Since OTP doesn't provide any CSV parsing capabilities, I decided to write my own based upon gen_fsm. Following the short explanation given in Perl's Text::CSV_XS module, I implemented a simple state machine. Erlang's excellent binary processing capabilities and built in gen_fsm behavior make the code compact, and surprisingly easy to implement. You can download the code from here. A short explanation follows.
The state machine has only following handful of states:
start_field
Start reading a CSV field – it might be double quoted and might have special chars such as \r, \n, comma and double quote. Goto read_field or read_quoted_field, depending upon if a double quote started this field.
read_field
Once a field has started, we switch to this state and read binary bytes until and end of field condition is detected. End of Field is marked by either a comma or a newline.
read_quoted_field
We're inside a double quoted field; read everything until another double quote is encountered. A double quote might be end of this field or an embedded double quote marked with two consecutive double quotes. Switch to escaped_double_quote if a double quote is encountered.
escaped_double_quote
We come inside this field upon encountering a double quote inside read_quoted_field state. If another double quote is seen, it's an escaped double quote, else, it's nothing special. Both these cases go back to read_quoted_field.
Usage
Following public methods are exported:
parse_csv(File_path) parse_csv(Binary_blob) parse_csv(File_path, Options) parse_csv(Binary_blob, Options)
where Options is:
[{callback_fn, Fun}, {callback_state, term()}]
Fun is a function/2 and gets called with a List of Fields and callback state.
With no Options passed, the return is a list of list of Fields. With callback_fn passed, the Callback is called at the end of each line with a list of Fields.
Examples:
parse_csv:parse_csv("/tmp/data.csv").
Returns:
[[<<"Date">>,<<"Source">>,<<"Destination">>, <<"Seconds">>,<<"CallerID">>,<<"Disposition">>,<<"Cost">>], [<<"2009-09-18 09:44:54">>,<<"5097213333">>, <<"18667778888">>,<<"66">>,<<"5098761323">>,<<"ANSWERED">>, <<"0">>]]
F = fun(Fields, State) -> io:format("~p~n",[Fields]), State + 1 end.
parse_csv:parse_csv("/tmp/data.csv", [{callback_fn, F},{callback_state, 0}]).
