Send Emails without installing SMTP server HSQLDB woes
Nov 07

CSV Parsing with Erlang

Posted By:  Praveen Ray

Since OTP doesn't provide any CSV parsing capabilities, I decided to write my own based upon gen_fsm. Following the short explanation given in Perl's Text::CSV_XS module, I implemented a simple state machine. Erlang's excellent binary processing capabilities and built in gen_fsm behavior make the code compact, and surprisingly easy to implement. You can download the code from here. A short explanation follows.

The state machine has only following handful of states:

start_field

Start reading a CSV field – it might be double quoted and might have special chars such as \r, \n, comma and double quote. Goto read_field or read_quoted_field, depending upon if a double quote started this field.

read_field

Once a field has started, we switch to this state and read binary bytes until and end of field condition is detected. End of Field is marked by either a comma or a newline. 

read_quoted_field

We're inside a double quoted field; read everything until another double quote is encountered. A double quote might be end of this field or an embedded double quote marked with two consecutive double quotes. Switch to escaped_double_quote if a double quote is encountered.

escaped_double_quote

We come inside this field upon encountering a double quote inside read_quoted_field state. If another double quote is seen, it's an escaped double quote, else, it's nothing special. Both these cases go back to read_quoted_field.

Usage

Following public methods are exported:

parse_csv(File_path)
parse_csv(Binary_blob)
parse_csv(File_path, Options)
parse_csv(Binary_blob, Options)

where Options is:

[{callback_fn, Fun}, {callback_state, term()}]

Fun is a function/2 and gets called with a List of Fields and callback state.

With no Options passed, the return is a list of list of Fields. With callback_fn passed, the Callback is called at the end of each line with a list of Fields.

Examples:


parse_csv:parse_csv("/tmp/data.csv").

Returns:

 

[[<<"Date">>,<<"Source">>,<<"Destination">>,
  <<"Seconds">>,<<"CallerID">>,<<"Disposition">>,<<"Cost">>],
 [<<"2009-09-18 09:44:54">>,<<"5097213333">>,
  <<"18667778888">>,<<"66">>,<<"5098761323">>,<<"ANSWERED">>,
  <<"0">>]]
F = fun(Fields, State) -> io:format("~p~n",[Fields]), State + 1 end.
parse_csv:parse_csv("/tmp/data.csv", [{callback_fn, F},{callback_state, 0}]).
It calls F repeatedly. First with second parameter set to 0, then 1, then2 and so on. Note that your fun must return a modified state which becomes second parameter to callback_fn for the next line.
 
Share and Enjoy:
  • Print
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • StumbleUpon
  • Propeller
  • Technorati

3 Responses to “CSV Parsing with Erlang”

  1. Alicia Dambrose Says:

    I have seen some crappy posts but this one really impresses me. Good work.

  2. Felix Says:

    What’s the point of using a generic server for parsing data,
    wouldn’t a recursive solution suffice for that? It is very easy to simulate a state machine using mutually recursive functions like this (a and b are states):

    a(S) ->
    …;
    a(S) ->

    b(NewS).

    b(S) ->

    a(NewS).

  3. Praveen Ray Says:

    Felix,
    You’re right about mutually recursive functions. Since OTP provides a fsm behavior, it’s definitely little easier to build state machines using OTP. But for such simple machines, it’s probably a good idea to do away with the OTP behavior altogether.

Leave a Reply

preload preload preload