Nov 07

CSV Parsing with Erlang

Posted By:  Praveen Ray

Since OTP doesn't provide any CSV parsing capabilities, I decided to write my own based upon gen_fsm. Following the short explanation given in Perl's Text::CSV_XS module, I implemented a simple state machine. Erlang's excellent binary processing capabilities and built in gen_fsm behavior make the code compact, and surprisingly easy to implement. You can download the code from here. A short explanation follows.

The state machine has only following handful of states:

start_field

Start reading a CSV field – it might be double quoted and might have special chars such as \r, \n, comma and double quote. Goto read_field or read_quoted_field, depending upon if a double quote started this field.

read_field

Once a field has started, we switch to this state and read binary bytes until and end of field condition is detected. End of Field is marked by either a comma or a newline. 

read_quoted_field

We're inside a double quoted field; read everything until another double quote is encountered. A double quote might be end of this field or an embedded double quote marked with two consecutive double quotes. Switch to escaped_double_quote if a double quote is encountered.

escaped_double_quote

We come inside this field upon encountering a double quote inside read_quoted_field state. If another double quote is seen, it's an escaped double quote, else, it's nothing special. Both these cases go back to read_quoted_field.

Usage

Following public methods are exported:

parse_csv(File_path)
parse_csv(Binary_blob)
parse_csv(File_path, Options)
parse_csv(Binary_blob, Options)

where Options is:

[{callback_fn, Fun}, {callback_state, term()}]

Fun is a function/2 and gets called with a List of Fields and callback state.

With no Options passed, the return is a list of list of Fields. With callback_fn passed, the Callback is called at the end of each line with a list of Fields.

Examples:


parse_csv:parse_csv("/tmp/data.csv").

Returns:

 

[[<<"Date">>,<<"Source">>,<<"Destination">>,
  <<"Seconds">>,<<"CallerID">>,<<"Disposition">>,<<"Cost">>],
 [<<"2009-09-18 09:44:54">>,<<"5097213333">>,
  <<"18667778888">>,<<"66">>,<<"5098761323">>,<<"ANSWERED">>,
  <<"0">>]]
F = fun(Fields, State) -> io:format("~p~n",[Fields]), State + 1 end.
parse_csv:parse_csv("/tmp/data.csv", [{callback_fn, F},{callback_state, 0}]).
It calls F repeatedly. First with second parameter set to 0, then 1, then2 and so on. Note that your fun must return a modified state which becomes second parameter to callback_fn for the next line.
 
Tagged with:
Sep 21

The example at yaws web site to read the file upload is a good starting point but it's too simplistic. I extended the example so it's useful in the real world. (Update: Thanks to Steve Vinoski, this module(yaws_multipart) is now part of the yaws git tree).

  1. It reads all parameters – files uploaded and other simple parameters .
  2. It takes a few options to help file uploads. Specifically:
    1. {max_file_size, MaxBytes} : If file exceeds MaxBytes bytes, return an error
    2. no_temp_file: read the uploded file in memory without any temp files
    3. {temp_file,FullFilePath}: Specify full path for the temp file. If not given, a unique file name is generated
    4. {temp_dir, TempDir} : Specify a directory to store uploaded temp file. By default '/tmp' is used.

Using it is simple. Just call read_multipart_form from your 'out' function and it'll return a tuple with first element either 'get_more', 'done' or 'error'. The 'get_more' implies more data needs to be read and you must call read_multipart_form again. 'done' implies it's done reading all parameters and you're free to proceed. The 'done' tuple also returns a 'dict' full of params. This dict can be queried for parameters by name. For file upload parameters it returns one of the following lists:

[{filename, "name of the uploaded file as entered on the form"},
  {value, Contents_of_the_file_all_in_memory}]
OR
[{filename, "name of the uploaded file as entered on the form"},
  {temp_file, "full pathname of the temp file"}]

In the second case, it's your responsibility to remove the temp file. Usage example:

-module(my_yaws_controller).
-export([out/1]).
 
out(Arg) ->
     Options = [no_temp_file],
     case yaws_multipart:read_multipart_form(Arg, Options) of
             {done, Params} -> 
                   io:format("Params : ~p",[Params]),
                   [{filename, File_name},{value,File_content}] = dict:find("my_file", Params),
                  Another_param = dict:find("another_param", Params);
                  % do something with File_name, File_content and Another_param
              {error, Reason} ->
                   io:format("Error reading multipart form: ~s", [Reason]);
              Other -> Other
      end
.
Tagged with:
preload preload preload