Nov 07

CSV Parsing with Erlang

Posted By:  Praveen Ray

Since OTP doesn't provide any CSV parsing capabilities, I decided to write my own based upon gen_fsm. Following the short explanation given in Perl's Text::CSV_XS module, I implemented a simple state machine. Erlang's excellent binary processing capabilities and built in gen_fsm behavior make the code compact, and surprisingly easy to implement. You can download the code from here. A short explanation follows.

The state machine has only following handful of states:

start_field

Start reading a CSV field – it might be double quoted and might have special chars such as \r, \n, comma and double quote. Goto read_field or read_quoted_field, depending upon if a double quote started this field.

read_field

Once a field has started, we switch to this state and read binary bytes until and end of field condition is detected. End of Field is marked by either a comma or a newline. 

read_quoted_field

We're inside a double quoted field; read everything until another double quote is encountered. A double quote might be end of this field or an embedded double quote marked with two consecutive double quotes. Switch to escaped_double_quote if a double quote is encountered.

escaped_double_quote

We come inside this field upon encountering a double quote inside read_quoted_field state. If another double quote is seen, it's an escaped double quote, else, it's nothing special. Both these cases go back to read_quoted_field.

Usage

Following public methods are exported:

parse_csv(File_path)
parse_csv(Binary_blob)
parse_csv(File_path, Options)
parse_csv(Binary_blob, Options)

where Options is:

[{callback_fn, Fun}, {callback_state, term()}]

Fun is a function/2 and gets called with a List of Fields and callback state.

With no Options passed, the return is a list of list of Fields. With callback_fn passed, the Callback is called at the end of each line with a list of Fields.

Examples:


parse_csv:parse_csv("/tmp/data.csv").

Returns:

 

[[<<"Date">>,<<"Source">>,<<"Destination">>,
  <<"Seconds">>,<<"CallerID">>,<<"Disposition">>,<<"Cost">>],
 [<<"2009-09-18 09:44:54">>,<<"5097213333">>,
  <<"18667778888">>,<<"66">>,<<"5098761323">>,<<"ANSWERED">>,
  <<"0">>]]
F = fun(Fields, State) -> io:format("~p~n",[Fields]), State + 1 end.
parse_csv:parse_csv("/tmp/data.csv", [{callback_fn, F},{callback_state, 0}]).
It calls F repeatedly. First with second parameter set to 0, then 1, then2 and so on. Note that your fun must return a modified state which becomes second parameter to callback_fn for the next line.
 
Tagged with:
Sep 21

The example at yaws web site to read the file upload is a good starting point but it's too simplistic. I extended the example so it's useful in the real world. (Update: Thanks to Steve Vinoski, this module(yaws_multipart) is now part of the yaws git tree).

  1. It reads all parameters – files uploaded and other simple parameters .
  2. It takes a few options to help file uploads. Specifically:
    1. {max_file_size, MaxBytes} : If file exceeds MaxBytes bytes, return an error
    2. no_temp_file: read the uploded file in memory without any temp files
    3. {temp_file,FullFilePath}: Specify full path for the temp file. If not given, a unique file name is generated
    4. {temp_dir, TempDir} : Specify a directory to store uploaded temp file. By default '/tmp' is used.

Using it is simple. Just call read_multipart_form from your 'out' function and it'll return a tuple with first element either 'get_more', 'done' or 'error'. The 'get_more' implies more data needs to be read and you must call read_multipart_form again. 'done' implies it's done reading all parameters and you're free to proceed. The 'done' tuple also returns a 'dict' full of params. This dict can be queried for parameters by name. For file upload parameters it returns one of the following lists:

[{filename, "name of the uploaded file as entered on the form"},
  {value, Contents_of_the_file_all_in_memory}]
OR
[{filename, "name of the uploaded file as entered on the form"},
  {temp_file, "full pathname of the temp file"}]

In the second case, it's your responsibility to remove the temp file. Usage example:

-module(my_yaws_controller).
-export([out/1]).
 
out(Arg) ->
     Options = [no_temp_file],
     case yaws_multipart:read_multipart_form(Arg, Options) of
             {done, Params} -> 
                   io:format("Params : ~p",[Params]),
                   [{filename, File_name},{value,File_content}] = dict:find("my_file", Params),
                  Another_param = dict:find("another_param", Params);
                  % do something with File_name, File_content and Another_param
              {error, Reason} ->
                   io:format("Error reading multipart form: ~s", [Reason]);
              Other -> Other
      end
.
Tagged with:
Jun 17

Keeping in philosophy of KISS, nginx is an awesome, simple web server. It does few things and does it extremely well.
It doesn’t do CGI but does proxy’ing and that makes it extremely useful as a front end web server. I recently had to implement an extjs based progress bar for large file uploads with nginx acting as a front end to a Rack/mongrel based application. Here are the steps for ubuntu.

Do not install nginx from the repo. Uninstall if it’s already installed.

apt-get remove nginx
mkdir -p /opt/downloads
cd /opt/downloads

Download nginx sources from nginx.net and unpack (I’m working with nginx-0.6.36):

tar zxf nginx-0.6.36.tar.gz

Download an untar upload progress module from nginx wiki

tar zxf Nginx_uploadprogress_module-0.5.tar.gz
cd nginx-0.6.36
./configure --prefix=/opt/nginx --add-module=/opt/downloads/nginx_uploadprogress_module
make install

This’ll install nginx in /opt/nginx

Configuration

open up /opt/nginx/conf/nginx.conf and add following lines:

http {
     client_max_body_size 30M; # adjust as per your need
     upload_progress proxied 1m;
 
     server {
         server_name my.server.com;
         listen 80;
         root /var/www/nginx-default/my-static-files;
 
         location /ajax {
             proxy_pass http://localhost:2300;
             proxy_redirect default;
             track_uploads proxied 30s;
         }
         location ^~ /report_file_uploads {
              report_uploads proxied;
         }
     }
}

This assumesĀ  ‘/ajax’ is the backend application proxy.

In Javascript, to get progress bar going, send following AJAX message in a loop, after the form with file upload field has been submitted.

Lots of details are omitted since these are dependent upon your javascript library of choice (which, btw, for us is extjs).

var upload_id = 'MyUniqueID'; // upload_id must be unique for each upload session
this.send_ajax_message({
       url: '/report_file_uploads',
       headers: {'X-Progress-ID' : upload_id},
       method:'GET',
       success: function(r) {
              r = this.parse_ajax_response(r);
             if(r.state == 'uploading' || r.state == 'starting') {
                var percent = (Number(r.received)/Number(r.size));
                if(percent &gt; 1.0)
                    percent = 1.0;
                 //show percent as you wish on your progress meter
                 // sleep for few seconds and send this ajax message again
             } else if(r.state == 'done' || r.state == 'error') {
                // kill your loop timer
                // finish your progress meter
             }
        }
});

Let me know if you’d like javascript fragment for extjs and I’ll post it but it’s relatively straightforward.

At Yellowfish, we specialize in web2.0 Ajax web application development using open source tools and modern software trends.

Tagged with:
preload preload preload