Regex vs Grammar

H.Merijn Brand Sat, 23 May 2015 12:24:05 -0700

Liz told me to post here

In my work to post Text::CSV_XS from perl5 to perl6, am am close to
feature complete now, but the performance is not what I hoped for:


seconds to parse 10000 lines of CSV

perl5
Text::CSV::Easy_XS     0.016
Text::CSV::Easy_PP     0.017
Text::CSV_XS w/ bindc  0.034
Text::CSV_XS           0.039
Text::CSV_PP           0.525
Pegex::CSV             1.340

perl6
csv.pl                 7.270  state machine
csv-ip5xs             17.267  Inline::Perl5 with Text::CSV_XS
csv-ip5xsio           17.243  Inline::Perl5 with Text::CSV_XS w/ IO
csv-ip5pp             18.218  Inline::Perl5 with Text::CSV_PP
csv_gram.pl           14.226  A Grammar-based parser
test.pl               44.541  A reference parser (when I started)
test-t.pl             39.887  Current parser, all options implemented
csv-parser.pl         25.712  Tony-o's parser


So, currently for this kid of work, perl6 is between 2780 and 5.2 times
slower than perl5 (worst vs best / best vs worst)

As Text::CSV is allowing all setting to be changed between any call,
a static grammar engine is out of the question. As I started working
alongside someone else, we decided that I would explore the regular
expression approach and he would explore that grammar approach. The
latter never really happened :(

Currently, the regular expression is causing the parse line to be
returned as chunks of interest, where I take advantage of the first in
alternative is most important so having a quotation sequence that is
equal to part of eon-of-line sequence is still valid.

        my sub chunks (Str $str, Regex:D $re) {
            $str.defined or  return ();
            $str eq ""   and return ("");

            $str.split ($re, :all).flat.map: {
                if $_ ~~ Str {
                    $_   if .chars;
                    }
                else {
                    .Str if .Bool;
                    };
                };
            }

and then later

       my Regex $chx = $!eol.defined
           ?? rx{ $eol           | $sep | $quo | $esc }
           !! rx{ \r\n | \r | \n | $sep | $quo | $esc };

       $buffer.defined and @ch.push: chunks ($buffer, $chx);
       @ch or return parse_error (2012);

as it stands, the chunks function could be reconstructed into using a
grammar that only changes whenever any of $eol, $sep, $quo, or $esc
would change. None of the other options - in the current program flow -
would be of influence on the parser, as long as chunks would return the
same list of "tokens"

Is it worth while to try to reconstruct chunks to use a dynamic grammar
or do I wait for the regex engine to become faster.

As a side note: currently none of these four parts are allowed to be a
regular expression. If I stick to regular expressions, that could be an
option for future enhancements. All four are to be considered fixed
strings, where an undefined $eol means either \r\n, or \n, or \r

Code is available in the perl6 ecosystem http://modules.perl6.org/
GIT repo is at https://github.com/Tux/CSV
Documentation is https://github.com/Tux/CSV/blob/master/Text-CSV.pod

The csv *function* is still work in progress.
The style used is not a point of discussion.

-- 
H.Merijn Brand  http://tux.nl   Perl Monger  http://amsterdam.pm.org/
using perl5.00307 .. 5.21   porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/        http://www.test-smoke.org/
http://qa.perl.org   http://www.goldmark.org/jeff/stupid-disclaimers/

pgplR7UTSYSYb.pgp
Description: OpenPGP digital signature

Regex vs Grammar

Reply via email to