Liz told me to post here In my work to post Text::CSV_XS from perl5 to perl6, am am close to feature complete now, but the performance is not what I hoped for:
seconds to parse 10000 lines of CSV perl5 Text::CSV::Easy_XS 0.016 Text::CSV::Easy_PP 0.017 Text::CSV_XS w/ bindc 0.034 Text::CSV_XS 0.039 Text::CSV_PP 0.525 Pegex::CSV 1.340 perl6 csv.pl 7.270 state machine csv-ip5xs 17.267 Inline::Perl5 with Text::CSV_XS csv-ip5xsio 17.243 Inline::Perl5 with Text::CSV_XS w/ IO csv-ip5pp 18.218 Inline::Perl5 with Text::CSV_PP csv_gram.pl 14.226 A Grammar-based parser test.pl 44.541 A reference parser (when I started) test-t.pl 39.887 Current parser, all options implemented csv-parser.pl 25.712 Tony-o's parser So, currently for this kid of work, perl6 is between 2780 and 5.2 times slower than perl5 (worst vs best / best vs worst) As Text::CSV is allowing all setting to be changed between any call, a static grammar engine is out of the question. As I started working alongside someone else, we decided that I would explore the regular expression approach and he would explore that grammar approach. The latter never really happened :( Currently, the regular expression is causing the parse line to be returned as chunks of interest, where I take advantage of the first in alternative is most important so having a quotation sequence that is equal to part of eon-of-line sequence is still valid. my sub chunks (Str $str, Regex:D $re) { $str.defined or return (); $str eq "" and return (""); $str.split ($re, :all).flat.map: { if $_ ~~ Str { $_ if .chars; } else { .Str if .Bool; }; }; } and then later my Regex $chx = $!eol.defined ?? rx{ $eol | $sep | $quo | $esc } !! rx{ \r\n | \r | \n | $sep | $quo | $esc }; $buffer.defined and @ch.push: chunks ($buffer, $chx); @ch or return parse_error (2012); as it stands, the chunks function could be reconstructed into using a grammar that only changes whenever any of $eol, $sep, $quo, or $esc would change. None of the other options - in the current program flow - would be of influence on the parser, as long as chunks would return the same list of "tokens" Is it worth while to try to reconstruct chunks to use a dynamic grammar or do I wait for the regex engine to become faster. As a side note: currently none of these four parts are allowed to be a regular expression. If I stick to regular expressions, that could be an option for future enhancements. All four are to be considered fixed strings, where an undefined $eol means either \r\n, or \n, or \r Code is available in the perl6 ecosystem http://modules.perl6.org/ GIT repo is at https://github.com/Tux/CSV Documentation is https://github.com/Tux/CSV/blob/master/Text-CSV.pod The csv *function* is still work in progress. The style used is not a point of discussion. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.21 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
pgplR7UTSYSYb.pgp
Description: OpenPGP digital signature