Re: Speed of csvReader

2016-01-27 Thread jmh530 via Digitalmars-d-learn

On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:


Yeah, in the course of this exercise, I found that the one 
thing that has had the biggest impact on performance is the 
amount of allocations involved.  [...snip]


Really interesting discussion.


Re: Speed of csvReader

2016-01-27 Thread Gerald Jansen via Digitalmars-d-learn

On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:

...
So the moral of the story is: avoid large numbers of small 
allocations. If you have to do it, consider consolidating your 
allocations into a series of allocations of large(ish) buffers 
instead, and taking slices of the buffers.


Many thanks for the detailed explanation.


Re: Speed of csvReader

2016-01-26 Thread Gerald Jansen via Digitalmars-d-learn

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:


While this is no fancy range-based code, and one might say it's 
more hackish and C-like than idiomatic D, the problem is that 
current D compilers can't quite optimize range-based code to 
this extent yet. Perhaps in the future optimizers will improve 
so that more idiomiatic, range-based code will have comparable 
performance with fastcsv. (At least in theory this should be 
possible.)


As a D novice still struggling with the concept that composable 
range-based functions can be more efficient than good-old looping 
(ya, I know, cache friendliness and GC avoidance), I find it 
extremely interesting that someone as expert as yourself would 
reach for a C-like approach for serious data crunching. Given 
that data crunching is the kind of thing I need to do a lot, I'm 
wondering how general your statement above might be at this time 
w.r.t. this and possibly other domains.




Re: Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)

2016-01-26 Thread Jesse Phillips via Digitalmars-d-learn

On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote:
On Sun, Jan 24, 2016 at 06:07:41AM +, Jesse Phillips via 
Digitalmars-d-learn wrote: [...]
My suggestion is to take the unittests used in std.csv and try 
to get your code working with them. As fastcsv limitations 
would prevent replacing the std.csv implementation the API may 
not need to match, but keeping close to the same would be best.


My thought is to integrate the fastcsv code into std.csv, such 
that the current std.csv code will serve as fallback in the 
cases where fastcsv's limitations would prevent it from being 
used, with fastcsv being chosen where possible.


That is why I suggested starting with the unittests. I don't 
expect the implementations to share much code, std.csv is written 
to only use front, popFront, and empty. Most of the work is done 
in csvNextToken so it might be able to take advantage of 
random-access ranges for more performance. I just think the 
unittests will help to define where switching algorthims will be 
required since they exercise a good portion of the API.


Re: Speed of csvReader

2016-01-26 Thread Chris Wright via Digitalmars-d-learn
On Tue, 26 Jan 2016 18:16:28 +, Gerald Jansen wrote:

> On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
>>
>> While this is no fancy range-based code, and one might say it's more
>> hackish and C-like than idiomatic D, the problem is that current D
>> compilers can't quite optimize range-based code to this extent yet.
>> Perhaps in the future optimizers will improve so that more idiomiatic,
>> range-based code will have comparable performance with fastcsv. (At
>> least in theory this should be possible.)
> 
> As a D novice still struggling with the concept that composable
> range-based functions can be more efficient than good-old looping (ya, I
> know, cache friendliness and GC avoidance), I find it extremely
> interesting that someone as expert as yourself would reach for a C-like
> approach for serious data crunching. Given that data crunching is the
> kind of thing I need to do a lot, I'm wondering how general your
> statement above might be at this time w.r.t. this and possibly other
> domains.

You want to reduce allocations. Ranges often let you do that. However, 
it's sometimes unsafe to reuse range values that aren't immutable. That 
means, if you want to keep the values around, you need to copy them -- 
which introduces an allocation.

You can get fewer large allocations by reading the whole file at once 
manually and using slices into that large allocation.


Re: Speed of csvReader

2016-01-26 Thread Gerald Jansen via Digitalmars-d-learn

On Tuesday, 26 January 2016 at 20:54:34 UTC, Chris Wright wrote:

On Tue, 26 Jan 2016 18:16:28 +, Gerald Jansen wrote:

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:


While this is no fancy range-based code, and one might say 
it's more hackish and C-like than idiomatic D, the problem is 
that current D compilers can't quite optimize range-based 
code to this extent yet. Perhaps in the future optimizers 
will improve so that more idiomiatic, range-based code will 
have comparable performance with fastcsv.


... data crunching ... I'm wondering how general your 
statement above might be at this time w.r.t. this and possibly 
other domains.


You can get fewer large allocations by reading the whole file 
at once manually and using slices into that large allocation.


Sure, that part is clear. Presumably the quoted comment referred 
to more than just that technique.




Re: Speed of csvReader

2016-01-26 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Jan 26, 2016 at 08:54:34PM +, Chris Wright via Digitalmars-d-learn 
wrote:
> On Tue, 26 Jan 2016 18:16:28 +, Gerald Jansen wrote:
> 
> > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> >>
> >> While this is no fancy range-based code, and one might say it's
> >> more hackish and C-like than idiomatic D, the problem is that
> >> current D compilers can't quite optimize range-based code to this
> >> extent yet.  Perhaps in the future optimizers will improve so that
> >> more idiomiatic, range-based code will have comparable performance
> >> with fastcsv. (At least in theory this should be possible.)
> > 
> > As a D novice still struggling with the concept that composable
> > range-based functions can be more efficient than good-old looping
> > (ya, I know, cache friendliness and GC avoidance), I find it
> > extremely interesting that someone as expert as yourself would reach
> > for a C-like approach for serious data crunching. Given that data
> > crunching is the kind of thing I need to do a lot, I'm wondering how
> > general your statement above might be at this time w.r.t. this and
> > possibly other domains.
> 
> You want to reduce allocations. Ranges often let you do that. However,
> it's sometimes unsafe to reuse range values that aren't immutable.
> That means, if you want to keep the values around, you need to copy
> them -- which introduces an allocation.
> 
> You can get fewer large allocations by reading the whole file at once
> manually and using slices into that large allocation.

Yeah, in the course of this exercise, I found that the one thing that
has had the biggest impact on performance is the amount of allocations
involved.  Basically, I noted that the less allocations are made, the
more efficient the code. I'm not sure exactly why this is so, but it's
probably something to do with the fact that tracing GCs work better with
fewer allocations of larger objects, than many allocations of small
objects.  I have also noted in the past that D's current GC runs
collections a little too often; in past projects I've obtained
significant speedup (in one case, up to 40% reduction of total runtime)
by suppressing automatic collections and scheduling them manually at a
lower frequency.

In short, I've found that reducing GC load plays a much bigger role in
performance than the range vs. loops issue.

The reason I chose to write manual loops at first is to eliminate all
possibility of unexpected overhead that might hide behind range
primitives, as well as compiler limitations, as current optimizers
aren't exactly tuned for range-based idioms, and may fail to recognize
certain range-based idioms that would lead to much more efficient code.
However, in my second iteration when I made the fastcsv parser return an
input range instead of an array, I found only negligible performance
differences.  This suggests that perhaps range-based code may not
perform that badly after all. I have yet to test this hypothesis, as the
inner loop that parses fields in a single row is still a manual loop;
but my suspicion is that it wouldn't do too badly in range-based form
either.

What might make a big difference, though, is the part where slicing is
used, since that is essential for reducing the number of allocations.

The current iteration of struct-based parsing code, for instance, went
through an initial version that was excruciatingly slow for structs with
string fields. Why? Because the function takes const(char)[] as input,
and you can't legally get strings out of that unless you make a copy of
that data (since const means you cannot modify it, but somebody else
still might). So std.conv.to would allocate a new string and copy the
contents over, every time a string field was parsed, resulting in a
large number of small allocations.

To solve this, I decided to use a string buffer: instead of one
allocation per string, pre-allocate a large-ish char[] buffer, and every
time a string field was parsed, append the data into the buffer. If the
buffer becomes full, allocate a new one. Take a slice of the buffer
corresponding to that field and cast it to string (this is safe since
the algorithm was constructed never to write over previous parts of the
buffer).  This seemingly trivial optimization won me a performance
improvement of an order of magnitude(!).

This is particularly enlightening, since it suggests that even the
overhead of copying all the string fields out of the original data into
a new buffer does not add up to that much.  The new struct-based parser
also returns an input range rather than an array; I found that
constructing the array directly vs. copying from an input range didn't
really make that big of a difference either. What did make a huge
difference is reducing the number of allocations.

So the moral of the story is: avoid large numbers of small allocations.
If you have to do it, consider consolidating your allocations into a
series of allocations of large(ish) buffers 

Re: Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)

2016-01-26 Thread bachmeier via Digitalmars-d-learn

On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote:
My thought is to integrate the fastcsv code into std.csv, such 
that the current std.csv code will serve as fallback in the 
cases where fastcsv's limitations would prevent it from being 
used, with fastcsv being chosen where possible.


Wouldn't it be simpler to add a new function? Otherwise you'll 
end up with very different performance for almost the same data.


Re: Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)

2016-01-25 Thread H. S. Teoh via Digitalmars-d-learn
On Sun, Jan 24, 2016 at 06:07:41AM +, Jesse Phillips via 
Digitalmars-d-learn wrote:
[...]
> My suggestion is to take the unittests used in std.csv and try to get
> your code working with them. As fastcsv limitations would prevent
> replacing the std.csv implementation the API may not need to match,
> but keeping close to the same would be best.

My thought is to integrate the fastcsv code into std.csv, such that the
current std.csv code will serve as fallback in the cases where fastcsv's
limitations would prevent it from being used, with fastcsv being chosen
where possible.

It may be possible to lift some of fastcsv's limitations, now that a few
performance bottlenecks have been identified (validation, excessive
number of small allocations, being the main ones). The code could be
generalized a bit more while preserving the optimizations in these key
areas.


T

-- 
BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL


Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)

2016-01-23 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Jan 22, 2016 at 10:04:58PM +, data pulverizer via 
Digitalmars-d-learn wrote:
[...]
> I guess the next step is allowing Tuple rows with mixed types.

Alright. I threw together a new CSV parsing function that loads CSV data
into an array of structs. Currently, the implementation is not quite
polished yet (it blindly assumes the first row is a header row, which it
discards), but it does work, and outperforms std.csv by about an order
of magnitude.

The initial implementation was very slow (albeit still somewhat fast
than std.csv by about 10% or so) when given a struct with string fields.
However, structs with POD fields are lightning fast (not significantly
different from before, in spite of all the calls to std.conv.to!). This
suggested that the slowdown was caused by excessive allocations of small
strings, causing a heavy GC load.  This suspicion was confirmed when I
ran the same input data with a struct where all string fields were
replaced with const(char)[] (so that std.conv.to simply returned slices
to the data) -- the performance shot back up to about 1700 msecs, a
little slower than the original version of reading into an array of
array of const(char)[] slices, but about 58 times(!) the performance of
std.csv.

So I tried a simple optimization: instead of allocating a string per
field, allocate 64KB string buffers and copy string field values into
it, then taking slices from the buffer to assign to the struct's string
fields.  With this optimization, running times came down to about the
1900 msec range, which is only marginally slower than the const(char)[]
case, about 51 times faster than std.csv.

Here are the actual benchmark values:

1) std.csv: 2126883 records, 102136 msecs

2) fastcsv (struct with string fields): 2126883 records, 1978 msecs

3) fastcsv (struct with const(char)[] fields): 2126883 records, 1743 msecs

The latest code is available on github:

https://github.com/quickfur/fastcsv

The benchmark driver now has 3 new targets:

stdstruct   - std.csv parsing of CSV into structs
faststruct  - fastcsv parsing of CSV into struct (string fields)
faststruct2 - fastcsv parsing of CSV into struct (const(char)[] fields)

Note that the structs are hard-coded into the code, so they will only
work with the census.gov test file.

Things still left to do:

- Fix header parsing to have a consistent interface with std.csv, or at
  least allow the user to configure whether or not the first row should
  be discarded.

- Support transcription to Tuples?

- Refactor the code to have less copy-pasta.

- Ummm... make it ready for integration with std.csv maybe? ;-)


T

-- 
Fact is stranger than fiction.


Re: Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)

2016-01-23 Thread Jesse Phillips via Digitalmars-d-learn

On Sunday, 24 January 2016 at 01:57:11 UTC, H. S. Teoh wrote:

- Ummm... make it ready for integration with std.csv maybe? ;-)


T


My suggestion is to take the unittests used in std.csv and try to 
get your code working with them. As fastcsv limitations would 
prevent replacing the std.csv implementation the API may not need 
to match, but keeping close to the same would be best.


Re: Speed of csvReader

2016-01-22 Thread data pulverizer via Digitalmars-d-learn

On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via 
Digitalmars-d-learn wrote:

[...]
> >   https://github.com/quickfur/fastcsv

[...]

Fixed some boundary condition crashes and reverted doubled 
quote handling in unquoted fields (since those are illegal 
according to RFC 4810).  Performance is back in the ~1200 msec 
range.



T


Hi H. S. Teoh,  I have used you fastcsv on my file:

import std.file;
import fastcsv;
import std.stdio;
import std.datetime;

void main(){
  StopWatch sw;
  sw.start();
  auto input = cast(string) read("Acquisition_2009Q2.txt");
  auto mydata = fastcsv.csvToArray!('|')(input);
  sw.stop();
  double time = sw.peek().msecs;
  writeln("Time (s): ", time/1000);
}

$ dmd file_read_5.d fastcsv.d
$ ./file_read_5
Time (s): 0.679

Fastest so far, very nice.


Re: Speed of csvReader

2016-01-22 Thread Edwin van Leeuwen via Digitalmars-d-learn

On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via 
Digitalmars-d-learn wrote:

[...]
> >   https://github.com/quickfur/fastcsv

[...]

Fixed some boundary condition crashes and reverted doubled 
quote handling in unquoted fields (since those are illegal 
according to RFC 4810).  Performance is back in the ~1200 msec 
range.



T


That's pretty impressive. Maybe turn it on into a dub package so 
that data pulverizer could easily test it on his data :)


Re: Speed of csvReader

2016-01-22 Thread data pulverizer via Digitalmars-d-learn

On Friday, 22 January 2016 at 21:41:46 UTC, data pulverizer wrote:

On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:

[...]


Hi H. S. Teoh,  I have used you fastcsv on my file:

import std.file;
import fastcsv;
import std.stdio;
import std.datetime;

void main(){
  StopWatch sw;
  sw.start();
  auto input = cast(string) read("Acquisition_2009Q2.txt");
  auto mydata = fastcsv.csvToArray!('|')(input);
  sw.stop();
  double time = sw.peek().msecs;
  writeln("Time (s): ", time/1000);
}

$ dmd file_read_5.d fastcsv.d
$ ./file_read_5
Time (s): 0.679

Fastest so far, very nice.


I guess the next step is allowing Tuple rows with mixed types.


Re: Speed of csvReader

2016-01-22 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Jan 22, 2016 at 10:04:58PM +, data pulverizer via 
Digitalmars-d-learn wrote:
[...]
> >$ dmd file_read_5.d fastcsv.d
> >$ ./file_read_5
> >Time (s): 0.679
> >
> >Fastest so far, very nice.

Thanks!


> I guess the next step is allowing Tuple rows with mixed types.

I thought about that a little today. I'm guessing that most of the
performance will be dependent on the conversion into the target types.
Right now it's extremely fast because, for the most part, it's just
taking slices of an existing string. It shouldn't be too hard to extend
the current code so that instead of assembling the string slices in a
block buffer, it will run them through std.conv.to instead and store
them in an array of some given struct. But there may be performance
degradation because now we have to do non-trivial operations on the
string slices.

Converting from const(char)[] to string probably should be avoided where
not necessary, since otherwise it will involve lots and lots of small
allocations and the GC will become very slow. Converting to ints may not
be too bad... but conversion to types like floating point may be quite
slow. Now, assembling the resulting structs into an array could
potentially be slow... but perhaps an analogous block buffer technique
can be used to create the array piecemeal in separate blocks, and only
perform the final assembly into a single array at the very end (thus
avoiding reallocating and copying the growing array as we go along).

But we'll see.  Performance predictions are rarely accurate; only a
profiler will tell the truth about where the real bottlenecks are. :-)


T

-- 
LINUX = Lousy Interface for Nefarious Unix Xenophobes.


Re: Speed of csvReader

2016-01-22 Thread Jesse Phillips via Digitalmars-d-learn

On Friday, 22 January 2016 at 01:36:40 UTC, cym13 wrote:

On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote:
And now that you mention this, RFC-4180 does not allow doubled 
quotes in an unquoted field. I'll take that out of the code 
(it improves performance :-D).


Right, re-reading the RFC would have been a great thing. That 
said I saw that kind of CSV in the real world, so I don't know 
what to think of it. I'm not saying it should be supported, but 
I wonder if there are points outside RFC-4180 that are taken 
for granted.


You have to understand CSV didn't come from a standard. People 
started using because it was simple for writing out some tabular 
data. Then they changed it because their data changed. It's not 
like their language came with a CSV parser, it was always hand 
written and people still do it today. And that is why data is 
delimited with so many things not comma (people thought they 
wouldn't need to escape their data).


So yes, some CSV parsers will accept comments but that just means 
it breaks for people that have # in their data. Yeah, you can 
assume that two double quotes in unquoted data is just a quote, 
but then it breaks for those who have that kind of data which 
isn't escaped.


There is also many other issues with CSV data, like is the file 
in ASCII or UTF or some other code page. And many times CSV isn't 
well formed because the data was output without proper escaping.


std.csv isn't the end-all csv parsers, but it will at least 
handle well formed CSV that use different separators or quotes.


Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn
On Thursday, 21 January 2016 at 10:40:39 UTC, data pulverizer 
wrote:
On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole 
wrote:



Okay without registering not gonna get that data.

So usual things to think about, did you turn on release mode?
What about inlining?

Lastly how about disabling the GC?

import core.memory : GC;
GC.disable();

dmd -release -inline code.d


That helped a lot, I disable GC and inlined as you suggested 
and the time is now:


Time (s): 8.754

However, with R's data.table package gives us:

system.time(x <- fread("Acquisition_2009Q2.txt", sep = "|", 
colClasses = rep("character", 22)))

   user  system elapsed
  0.852   0.021   0.872

I should probably have begun with this timing. Its not my 
intention to turn this into a speed-only competition, however 
the ingest of files and speed of calculation is very important 
to me.


I should probably add compiler version info:

~$ dmd --version
DMD64 D Compiler v2.069.2
Copyright (c) 1999-2015 by Digital Mars written by Walter Bright

Running Ubuntu 14.04 LTS



Re: Speed of csvReader

2016-01-21 Thread Edwin van Leeuwen via Digitalmars-d-learn
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:

  StopWatch sw;
  sw.start();
  auto buffer = std.file.readText("Acquisition_2009Q2.txt");
  auto records = csvReader!row_type(buffer, '|').array;
  sw.stop();



Is it csvReader or readText that is slow? i.e. could you move 
sw.start(); one line down (after the readText command) and see 
how long just the csvReader part takes?


Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote:

On 01/21/2016 02:40 AM, data pulverizer wrote:


dmd -release -inline code.d


These two as well please:

  -O -boundscheck=off


the ingest of files and
speed of calculation is very important to me.


We should understand why D is slow in this case. :)

Ali


Thank you, adding those two flags brings down the time a little 
more ...


Time (s): 6.832


Re: Speed of csvReader

2016-01-21 Thread Ali Çehreli via Digitalmars-d-learn

On 01/21/2016 02:40 AM, data pulverizer wrote:


dmd -release -inline code.d


These two as well please:

  -O -boundscheck=off


the ingest of files and
speed of calculation is very important to me.


We should understand why D is slow in this case. :)

Ali



Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn
I have been reading large text files with D's csv file reader and 
have found it slow compared to R's read.table function which is 
not known to be particularly fast. Here I am reading Fannie Mae 
mortgage acquisition data which can be found here 
http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html after registering:


D Code:

import std.algorithm;
import std.array;
import std.file;
import std.csv;
import std.stdio;
import std.typecons;
import std.datetime;

alias row_type = Tuple!(string, string, string, string, string, 
string, string, string,
string, string, string, string, string, 
string, string, string,
string, string, string, string, string, 
string);


void main(){
  StopWatch sw;
  sw.start();
  auto buffer = std.file.readText("Acquisition_2009Q2.txt");
  auto records = csvReader!row_type(buffer, '|').array;
  sw.stop();
  double time = sw.peek().msecs;
  writeln("Time (s): ", time/1000);
}

Time (s): 13.478

R Code:

system.time(x <- read.table("Acquisition_2009Q2.txt", sep = "|", 
colClasses = rep("character", 22)))

   user  system elapsed
  7.810   0.067   7.874


R takes about half as long to read the file. Both read the data 
in the "equivalent" type format. Am I doing something incorrect 
here?


Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn
On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole 
wrote:



Okay without registering not gonna get that data.

So usual things to think about, did you turn on release mode?
What about inlining?

Lastly how about disabling the GC?

import core.memory : GC;
GC.disable();

dmd -release -inline code.d


That helped a lot, I disable GC and inlined as you suggested and 
the time is now:


Time (s): 8.754

However, with R's data.table package gives us:

system.time(x <- fread("Acquisition_2009Q2.txt", sep = "|", 
colClasses = rep("character", 22)))

   user  system elapsed
  0.852   0.021   0.872

I should probably have begun with this timing. Its not my 
intention to turn this into a speed-only competition, however the 
ingest of files and speed of calculation is very important to me.




Re: Speed of csvReader

2016-01-21 Thread Rikki Cattermole via Digitalmars-d-learn

On 21/01/16 10:39 PM, data pulverizer wrote:

I have been reading large text files with D's csv file reader and have
found it slow compared to R's read.table function which is not known to
be particularly fast. Here I am reading Fannie Mae mortgage acquisition
data which can be found here
http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html
after registering:

D Code:

import std.algorithm;
import std.array;
import std.file;
import std.csv;
import std.stdio;
import std.typecons;
import std.datetime;

alias row_type = Tuple!(string, string, string, string, string, string,
string, string,
 string, string, string, string, string, string,
string, string,
 string, string, string, string, string, string);

void main(){
   StopWatch sw;
   sw.start();
   auto buffer = std.file.readText("Acquisition_2009Q2.txt");
   auto records = csvReader!row_type(buffer, '|').array;
   sw.stop();
   double time = sw.peek().msecs;
   writeln("Time (s): ", time/1000);
}

Time (s): 13.478

R Code:

system.time(x <- read.table("Acquisition_2009Q2.txt", sep = "|",
colClasses = rep("character", 22)))
user  system elapsed
   7.810   0.067   7.874


R takes about half as long to read the file. Both read the data in the
"equivalent" type format. Am I doing something incorrect here?


Okay without registering not gonna get that data.

So usual things to think about, did you turn on release mode?
What about inlining?

Lastly how about disabling the GC?

import core.memory : GC;
GC.disable();

dmd -release -inline code.d


Re: Speed of csvReader

2016-01-21 Thread Saurabh Das via Digitalmars-d-learn

On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen 
wrote:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:

  StopWatch sw;
  sw.start();
  auto buffer = std.file.readText("Acquisition_2009Q2.txt");
  auto records = csvReader!row_type(buffer, '|').array;
  sw.stop();



Is it csvReader or readText that is slow? i.e. could you move 
sw.start(); one line down (after the readText command) and see 
how long just the csvReader part takes?


Please try this:

auto records = 
File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array;


Can you put up some sample data and share the number of records 
in the file as well.


Actually since you're aiming for speed, this might be better:

sw.start();
auto records = 
File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a 
=> cast(dchar)a).csvReader!row_type('|').array

sw.stop();

Please do verify that the end result is the same - I'm not 100% 
confident of the cast.


Thanks,
Saurabh



Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:

On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van 
Leeuwen wrote:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:

  StopWatch sw;
  sw.start();
  auto buffer = std.file.readText("Acquisition_2009Q2.txt");
  auto records = csvReader!row_type(buffer, '|').array;
  sw.stop();



Is it csvReader or readText that is slow? i.e. could you move 
sw.start(); one line down (after the readText command) and 
see how long just the csvReader part takes?


Please try this:

auto records = 
File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array;


Can you put up some sample data and share the number of 
records in the file as well.


Actually since you're aiming for speed, this might be better:

sw.start();
auto records = 
File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a 
=> cast(dchar)a).csvReader!row_type('|').array

sw.stop();

Please do verify that the end result is the same - I'm not 100% 
confident of the cast.


Thanks,
Saurabh


@Saurabh I have tried your latest suggestion and the time reduces 
fractionally to:


Time (s): 6.345

the previous suggestion actually increased the time

@Edwin van Leeuwen The csvReader is what takes the most time, the 
readText takes 0.229 s


Re: Speed of csvReader

2016-01-21 Thread Saurabh Das via Digitalmars-d-learn
On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen 
wrote:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:

  StopWatch sw;
  sw.start();
  auto buffer = std.file.readText("Acquisition_2009Q2.txt");
  auto records = csvReader!row_type(buffer, '|').array;
  sw.stop();



Is it csvReader or readText that is slow? i.e. could you move 
sw.start(); one line down (after the readText command) and see 
how long just the csvReader part takes?


Please try this:

auto records = 
File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array;


Can you put up some sample data and share the number of records 
in the file as well.




Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer 
wrote:

On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das 
Actually since you're aiming for speed, this might be better:


sw.start();
auto records = 
File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a => cast(dchar)a).csvReader!row_type('|').array

sw.stop();

Please do verify that the end result is the same - I'm not 
100% confident of the cast.


Thanks,
Saurabh


@Saurabh I have tried your latest suggestion and the time 
reduces fractionally to:


Time (s): 6.345

the previous suggestion actually increased the time

@Edwin van Leeuwen The csvReader is what takes the most time, 
the readText takes 0.229 s


p.s. @Saurabh the result looks fine from the cast.

Thanks


Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 16:25:55 UTC, bachmeier wrote:
On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer 
wrote:



Running Ubuntu 14.04 LTS


In that case, have you looked at

http://lancebachmeier.com/rdlang/

If this is a serious bottleneck you can solve it with two lines

evalRQ(`x <- fread("Acquisition_2009Q2.txt", sep = "|", 
colClasses = rep("character", 22))`);

auto x = RMatrix(evalR("x"));

and then you've got access to the data in D.


Thanks. That's certainly something to try.


Re: Speed of csvReader

2016-01-21 Thread Saurabh Das via Digitalmars-d-learn
On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer 
wrote:

On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:

Interesting that reading a file is so slow.

Your timings from R, is that including reading the file also?


Yes, its just insane isn't it?


It is insane. Earlier in the thread we were tackling the wrong 
problem clearly. Hence the adage, "measure first" :-/.


As suggested by Edwin van Leeuwen, can you give us a timing of:
auto records = File("Acquisition_2009Q2.txt", "r").byLine.map!(a 
=> a.split("|").array).array;


Thanks,
Saurabh



Re: Speed of csvReader

2016-01-21 Thread wobbles via Digitalmars-d-learn
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer 
wrote:

On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das 
wrote:

[...]


Actually since you're aiming for speed, this might be better:

sw.start();
auto records = 
File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a => cast(dchar)a).csvReader!row_type('|').array

sw.stop();

Please do verify that the end result is the same - I'm not 
100% confident of the cast.


Thanks,
Saurabh


@Saurabh I have tried your latest suggestion and the time 
reduces fractionally to:


Time (s): 6.345

the previous suggestion actually increased the time

@Edwin van Leeuwen The csvReader is what takes the most time, 
the readText takes 0.229 s


Interesting that reading a file is so slow.

Your timings from R, is that including reading the file also?


Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:

Interesting that reading a file is so slow.

Your timings from R, is that including reading the file also?


Yes, its just insane isn't it?


Re: Speed of csvReader

2016-01-21 Thread bachmeier via Digitalmars-d-learn

On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote:


We should understand why D is slow in this case. :)

Ali


fread source is here:

https://github.com/Rdatatable/data.table/blob/master/src/fread.c

Good luck trying to work through that (which explains why I'm 
using D). I don't know what their magic is, but data.table is 
many times faster than anything else in R, so I don't think it's 
trivial.


Re: Speed of csvReader

2016-01-21 Thread bachmeier via Digitalmars-d-learn
On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer 
wrote:



Running Ubuntu 14.04 LTS


In that case, have you looked at

http://lancebachmeier.com/rdlang/

If this is a serious bottleneck you can solve it with two lines

evalRQ(`x <- fread("Acquisition_2009Q2.txt", sep = "|", 
colClasses = rep("character", 22))`);

auto x = RMatrix(evalR("x"));

and then you've got access to the data in D.


Re: Speed of csvReader

2016-01-21 Thread Edwin van Leeuwen via Digitalmars-d-learn
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer 
wrote:

On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
@Edwin van Leeuwen The csvReader is what takes the most time, 
the readText takes 0.229 s


The underlying problem most likely is that csvReader has (AFAIK) 
never been properly optimized/profiled (very old piece of the 
library). You could try to implement a rough csvReader using 
buffer.byLine() and for each line use split("|") to split at "|". 
That should be faster, because it doesn't do any checking.


Non tested code:
string[][] res = buffer.byLine().map!((a) => 
a.split("|").array).array;




Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 17:17:52 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer 
wrote:

On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:

Interesting that reading a file is so slow.

Your timings from R, is that including reading the file also?


Yes, its just insane isn't it?


It is insane. Earlier in the thread we were tackling the wrong 
problem clearly. Hence the adage, "measure first" :-/.


As suggested by Edwin van Leeuwen, can you give us a timing of:
auto records = File("Acquisition_2009Q2.txt", 
"r").byLine.map!(a => a.split("|").array).array;


Thanks,
Saurabh


Good news and bad new. I was going for something similar to what 
you have above and both slash the time alot:


Time (s): 1.024

But now the output is a little garbled. For some reason the 
splitter isn't splitting correctly - or we are not applying it 
properly. Line 0:


["11703051", "RETAIL", "BANK OF AMERICA, 
N.A.|4.875|207000|3", "0", "03/200", "|05", "2009|75", "75|1|26", 
"80", "|N", "|", "O ", "ASH", "OU", " REFINANCE|PUD|1|INVE", 
"TOR", "C", "|801||FRM", "\n\n", "863", "", "FRM"]


Re: Speed of csvReader

2016-01-21 Thread Justin Whear via Digitalmars-d-learn
On Thu, 21 Jan 2016 18:37:08 +, data pulverizer wrote:
 
> It's interesting that the output first array is not the same as the
> input

byLine reuses a buffer (for speed) and the subsequent split operation 
just returns slices into that buffer.  So when byLine progresses to the 
next line the strings (slices) returned previously now point into a 
buffer with different contents.  You should either use byLineCopy or .idup 
to create copies of the relevant strings.  If your use-case allows for 
streaming and doesn't require having all the data present at once, you 
could continue to use byLine and just be careful not to refer to previous 
rows.


Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn
On Thursday, 21 January 2016 at 18:31:17 UTC, data pulverizer 
wrote:
Good news and bad new. I was going for something similar to 
what you have above and both slash the time alot:


Time (s): 1.024

But now the output is a little garbled. For some reason the 
splitter isn't splitting correctly - or we are not applying it 
properly. Line 0:


["11703051", "RETAIL", "BANK OF AMERICA, 
N.A.|4.875|207000|3", "0", "03/200", "|05", "2009|75", 
"75|1|26", "80", "|N", "|", "O ", "ASH", "OU", " 
REFINANCE|PUD|1|INVE", "TOR", "C", "|801||FRM", "\n\n", "863", 
"", "FRM"]


I should probably include the first few lines of the file:

10511550|RETAIL|FLAGSTAR CAPITAL MARKETS 
CORPORATION|5|222000|360|04/2009|06/2009|44|44|2|37|823|NO|NO 
CASH-OUT REFINANCE|PUD|1|PRINCIPAL|AZ|863||FRM
11031040|BROKER|SUNTRUST MORTGAGE 
INC.|4.99|456000|360|03/2009|05/2009|83|83|1|47|744|NO|NO 
CASH-OUT REFINANCE|SF|1|PRINCIPAL|MD|211|12|FRM
11445182|CORRESPONDENT|CITIMORTGAGE, 
INC.|4.875|172000|360|05/2009|07/2009|80|80|2|25|797|NO|CASH-OUT 
REFINANCE|SF|1|PRINCIPAL|TX|758||FRM
11703051|RETAIL|BANK OF AMERICA, 
N.A.|4.875|207000|360|03/2009|05/2009|75|75|1|26|806|NO|NO 
CASH-OUT REFINANCE|PUD|1|INVESTOR|CO|801||FRM
16033316|CORRESPONDENT|JPMORGAN CHASE BANK, NATIONAL 
ASSOCIATION|5|17|360|05/2009|07/2009|80|80|1|23|771|NO|CASH-OUT REFINANCE|PUD|1|PRINCIPAL|VA|224||FRM



It's interesting that the output first array is not the same as 
the input


Re: Speed of csvReader

2016-01-21 Thread Gerald Jansen via Digitalmars-d-learn
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
I have been reading large text files with D's csv file reader 
and have found it slow compared to R's read.table function


This great blog post has an optimized FastReader for CSV files:

http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html


Re: Speed of csvReader

2016-01-21 Thread cym13 via Digitalmars-d-learn

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:

[...]


It may be fast but I think it may be related to the fact that 
this is not a CSV parser. Don't get me wrong, it is able to parse 
a format defined by delimiters but true CSV is one hell of a 
beast. Of course most data look like:


number,name,price,comment
1,Twilight,150,good friend
2,Fluttershy,142,gentle
3,Pinkie Pie,169,oh my gosh

but you can have delimiters inside a field:

number,name,price,comment
1,Twilight,150,good friend
2,Fluttershy,"14,2",gentle
3,Pinkie Pie,169,oh my gosh

or quotes in a quoted field, in that case you have to double the 
quotes:


number,name,price,comment
1,Twilight,150,good friend
2,Fluttershy,142,gentle
3,Pinkie Pie,169,"He said ""oh my gosh"""

but in that case external quotes aren't required:

number,name,price,comment
1,Twilight,150,good friend
2,Fluttershy,142,gentle
3,Pinkie Pie,169,He said ""oh my gosh""

but at least it's always one record per line, no? No? No.

number,name,price,comment
1,Twilight,150,good friend
2,Fluttershy,142,gentle
3,Pinkie Pie,169,"He said
""oh my gosh""
And she replied
""Come on! Have fun!"""

I'll stop there, but you get the picture. Simply splitting by 
line then separator may work well on most data, but I wouldn't 
put it in production or in the standard library. Note that I 
think you did a great job optimizing your code, and I respect 
that, it's just a friendly reminder.




Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 11:29:49PM +, data pulverizer via 
Digitalmars-d-learn wrote:

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
>On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via 
>This piqued my interest today, so I decided to take a shot at 
>writing a fast CSV parser.  First, I downloaded a sample 
>large CSV file from: [...]


Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my 
machine but I get ctr1.o errors for example:


.../crt1.o(.debug_info): relocation 0 has invalid symbol index 
0


are there flags that I should be compiling with or some other 
thing that I am missing?


Did you supply a main() function? If not, it won't run, because 
fastcsv.d is only a module.  If you want to run the benchmark, 
you'll have to compile both benchmark.d and fastcsv.d together.



T


Thanks, I got used to getting away with running the "script" file 
in the same folder as a single file module - it usually works but 
occasionally (like now) I have to compile both together as you 
suggested.


Re: Speed of csvReader

2016-01-21 Thread Brad Anderson via Digitalmars-d-learn

On Thursday, 21 January 2016 at 22:13:38 UTC, Brad Anderson wrote:

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:

[...]


What about wrapping the slices in a range-like interface that 
would unescape the quotes on demand? You could even set a flag 
on it during the initial pass to say the field has double 
quotes that need to be escaped so it doesn't need to take a 
per-pop performance hit checking for double quotes (that's 
probably a pretty minor boost, if any, though).


Oh, you discussed range-based later. I should have finished 
reading before replying.


Re: Speed of csvReader

2016-01-21 Thread Jon D via Digitalmars-d-learn

On Thursday, 21 January 2016 at 22:20:28 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 10:09:24PM +, Jon D via 
Digitalmars-d-learn wrote: [...]
FWIW - I've been implementing a few programs manipulating 
delimited files, e.g. tab-delimited. Simpler than CSV files 
because there is no escaping inside the data. I've been trying 
to do this in relatively straightforward ways, e.g. using 
byLine rather than byChunk. (Goal is to explore the power of D 
standard libraries).


I've gotten significant speed-ups in a couple different ways:
* DMD libraries 2.068+  -  byLine is dramatically faster
* LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the 
DMD compiler


While byLine has improved a lot, it's still not the fastest 
thing in the world, because it still performs (at least) one OS 
roundtrip per line, not to mention it will auto-reencode to 
UTF-8. If your data is already in a known encoding, reading in 
the entire file and casting to (|w|d)string then splitting it 
by line will be a lot faster, since you can eliminate a lot of 
I/O roundtrips that way.


No disagreement, but I had other goals. At a high level, I'm 
trying to learn and evaluate D, which partly involves 
understanding the strengths and weaknesses of the standard 
library. From this perspective, byLine was a logical starting 
point. More specifically, the tools I'm writing are often used in 
unix pipelines, so input can be a mixture of standard input and 
files. And, the files can be arbitrarily large. In these cases, 
reading the entire file is not always appropriate. Buffering 
usually is, and my code knows when it is dealing with files vs 
standard input and could handle these differently. However, 
standard library code could handle these distinctions as well, 
which was part of the reason for trying the straightforward 
approach.


Aside - Despite the 'learning D' motivation, the tools are real 
tools, and writing them in D has been a clear win, especially 
with the byLine performance improvements in 2.068.




Re: Speed of csvReader

2016-01-21 Thread Jon D via Digitalmars-d-learn
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
I have been reading large text files with D's csv file reader 
and have found it slow compared to R's read.table function 
which is not known to be particularly fast.


FWIW - I've been implementing a few programs manipulating 
delimited files, e.g. tab-delimited. Simpler than CSV files 
because there is no escaping inside the data. I've been trying to 
do this in relatively straightforward ways, e.g. using byLine 
rather than byChunk. (Goal is to explore the power of D standard 
libraries).


I've gotten significant speed-ups in a couple different ways:
* DMD libraries 2.068+  -  byLine is dramatically faster
* LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the 
DMD compiler
* Avoid utf-8 to dchar conversion - This conversion often occurs 
silently when working with ranges, but is generally not needed 
when manipulating data.
* Avoid unnecessary string copies. e.g. Don't gratuitously 
convert char[] to string.


At this point performance of the utilities I've been writing is 
quite good. They don't have direct equivalents with other tools 
(such as gnu core utils), so a head-to-head is not appropriate, 
but generally it seems the tools are quite competitive without 
needing to do my own buffer or memory management. And, they are 
dramatically faster than the same tools written in perl (which I 
was happy with).


--Jon


Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> >[...]
> 
> It may be fast but I think it may be related to the fact that this is
> not a CSV parser. Don't get me wrong, it is able to parse a format
> defined by delimiters but true CSV is one hell of a beast.
[...]

As I stated, I didn't fully implement the parsing of quoted fields. (Or,
for that matter, the correct parsing of crazy wrapped values like you
pointed out.) This is not finished code; it's more of a proof of
concept.


T

-- 
Lottery: tax on the stupid. -- Slashdotter


Re: Speed of csvReader

2016-01-21 Thread Brad Anderson via Digitalmars-d-learn

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:

[snip]
There are some limitations to this approach: while the current 
code does try to unwrap quoted values in the CSV, it does not 
correctly parse escaped double quotes ("") in the fields. This 
is because to process those values correctly we'd have to copy 
the field data into a new string and construct its interpreted 
value, which is slow.  So I leave it as an exercise for the 
reader to implement (it's not hard, when the double 
double-quote sequence is detected, allocate a new string with 
the interpreted data instead of slicing the original data. 
Either that, or just unescape the quotes in the application 
code itself).


What about wrapping the slices in a range-like interface that 
would unescape the quotes on demand? You could even set a flag on 
it during the initial pass to say the field has double quotes 
that need to be escaped so it doesn't need to take a per-pop 
performance hit checking for double quotes (that's probably a 
pretty minor boost, if any, though).




Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via 
This piqued my interest today, so I decided to take a shot at 
writing a fast CSV parser.  First, I downloaded a sample large 
CSV file from: [...]


Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my 
machine but I get ctr1.o errors for example:


.../crt1.o(.debug_info): relocation 0 has invalid symbol index 0

are there flags that I should be compiling with or some other 
thing that I am missing?


Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 21, 2016 at 11:29:49PM +, data pulverizer via 
Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> >On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via This piqued
> >my interest today, so I decided to take a shot at writing a fast CSV
> >parser.  First, I downloaded a sample large CSV file from: [...]
> 
> Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my machine
> but I get ctr1.o errors for example:
> 
> .../crt1.o(.debug_info): relocation 0 has invalid symbol index 0
> 
> are there flags that I should be compiling with or some other thing
> that I am missing?

Did you supply a main() function? If not, it won't run, because
fastcsv.d is only a module.  If you want to run the benchmark, you'll
have to compile both benchmark.d and fastcsv.d together.


T

-- 
Give a man a fish, and he eats once. Teach a man to fish, and he will sit 
forever.


Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via 
Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote:
> >R takes about half as long to read the file. Both read the data in
> >the "equivalent" type format. Am I doing something incorrect here?
> 
> CsvReader hasn't been compared and optimized from other CSV readers.
> It does have allocation for the parsed string (even if it isn't
> changed) and it does a number of validation checks.
[...]

This piqued my interest today, so I decided to take a shot at writing a
fast CSV parser.  First, I downloaded a sample large CSV file from:

ftp://ftp.census.gov/econ2013/CBP_CSV/cbp13co.zip

This file has over 2 million records, so I thought it would serve as a
good dataset to run benchmarks on.

Since the OP wanted the loaded data in an array of records, as opposed
iterating over the records as an input range, I decided that the best
way to optimize this use case was to load the entire file into memory
and then return an array of slices to this data, instead of wasting time
(and memory) copying the data.

Furthermore, since it will be an array of records which are arrays of
slices to field values, another optimization is to allocate a large
buffer for storing consecutive field slices, and then in the outer array
just slice the buffer to represent a record. This greatly cuts down on
the number of GC allocations needed.

Once the buffer is full, we don't allocate a larger buffer and copy
everything over; this is unnecessary (and wasteful) because the outer
array doesn't care where its elements point to. Instead, we allocate a
new buffer, leaving previous records pointing to slices of the old
buffer, and start appending more field slices in the new buffer, and so
on. After all, the records don't have to exist in consecutive slices.
There's just a minor overhead in that if we run out of space in the
buffer while in the middle of parsing a record, we need to copy the
current record's field slices into the new buffer, so that all the
fields belonging to this record remain contiguous (so that the outer
array can just slice them). This is a very small overhead compared to
copying the entire buffer into a new memory block (as would happen if we
kept the buffer as a single array that needs to expand), so it ought to
be negligible.

So in a nutshell, what we have is an outer array, each element of which
is a slice (representing a record) that points to some slice of one of
the buffers. Each buffer is a contiguous sequence of slices
(representing a field) pointing to some segment of the original data.

Here's the code:

---
/**
 * Experimental fast CSV reader.
 *
 * Based on RFC 4180.
 */
module fastcsv;

/**
 * Reads CSV data from the given filename.
 */
auto csvFromUtf8File(string filename)
{
import std.file : read;
return csvFromString(cast(string) read(filename));
}

/**
 * Parses CSV data in a string.
 *
 * Params:
 *  fieldDelim = The field delimiter (default: ',')
 *  data = The data in CSV format.
 */
auto csvFromString(dchar fieldDelim=',', dchar quote='"')(const(char)[] 
data)
{
import core.memory;
import std.array : appender;

enum fieldBlockSize = 1 << 16;
auto fields = new const(char)[][fieldBlockSize];
size_t curField = 0;

GC.disable();
auto app = appender!(const(char)[][][]);

// Scan data
size_t i;
while (i < data.length)
{
// Parse records
size_t firstField = curField;
while (i < data.length && data[i] != '\n' && data[i] != '\r')
{
// Parse fields
size_t firstChar, lastChar;
if (data[i] == quote)
{
i++;
firstChar = i;
while (i < data.length && data[i] != fieldDelim &&
   data[i] != '\n' && data[i] != '\r')
{
i++;
}
lastChar = (i < data.length && data[i-1] == quote) ? 
i-1 : i;
}
else
{
firstChar = i;
while (i < data.length && data[i] != fieldDelim &&
   data[i] != '\n' && data[i] != '\r')
{
i++;
}
lastChar = i;
}
if (curField >= fields.length)

Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 20:46:15 UTC, Gerald Jansen wrote:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
I have been reading large text files with D's csv file reader 
and have found it slow compared to R's read.table function


This great blog post has an optimized FastReader for CSV files:

http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html


Thanks a lot Gerald, the blog and the discussions were very 
useful and revealing - for me it shows that you can use the D 
language to write fast code and then if you need it, to wring 
more performance and you can go as low level as you want all 
without leaving the D language or its tooling ecosystem.


Re: Speed of csvReader

2016-01-21 Thread Jesse Phillips via Digitalmars-d-learn

On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
Of course, running without GC collection is not a fair 
comparison with std.csv, so I added an option to my benchmark 
program to disable the GC for std.csv as well.  While the 
result was slightly faster, it was still much slower than my 
fastcsv code. (Though to be fair, std.csv does perform 
validation checks and so forth that fastcsv doesn't even try 
to.)


As mentioned validation can be turned off

auto data = std.csv.csvReader!(string, 
Malformed.ignore)(input).array;


I forgot to mention that one of the requirements for std.csv was 
that it worked on the base range type, input range. Not that 
slicing wouldn't be a valid addition.


I was also going to do the same thing with my sliced CSV, no 
fixing of the escaped quote. That would have just been a helper 
function the user could map over the results.


Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 21, 2016 at 10:09:24PM +, Jon D via Digitalmars-d-learn wrote:
[...]
> FWIW - I've been implementing a few programs manipulating delimited
> files, e.g. tab-delimited. Simpler than CSV files because there is no
> escaping inside the data. I've been trying to do this in relatively
> straightforward ways, e.g. using byLine rather than byChunk. (Goal is
> to explore the power of D standard libraries).
> 
> I've gotten significant speed-ups in a couple different ways:
> * DMD libraries 2.068+  -  byLine is dramatically faster
> * LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the DMD compiler

While byLine has improved a lot, it's still not the fastest thing in the
world, because it still performs (at least) one OS roundtrip per line,
not to mention it will auto-reencode to UTF-8. If your data is already
in a known encoding, reading in the entire file and casting to
(|w|d)string then splitting it by line will be a lot faster, since you
can eliminate a lot of I/O roundtrips that way.

In any case, it's well-known that gdc/ldc generally produce code that's
about 20%-30% faster than dmd-compiled code, sometimes a lot more. While
DMD has gotten some improvements in this area recently, it still has a
long way to go before it can catch up.  For performance-sensitive code I
always reach for gdc instead of dmd.


> * Avoid utf-8 to dchar conversion - This conversion often occurs
> silently when working with ranges, but is generally not needed when
> manipulating data.
[...]

Yet another nail in the coffin of auto-decoding.  I wonder how many more
nails we will need before Andrei is convinced...


T

-- 
The diminished 7th chord is the most flexible and fear-instilling chord. Use it 
often, use it unsparingly, to subdue your listeners into submission!


Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote:
are there flags that I should be compiling with or some other 
thing that I am missing?


Did you supply a main() function? If not, it won't run, because 
fastcsv.d is only a module.  If you want to run the benchmark, 
you'll have to compile both benchmark.d and fastcsv.d together.



T


Great benchmarks! This is something else for me to learn from.



Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> >[...]
> 
> It may be fast but I think it may be related to the fact that this is
> not a CSV parser. Don't get me wrong, it is able to parse a format
> defined by delimiters but true CSV is one hell of a beast.

Alright, I decided to take on the challenge to write a "real" CSV
parser... since it's a bit tedious to keep posting code in the forum,
I've pushed it to github instead:

https://github.com/quickfur/fastcsv


[...]
> but you can have delimiters inside a field:
> 
> number,name,price,comment
> 1,Twilight,150,good friend
> 2,Fluttershy,"14,2",gentle
> 3,Pinkie Pie,169,oh my gosh

Fixed.


> or quotes in a quoted field, in that case you have to double the quotes:
> 
> number,name,price,comment
> 1,Twilight,150,good friend
> 2,Fluttershy,142,gentle
> 3,Pinkie Pie,169,"He said ""oh my gosh"""

Fixed.  Well, except the fact that I don't actually interpret the
doubled quotes, but leave it up to the caller to filter them out at the
application level.


> but in that case external quotes aren't required:
> 
> number,name,price,comment
> 1,Twilight,150,good friend
> 2,Fluttershy,142,gentle
> 3,Pinkie Pie,169,He said ""oh my gosh""

Actually, this has already worked before. (Excepting the untranslated
doubled quotes, of course.)


> but at least it's always one record per line, no? No? No.
> 
> number,name,price,comment
> 1,Twilight,150,good friend
> 2,Fluttershy,142,gentle
> 3,Pinkie Pie,169,"He said
> ""oh my gosh""
> And she replied
> ""Come on! Have fun!"""

Fixed.


> I'll stop there, but you get the picture. Simply splitting by line
> then separator may work well on most data, but I wouldn't put it in
> production or in the standard library.

Actually, my code does *not* split by line then by separator. Did you
read it? ;-)


T

-- 
The most powerful one-line C program: #include "/dev/tty" -- IOCCC


Re: Speed of csvReader

2016-01-21 Thread cym13 via Digitalmars-d-learn

On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote:
And now that you mention this, RFC-4180 does not allow doubled 
quotes in an unquoted field. I'll take that out of the code (it 
improves performance :-D).


Right, re-reading the RFC would have been a great thing. That 
said I saw that kind of CSV in the real world, so I don't know 
what to think of it. I'm not saying it should be supported, but I 
wonder if there are points outside RFC-4180 that are taken for 
granted.




Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 21, 2016 at 04:31:03PM -0800, H. S. Teoh via Digitalmars-d-learn 
wrote:
> On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn 
> wrote:
[...]
> > https://github.com/quickfur/fastcsv
> 
> Oh, forgot to mention, the parsing times are still lightning fast
> after the fixes I mentioned: still around 1190 msecs or so.
> 
> Now I'm tempted to actually implement doubled-quote interpretation...
> as long as the input file doesn't contain unreasonable amounts of
> doubled quotes, I'm expecting the speed should remain pretty fast.
[...]

Done, commits pushed to github.

The new code now parses doubled quotes correctly.  The performance is
slightly worse now, around 1300 msecs on average, even in files that
don't have any doubled quotes (it's a penalty incurred by the inner loop
needing to detect doubled quote sequences).

My benchmark input file doesn't have any doubled quotes, however (code
correctness with doubled quotes is gauged by unittests only); so the
performance numbers may not accurately reflect true performance in the
general case. (But if doubled quotes are rare, as I'm expecting, the
actual performance shouldn't change too much in general usage...)

Maybe somebody who has a file with lots of ""'s can run the benchmark to
see how badly it performs? :-P


T

-- 
Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be 
algorithms.


Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Jan 22, 2016 at 01:13:07AM +, Jesse Phillips via 
Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote:
> >but in that case external quotes aren't required:
> >
> >number,name,price,comment
> >1,Twilight,150,good friend
> >2,Fluttershy,142,gentle
> >3,Pinkie Pie,169,He said ""oh my gosh""
> 
> std.csv will reject this. If validation is turned off this is fine but
> your data will include "".
> 
> "A field containing new lines, commas, or double quotes should be
> enclosed in double quotes (customizable)"
> 
> This because it is not possible to decide what correct parsing should
> be. Is the data using including two double quotes? What if there was
> only one quote there, do I have to remember it was their and decide
> not to throw it out because I didn't see another quote? At this point
> the data is not following CSV rules so if I'm validating I'm throwing
> it out and if I'm not validating I'm not stripping data.

This case is still manageable, because there are no embedded commas.
Everything between the last comma and the next comma or newline
unambiguously belongs to the current field.  As to how to interpret it
(should the result contain single or doubled quotes?), though, that
could potentially be problematic.

And now that you mention this, RFC-4180 does not allow doubled quotes in
an unquoted field. I'll take that out of the code (it improves
performance :-D).


T

-- 
First Rule of History: History doesn't repeat itself -- historians merely 
repeat each other.


Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn 
wrote:
> On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote:
> > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> > >[...]
> > 
> > It may be fast but I think it may be related to the fact that this is
> > not a CSV parser. Don't get me wrong, it is able to parse a format
> > defined by delimiters but true CSV is one hell of a beast.
> 
> Alright, I decided to take on the challenge to write a "real" CSV
> parser... since it's a bit tedious to keep posting code in the forum,
> I've pushed it to github instead:
> 
>   https://github.com/quickfur/fastcsv

Oh, forgot to mention, the parsing times are still lightning fast after
the fixes I mentioned: still around 1190 msecs or so.

Now I'm tempted to actually implement doubled-quote interpretation... as
long as the input file doesn't contain unreasonable amounts of doubled
quotes, I'm expecting the speed should remain pretty fast.


--T


Re: Speed of csvReader

2016-01-21 Thread cym13 via Digitalmars-d-learn

On Friday, 22 January 2016 at 00:26:16 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via 
Digitalmars-d-learn wrote:

[...]


Alright, I decided to take on the challenge to write a "real" 
CSV parser... since it's a bit tedious to keep posting code in 
the forum, I've pushed it to github instead:


https://github.com/quickfur/fastcsv


[...]

[...]


Fixed.



[...]


Fixed.  Well, except the fact that I don't actually interpret 
the doubled quotes, but leave it up to the caller to filter 
them out at the application level.




[...]


Actually, this has already worked before. (Excepting the 
untranslated

doubled quotes, of course.)



[...]


Fixed.



[...]


Actually, my code does *not* split by line then by separator. 
Did you read it? ;-)



T


Great! Sorry for the separator thing, I didn't read your code 
carefully. You still lack some things like comments and surely 
more things that I don't know about but it's getting there. I 
didn't think you'd go through the trouble of fixing those things 
to be honnest, I'm impressed.


Re: Speed of csvReader

2016-01-21 Thread Jesse Phillips via Digitalmars-d-learn

On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote:

but in that case external quotes aren't required:

number,name,price,comment
1,Twilight,150,good friend
2,Fluttershy,142,gentle
3,Pinkie Pie,169,He said ""oh my gosh""


std.csv will reject this. If validation is turned off this is 
fine but your data will include "".


"A field containing new lines, commas, or double quotes should be 
enclosed in double quotes (customizable)"


This because it is not possible to decide what correct parsing 
should be. Is the data using including two double quotes? What if 
there was only one quote there, do I have to remember it was 
their and decide not to throw it out because I didn't see another 
quote? At this point the data is not following CSV rules so if 
I'm validating I'm throwing it out and if I'm not validating I'm 
not stripping data.


Re: Speed of csvReader

2016-01-21 Thread Jesse Phillips via Digitalmars-d-learn

On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote:
Great! Sorry for the separator thing, I didn't read your code 
carefully. You still lack some things like comments and surely 
more things that I don't know about but it's getting there. I 
didn't think you'd go through the trouble of fixing those 
things to be honnest, I'm impressed.


CSV doesn't have comments, sorry.


Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Jan 22, 2016 at 12:56:02AM +, cym13 via Digitalmars-d-learn wrote:
[...]
> Great! Sorry for the separator thing, I didn't read your code
> carefully. You still lack some things like comments and surely more
> things that I don't know about but it's getting there.

Comments? You mean in the code?  'cos the CSV grammar described in
RFC-4180 doesn't seem to have the possibility of comments in the CSV
itself...


> I didn't think you'd go through the trouble of fixing those things to
> be honnest, I'm impressed.

They weren't that hard to fix, because the original code already had a
separate path for quoted values, so it was just a matter of deleting
some of the loop conditions to make the quoted path accept delimiters
and newlines. In fact, the original code already accepted doubled
quotes in the unquoted field path.

It was only to implement interpretation of doubled quotes that required
modifications to both inner loops.

Now having said that, though, I think there are some bugs in the code
that might cause an array overrun... and the fix might slow things down
yet a bit more. There are also some fundamental limitations:

1) The CSV data has to be loadable into memory in its entirety. This may
not be possible for very large files, or on machines with low memory.

2) There is no ranged-based interface. I *think* this should be possible
to add, but it will probably increase the overhead and make the code
slower.

3) There is no validation of the input whatsoever. If you feed it
malformed CSV, it will give you nonsensical output. Well, it may crash,
but hopefully won't anymore after I fix those missing bounds checks...
but it will still give you nonsensical output.

4) The accepted syntax is actually a little larger than strict CSV (in
the sense of RFC-4180); Unicode input is accepted but RFC-4180 does not
allow Unicode. This may actually be a plus, though, because I'm
expecting that modern CSV may actually contain Unicode data, not just
the ASCII range defined in RFC-4180.


T

-- 
The volume of a pizza of thickness a and radius z can be described by the 
following formula: pi zz a. -- Wouter Verhelst


Re: Speed of csvReader

2016-01-21 Thread cym13 via Digitalmars-d-learn

On Friday, 22 January 2016 at 01:14:48 UTC, Jesse Phillips wrote:

On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote:
Great! Sorry for the separator thing, I didn't read your code 
carefully. You still lack some things like comments and surely 
more things that I don't know about but it's getting there. I 
didn't think you'd go through the trouble of fixing those 
things to be honnest, I'm impressed.


CSV doesn't have comments, sorry.


I've met libraries that accepted lines beginning by # as comment 
(outside of "" of course) and wrongly assumed it was a standard 
thing, I stand corrected.


Re: Speed of csvReader

2016-01-21 Thread H. S. Teoh via Digitalmars-d-learn
On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via Digitalmars-d-learn 
wrote:
> [...]
> > >   https://github.com/quickfur/fastcsv
[...]

Fixed some boundary condition crashes and reverted doubled quote
handling in unquoted fields (since those are illegal according to RFC
4810).  Performance is back in the ~1200 msec range.


T

-- 
There is no gravity. The earth sucks.


Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn

On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear wrote:

On Thu, 21 Jan 2016 18:37:08 +, data pulverizer wrote:

It's interesting that the output first array is not the same 
as the input


byLine reuses a buffer (for speed) and the subsequent split 
operation just returns slices into that buffer.  So when byLine 
progresses to the next line the strings (slices) returned 
previously now point into a buffer with different contents.  
You should either use byLineCopy or .idup to create copies of 
the relevant strings.  If your use-case allows for streaming 
and doesn't require having all the data present at once, you 
could continue to use byLine and just be careful not to refer 
to previous rows.


Thanks. It now works with byLineCopy()

Time (s): 1.128


Re: Speed of csvReader

2016-01-21 Thread Jesse Phillips via Digitalmars-d-learn
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer 
wrote:
R takes about half as long to read the file. Both read the data 
in the "equivalent" type format. Am I doing something incorrect 
here?


CsvReader hasn't been compared and optimized from other CSV 
readers. It does have allocation for the parsed string (even if 
it isn't changed) and it does a number of validation checks.


You may get some improvement disabling the CSV validation, but 
again this wasn't tested for performance.


csvReader!(string,Malformed.ignore)(str)

Generally people recommend using GDC/LCD if you need resulting 
executable performance, but csvReader being slower isn't the most 
surprising.


Before submitting my library to phobos I had started a CSV reader 
that would do no allocations and instead return the string slice. 
This wasn't completed and so it never had performance testing 
done against it. It could very well be slower.


https://github.com/JesseKPhillips/JPDLibs/blob/csvoptimize/csv/csv.d

My original CSV parser was really slow because I parsed the 
string twice.


Re: Speed of csvReader

2016-01-21 Thread data pulverizer via Digitalmars-d-learn
On Thursday, 21 January 2016 at 19:08:38 UTC, data pulverizer 
wrote:
On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear 
wrote:

On Thu, 21 Jan 2016 18:37:08 +, data pulverizer wrote:

It's interesting that the output first array is not the same 
as the input


byLine reuses a buffer (for speed) and the subsequent split 
operation just returns slices into that buffer.  So when 
byLine progresses to the next line the strings (slices) 
returned previously now point into a buffer with different 
contents.  You should either use byLineCopy or .idup to create 
copies of the relevant strings.  If your use-case allows for 
streaming and doesn't require having all the data present at 
once, you could continue to use byLine and just be careful not 
to refer to previous rows.


Thanks. It now works with byLineCopy()

Time (s): 1.128


Currently the timing is similar to python pandas:

# Script (Python 2.7.6)
import pandas as pd
import time
col_types = {'col1': str, 'col2': str, 'col3': str, 'col4': str, 
'col5': str, 'col6': str, 'col7': str, 'col8': str, 'col9': str, 
'col10': str, 'col11': str, 'col12': str, 'col13': str, 'col14': 
str, 'col15': str, 'col16': str, 'col17': str, 'col18': str, 
'col19': str, 'col20': str, 'col21': str, 'col22': str}

begin = time.time()
x = pd.read_csv('Acquisition_2009Q2.txt', sep = '|', dtype = 
col_types)

end = time.time()

print end - begin

$ python file_read.py
1.19544792175