Re: Speed of csvReader
On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote: Yeah, in the course of this exercise, I found that the one thing that has had the biggest impact on performance is the amount of allocations involved. [...snip] Really interesting discussion.
Re: Speed of csvReader
On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote: ... So the moral of the story is: avoid large numbers of small allocations. If you have to do it, consider consolidating your allocations into a series of allocations of large(ish) buffers instead, and taking slices of the buffers. Many thanks for the detailed explanation.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: While this is no fancy range-based code, and one might say it's more hackish and C-like than idiomatic D, the problem is that current D compilers can't quite optimize range-based code to this extent yet. Perhaps in the future optimizers will improve so that more idiomiatic, range-based code will have comparable performance with fastcsv. (At least in theory this should be possible.) As a D novice still struggling with the concept that composable range-based functions can be more efficient than good-old looping (ya, I know, cache friendliness and GC avoidance), I find it extremely interesting that someone as expert as yourself would reach for a C-like approach for serious data crunching. Given that data crunching is the kind of thing I need to do a lot, I'm wondering how general your statement above might be at this time w.r.t. this and possibly other domains.
Re: Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)
On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote: On Sun, Jan 24, 2016 at 06:07:41AM +, Jesse Phillips via Digitalmars-d-learn wrote: [...] My suggestion is to take the unittests used in std.csv and try to get your code working with them. As fastcsv limitations would prevent replacing the std.csv implementation the API may not need to match, but keeping close to the same would be best. My thought is to integrate the fastcsv code into std.csv, such that the current std.csv code will serve as fallback in the cases where fastcsv's limitations would prevent it from being used, with fastcsv being chosen where possible. That is why I suggested starting with the unittests. I don't expect the implementations to share much code, std.csv is written to only use front, popFront, and empty. Most of the work is done in csvNextToken so it might be able to take advantage of random-access ranges for more performance. I just think the unittests will help to define where switching algorthims will be required since they exercise a good portion of the API.
Re: Speed of csvReader
On Tue, 26 Jan 2016 18:16:28 +, Gerald Jansen wrote: > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: >> >> While this is no fancy range-based code, and one might say it's more >> hackish and C-like than idiomatic D, the problem is that current D >> compilers can't quite optimize range-based code to this extent yet. >> Perhaps in the future optimizers will improve so that more idiomiatic, >> range-based code will have comparable performance with fastcsv. (At >> least in theory this should be possible.) > > As a D novice still struggling with the concept that composable > range-based functions can be more efficient than good-old looping (ya, I > know, cache friendliness and GC avoidance), I find it extremely > interesting that someone as expert as yourself would reach for a C-like > approach for serious data crunching. Given that data crunching is the > kind of thing I need to do a lot, I'm wondering how general your > statement above might be at this time w.r.t. this and possibly other > domains. You want to reduce allocations. Ranges often let you do that. However, it's sometimes unsafe to reuse range values that aren't immutable. That means, if you want to keep the values around, you need to copy them -- which introduces an allocation. You can get fewer large allocations by reading the whole file at once manually and using slices into that large allocation.
Re: Speed of csvReader
On Tuesday, 26 January 2016 at 20:54:34 UTC, Chris Wright wrote: On Tue, 26 Jan 2016 18:16:28 +, Gerald Jansen wrote: On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: While this is no fancy range-based code, and one might say it's more hackish and C-like than idiomatic D, the problem is that current D compilers can't quite optimize range-based code to this extent yet. Perhaps in the future optimizers will improve so that more idiomiatic, range-based code will have comparable performance with fastcsv. ... data crunching ... I'm wondering how general your statement above might be at this time w.r.t. this and possibly other domains. You can get fewer large allocations by reading the whole file at once manually and using slices into that large allocation. Sure, that part is clear. Presumably the quoted comment referred to more than just that technique.
Re: Speed of csvReader
On Tue, Jan 26, 2016 at 08:54:34PM +, Chris Wright via Digitalmars-d-learn wrote: > On Tue, 26 Jan 2016 18:16:28 +, Gerald Jansen wrote: > > > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: > >> > >> While this is no fancy range-based code, and one might say it's > >> more hackish and C-like than idiomatic D, the problem is that > >> current D compilers can't quite optimize range-based code to this > >> extent yet. Perhaps in the future optimizers will improve so that > >> more idiomiatic, range-based code will have comparable performance > >> with fastcsv. (At least in theory this should be possible.) > > > > As a D novice still struggling with the concept that composable > > range-based functions can be more efficient than good-old looping > > (ya, I know, cache friendliness and GC avoidance), I find it > > extremely interesting that someone as expert as yourself would reach > > for a C-like approach for serious data crunching. Given that data > > crunching is the kind of thing I need to do a lot, I'm wondering how > > general your statement above might be at this time w.r.t. this and > > possibly other domains. > > You want to reduce allocations. Ranges often let you do that. However, > it's sometimes unsafe to reuse range values that aren't immutable. > That means, if you want to keep the values around, you need to copy > them -- which introduces an allocation. > > You can get fewer large allocations by reading the whole file at once > manually and using slices into that large allocation. Yeah, in the course of this exercise, I found that the one thing that has had the biggest impact on performance is the amount of allocations involved. Basically, I noted that the less allocations are made, the more efficient the code. I'm not sure exactly why this is so, but it's probably something to do with the fact that tracing GCs work better with fewer allocations of larger objects, than many allocations of small objects. I have also noted in the past that D's current GC runs collections a little too often; in past projects I've obtained significant speedup (in one case, up to 40% reduction of total runtime) by suppressing automatic collections and scheduling them manually at a lower frequency. In short, I've found that reducing GC load plays a much bigger role in performance than the range vs. loops issue. The reason I chose to write manual loops at first is to eliminate all possibility of unexpected overhead that might hide behind range primitives, as well as compiler limitations, as current optimizers aren't exactly tuned for range-based idioms, and may fail to recognize certain range-based idioms that would lead to much more efficient code. However, in my second iteration when I made the fastcsv parser return an input range instead of an array, I found only negligible performance differences. This suggests that perhaps range-based code may not perform that badly after all. I have yet to test this hypothesis, as the inner loop that parses fields in a single row is still a manual loop; but my suspicion is that it wouldn't do too badly in range-based form either. What might make a big difference, though, is the part where slicing is used, since that is essential for reducing the number of allocations. The current iteration of struct-based parsing code, for instance, went through an initial version that was excruciatingly slow for structs with string fields. Why? Because the function takes const(char)[] as input, and you can't legally get strings out of that unless you make a copy of that data (since const means you cannot modify it, but somebody else still might). So std.conv.to would allocate a new string and copy the contents over, every time a string field was parsed, resulting in a large number of small allocations. To solve this, I decided to use a string buffer: instead of one allocation per string, pre-allocate a large-ish char[] buffer, and every time a string field was parsed, append the data into the buffer. If the buffer becomes full, allocate a new one. Take a slice of the buffer corresponding to that field and cast it to string (this is safe since the algorithm was constructed never to write over previous parts of the buffer). This seemingly trivial optimization won me a performance improvement of an order of magnitude(!). This is particularly enlightening, since it suggests that even the overhead of copying all the string fields out of the original data into a new buffer does not add up to that much. The new struct-based parser also returns an input range rather than an array; I found that constructing the array directly vs. copying from an input range didn't really make that big of a difference either. What did make a huge difference is reducing the number of allocations. So the moral of the story is: avoid large numbers of small allocations. If you have to do it, consider consolidating your allocations into a series of allocations of large(ish) buffers
Re: Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)
On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote: My thought is to integrate the fastcsv code into std.csv, such that the current std.csv code will serve as fallback in the cases where fastcsv's limitations would prevent it from being used, with fastcsv being chosen where possible. Wouldn't it be simpler to add a new function? Otherwise you'll end up with very different performance for almost the same data.
Re: Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)
On Sun, Jan 24, 2016 at 06:07:41AM +, Jesse Phillips via Digitalmars-d-learn wrote: [...] > My suggestion is to take the unittests used in std.csv and try to get > your code working with them. As fastcsv limitations would prevent > replacing the std.csv implementation the API may not need to match, > but keeping close to the same would be best. My thought is to integrate the fastcsv code into std.csv, such that the current std.csv code will serve as fallback in the cases where fastcsv's limitations would prevent it from being used, with fastcsv being chosen where possible. It may be possible to lift some of fastcsv's limitations, now that a few performance bottlenecks have been identified (validation, excessive number of small allocations, being the main ones). The code could be generalized a bit more while preserving the optimizations in these key areas. T -- BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL
Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)
On Fri, Jan 22, 2016 at 10:04:58PM +, data pulverizer via Digitalmars-d-learn wrote: [...] > I guess the next step is allowing Tuple rows with mixed types. Alright. I threw together a new CSV parsing function that loads CSV data into an array of structs. Currently, the implementation is not quite polished yet (it blindly assumes the first row is a header row, which it discards), but it does work, and outperforms std.csv by about an order of magnitude. The initial implementation was very slow (albeit still somewhat fast than std.csv by about 10% or so) when given a struct with string fields. However, structs with POD fields are lightning fast (not significantly different from before, in spite of all the calls to std.conv.to!). This suggested that the slowdown was caused by excessive allocations of small strings, causing a heavy GC load. This suspicion was confirmed when I ran the same input data with a struct where all string fields were replaced with const(char)[] (so that std.conv.to simply returned slices to the data) -- the performance shot back up to about 1700 msecs, a little slower than the original version of reading into an array of array of const(char)[] slices, but about 58 times(!) the performance of std.csv. So I tried a simple optimization: instead of allocating a string per field, allocate 64KB string buffers and copy string field values into it, then taking slices from the buffer to assign to the struct's string fields. With this optimization, running times came down to about the 1900 msec range, which is only marginally slower than the const(char)[] case, about 51 times faster than std.csv. Here are the actual benchmark values: 1) std.csv: 2126883 records, 102136 msecs 2) fastcsv (struct with string fields): 2126883 records, 1978 msecs 3) fastcsv (struct with const(char)[] fields): 2126883 records, 1743 msecs The latest code is available on github: https://github.com/quickfur/fastcsv The benchmark driver now has 3 new targets: stdstruct - std.csv parsing of CSV into structs faststruct - fastcsv parsing of CSV into struct (string fields) faststruct2 - fastcsv parsing of CSV into struct (const(char)[] fields) Note that the structs are hard-coded into the code, so they will only work with the census.gov test file. Things still left to do: - Fix header parsing to have a consistent interface with std.csv, or at least allow the user to configure whether or not the first row should be discarded. - Support transcription to Tuples? - Refactor the code to have less copy-pasta. - Ummm... make it ready for integration with std.csv maybe? ;-) T -- Fact is stranger than fiction.
Re: Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)
On Sunday, 24 January 2016 at 01:57:11 UTC, H. S. Teoh wrote: - Ummm... make it ready for integration with std.csv maybe? ;-) T My suggestion is to take the unittests used in std.csv and try to get your code working with them. As fastcsv limitations would prevent replacing the std.csv implementation the API may not need to match, but keeping close to the same would be best.
Re: Speed of csvReader
On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote: On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...] > > https://github.com/quickfur/fastcsv [...] Fixed some boundary condition crashes and reverted doubled quote handling in unquoted fields (since those are illegal according to RFC 4810). Performance is back in the ~1200 msec range. T Hi H. S. Teoh, I have used you fastcsv on my file: import std.file; import fastcsv; import std.stdio; import std.datetime; void main(){ StopWatch sw; sw.start(); auto input = cast(string) read("Acquisition_2009Q2.txt"); auto mydata = fastcsv.csvToArray!('|')(input); sw.stop(); double time = sw.peek().msecs; writeln("Time (s): ", time/1000); } $ dmd file_read_5.d fastcsv.d $ ./file_read_5 Time (s): 0.679 Fastest so far, very nice.
Re: Speed of csvReader
On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote: On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: [...] > > https://github.com/quickfur/fastcsv [...] Fixed some boundary condition crashes and reverted doubled quote handling in unquoted fields (since those are illegal according to RFC 4810). Performance is back in the ~1200 msec range. T That's pretty impressive. Maybe turn it on into a dub package so that data pulverizer could easily test it on his data :)
Re: Speed of csvReader
On Friday, 22 January 2016 at 21:41:46 UTC, data pulverizer wrote: On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote: [...] Hi H. S. Teoh, I have used you fastcsv on my file: import std.file; import fastcsv; import std.stdio; import std.datetime; void main(){ StopWatch sw; sw.start(); auto input = cast(string) read("Acquisition_2009Q2.txt"); auto mydata = fastcsv.csvToArray!('|')(input); sw.stop(); double time = sw.peek().msecs; writeln("Time (s): ", time/1000); } $ dmd file_read_5.d fastcsv.d $ ./file_read_5 Time (s): 0.679 Fastest so far, very nice. I guess the next step is allowing Tuple rows with mixed types.
Re: Speed of csvReader
On Fri, Jan 22, 2016 at 10:04:58PM +, data pulverizer via Digitalmars-d-learn wrote: [...] > >$ dmd file_read_5.d fastcsv.d > >$ ./file_read_5 > >Time (s): 0.679 > > > >Fastest so far, very nice. Thanks! > I guess the next step is allowing Tuple rows with mixed types. I thought about that a little today. I'm guessing that most of the performance will be dependent on the conversion into the target types. Right now it's extremely fast because, for the most part, it's just taking slices of an existing string. It shouldn't be too hard to extend the current code so that instead of assembling the string slices in a block buffer, it will run them through std.conv.to instead and store them in an array of some given struct. But there may be performance degradation because now we have to do non-trivial operations on the string slices. Converting from const(char)[] to string probably should be avoided where not necessary, since otherwise it will involve lots and lots of small allocations and the GC will become very slow. Converting to ints may not be too bad... but conversion to types like floating point may be quite slow. Now, assembling the resulting structs into an array could potentially be slow... but perhaps an analogous block buffer technique can be used to create the array piecemeal in separate blocks, and only perform the final assembly into a single array at the very end (thus avoiding reallocating and copying the growing array as we go along). But we'll see. Performance predictions are rarely accurate; only a profiler will tell the truth about where the real bottlenecks are. :-) T -- LINUX = Lousy Interface for Nefarious Unix Xenophobes.
Re: Speed of csvReader
On Friday, 22 January 2016 at 01:36:40 UTC, cym13 wrote: On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote: And now that you mention this, RFC-4180 does not allow doubled quotes in an unquoted field. I'll take that out of the code (it improves performance :-D). Right, re-reading the RFC would have been a great thing. That said I saw that kind of CSV in the real world, so I don't know what to think of it. I'm not saying it should be supported, but I wonder if there are points outside RFC-4180 that are taken for granted. You have to understand CSV didn't come from a standard. People started using because it was simple for writing out some tabular data. Then they changed it because their data changed. It's not like their language came with a CSV parser, it was always hand written and people still do it today. And that is why data is delimited with so many things not comma (people thought they wouldn't need to escape their data). So yes, some CSV parsers will accept comments but that just means it breaks for people that have # in their data. Yeah, you can assume that two double quotes in unquoted data is just a quote, but then it breaks for those who have that kind of data which isn't escaped. There is also many other issues with CSV data, like is the file in ASCII or UTF or some other code page. And many times CSV isn't well formed because the data was output without proper escaping. std.csv isn't the end-all csv parsers, but it will at least handle well formed CSV that use different separators or quotes.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 10:40:39 UTC, data pulverizer wrote: On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole wrote: Okay without registering not gonna get that data. So usual things to think about, did you turn on release mode? What about inlining? Lastly how about disabling the GC? import core.memory : GC; GC.disable(); dmd -release -inline code.d That helped a lot, I disable GC and inlined as you suggested and the time is now: Time (s): 8.754 However, with R's data.table package gives us: system.time(x <- fread("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))) user system elapsed 0.852 0.021 0.872 I should probably have begun with this timing. Its not my intention to turn this into a speed-only competition, however the ingest of files and speed of calculation is very important to me. I should probably add compiler version info: ~$ dmd --version DMD64 D Compiler v2.069.2 Copyright (c) 1999-2015 by Digital Mars written by Walter Bright Running Ubuntu 14.04 LTS
Re: Speed of csvReader
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote: StopWatch sw; sw.start(); auto buffer = std.file.readText("Acquisition_2009Q2.txt"); auto records = csvReader!row_type(buffer, '|').array; sw.stop(); Is it csvReader or readText that is slow? i.e. could you move sw.start(); one line down (after the readText command) and see how long just the csvReader part takes?
Re: Speed of csvReader
On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote: On 01/21/2016 02:40 AM, data pulverizer wrote: dmd -release -inline code.d These two as well please: -O -boundscheck=off the ingest of files and speed of calculation is very important to me. We should understand why D is slow in this case. :) Ali Thank you, adding those two flags brings down the time a little more ... Time (s): 6.832
Re: Speed of csvReader
On 01/21/2016 02:40 AM, data pulverizer wrote: dmd -release -inline code.d These two as well please: -O -boundscheck=off the ingest of files and speed of calculation is very important to me. We should understand why D is slow in this case. :) Ali
Speed of csvReader
I have been reading large text files with D's csv file reader and have found it slow compared to R's read.table function which is not known to be particularly fast. Here I am reading Fannie Mae mortgage acquisition data which can be found here http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html after registering: D Code: import std.algorithm; import std.array; import std.file; import std.csv; import std.stdio; import std.typecons; import std.datetime; alias row_type = Tuple!(string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string); void main(){ StopWatch sw; sw.start(); auto buffer = std.file.readText("Acquisition_2009Q2.txt"); auto records = csvReader!row_type(buffer, '|').array; sw.stop(); double time = sw.peek().msecs; writeln("Time (s): ", time/1000); } Time (s): 13.478 R Code: system.time(x <- read.table("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))) user system elapsed 7.810 0.067 7.874 R takes about half as long to read the file. Both read the data in the "equivalent" type format. Am I doing something incorrect here?
Re: Speed of csvReader
On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole wrote: Okay without registering not gonna get that data. So usual things to think about, did you turn on release mode? What about inlining? Lastly how about disabling the GC? import core.memory : GC; GC.disable(); dmd -release -inline code.d That helped a lot, I disable GC and inlined as you suggested and the time is now: Time (s): 8.754 However, with R's data.table package gives us: system.time(x <- fread("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))) user system elapsed 0.852 0.021 0.872 I should probably have begun with this timing. Its not my intention to turn this into a speed-only competition, however the ingest of files and speed of calculation is very important to me.
Re: Speed of csvReader
On 21/01/16 10:39 PM, data pulverizer wrote: I have been reading large text files with D's csv file reader and have found it slow compared to R's read.table function which is not known to be particularly fast. Here I am reading Fannie Mae mortgage acquisition data which can be found here http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html after registering: D Code: import std.algorithm; import std.array; import std.file; import std.csv; import std.stdio; import std.typecons; import std.datetime; alias row_type = Tuple!(string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string); void main(){ StopWatch sw; sw.start(); auto buffer = std.file.readText("Acquisition_2009Q2.txt"); auto records = csvReader!row_type(buffer, '|').array; sw.stop(); double time = sw.peek().msecs; writeln("Time (s): ", time/1000); } Time (s): 13.478 R Code: system.time(x <- read.table("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))) user system elapsed 7.810 0.067 7.874 R takes about half as long to read the file. Both read the data in the "equivalent" type format. Am I doing something incorrect here? Okay without registering not gonna get that data. So usual things to think about, did you turn on release mode? What about inlining? Lastly how about disabling the GC? import core.memory : GC; GC.disable(); dmd -release -inline code.d
Re: Speed of csvReader
On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote: On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen wrote: On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote: StopWatch sw; sw.start(); auto buffer = std.file.readText("Acquisition_2009Q2.txt"); auto records = csvReader!row_type(buffer, '|').array; sw.stop(); Is it csvReader or readText that is slow? i.e. could you move sw.start(); one line down (after the readText command) and see how long just the csvReader part takes? Please try this: auto records = File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array; Can you put up some sample data and share the number of records in the file as well. Actually since you're aiming for speed, this might be better: sw.start(); auto records = File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a => cast(dchar)a).csvReader!row_type('|').array sw.stop(); Please do verify that the end result is the same - I'm not 100% confident of the cast. Thanks, Saurabh
Re: Speed of csvReader
On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote: On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote: On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen wrote: On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote: StopWatch sw; sw.start(); auto buffer = std.file.readText("Acquisition_2009Q2.txt"); auto records = csvReader!row_type(buffer, '|').array; sw.stop(); Is it csvReader or readText that is slow? i.e. could you move sw.start(); one line down (after the readText command) and see how long just the csvReader part takes? Please try this: auto records = File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array; Can you put up some sample data and share the number of records in the file as well. Actually since you're aiming for speed, this might be better: sw.start(); auto records = File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a => cast(dchar)a).csvReader!row_type('|').array sw.stop(); Please do verify that the end result is the same - I'm not 100% confident of the cast. Thanks, Saurabh @Saurabh I have tried your latest suggestion and the time reduces fractionally to: Time (s): 6.345 the previous suggestion actually increased the time @Edwin van Leeuwen The csvReader is what takes the most time, the readText takes 0.229 s
Re: Speed of csvReader
On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen wrote: On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote: StopWatch sw; sw.start(); auto buffer = std.file.readText("Acquisition_2009Q2.txt"); auto records = csvReader!row_type(buffer, '|').array; sw.stop(); Is it csvReader or readText that is slow? i.e. could you move sw.start(); one line down (after the readText command) and see how long just the csvReader part takes? Please try this: auto records = File("Acquisition_2009Q2.txt").byLine.joiner("\n").csvReader!row_type('|').array; Can you put up some sample data and share the number of records in the file as well.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer wrote: On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote: On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das Actually since you're aiming for speed, this might be better: sw.start(); auto records = File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a => cast(dchar)a).csvReader!row_type('|').array sw.stop(); Please do verify that the end result is the same - I'm not 100% confident of the cast. Thanks, Saurabh @Saurabh I have tried your latest suggestion and the time reduces fractionally to: Time (s): 6.345 the previous suggestion actually increased the time @Edwin van Leeuwen The csvReader is what takes the most time, the readText takes 0.229 s p.s. @Saurabh the result looks fine from the cast. Thanks
Re: Speed of csvReader
On Thursday, 21 January 2016 at 16:25:55 UTC, bachmeier wrote: On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer wrote: Running Ubuntu 14.04 LTS In that case, have you looked at http://lancebachmeier.com/rdlang/ If this is a serious bottleneck you can solve it with two lines evalRQ(`x <- fread("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))`); auto x = RMatrix(evalR("x")); and then you've got access to the data in D. Thanks. That's certainly something to try.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer wrote: On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote: Interesting that reading a file is so slow. Your timings from R, is that including reading the file also? Yes, its just insane isn't it? It is insane. Earlier in the thread we were tackling the wrong problem clearly. Hence the adage, "measure first" :-/. As suggested by Edwin van Leeuwen, can you give us a timing of: auto records = File("Acquisition_2009Q2.txt", "r").byLine.map!(a => a.split("|").array).array; Thanks, Saurabh
Re: Speed of csvReader
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer wrote: On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote: On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote: [...] Actually since you're aiming for speed, this might be better: sw.start(); auto records = File("Acquisition_2009Q2.txt").byChunk(1024*1024).joiner.map!(a => cast(dchar)a).csvReader!row_type('|').array sw.stop(); Please do verify that the end result is the same - I'm not 100% confident of the cast. Thanks, Saurabh @Saurabh I have tried your latest suggestion and the time reduces fractionally to: Time (s): 6.345 the previous suggestion actually increased the time @Edwin van Leeuwen The csvReader is what takes the most time, the readText takes 0.229 s Interesting that reading a file is so slow. Your timings from R, is that including reading the file also?
Re: Speed of csvReader
On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote: Interesting that reading a file is so slow. Your timings from R, is that including reading the file also? Yes, its just insane isn't it?
Re: Speed of csvReader
On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote: We should understand why D is slow in this case. :) Ali fread source is here: https://github.com/Rdatatable/data.table/blob/master/src/fread.c Good luck trying to work through that (which explains why I'm using D). I don't know what their magic is, but data.table is many times faster than anything else in R, so I don't think it's trivial.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer wrote: Running Ubuntu 14.04 LTS In that case, have you looked at http://lancebachmeier.com/rdlang/ If this is a serious bottleneck you can solve it with two lines evalRQ(`x <- fread("Acquisition_2009Q2.txt", sep = "|", colClasses = rep("character", 22))`); auto x = RMatrix(evalR("x")); and then you've got access to the data in D.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer wrote: On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote: @Edwin van Leeuwen The csvReader is what takes the most time, the readText takes 0.229 s The underlying problem most likely is that csvReader has (AFAIK) never been properly optimized/profiled (very old piece of the library). You could try to implement a rough csvReader using buffer.byLine() and for each line use split("|") to split at "|". That should be faster, because it doesn't do any checking. Non tested code: string[][] res = buffer.byLine().map!((a) => a.split("|").array).array;
Re: Speed of csvReader
On Thursday, 21 January 2016 at 17:17:52 UTC, Saurabh Das wrote: On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer wrote: On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote: Interesting that reading a file is so slow. Your timings from R, is that including reading the file also? Yes, its just insane isn't it? It is insane. Earlier in the thread we were tackling the wrong problem clearly. Hence the adage, "measure first" :-/. As suggested by Edwin van Leeuwen, can you give us a timing of: auto records = File("Acquisition_2009Q2.txt", "r").byLine.map!(a => a.split("|").array).array; Thanks, Saurabh Good news and bad new. I was going for something similar to what you have above and both slash the time alot: Time (s): 1.024 But now the output is a little garbled. For some reason the splitter isn't splitting correctly - or we are not applying it properly. Line 0: ["11703051", "RETAIL", "BANK OF AMERICA, N.A.|4.875|207000|3", "0", "03/200", "|05", "2009|75", "75|1|26", "80", "|N", "|", "O ", "ASH", "OU", " REFINANCE|PUD|1|INVE", "TOR", "C", "|801||FRM", "\n\n", "863", "", "FRM"]
Re: Speed of csvReader
On Thu, 21 Jan 2016 18:37:08 +, data pulverizer wrote: > It's interesting that the output first array is not the same as the > input byLine reuses a buffer (for speed) and the subsequent split operation just returns slices into that buffer. So when byLine progresses to the next line the strings (slices) returned previously now point into a buffer with different contents. You should either use byLineCopy or .idup to create copies of the relevant strings. If your use-case allows for streaming and doesn't require having all the data present at once, you could continue to use byLine and just be careful not to refer to previous rows.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 18:31:17 UTC, data pulverizer wrote: Good news and bad new. I was going for something similar to what you have above and both slash the time alot: Time (s): 1.024 But now the output is a little garbled. For some reason the splitter isn't splitting correctly - or we are not applying it properly. Line 0: ["11703051", "RETAIL", "BANK OF AMERICA, N.A.|4.875|207000|3", "0", "03/200", "|05", "2009|75", "75|1|26", "80", "|N", "|", "O ", "ASH", "OU", " REFINANCE|PUD|1|INVE", "TOR", "C", "|801||FRM", "\n\n", "863", "", "FRM"] I should probably include the first few lines of the file: 10511550|RETAIL|FLAGSTAR CAPITAL MARKETS CORPORATION|5|222000|360|04/2009|06/2009|44|44|2|37|823|NO|NO CASH-OUT REFINANCE|PUD|1|PRINCIPAL|AZ|863||FRM 11031040|BROKER|SUNTRUST MORTGAGE INC.|4.99|456000|360|03/2009|05/2009|83|83|1|47|744|NO|NO CASH-OUT REFINANCE|SF|1|PRINCIPAL|MD|211|12|FRM 11445182|CORRESPONDENT|CITIMORTGAGE, INC.|4.875|172000|360|05/2009|07/2009|80|80|2|25|797|NO|CASH-OUT REFINANCE|SF|1|PRINCIPAL|TX|758||FRM 11703051|RETAIL|BANK OF AMERICA, N.A.|4.875|207000|360|03/2009|05/2009|75|75|1|26|806|NO|NO CASH-OUT REFINANCE|PUD|1|INVESTOR|CO|801||FRM 16033316|CORRESPONDENT|JPMORGAN CHASE BANK, NATIONAL ASSOCIATION|5|17|360|05/2009|07/2009|80|80|1|23|771|NO|CASH-OUT REFINANCE|PUD|1|PRINCIPAL|VA|224||FRM It's interesting that the output first array is not the same as the input
Re: Speed of csvReader
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote: I have been reading large text files with D's csv file reader and have found it slow compared to R's read.table function This great blog post has an optimized FastReader for CSV files: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html
Re: Speed of csvReader
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: [...] It may be fast but I think it may be related to the fact that this is not a CSV parser. Don't get me wrong, it is able to parse a format defined by delimiters but true CSV is one hell of a beast. Of course most data look like: number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,142,gentle 3,Pinkie Pie,169,oh my gosh but you can have delimiters inside a field: number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,"14,2",gentle 3,Pinkie Pie,169,oh my gosh or quotes in a quoted field, in that case you have to double the quotes: number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,142,gentle 3,Pinkie Pie,169,"He said ""oh my gosh""" but in that case external quotes aren't required: number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,142,gentle 3,Pinkie Pie,169,He said ""oh my gosh"" but at least it's always one record per line, no? No? No. number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,142,gentle 3,Pinkie Pie,169,"He said ""oh my gosh"" And she replied ""Come on! Have fun!""" I'll stop there, but you get the picture. Simply splitting by line then separator may work well on most data, but I wouldn't put it in production or in the standard library. Note that I think you did a great job optimizing your code, and I respect that, it's just a friendly reminder.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote: On Thu, Jan 21, 2016 at 11:29:49PM +, data pulverizer via Digitalmars-d-learn wrote: On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: >On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via >This piqued my interest today, so I decided to take a shot at >writing a fast CSV parser. First, I downloaded a sample >large CSV file from: [...] Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my machine but I get ctr1.o errors for example: .../crt1.o(.debug_info): relocation 0 has invalid symbol index 0 are there flags that I should be compiling with or some other thing that I am missing? Did you supply a main() function? If not, it won't run, because fastcsv.d is only a module. If you want to run the benchmark, you'll have to compile both benchmark.d and fastcsv.d together. T Thanks, I got used to getting away with running the "script" file in the same folder as a single file module - it usually works but occasionally (like now) I have to compile both together as you suggested.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 22:13:38 UTC, Brad Anderson wrote: On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: [...] What about wrapping the slices in a range-like interface that would unescape the quotes on demand? You could even set a flag on it during the initial pass to say the field has double quotes that need to be escaped so it doesn't need to take a per-pop performance hit checking for double quotes (that's probably a pretty minor boost, if any, though). Oh, you discussed range-based later. I should have finished reading before replying.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 22:20:28 UTC, H. S. Teoh wrote: On Thu, Jan 21, 2016 at 10:09:24PM +, Jon D via Digitalmars-d-learn wrote: [...] FWIW - I've been implementing a few programs manipulating delimited files, e.g. tab-delimited. Simpler than CSV files because there is no escaping inside the data. I've been trying to do this in relatively straightforward ways, e.g. using byLine rather than byChunk. (Goal is to explore the power of D standard libraries). I've gotten significant speed-ups in a couple different ways: * DMD libraries 2.068+ - byLine is dramatically faster * LDC 0.17 (alpha) - Based on DMD 2.068, and faster than the DMD compiler While byLine has improved a lot, it's still not the fastest thing in the world, because it still performs (at least) one OS roundtrip per line, not to mention it will auto-reencode to UTF-8. If your data is already in a known encoding, reading in the entire file and casting to (|w|d)string then splitting it by line will be a lot faster, since you can eliminate a lot of I/O roundtrips that way. No disagreement, but I had other goals. At a high level, I'm trying to learn and evaluate D, which partly involves understanding the strengths and weaknesses of the standard library. From this perspective, byLine was a logical starting point. More specifically, the tools I'm writing are often used in unix pipelines, so input can be a mixture of standard input and files. And, the files can be arbitrarily large. In these cases, reading the entire file is not always appropriate. Buffering usually is, and my code knows when it is dealing with files vs standard input and could handle these differently. However, standard library code could handle these distinctions as well, which was part of the reason for trying the straightforward approach. Aside - Despite the 'learning D' motivation, the tools are real tools, and writing them in D has been a clear win, especially with the byLine performance improvements in 2.068.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote: I have been reading large text files with D's csv file reader and have found it slow compared to R's read.table function which is not known to be particularly fast. FWIW - I've been implementing a few programs manipulating delimited files, e.g. tab-delimited. Simpler than CSV files because there is no escaping inside the data. I've been trying to do this in relatively straightforward ways, e.g. using byLine rather than byChunk. (Goal is to explore the power of D standard libraries). I've gotten significant speed-ups in a couple different ways: * DMD libraries 2.068+ - byLine is dramatically faster * LDC 0.17 (alpha) - Based on DMD 2.068, and faster than the DMD compiler * Avoid utf-8 to dchar conversion - This conversion often occurs silently when working with ranges, but is generally not needed when manipulating data. * Avoid unnecessary string copies. e.g. Don't gratuitously convert char[] to string. At this point performance of the utilities I've been writing is quite good. They don't have direct equivalents with other tools (such as gnu core utils), so a head-to-head is not appropriate, but generally it seems the tools are quite competitive without needing to do my own buffer or memory management. And, they are dramatically faster than the same tools written in perl (which I was happy with). --Jon
Re: Speed of csvReader
On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote: > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: > >[...] > > It may be fast but I think it may be related to the fact that this is > not a CSV parser. Don't get me wrong, it is able to parse a format > defined by delimiters but true CSV is one hell of a beast. [...] As I stated, I didn't fully implement the parsing of quoted fields. (Or, for that matter, the correct parsing of crazy wrapped values like you pointed out.) This is not finished code; it's more of a proof of concept. T -- Lottery: tax on the stupid. -- Slashdotter
Re: Speed of csvReader
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: [snip] There are some limitations to this approach: while the current code does try to unwrap quoted values in the CSV, it does not correctly parse escaped double quotes ("") in the fields. This is because to process those values correctly we'd have to copy the field data into a new string and construct its interpreted value, which is slow. So I leave it as an exercise for the reader to implement (it's not hard, when the double double-quote sequence is detected, allocate a new string with the interpreted data instead of slicing the original data. Either that, or just unescape the quotes in the application code itself). What about wrapping the slices in a range-like interface that would unescape the quotes on demand? You could even set a flag on it during the initial pass to say the field has double quotes that need to be escaped so it doesn't need to take a per-pop performance hit checking for double quotes (that's probably a pretty minor boost, if any, though).
Re: Speed of csvReader
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via This piqued my interest today, so I decided to take a shot at writing a fast CSV parser. First, I downloaded a sample large CSV file from: [...] Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my machine but I get ctr1.o errors for example: .../crt1.o(.debug_info): relocation 0 has invalid symbol index 0 are there flags that I should be compiling with or some other thing that I am missing?
Re: Speed of csvReader
On Thu, Jan 21, 2016 at 11:29:49PM +, data pulverizer via Digitalmars-d-learn wrote: > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: > >On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via This piqued > >my interest today, so I decided to take a shot at writing a fast CSV > >parser. First, I downloaded a sample large CSV file from: [...] > > Hi H. S. Teoh, I tried to compile your code (fastcsv.d) on my machine > but I get ctr1.o errors for example: > > .../crt1.o(.debug_info): relocation 0 has invalid symbol index 0 > > are there flags that I should be compiling with or some other thing > that I am missing? Did you supply a main() function? If not, it won't run, because fastcsv.d is only a module. If you want to run the benchmark, you'll have to compile both benchmark.d and fastcsv.d together. T -- Give a man a fish, and he eats once. Teach a man to fish, and he will sit forever.
Re: Speed of csvReader
On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via Digitalmars-d-learn wrote: > On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote: > >R takes about half as long to read the file. Both read the data in > >the "equivalent" type format. Am I doing something incorrect here? > > CsvReader hasn't been compared and optimized from other CSV readers. > It does have allocation for the parsed string (even if it isn't > changed) and it does a number of validation checks. [...] This piqued my interest today, so I decided to take a shot at writing a fast CSV parser. First, I downloaded a sample large CSV file from: ftp://ftp.census.gov/econ2013/CBP_CSV/cbp13co.zip This file has over 2 million records, so I thought it would serve as a good dataset to run benchmarks on. Since the OP wanted the loaded data in an array of records, as opposed iterating over the records as an input range, I decided that the best way to optimize this use case was to load the entire file into memory and then return an array of slices to this data, instead of wasting time (and memory) copying the data. Furthermore, since it will be an array of records which are arrays of slices to field values, another optimization is to allocate a large buffer for storing consecutive field slices, and then in the outer array just slice the buffer to represent a record. This greatly cuts down on the number of GC allocations needed. Once the buffer is full, we don't allocate a larger buffer and copy everything over; this is unnecessary (and wasteful) because the outer array doesn't care where its elements point to. Instead, we allocate a new buffer, leaving previous records pointing to slices of the old buffer, and start appending more field slices in the new buffer, and so on. After all, the records don't have to exist in consecutive slices. There's just a minor overhead in that if we run out of space in the buffer while in the middle of parsing a record, we need to copy the current record's field slices into the new buffer, so that all the fields belonging to this record remain contiguous (so that the outer array can just slice them). This is a very small overhead compared to copying the entire buffer into a new memory block (as would happen if we kept the buffer as a single array that needs to expand), so it ought to be negligible. So in a nutshell, what we have is an outer array, each element of which is a slice (representing a record) that points to some slice of one of the buffers. Each buffer is a contiguous sequence of slices (representing a field) pointing to some segment of the original data. Here's the code: --- /** * Experimental fast CSV reader. * * Based on RFC 4180. */ module fastcsv; /** * Reads CSV data from the given filename. */ auto csvFromUtf8File(string filename) { import std.file : read; return csvFromString(cast(string) read(filename)); } /** * Parses CSV data in a string. * * Params: * fieldDelim = The field delimiter (default: ',') * data = The data in CSV format. */ auto csvFromString(dchar fieldDelim=',', dchar quote='"')(const(char)[] data) { import core.memory; import std.array : appender; enum fieldBlockSize = 1 << 16; auto fields = new const(char)[][fieldBlockSize]; size_t curField = 0; GC.disable(); auto app = appender!(const(char)[][][]); // Scan data size_t i; while (i < data.length) { // Parse records size_t firstField = curField; while (i < data.length && data[i] != '\n' && data[i] != '\r') { // Parse fields size_t firstChar, lastChar; if (data[i] == quote) { i++; firstChar = i; while (i < data.length && data[i] != fieldDelim && data[i] != '\n' && data[i] != '\r') { i++; } lastChar = (i < data.length && data[i-1] == quote) ? i-1 : i; } else { firstChar = i; while (i < data.length && data[i] != fieldDelim && data[i] != '\n' && data[i] != '\r') { i++; } lastChar = i; } if (curField >= fields.length)
Re: Speed of csvReader
On Thursday, 21 January 2016 at 20:46:15 UTC, Gerald Jansen wrote: On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote: I have been reading large text files with D's csv file reader and have found it slow compared to R's read.table function This great blog post has an optimized FastReader for CSV files: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html Thanks a lot Gerald, the blog and the discussions were very useful and revealing - for me it shows that you can use the D language to write fast code and then if you need it, to wring more performance and you can go as low level as you want all without leaving the D language or its tooling ecosystem.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: Of course, running without GC collection is not a fair comparison with std.csv, so I added an option to my benchmark program to disable the GC for std.csv as well. While the result was slightly faster, it was still much slower than my fastcsv code. (Though to be fair, std.csv does perform validation checks and so forth that fastcsv doesn't even try to.) As mentioned validation can be turned off auto data = std.csv.csvReader!(string, Malformed.ignore)(input).array; I forgot to mention that one of the requirements for std.csv was that it worked on the base range type, input range. Not that slicing wouldn't be a valid addition. I was also going to do the same thing with my sliced CSV, no fixing of the escaped quote. That would have just been a helper function the user could map over the results.
Re: Speed of csvReader
On Thu, Jan 21, 2016 at 10:09:24PM +, Jon D via Digitalmars-d-learn wrote: [...] > FWIW - I've been implementing a few programs manipulating delimited > files, e.g. tab-delimited. Simpler than CSV files because there is no > escaping inside the data. I've been trying to do this in relatively > straightforward ways, e.g. using byLine rather than byChunk. (Goal is > to explore the power of D standard libraries). > > I've gotten significant speed-ups in a couple different ways: > * DMD libraries 2.068+ - byLine is dramatically faster > * LDC 0.17 (alpha) - Based on DMD 2.068, and faster than the DMD compiler While byLine has improved a lot, it's still not the fastest thing in the world, because it still performs (at least) one OS roundtrip per line, not to mention it will auto-reencode to UTF-8. If your data is already in a known encoding, reading in the entire file and casting to (|w|d)string then splitting it by line will be a lot faster, since you can eliminate a lot of I/O roundtrips that way. In any case, it's well-known that gdc/ldc generally produce code that's about 20%-30% faster than dmd-compiled code, sometimes a lot more. While DMD has gotten some improvements in this area recently, it still has a long way to go before it can catch up. For performance-sensitive code I always reach for gdc instead of dmd. > * Avoid utf-8 to dchar conversion - This conversion often occurs > silently when working with ranges, but is generally not needed when > manipulating data. [...] Yet another nail in the coffin of auto-decoding. I wonder how many more nails we will need before Andrei is convinced... T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!
Re: Speed of csvReader
On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote: are there flags that I should be compiling with or some other thing that I am missing? Did you supply a main() function? If not, it won't run, because fastcsv.d is only a module. If you want to run the benchmark, you'll have to compile both benchmark.d and fastcsv.d together. T Great benchmarks! This is something else for me to learn from.
Re: Speed of csvReader
On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote: > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: > >[...] > > It may be fast but I think it may be related to the fact that this is > not a CSV parser. Don't get me wrong, it is able to parse a format > defined by delimiters but true CSV is one hell of a beast. Alright, I decided to take on the challenge to write a "real" CSV parser... since it's a bit tedious to keep posting code in the forum, I've pushed it to github instead: https://github.com/quickfur/fastcsv [...] > but you can have delimiters inside a field: > > number,name,price,comment > 1,Twilight,150,good friend > 2,Fluttershy,"14,2",gentle > 3,Pinkie Pie,169,oh my gosh Fixed. > or quotes in a quoted field, in that case you have to double the quotes: > > number,name,price,comment > 1,Twilight,150,good friend > 2,Fluttershy,142,gentle > 3,Pinkie Pie,169,"He said ""oh my gosh""" Fixed. Well, except the fact that I don't actually interpret the doubled quotes, but leave it up to the caller to filter them out at the application level. > but in that case external quotes aren't required: > > number,name,price,comment > 1,Twilight,150,good friend > 2,Fluttershy,142,gentle > 3,Pinkie Pie,169,He said ""oh my gosh"" Actually, this has already worked before. (Excepting the untranslated doubled quotes, of course.) > but at least it's always one record per line, no? No? No. > > number,name,price,comment > 1,Twilight,150,good friend > 2,Fluttershy,142,gentle > 3,Pinkie Pie,169,"He said > ""oh my gosh"" > And she replied > ""Come on! Have fun!""" Fixed. > I'll stop there, but you get the picture. Simply splitting by line > then separator may work well on most data, but I wouldn't put it in > production or in the standard library. Actually, my code does *not* split by line then by separator. Did you read it? ;-) T -- The most powerful one-line C program: #include "/dev/tty" -- IOCCC
Re: Speed of csvReader
On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote: And now that you mention this, RFC-4180 does not allow doubled quotes in an unquoted field. I'll take that out of the code (it improves performance :-D). Right, re-reading the RFC would have been a great thing. That said I saw that kind of CSV in the real world, so I don't know what to think of it. I'm not saying it should be supported, but I wonder if there are points outside RFC-4180 that are taken for granted.
Re: Speed of csvReader
On Thu, Jan 21, 2016 at 04:31:03PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: > On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn > wrote: [...] > > https://github.com/quickfur/fastcsv > > Oh, forgot to mention, the parsing times are still lightning fast > after the fixes I mentioned: still around 1190 msecs or so. > > Now I'm tempted to actually implement doubled-quote interpretation... > as long as the input file doesn't contain unreasonable amounts of > doubled quotes, I'm expecting the speed should remain pretty fast. [...] Done, commits pushed to github. The new code now parses doubled quotes correctly. The performance is slightly worse now, around 1300 msecs on average, even in files that don't have any doubled quotes (it's a penalty incurred by the inner loop needing to detect doubled quote sequences). My benchmark input file doesn't have any doubled quotes, however (code correctness with doubled quotes is gauged by unittests only); so the performance numbers may not accurately reflect true performance in the general case. (But if doubled quotes are rare, as I'm expecting, the actual performance shouldn't change too much in general usage...) Maybe somebody who has a file with lots of ""'s can run the benchmark to see how badly it performs? :-P T -- Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.
Re: Speed of csvReader
On Fri, Jan 22, 2016 at 01:13:07AM +, Jesse Phillips via Digitalmars-d-learn wrote: > On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote: > >but in that case external quotes aren't required: > > > >number,name,price,comment > >1,Twilight,150,good friend > >2,Fluttershy,142,gentle > >3,Pinkie Pie,169,He said ""oh my gosh"" > > std.csv will reject this. If validation is turned off this is fine but > your data will include "". > > "A field containing new lines, commas, or double quotes should be > enclosed in double quotes (customizable)" > > This because it is not possible to decide what correct parsing should > be. Is the data using including two double quotes? What if there was > only one quote there, do I have to remember it was their and decide > not to throw it out because I didn't see another quote? At this point > the data is not following CSV rules so if I'm validating I'm throwing > it out and if I'm not validating I'm not stripping data. This case is still manageable, because there are no embedded commas. Everything between the last comma and the next comma or newline unambiguously belongs to the current field. As to how to interpret it (should the result contain single or doubled quotes?), though, that could potentially be problematic. And now that you mention this, RFC-4180 does not allow doubled quotes in an unquoted field. I'll take that out of the code (it improves performance :-D). T -- First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
Re: Speed of csvReader
On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: > On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote: > > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: > > >[...] > > > > It may be fast but I think it may be related to the fact that this is > > not a CSV parser. Don't get me wrong, it is able to parse a format > > defined by delimiters but true CSV is one hell of a beast. > > Alright, I decided to take on the challenge to write a "real" CSV > parser... since it's a bit tedious to keep posting code in the forum, > I've pushed it to github instead: > > https://github.com/quickfur/fastcsv Oh, forgot to mention, the parsing times are still lightning fast after the fixes I mentioned: still around 1190 msecs or so. Now I'm tempted to actually implement doubled-quote interpretation... as long as the input file doesn't contain unreasonable amounts of doubled quotes, I'm expecting the speed should remain pretty fast. --T
Re: Speed of csvReader
On Friday, 22 January 2016 at 00:26:16 UTC, H. S. Teoh wrote: On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote: [...] Alright, I decided to take on the challenge to write a "real" CSV parser... since it's a bit tedious to keep posting code in the forum, I've pushed it to github instead: https://github.com/quickfur/fastcsv [...] [...] Fixed. [...] Fixed. Well, except the fact that I don't actually interpret the doubled quotes, but leave it up to the caller to filter them out at the application level. [...] Actually, this has already worked before. (Excepting the untranslated doubled quotes, of course.) [...] Fixed. [...] Actually, my code does *not* split by line then by separator. Did you read it? ;-) T Great! Sorry for the separator thing, I didn't read your code carefully. You still lack some things like comments and surely more things that I don't know about but it's getting there. I didn't think you'd go through the trouble of fixing those things to be honnest, I'm impressed.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote: but in that case external quotes aren't required: number,name,price,comment 1,Twilight,150,good friend 2,Fluttershy,142,gentle 3,Pinkie Pie,169,He said ""oh my gosh"" std.csv will reject this. If validation is turned off this is fine but your data will include "". "A field containing new lines, commas, or double quotes should be enclosed in double quotes (customizable)" This because it is not possible to decide what correct parsing should be. Is the data using including two double quotes? What if there was only one quote there, do I have to remember it was their and decide not to throw it out because I didn't see another quote? At this point the data is not following CSV rules so if I'm validating I'm throwing it out and if I'm not validating I'm not stripping data.
Re: Speed of csvReader
On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote: Great! Sorry for the separator thing, I didn't read your code carefully. You still lack some things like comments and surely more things that I don't know about but it's getting there. I didn't think you'd go through the trouble of fixing those things to be honnest, I'm impressed. CSV doesn't have comments, sorry.
Re: Speed of csvReader
On Fri, Jan 22, 2016 at 12:56:02AM +, cym13 via Digitalmars-d-learn wrote: [...] > Great! Sorry for the separator thing, I didn't read your code > carefully. You still lack some things like comments and surely more > things that I don't know about but it's getting there. Comments? You mean in the code? 'cos the CSV grammar described in RFC-4180 doesn't seem to have the possibility of comments in the CSV itself... > I didn't think you'd go through the trouble of fixing those things to > be honnest, I'm impressed. They weren't that hard to fix, because the original code already had a separate path for quoted values, so it was just a matter of deleting some of the loop conditions to make the quoted path accept delimiters and newlines. In fact, the original code already accepted doubled quotes in the unquoted field path. It was only to implement interpretation of doubled quotes that required modifications to both inner loops. Now having said that, though, I think there are some bugs in the code that might cause an array overrun... and the fix might slow things down yet a bit more. There are also some fundamental limitations: 1) The CSV data has to be loadable into memory in its entirety. This may not be possible for very large files, or on machines with low memory. 2) There is no ranged-based interface. I *think* this should be possible to add, but it will probably increase the overhead and make the code slower. 3) There is no validation of the input whatsoever. If you feed it malformed CSV, it will give you nonsensical output. Well, it may crash, but hopefully won't anymore after I fix those missing bounds checks... but it will still give you nonsensical output. 4) The accepted syntax is actually a little larger than strict CSV (in the sense of RFC-4180); Unicode input is accepted but RFC-4180 does not allow Unicode. This may actually be a plus, though, because I'm expecting that modern CSV may actually contain Unicode data, not just the ASCII range defined in RFC-4180. T -- The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter Verhelst
Re: Speed of csvReader
On Friday, 22 January 2016 at 01:14:48 UTC, Jesse Phillips wrote: On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote: Great! Sorry for the separator thing, I didn't read your code carefully. You still lack some things like comments and surely more things that I don't know about but it's getting there. I didn't think you'd go through the trouble of fixing those things to be honnest, I'm impressed. CSV doesn't have comments, sorry. I've met libraries that accepted lines beginning by # as comment (outside of "" of course) and wrongly assumed it was a standard thing, I stand corrected.
Re: Speed of csvReader
On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via Digitalmars-d-learn wrote: > [...] > > > https://github.com/quickfur/fastcsv [...] Fixed some boundary condition crashes and reverted doubled quote handling in unquoted fields (since those are illegal according to RFC 4810). Performance is back in the ~1200 msec range. T -- There is no gravity. The earth sucks.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear wrote: On Thu, 21 Jan 2016 18:37:08 +, data pulverizer wrote: It's interesting that the output first array is not the same as the input byLine reuses a buffer (for speed) and the subsequent split operation just returns slices into that buffer. So when byLine progresses to the next line the strings (slices) returned previously now point into a buffer with different contents. You should either use byLineCopy or .idup to create copies of the relevant strings. If your use-case allows for streaming and doesn't require having all the data present at once, you could continue to use byLine and just be careful not to refer to previous rows. Thanks. It now works with byLineCopy() Time (s): 1.128
Re: Speed of csvReader
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote: R takes about half as long to read the file. Both read the data in the "equivalent" type format. Am I doing something incorrect here? CsvReader hasn't been compared and optimized from other CSV readers. It does have allocation for the parsed string (even if it isn't changed) and it does a number of validation checks. You may get some improvement disabling the CSV validation, but again this wasn't tested for performance. csvReader!(string,Malformed.ignore)(str) Generally people recommend using GDC/LCD if you need resulting executable performance, but csvReader being slower isn't the most surprising. Before submitting my library to phobos I had started a CSV reader that would do no allocations and instead return the string slice. This wasn't completed and so it never had performance testing done against it. It could very well be slower. https://github.com/JesseKPhillips/JPDLibs/blob/csvoptimize/csv/csv.d My original CSV parser was really slow because I parsed the string twice.
Re: Speed of csvReader
On Thursday, 21 January 2016 at 19:08:38 UTC, data pulverizer wrote: On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear wrote: On Thu, 21 Jan 2016 18:37:08 +, data pulverizer wrote: It's interesting that the output first array is not the same as the input byLine reuses a buffer (for speed) and the subsequent split operation just returns slices into that buffer. So when byLine progresses to the next line the strings (slices) returned previously now point into a buffer with different contents. You should either use byLineCopy or .idup to create copies of the relevant strings. If your use-case allows for streaming and doesn't require having all the data present at once, you could continue to use byLine and just be careful not to refer to previous rows. Thanks. It now works with byLineCopy() Time (s): 1.128 Currently the timing is similar to python pandas: # Script (Python 2.7.6) import pandas as pd import time col_types = {'col1': str, 'col2': str, 'col3': str, 'col4': str, 'col5': str, 'col6': str, 'col7': str, 'col8': str, 'col9': str, 'col10': str, 'col11': str, 'col12': str, 'col13': str, 'col14': str, 'col15': str, 'col16': str, 'col17': str, 'col18': str, 'col19': str, 'col20': str, 'col21': str, 'col22': str} begin = time.time() x = pd.read_csv('Acquisition_2009Q2.txt', sep = '|', dtype = col_types) end = time.time() print end - begin $ python file_read.py 1.19544792175