Re: Processing a gzipped csv-file by line-by-line
On 5/11/17 8:18 PM, H. S. Teoh via Digitalmars-d-learn wrote: On Wed, May 10, 2017 at 11:40:08PM +, Jesse Phillips via Digitalmars-d-learn wrote: If you can get the zip to decompress into a range of dchar then std.csv will work with it. It is by far not the fastest, but much speed is lost since it supports input ranges and doesn't specialize on any other range type. I actually spent some time today to look into whether fastcsv can possibly be made to work with general input ranges as long as they support slicing... and immediately ran into the infamous autodecoding issue: strings are not random-access ranges because of autodecoding, so it would require either extensive code surgery to make it work, or ugly hacks to bypass autodecoding. I'm quite tempted to attempt the latter, in fact, but not now since it's getting busier at work and I don't have that much free time to spend on a major refactoring of fastcsv. Yeah, iopipe treats char[] as a random-access sliceable range :) Autodecoding gets annoying if you want to do anything fancy (like chain(somestr, someotherstr)) Alternatively, I could possibly hack together a version of fastcsv that took a range of const(char)[] as input (rather than a single string), so that, in theory, it could handle arbitrarily large input files as long as the caller can provide a range of data blocks, e.g., File.byChunk, or in this particular case, a range of decompressed data blocks from whatever decompressor is used to extract the data. As long as you consume the individual rows without storing references to them indefinitely (don't try to make an array of the entire dataset), fastcsv's optimizations should still work, since unreferenced blocks will eventually get cleaned up by the GC when memory runs low. I'm interested in getting a fast CSV parser built on top of iopipe. I may fork your code and see if I can get it to work. Since you already work on arrays, it should be quite simple, since arrays are also iopipes by default. -Steve
Re: Processing a gzipped csv-file by line-by-line
On 5/10/17 7:17 PM, Nicholas Wilson wrote: On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordlöw wrote: What's fastest way to on-the-fly-decompress and process a gzipped csv-fil line by line? Is it possible to combine http://dlang.org/phobos/std_zlib.html with some stream variant of File(path).byLineFast ? I suggest you take a look at Steven's iopipe (also watch his Dconf presentation). should be very simple. Yeah, this should work and be quite fast: import iopipe.zip; import iopipe.textpipe; import iopipe.bufpipe; import iopipe.stream; foreach(line; openDev(path).bufd.unzip.decodeText.byLineRange) I think that was actually one of my slide examples. -Steve
Re: Processing a gzipped csv-file by line-by-line
On Friday, 12 May 2017 at 00:18:47 UTC, H. S. Teoh wrote: On Wed, May 10, 2017 at 11:40:08PM +, Jesse Phillips via Digitalmars-d-learn wrote: [...] H.S. Teoh mentioned fastcsv but requires all the data to be in memory. Or you could use std.mmfile. But if it's decompressed data, then it would still need to be small enough to fit in memory. Well, in theory you *could* use an anonymous mapping for std.mmfile as an OS-backed virtual memory buffer to decompress into, but it's questionable whether that's really worth the effort. If you can get the zip to decompress into a range of dchar then std.csv will work with it. It is by far not the fastest, but much speed is lost since it supports input ranges and doesn't specialize on any other range type. I actually spent some time today to look into whether fastcsv can possibly be made to work with general input ranges as long as they support slicing... and immediately ran into the infamous autodecoding issue: strings are not random-access ranges because of autodecoding, so it would require either extensive code surgery to make it work, or ugly hacks to bypass autodecoding. I'm quite tempted to attempt the latter, in fact, but not now since it's getting busier at work and I don't have that much free time to spend on a major refactoring of fastcsv. Alternatively, I could possibly hack together a version of fastcsv that took a range of const(char)[] as input (rather than a single string), so that, in theory, it could handle arbitrarily large input files as long as the caller can provide a range of data blocks, e.g., File.byChunk, or in this particular case, a range of decompressed data blocks from whatever decompressor is used to extract the data. As long as you consume the individual rows without storing references to them indefinitely (don't try to make an array of the entire dataset), fastcsv's optimizations should still work, since unreferenced blocks will eventually get cleaned up by the GC when memory runs low. T I hacked your code to work with std.experimental.allocator. If I remember it was a fair bit faster for my use. Let me know if you would like me to tidy up into a pull request. Thanks for the library. Also - sent you an email. Not sure if you got it. Laeeth
Re: Processing a gzipped csv-file by line-by-line
On Wed, May 10, 2017 at 11:40:08PM +, Jesse Phillips via Digitalmars-d-learn wrote: [...] > H.S. Teoh mentioned fastcsv but requires all the data to be in memory. Or you could use std.mmfile. But if it's decompressed data, then it would still need to be small enough to fit in memory. Well, in theory you *could* use an anonymous mapping for std.mmfile as an OS-backed virtual memory buffer to decompress into, but it's questionable whether that's really worth the effort. > If you can get the zip to decompress into a range of dchar then > std.csv will work with it. It is by far not the fastest, but much > speed is lost since it supports input ranges and doesn't specialize on > any other range type. I actually spent some time today to look into whether fastcsv can possibly be made to work with general input ranges as long as they support slicing... and immediately ran into the infamous autodecoding issue: strings are not random-access ranges because of autodecoding, so it would require either extensive code surgery to make it work, or ugly hacks to bypass autodecoding. I'm quite tempted to attempt the latter, in fact, but not now since it's getting busier at work and I don't have that much free time to spend on a major refactoring of fastcsv. Alternatively, I could possibly hack together a version of fastcsv that took a range of const(char)[] as input (rather than a single string), so that, in theory, it could handle arbitrarily large input files as long as the caller can provide a range of data blocks, e.g., File.byChunk, or in this particular case, a range of decompressed data blocks from whatever decompressor is used to extract the data. As long as you consume the individual rows without storing references to them indefinitely (don't try to make an array of the entire dataset), fastcsv's optimizations should still work, since unreferenced blocks will eventually get cleaned up by the GC when memory runs low. T -- The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
Re: Processing a gzipped csv-file by line-by-line
On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordlöw wrote: What's fastest way to on-the-fly-decompress and process a gzipped csv-fil line by line? Is it possible to combine http://dlang.org/phobos/std_zlib.html with some stream variant of File(path).byLineFast ? I was curious what byLineFast was, I'm guessing it's from here: https://github.com/biod/BioD/blob/master/bio/core/utils/bylinefast.d. I didn't test it, but it appears it may pre-date the speed improvements made to std.stdio.byLine perhaps a year and a half ago. If so, it might be worth comparing it to the current Phobos version, and of course iopipe. As mentioned in one of the other replies, byLine and variants aren't appropriate for CSV with escapes. For that, a real CSV parser is needed. As an alternative, run a converter that converts from csv to another format. --Jon
Re: Processing a gzipped csv-file by line-by-line
On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordlöw wrote: What's fastest way to on-the-fly-decompress and process a gzipped csv-fil line by line? Is it possible to combine http://dlang.org/phobos/std_zlib.html with some stream variant of File(path).byLineFast ? You can't really parse a CSV file line-by-line. H.S. Teoh mentioned fastcsv but requires all the data to be in memory. If you can get the zip to decompress into a range of dchar then std.csv will work with it. It is by far not the fastest, but much speed is lost since it supports input ranges and doesn't specialize on any other range type.
Re: Processing a gzipped csv-file by line-by-line
On Wednesday, 10 May 2017 at 23:19:15 UTC, H. S. Teoh wrote: Also, if you need to parse lots of CSV data very fast, you might be interested in this: https://github.com/quickfur/fastcsv T Or asdf: https://github.com/tamediadigital/asdf
Re: Processing a gzipped csv-file by line-by-line
On Wed, May 10, 2017 at 11:17:44PM +, Nicholas Wilson via Digitalmars-d-learn wrote: > On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordlöw wrote: > > What's fastest way to on-the-fly-decompress and process a gzipped > > csv-fil line by line? > > > > Is it possible to combine > > > > http://dlang.org/phobos/std_zlib.html > > > > with some stream variant of > > > > File(path).byLineFast > > > > ? > > I suggest you take a look at Steven's iopipe (also watch his Dconf > presentation). should be very simple. Also, if you need to parse lots of CSV data very fast, you might be interested in this: https://github.com/quickfur/fastcsv T -- Just because you can, doesn't mean you should.
Re: Processing a gzipped csv-file by line-by-line
On Wednesday, 10 May 2017 at 22:20:52 UTC, Nordlöw wrote: What's fastest way to on-the-fly-decompress and process a gzipped csv-fil line by line? Is it possible to combine http://dlang.org/phobos/std_zlib.html with some stream variant of File(path).byLineFast ? I suggest you take a look at Steven's iopipe (also watch his Dconf presentation). should be very simple.
Re: Processing a gzipped csv-file by line-by-line
Nordlöw wrote: What's fastest way to on-the-fly-decompress and process a gzipped csv-fil line by line? Is it possible to combine http://dlang.org/phobos/std_zlib.html with some stream variant of File(path).byLineFast ? iv.vfs[0] can do that (transparently decompress gzip files, and more). yet it is far from "fastest", so i don't think that it will fit. yet i can't miss such a great opportunity for self-promotion. [0] http://repo.or.cz/iv.d.git/tree/HEAD:/vfs
Processing a gzipped csv-file by line-by-line
What's fastest way to on-the-fly-decompress and process a gzipped csv-fil line by line? Is it possible to combine http://dlang.org/phobos/std_zlib.html with some stream variant of File(path).byLineFast ?