Re: [go-nuts] compressing long list of short strings

2016-08-11 Thread Egon
On Thursday, 11 August 2016 02:24:49 UTC+3, Alex Flint wrote: > > There are around 2M strings, and their total size is ~6 GB, so an average > of 3k each. > What kind of data? How large is the alphabet? What is the distribution of letters? Examples would be good :) > > I actually looked

Re: [go-nuts] compressing long list of short strings

2016-08-10 Thread Dan Kortschak
This looks like something that is solved for genomics data. If you are OK with decompressing m strings where m << n then the BGZF addition to gzip would work for you. In brief, BGZF blocks gzip into 64kb chunks which can be indexed. The spec for BGZF is here [1] (section 4 from page 11 on) and

Re: [go-nuts] compressing long list of short strings

2016-08-10 Thread Alex Flint
There are around 2M strings, and their total size is ~6 GB, so an average of 3k each. I actually looked briefly at Go's compress/flate to see whether something like what you're describing is possible without writing my own compressor but I couldn't see any obvious way to get at the underlying

Re: [go-nuts] compressing long list of short strings

2016-08-10 Thread Ian Lance Taylor
On Wed, Aug 10, 2016 at 3:27 PM, Alex Flint wrote: > > I have long list of short strings that I want to compress, but I want to be > able to decompress an arbitrary string in the list at any time without > decompressing the entire list. > > I know the list ahead of time and

[go-nuts] compressing long list of short strings

2016-08-10 Thread Alex Flint
I have long list of short strings that I want to compress, but I want to be able to decompress an arbitrary string in the list at any time without decompressing the entire list. I know the list ahead of time and it doesn't matter how much preprocessing time is involved. It is also fine if there