Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-30 Thread simba nyatsanga
Hi Everyone, Just an update on the above questions. I've updated the numbers in Google sheet using data with less entropy here: https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0 I've also got the benchmarking code. Although some of the data examples

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-25 Thread simba nyatsanga
Thanks all for the great feedback! Thanks Daniel for the sample data sets. I loaded them up and they're quite comparable in size to some of the data I'm dealing with. In my case the shapes range from 150 to ~100million rows. Column wise they range from 2-3 columns to ~500,000 columns. Thanks

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Daniel Lemire
Here are some realistic tabular data sets... https://github.com/lemire/RealisticTabularDataSets They are small by modern standards but they are also one GitHub clone away. - Daniel On Wed, Jan 24, 2018 at 2:26 PM, Wes McKinney wrote: > Thanks Ted. I will echo these

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Wes McKinney
Thanks Ted. I will echo these comments and recommend to run tests on larger and preferably "real" datasets rather than randomly generated ones. The more repetition and less entropy in a dataset, the better Parquet performs relative to other storage options. Web-scale datasets often exhibit these

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Ted Dunning
Simba Nice summary. I think that there may be some issues with your tests. In particular, you are storing essentially uniform random values. That might be a viable test in some situations, there are many where there is considerably less entropy in the data being stored. For instance, if you store

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga
Hi Uwe, thanks. I've attached a Google Sheet link https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0 Kind Regards Simba On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn wrote: > Hello Simba, > > your plots did not come through. Try

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Uwe L. Korn
Hello Simba, your plots did not come through. Try uploading them somewhere and link to them in the mails. Attachments are always stripped on Apache mailing lists. Uwe On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote: > Hi Everyone, > > I did some benchmarking to compare the disk size

[Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga
Hi Everyone, I did some benchmarking to compare the disk size performance when writing Pandas DataFrames to parquet files using Snappy and Brotli compression. I then compared these numbers with those of my current file storage solution. In my current (non Arrow+Parquet solution), every column in