Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-22 Thread Ryan Blue
> Instead of using an existing columnar format like parquet (one file for one type of stats) to store indexes, any reason why we have developed our own format and any benchmarks taken against Puffin vs other formats? The format needs to store large blobs, which can easily be multiple megabytes

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-22 Thread Russell Spitzer
+1 On Wed, Jun 22, 2022 at 9:34 AM Piotr Findeisen wrote: > Hi Ajantha, > > Thank you for spending the time to look into this. > > re a: I think I remember Ryan saying Parquet isn't good for bigger pieces > of data, and some stats sketches or indices can be bigger than others. > Also, the

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-22 Thread Piotr Findeisen
Hi Ajantha, Thank you for spending the time to look into this. re a: I think I remember Ryan saying Parquet isn't good for bigger pieces of data, and some stats sketches or indices can be bigger than others. Also, the Parquet row logical / columnar storage format doesn't give as much benefit for

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-21 Thread Ajantha Bhat
Thank you Piotr for all of the work you’ve put into this. I just checked the spec. I have a few newbie questions. a. Instead of using an existing columnar format like parquet (one file for one type of stats) to store indexes, any reason why we have developed our own format and any benchmarks

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-12 Thread Miao Wang
+1 on the format! It looks great! Thanks for materializing the initial design idea. Miao From: Kyle Bendickson Date: Sunday, June 12, 2022 at 1:55 PM To: dev@iceberg.apache.org Subject: Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes EXTERNAL: Use caution when

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-12 Thread Kyle Bendickson
+1 [non-binding] Thank you Piotr for all of the work you’ve put into this. This should greatly benefit not only Iceberg on Trino, but hopefully can be used in many novel ways due to its well thought out generic design and incorporation of the ability to extend with new sketches. Looking forward

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-10 Thread Alexander Jo
+1, let's do it! On Fri, Jun 10, 2022 at 2:47 PM John Zhuge wrote: > +1 Looking forward to the features it enables. > > On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu wrote: > >> +1. Looking forward to the partition stats. >> Best, >> >> Yufei >> >> >> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-10 Thread John Zhuge
+1 Looking forward to the features it enables. On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu wrote: > +1. Looking forward to the partition stats. > Best, > > Yufei > > > On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks wrote: > >> +1 as well. Excited about the progress here. >> >> -Dan >> >> On Thu,

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-10 Thread Yufei Gu
+1. Looking forward to the partition stats. Best, Yufei On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks wrote: > +1 as well. Excited about the progress here. > > -Dan > > On Thu, Jun 9, 2022, 6:25 PM Junjie Chen wrote: > >> +1, really nice! Indexes are coming! >> >> On Fri, Jun 10, 2022 at 8:04

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Daniel Weeks
+1 as well. Excited about the progress here. -Dan On Thu, Jun 9, 2022, 6:25 PM Junjie Chen wrote: > +1, really nice! Indexes are coming! > > On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho wrote: > >> +1, it's an exciting step for Iceberg, look forward to all the new >> statistics and secondary

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Junjie Chen
+1, really nice! Indexes are coming! On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho wrote: > +1, it's an exciting step for Iceberg, look forward to all the new > statistics and secondary indices it will allow. > > Had a few questions of what the reference to Puffin file(s) will be in the > Iceberg

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Szehon Ho
+1, it's an exciting step for Iceberg, look forward to all the new statistics and secondary indices it will allow. Had a few questions of what the reference to Puffin file(s) will be in the Iceberg spec, but it's orthogonal to Puffin file format itself. Thanks, Szehon On Thu, Jun 9, 2022 at

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Ryan Blue
+1 from me! There may also be people that haven't followed the design discussions and we can start a DISCUSS thread if needed. But if everyone is comfortable with the design and implementation, I think it's ready for a vote as well. Huge thanks to Piotr for getting this ready! I think the format

[VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Piotr Findeisen
Hi Everyone, I propose that we adopt Puffin file format as a file format for statistics and indexes in Iceberg tables. Puffin file format specification: https://github.com/apache/iceberg/blob/master/format/puffin-spec.md (previous discussions: https://github.com/apache/iceberg/pull/4944,