Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-22 Thread Ryan Blue
> Instead of using an existing columnar format like parquet (one file for
one type of stats) to store indexes, any reason why we have developed our
own format and any benchmarks taken against Puffin vs other formats?

The format needs to store large blobs, which can easily be multiple
megabytes for indexes like bloom filters. Formats like Parquet aren't
designed for large values like this. Some Parquet writers use page sizes
that are on the order of 64kb and the default for the Java implementation
is 1 MB. When single values are this large and you want to be able to
quickly get to any one value, it makes no sense to batch them up into row
groups. Instead, a simple container format makes sense.

> How these Puffin files are linked to Iceberg's metadata files is still a
missing link for me.

Most likely, the initial use of Puffin files will be storing NDV sketches.
The Puffin file will be tracked by Snapshot.

> do we need to read the previous snapshot's Puffin file and write back a
new file with updated stats?

No, we will carry stats files forward without modification. Stats can be a
little out of date.

> do we plan to asynchronously support collecting the stats like "ANALYZE
table" and modify the table metadata with the stats file names? (might need
an Iceberg commit to write new table metadata)

Stats can be handled however engines choose to implement this. ANALYZE is a
good option to begin with.

> we are looking forward to storing [partition level stats] in the Puffin
format. But I'm not sure about storing it as a single file with millions of
rows.

These aren't designed for millions of rows. Instead, each sketch or index
would cover a whole partition. This format is mainly for storing a few
large payloads, not tabular data with many rows and columns.

Ryan

On Mon, Jun 20, 2022 at 11:52 PM Ajantha Bhat  wrote:

> Thank you Piotr for all of the work you’ve put into this.
>
> I just checked the spec. I have a few newbie questions.
>
> a. Instead of using an existing columnar format like parquet (one file for
> one type of stats) to store indexes, any reason why we have developed our
> own format and any benchmarks taken against Puffin vs other formats?
>
> b. How these Puffin files are linked to Iceberg's metadata files is still
> a missing link for me. As the Puffin spec says, these stats are table level
> (updated per snapshots). So, do we need an Iceberg spec change to store the
> file names of these Puffin files so that remove_orphan_files will not
> clean it up accidentally? (also needed for expire_snapshots)
>
> c. NDV's are column level stats. So, I expect the latest puffin file of
> that snapshot will have one row of stats representing stats for each
> column. But if we are to implement secondary index or table level partition
> stats, there can be many rows (millions) in puffin based on the dataset.
> So, for every commit, do we need to read the previous snapshot's Puffin
> file and write back a new file with updated stats? (the file might be very
> huge when data grows?). I think it will affect the commit time. Any
> thoughts on this?
>
> d. Slightly related to the above point, do we plan to asynchronously
> support collecting the stats like "ANALYZE table" and modify the table
> metadata with the stats file names? (might need an Iceberg commit to write
> new table metadata)
>
> e. Even though table level partition stats are available from _parition
> metadata table (along with filter push down support), computing metadata
> table per query will be expensive.
> Hence, we are looking forward to storing them in the Puffin format. But
> I'm not sure about storing it as a single file with millions of rows.
> I Would like to collaborate and discuss more on this.
>
> Thanks,
> Ajantha
>
> On Mon, Jun 13, 2022 at 2:45 AM Miao Wang 
> wrote:
>
>> +1 on the format! It looks great!
>>
>>
>>
>> Thanks for materializing the initial design idea.
>>
>>
>>
>> Miao
>>
>> *From: *Kyle Bendickson 
>> *Date: *Sunday, June 12, 2022 at 1:55 PM
>> *To: *dev@iceberg.apache.org 
>> *Subject: *Re: [VOTE] Adopt Puffin format as a file format for
>> statistics and indexes
>>
>> *EXTERNAL: Use caution when clicking on links or opening attachments.*
>>
>>
>>
>> +1 [non-binding]
>>
>>
>>
>> Thank you Piotr for all of the work you’ve put into this.
>>
>>
>>
>> This should greatly benefit not only Iceberg on Trino, but hopefully can
>> be used in many novel ways due to its well thought out generic design and
>> incorporation of the ability to extend with new sketches.
>>
>>
>>
>> Looking forward to the improvements th

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-22 Thread Russell Spitzer
+1

On Wed, Jun 22, 2022 at 9:34 AM Piotr Findeisen 
wrote:

> Hi Ajantha,
>
> Thank you for spending the time to look into this.
>
> re a: I think I remember Ryan saying Parquet isn't good for bigger pieces
> of data, and some stats sketches or indices can be bigger than others.
> Also, the Parquet row logical / columnar storage format doesn't give as
> much benefit for what's more closer to key-value storage
>
> re b:
> this is still tbd --
> eg https://github.com/apache/iceberg/pull/4945
> https://github.com/apache/iceberg/pull/5021
>
> re c, e:
> for partition-level, it's not decided yet how it will be handled
>
> re d:
> yes, ANALYZE can be separate operation, see
> https://github.com/trinodb/trino/pull/12317 for POC
>
> Best regards,
> PF
>
>
>
> On Tue, Jun 21, 2022 at 8:52 AM Ajantha Bhat 
> wrote:
>
>> Thank you Piotr for all of the work you’ve put into this.
>>
>> I just checked the spec. I have a few newbie questions.
>>
>> a. Instead of using an existing columnar format like parquet (one file
>> for one type of stats) to store indexes, any reason why we have developed
>> our own format and any benchmarks taken against Puffin vs other formats?
>>
>> b. How these Puffin files are linked to Iceberg's metadata files is still
>> a missing link for me. As the Puffin spec says, these stats are table level
>> (updated per snapshots). So, do we need an Iceberg spec change to store the
>> file names of these Puffin files so that remove_orphan_files will not
>> clean it up accidentally? (also needed for expire_snapshots)
>>
>> c. NDV's are column level stats. So, I expect the latest puffin file of
>> that snapshot will have one row of stats representing stats for each
>> column. But if we are to implement secondary index or table level partition
>> stats, there can be many rows (millions) in puffin based on the dataset.
>> So, for every commit, do we need to read the previous snapshot's Puffin
>> file and write back a new file with updated stats? (the file might be very
>> huge when data grows?). I think it will affect the commit time. Any
>> thoughts on this?
>>
>> d. Slightly related to the above point, do we plan to asynchronously
>> support collecting the stats like "ANALYZE table" and modify the table
>> metadata with the stats file names? (might need an Iceberg commit to write
>> new table metadata)
>>
>> e. Even though table level partition stats are available from _parition
>> metadata table (along with filter push down support), computing metadata
>> table per query will be expensive.
>> Hence, we are looking forward to storing them in the Puffin format. But
>> I'm not sure about storing it as a single file with millions of rows.
>> I Would like to collaborate and discuss more on this.
>>
>> Thanks,
>> Ajantha
>>
>> On Mon, Jun 13, 2022 at 2:45 AM Miao Wang 
>> wrote:
>>
>>> +1 on the format! It looks great!
>>>
>>>
>>>
>>> Thanks for materializing the initial design idea.
>>>
>>>
>>>
>>> Miao
>>>
>>> *From: *Kyle Bendickson 
>>> *Date: *Sunday, June 12, 2022 at 1:55 PM
>>> *To: *dev@iceberg.apache.org 
>>> *Subject: *Re: [VOTE] Adopt Puffin format as a file format for
>>> statistics and indexes
>>>
>>> *EXTERNAL: Use caution when clicking on links or opening attachments.*
>>>
>>>
>>>
>>> +1 [non-binding]
>>>
>>>
>>>
>>> Thank you Piotr for all of the work you’ve put into this.
>>>
>>>
>>>
>>> This should greatly benefit not only Iceberg on Trino, but hopefully can
>>> be used in many novel ways due to its well thought out generic design and
>>> incorporation of the ability to extend with new sketches.
>>>
>>>
>>>
>>> Looking forward to the improvements this will bring.
>>>
>>>
>>>
>>> - Kyle
>>>
>>>
>>>
>>> On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo 
>>> wrote:
>>>
>>> +1, let's do it!
>>>
>>>
>>>
>>> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge  wrote:
>>>
>>> +1  Looking forward to the features it enables.
>>>
>>>
>>>
>>> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu  wrote:
>>>
>>> +1. Looking forward to the partition stats.
>>>
>>> Best,
>>>
>>>
>>>
>&

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-22 Thread Piotr Findeisen
Hi Ajantha,

Thank you for spending the time to look into this.

re a: I think I remember Ryan saying Parquet isn't good for bigger pieces
of data, and some stats sketches or indices can be bigger than others.
Also, the Parquet row logical / columnar storage format doesn't give as
much benefit for what's more closer to key-value storage

re b:
this is still tbd --
eg https://github.com/apache/iceberg/pull/4945
https://github.com/apache/iceberg/pull/5021

re c, e:
for partition-level, it's not decided yet how it will be handled

re d:
yes, ANALYZE can be separate operation, see
https://github.com/trinodb/trino/pull/12317 for POC

Best regards,
PF



On Tue, Jun 21, 2022 at 8:52 AM Ajantha Bhat  wrote:

> Thank you Piotr for all of the work you’ve put into this.
>
> I just checked the spec. I have a few newbie questions.
>
> a. Instead of using an existing columnar format like parquet (one file for
> one type of stats) to store indexes, any reason why we have developed our
> own format and any benchmarks taken against Puffin vs other formats?
>
> b. How these Puffin files are linked to Iceberg's metadata files is still
> a missing link for me. As the Puffin spec says, these stats are table level
> (updated per snapshots). So, do we need an Iceberg spec change to store the
> file names of these Puffin files so that remove_orphan_files will not
> clean it up accidentally? (also needed for expire_snapshots)
>
> c. NDV's are column level stats. So, I expect the latest puffin file of
> that snapshot will have one row of stats representing stats for each
> column. But if we are to implement secondary index or table level partition
> stats, there can be many rows (millions) in puffin based on the dataset.
> So, for every commit, do we need to read the previous snapshot's Puffin
> file and write back a new file with updated stats? (the file might be very
> huge when data grows?). I think it will affect the commit time. Any
> thoughts on this?
>
> d. Slightly related to the above point, do we plan to asynchronously
> support collecting the stats like "ANALYZE table" and modify the table
> metadata with the stats file names? (might need an Iceberg commit to write
> new table metadata)
>
> e. Even though table level partition stats are available from _parition
> metadata table (along with filter push down support), computing metadata
> table per query will be expensive.
> Hence, we are looking forward to storing them in the Puffin format. But
> I'm not sure about storing it as a single file with millions of rows.
> I Would like to collaborate and discuss more on this.
>
> Thanks,
> Ajantha
>
> On Mon, Jun 13, 2022 at 2:45 AM Miao Wang 
> wrote:
>
>> +1 on the format! It looks great!
>>
>>
>>
>> Thanks for materializing the initial design idea.
>>
>>
>>
>> Miao
>>
>> *From: *Kyle Bendickson 
>> *Date: *Sunday, June 12, 2022 at 1:55 PM
>> *To: *dev@iceberg.apache.org 
>> *Subject: *Re: [VOTE] Adopt Puffin format as a file format for
>> statistics and indexes
>>
>> *EXTERNAL: Use caution when clicking on links or opening attachments.*
>>
>>
>>
>> +1 [non-binding]
>>
>>
>>
>> Thank you Piotr for all of the work you’ve put into this.
>>
>>
>>
>> This should greatly benefit not only Iceberg on Trino, but hopefully can
>> be used in many novel ways due to its well thought out generic design and
>> incorporation of the ability to extend with new sketches.
>>
>>
>>
>> Looking forward to the improvements this will bring.
>>
>>
>>
>> - Kyle
>>
>>
>>
>> On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo 
>> wrote:
>>
>> +1, let's do it!
>>
>>
>>
>> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge  wrote:
>>
>> +1  Looking forward to the features it enables.
>>
>>
>>
>> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu  wrote:
>>
>> +1. Looking forward to the partition stats.
>>
>> Best,
>>
>>
>>
>> Yufei
>>
>>
>>
>>
>>
>> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks  wrote:
>>
>> +1 as well.  Excited about the progress here.
>>
>>
>>
>> -Dan
>>
>> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen 
>> wrote:
>>
>> +1, really nice! Indexes are coming!
>>
>>
>>
>> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho 
>> wrote:
>>
>> +1, it's an exciting step for Iceberg, look forward to all the new
>> statistics and secondary indices it will allow.
>>
>

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-20 Thread Ajantha Bhat
Thank you Piotr for all of the work you’ve put into this.

I just checked the spec. I have a few newbie questions.

a. Instead of using an existing columnar format like parquet (one file for
one type of stats) to store indexes, any reason why we have developed our
own format and any benchmarks taken against Puffin vs other formats?

b. How these Puffin files are linked to Iceberg's metadata files is still a
missing link for me. As the Puffin spec says, these stats are table level
(updated per snapshots). So, do we need an Iceberg spec change to store the
file names of these Puffin files so that remove_orphan_files will not clean
it up accidentally? (also needed for expire_snapshots)

c. NDV's are column level stats. So, I expect the latest puffin file of
that snapshot will have one row of stats representing stats for each
column. But if we are to implement secondary index or table level partition
stats, there can be many rows (millions) in puffin based on the dataset.
So, for every commit, do we need to read the previous snapshot's Puffin
file and write back a new file with updated stats? (the file might be very
huge when data grows?). I think it will affect the commit time. Any
thoughts on this?

d. Slightly related to the above point, do we plan to asynchronously
support collecting the stats like "ANALYZE table" and modify the table
metadata with the stats file names? (might need an Iceberg commit to write
new table metadata)

e. Even though table level partition stats are available from _parition
metadata table (along with filter push down support), computing metadata
table per query will be expensive.
Hence, we are looking forward to storing them in the Puffin format. But I'm
not sure about storing it as a single file with millions of rows.
I Would like to collaborate and discuss more on this.

Thanks,
Ajantha

On Mon, Jun 13, 2022 at 2:45 AM Miao Wang  wrote:

> +1 on the format! It looks great!
>
>
>
> Thanks for materializing the initial design idea.
>
>
>
> Miao
>
> *From: *Kyle Bendickson 
> *Date: *Sunday, June 12, 2022 at 1:55 PM
> *To: *dev@iceberg.apache.org 
> *Subject: *Re: [VOTE] Adopt Puffin format as a file format for statistics
> and indexes
>
> *EXTERNAL: Use caution when clicking on links or opening attachments.*
>
>
>
> +1 [non-binding]
>
>
>
> Thank you Piotr for all of the work you’ve put into this.
>
>
>
> This should greatly benefit not only Iceberg on Trino, but hopefully can
> be used in many novel ways due to its well thought out generic design and
> incorporation of the ability to extend with new sketches.
>
>
>
> Looking forward to the improvements this will bring.
>
>
>
> - Kyle
>
>
>
> On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo 
> wrote:
>
> +1, let's do it!
>
>
>
> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge  wrote:
>
> +1  Looking forward to the features it enables.
>
>
>
> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu  wrote:
>
> +1. Looking forward to the partition stats.
>
> Best,
>
>
>
> Yufei
>
>
>
>
>
> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks  wrote:
>
> +1 as well.  Excited about the progress here.
>
>
>
> -Dan
>
> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen  wrote:
>
> +1, really nice! Indexes are coming!
>
>
>
> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho  wrote:
>
> +1, it's an exciting step for Iceberg, look forward to all the new
> statistics and secondary indices it will allow.
>
>
>
> Had a few questions of what the reference to Puffin file(s) will be in the
> Iceberg spec, but it's orthogonal to Puffin file format itself.
>
>
>
> Thanks,
>
> Szehon
>
>
>
> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue  wrote:
>
> +1 from me!
>
>
>
> There may also be people that haven't followed the design discussions and
> we can start a DISCUSS thread if needed. But if everyone is comfortable
> with the design and implementation, I think it's ready for a vote as well.
>
>
>
> Huge thanks to Piotr for getting this ready! I think the format is going
> to be really useful for both stats and indexes in Iceberg.
>
>
>
> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen 
> wrote:
>
> Hi Everyone,
>
> I propose that we adopt Puffin file format as a file format for statistics
> and indexes in Iceberg tables.
>
>
>
> Puffin file format specification:
>
> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Fformat%2Fpuffin-spec.md&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef8

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-12 Thread Miao Wang
+1 on the format! It looks great!

Thanks for materializing the initial design idea.

Miao
From: Kyle Bendickson 
Date: Sunday, June 12, 2022 at 1:55 PM
To: dev@iceberg.apache.org 
Subject: Re: [VOTE] Adopt Puffin format as a file format for statistics and 
indexes

EXTERNAL: Use caution when clicking on links or opening attachments.


+1 [non-binding]

Thank you Piotr for all of the work you’ve put into this.

This should greatly benefit not only Iceberg on Trino, but hopefully can be 
used in many novel ways due to its well thought out generic design and 
incorporation of the ability to extend with new sketches.

Looking forward to the improvements this will bring.

- Kyle

On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo 
mailto:alex...@starburstdata.com>> wrote:
+1, let's do it!

On Fri, Jun 10, 2022 at 2:47 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
+1  Looking forward to the features it enables.

On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu 
mailto:flyrain...@gmail.com>> wrote:
+1. Looking forward to the partition stats.
Best,

Yufei


On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks 
mailto:dwe...@apache.org>> wrote:
+1 as well.  Excited about the progress here.

-Dan
On Thu, Jun 9, 2022, 6:25 PM Junjie Chen 
mailto:chenjunjied...@gmail.com>> wrote:
+1, really nice! Indexes are coming!

On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho 
mailto:szehon.apa...@gmail.com>> wrote:
+1, it's an exciting step for Iceberg, look forward to all the new statistics 
and secondary indices it will allow.

Had a few questions of what the reference to Puffin file(s) will be in the 
Iceberg spec, but it's orthogonal to Puffin file format itself.

Thanks,
Szehon

On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue 
mailto:b...@tabular.io>> wrote:
+1 from me!

There may also be people that haven't followed the design discussions and we 
can start a DISCUSS thread if needed. But if everyone is comfortable with the 
design and implementation, I think it's ready for a vote as well.

Huge thanks to Piotr for getting this ready! I think the format is going to be 
really useful for both stats and indexes in Iceberg.

On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen 
mailto:pi...@starburstdata.com>> wrote:
Hi Everyone,

I propose that we adopt Puffin file format as a file format for statistics and 
indexes in Iceberg tables.

Puffin file format specification:
https://github.com/apache/iceberg/blob/master/format/puffin-spec.md<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Fformat%2Fpuffin-spec.md&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3Y04jqMw6ZIc%2BojDmWlpOeLL5zQ3YvLcdAgoHJTwL8c%3D&reserved=0>
(previous discussions:  
https://github.com/apache/iceberg/pull/4944<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4944&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tWuoyTfEaIWmOFivROQRt0fD1KRYc%2FqwRO2KoZhIoi8%3D&reserved=0>,
 
https://github.com/apache/iceberg-docs/pull/69<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg-docs%2Fpull%2F69&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Uf8XiuLSLEO8YtCMkk%2BSXWS6lefw95O22K844P5Iovc%3D&reserved=0>)

Intend use:
* statistics in Iceberg tables (see 
https://github.com/apache/iceberg/pull/4945<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4945&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=swByVgNPD6lbSlJjHIZZX4jgeVzC%2BT%2BWUvxrrg0Wpx8%3D&reserved=0>
 and associated proposed implementation 
https://github.com/apache/iceberg/pull/4741<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4741&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dYckyv1f36iQqs9%2FaRQRsumtB2xEmwcFJAQihYZRYlw%3D&rese

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-12 Thread Kyle Bendickson
+1 [non-binding]

Thank you Piotr for all of the work you’ve put into this.

This should greatly benefit not only Iceberg on Trino, but hopefully can be
used in many novel ways due to its well thought out generic design and
incorporation of the ability to extend with new sketches.

Looking forward to the improvements this will bring.

- Kyle

On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo 
wrote:

> +1, let's do it!
>
> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge  wrote:
>
>> +1  Looking forward to the features it enables.
>>
>> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu  wrote:
>>
>>> +1. Looking forward to the partition stats.
>>> Best,
>>>
>>> Yufei
>>>
>>>
>>> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks  wrote:
>>>
 +1 as well.  Excited about the progress here.

 -Dan

 On Thu, Jun 9, 2022, 6:25 PM Junjie Chen 
 wrote:

> +1, really nice! Indexes are coming!
>
> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho 
> wrote:
>
>> +1, it's an exciting step for Iceberg, look forward to all the new
>> statistics and secondary indices it will allow.
>>
>> Had a few questions of what the reference to Puffin file(s) will be
>> in the Iceberg spec, but it's orthogonal to Puffin file format itself.
>>
>> Thanks,
>> Szehon
>>
>> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue  wrote:
>>
>>> +1 from me!
>>>
>>> There may also be people that haven't followed the design
>>> discussions and we can start a DISCUSS thread if needed. But if 
>>> everyone is
>>> comfortable with the design and implementation, I think it's ready for a
>>> vote as well.
>>>
>>> Huge thanks to Piotr for getting this ready! I think the format is
>>> going to be really useful for both stats and indexes in Iceberg.
>>>
>>> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen <
>>> pi...@starburstdata.com> wrote:
>>>
 Hi Everyone,

 I propose that we adopt Puffin file format as a file format for
 statistics and indexes in Iceberg tables.

 Puffin file format specification:
 https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
 (previous discussions:  https://github.com/apache/iceberg/pull/4944
 , https://github.com/apache/iceberg-docs/pull/69)

 Intend use:
 * statistics in Iceberg tables (see
 https://github.com/apache/iceberg/pull/4945 and associated
 proposed implementation https://github.com/apache/iceberg/pull/4741
 )
 * in the future: storage for secondary indexes

 Puffin file reader and writer implementation:
 https://github.com/apache/iceberg/pull/4537

 Thanks,
 PF


>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Best Regards
>

>>
>> --
>> John Zhuge
>>
>


Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-10 Thread Alexander Jo
+1, let's do it!

On Fri, Jun 10, 2022 at 2:47 PM John Zhuge  wrote:

> +1  Looking forward to the features it enables.
>
> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu  wrote:
>
>> +1. Looking forward to the partition stats.
>> Best,
>>
>> Yufei
>>
>>
>> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks  wrote:
>>
>>> +1 as well.  Excited about the progress here.
>>>
>>> -Dan
>>>
>>> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen 
>>> wrote:
>>>
 +1, really nice! Indexes are coming!

 On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho 
 wrote:

> +1, it's an exciting step for Iceberg, look forward to all the new
> statistics and secondary indices it will allow.
>
> Had a few questions of what the reference to Puffin file(s) will be in
> the Iceberg spec, but it's orthogonal to Puffin file format itself.
>
> Thanks,
> Szehon
>
> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue  wrote:
>
>> +1 from me!
>>
>> There may also be people that haven't followed the design discussions
>> and we can start a DISCUSS thread if needed. But if everyone is 
>> comfortable
>> with the design and implementation, I think it's ready for a vote as 
>> well.
>>
>> Huge thanks to Piotr for getting this ready! I think the format is
>> going to be really useful for both stats and indexes in Iceberg.
>>
>> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen <
>> pi...@starburstdata.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I propose that we adopt Puffin file format as a file format for
>>> statistics and indexes in Iceberg tables.
>>>
>>> Puffin file format specification:
>>> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
>>> (previous discussions:  https://github.com/apache/iceberg/pull/4944
>>> , https://github.com/apache/iceberg-docs/pull/69)
>>>
>>> Intend use:
>>> * statistics in Iceberg tables (see
>>> https://github.com/apache/iceberg/pull/4945 and associated proposed
>>> implementation https://github.com/apache/iceberg/pull/4741)
>>> * in the future: storage for secondary indexes
>>>
>>> Puffin file reader and writer implementation:
>>> https://github.com/apache/iceberg/pull/4537
>>>
>>> Thanks,
>>> PF
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

 --
 Best Regards

>>>
>
> --
> John Zhuge
>


Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-10 Thread John Zhuge
+1  Looking forward to the features it enables.

On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu  wrote:

> +1. Looking forward to the partition stats.
> Best,
>
> Yufei
>
>
> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks  wrote:
>
>> +1 as well.  Excited about the progress here.
>>
>> -Dan
>>
>> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen 
>> wrote:
>>
>>> +1, really nice! Indexes are coming!
>>>
>>> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho 
>>> wrote:
>>>
 +1, it's an exciting step for Iceberg, look forward to all the new
 statistics and secondary indices it will allow.

 Had a few questions of what the reference to Puffin file(s) will be in
 the Iceberg spec, but it's orthogonal to Puffin file format itself.

 Thanks,
 Szehon

 On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue  wrote:

> +1 from me!
>
> There may also be people that haven't followed the design discussions
> and we can start a DISCUSS thread if needed. But if everyone is 
> comfortable
> with the design and implementation, I think it's ready for a vote as well.
>
> Huge thanks to Piotr for getting this ready! I think the format is
> going to be really useful for both stats and indexes in Iceberg.
>
> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen <
> pi...@starburstdata.com> wrote:
>
>> Hi Everyone,
>>
>> I propose that we adopt Puffin file format as a file format for
>> statistics and indexes in Iceberg tables.
>>
>> Puffin file format specification:
>> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
>> (previous discussions:  https://github.com/apache/iceberg/pull/4944,
>> https://github.com/apache/iceberg-docs/pull/69)
>>
>> Intend use:
>> * statistics in Iceberg tables (see
>> https://github.com/apache/iceberg/pull/4945 and associated proposed
>> implementation https://github.com/apache/iceberg/pull/4741)
>> * in the future: storage for secondary indexes
>>
>> Puffin file reader and writer implementation:
>> https://github.com/apache/iceberg/pull/4537
>>
>> Thanks,
>> PF
>>
>>
>
> --
> Ryan Blue
> Tabular
>

>>>
>>> --
>>> Best Regards
>>>
>>

-- 
John Zhuge


Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-10 Thread Yufei Gu
+1. Looking forward to the partition stats.
Best,

Yufei


On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks  wrote:

> +1 as well.  Excited about the progress here.
>
> -Dan
>
> On Thu, Jun 9, 2022, 6:25 PM Junjie Chen  wrote:
>
>> +1, really nice! Indexes are coming!
>>
>> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho 
>> wrote:
>>
>>> +1, it's an exciting step for Iceberg, look forward to all the new
>>> statistics and secondary indices it will allow.
>>>
>>> Had a few questions of what the reference to Puffin file(s) will be in
>>> the Iceberg spec, but it's orthogonal to Puffin file format itself.
>>>
>>> Thanks,
>>> Szehon
>>>
>>> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue  wrote:
>>>
 +1 from me!

 There may also be people that haven't followed the design discussions
 and we can start a DISCUSS thread if needed. But if everyone is comfortable
 with the design and implementation, I think it's ready for a vote as well.

 Huge thanks to Piotr for getting this ready! I think the format is
 going to be really useful for both stats and indexes in Iceberg.

 On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen 
 wrote:

> Hi Everyone,
>
> I propose that we adopt Puffin file format as a file format for
> statistics and indexes in Iceberg tables.
>
> Puffin file format specification:
> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
> (previous discussions:  https://github.com/apache/iceberg/pull/4944,
> https://github.com/apache/iceberg-docs/pull/69)
>
> Intend use:
> * statistics in Iceberg tables (see
> https://github.com/apache/iceberg/pull/4945 and associated proposed
> implementation https://github.com/apache/iceberg/pull/4741)
> * in the future: storage for secondary indexes
>
> Puffin file reader and writer implementation:
> https://github.com/apache/iceberg/pull/4537
>
> Thanks,
> PF
>
>

 --
 Ryan Blue
 Tabular

>>>
>>
>> --
>> Best Regards
>>
>


Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Daniel Weeks
+1 as well.  Excited about the progress here.

-Dan

On Thu, Jun 9, 2022, 6:25 PM Junjie Chen  wrote:

> +1, really nice! Indexes are coming!
>
> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho  wrote:
>
>> +1, it's an exciting step for Iceberg, look forward to all the new
>> statistics and secondary indices it will allow.
>>
>> Had a few questions of what the reference to Puffin file(s) will be in
>> the Iceberg spec, but it's orthogonal to Puffin file format itself.
>>
>> Thanks,
>> Szehon
>>
>> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue  wrote:
>>
>>> +1 from me!
>>>
>>> There may also be people that haven't followed the design discussions
>>> and we can start a DISCUSS thread if needed. But if everyone is comfortable
>>> with the design and implementation, I think it's ready for a vote as well.
>>>
>>> Huge thanks to Piotr for getting this ready! I think the format is going
>>> to be really useful for both stats and indexes in Iceberg.
>>>
>>> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen 
>>> wrote:
>>>
 Hi Everyone,

 I propose that we adopt Puffin file format as a file format for
 statistics and indexes in Iceberg tables.

 Puffin file format specification:
 https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
 (previous discussions:  https://github.com/apache/iceberg/pull/4944,
 https://github.com/apache/iceberg-docs/pull/69)

 Intend use:
 * statistics in Iceberg tables (see
 https://github.com/apache/iceberg/pull/4945 and associated proposed
 implementation https://github.com/apache/iceberg/pull/4741)
 * in the future: storage for secondary indexes

 Puffin file reader and writer implementation:
 https://github.com/apache/iceberg/pull/4537

 Thanks,
 PF


>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Best Regards
>


Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Junjie Chen
+1, really nice! Indexes are coming!

On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho  wrote:

> +1, it's an exciting step for Iceberg, look forward to all the new
> statistics and secondary indices it will allow.
>
> Had a few questions of what the reference to Puffin file(s) will be in the
> Iceberg spec, but it's orthogonal to Puffin file format itself.
>
> Thanks,
> Szehon
>
> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue  wrote:
>
>> +1 from me!
>>
>> There may also be people that haven't followed the design discussions and
>> we can start a DISCUSS thread if needed. But if everyone is comfortable
>> with the design and implementation, I think it's ready for a vote as well.
>>
>> Huge thanks to Piotr for getting this ready! I think the format is going
>> to be really useful for both stats and indexes in Iceberg.
>>
>> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen 
>> wrote:
>>
>>> Hi Everyone,
>>>
>>> I propose that we adopt Puffin file format as a file format for
>>> statistics and indexes in Iceberg tables.
>>>
>>> Puffin file format specification:
>>> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
>>> (previous discussions:  https://github.com/apache/iceberg/pull/4944,
>>> https://github.com/apache/iceberg-docs/pull/69)
>>>
>>> Intend use:
>>> * statistics in Iceberg tables (see
>>> https://github.com/apache/iceberg/pull/4945 and associated proposed
>>> implementation https://github.com/apache/iceberg/pull/4741)
>>> * in the future: storage for secondary indexes
>>>
>>> Puffin file reader and writer implementation:
>>> https://github.com/apache/iceberg/pull/4537
>>>
>>> Thanks,
>>> PF
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Best Regards


Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Szehon Ho
+1, it's an exciting step for Iceberg, look forward to all the new
statistics and secondary indices it will allow.

Had a few questions of what the reference to Puffin file(s) will be in the
Iceberg spec, but it's orthogonal to Puffin file format itself.

Thanks,
Szehon

On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue  wrote:

> +1 from me!
>
> There may also be people that haven't followed the design discussions and
> we can start a DISCUSS thread if needed. But if everyone is comfortable
> with the design and implementation, I think it's ready for a vote as well.
>
> Huge thanks to Piotr for getting this ready! I think the format is going
> to be really useful for both stats and indexes in Iceberg.
>
> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen 
> wrote:
>
>> Hi Everyone,
>>
>> I propose that we adopt Puffin file format as a file format for
>> statistics and indexes in Iceberg tables.
>>
>> Puffin file format specification:
>> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
>> (previous discussions:  https://github.com/apache/iceberg/pull/4944,
>> https://github.com/apache/iceberg-docs/pull/69)
>>
>> Intend use:
>> * statistics in Iceberg tables (see
>> https://github.com/apache/iceberg/pull/4945 and associated proposed
>> implementation https://github.com/apache/iceberg/pull/4741)
>> * in the future: storage for secondary indexes
>>
>> Puffin file reader and writer implementation:
>> https://github.com/apache/iceberg/pull/4537
>>
>> Thanks,
>> PF
>>
>>
>
> --
> Ryan Blue
> Tabular
>


Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Ryan Blue
+1 from me!

There may also be people that haven't followed the design discussions and
we can start a DISCUSS thread if needed. But if everyone is comfortable
with the design and implementation, I think it's ready for a vote as well.

Huge thanks to Piotr for getting this ready! I think the format is going to
be really useful for both stats and indexes in Iceberg.

On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen 
wrote:

> Hi Everyone,
>
> I propose that we adopt Puffin file format as a file format for statistics
> and indexes in Iceberg tables.
>
> Puffin file format specification:
> https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
> (previous discussions:  https://github.com/apache/iceberg/pull/4944,
> https://github.com/apache/iceberg-docs/pull/69)
>
> Intend use:
> * statistics in Iceberg tables (see
> https://github.com/apache/iceberg/pull/4945 and associated proposed
> implementation https://github.com/apache/iceberg/pull/4741)
> * in the future: storage for secondary indexes
>
> Puffin file reader and writer implementation:
> https://github.com/apache/iceberg/pull/4537
>
> Thanks,
> PF
>
>

-- 
Ryan Blue
Tabular