Re: Parquet Sync Notes July 31th 2024

2024-08-20 Thread Alkis Evlogimenos
Thank you Fokko.

PR is up: https://github.com/apache/parquet-benchmark/pull/1

On Tue, Aug 20, 2024 at 12:11 AM Julien Le Dem  wrote:

> Thanks Fokko!
>
> On Mon, Aug 19, 2024 at 11:59 AM Fokko Driesprong 
> wrote:
>
> > Done!
> >
> > Kind regards,
> > Fokko
> >
> > Op ma 19 aug 2024 om 20:52 schreef Alkis Evlogimenos
> > :
> >
> > > Hello Julien. I finally got around compiling binaries for the
> > benchmarking
> > > repo. Can you add an empty README.md in
> > > https://github.com/apache/parquet-benchmark because otherwise I can't
> > fork
> > > an empty repo (!!!).
> > >
> > > Cheers,
> > >
> > > On Wed, Aug 7, 2024 at 12:52 AM Julien Le Dem 
> wrote:
> > >
> > > > That works for me.
> > > > @Alkis Evlogimenos  When you open
> a
> > PR
> > > > on parquet-benchmark, just make it clear how this binary got there
> and
> > > that
> > > > it is an unofficial build from the arrow project waiting for an
> > official
> > > > release.
> > > >
> > > >
> > > >
> > > > On Tue, Aug 6, 2024 at 7:52 AM Rok Mihevc 
> > wrote:
> > > >
> > > >> That would be a temporary solution until parquet-cpp is released?
> > Seems
> > > ok
> > > >> as it's a utility thing.
> > > >>
> > > >> On Tue, Aug 6, 2024 at 4:03 PM Alkis Evlogimenos
> > > >>  wrote:
> > > >>
> > > >> > Perhaps it is best to compile static binaries of the above and
> > upload
> > > to
> > > >> > https://github.com/apache/parquet-benchmark along with a readme?
> > > >> >
> > > >> > On Tue, Aug 6, 2024 at 4:30 PM Rok Mihevc 
> > > wrote:
> > > >> >
> > > >> > > Arrow releases are cut ~every three months and the last release
> > was
> > > >> mid
> > > >> > > July (https://arrow.apache.org/release/17.0.0.html).
> > > >> > > I would speculate 18.0.0 will be public mid September.
> > > >> > >
> > > >> > > On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos
> > > >> > >  wrote:
> > > >> > >
> > > >> > > > Thank you Julien. When can we expect a new arrow package
> release
> > > so
> > > >> > that
> > > >> > > I
> > > >> > > > can compile a doc for customers to donate footers to us?
> > > >> > > >
> > > >> > > > binary in question:
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc
> > > >> > > >
> > > >> > > > On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem <
> jul...@apache.org
> > >
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > Following up on my action item, I have created the
> > > >> parquet-benchmark
> > > >> > > > repo:
> > > >> > > > > https://github.com/apache/parquet-benchmark
> > > >> > > > >
> > > >> > > > > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem <
> > > jul...@apache.org>
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > Attendees:
> > > >> > > > > >
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >Micah: Google, no special topic today
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >Alkis: Databricks, storage stack. Topic: Parquet
> > extension
> > > >> PR so
> > > >> > > > that
> > > >> > > > > >we can go in the format. Want to fix the metadata to
> make
> > > it
> > > >> > work
> > > >> > > > for
> > > >> > > > > wide
> > > >> > > > > >schemas.
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >Vinoo: Palantir -> startup in data space. Working on
> > > >> improving
> > > >> > the
> > > >> > > > > >website.
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >Julien: Datadog. Topic: Make parquet reading possible
> to
> > be
> > > >> done
> > > >> > > > > >sequentially (as opposed to footer first)
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >Rok: Voltron -> freelance in Fintech. Care about
> Parquet
> > > >> > > > performance.
> > > >> > > > > >Have time to contribute to footers (“V3”).
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > Follow up items:
> > > >> > > > > >
> > > >> > > > > > Mika’s Parquet format changes process
> > > >> > > > > >
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >First PR merged, need to finalize java
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >=> Mostly done
> > > >> > > > > >
> > > >> > > > > > Jira -> github migration
> > > >> > > > > >
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >Getting started with github. Will follow up on the
> > mailing
> > > >> list.
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >=> mostly closed discussion. Some follow up async on
> the
> > > >> > > discussion.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > Agenda:
> > > >> > > > > >
> > > >> > > > > >-
> > > >> > > > > >
> > > >> > > > > >Finalizing [EXTERNAL] Parquet extensions
> > > >> > > > > ><
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > >

Re: Parquet Sync Notes July 31th 2024

2024-08-19 Thread Julien Le Dem
Thanks Fokko!

On Mon, Aug 19, 2024 at 11:59 AM Fokko Driesprong  wrote:

> Done!
>
> Kind regards,
> Fokko
>
> Op ma 19 aug 2024 om 20:52 schreef Alkis Evlogimenos
> :
>
> > Hello Julien. I finally got around compiling binaries for the
> benchmarking
> > repo. Can you add an empty README.md in
> > https://github.com/apache/parquet-benchmark because otherwise I can't
> fork
> > an empty repo (!!!).
> >
> > Cheers,
> >
> > On Wed, Aug 7, 2024 at 12:52 AM Julien Le Dem  wrote:
> >
> > > That works for me.
> > > @Alkis Evlogimenos  When you open a
> PR
> > > on parquet-benchmark, just make it clear how this binary got there and
> > that
> > > it is an unofficial build from the arrow project waiting for an
> official
> > > release.
> > >
> > >
> > >
> > > On Tue, Aug 6, 2024 at 7:52 AM Rok Mihevc 
> wrote:
> > >
> > >> That would be a temporary solution until parquet-cpp is released?
> Seems
> > ok
> > >> as it's a utility thing.
> > >>
> > >> On Tue, Aug 6, 2024 at 4:03 PM Alkis Evlogimenos
> > >>  wrote:
> > >>
> > >> > Perhaps it is best to compile static binaries of the above and
> upload
> > to
> > >> > https://github.com/apache/parquet-benchmark along with a readme?
> > >> >
> > >> > On Tue, Aug 6, 2024 at 4:30 PM Rok Mihevc 
> > wrote:
> > >> >
> > >> > > Arrow releases are cut ~every three months and the last release
> was
> > >> mid
> > >> > > July (https://arrow.apache.org/release/17.0.0.html).
> > >> > > I would speculate 18.0.0 will be public mid September.
> > >> > >
> > >> > > On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos
> > >> > >  wrote:
> > >> > >
> > >> > > > Thank you Julien. When can we expect a new arrow package release
> > so
> > >> > that
> > >> > > I
> > >> > > > can compile a doc for customers to donate footers to us?
> > >> > > >
> > >> > > > binary in question:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc
> > >> > > >
> > >> > > > On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem  >
> > >> > wrote:
> > >> > > >
> > >> > > > > Following up on my action item, I have created the
> > >> parquet-benchmark
> > >> > > > repo:
> > >> > > > > https://github.com/apache/parquet-benchmark
> > >> > > > >
> > >> > > > > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem <
> > jul...@apache.org>
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Attendees:
> > >> > > > > >
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >Micah: Google, no special topic today
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >Alkis: Databricks, storage stack. Topic: Parquet
> extension
> > >> PR so
> > >> > > > that
> > >> > > > > >we can go in the format. Want to fix the metadata to make
> > it
> > >> > work
> > >> > > > for
> > >> > > > > wide
> > >> > > > > >schemas.
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >Vinoo: Palantir -> startup in data space. Working on
> > >> improving
> > >> > the
> > >> > > > > >website.
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >Julien: Datadog. Topic: Make parquet reading possible to
> be
> > >> done
> > >> > > > > >sequentially (as opposed to footer first)
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >Rok: Voltron -> freelance in Fintech. Care about Parquet
> > >> > > > performance.
> > >> > > > > >Have time to contribute to footers (“V3”).
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Follow up items:
> > >> > > > > >
> > >> > > > > > Mika’s Parquet format changes process
> > >> > > > > >
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >First PR merged, need to finalize java
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >=> Mostly done
> > >> > > > > >
> > >> > > > > > Jira -> github migration
> > >> > > > > >
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >Getting started with github. Will follow up on the
> mailing
> > >> list.
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >=> mostly closed discussion. Some follow up async on the
> > >> > > discussion.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Agenda:
> > >> > > > > >
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >Finalizing [EXTERNAL] Parquet extensions
> > >> > > > > ><
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >-
> > >> > > > > >
> > >> > > > > >   AI: Alkis Evlogimenos <
> alkis.evlogime...@databricks.com
> > >
> > >> to
> > >> > > > update
> > >> > > > > >   PR with everything in the doc except Alternatives
> > >> Considered
> > >> > > and
> > >> > > > > split the
> > >> > > > > >   examples in another page.
> > >> > > > > >   -
> > >> > > > > >
> > >> > > > > >New footer metadata discussion.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Discussion:
> > >> > > > > >
> > >> > > > >

Re: Parquet Sync Notes July 31th 2024

2024-08-19 Thread Fokko Driesprong
Done!

Kind regards,
Fokko

Op ma 19 aug 2024 om 20:52 schreef Alkis Evlogimenos
:

> Hello Julien. I finally got around compiling binaries for the benchmarking
> repo. Can you add an empty README.md in
> https://github.com/apache/parquet-benchmark because otherwise I can't fork
> an empty repo (!!!).
>
> Cheers,
>
> On Wed, Aug 7, 2024 at 12:52 AM Julien Le Dem  wrote:
>
> > That works for me.
> > @Alkis Evlogimenos  When you open a PR
> > on parquet-benchmark, just make it clear how this binary got there and
> that
> > it is an unofficial build from the arrow project waiting for an official
> > release.
> >
> >
> >
> > On Tue, Aug 6, 2024 at 7:52 AM Rok Mihevc  wrote:
> >
> >> That would be a temporary solution until parquet-cpp is released? Seems
> ok
> >> as it's a utility thing.
> >>
> >> On Tue, Aug 6, 2024 at 4:03 PM Alkis Evlogimenos
> >>  wrote:
> >>
> >> > Perhaps it is best to compile static binaries of the above and upload
> to
> >> > https://github.com/apache/parquet-benchmark along with a readme?
> >> >
> >> > On Tue, Aug 6, 2024 at 4:30 PM Rok Mihevc 
> wrote:
> >> >
> >> > > Arrow releases are cut ~every three months and the last release was
> >> mid
> >> > > July (https://arrow.apache.org/release/17.0.0.html).
> >> > > I would speculate 18.0.0 will be public mid September.
> >> > >
> >> > > On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos
> >> > >  wrote:
> >> > >
> >> > > > Thank you Julien. When can we expect a new arrow package release
> so
> >> > that
> >> > > I
> >> > > > can compile a doc for customers to donate footers to us?
> >> > > >
> >> > > > binary in question:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc
> >> > > >
> >> > > > On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem 
> >> > wrote:
> >> > > >
> >> > > > > Following up on my action item, I have created the
> >> parquet-benchmark
> >> > > > repo:
> >> > > > > https://github.com/apache/parquet-benchmark
> >> > > > >
> >> > > > > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem <
> jul...@apache.org>
> >> > > wrote:
> >> > > > >
> >> > > > > > Attendees:
> >> > > > > >
> >> > > > > >-
> >> > > > > >
> >> > > > > >Micah: Google, no special topic today
> >> > > > > >-
> >> > > > > >
> >> > > > > >Alkis: Databricks, storage stack. Topic: Parquet extension
> >> PR so
> >> > > > that
> >> > > > > >we can go in the format. Want to fix the metadata to make
> it
> >> > work
> >> > > > for
> >> > > > > wide
> >> > > > > >schemas.
> >> > > > > >-
> >> > > > > >
> >> > > > > >Vinoo: Palantir -> startup in data space. Working on
> >> improving
> >> > the
> >> > > > > >website.
> >> > > > > >-
> >> > > > > >
> >> > > > > >Julien: Datadog. Topic: Make parquet reading possible to be
> >> done
> >> > > > > >sequentially (as opposed to footer first)
> >> > > > > >-
> >> > > > > >
> >> > > > > >Rok: Voltron -> freelance in Fintech. Care about Parquet
> >> > > > performance.
> >> > > > > >Have time to contribute to footers (“V3”).
> >> > > > > >
> >> > > > > >
> >> > > > > > Follow up items:
> >> > > > > >
> >> > > > > > Mika’s Parquet format changes process
> >> > > > > >
> >> > > > > >-
> >> > > > > >
> >> > > > > >First PR merged, need to finalize java
> >> > > > > >-
> >> > > > > >
> >> > > > > >=> Mostly done
> >> > > > > >
> >> > > > > > Jira -> github migration
> >> > > > > >
> >> > > > > >-
> >> > > > > >
> >> > > > > >Getting started with github. Will follow up on the mailing
> >> list.
> >> > > > > >-
> >> > > > > >
> >> > > > > >=> mostly closed discussion. Some follow up async on the
> >> > > discussion.
> >> > > > > >
> >> > > > > >
> >> > > > > > Agenda:
> >> > > > > >
> >> > > > > >-
> >> > > > > >
> >> > > > > >Finalizing [EXTERNAL] Parquet extensions
> >> > > > > ><
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> >> > > > > >
> >> > > > > >
> >> > > > > >-
> >> > > > > >
> >> > > > > >   AI: Alkis Evlogimenos  >
> >> to
> >> > > > update
> >> > > > > >   PR with everything in the doc except Alternatives
> >> Considered
> >> > > and
> >> > > > > split the
> >> > > > > >   examples in another page.
> >> > > > > >   -
> >> > > > > >
> >> > > > > >New footer metadata discussion.
> >> > > > > >
> >> > > > > >
> >> > > > > > Discussion:
> >> > > > > >
> >> > > > > >-
> >> > > > > >
> >> > > > > >Extensions:
> >> > > > > >-
> >> > > > > >
> >> > > > > >   Add functionality to read/write the extension and show
> >> that
> >> > we
> >> > > > can
> >> > > > > >   ignore it.
> >> > > > > >   -
> >> > > > > >
> >> > > > > >  1: write an extension and read the old footer that
> >> ignores
> >> > > it.
> >> > > > > >  -
> >> > > > > >
> >> > > > > >  2: write extension and allow readi

Re: Parquet Sync Notes July 31th 2024

2024-08-19 Thread Alkis Evlogimenos
Hello Julien. I finally got around compiling binaries for the benchmarking
repo. Can you add an empty README.md in
https://github.com/apache/parquet-benchmark because otherwise I can't fork
an empty repo (!!!).

Cheers,

On Wed, Aug 7, 2024 at 12:52 AM Julien Le Dem  wrote:

> That works for me.
> @Alkis Evlogimenos  When you open a PR
> on parquet-benchmark, just make it clear how this binary got there and that
> it is an unofficial build from the arrow project waiting for an official
> release.
>
>
>
> On Tue, Aug 6, 2024 at 7:52 AM Rok Mihevc  wrote:
>
>> That would be a temporary solution until parquet-cpp is released? Seems ok
>> as it's a utility thing.
>>
>> On Tue, Aug 6, 2024 at 4:03 PM Alkis Evlogimenos
>>  wrote:
>>
>> > Perhaps it is best to compile static binaries of the above and upload to
>> > https://github.com/apache/parquet-benchmark along with a readme?
>> >
>> > On Tue, Aug 6, 2024 at 4:30 PM Rok Mihevc  wrote:
>> >
>> > > Arrow releases are cut ~every three months and the last release was
>> mid
>> > > July (https://arrow.apache.org/release/17.0.0.html).
>> > > I would speculate 18.0.0 will be public mid September.
>> > >
>> > > On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos
>> > >  wrote:
>> > >
>> > > > Thank you Julien. When can we expect a new arrow package release so
>> > that
>> > > I
>> > > > can compile a doc for customers to donate footers to us?
>> > > >
>> > > > binary in question:
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc
>> > > >
>> > > > On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem 
>> > wrote:
>> > > >
>> > > > > Following up on my action item, I have created the
>> parquet-benchmark
>> > > > repo:
>> > > > > https://github.com/apache/parquet-benchmark
>> > > > >
>> > > > > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem 
>> > > wrote:
>> > > > >
>> > > > > > Attendees:
>> > > > > >
>> > > > > >-
>> > > > > >
>> > > > > >Micah: Google, no special topic today
>> > > > > >-
>> > > > > >
>> > > > > >Alkis: Databricks, storage stack. Topic: Parquet extension
>> PR so
>> > > > that
>> > > > > >we can go in the format. Want to fix the metadata to make it
>> > work
>> > > > for
>> > > > > wide
>> > > > > >schemas.
>> > > > > >-
>> > > > > >
>> > > > > >Vinoo: Palantir -> startup in data space. Working on
>> improving
>> > the
>> > > > > >website.
>> > > > > >-
>> > > > > >
>> > > > > >Julien: Datadog. Topic: Make parquet reading possible to be
>> done
>> > > > > >sequentially (as opposed to footer first)
>> > > > > >-
>> > > > > >
>> > > > > >Rok: Voltron -> freelance in Fintech. Care about Parquet
>> > > > performance.
>> > > > > >Have time to contribute to footers (“V3”).
>> > > > > >
>> > > > > >
>> > > > > > Follow up items:
>> > > > > >
>> > > > > > Mika’s Parquet format changes process
>> > > > > >
>> > > > > >-
>> > > > > >
>> > > > > >First PR merged, need to finalize java
>> > > > > >-
>> > > > > >
>> > > > > >=> Mostly done
>> > > > > >
>> > > > > > Jira -> github migration
>> > > > > >
>> > > > > >-
>> > > > > >
>> > > > > >Getting started with github. Will follow up on the mailing
>> list.
>> > > > > >-
>> > > > > >
>> > > > > >=> mostly closed discussion. Some follow up async on the
>> > > discussion.
>> > > > > >
>> > > > > >
>> > > > > > Agenda:
>> > > > > >
>> > > > > >-
>> > > > > >
>> > > > > >Finalizing [EXTERNAL] Parquet extensions
>> > > > > ><
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
>> > > > > >
>> > > > > >
>> > > > > >-
>> > > > > >
>> > > > > >   AI: Alkis Evlogimenos 
>> to
>> > > > update
>> > > > > >   PR with everything in the doc except Alternatives
>> Considered
>> > > and
>> > > > > split the
>> > > > > >   examples in another page.
>> > > > > >   -
>> > > > > >
>> > > > > >New footer metadata discussion.
>> > > > > >
>> > > > > >
>> > > > > > Discussion:
>> > > > > >
>> > > > > >-
>> > > > > >
>> > > > > >Extensions:
>> > > > > >-
>> > > > > >
>> > > > > >   Add functionality to read/write the extension and show
>> that
>> > we
>> > > > can
>> > > > > >   ignore it.
>> > > > > >   -
>> > > > > >
>> > > > > >  1: write an extension and read the old footer that
>> ignores
>> > > it.
>> > > > > >  -
>> > > > > >
>> > > > > >  2: write extension and allow reading it back.
>> > > > > >  -
>> > > > > >
>> > > > > >New metadata:
>> > > > > >-
>> > > > > >
>> > > > > >   Flatbuffer is bigger than thrift: need to optimize
>> metadata
>> > > > > >   -
>> > > > > >
>> > > > > >  Start from a 1-1 implementation to existing footer and
>> > keep
>> > > > > >  iterating 1 commit at a time.
>> > > > > >  -
>> > > > > >
>> > > > > >   Would like to have a branch in

Re: Parquet Sync Notes July 31th 2024

2024-08-06 Thread Julien Le Dem
That works for me.
@Alkis Evlogimenos  When you open a PR on
parquet-benchmark, just make it clear how this binary got there and that it
is an unofficial build from the arrow project waiting for an official
release.



On Tue, Aug 6, 2024 at 7:52 AM Rok Mihevc  wrote:

> That would be a temporary solution until parquet-cpp is released? Seems ok
> as it's a utility thing.
>
> On Tue, Aug 6, 2024 at 4:03 PM Alkis Evlogimenos
>  wrote:
>
> > Perhaps it is best to compile static binaries of the above and upload to
> > https://github.com/apache/parquet-benchmark along with a readme?
> >
> > On Tue, Aug 6, 2024 at 4:30 PM Rok Mihevc  wrote:
> >
> > > Arrow releases are cut ~every three months and the last release was mid
> > > July (https://arrow.apache.org/release/17.0.0.html).
> > > I would speculate 18.0.0 will be public mid September.
> > >
> > > On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos
> > >  wrote:
> > >
> > > > Thank you Julien. When can we expect a new arrow package release so
> > that
> > > I
> > > > can compile a doc for customers to donate footers to us?
> > > >
> > > > binary in question:
> > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc
> > > >
> > > > On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem 
> > wrote:
> > > >
> > > > > Following up on my action item, I have created the
> parquet-benchmark
> > > > repo:
> > > > > https://github.com/apache/parquet-benchmark
> > > > >
> > > > > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem 
> > > wrote:
> > > > >
> > > > > > Attendees:
> > > > > >
> > > > > >-
> > > > > >
> > > > > >Micah: Google, no special topic today
> > > > > >-
> > > > > >
> > > > > >Alkis: Databricks, storage stack. Topic: Parquet extension PR
> so
> > > > that
> > > > > >we can go in the format. Want to fix the metadata to make it
> > work
> > > > for
> > > > > wide
> > > > > >schemas.
> > > > > >-
> > > > > >
> > > > > >Vinoo: Palantir -> startup in data space. Working on improving
> > the
> > > > > >website.
> > > > > >-
> > > > > >
> > > > > >Julien: Datadog. Topic: Make parquet reading possible to be
> done
> > > > > >sequentially (as opposed to footer first)
> > > > > >-
> > > > > >
> > > > > >Rok: Voltron -> freelance in Fintech. Care about Parquet
> > > > performance.
> > > > > >Have time to contribute to footers (“V3”).
> > > > > >
> > > > > >
> > > > > > Follow up items:
> > > > > >
> > > > > > Mika’s Parquet format changes process
> > > > > >
> > > > > >-
> > > > > >
> > > > > >First PR merged, need to finalize java
> > > > > >-
> > > > > >
> > > > > >=> Mostly done
> > > > > >
> > > > > > Jira -> github migration
> > > > > >
> > > > > >-
> > > > > >
> > > > > >Getting started with github. Will follow up on the mailing
> list.
> > > > > >-
> > > > > >
> > > > > >=> mostly closed discussion. Some follow up async on the
> > > discussion.
> > > > > >
> > > > > >
> > > > > > Agenda:
> > > > > >
> > > > > >-
> > > > > >
> > > > > >Finalizing [EXTERNAL] Parquet extensions
> > > > > ><
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > > > > >
> > > > > >
> > > > > >-
> > > > > >
> > > > > >   AI: Alkis Evlogimenos 
> to
> > > > update
> > > > > >   PR with everything in the doc except Alternatives
> Considered
> > > and
> > > > > split the
> > > > > >   examples in another page.
> > > > > >   -
> > > > > >
> > > > > >New footer metadata discussion.
> > > > > >
> > > > > >
> > > > > > Discussion:
> > > > > >
> > > > > >-
> > > > > >
> > > > > >Extensions:
> > > > > >-
> > > > > >
> > > > > >   Add functionality to read/write the extension and show that
> > we
> > > > can
> > > > > >   ignore it.
> > > > > >   -
> > > > > >
> > > > > >  1: write an extension and read the old footer that
> ignores
> > > it.
> > > > > >  -
> > > > > >
> > > > > >  2: write extension and allow reading it back.
> > > > > >  -
> > > > > >
> > > > > >New metadata:
> > > > > >-
> > > > > >
> > > > > >   Flatbuffer is bigger than thrift: need to optimize metadata
> > > > > >   -
> > > > > >
> > > > > >  Start from a 1-1 implementation to existing footer and
> > keep
> > > > > >  iterating 1 commit at a time.
> > > > > >  -
> > > > > >
> > > > > >   Would like to have a branch in github arrow cpp or a public
> > > fork
> > > > on
> > > > > >   github to share the prototype.
> > > > > >   -
> > > > > >
> > > > > >   Add to parquet-tool to print the footer.
> > > > > >   -
> > > > > >
> > > > > >  Add utility to obfuscate schema so that people can share
> > > their
> > > > > >  metadata without sharing proprietary information.
> > > > > >  -
> > > > > >
> > > > > >  That way we can have

Re: Parquet Sync Notes July 31th 2024

2024-08-06 Thread Rok Mihevc
That would be a temporary solution until parquet-cpp is released? Seems ok
as it's a utility thing.

On Tue, Aug 6, 2024 at 4:03 PM Alkis Evlogimenos
 wrote:

> Perhaps it is best to compile static binaries of the above and upload to
> https://github.com/apache/parquet-benchmark along with a readme?
>
> On Tue, Aug 6, 2024 at 4:30 PM Rok Mihevc  wrote:
>
> > Arrow releases are cut ~every three months and the last release was mid
> > July (https://arrow.apache.org/release/17.0.0.html).
> > I would speculate 18.0.0 will be public mid September.
> >
> > On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos
> >  wrote:
> >
> > > Thank you Julien. When can we expect a new arrow package release so
> that
> > I
> > > can compile a doc for customers to donate footers to us?
> > >
> > > binary in question:
> > >
> > >
> >
> https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc
> > >
> > > On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem 
> wrote:
> > >
> > > > Following up on my action item, I have created the parquet-benchmark
> > > repo:
> > > > https://github.com/apache/parquet-benchmark
> > > >
> > > > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem 
> > wrote:
> > > >
> > > > > Attendees:
> > > > >
> > > > >-
> > > > >
> > > > >Micah: Google, no special topic today
> > > > >-
> > > > >
> > > > >Alkis: Databricks, storage stack. Topic: Parquet extension PR so
> > > that
> > > > >we can go in the format. Want to fix the metadata to make it
> work
> > > for
> > > > wide
> > > > >schemas.
> > > > >-
> > > > >
> > > > >Vinoo: Palantir -> startup in data space. Working on improving
> the
> > > > >website.
> > > > >-
> > > > >
> > > > >Julien: Datadog. Topic: Make parquet reading possible to be done
> > > > >sequentially (as opposed to footer first)
> > > > >-
> > > > >
> > > > >Rok: Voltron -> freelance in Fintech. Care about Parquet
> > > performance.
> > > > >Have time to contribute to footers (“V3”).
> > > > >
> > > > >
> > > > > Follow up items:
> > > > >
> > > > > Mika’s Parquet format changes process
> > > > >
> > > > >-
> > > > >
> > > > >First PR merged, need to finalize java
> > > > >-
> > > > >
> > > > >=> Mostly done
> > > > >
> > > > > Jira -> github migration
> > > > >
> > > > >-
> > > > >
> > > > >Getting started with github. Will follow up on the mailing list.
> > > > >-
> > > > >
> > > > >=> mostly closed discussion. Some follow up async on the
> > discussion.
> > > > >
> > > > >
> > > > > Agenda:
> > > > >
> > > > >-
> > > > >
> > > > >Finalizing [EXTERNAL] Parquet extensions
> > > > ><
> > > >
> > >
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > > > >
> > > > >
> > > > >-
> > > > >
> > > > >   AI: Alkis Evlogimenos  to
> > > update
> > > > >   PR with everything in the doc except Alternatives Considered
> > and
> > > > split the
> > > > >   examples in another page.
> > > > >   -
> > > > >
> > > > >New footer metadata discussion.
> > > > >
> > > > >
> > > > > Discussion:
> > > > >
> > > > >-
> > > > >
> > > > >Extensions:
> > > > >-
> > > > >
> > > > >   Add functionality to read/write the extension and show that
> we
> > > can
> > > > >   ignore it.
> > > > >   -
> > > > >
> > > > >  1: write an extension and read the old footer that ignores
> > it.
> > > > >  -
> > > > >
> > > > >  2: write extension and allow reading it back.
> > > > >  -
> > > > >
> > > > >New metadata:
> > > > >-
> > > > >
> > > > >   Flatbuffer is bigger than thrift: need to optimize metadata
> > > > >   -
> > > > >
> > > > >  Start from a 1-1 implementation to existing footer and
> keep
> > > > >  iterating 1 commit at a time.
> > > > >  -
> > > > >
> > > > >   Would like to have a branch in github arrow cpp or a public
> > fork
> > > on
> > > > >   github to share the prototype.
> > > > >   -
> > > > >
> > > > >   Add to parquet-tool to print the footer.
> > > > >   -
> > > > >
> > > > >  Add utility to obfuscate schema so that people can share
> > their
> > > > >  metadata without sharing proprietary information.
> > > > >  -
> > > > >
> > > > >  That way we can have data about slow footers and validate
> we
> > > can
> > > > >  read faster with the new footer
> > > > >  -
> > > > >
> > > > >  => creation of a database of footers.
> > > > >  -
> > > > >
> > > > >   Getting a feel of what features are used by users.
> > > > >   -
> > > > >
> > > > >  Alkis would want to share his findings through a blog
> post.
> > > > >  -
> > > > >
> > > > >   Also need to make sure the addition of the new footer doesn’t
> > > > >   impact old footers too much.
> > > > >   -
> > > > >
> > > > >   Possibly:
> >

Re: Parquet Sync Notes July 31th 2024

2024-08-06 Thread Alkis Evlogimenos
Perhaps it is best to compile static binaries of the above and upload to
https://github.com/apache/parquet-benchmark along with a readme?

On Tue, Aug 6, 2024 at 4:30 PM Rok Mihevc  wrote:

> Arrow releases are cut ~every three months and the last release was mid
> July (https://arrow.apache.org/release/17.0.0.html).
> I would speculate 18.0.0 will be public mid September.
>
> On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos
>  wrote:
>
> > Thank you Julien. When can we expect a new arrow package release so that
> I
> > can compile a doc for customers to donate footers to us?
> >
> > binary in question:
> >
> >
> https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc
> >
> > On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem  wrote:
> >
> > > Following up on my action item, I have created the parquet-benchmark
> > repo:
> > > https://github.com/apache/parquet-benchmark
> > >
> > > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem 
> wrote:
> > >
> > > > Attendees:
> > > >
> > > >-
> > > >
> > > >Micah: Google, no special topic today
> > > >-
> > > >
> > > >Alkis: Databricks, storage stack. Topic: Parquet extension PR so
> > that
> > > >we can go in the format. Want to fix the metadata to make it work
> > for
> > > wide
> > > >schemas.
> > > >-
> > > >
> > > >Vinoo: Palantir -> startup in data space. Working on improving the
> > > >website.
> > > >-
> > > >
> > > >Julien: Datadog. Topic: Make parquet reading possible to be done
> > > >sequentially (as opposed to footer first)
> > > >-
> > > >
> > > >Rok: Voltron -> freelance in Fintech. Care about Parquet
> > performance.
> > > >Have time to contribute to footers (“V3”).
> > > >
> > > >
> > > > Follow up items:
> > > >
> > > > Mika’s Parquet format changes process
> > > >
> > > >-
> > > >
> > > >First PR merged, need to finalize java
> > > >-
> > > >
> > > >=> Mostly done
> > > >
> > > > Jira -> github migration
> > > >
> > > >-
> > > >
> > > >Getting started with github. Will follow up on the mailing list.
> > > >-
> > > >
> > > >=> mostly closed discussion. Some follow up async on the
> discussion.
> > > >
> > > >
> > > > Agenda:
> > > >
> > > >-
> > > >
> > > >Finalizing [EXTERNAL] Parquet extensions
> > > ><
> > >
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > > >
> > > >
> > > >-
> > > >
> > > >   AI: Alkis Evlogimenos  to
> > update
> > > >   PR with everything in the doc except Alternatives Considered
> and
> > > split the
> > > >   examples in another page.
> > > >   -
> > > >
> > > >New footer metadata discussion.
> > > >
> > > >
> > > > Discussion:
> > > >
> > > >-
> > > >
> > > >Extensions:
> > > >-
> > > >
> > > >   Add functionality to read/write the extension and show that we
> > can
> > > >   ignore it.
> > > >   -
> > > >
> > > >  1: write an extension and read the old footer that ignores
> it.
> > > >  -
> > > >
> > > >  2: write extension and allow reading it back.
> > > >  -
> > > >
> > > >New metadata:
> > > >-
> > > >
> > > >   Flatbuffer is bigger than thrift: need to optimize metadata
> > > >   -
> > > >
> > > >  Start from a 1-1 implementation to existing footer and keep
> > > >  iterating 1 commit at a time.
> > > >  -
> > > >
> > > >   Would like to have a branch in github arrow cpp or a public
> fork
> > on
> > > >   github to share the prototype.
> > > >   -
> > > >
> > > >   Add to parquet-tool to print the footer.
> > > >   -
> > > >
> > > >  Add utility to obfuscate schema so that people can share
> their
> > > >  metadata without sharing proprietary information.
> > > >  -
> > > >
> > > >  That way we can have data about slow footers and validate we
> > can
> > > >  read faster with the new footer
> > > >  -
> > > >
> > > >  => creation of a database of footers.
> > > >  -
> > > >
> > > >   Getting a feel of what features are used by users.
> > > >   -
> > > >
> > > >  Alkis would want to share his findings through a blog post.
> > > >  -
> > > >
> > > >   Also need to make sure the addition of the new footer doesn’t
> > > >   impact old footers too much.
> > > >   -
> > > >
> > > >   Possibly:
> > > >   -
> > > >
> > > >  Codspeed for performance testing
> > > >  -
> > > >
> > > >  Thrift linter: https://github.com/thrift-labs/thrift-fmt
> > > >  -
> > > >
> > > >   AI:
> > > >   -
> > > >
> > > >  [Julien] Create a parquet-benchmark repo for a footer db and
> > > >  other things
> > > >  -
> > > >
> > > > Example: https://github.com/rok/parquet-benchmark
> > > > -
> > > >
> > > >  Alkis to pick wh

Re: Parquet Sync Notes July 31th 2024

2024-08-06 Thread Rok Mihevc
Arrow releases are cut ~every three months and the last release was mid
July (https://arrow.apache.org/release/17.0.0.html).
I would speculate 18.0.0 will be public mid September.

On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos
 wrote:

> Thank you Julien. When can we expect a new arrow package release so that I
> can compile a doc for customers to donate footers to us?
>
> binary in question:
>
> https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc
>
> On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem  wrote:
>
> > Following up on my action item, I have created the parquet-benchmark
> repo:
> > https://github.com/apache/parquet-benchmark
> >
> > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem  wrote:
> >
> > > Attendees:
> > >
> > >-
> > >
> > >Micah: Google, no special topic today
> > >-
> > >
> > >Alkis: Databricks, storage stack. Topic: Parquet extension PR so
> that
> > >we can go in the format. Want to fix the metadata to make it work
> for
> > wide
> > >schemas.
> > >-
> > >
> > >Vinoo: Palantir -> startup in data space. Working on improving the
> > >website.
> > >-
> > >
> > >Julien: Datadog. Topic: Make parquet reading possible to be done
> > >sequentially (as opposed to footer first)
> > >-
> > >
> > >Rok: Voltron -> freelance in Fintech. Care about Parquet
> performance.
> > >Have time to contribute to footers (“V3”).
> > >
> > >
> > > Follow up items:
> > >
> > > Mika’s Parquet format changes process
> > >
> > >-
> > >
> > >First PR merged, need to finalize java
> > >-
> > >
> > >=> Mostly done
> > >
> > > Jira -> github migration
> > >
> > >-
> > >
> > >Getting started with github. Will follow up on the mailing list.
> > >-
> > >
> > >=> mostly closed discussion. Some follow up async on the discussion.
> > >
> > >
> > > Agenda:
> > >
> > >-
> > >
> > >Finalizing [EXTERNAL] Parquet extensions
> > ><
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > >
> > >
> > >-
> > >
> > >   AI: Alkis Evlogimenos  to
> update
> > >   PR with everything in the doc except Alternatives Considered and
> > split the
> > >   examples in another page.
> > >   -
> > >
> > >New footer metadata discussion.
> > >
> > >
> > > Discussion:
> > >
> > >-
> > >
> > >Extensions:
> > >-
> > >
> > >   Add functionality to read/write the extension and show that we
> can
> > >   ignore it.
> > >   -
> > >
> > >  1: write an extension and read the old footer that ignores it.
> > >  -
> > >
> > >  2: write extension and allow reading it back.
> > >  -
> > >
> > >New metadata:
> > >-
> > >
> > >   Flatbuffer is bigger than thrift: need to optimize metadata
> > >   -
> > >
> > >  Start from a 1-1 implementation to existing footer and keep
> > >  iterating 1 commit at a time.
> > >  -
> > >
> > >   Would like to have a branch in github arrow cpp or a public fork
> on
> > >   github to share the prototype.
> > >   -
> > >
> > >   Add to parquet-tool to print the footer.
> > >   -
> > >
> > >  Add utility to obfuscate schema so that people can share their
> > >  metadata without sharing proprietary information.
> > >  -
> > >
> > >  That way we can have data about slow footers and validate we
> can
> > >  read faster with the new footer
> > >  -
> > >
> > >  => creation of a database of footers.
> > >  -
> > >
> > >   Getting a feel of what features are used by users.
> > >   -
> > >
> > >  Alkis would want to share his findings through a blog post.
> > >  -
> > >
> > >   Also need to make sure the addition of the new footer doesn’t
> > >   impact old footers too much.
> > >   -
> > >
> > >   Possibly:
> > >   -
> > >
> > >  Codspeed for performance testing
> > >  -
> > >
> > >  Thrift linter: https://github.com/thrift-labs/thrift-fmt
> > >  -
> > >
> > >   AI:
> > >   -
> > >
> > >  [Julien] Create a parquet-benchmark repo for a footer db and
> > >  other things
> > >  -
> > >
> > > Example: https://github.com/rok/parquet-benchmark
> > > -
> > >
> > >  Alkis to pick where on github to push his prototype branch
> > >  -
> > >
> > >  Follow up on:
> > >  -
> > >
> > > https://github.com/apache/parquet-format/pull/445
> > >
> > >
> >
>


Re: Parquet Sync Notes July 31th 2024

2024-08-06 Thread Alkis Evlogimenos
Thank you Julien. When can we expect a new arrow package release so that I
can compile a doc for customers to donate footers to us?

binary in question:
https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc

On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem  wrote:

> Following up on my action item, I have created the parquet-benchmark repo:
> https://github.com/apache/parquet-benchmark
>
> On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem  wrote:
>
> > Attendees:
> >
> >-
> >
> >Micah: Google, no special topic today
> >-
> >
> >Alkis: Databricks, storage stack. Topic: Parquet extension PR so that
> >we can go in the format. Want to fix the metadata to make it work for
> wide
> >schemas.
> >-
> >
> >Vinoo: Palantir -> startup in data space. Working on improving the
> >website.
> >-
> >
> >Julien: Datadog. Topic: Make parquet reading possible to be done
> >sequentially (as opposed to footer first)
> >-
> >
> >Rok: Voltron -> freelance in Fintech. Care about Parquet performance.
> >Have time to contribute to footers (“V3”).
> >
> >
> > Follow up items:
> >
> > Mika’s Parquet format changes process
> >
> >-
> >
> >First PR merged, need to finalize java
> >-
> >
> >=> Mostly done
> >
> > Jira -> github migration
> >
> >-
> >
> >Getting started with github. Will follow up on the mailing list.
> >-
> >
> >=> mostly closed discussion. Some follow up async on the discussion.
> >
> >
> > Agenda:
> >
> >-
> >
> >Finalizing [EXTERNAL] Parquet extensions
> ><
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> >
> >
> >-
> >
> >   AI: Alkis Evlogimenos  to update
> >   PR with everything in the doc except Alternatives Considered and
> split the
> >   examples in another page.
> >   -
> >
> >New footer metadata discussion.
> >
> >
> > Discussion:
> >
> >-
> >
> >Extensions:
> >-
> >
> >   Add functionality to read/write the extension and show that we can
> >   ignore it.
> >   -
> >
> >  1: write an extension and read the old footer that ignores it.
> >  -
> >
> >  2: write extension and allow reading it back.
> >  -
> >
> >New metadata:
> >-
> >
> >   Flatbuffer is bigger than thrift: need to optimize metadata
> >   -
> >
> >  Start from a 1-1 implementation to existing footer and keep
> >  iterating 1 commit at a time.
> >  -
> >
> >   Would like to have a branch in github arrow cpp or a public fork on
> >   github to share the prototype.
> >   -
> >
> >   Add to parquet-tool to print the footer.
> >   -
> >
> >  Add utility to obfuscate schema so that people can share their
> >  metadata without sharing proprietary information.
> >  -
> >
> >  That way we can have data about slow footers and validate we can
> >  read faster with the new footer
> >  -
> >
> >  => creation of a database of footers.
> >  -
> >
> >   Getting a feel of what features are used by users.
> >   -
> >
> >  Alkis would want to share his findings through a blog post.
> >  -
> >
> >   Also need to make sure the addition of the new footer doesn’t
> >   impact old footers too much.
> >   -
> >
> >   Possibly:
> >   -
> >
> >  Codspeed for performance testing
> >  -
> >
> >  Thrift linter: https://github.com/thrift-labs/thrift-fmt
> >  -
> >
> >   AI:
> >   -
> >
> >  [Julien] Create a parquet-benchmark repo for a footer db and
> >  other things
> >  -
> >
> > Example: https://github.com/rok/parquet-benchmark
> > -
> >
> >  Alkis to pick where on github to push his prototype branch
> >  -
> >
> >  Follow up on:
> >  -
> >
> > https://github.com/apache/parquet-format/pull/445
> >
> >
>


Re: Parquet Sync Notes July 31th 2024

2024-08-02 Thread Julien Le Dem
Following up on my action item, I have created the parquet-benchmark repo:
https://github.com/apache/parquet-benchmark

On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem  wrote:

> Attendees:
>
>-
>
>Micah: Google, no special topic today
>-
>
>Alkis: Databricks, storage stack. Topic: Parquet extension PR so that
>we can go in the format. Want to fix the metadata to make it work for wide
>schemas.
>-
>
>Vinoo: Palantir -> startup in data space. Working on improving the
>website.
>-
>
>Julien: Datadog. Topic: Make parquet reading possible to be done
>sequentially (as opposed to footer first)
>-
>
>Rok: Voltron -> freelance in Fintech. Care about Parquet performance.
>Have time to contribute to footers (“V3”).
>
>
> Follow up items:
>
> Mika’s Parquet format changes process
>
>-
>
>First PR merged, need to finalize java
>-
>
>=> Mostly done
>
> Jira -> github migration
>
>-
>
>Getting started with github. Will follow up on the mailing list.
>-
>
>=> mostly closed discussion. Some follow up async on the discussion.
>
>
> Agenda:
>
>-
>
>Finalizing [EXTERNAL] Parquet extensions
>
> 
>
>-
>
>   AI: Alkis Evlogimenos  to update
>   PR with everything in the doc except Alternatives Considered and split 
> the
>   examples in another page.
>   -
>
>New footer metadata discussion.
>
>
> Discussion:
>
>-
>
>Extensions:
>-
>
>   Add functionality to read/write the extension and show that we can
>   ignore it.
>   -
>
>  1: write an extension and read the old footer that ignores it.
>  -
>
>  2: write extension and allow reading it back.
>  -
>
>New metadata:
>-
>
>   Flatbuffer is bigger than thrift: need to optimize metadata
>   -
>
>  Start from a 1-1 implementation to existing footer and keep
>  iterating 1 commit at a time.
>  -
>
>   Would like to have a branch in github arrow cpp or a public fork on
>   github to share the prototype.
>   -
>
>   Add to parquet-tool to print the footer.
>   -
>
>  Add utility to obfuscate schema so that people can share their
>  metadata without sharing proprietary information.
>  -
>
>  That way we can have data about slow footers and validate we can
>  read faster with the new footer
>  -
>
>  => creation of a database of footers.
>  -
>
>   Getting a feel of what features are used by users.
>   -
>
>  Alkis would want to share his findings through a blog post.
>  -
>
>   Also need to make sure the addition of the new footer doesn’t
>   impact old footers too much.
>   -
>
>   Possibly:
>   -
>
>  Codspeed for performance testing
>  -
>
>  Thrift linter: https://github.com/thrift-labs/thrift-fmt
>  -
>
>   AI:
>   -
>
>  [Julien] Create a parquet-benchmark repo for a footer db and
>  other things
>  -
>
> Example: https://github.com/rok/parquet-benchmark
> -
>
>  Alkis to pick where on github to push his prototype branch
>  -
>
>  Follow up on:
>  -
>
> https://github.com/apache/parquet-format/pull/445
>
>


Re: Parquet Sync meeting notes - April 23 2024

2024-04-25 Thread Steve Loughran
In my vector  IO PR it was raising false positives about a new class.

Maybe the process should be something like "submitter needs approval for
extra exclusions"

more troublesome: once a release has shipped, those exclusions should be
stripped from the master branch,
as then a regression is a genuine failure

On Thu, 25 Apr 2024 at 16:06, Gang Wu  wrote:

> Let me take a look at the exclusions of japicmp. Will try to remove
> them as much as possible.
>
> Best,
> Gang
>
> On Thu, Apr 25, 2024 at 10:01 PM Gábor Szádovszky 
> wrote:
>
> > Sorry, I was not able to attend the meeting. Let me put some notes here:
> >
> > 2. We have been fighting with compatibility issues for a while now.
> That's
> > why we introduced japicmp. I can see many exclusions in the master pom. I
> > think we should investigate if these exclusions cause any issues before
> the
> > next release. We should avoid excluding code from the compatibility
> checker
> > even if it seems reasonable because they tend to be kept there from
> release
> > to release.
> >
> > 3. Let's call it parquet-mr 2.0. "Parquet" might be confusing since this
> > mailing list is used for both parquet-mr and parquet-format.
> > I think the main requirement for parquet-mr 2.0 would be to have it
> > "standalone". The core should contain an API that is able read/write
> > parquet files at different levels. It should be released in a way that it
> > does not bring any dependencies. (What is required should be shaded.) We
> > can even start developing the new API in a separate module under
> parquet-mr
> > 1.x, so our clients can play with it and give feedback.
> >
> > Xinli shang  ezt írta (időpont: 2024. ápr. 23.,
> > K,
> > 16:31):
> >
> > > 4/23/2024
> > >
> > > Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang
> > >
> > >
> > > Parquet-mr 1.14 release:
> > >
> > > 1. Fokko and Gang will discuss starting the release soon
> > >
> > > 2. There are a few breaking changes we need to make to ensure backward
> > > compatibility and do proper testing
> > >
> > > 2. Vinoo will shadow and do some testing
> > >
> > > 3. Ideas on the release of Parquet 2.0. We start collecting thoughts
> and
> > > welcome everybody to share opinions.
> > > --
> > > Xinli Shang
> > >
> >
>


Re: Parquet Sync meeting notes - April 23 2024

2024-04-25 Thread Gang Wu
Let me take a look at the exclusions of japicmp. Will try to remove
them as much as possible.

Best,
Gang

On Thu, Apr 25, 2024 at 10:01 PM Gábor Szádovszky  wrote:

> Sorry, I was not able to attend the meeting. Let me put some notes here:
>
> 2. We have been fighting with compatibility issues for a while now. That's
> why we introduced japicmp. I can see many exclusions in the master pom. I
> think we should investigate if these exclusions cause any issues before the
> next release. We should avoid excluding code from the compatibility checker
> even if it seems reasonable because they tend to be kept there from release
> to release.
>
> 3. Let's call it parquet-mr 2.0. "Parquet" might be confusing since this
> mailing list is used for both parquet-mr and parquet-format.
> I think the main requirement for parquet-mr 2.0 would be to have it
> "standalone". The core should contain an API that is able read/write
> parquet files at different levels. It should be released in a way that it
> does not bring any dependencies. (What is required should be shaded.) We
> can even start developing the new API in a separate module under parquet-mr
> 1.x, so our clients can play with it and give feedback.
>
> Xinli shang  ezt írta (időpont: 2024. ápr. 23.,
> K,
> 16:31):
>
> > 4/23/2024
> >
> > Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang
> >
> >
> > Parquet-mr 1.14 release:
> >
> > 1. Fokko and Gang will discuss starting the release soon
> >
> > 2. There are a few breaking changes we need to make to ensure backward
> > compatibility and do proper testing
> >
> > 2. Vinoo will shadow and do some testing
> >
> > 3. Ideas on the release of Parquet 2.0. We start collecting thoughts and
> > welcome everybody to share opinions.
> > --
> > Xinli Shang
> >
>


Re: Parquet Sync meeting notes - April 23 2024

2024-04-25 Thread Gábor Szádovszky
Sorry, I was not able to attend the meeting. Let me put some notes here:

2. We have been fighting with compatibility issues for a while now. That's
why we introduced japicmp. I can see many exclusions in the master pom. I
think we should investigate if these exclusions cause any issues before the
next release. We should avoid excluding code from the compatibility checker
even if it seems reasonable because they tend to be kept there from release
to release.

3. Let's call it parquet-mr 2.0. "Parquet" might be confusing since this
mailing list is used for both parquet-mr and parquet-format.
I think the main requirement for parquet-mr 2.0 would be to have it
"standalone". The core should contain an API that is able read/write
parquet files at different levels. It should be released in a way that it
does not bring any dependencies. (What is required should be shaded.) We
can even start developing the new API in a separate module under parquet-mr
1.x, so our clients can play with it and give feedback.

Xinli shang  ezt írta (időpont: 2024. ápr. 23., K,
16:31):

> 4/23/2024
>
> Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang
>
>
> Parquet-mr 1.14 release:
>
> 1. Fokko and Gang will discuss starting the release soon
>
> 2. There are a few breaking changes we need to make to ensure backward
> compatibility and do proper testing
>
> 2. Vinoo will shadow and do some testing
>
> 3. Ideas on the release of Parquet 2.0. We start collecting thoughts and
> welcome everybody to share opinions.
> --
> Xinli Shang
>


Re: Parquet Sync meeting notes - April 23 2024

2024-04-24 Thread Claire McGinty
Great news on 1.14 getting ready for release! It's a little last minute,
but I cleaned up a draft I've been working on for supporting an
Array#Contains predicate and opened a PR
. It is a lot of code to
review, but surprisingly didn't require any breaking changes. I'm wondering
if it's feasible whether this feature could make it into 1.14.0 :)

Best,
Claire

On Wed, Apr 24, 2024 at 5:30 AM Steve Loughran 
wrote:

> where is the timetable for these calls? I think I'd like to join in if the
> timing works for me (UK)
>
> On Tue, 23 Apr 2024 at 15:31, Xinli shang  wrote:
>
> > 4/23/2024
> >
> > Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang
> >
> >
> > Parquet-mr 1.14 release:
> >
> > 1. Fokko and Gang will discuss starting the release soon
> >
> > 2. There are a few breaking changes we need to make to ensure backward
> > compatibility and do proper testing
> >
> > 2. Vinoo will shadow and do some testing
> >
> > 3. Ideas on the release of Parquet 2.0. We start collecting thoughts and
> > welcome everybody to share opinions.
> > --
> > Xinli Shang
> >
>


Re: Parquet Sync meeting notes - April 23 2024

2024-04-24 Thread Steve Loughran
where is the timetable for these calls? I think I'd like to join in if the
timing works for me (UK)

On Tue, 23 Apr 2024 at 15:31, Xinli shang  wrote:

> 4/23/2024
>
> Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang
>
>
> Parquet-mr 1.14 release:
>
> 1. Fokko and Gang will discuss starting the release soon
>
> 2. There are a few breaking changes we need to make to ensure backward
> compatibility and do proper testing
>
> 2. Vinoo will shadow and do some testing
>
> 3. Ideas on the release of Parquet 2.0. We start collecting thoughts and
> welcome everybody to share opinions.
> --
> Xinli Shang
>


Re: Parquet Sync meeting notes - April 23 2024

2024-04-23 Thread Prem Sahoo
I will defer here as per Parquet community Parquet V2 encoding is not final yet 
so they haven’t make it official. I have no clue how pyarrow is supporting it ?
I thought  parquet used by pyarrow and Spark should have same flavor but 
unfortunately it is not 😞which is very concerning. Spark doesn’t support write 
in Parquet V2 and pyarrow do support in V2.
Sent from my iPhone

> On Apr 23, 2024, at 11:23 AM, Gang Wu  wrote:
> 
> I would expect so. parquet-mr has a complete implementation of all v2
> encodings
> and some other Parquet implementations (e.g. Apache Arrow C++ and arrow-rs)
> have already supported most (if not all) v2 encodings for a long time.
> 
> Best,
> Gang
> 
>> On Tue, Apr 23, 2024 at 11:02 PM Prem Sahoo  wrote:
>> 
>> Are we planning to put Parquet V2 encoding in 2.0 ?
>> Sent from my iPhone
>> 
>>> On Apr 23, 2024, at 10:31 AM, Xinli shang 
>> wrote:
>>> 
>>> 4/23/2024
>>> 
>>> Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang
>>> 
>>> 
>>> Parquet-mr 1.14 release:
>>> 
>>> 1. Fokko and Gang will discuss starting the release soon
>>> 
>>> 2. There are a few breaking changes we need to make to ensure backward
>>> compatibility and do proper testing
>>> 
>>> 2. Vinoo will shadow and do some testing
>>> 
>>> 3. Ideas on the release of Parquet 2.0. We start collecting thoughts and
>>> welcome everybody to share opinions.
>>> --
>>> Xinli Shang
>> 


Re: Parquet Sync meeting notes - April 23 2024

2024-04-23 Thread Gang Wu
I would expect so. parquet-mr has a complete implementation of all v2
encodings
and some other Parquet implementations (e.g. Apache Arrow C++ and arrow-rs)
have already supported most (if not all) v2 encodings for a long time.

Best,
Gang

On Tue, Apr 23, 2024 at 11:02 PM Prem Sahoo  wrote:

> Are we planning to put Parquet V2 encoding in 2.0 ?
> Sent from my iPhone
>
> > On Apr 23, 2024, at 10:31 AM, Xinli shang 
> wrote:
> >
> > 4/23/2024
> >
> > Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang
> >
> >
> > Parquet-mr 1.14 release:
> >
> > 1. Fokko and Gang will discuss starting the release soon
> >
> > 2. There are a few breaking changes we need to make to ensure backward
> > compatibility and do proper testing
> >
> > 2. Vinoo will shadow and do some testing
> >
> > 3. Ideas on the release of Parquet 2.0. We start collecting thoughts and
> > welcome everybody to share opinions.
> > --
> > Xinli Shang
>


Re: Parquet Sync meeting notes - April 23 2024

2024-04-23 Thread Prem Sahoo
Are we planning to put Parquet V2 encoding in 2.0 ?
Sent from my iPhone

> On Apr 23, 2024, at 10:31 AM, Xinli shang  wrote:
> 
> 4/23/2024
> 
> Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang
> 
> 
> Parquet-mr 1.14 release:
> 
> 1. Fokko and Gang will discuss starting the release soon
> 
> 2. There are a few breaking changes we need to make to ensure backward
> compatibility and do proper testing
> 
> 2. Vinoo will shadow and do some testing
> 
> 3. Ideas on the release of Parquet 2.0. We start collecting thoughts and
> welcome everybody to share opinions.
> --
> Xinli Shang


Re: Parquet sync meeting notes - 1/26/2022

2022-01-27 Thread Xinli shang
Here

is the link for the Cell-Level encryption pre-design. Feel free to share
the feedback in the file directly by adding comments.

On Wed, Jan 26, 2022 at 9:51 AM Xinli shang  wrote:

> 1/26/2022
>
> Attendees: Xinli Shang, Gidon Gershinsky, Pavi Subenderan, Jason Zhang
>
>1.
>
>Data masking
>1.
>
>   Pavi: Will create a PR by next week
>   2.
>
>   PARQUET-2062 
>   3.
>
>   Will have a high-level design sent out soon
>   2.
>
>Cell level encryption
>1.
>
>   Xinli: Will send out the draft design soon
>   2.
>
>   Key questions: Should we have the same key for all the cells in the
>   same column? It could generate millions of keys if we do it.
>   3.
>
>   There are two options explored: 1)Use FPE to encrypt in place, 2)
>   add extra columns to utilize existing modular encryption. Will have 
> them in
>   the design.
>   3.
>
>Release of 1.13.0
>1.
>
>   Data masking(null)
>   1.
>
>  PARQUET-2062 
>  will be done in a few weeks.
>  2.
>
>   ID resolution instead of name
>   1.
>
>  PARQUET-2006 ,
>  need to see if it needs specification change and the scope of the 
> change
>  and ETA. We will decide should we include it in 1.13.0.
>
>
>
> Xinli Shang
> Apache Parquet PMC Chair
> Teach Lead Manager at Uber Data Infra
>
>
>

-- 
Xinli Shang


Re: Parquet sync meeting May 2021

2021-05-25 Thread Micah Kornfield
>
>  Iceberg filtering / V2 api
>   3.
>   Adopt Arrow as data model


Curious if more notes, or discussion around the two points above will
happen on the mailing list?  Or are there relevant JIRAs?

Thanks,
Micah



On Tue, May 25, 2021 at 9:45 AM Xinli shang  wrote:

> 5/25/2021
>
> Attendees: Xinli Shang, Gábor Szádovszky, Gidon Gershinsky
>
>1.
>
>Parquet 1.12.0 post release issues:
>1.
>
>   Release 1.12.1? Let's wait a bit since the testing and integration is
>   still going on. Better to have more fixes for the release.
>   2.
>
>   Two integer overflow issues are being reviewed.
>   2.
>
>INT96 issue (Parquet-2037) - fixed and merged
>3.
>
>1.13.0 planning ideas brainstorming
>1.
>
>   ID resolution instead of name
>   2.
>
>   Iceberg filtering / V2 api
>   3.
>
>   Adopt Arrow as data model
>   4.
>
>   Vectorization API
>   5.
>
>   Unified file encryption
>   6.
>
>   Data masking(null)
>
>
> --
> Xinli Shang, Tech Lead Manager at Uber Data Infra
>


Re: Parquet sync meeting 11/24/2020

2020-11-24 Thread Gidon Gershinsky
Thanks Xinli,

A slight correction re 2.1: The recent improvements / pull requests are in
the Java version, parquet-mr. But the ideas for some of them indeed came
from our work with the Arrow teams, that develop a C++ version of parquet
modular encryption.

Cheers, Gidon


On Tue, Nov 24, 2020 at 8:46 PM Xinli shang  wrote:

> 11/24/2020
>
> Hi all,
>
> Attendees:
>
>1.
>
>To solve Parquet upgrading with Avro version issue, should we release
>Parquet Avro with a separate release?
>1.
>
>   For uprading Avro from1.8 to 1.9, Parquet only have unit test change
>   and parquet-cli and user can excluce avro from Parquet
>   2.
>
>   The long-term still benefits if we can separate but it is not easy,
>   for now, it is not required.
>   2.
>
>Column Encryption
>1.
>
>   C++ version has several PRs (improvements) recently.
>   3.
>
>Data masking
>
>
>1.
>
>Some upper layer can develop their own data masking easily.
>2.
>
>We might think about some simple tools other than executing them in
>Parquet.
>3.
>
>Developed null data masking in Parquet and it works now. Open a Google
>doc and we can discuss from there.
>
>
>1.
>
>Parquet 1.11.x adoption to Presto
>1.
>
>   PR  is created but it
>   has a unit test failure.
>   2.
>
>Parquet 1.11.x feature adoption to Iceberg
>1.
>
>   Iceberg meeting notes
>   <
> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#
> >
>   for discussing this issue.
>   2.
>
>   Issue summary and proposals
>   <
> https://docs.google.com/document/d/1f8erGSnhVcdD0UokGx2opjmGvCU69g7fsiPXCJhP3MA/edit#
> >
>
>   3.
>
>   For having Parquet V2 API to support Iceberg, if we do that, then
>   makes sense to have vectorized API with Parquet V2 API. Let’s bring
> other
>   PMS/commuters to discuss for the next community meeting.
>   3.
>
>Parquet 1.12.0
>
> a. Will cut RC release soon
>
> Please let me know if you have any questions.
>
> Xinli Shang | Tech Lead Manager @ Uber Data Infra
>
>
> --
> Xinli Shang
>


Re: Parquet sync meeting 11/24/2020

2020-11-24 Thread Ryan Blue
Sorry I wasn't able to make it to the sync today. I should be able to make
it to the next one and we can talk about getting some of Iceberg's changes
upstream.

On Tue, Nov 24, 2020 at 10:46 AM Xinli shang 
wrote:

> 11/24/2020
>
> Hi all,
>
> Attendees:
>
>1.
>
>To solve Parquet upgrading with Avro version issue, should we release
>Parquet Avro with a separate release?
>1.
>
>   For uprading Avro from1.8 to 1.9, Parquet only have unit test change
>   and parquet-cli and user can excluce avro from Parquet
>   2.
>
>   The long-term still benefits if we can separate but it is not easy,
>   for now, it is not required.
>   2.
>
>Column Encryption
>1.
>
>   C++ version has several PRs (improvements) recently.
>   3.
>
>Data masking
>
>
>1.
>
>Some upper layer can develop their own data masking easily.
>2.
>
>We might think about some simple tools other than executing them in
>Parquet.
>3.
>
>Developed null data masking in Parquet and it works now. Open a Google
>doc and we can discuss from there.
>
>
>1.
>
>Parquet 1.11.x adoption to Presto
>1.
>
>   PR  is created but it
>   has a unit test failure.
>   2.
>
>Parquet 1.11.x feature adoption to Iceberg
>1.
>
>   Iceberg meeting notes
>   <
> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#
> >
>   for discussing this issue.
>   2.
>
>   Issue summary and proposals
>   <
> https://docs.google.com/document/d/1f8erGSnhVcdD0UokGx2opjmGvCU69g7fsiPXCJhP3MA/edit#
> >
>
>   3.
>
>   For having Parquet V2 API to support Iceberg, if we do that, then
>   makes sense to have vectorized API with Parquet V2 API. Let’s bring
> other
>   PMS/commuters to discuss for the next community meeting.
>   3.
>
>Parquet 1.12.0
>
> a. Will cut RC release soon
>
> Please let me know if you have any questions.
>
> Xinli Shang | Tech Lead Manager @ Uber Data Infra
>
>
> --
> Xinli Shang
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Parquet sync zoom - invalid meeting ID

2019-11-21 Thread Xinli shang
Sorry for the meeting ID issue. I don't know why this same ID worked for
the last two meetings but not today. I will send out another invitation
soon targetting for early December as a makeup.

Just Julien and me in the meeting today and we briefly talked about the
current outstanding tasks. We can talk them in more detail in the makeup
meeting.

On Thu, Nov 21, 2019 at 9:15 AM Julien Le Dem
 wrote:

> that worked, thanks!
>
> On Thu, Nov 21, 2019 at 9:11 AM Xinli shang 
> wrote:
>
> > Can you try
> https://urldefense.proofpoint.com/v2/url?u=https-3A__uber.zoom.us_j_142456544-3F&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=rEenljUvElhCIaOfXaXmXIuncTrfrY-1gbayEFNgIUc&s=0vb3nSDmJ03zJCKipI4E5vd2KB22TyqmUyeWc0g8hU4&e=
> >
> > On Thu, Nov 21, 2019 at 9:07 AM Gabor Szadovszky 
> wrote:
> >
> > > Hi,
> > >
> > > Is it just me who cannot join to the meeting? It says "Invalid meeting
> > > ID"...
> > >
> > > Cheers,
> > > Gabor
> > >
> >
> >
> > --
> > Xinli Shang
> >
>


-- 
Xinli Shang


Re: Parquet sync zoom - invalid meeting ID

2019-11-21 Thread Julien Le Dem
that worked, thanks!

On Thu, Nov 21, 2019 at 9:11 AM Xinli shang  wrote:

> Can you try https://uber.zoom.us/j/142456544?
>
> On Thu, Nov 21, 2019 at 9:07 AM Gabor Szadovszky  wrote:
>
> > Hi,
> >
> > Is it just me who cannot join to the meeting? It says "Invalid meeting
> > ID"...
> >
> > Cheers,
> > Gabor
> >
>
>
> --
> Xinli Shang
>


Re: Parquet sync zoom - invalid meeting ID

2019-11-21 Thread Xinli shang
Can you try https://uber.zoom.us/j/142456544?

On Thu, Nov 21, 2019 at 9:07 AM Gabor Szadovszky  wrote:

> Hi,
>
> Is it just me who cannot join to the meeting? It says "Invalid meeting
> ID"...
>
> Cheers,
> Gabor
>


-- 
Xinli Shang


Re: Parquet sync zoom - invalid meeting ID

2019-11-21 Thread Julien Le Dem
same for me. can someone send a new link?

On Thu, Nov 21, 2019 at 9:08 AM Jim Apple  wrote:

> The same is happening to me. Additionally, one of the toll-free phone
> numbers did not pick up.
>
> No outages I see: https://statusgator.com/services/zoom,
> https://status.zoom.us/
>
> On 2019/11/21 17:06:56, Gabor Szadovszky  wrote:
> > Hi,
> >
> > Is it just me who cannot join to the meeting? It says "Invalid meeting
> > ID"...
> >
> > Cheers,
> > Gabor
> >
>


Re: Parquet sync zoom - invalid meeting ID

2019-11-21 Thread Jim Apple
The same is happening to me. Additionally, one of the toll-free phone numbers 
did not pick up.

No outages I see: https://statusgator.com/services/zoom, https://status.zoom.us/

On 2019/11/21 17:06:56, Gabor Szadovszky  wrote: 
> Hi,
> 
> Is it just me who cannot join to the meeting? It says "Invalid meeting
> ID"...
> 
> Cheers,
> Gabor
> 


Re: Parquet Sync - 10/17/2019 - Meeting Notes

2019-10-17 Thread Julien Le Dem
Thanks for the notes. Sorry I missed the sync because of a conflict.

On Thu, Oct 17, 2019 at 10:00 AM Gidon Gershinsky  wrote:

> A slight correction re C++. I said the following
> C++ work is near completion/merge. Deepark has reviewed it and made
> additional changes / refactoring.
>
> On Thu, Oct 17, 2019 at 7:33 PM  wrote:
>
>> 10/17/2019
>>
>> Attendee:
>> Gidon
>> Gabor
>> Ryan
>> Karfiol
>> Xinli
>>
>> Topics:
>>
>> Column Encryption
>> For C++ version, Gidon worked with Deepak to have reviews going on.
>> For Java, we are blocked on the Parquet-11 release. Gabor proposed to
>> have branch the Parquet-11 and then merge later. But we would need to be in
>> master as the final step.
>>
>> Bloom Filter
>> Next step is to wait for the Parquet-11 release
>>
>> Parquet-11 Validation
>> Ryan - the release can go ahead without me if there are enough PMCs
>> Gabor - I will try to push this effort in a couple of weeks
>>
>> Ongoing Parquet Work
>> There is some work to create PR shortly to optimize Parquet usage of byte
>> buffer, ParallelIzation, S3 reading, etc.
>>
>> Xinli Shang (Uber)
>>
>> Parquet Sync - Monthly(every 3rd thursday)
>> Hi all,
>>
>> This is an invitation for the next occasion of the regular sync meeting
>> of the Parquet community.
>>
>> Xinli Shang
>>
>> Join Zoom Meeting
>> https://uber.zoom.us/j/112318682
>> 
>>
>> One tap mobile
>> +16699006833,,112318682# US (San Jose)
>> +16468769923,,112318682# US (New York)
>>
>> Dial by your location
>> +1 669 900 6833 US (San Jose)
>> +1 646 876 9923 US (New York)
>> 855 880 1246 US Toll-free
>> 877 369 0926 US Toll-free
>> Meeting ID: 112 318 682
>> Find your local number: https://zoom.us/u/aZKZunOZ9
>> 
>>
>> Join by SIP
>> 112318...@zoomcrc.com
>>
>> Join by H.323
>> 162.255.37.11 (US West)
>> 162.255.36.11 (US East)
>> 221.122.88.195 (China)
>> 115.114.131.7 (India)
>> 213.19.144.110 (EMEA)
>> 103.122.166.55 (Australia)
>> 209.9.211.110 (Hong Kong)
>> 64.211.144.160 (Brazil)
>> 69.174.57.160 (Canada)
>> 207.226.132.110 (Japan)
>> Meeting ID: 112 318 682
>> *When*
>> Thu Oct 17, 2019 9am – 10am Pacific Time - Los Angeles
>>
>> *Where*
>> https://uber.zoom.us/j/112318682, SEA | 1191 2nd Ave-8th-Whidbey (7)
>> [Zoom] (map
>> 
>> )
>>
>> *Who*
>> •
>> sha...@uber.com - organizer
>> •
>> shri.hariharasubrahman...@oracle.com
>> •
>> non...@gmail.com
>> •
>> robe...@palantir.com
>> •
>> szonyi.a...@gmail.com
>> •
>> szo...@cloudera.com
>> •
>> m.lac...@criteo.com
>> •
>> csringho...@cloudera.com
>> •
>> rzam...@nvidia.com
>> •
>> borokna...@cloudera.com
>> •
>> bikramjeet@cloudera.com
>> •
>> dev@parquet.apache.org
>> •
>> daniels...@gmail.com
>> •
>> smanik...@gmail.com
>> •
>> nkol...@cloudera.com
>> •
>> ven...@uber.com
>> •
>> q...@criteo.com
>> •
>> jimmyjc...@tencent.com
>> •
>> vercego...@cloudera.com
>> •
>> Xu, Cheng A
>> •
>> aniket...@gmail.com
>> •
>> jbap...@cloudera.com
>> •
>> Julien Le Dem
>> •
>> apha...@cloudera.com
>> •
>> yalia...@twitter.com
>> •
>> marc...@gmail.com
>> •
>> mark.ma...@kognitio.com
>> •
>> santlal.gu...@bitwiseglobal.com
>> •
>> Daniel Weeks
>> •
>> wesmck...@gmail.com
>> •
>> sunc...@apache.org
>> •
>> j.cof...@criteo.com
>> •
>> Reynold Xin
>> •
>> Ryan Blue
>> •
>> Lars Volker
>> •
>> alexleven...@twitter.com
>> •
>> jacq...@apache.org
>> •
>> Sergio Pena
>> •
>> gg5...@gmail.com
>> •
>> fnoth...@berkeley.edu
>> •
>> lukas.naleze...@gmail.com
>> •
>> m.li...@criteo.com
>> •
>> stak...@cloudera.com
>> •
>> s...@yelp.com
>> •
>> o.kaidan...@criteo.com
>> •
>> altekruseja...@gmail.com
>> •
>> brian.bow...@sas.com
>> •
>> julien.le...@gmail.com
>> •
>> Mohammad Islam
>> •
>> gabor.szadovs...@cloudera.com
>> •
>> andy.gr...@rms.com
>> •
>> Wei Han
>> •
>> yumw...@ebay.com
>> •
>> bal...@uber.com
>> •
>> ippokra...@gmail.com
>> •
>> Pavi Subenderan
>> •
>> Zoltan Ivanfi
>> •
>> b.hano...@criteo.com
>> •
>> dam6...@gmail.com
>> •
>> majeti.dee...@gmail.com
>> •
>> Parth Chandra
>> •
>> Mohit Sabharwal
>> •
>> nilangekar.po...@gmail.com
>> •
>> xho...@gmail.com
>> •
>> gorec...@amazon.com
>>
>


Re: Parquet Sync - 10/17/2019 - Meeting Notes

2019-10-17 Thread Gidon Gershinsky
A slight correction re C++. I said the following
C++ work is near completion/merge. Deepark has reviewed it and made
additional changes / refactoring.

On Thu, Oct 17, 2019 at 7:33 PM  wrote:

> 10/17/2019
>
> Attendee:
> Gidon
> Gabor
> Ryan
> Karfiol
> Xinli
>
> Topics:
>
> Column Encryption
> For C++ version, Gidon worked with Deepak to have reviews going on.
> For Java, we are blocked on the Parquet-11 release. Gabor proposed to have
> branch the Parquet-11 and then merge later. But we would need to be in
> master as the final step.
>
> Bloom Filter
> Next step is to wait for the Parquet-11 release
>
> Parquet-11 Validation
> Ryan - the release can go ahead without me if there are enough PMCs
> Gabor - I will try to push this effort in a couple of weeks
>
> Ongoing Parquet Work
> There is some work to create PR shortly to optimize Parquet usage of byte
> buffer, ParallelIzation, S3 reading, etc.
>
> Xinli Shang (Uber)
>
> Parquet Sync - Monthly(every 3rd thursday)
> Hi all,
>
> This is an invitation for the next occasion of the regular sync meeting of
> the Parquet community.
>
> Xinli Shang
>
> Join Zoom Meeting
> https://uber.zoom.us/j/112318682
> 
>
> One tap mobile
> +16699006833,,112318682# US (San Jose)
> +16468769923,,112318682# US (New York)
>
> Dial by your location
> +1 669 900 6833 US (San Jose)
> +1 646 876 9923 US (New York)
> 855 880 1246 US Toll-free
> 877 369 0926 US Toll-free
> Meeting ID: 112 318 682
> Find your local number: https://zoom.us/u/aZKZunOZ9
> 
>
> Join by SIP
> 112318...@zoomcrc.com
>
> Join by H.323
> 162.255.37.11 (US West)
> 162.255.36.11 (US East)
> 221.122.88.195 (China)
> 115.114.131.7 (India)
> 213.19.144.110 (EMEA)
> 103.122.166.55 (Australia)
> 209.9.211.110 (Hong Kong)
> 64.211.144.160 (Brazil)
> 69.174.57.160 (Canada)
> 207.226.132.110 (Japan)
> Meeting ID: 112 318 682
> *When*
> Thu Oct 17, 2019 9am – 10am Pacific Time - Los Angeles
>
> *Where*
> https://uber.zoom.us/j/112318682, SEA | 1191 2nd Ave-8th-Whidbey (7)
> [Zoom] (map
> 
> )
>
> *Who*
> •
> sha...@uber.com - organizer
> •
> shri.hariharasubrahman...@oracle.com
> •
> non...@gmail.com
> •
> robe...@palantir.com
> •
> szonyi.a...@gmail.com
> •
> szo...@cloudera.com
> •
> m.lac...@criteo.com
> •
> csringho...@cloudera.com
> •
> rzam...@nvidia.com
> •
> borokna...@cloudera.com
> •
> bikramjeet@cloudera.com
> •
> dev@parquet.apache.org
> •
> daniels...@gmail.com
> •
> smanik...@gmail.com
> •
> nkol...@cloudera.com
> •
> ven...@uber.com
> •
> q...@criteo.com
> •
> jimmyjc...@tencent.com
> •
> vercego...@cloudera.com
> •
> Xu, Cheng A
> •
> aniket...@gmail.com
> •
> jbap...@cloudera.com
> •
> Julien Le Dem
> •
> apha...@cloudera.com
> •
> yalia...@twitter.com
> •
> marc...@gmail.com
> •
> mark.ma...@kognitio.com
> •
> santlal.gu...@bitwiseglobal.com
> •
> Daniel Weeks
> •
> wesmck...@gmail.com
> •
> sunc...@apache.org
> •
> j.cof...@criteo.com
> •
> Reynold Xin
> •
> Ryan Blue
> •
> Lars Volker
> •
> alexleven...@twitter.com
> •
> jacq...@apache.org
> •
> Sergio Pena
> •
> gg5...@gmail.com
> •
> fnoth...@berkeley.edu
> •
> lukas.naleze...@gmail.com
> •
> m.li...@criteo.com
> •
> stak...@cloudera.com
> •
> s...@yelp.com
> •
> o.kaidan...@criteo.com
> •
> altekruseja...@gmail.com
> •
> brian.bow...@sas.com
> •
> julien.le...@gmail.com
> •
> Mohammad Islam
> •
> gabor.szadovs...@cloudera.com
> •
> andy.gr...@rms.com
> •
> Wei Han
> •
> yumw...@ebay.com
> •
> bal...@uber.com
> •
> ippokra...@gmail.com
> •
> Pavi Subenderan
> •
> Zoltan Ivanfi
> •
> b.hano...@criteo.com
> •
> dam6...@gmail.com
> •
> majeti.dee...@gmail.com
> •
> Parth Chandra
> •
> Mohit Sabharwal
> •
> nilangekar.po...@gmail.com
> •
> xho...@gmail.com
> •
> gorec...@amazon.com
>


Re: Parquet Sync - Meeting Notes

2019-09-19 Thread Gidon Gershinsky
Hi Xinli,

Regarding the parquet-cpp encryption - there are no integration errors. A
number of pull requests are merged by now; the remaining code has been
reviewed and updated; the only outstanding question (how/when to turn off
OpenSSL getting included in Arrow) had been addressed by the Cpp leads and
will be resolved soon.

Cheers, Gidon.

On Thu, Sep 19, 2019 at 8:37 PM  wrote:

> Hi all,
>
> This is the meeting notes that I took. Feel free to add or correct it if
> something is missed or wrong.
>
> 9/19/2019
>
> Attendee:
> Xinli Shang(Uber)
> Gidon Gershinsky(IBM)
> Jim Apple (Netflix)
> Nandor Kollar, Gabor, and several other Cloudera folks
> Julien Le Dem (WeWork)
> Deepak (Vertica)
> Please add if you are missed.
> Topics:
> Column Encryption
> Parquet-format has the specification merged.
> One PR is merged into parquet-mr, the second is being reviewed.
> For parquet-cpp, we still have some integration errors.
> Xinli backported the encryption code to parquet 1.10.1 to mitigate the
> risk. We can wait for 1.11.0 release before deciding should public
> community should do that.
>
> Bloom filter
> The spec has been checked in to parquet-format.
> Will continue the validation of the correctness on parquet-mr(feature
> branch) and parquet-cpp(master branch? some code like reader/writer not in
> master branch yet).
> Netflix has done enough testing on performance. The remaining tests are
> mainly for correctness.
> There are unit tests and integration tests that cover.
>
> Parquet-format 2.7.0
> Releasing of parquet-format is slow now. We need the release before
> checking into parquet-mr master.
> There are several options. We prefer option 3 that is to release bloom
> filter and parquet encryption together in 2.7.0.
> 3 PMC voted in this meeting +1 for the option 3.
> Ryan can help on the release, signing keys etc.
>
> Remove old Parquet modules
> Hive modules - sounds good
> Scooge - Julien will reach out to twitter
> Tools - undecided - Cloudera may still use the parquet-tools according to
> Gabor.
> Cascading - undecided
> We can change the module as deprecated as description.
>
> 1.11.0 Release
> Column index validation - Need Ryan to review it.
>
> Someone is proposing byte_stream_split encoding in the mailing list
> Ryan made a proposal and the owner just replied to try the proposal and
> get back.
>
> 7. Merge Parquet and ORC
> Ryan and Owen had a talk in ApacheCon regarding merging ORC and Parquet.
> There are a lot of benefits to doing that but also a lot of work. Overall,
> people in this meeting support this effort.
> Ryan can start socializing this effort.
>
> Xinli Shang (Uber)
>
>
>
> Parquet Sync
> Hi all,
>
> This is an invitation for the next occasion of the regular sync meeting of
> the Parquet community.
>
> Xinli Shang
>
> Join Zoom Meeting
> https://uber.zoom.us/j/112318682
> 
>
> One tap mobile
> +16699006833,,112318682# US (San Jose)
> +16468769923,,112318682# US (New York)
>
> Dial by your location
> +1 669 900 6833 US (San Jose)
> +1 646 876 9923 US (New York)
> 855 880 1246 US Toll-free
> 877 369 0926 US Toll-free
> Meeting ID: 112 318 682
> Find your local number: https://zoom.us/u/aZKZunOZ9
> 
>
> Join by SIP
> 112318...@zoomcrc.com
>
> Join by H.323
> 162.255.37.11 (US West)
> 162.255.36.11 (US East)
> 221.122.88.195 (China)
> 115.114.131.7 (India)
> 213.19.144.110 (EMEA)
> 103.122.166.55 (Australia)
> 209.9.211.110 (Hong Kong)
> 64.211.144.160 (Brazil)
> 69.174.57.160 (Canada)
> 207.226.132.110 (Japan)
> Meeting ID: 112 318 682
> *When*
> Thu Sep 19, 2019 9am – 10am Pacific Time - Los Angeles
>
> *Where*
> https://uber.zoom.us/j/112318682 (map
> 
> )
>
> *Who*
> •
> sha...@uber.com - organizer
> •
> gg5...@gmail.com
> •
> Daniel Weeks
> •
> aniket...@gmail.com
> •
> daniels...@gmail.com
> •
> altekruseja...@gmail.com
> •
> ippokra...@gmail.com
> •
> Lars Volker
> •
> Mohit Sabharwal
> •
> santlal.gu...@bitwiseglobal.com
> •
> yumw...@ebay.com
> •
> smanik...@gmail.com
> •
> szo...@cloudera.com
> •
> Julien Le Dem
> •
> j.cof...@criteo.com
> •
> dev@parquet.apache.org
> •
> m.lac...@criteo.com
> •
> non...@gmail.com
> •
> jacq...@apache.org
> •
> fnoth...@berkeley.edu
> •
> ven...@uber.com
> •
> borokna...@cloudera.com
> •
> Xu, Cheng A
> •
> majeti.dee...@gmail.com
> •
> csringho...@cloudera.com
> •
> stak...@cloudera.com
> •
> o.kaidan...@criteo.com
> •
> bikramjeet@cloudera.com
> •
> brian.bow...@sas.com
> •
> apha...@cloudera.com
> •
> nkol...@cloudera.com
> •
> wesmck...@gmail.com
> •
> Ryan Blue
> •
> Wei Han
> •
> robe...@palant

Re: Parquet Sync Meeting Notes

2019-07-19 Thread Jim Apple
I believe this has now been both voted on a few months ago and approved by 
Zoltan last week.

If someone could merge it, that would get us one step closer to a 
parquet-format release.

On 2019/07/17 17:59:06, Xinli shang  wrote: 
> Gidon pointed out that the encryption parquet-format PR is the one below
> only. Sorry for the confusion.
> https://github.com/apache/parquet-format/pull/142



Re: Parquet Sync Meeting Notes

2019-07-17 Thread Xinli shang
Gidon pointed out that the encryption parquet-format PR is the one below
only. Sorry for the confusion.
https://github.com/apache/parquet-format/pull/142

On Wed, Jul 17, 2019 at 10:57 AM Xinli shang  wrote:

> 7/17/2019
>
> Attendee:
>
> Ryan Blue(Netflix)
>
> Jame(Netflix)
>
> Gidon Gershinsky(IBM)
>
> Steven(Yelp)
>
> Deepak and several other folks (Vertica)
>
> Xinli Shang(Uber)
>
> Junjie Chen
>
> Topics:
>
>1.
>
>Column Encryption
>1.
>
>   Gidon:
>   1.
>
>  C++ version code review: Have addressed all feedbacks. The last
>  step is testing. Hopefully tomorrow the testing can be done.
>  2.
>
>  Reviewed bloom filter design from Parquet encryption
>  perspective. It is straightforward.
>  3.
>
>  Not much done on Java version Parquet side. Worked with Xinli to
>  fix several issues.
>  4.
>
>  Found throughput issues in Java and fixed it.
>  2.
>
>   Xinli:
>   1.
>
>  Gidon sent out a design which consolidates different ways of
>  deploying parquet encryption, but not much attention is gained from 
> the
>  community. Please have a look if you are interested in.
>  2.
>
>  There is a discussion about unifying table properties in
>  HMS(HIVE-21848) for both ORC and Parquet column encryption. Please 
> chime in
>  if you have a concern.
>  3.
>
>  Java version parquet-mr PR review is being slow. How do we move
>  faster? We need more people to review it.
>  1.
>
> https://github.com/apache/parquet-mr/pull/613
> 2.
>
> https://github.com/apache/parquet-mr/pull/614
> 3.
>
> https://github.com/apache/parquet-mr/pull/643
> 3.
>
>   Jim
>   1.
>
>  What is blocked on the parquet-mr review? We need more people to
>  review it. There is a lot of PR now.
>  4.
>
>   Deepak
>   1.
>
>  Does the parquet encryption work with Hive?
>  1.
>
> Yes, we have tested it(xinli).
> 2.
>
>  Also have questions about table properties definition.
>  1.
>
> HIVE-21848(xinli)
> 2. Bloom filter
>   1.
>
>   Junjie Chen
>   1.
>
>  We need one more PMC vote
>  2.
>
>   Ryan
>   1.
>
>  I will have a look next week. Were the issues raised earlier
>  addressed?
>  1.
>
> Yes(Junjie)
> 2.
>
>  Parquet-format should be considered as upstream for parquet-cpp
>  and parquet-mr that are implementation.
>  3.
>
>  We need Encryption specification merge to parquet-format ASAP,
>  then bloom filter. Otherwise, parquet-format will depend parquet-cpp 
> and
>  parquet-mr, which is not right.
>
>https://github.com/apache/parquet-format/pull/68
>
>https://github.com/apache/parquet-format/pull/142
>1.
>
>   Xinli
>   1.
>
>  Is parquet-format 2.6 + encryption compatible with parquet
>  2.7(encryption + bloom filter)?
>  1.
>
> By design, yes(Gidon)
> 2.
>
>  Please add Xinli for testing if we have a prototype for bloom
>  filter to make sure they are compatible.
>
>
>
>1.
>
>Parquet-1.11.0 Release Validation
>1.
>
>   Ryan
>   1.
>
>  Both Ryan and Zalton are very busy. No progress so far.
>  2.
>
>  We need to write a test to make sure the data write/read are
>  correct.
>
>
>
>1.
>
>Remove old Parquet modules
>
>
>1.
>
>Ryan
>1.
>
>   No time. If somebody has time to do it, go for it.
>
>
> --
> Xinli Shang
>


-- 
Xinli Shang


Re: Parquet Sync - Meeting notes

2019-05-02 Thread Zoltan Ivanfi
Hi,

I would like to add the following to the notes for topic "1. key signing":
- Zoltan brought up the question of whether and how PMC-s from the US
could remotely sign the keys of committers/PMC-s located in Europe.
- Julien and Ryan commented that for the purpose of signing releases
it is not really necessary for the signer's key to be in the web of
trust as long as it is in the central KEYS file (especially if the
signers participate in Parquet Syncs discussing the RC, thereby
implicitly confirming their ownership of that signature).

Br,

Zoltan

On Tue, Apr 30, 2019 at 10:25 PM Xinli shang  wrote:
>
> Hi all,
>
> This is to follow up of the meeting notes below. I created Jira ticket 
> PARQUET-1396 and the design can be found here.  The recorded video in Hadoop 
> Contributor Meetup can also help reading the design. Please share your 
> feedback by commenting on the design doc.
>
> On top of Gidon’s change, we introduced a plugin/interface to Parquet to 
> activate encryption and build up encryption properties. Currently, we 
> implement its schema driven implementation, but it can be implemented in 
> another way too. I will send out the design soon.
>
>
> Xinli
>
> On Tue, Apr 30, 2019 at 12:30 PM Xinli shang  wrote:
>>
>> 4/30/2019
>>
>>
>> Attendee:
>>
>> Zoltan and Several other folks(Cloudera)
>>
>> Brian (SaS?)
>>
>> Ryan Blue(Netflix)
>>
>> Julien(WeWorks)
>>
>> Wes McKinney(Ursa Labs)
>>
>> Gidon Gershinsky(IBM)
>>
>> Steven(?)
>>
>> Anikt(?)
>>
>> Deepak(?)
>>
>> Xinli Shang(Uber)
>>
>>
>> Topics:
>>
>> Key signing issue
>>
>> Zoltan/Julien/Ryan:
>>
>> We already have email exchange of this issue.
>>
>> In the past, it is done in person. But it is OK to sign each other via video 
>> conference. We can do a video session of signing keys.
>>
>> It is painful to do this every release
>>
>>
>> Column Encryption
>>
>> Gidon:
>>
>> C++ version progress well. It is pretty much done.
>>
>> Wait for Parquet-1.11.0 release to send out code review
>>
>> Found issues in Java. Worked around it. Will talk to Java community.
>>
>> Xinli:
>>
>> On top of Gidon’s change, we introduced a plugin/interface to Parquet to 
>> activate encryption and build up encryption properties. Currently, we 
>> implement its schema driven implementation, but it can be implemented in 
>> another way too. I will send out the design soon.
>>
>> Gidon:
>>
>> Overall we took a bottom-up approach. We might need another layer on top of 
>> these to make the adoption easier.
>>
>> Ryan:
>>
>> Different companies can have a different implementation. It is good to have 
>> a plugin mode.
>>
>> Brian: Question of the key metadata, KMS.
>>
>> Currently, Parquet designs it as a byte array. Depending on the 
>> implementation, it can be used to record the KMS/Key Metadata.
>>
>> Parquet-1.11.0 Release Validation
>>
>> Ryan
>>
>> Validate the write path of column index - We need to test the calculation is 
>> correct; Validation is independent. Ryan will take this task.
>>
>> Brian:
>>
>> Can help some testing in Summer if needed.
>>
>> Steven:
>>
>> What is the test strategy, any fuzzing test?
>>
>> Ryan:
>>
>> We have some random test but not reliable. Inside Netflix, we have stable 
>> fuzzing test. May need to port some to Parquet.
>>
>> Xinli:
>>
>> We have run a lot of regression test on Parquet-1.11.0. We add encryption 
>> code on top of 1.11.0 and run a lot of tests. No new feature test of 1.110 
>> but existing features tests are so far so good. Let us know if you want us 
>> to add some more tests into our test suite.
>>
>>
>> Remove old Parquet modules
>>
>> Ryan
>>
>> We should remove those old modules if they are not needed
>>
>> Hive module - Seems not used
>>
>> Scrooge module - if it is only used by one company, we might not want to 
>> maintain it
>>
>> Does anybody still use parquet-tools instead of parquet-cli? Maybe we can 
>> mark it as deprecated.
>>
>> Open a Jira ticket for it.
>>
>> Julien
>>
>> Twitter may use it. Julien will check with Twitter.
>>
>> We should communicate widely.
>>
>>
>> --
>> Xinli Shang (Uber)
>
>
>
> --
> Xinli Shang


Re: Parquet Sync - Meeting notes

2019-04-30 Thread Xinli shang
Hi all,

This is to follow up of the meeting notes below. I created Jira ticket
PARQUET-1396  and the
design can be found here
.
The recorded video  in Hadoop
Contributor Meetup can also help reading the design. Please share your
feedback by commenting on the design doc.


   1. On top of Gidon’s change, we introduced a plugin/interface to Parquet
   to activate encryption and build up encryption properties. Currently, we
   implement its schema driven implementation, but it can be implemented in
   another way too. I will send out the design soon.


Xinli

On Tue, Apr 30, 2019 at 12:30 PM Xinli shang  wrote:

> 4/30/2019
>
> Attendee:
>
> Zoltan and Several other folks(Cloudera)
>
> Brian (SaS?)
>
> Ryan Blue(Netflix)
>
> Julien(WeWorks)
>
> Wes McKinney(Ursa Labs)
>
> Gidon Gershinsky(IBM)
>
> Steven(?)
>
> Anikt(?)
>
> Deepak(?)
>
> Xinli Shang(Uber)
>
>
> Topics:
>
>1.
>
>Key signing issue
>1.
>
>   Zoltan/Julien/Ryan:
>   1.
>
>  We already have email exchange of this issue.
>  2.
>
>  In the past, it is done in person. But it is OK to sign each
>  other via video conference. We can do a video session of signing 
> keys.
>  3.
>
>  It is painful to do this every release
>
>
>
>1.
>
>Column Encryption
>1.
>
>   Gidon:
>   1.
>
>  C++ version progress well. It is pretty much done.
>  2.
>
>  Wait for Parquet-1.11.0 release to send out code review
>  3.
>
>  Found issues in Java. Worked around it. Will talk to Java
>  community.
>  2.
>
>   Xinli:
>   1.
>
>  On top of Gidon’s change, we introduced a plugin/interface to
>  Parquet to activate encryption and build up encryption properties.
>  Currently, we implement its schema driven implementation, but it can 
> be
>  implemented in another way too. I will send out the design soon.
>  3.
>
>   Gidon:
>   1.
>
>  Overall we took a bottom-up approach. We might need another
>  layer on top of these to make the adoption easier.
>  4.
>
>   Ryan:
>   1.
>
>  Different companies can have a different implementation. It is
>  good to have a plugin mode.
>  5.
>
>   Brian: Question of the key metadata, KMS.
>   1.
>
>  Currently, Parquet designs it as a byte array. Depending on the
>  implementation, it can be used to record the KMS/Key Metadata.
>  2.
>
>Parquet-1.11.0 Release Validation
>1.
>
>   Ryan
>   1.
>
>  Validate the write path of column index - We need to test the
>  calculation is correct; Validation is independent. Ryan will take 
> this task.
>  2.
>
>   Brian:
>   1.
>
>  Can help some testing in Summer if needed.
>  3.
>
>   Steven:
>   1.
>
>  What is the test strategy, any fuzzing test?
>  4.
>
>   Ryan:
>   1.
>
>  We have some random test but not reliable. Inside Netflix, we
>  have stable fuzzing test. May need to port some to Parquet.
>  5.
>
>   Xinli:
>   1.
>
>  We have run a lot of regression test on Parquet-1.11.0. We add
>  encryption code on top of 1.11.0 and run a lot of tests. No new 
> feature
>  test of 1.110 but existing features tests are so far so good. Let us 
> know
>  if you want us to add some more tests into our test suite.
>
>
>
>1.
>
>Remove old Parquet modules
>
>
>1.
>
>Ryan
>1.
>
>   We should remove those old modules if they are not needed
>   2.
>
>   Hive module - Seems not used
>   3.
>
>   Scrooge module - if it is only used by one company, we might not
>   want to maintain it
>   4.
>
>   Does anybody still use parquet-tools instead of parquet-cli? Maybe
>   we can mark it as deprecated.
>   5.
>
>   Open a Jira ticket for it.
>   2.
>
>Julien
>1.
>
>   Twitter may use it. Julien will check with Twitter.
>   2.
>
>   We should communicate widely.
>
>
> --
> Xinli Shang (Uber)
>


-- 
Xinli Shang


Re: Parquet Sync

2019-04-20 Thread Xinli shang
I will send out invitation soon with a zoom link and conf phone #. I
believe everyone with the link should just be able to join using browser,
or call in with phone.

I plan to send to parquet dev mailing list. Feel free to send me extra
contacts who want to join. Everyone is welcome!

On Sat, Apr 20, 2019 at 8:04 AM Brian Bowman  wrote:

> Does the sync happen on Google Hangout?  Could someone please provide a
> link on where to sign up/connect?
>
> Thanks,
>
> Brian
>
> > On Apr 18, 2019, at 12:51 PM, Xinli shang 
> wrote:
> >
> > EXTERNAL
> >
> > Hi all,
> >
> > Please send your agenda for the next Parquet community sync up meeting. I
> > will compile and send the list before the meeting. One of the agenda I
> have
> > so far is encryption.  The meeting will be tentatively at April 30
> Tuesday
> > 9-10am PT, just like our previous regular meeting time. Please let me
> know
> > if you have any questions for agenda or date/time.
> >
> > Xinli
> >
> > On Mon, Apr 15, 2019 at 10:54 PM Julien Le Dem
> >  wrote:
> >
> >> It would be fine to have a rotation.
> >>
> >> On Mon, Apr 15, 2019 at 10:44 PM Lars Volker 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I'd be happy to help. I have organized a few of these in the past, and
> >> I've
> >>> recently started similar meetings for the Impala project.
> >>>
> >>> If someone else wants to do it, that's fine for me, too, of course.
> >>>
> >>> Cheers, Lars
> >>>
> >>> On Mon, Apr 15, 2019, 22:14 Julien Le Dem 
> >> wrote:
> >>>
>  Hello all,
>  Since I have been away with the new baby the Parquet syncs have fallen
>  behind.
>  I'd like a volunteer to run those.
>  Responsibilities include taking notes and posting them on the list.
>  Also occasionally finding a good time for the meeting.
>  Any takers? This could be a rotating duty as well.
>  Thank you
>  Julien
> 
> >>>
> >>
> >
> >
> > --
> > Xinli Shang
>
-- 
Xinli Shang


Re: Parquet Sync

2019-04-20 Thread Brian Bowman
Does the sync happen on Google Hangout?  Could someone please provide a link on 
where to sign up/connect?

Thanks,

Brian

> On Apr 18, 2019, at 12:51 PM, Xinli shang  wrote:
> 
> EXTERNAL
> 
> Hi all,
> 
> Please send your agenda for the next Parquet community sync up meeting. I
> will compile and send the list before the meeting. One of the agenda I have
> so far is encryption.  The meeting will be tentatively at April 30 Tuesday
> 9-10am PT, just like our previous regular meeting time. Please let me know
> if you have any questions for agenda or date/time.
> 
> Xinli
> 
> On Mon, Apr 15, 2019 at 10:54 PM Julien Le Dem
>  wrote:
> 
>> It would be fine to have a rotation.
>> 
>> On Mon, Apr 15, 2019 at 10:44 PM Lars Volker 
>> wrote:
>> 
>>> Hi,
>>> 
>>> I'd be happy to help. I have organized a few of these in the past, and
>> I've
>>> recently started similar meetings for the Impala project.
>>> 
>>> If someone else wants to do it, that's fine for me, too, of course.
>>> 
>>> Cheers, Lars
>>> 
>>> On Mon, Apr 15, 2019, 22:14 Julien Le Dem 
>> wrote:
>>> 
 Hello all,
 Since I have been away with the new baby the Parquet syncs have fallen
 behind.
 I'd like a volunteer to run those.
 Responsibilities include taking notes and posting them on the list.
 Also occasionally finding a good time for the meeting.
 Any takers? This could be a rotating duty as well.
 Thank you
 Julien
 
>>> 
>> 
> 
> 
> --
> Xinli Shang


Re: Parquet Sync

2019-04-18 Thread Xinli shang
Hi all,

Please send your agenda for the next Parquet community sync up meeting. I
will compile and send the list before the meeting. One of the agenda I have
so far is encryption.  The meeting will be tentatively at April 30 Tuesday
9-10am PT, just like our previous regular meeting time. Please let me know
if you have any questions for agenda or date/time.

Xinli

On Mon, Apr 15, 2019 at 10:54 PM Julien Le Dem
 wrote:

> It would be fine to have a rotation.
>
> On Mon, Apr 15, 2019 at 10:44 PM Lars Volker 
> wrote:
>
> > Hi,
> >
> > I'd be happy to help. I have organized a few of these in the past, and
> I've
> > recently started similar meetings for the Impala project.
> >
> > If someone else wants to do it, that's fine for me, too, of course.
> >
> > Cheers, Lars
> >
> > On Mon, Apr 15, 2019, 22:14 Julien Le Dem 
> wrote:
> >
> > > Hello all,
> > > Since I have been away with the new baby the Parquet syncs have fallen
> > > behind.
> > > I'd like a volunteer to run those.
> > > Responsibilities include taking notes and posting them on the list.
> > > Also occasionally finding a good time for the meeting.
> > > Any takers? This could be a rotating duty as well.
> > > Thank you
> > > Julien
> > >
> >
>


-- 
Xinli Shang


Re: Parquet Sync

2019-04-16 Thread Xinli shang
Thanks, Julien! I will work with Lars for the rotation.


On Mon, Apr 15, 2019 at 10:54 PM Julien Le Dem
 wrote:

> It would be fine to have a rotation.
>
> On Mon, Apr 15, 2019 at 10:44 PM Lars Volker 
> wrote:
>
> > Hi,
> >
> > I'd be happy to help. I have organized a few of these in the past, and
> I've
> > recently started similar meetings for the Impala project.
> >
> > If someone else wants to do it, that's fine for me, too, of course.
> >
> > Cheers, Lars
> >
> > On Mon, Apr 15, 2019, 22:14 Julien Le Dem 
> wrote:
> >
> > > Hello all,
> > > Since I have been away with the new baby the Parquet syncs have fallen
> > > behind.
> > > I'd like a volunteer to run those.
> > > Responsibilities include taking notes and posting them on the list.
> > > Also occasionally finding a good time for the meeting.
> > > Any takers? This could be a rotating duty as well.
> > > Thank you
> > > Julien
> > >
> >
>


Re: Parquet Sync

2019-04-16 Thread Brian Bowman
All,

I look forward to participating in the upcoming Parquet Syncs.  I'll be happy 
to be a "scribe in rotation" but would first like to participate in a couple of 
Syncs. 

By way of introduction:  I'm Brian Bowman, 34+ year veteran of SAS R&D.  I've 
been working with Parquet Open Source and C++ for the past four months but have 
no prior open source experience.  My career has been programming in Assembly, 
C, Java and SAS, with decades of work in file format design, storage layer 
internals, and scalable distributed access control capabilities.  For the past 
5 years I've been doing core R&D for Cloud Analytic Services (CAS)  -- the 
modern SAS distributed analytics and data management framework.  I work on the 
CAS distributed table, I/O, and indexing capabilities ... and now Parquet 
integration with CAS.
 
Arrow/Parquet are exciting technologies and I look forward to more work with 
this group as our efforts move ahead.

Best,

Brian

Brian Bowman
Principal Software Developer 
Analytic Server R&D
SAS Institute Inc.


On 4/16/19, 1:54 AM, "Julien Le Dem"  wrote:

EXTERNAL

It would be fine to have a rotation.

On Mon, Apr 15, 2019 at 10:44 PM Lars Volker 
wrote:

> Hi,
>
> I'd be happy to help. I have organized a few of these in the past, and 
I've
> recently started similar meetings for the Impala project.
>
> If someone else wants to do it, that's fine for me, too, of course.
>
> Cheers, Lars
>
> On Mon, Apr 15, 2019, 22:14 Julien Le Dem  wrote:
>
> > Hello all,
> > Since I have been away with the new baby the Parquet syncs have fallen
> > behind.
> > I'd like a volunteer to run those.
> > Responsibilities include taking notes and posting them on the list.
> > Also occasionally finding a good time for the meeting.
> > Any takers? This could be a rotating duty as well.
> > Thank you
> > Julien
> >
>




Re: Parquet Sync

2019-04-15 Thread Julien Le Dem
It would be fine to have a rotation.

On Mon, Apr 15, 2019 at 10:44 PM Lars Volker 
wrote:

> Hi,
>
> I'd be happy to help. I have organized a few of these in the past, and I've
> recently started similar meetings for the Impala project.
>
> If someone else wants to do it, that's fine for me, too, of course.
>
> Cheers, Lars
>
> On Mon, Apr 15, 2019, 22:14 Julien Le Dem  wrote:
>
> > Hello all,
> > Since I have been away with the new baby the Parquet syncs have fallen
> > behind.
> > I'd like a volunteer to run those.
> > Responsibilities include taking notes and posting them on the list.
> > Also occasionally finding a good time for the meeting.
> > Any takers? This could be a rotating duty as well.
> > Thank you
> > Julien
> >
>


Re: Parquet Sync

2019-04-15 Thread Julien Le Dem
No requirement to be a PMC member no.

On Mon, Apr 15, 2019 at 10:41 PM Xinli shang 
wrote:

> Is there any requirement like being PMC of Parquet?
>
> On Mon, Apr 15, 2019 at 10:14 PM Julien Le Dem 
> wrote:
>
> > Hello all,
> > Since I have been away with the new baby the Parquet syncs have fallen
> > behind.
> > I'd like a volunteer to run those.
> > Responsibilities include taking notes and posting them on the list.
> > Also occasionally finding a good time for the meeting.
> > Any takers? This could be a rotating duty as well.
> > Thank you
> > Julien
> >
> --
> Xinli Shang
>


Re: Parquet Sync

2019-04-15 Thread Lars Volker
Hi,

I'd be happy to help. I have organized a few of these in the past, and I've
recently started similar meetings for the Impala project.

If someone else wants to do it, that's fine for me, too, of course.

Cheers, Lars

On Mon, Apr 15, 2019, 22:14 Julien Le Dem  wrote:

> Hello all,
> Since I have been away with the new baby the Parquet syncs have fallen
> behind.
> I'd like a volunteer to run those.
> Responsibilities include taking notes and posting them on the list.
> Also occasionally finding a good time for the meeting.
> Any takers? This could be a rotating duty as well.
> Thank you
> Julien
>


Re: Parquet Sync

2019-04-15 Thread Xinli shang
Is there any requirement like being PMC of Parquet?

On Mon, Apr 15, 2019 at 10:14 PM Julien Le Dem 
wrote:

> Hello all,
> Since I have been away with the new baby the Parquet syncs have fallen
> behind.
> I'd like a volunteer to run those.
> Responsibilities include taking notes and posting them on the list.
> Also occasionally finding a good time for the meeting.
> Any takers? This could be a rotating duty as well.
> Thank you
> Julien
>
-- 
Xinli Shang


Re: Parquet sync meeting notes

2018-11-06 Thread Julien Le Dem
- I reached out to Ryan who will get back on the PR
- I reached out to Jacques regarding page level stats
- also advertised it on twitter:
https://twitter.com/J_/status/1059860813115052032

On Tue, Nov 6, 2018 at 9:30 AM Julien Le Dem 
wrote:

> Attendees:
>
>- Gabor (Cloudera)
>- Nandor (Cloudera)
>- Zoltan (Cloudera): new parquet-mr release
>- Anna (Cloudera): new parquet-mr release. Would like encryption
>update
>- Gidon (IBM): status of encryption design sign off
>- Xinli (Uber): encryption
>- Steven (Yelp)
>- Julien (Wework)
>- Aniket (Google): cloud dataproc. Interest in bloom filter.
>
>
> Parquet-mr release:
>
>- Column indexes
>- Jira open: remove the page level statistics for:
>https://issues.apache.org/jira/browse/PARQUET-1365
>
> 
>- Action: reach out page level stats.
>
>
> Encryption:
>
>- https://github.com/apache/parquet-format/pull/114
>- Work on c++ implementation and at Uber is blocked on this.
>
>
> Bloom Filter:
>
>- Will reach out on the mailing list
>
>
> Meeting time:
>
>-  Will start a new vote.
>
>
>


Re: parquet sync notes

2018-10-15 Thread Aniket Mokashi
I would like to attend the next sync. Where do I find instructions to join
this meeting?

On Tue, Oct 9, 2018 at 10:13 AM Julien Le Dem
 wrote:

> Gabor (Cloudera): column index, benchmark, nested types (filter, indexes)
> Anna (Cloudera): process, feature branches, etiquette of waiting for
> someone? Blocked
> Zoltan (Cloudera): Feature branches? When to review them?
> Nandor (Cloudera)
>  parquet file with multiple row groups, schema evolution
> Zoltan (Cloudera): column index
> Junjie (tencent): listening
> Gidon (IBM): encryption next steps
> Jim: bloom filter, Bit weaving
> Xinli (Uber): encryption
> Julien (WeWork): encryption
>
> Bloom filter:
>
>-  PR for doc. Parquet-format feature branch.
>-
>   - To be reviewed by: Deepak, Jim, Ryan.
>
>
> Encryption:
>
>- Another encryption effort exists, Julien to send intros: Xinli,
>Giddon, Zoltan
>- New requirements, updated doc, implement code changes.
>
>
> Process:
>
>- Feature branches:
>-
>   - Julien to follow up with Ryan
>   - Feature branches are considered like master:
>   -
>  - Every changed is reviewed individually through a PR
>  - Every change has a jira
>  - Only difference is that it’s ok to make incompatible changes
>  - Squash merge vs merge commit:
>   -
>  - Merge commit keeps the history but clutters
>  - 3 options:
>   -
>  - Merge commit
>  -
> - Clutters history (not linear anymore)
> - But if each commit in the branch has a jira seems fine
> - Squash:
>  -
> - Loses the detailed commits of the feature
> - Keeps history linear
> - Rebase feature branch
>  -
> - Keeps history linear and keeps history
> - But need to address conflicts for each commit in branch
> - Commits in branch are now disconnected from the PR (modified
> after the facts).
> -  When is it appropriate to wait:
>-
>   - Balance:
>   -
>  - making sure we don’t make incompatible changes to the format and
>  we have final features
>  - Making it easier for people to contribute.
>  - Anna to start a conversation around our etiquette
>   -
>  - How long is it appropriate to wait on feedback
>  - How to know who’s the best committer to drive a PR to conclusion
>
>
> Filtering nested types support:
>
>-  We should store stats for nested types
>
>
> Page Index benchmark:
>
>- Nice results, comparing random to sorted files:
>-
>   -
>
> https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkWithOrWithoutColumnIndex.json
>   -
>
> https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkPageSize.json
>   - Need to compare page size affect on compression and file size
>
>
> Appending to a parquet file:
>
>-  The type of a column chunk should be consistent with the schema in
>the footer.
>


-- 
"...:::Aniket:::... Quetzalco@tl"


Re: parquet sync notes

2018-09-27 Thread Zoltan Ivanfi
Hi,

I have created the feature branches:

- https://github.com/apache/parquet-mr/tree/bloom-filter
- https://github.com/apache/parquet-format/tree/bloom-filter

- https://github.com/apache/parquet-mr/tree/encryption
- https://github.com/apache/parquet-format/tree/encryption

I have also cherry-picked the encryption commits to the latter one.

Br,

Zoltan

On Wed, Sep 26, 2018 at 10:29 AM 俊杰陈  wrote:

> Hi Zoltan
>
> PR #62 contains some rebase info which is not relate to change itself so I
> created PR#99. Actually it only contains one file change now, I will add
> another document file later.
>
> Zoltan Ivanfi  于2018年9月26日周三 下午3:19写道:
>
> > Hi,
> >
> > It seems to me that PR #99 does not supersede PR #62, as the latter
> affects
> > 16 files but the former only modifies a single one. Or has the rest of
> the
> > changes been already merged to the codebase from another PR? I checked
> the
> > history and I don't see anything related.
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Wed, Sep 26, 2018 at 4:25 AM 俊杰陈  wrote:
> >
> > > Hi
> > >
> > > the pr28 and pr62 of parquet-format was closed. Will we create a
> feature
> > > branch for bloom filter on parquet-mr as well?
> > >
> > > Julien Le Dem  于2018年9月26日周三
> 上午12:48写道:
> > >
> > > > Lars (Cloudera Impala): listen in.
> > > > Zoltan, Gabor and Nandor (Cloudera):
> > > >
> > > >- feature branch reviewed and merged
> > > >- Parquet-format release
> > > >-
> > > >   - Define scope
> > > >
> > > > Ryan (Netflix)
> > > > Junjie (tencent): bloom filter
> > > > Jim Apple (cloud service): bloom filter in parquet-mr? Since they got
> > in
> > > > parquet-cpp
> > > > Gidon (IBM): encrytpion
> > > > Sahil (Cloudera impala, hive): listen in
> > > > Julien (Wework)
> > > >
> > > > Status update from Gabor:
> > > >
> > > >-  Waiting for reviews.
> > > >-
> > > >   - Plan to merge this Friday.
> > > >   - Please review in the next few days.
> > > >
> > > > Parquet format release:
> > > >
> > > >- Nanosecond precision
> > > >- Deprecation of java related code
> > > >- Encryption metadata
> > > >-
> > > >   - One more pr to merge
> > > >   - Plan:
> > > >-
> > > >   - Revert the encryption patches and put them in feature branch
> in
> > > >   parquet-format
> > > >   - Apply the same process to bloom filters
> > > >   - Owner of pr can update it to the feature branch
> > > >
> > > >
> > > > Encryption:
> > > >
> > > >- Old readers can read non encrypted columns
> > > >-
> > > >   - Changes to metadata
> > > >   - One last PR on parquet-format
> > > >   - We should have a vote before merging it.
> > > >- Make sure parquet-cpp depends on the source of truth thrift in
> > > >parquet-format.
> > > >
> > > >
> > > > Bloom filter:
> > > >
> > > >- parquet-format/62 and parquet-format/99
> > > >- parquet-format/28: should be closed as is outdated. We should
> port
> > > the
> > > >doc to the more recent PR.
> > > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>
>
> --
> Thanks & Best Regards
>


Re: parquet sync notes

2018-09-26 Thread 俊杰陈
Hi Zoltan

PR #62 contains some rebase info which is not relate to change itself so I
created PR#99. Actually it only contains one file change now, I will add
another document file later.

Zoltan Ivanfi  于2018年9月26日周三 下午3:19写道:

> Hi,
>
> It seems to me that PR #99 does not supersede PR #62, as the latter affects
> 16 files but the former only modifies a single one. Or has the rest of the
> changes been already merged to the codebase from another PR? I checked the
> history and I don't see anything related.
>
> Thanks,
>
> Zoltan
>
> On Wed, Sep 26, 2018 at 4:25 AM 俊杰陈  wrote:
>
> > Hi
> >
> > the pr28 and pr62 of parquet-format was closed. Will we create a feature
> > branch for bloom filter on parquet-mr as well?
> >
> > Julien Le Dem  于2018年9月26日周三 上午12:48写道:
> >
> > > Lars (Cloudera Impala): listen in.
> > > Zoltan, Gabor and Nandor (Cloudera):
> > >
> > >- feature branch reviewed and merged
> > >- Parquet-format release
> > >-
> > >   - Define scope
> > >
> > > Ryan (Netflix)
> > > Junjie (tencent): bloom filter
> > > Jim Apple (cloud service): bloom filter in parquet-mr? Since they got
> in
> > > parquet-cpp
> > > Gidon (IBM): encrytpion
> > > Sahil (Cloudera impala, hive): listen in
> > > Julien (Wework)
> > >
> > > Status update from Gabor:
> > >
> > >-  Waiting for reviews.
> > >-
> > >   - Plan to merge this Friday.
> > >   - Please review in the next few days.
> > >
> > > Parquet format release:
> > >
> > >- Nanosecond precision
> > >- Deprecation of java related code
> > >- Encryption metadata
> > >-
> > >   - One more pr to merge
> > >   - Plan:
> > >-
> > >   - Revert the encryption patches and put them in feature branch in
> > >   parquet-format
> > >   - Apply the same process to bloom filters
> > >   - Owner of pr can update it to the feature branch
> > >
> > >
> > > Encryption:
> > >
> > >- Old readers can read non encrypted columns
> > >-
> > >   - Changes to metadata
> > >   - One last PR on parquet-format
> > >   - We should have a vote before merging it.
> > >- Make sure parquet-cpp depends on the source of truth thrift in
> > >parquet-format.
> > >
> > >
> > > Bloom filter:
> > >
> > >- parquet-format/62 and parquet-format/99
> > >- parquet-format/28: should be closed as is outdated. We should port
> > the
> > >doc to the more recent PR.
> > >
> >
> >
> > --
> > Thanks & Best Regards
> >
>


-- 
Thanks & Best Regards


Re: parquet sync notes

2018-09-26 Thread Zoltan Ivanfi
Hi,

It seems to me that PR #99 does not supersede PR #62, as the latter affects
16 files but the former only modifies a single one. Or has the rest of the
changes been already merged to the codebase from another PR? I checked the
history and I don't see anything related.

Thanks,

Zoltan

On Wed, Sep 26, 2018 at 4:25 AM 俊杰陈  wrote:

> Hi
>
> the pr28 and pr62 of parquet-format was closed. Will we create a feature
> branch for bloom filter on parquet-mr as well?
>
> Julien Le Dem  于2018年9月26日周三 上午12:48写道:
>
> > Lars (Cloudera Impala): listen in.
> > Zoltan, Gabor and Nandor (Cloudera):
> >
> >- feature branch reviewed and merged
> >- Parquet-format release
> >-
> >   - Define scope
> >
> > Ryan (Netflix)
> > Junjie (tencent): bloom filter
> > Jim Apple (cloud service): bloom filter in parquet-mr? Since they got in
> > parquet-cpp
> > Gidon (IBM): encrytpion
> > Sahil (Cloudera impala, hive): listen in
> > Julien (Wework)
> >
> > Status update from Gabor:
> >
> >-  Waiting for reviews.
> >-
> >   - Plan to merge this Friday.
> >   - Please review in the next few days.
> >
> > Parquet format release:
> >
> >- Nanosecond precision
> >- Deprecation of java related code
> >- Encryption metadata
> >-
> >   - One more pr to merge
> >   - Plan:
> >-
> >   - Revert the encryption patches and put them in feature branch in
> >   parquet-format
> >   - Apply the same process to bloom filters
> >   - Owner of pr can update it to the feature branch
> >
> >
> > Encryption:
> >
> >- Old readers can read non encrypted columns
> >-
> >   - Changes to metadata
> >   - One last PR on parquet-format
> >   - We should have a vote before merging it.
> >- Make sure parquet-cpp depends on the source of truth thrift in
> >parquet-format.
> >
> >
> > Bloom filter:
> >
> >- parquet-format/62 and parquet-format/99
> >- parquet-format/28: should be closed as is outdated. We should port
> the
> >doc to the more recent PR.
> >
>
>
> --
> Thanks & Best Regards
>


Re: parquet sync notes

2018-09-25 Thread 俊杰陈
Hi

the pr28 and pr62 of parquet-format was closed. Will we create a feature
branch for bloom filter on parquet-mr as well?

Julien Le Dem  于2018年9月26日周三 上午12:48写道:

> Lars (Cloudera Impala): listen in.
> Zoltan, Gabor and Nandor (Cloudera):
>
>- feature branch reviewed and merged
>- Parquet-format release
>-
>   - Define scope
>
> Ryan (Netflix)
> Junjie (tencent): bloom filter
> Jim Apple (cloud service): bloom filter in parquet-mr? Since they got in
> parquet-cpp
> Gidon (IBM): encrytpion
> Sahil (Cloudera impala, hive): listen in
> Julien (Wework)
>
> Status update from Gabor:
>
>-  Waiting for reviews.
>-
>   - Plan to merge this Friday.
>   - Please review in the next few days.
>
> Parquet format release:
>
>- Nanosecond precision
>- Deprecation of java related code
>- Encryption metadata
>-
>   - One more pr to merge
>   - Plan:
>-
>   - Revert the encryption patches and put them in feature branch in
>   parquet-format
>   - Apply the same process to bloom filters
>   - Owner of pr can update it to the feature branch
>
>
> Encryption:
>
>- Old readers can read non encrypted columns
>-
>   - Changes to metadata
>   - One last PR on parquet-format
>   - We should have a vote before merging it.
>- Make sure parquet-cpp depends on the source of truth thrift in
>parquet-format.
>
>
> Bloom filter:
>
>- parquet-format/62 and parquet-format/99
>- parquet-format/28: should be closed as is outdated. We should port the
>doc to the more recent PR.
>


-- 
Thanks & Best Regards


Re: Parquet sync meeting minutes

2018-08-17 Thread Zoltan Ivanfi
Hi,

Sorry, that was an error on my side, I suggested Nandor to add a TLDR
section with this title. I agree with your comment, Wes, outcome would have
been a better choice of word than decision.

Br,

Zoltan

On Fri, Aug 17, 2018 at 6:36 PM Wes McKinney  wrote:

> hi Nandor,
>
> A fine detail, and I may be wrong, but I don't think decisions can
> technically be made on a call because time zones do not permit
> everyone to join always and not all collaborators are comfortable
> having live discussions in English. see [1]
>
> You can present the consensus of the participants in the call summary
> and others in the community have an opportunity to provide feedback.
> The "decision" is therefore one based on lazy consensus thereafter if
> there are no objections or follow up discussion
>
> - Wes
>
> [1]: https://www.apache.org/foundation/how-it-works.html#management
>
> On Fri, Aug 17, 2018 at 8:38 AM, Nandor Kollar
>  wrote:
> > Topics discussed and decisions (meeting held on 2018 August 15th, at
> > 6pm CET / 9 am PST):
> >
> > - Aligning page row boundaries between different columns: Debated,
> > please follow-up
> > - Remove Java specific code from parquet-format: Accepted
> > - Column encryption: Please review
> > - Parquet-format release: Scope accepted
> > - C++ mono-repo: Please vote
> >
> >
> >
> > Aligning page row boundaries between different columns (Gabor)
> > --
> >
> > Background: In the existing specification of column indexes, page
> > boundaries are not aligned between different column in respect to row
> > count.
> >
> > Gabor: implemented this logic, interested parties can review the code
> here:
> > - https://github.com/apache/parquet-mr/pull/509
> > - https://github.com/apache/parquet-mr/commits/column-indexes
> >
> > Main takeaway from implementation:
> >
> > - Index filtering logic as currently specified is overcomplicated.
> > - May become a maintenance burden and results in steep learning curve
> > for onboarding - new developers.
> > - Can not be made transparent, vectorized readers (Hive, Spark) have
> > to implement a similar logic.
> >
> > Suggestion:
> >
> > - Align page row boundaries between different columns, i.e. the n-th
> > page of every column should contain the same number of rows.
> > - Filtering logic would be a lot simpler.
> > - Vectorized readers will get index-based filtering without any change
> > required on their side.
> >
> > Response:
> > - Ryan doesn't recommend it. Performance numbers?
> > - Discuss offline or on dev mailing list
> > - Timeline for reaching decision? Within a week. (Gabor already has a
> > working implementation.)
> >
> >
> >
> > Remove Java specific code from parquet-format (Nandor)
> > --
> >
> > Background: Parquet-format contains a few Java classes. Earlier no
> > changes were required in these, but this has changed in recent
> > features, especially with the new column encryption feature, which
> > would add substantial new code.
> >
> > Suggestion (Nandor): Instead of cluttering parquet-format further with
> > java-specific code, move these classes to parquet-mr and deprecate
> > them in parquet-format.
> >
> > What is the motivation behind the status quo? Julien: We may need a
> > different Thrift version in the parquet-thrift binding than in the
> > parquet files themselves. If we move these classes to parquet-mr, we
> > should shade thrift. Additionally, currently a thrift-compiler is only
> > needed for parquet-format, not parquet-mr, this will change. Gabor:
> > Dockerization may help.
> >
> > Julien: We could merge the two repos altogether as well. Gabor: This,
> > however would move the specification into the Java implementation,
> > which would be against the cross-language ideology, so let's keep the
> > separate repo for the format. Zoltan: Other language binding should
> > also consider directly using it instead of copying parquet.thrift into
> > their source code.
> >
> >
> >
> > Column encryption (Gidon)
> > -
> >
> > Under development:
> > - Key management API (doesn’t provide E2E key management) (PARQUET-1373)
> > - Anonymization and data masking (PARQUET-1376)
> >
> > Java PRs under review:
> > - https://github.com/apache/parquet-mr/pull/471
> > - https://github.com/apache/parquet-mr/pull/472
> >
> > C++ PR:
> > - https://github.com/apache/parquet-cpp/pull/475
> >
> >
> > We need more testing (both unit tests and interop tests between Java and
> C++).
> >
> >
> >
> > Parquet-format release (Zoltan)
> > ---
> >
> > Suggested scope (Zoltan):
> > - Column encryption
> > - Nanosec precision
> > - Anything else?
> >
> > Discussion:
> > - Nothing else to add.
> > - Wes welcomes the nano precision, will be needed in parquet-cpp as well.
> >
> >
> >
> > C++ mono-repo: merging Arrow and parquet-cpp (Wes)
> > --
> >
> >
> 

Re: Parquet sync meeting minutes

2018-08-17 Thread Wes McKinney
hi Nandor,

A fine detail, and I may be wrong, but I don't think decisions can
technically be made on a call because time zones do not permit
everyone to join always and not all collaborators are comfortable
having live discussions in English. see [1]

You can present the consensus of the participants in the call summary
and others in the community have an opportunity to provide feedback.
The "decision" is therefore one based on lazy consensus thereafter if
there are no objections or follow up discussion

- Wes

[1]: https://www.apache.org/foundation/how-it-works.html#management

On Fri, Aug 17, 2018 at 8:38 AM, Nandor Kollar
 wrote:
> Topics discussed and decisions (meeting held on 2018 August 15th, at
> 6pm CET / 9 am PST):
>
> - Aligning page row boundaries between different columns: Debated,
> please follow-up
> - Remove Java specific code from parquet-format: Accepted
> - Column encryption: Please review
> - Parquet-format release: Scope accepted
> - C++ mono-repo: Please vote
>
>
>
> Aligning page row boundaries between different columns (Gabor)
> --
>
> Background: In the existing specification of column indexes, page
> boundaries are not aligned between different column in respect to row
> count.
>
> Gabor: implemented this logic, interested parties can review the code here:
> - https://github.com/apache/parquet-mr/pull/509
> - https://github.com/apache/parquet-mr/commits/column-indexes
>
> Main takeaway from implementation:
>
> - Index filtering logic as currently specified is overcomplicated.
> - May become a maintenance burden and results in steep learning curve
> for onboarding - new developers.
> - Can not be made transparent, vectorized readers (Hive, Spark) have
> to implement a similar logic.
>
> Suggestion:
>
> - Align page row boundaries between different columns, i.e. the n-th
> page of every column should contain the same number of rows.
> - Filtering logic would be a lot simpler.
> - Vectorized readers will get index-based filtering without any change
> required on their side.
>
> Response:
> - Ryan doesn't recommend it. Performance numbers?
> - Discuss offline or on dev mailing list
> - Timeline for reaching decision? Within a week. (Gabor already has a
> working implementation.)
>
>
>
> Remove Java specific code from parquet-format (Nandor)
> --
>
> Background: Parquet-format contains a few Java classes. Earlier no
> changes were required in these, but this has changed in recent
> features, especially with the new column encryption feature, which
> would add substantial new code.
>
> Suggestion (Nandor): Instead of cluttering parquet-format further with
> java-specific code, move these classes to parquet-mr and deprecate
> them in parquet-format.
>
> What is the motivation behind the status quo? Julien: We may need a
> different Thrift version in the parquet-thrift binding than in the
> parquet files themselves. If we move these classes to parquet-mr, we
> should shade thrift. Additionally, currently a thrift-compiler is only
> needed for parquet-format, not parquet-mr, this will change. Gabor:
> Dockerization may help.
>
> Julien: We could merge the two repos altogether as well. Gabor: This,
> however would move the specification into the Java implementation,
> which would be against the cross-language ideology, so let's keep the
> separate repo for the format. Zoltan: Other language binding should
> also consider directly using it instead of copying parquet.thrift into
> their source code.
>
>
>
> Column encryption (Gidon)
> -
>
> Under development:
> - Key management API (doesn’t provide E2E key management) (PARQUET-1373)
> - Anonymization and data masking (PARQUET-1376)
>
> Java PRs under review:
> - https://github.com/apache/parquet-mr/pull/471
> - https://github.com/apache/parquet-mr/pull/472
>
> C++ PR:
> - https://github.com/apache/parquet-cpp/pull/475
>
>
> We need more testing (both unit tests and interop tests between Java and C++).
>
>
>
> Parquet-format release (Zoltan)
> ---
>
> Suggested scope (Zoltan):
> - Column encryption
> - Nanosec precision
> - Anything else?
>
> Discussion:
> - Nothing else to add.
> - Wes welcomes the nano precision, will be needed in parquet-cpp as well.
>
>
>
> C++ mono-repo: merging Arrow and parquet-cpp (Wes)
> --
>
>
> Background: duplicated CI system and codebase, circular dependencies
> between libraries
>
> Suggestion (Wes): move parquet-cpp into arrow codebase. Details can be
> read here: 
> https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E
>
>
> Resolution: No objections but no final decision either, vote on the
> parquet list: 
> https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E


Re: Parquet sync notes

2018-06-12 Thread Gidon Gershinsky
There are four PRs, each dependent on its predecessor. Please review in
this order:

1) #94 in parquet-format: Thrift additions (crypto structures)

2) #95 in parquet-format: encryption/decryption of footer, headers and
column metadata - via cipher interfaces

3) #471 in parquet-mr: crypto package: implementation of cipher interfaces,
AES-* algorithms, configuration and API of Parquet encryption

4) #472 in parquet-mr: utilization of crypto package in existing Parquet
classes for encryption/decryption of pages, and passing an encryptor object
to ParquetWriter/Reader.



On Tue, Jun 12, 2018 at 7:50 PM, Julien Le Dem 
wrote:

>  QingHui (Criteo): parquet-protobuf
> Lars (impala), Jim (Cloudera): Bloom filter benchmarks
> Ryan (Netflix):
> JunJie (Intel): Bloomfilter and dictionary comparison benchmarks
> Gidon (IBM): Encryption, feedback
> Xinli Shang (Uber): Encryption
>
> Bloomfilter and dictionary comparison benchmarks:
>
>- PARQUET-41
>- Feedback to find the number of distinct values for which bloom filter
>outperforms filter based search
>- Action: JunJie to share code and update benchmark
>
> Encryption:
>
>- Progress on multi-key design: need review
>- Need review on PR as well
>- Discussion on how to pass parameters down to Parquet to specify what
>to encrypt
>- Action: Gidon to share PR again and others to review.
>


Re: Parquet sync

2018-04-24 Thread Julien Le Dem
Notes:
attendees/agenda:
Ryan (Netflix):

   -  Spark update to parquet 1.10 pending

Nandor, Zoltan, Gabor, Anna (Cloudera):

   - Backport schema description language. New logical types => introduce
   parameters. Need to evolve schema parser.
   - Need review on column indexes PARQUET-1211. PR 456 :
   -
  - https://github.com/apache/parquet-mr/pull/456

Gidon:

   - Encryption

Benoit, xinhui: protobuf

   - Jackson shading
   - Parquet version in Spark
   - https://issues.apache.org/jira/browse/PARQUET-968

Julien (Wework)

Notes:
Parquet version in Spark:

   - PR in spark https://github.com/apache/spark/pull/21070
   -
  - would like in 2.4
  - Databricks would like TPCDS run.

Jackson shading:

   - Debug level => job crashes
   - Prettify schema.
   - Hadoop using Jackson 1.8
   - https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1281

Parquet-proto:

   - https://issues.apache.org/jira/browse/PARQUET-968
   - Tested on amazon presto fork.
   - https://github.com/apache/parquet-mr/pull/411 is ready to merge.

Encryption:

   - Need pluggable mechanism
   - Will open PRs on parquet-format/parquet-mr
   - For review: https://github.com/apache/parquet-format/pull/84

Schema language for new logical types:

   - Timestamp types has 2 parameters:
   - It is ok to have a breaking change in the parquet schema text
   representation.
   - Will be added as a follow up.
   - https://github.com/apache/parquet-mr/pull/463

Schema index implementation:

   - Please review: https://github.com/apache/parquet-mr/pull/456/files
   - Write path only for now
   - More PR are blocked by it.
   - Will work on read path soon.

Parquet 1.8.3: PARQUET-1277

   - PARQUET-1217 Incorrect handling of missing values in Statistics
   - PARQUET-1246 Ignore float/double statistics in case of NaN
   - Will be used for a spark patch release
   - No other ticket requested






On Tue, Apr 24, 2018 at 12:05 PM, Julien Le Dem 
wrote:

> Happening now:
> https://meet.google.com/esu-yiit-mun
>


Re: parquet sync happening now

2018-03-28 Thread Julien Le Dem
 Agenda/attendees:
Lars (Impala): how we store splitter elements for page indexes when very
long common prefixes.
Ryan (Netflix):

   - getting 1.10 out of parquet-mr
   - Status on statistics bug fixes changes

Zoltan and team (Cloudera):

   - Parquet-format release


   - New logical type representation

Marcel (unaffiliated):

   -  Iceberg

Benoit (Criteo):

   -  Parquet-proto

Deepak (Vertica)
Julien (WeWork):

   - Parquet-proto
   - Release



Notes

   - Releases 1.10
   -
  - Parquet-mr 1.10 => Ryan to start release process
  - Parquet-format => Zoltan to start release process


   - Statistics error fixes
   -
  - Null/NaN/+0/-0
  -
 - NaN never < no > any value. So if it’s the 1st value it gets
 stuck in min/max
 - +0 == -0 according to < and not according to compareTo
  - Page index Statistics
   -
  - long common prefix
  -
 - Action: Lars, Zoltan, Gabor: create proposal and follow up on
 email
  - Parquet-proto: Parquet-968
   -
  - https://github.com/apache/parquet-mr/pull/411
  - Action: Julien, Ryan to give feedback on PR
  - Support for proto-2 is not needed and will be dropped in next
  release.
  -
 -
 https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1259
 - Action: Benoit to comment on the Jira
  - New logical type representation
   - Iceberg: reach out to Ryan for details.




On Wed, Mar 28, 2018 at 9:00 AM, Julien Le Dem 
wrote:

> https://meet.google.com/xpc-gwie-sem
>


Re: Parquet sync starting now

2018-03-13 Thread Julien Le Dem
Notes:

Attendees:

   - Julien (WeWork): proto, release
   - Marcel: Iceberg
   - Zoltan, Gabor, Anna (Cloudera): bug null values.
   -
  - https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1222
  

  - https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1217
  

   - Lars, Zoltan Borok-nagy (Cloudera Impala): new way of merging changes
   after moving to gitbox.
   - Deepak (Vertica): encryption in c++
   - Benoit, Singhue (Criteo): protobuf. Merging
   -
  - https://github.com/apache/parquet-mr/pull/411
  - PARQUET-968
   - Chao (Uber): encryption, Native Rust implementation.
   - Gidon (IBM): encryption jira, status and next steps.



   - Protobuf:
   -
  - https://github.com/apache/parquet-mr/pull/411
  - In use for a few weeks.
  - Introduces a breaking change:
  -
 - Empty maps become null maps
  - Will add flag to avoid compatibility break
   - Rust:
   -
  - Been working for 1 year
  - 2 contributors.
  - Read implementation only for now.
  - Want to contribute to the parquet project.
  - Plan to have Parquet-rust using Arrow-rust
  - Personal project.
   - Encryption: https://issues.apache.org/jira/browse/PARQUET-1178
   -
  - Need review: https://github.com/apache/parquet-format/pull/84/files
  - Chao: Hive table use parquet format. Different engines (Presto).
  use the data so security should be implemented at the node level
  - Deepak: make sure there’s no incompatibility issues.
  - Gidon: has been looking at the C++ implementation. Cross
  compatibility working.
  - Action:
  -
 - Provide feedback on PR and doc.
 - Giddon to share java.
 - Deepak take a look and provide cpp point of view
  - Bugs:
   -
  - PARQUET-1222
  
:
Handling
  of NaN and 0+ 0-:
  -
 - 1: fix current behavior (ignore NaN in stats and 0+-)
 - 2: provide better total ordering including NaN etc
  - PARQUET-1217: if null_count if populated but not min/max old
  parquet use default 0 min max for numbers.
  -
 - Need a fix and parquet-mr
 - Old readers will have problems:
 -
- Possibly provide a 1.8.3 release with the bug fix for project
depending on an old version.
- For example Spark:

https://github.com/apache/spark/blob/34811e0b908449fd59bca476604612b1d200778d/pom.xml#L132
- Will reach out to the spark team to see if they can upgrade.



On Tue, Mar 13, 2018 at 10:01 AM, Julien Le Dem 
wrote:

> https://meet.google.com/jpy-mump-ngc
>


Re: parquet sync

2018-02-14 Thread Julien Le Dem
Notes:
Attendees, Agenda:
Lars (Cloudera Impala): Zoltan proposal to get to a more stable release or
feature flag
Qinghui, Benoit, Miguel, Justin (Criteo): Pull request. Parquet-proto.
PARQUET-968
Gidon (IBM): encryption JIRA. On track
Ryan (Netflix): getting 1.10 out
Zoltan (Cloudera): column index fixes from Gabor, ideas on list
Anna (Cloudera): Compatibility issues.

Discussion:
Compatibility issues and flags:

   - Define standard flags for features that are supported or not:
   -
  - New Compression algorithms: Brotli, ZStandard, ...
  - New Encodings (since v1): Delta-int, …
   - Flags are standards across parquet implementations to limit usage of
   features to a set supported across all components
   - Define (a few) profiles with the sets of features supported for a
   given version (1.0, 2.0, 3.0)
   -
  - These are goals for any implementation to support.
   - To be discussed: optional features that can be ignored and don’t
   prevent reading the file (ex: bloom filters, page index)
   -  Zoltan: create jira and google doc with a design proposal

Parquet-proto:

   - Criteo to validate and give +1 :
   https://github.com/apache/parquet-mr/pull/411
   - New feature needed:
   -
  -  support: empty list vs null list.
  - Crate will Create jira and submit New PR

Column indexes: (By Gabor) PR: https://github.com/apache/parquet-mr/pull/456

   - Needs modification in parquet-format utils (not the thrift metadata)
   => new release
   - first version writing into parquet-mr
   - Action:
   -
  - Ryan to review
  - Ryan and Zoltan to follow up on making parquet-format release






On Wed, Feb 14, 2018 at 9:02 AM, Julien Le Dem 
wrote:

> starting now on google hangout:
> https://meet.google.com/nhj-cvpt-atx
>


Re: parquet sync

2018-01-30 Thread Julien Le Dem
notes:
Julien (Wework)
Gidon (IBM): secure analytics. JIRA + Draft
Ryan (Netflix): Parquet-787 needs review
Lars (Cloudera, Impala): Discuss Zoltan’s proposal. Feature sets
Jim (Cloudera, Impala): Bloom filters
Zoltan (Cloudera): Java 8 transition, breaking changes management
Gabor (Cloudera): column index implement in parquet-mr
Nandor (Cloudera)
Uwe (Blue Yonder)
Marcel

Agenda:

   -  Bloom filters: https://github.com/apache/parquet-cpp/pull/432
   -
  - Patch out for review for bloom filter in C++
  - Perf comp for Bloom filter and Dictionary?
  - Need guidance on bloom filter size and mechanism not to write too
  big a bloom filter.
  - Ryan to follow up
   - Proposition for secure analytics: PARQUET-1178
   -
  - Allow encryption while maintaining Parquet push down capabilities
  - Step 1: encryption with single key, allowing individual columns to
  be encrypted or not.
   - Java 8 transition:
   -
  - Will move Parquet to Java8
   - breaking changes management, feature set proposal from Zoltan
   -
  - Parquet
   - Parquet-787 needs review
   -
  - Works in production at Netflix
  - Please review and approve if appropriate
   - Next sync, Tuesday in 2 weeks.



On Tue, Jan 30, 2018 at 6:59 PM, Julien Le Dem 
wrote:

> happening now: meet.google.com/nhj-cvpt-atx
>


Re: Parquet Sync timing

2017-12-04 Thread Zoltan Ivanfi
Hi,

I would suggest voting for the weekdays and choosing the best one or
choosing the two best ones and alternating between them. We could repeat
this process every 3-6 months.

I created a poll for this purpose, please vote here:

https://goo.gl/forms/Pr8U1wsRmpEZhdHy1

Thanks,

Zoltan

On Fri, Dec 1, 2017 at 8:50 AM Gabor Szadovszky <
gabor.szadovs...@cloudera.com> wrote:

> Hi,
>
> Unfortunately, the regular timing of the Parquet Sync meeting (Wednesday,
> 6PM CET) is not good for me. I don’t want to mess up everyone’s calendar,
> though.
> What do you think about having every twice meeting on Thursday?
>
> Thanks a lot,
> Gabor


Re: parquet sync starting in a few minutes

2017-11-22 Thread Julien Le Dem
 Notes from the meeting

Attendees:
Julien (WeWork): release
Hakan (Criteo): moving to parquet.
Marcel (unaffiliated)
Lars (Impala, Cloudera): new statistics min_value/max_value fields in
parquet_v2.
Gabor (Cloudera): min/max stats impl., parquet-mr.
Zoltan (Cloudera): Min/max
Anna (Cloudera): Min/Max
Uwe (BlueYonder)
Vuk Ercegovac (Cloudera)
Ryan (Netflix): getting reviews /429, parquet 2.0 reviews
Eric Owhadi (Trafodion): page level filtering. Min/max

Min_value/max_value implementation:
 https://issues.apache.org/jira/browse/PARQUET-1025


   - We should deprecate compareTo in Binary since it is at the physical
   type level when ordering is a logical type notion
  - We discussed a possible better implementation of compareTo that
  would take the LogicalType into account but agreed this would be
a separate
  effort
   - Add a Comparator based on the logical type that is the preferred way
   of comparing 2 values
   - stats writer implementation:
  - The preferred implementation is for writers to implement the new
  min_value/max_value metadata field instead of old min/max
independently of
  the version.
   -
  - Optionally writers might decide to also populate min/max for
  compatibility with older tools but we should do this only if the need
  arises.
   - Action: provide feedback on the JIRA above (PARQUET-1025)

Ryan has two PRs for review:

   - Make sure the Hadoop api does not leak through the Parquet api.
   https://github.com/apache/parquet-mr/pull/429
   - Improved Read allocation API:
   https://github.com/apache/parquet-mr/pull/390

Action: give feedback on pull requests.

next meeting in 2 weeks. same time.




On Wed, Nov 22, 2017 at 8:57 AM, Julien Le Dem 
wrote:

> https://meet.google.com/udi-dvmo-sva
>


Re: parquet sync starting now

2017-10-11 Thread Julien Le Dem
Attendees/agenda:

Santlal

Deepak (Vertical): deprecation of older compression.

Lars (Cloudera, Impala): Column indexes

Marcel: Column indexes

Ryan (Netflix): release parquet-format 2.4.0. need help on java side.
parquet related table format (id based column projection)

Jim (Cloudera)

Zoltan (Cloudera)

Anna (Cloudera)

Julien:


New compression alg / Deprecation of older compression:

 - we can't remove algos that have been used (lzo, brotli). We can add
recommendation on algo to use.

 - added language to clarify support of algorithms plus dependency on
installing some.

 - LZ4 widely available

 - zstandard harder to install but better.

Column indexes:

 - action: make max always present

 - always have min and max values (max not optional)

 - add metadata to capture if min/max are ordered. enum.

 - clarify meaning of null page.

 - todo: update PR and merge soon.

parquet-format release: blocked on page index

parquet related table format discussion: will happen separately.


next meeting in 2 weeks.

On Wed, Oct 11, 2017 at 9:06 AM, Julien Le Dem 
wrote:

> https://meet.google.com/oto-xpdf-kug
>


Re: parquet sync

2017-09-28 Thread Julien Le Dem
Parquet Sync Sept 27 2017:
Attendance and agenda:
Lars (Cloudera Impala):
 - Parquet page index status
Zoltan (Cloudera impala):
 - vectorization
 - api annotation (Private/Public)
Ryan (Netflix):
 - logical types commit
 - Compression tests
Wes (TwoSigma):
 - Compression C++
Julien:
 - testing parquet files: JSON and Parquet.
Jim (Cloudera)

Notes:
Page Index status:
 - need feedback on PR: https://github.com/apache/parquet-format/pull/63
   Action: Julien, Marcel Review
Vectorization:
- https://issues.apache.org/jira/browse/PARQUET-131
  original discussion in parquet which stalled.
- https://issues.apache.org/jira/browse/HIVE-14815
   Hive vectorized parquet read.
   Use annotations to clarify the state of an api
   - Zoltan to open jira: annotations.
   - need to reopen vectorized reader discussion. Follow up on JIRA-131
Logical types:
 - action: need to review PR:
https://github.com/apache/parquet-format/pull/51
Compression tests:
 - Ryan: used parquet-cli with 4 largest/most expensive tables
   => some are big map of k/v pairs, others are features/structured
ran 5 times + average.
will send spreadsheet with results for brotli/zstandard/lz4
brotli/zstandard look like winners: need more extensive tests
 brotli level 5 seems to be a good tradeoff compression cost/size
 lz4 quickest compression time but largest output
 zstandard a bit faster and a bit smaller than brotli
 uses:
   - jbrotli: embedded native library in jar
   - zstd: zlibnative path. packaged in ubuntu
 - action: Ryan cleanup and send out report
 - Wes: C++
speed: gzip, snappy, lz4, zstd

parquet files for tests:
 - Impala has a repository of files for tests:
https://github.com/apache/incubator-impala/tree/master/testdata
 - old compat test repo: https://github.com/Parquet/parquet-compatibility
 - have a repository of files.
 - open a JIRA: Lars.

parquet-tools merge command:
  - merge command: puts row groups after one another.
  - need jira to add comment on how this works (concatenates existing
rowgroups without combining them in larger ones)

​


Re: Parquet sync starting now

2017-08-16 Thread Lars Volker
Here are the notes I took:

Pooja (CMU, Cloudera): Present her work on Parquet indices
Yaliang (Presto), Zoltan (Cloudera), Anna (Cloudera), Marcel, Deepak
(Vertica): Interested in Parquet index work
Ryan (Netflix): Parquet indices, compression
Junjie (Intel): Bloom filter proposal

Parquet Indices:

   - Pooja presented her work.
   - We discussed that valid_values should be kept, distinct_values should
   be removed from the proposal.
   - It'd be interesting to see figures for larger page sizes, parquet-mr
   uses 1MB
   - There was agreement that page indexes should eventually replace page
   statistics
   - We discussed the following next steps
  - Prepare a PR for parquet-format, continue the discussion there
  - Link the slides to the JIRA and mail them to dev@
  - Update the title of PARQUET-922 to better reflect the ongoing work
  (Lars did this already)
  - Add the performance evaluation to the design doc

Compression:

   - Ryan built a JAR that supports zstd, lz4, brotli and is happy to share
   it with anyone who'd like to run their own experiments

Bloom Filters:

   - Junjie prepared a sheet comparing the performance of bloom filters
   with dictionary compression. Folks hadn't had time to look at the results
   so we'll continue the discussion on dev@ and in the next Parquet sync.
   - We may also want to compare them to index based page skipping




On Wed, Aug 16, 2017 at 9:01 AM, Lars Volker  wrote:

> Join us here: https://meet.google.com/zyd-mwbm-zpe
>


Re: Parquet sync starting now

2017-08-14 Thread Wes McKinney
I have not taken a look at the performance of different compression
algorithms yet. Are there any example datasets that anyone would like
to see statistics for? Otherwise I will generate some high and low
entropy datasets with dictionary encoding disabled (so that the
compression is handled more by the byte compressors than by
dictionaries).

On Fri, Aug 11, 2017 at 8:27 PM, Julien Le Dem  wrote:
> Sorry for the delay. See notes bellow.
> I'm on vacation next week and Lars will send an invitation for the next sync
>  August 16th.
> Pooja will talk about her work on page indices.
> Here are the notes from last sync:
>
> Parquet Sync Aug 2 2017
>
>
> Anna (Cloudera):
>
> Deepak (Vertica): timestamp format
>
> Jim (Cloudera): Bloom filters
>
> Lars (Cloudera Impala): feedback on Brotli, Pooja’s file indexes
>
> Marcel: index page proposal
>
> Ryan (Netflix): Merge
>
> Zoltan (Cloudera Budapest)
>
> JunJie (Intel): Bloom Filter.
>
> Julien: Bloom Filters
>
>
> Bloom Filters:
>
>  - to be efficient, needs 1 byte per distinct value.
>
>- useful if many MDVS that are bigger than 1 byte (example UUIDs)
>
>  - Benchmarking:
>
>- difficulty enabling dictionary filtering in Hive and spark sql:
> https://issues.apache.org/jira/browse/PARQUET-1061
>
>   - Ryan to follow up on how to configure it
>
>  - hashing discussion:
>
>- We will used block based hashing algorithm.
>
>- false positive > 00.1%
>
>- Definition of hash function:
>
>   - currently has only one (Murmur3).
>
>   - TODO: define metadata using union to allow for other hash functions
> in the future
>
>   - TODO: clarify what variation of Murmur3 we are using.
>
>
> Index pages:
>
>  - good IO savings by skipping pages.
>
>  - if columns
>
>  - added metadata for position of dictionary location.
>
>  - Next time presentation of the result.
>
>
> Timestamp Format:
>
>  - Ryan to update the PR with conclusion
>
>
> Feedback on Brotli:
>
>  - why not LZ4 or ZStandard?
>
>  - Wes to try ou to compare in C++
>
>  - Ryan to compare in Java with his datasets.
>
>  - For reference:
>
>- comparison graphs, including brotli vs. zstd:
> https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/
>
>-
> http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/compress/Lz4Codec.html
>
>
> PGP keys size:
>
>  - Use larger PGP key id to avoid collision:
>
>
> Github integration:
>
>  - Use new Apache - Github integration to allow admin rights on Github.
>
>  - Start a thread
>
> On Wed, Aug 2, 2017 at 4:28 PM, 俊杰陈  wrote:
>
>> Hi Julien
>> Do we have meeting minutes for sync up?  I can't hear clearly from handout
>> due to vpn issue from home.
>>
>> 2017-08-03 0:01 GMT+08:00 Julien Le Dem :
>>
>> > on hangout:
>> > https://hangouts.google.com/hangouts/_/calendar/
>> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
>> >
>>
>>
>>
>> --
>> Thanks & Best Regards
>>


Re: Parquet sync starting now

2017-08-11 Thread Julien Le Dem
Sorry for the delay. See notes bellow.
I'm on vacation next week and Lars will send an invitation for the next sync
 August 16th.
Pooja will talk about her work on page indices.
Here are the notes from last sync:

Parquet Sync Aug 2 2017


Anna (Cloudera):

Deepak (Vertica): timestamp format

Jim (Cloudera): Bloom filters

Lars (Cloudera Impala): feedback on Brotli, Pooja’s file indexes

Marcel: index page proposal

Ryan (Netflix): Merge

Zoltan (Cloudera Budapest)

JunJie (Intel): Bloom Filter.

Julien: Bloom Filters


Bloom Filters:

 - to be efficient, needs 1 byte per distinct value.

   - useful if many MDVS that are bigger than 1 byte (example UUIDs)

 - Benchmarking:

   - difficulty enabling dictionary filtering in Hive and spark sql:
https://issues.apache.org/jira/browse/PARQUET-1061

  - Ryan to follow up on how to configure it

 - hashing discussion:

   - We will used block based hashing algorithm.

   - false positive > 00.1%

   - Definition of hash function:

  - currently has only one (Murmur3).

  - TODO: define metadata using union to allow for other hash functions
in the future

  - TODO: clarify what variation of Murmur3 we are using.


Index pages:

 - good IO savings by skipping pages.

 - if columns

 - added metadata for position of dictionary location.

 - Next time presentation of the result.


Timestamp Format:

 - Ryan to update the PR with conclusion


Feedback on Brotli:

 - why not LZ4 or ZStandard?

 - Wes to try ou to compare in C++

 - Ryan to compare in Java with his datasets.

 - For reference:

   - comparison graphs, including brotli vs. zstd:
https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/

   -
http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/compress/Lz4Codec.html


PGP keys size:

 - Use larger PGP key id to avoid collision:


Github integration:

 - Use new Apache - Github integration to allow admin rights on Github.

 - Start a thread

On Wed, Aug 2, 2017 at 4:28 PM, 俊杰陈  wrote:

> Hi Julien
> Do we have meeting minutes for sync up?  I can't hear clearly from handout
> due to vpn issue from home.
>
> 2017-08-03 0:01 GMT+08:00 Julien Le Dem :
>
> > on hangout:
> > https://hangouts.google.com/hangouts/_/calendar/
> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> >
>
>
>
> --
> Thanks & Best Regards
>


Re: Parquet sync starting now

2017-08-04 Thread Jeff Knupp
Thanks! Good to know :)

-Jeff

On Fri, Aug 4, 2017 at 9:50 AM, Uwe L. Korn  wrote:

> Hello Jeff,
>
> they are open for anyone and everyone is appreciated! We use these syncs
> to exchange and discuss things about the Parquet project as well as the
> Parquet format. It is also a good point to start if you want to know
> what the current "hot topics" in Parquet are and how you could get
> involved.
>
> Uwe
>
> On Fri, Aug 4, 2017, at 03:48 PM, Jeff Knupp wrote:
> > Just out of curiosity, are these sync meetings restricted to committers
> > and
> > higher or can anyone listen in?
> >
> > Cheers,
> > Jeff Knupp
> >
> > On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈  wrote:
> >
> > > Hi Julien
> > > Do we have meeting minutes for sync up?  I can't hear clearly from
> handout
> > > due to vpn issue from home.
> > >
> > > 2017-08-03 0:01 GMT+08:00 Julien Le Dem :
> > >
> > > > on hangout:
> > > > https://hangouts.google.com/hangouts/_/calendar/
> > > > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
>


Re: Parquet sync starting now

2017-08-04 Thread Uwe L. Korn
Hello Jeff,

they are open for anyone and everyone is appreciated! We use these syncs
to exchange and discuss things about the Parquet project as well as the
Parquet format. It is also a good point to start if you want to know
what the current "hot topics" in Parquet are and how you could get
involved.

Uwe

On Fri, Aug 4, 2017, at 03:48 PM, Jeff Knupp wrote:
> Just out of curiosity, are these sync meetings restricted to committers
> and
> higher or can anyone listen in?
> 
> Cheers,
> Jeff Knupp
> 
> On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈  wrote:
> 
> > Hi Julien
> > Do we have meeting minutes for sync up?  I can't hear clearly from handout
> > due to vpn issue from home.
> >
> > 2017-08-03 0:01 GMT+08:00 Julien Le Dem :
> >
> > > on hangout:
> > > https://hangouts.google.com/hangouts/_/calendar/
> > > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> > >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >


Re: Parquet sync starting now

2017-08-04 Thread Jeff Knupp
Just out of curiosity, are these sync meetings restricted to committers and
higher or can anyone listen in?

Cheers,
Jeff Knupp

On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈  wrote:

> Hi Julien
> Do we have meeting minutes for sync up?  I can't hear clearly from handout
> due to vpn issue from home.
>
> 2017-08-03 0:01 GMT+08:00 Julien Le Dem :
>
> > on hangout:
> > https://hangouts.google.com/hangouts/_/calendar/
> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> >
>
>
>
> --
> Thanks & Best Regards
>


Re: Parquet sync starting now

2017-08-02 Thread 俊杰陈
Hi Julien
Do we have meeting minutes for sync up?  I can't hear clearly from handout
due to vpn issue from home.

2017-08-03 0:01 GMT+08:00 Julien Le Dem :

> on hangout:
> https://hangouts.google.com/hangouts/_/calendar/
> anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
>



-- 
Thanks & Best Regards


Re: Parquet sync starting now

2017-07-19 Thread Julien Le Dem
Notes:
Parquet Sync Jul 19 2017
Intros, Agenda:
Anna, Zoltan (Cloudera Budapest): Column Chunk deprecation (PARQUET-291), type 
dependent sort orderings
Cheng (Intel Shangai): Parquet Bloom Filter 
Jim (Cloudera): Bloom Filter
Lars (Cloudera Impala): 
Marcel: Column index design
Ryan (Netflix): Bloom Filters, Parquet-908 (Logical types), Arrow timestamp
Pooja (Cloudera,):
Julien: parquet-mr release, logical types, bloom filter  

Bloom Filter: PARQUET-41
 - 
https://docs.google.com/document/d/1I2UWCQPd-_6uO8gqf4cDSgRJxspd-ykTTnHdTCSnKUM/edit
 

 - use case: get by id on given very unique column
 - Distinct value count: table property? or end user input?
 - discussion for picking the hash function and how to set the bits: 
- Jim referred to: 
   - http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf 

   - Block split Bloom filter 
https://gist.github.com/jbapple-cloudera/e78460e641967e33d6b68877cff27202 

 - Were should store the Bloom filter data in between row groups.
   - offset and length in the column metadata
 - how do we know the number of distinct values? provided or figure out on the 
fly:
   - keep hashes in memory: 8 bytes per distinct hash
 - no Bloom Filter for dictionary encoded columns.
 - UUID column are a good example.

Column Indices:
  - Pooja update:
  - writing column indices to parquet files:
 - update from design: all offsets written together.
 - Pooja: to update the design doc
 - < .1% write overhead 
  - Parquet index filter
  - TODO: IO layer to skip pages instead of reading them. 

Logical Types:
 - Consensus with new structure
 - Arrow includes the TZ in DateTime. Will use YTC for parquet ts
 - TODO Ryan: get back on the PR. get it ready for commit

parquet-cli: 
 - +1 already
 - Ryan to commit

Brotli compression:
 - TODO: feedback from Impala.

Parquet-mr:
 - Patch release.

parquet-thrift
 - need to upgrade to latest. thrift -.7 is a pain to compile on recent macos

type dependent sort:
 - signed comparison for int96?
   - min and max are wrong with exception of min == max
- interval type: Zoltan to open a jira

Next time:
 follow up on Column Chunk deprecation (PARQUET-291)

> On Jul 19, 2017, at 9:29 AM, Julien Le Dem  wrote:
> 
> https://plus.google.com/hangouts/_/calendar/anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.vtfomsfgpbvjqd8d3kb8hte3j8



Re: Parquet sync starting now

2017-07-19 Thread Wes McKinney
The video call is full

On Wed, Jul 19, 2017 at 12:29 PM, Julien Le Dem  wrote:
> https://plus.google.com/hangouts/_/calendar/anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.vtfomsfgpbvjqd8d3kb8hte3j8


Re: Parquet sync starting in 10 min

2017-06-07 Thread Julien Le Dem
Notes:
Attendees/agenda building:
Zoltan (Cloudera):
 - timestamp, min/max
Anna (cloudera)
Deepak (Vertica):
 - timestamp
 - c++/java: bloom filter.
Lars (Cloudera Impala)
 - page skipping indexes
 - open PRs
Pooja (Cloudera Impala):
 - page skipping indexes
Julien (Dremio):
 - page skipping indexes
 - timestamp


Agenda:
 - open PRs
  TODO (all): review:
   - https://github.com/apache/parquet-format/pull/54
   - https://github.com/apache/parquet-mr/pull/414
   - https://github.com/apache/parquet-mr/pull/411
   - https://github.com/apache/parquet-mr/pull/413
   - https://github.com/apache/parquet-mr/pull/410
  TODO:
follow up (Julien, Lars, Ryan): https://github.com/
apache/parquet-format/pull/53
Ryan follow up https://github.com/apache/parquet-format/pull/51
Julien more tests: https://github.com/apache/parquet-format/pull/50
Ryan follow up: https://github.com/apache/parquet-format/pull/49
 - PR triage:
   - TODO: Lars to do a pass on parquet-format
   - TODO: Julien to do a pass on parquet-mr
 - timestamps:
   - When reading from parquet to arrow if the timestamp isAdjusted to UTC
in arrow we use UTC timezone. otherwise no timezone (timestamp without
timezone)
   - follow up on jira about timestamp with timezone: PARQUET-906
 - min/max: PARQUET-686
   - final conclusion: https://github.com/apache/parquet-format/pull/46
   - PARQUET-839 => duplicate of PARQUET-686
   - TODO close obsolete PRs:
  -  
https://github.com/apache/parquet-format/pull/42
  - https://github.com/apache/parquet-mr/pull/362
   - We need an implementation in parquet-mr for the metadata in
https://github.com/apache/parquet-format/pull/46
  - TODO: Zoltan to open a jira
  - impala has an implementation, we should test they are compatible
 - bloom filter
   - PARQUET-319: see linked PR and doc.
  - https://github.com/apache/parquet-format/pull/28
  - https://docs.google.com/document/d/1mIZ0W24Cr79QHJWN1sQ3dIUc4lAK5
AVqozwSwtpFhW8/edit#heading=h.hmt1hrab3fpc
  - TODO: review and give feedback
 - page skipping indexes
- plan is prototype a writer in impala then a reader.
- We’ll review the results to finalize the metadata in 5-6 weeks.
- dealing with statistics coming from parquet-cpp
  - new min/max_value fields will be the reference


On Wed, Jun 7, 2017 at 10:54 AM, Wes McKinney  wrote:

> Sorry, I was unable to join the sync today. I'm interested to discuss
> more my comments on
>
> https://github.com/apache/parquet-format/pull/51#discussion_r119911623
>
> I'll wait for the notes from the call and maybe we can continue the
> discussion on GitHub
>
> On Wed, Jun 7, 2017 at 12:53 PM, Julien Le Dem  wrote:
> > 10am PT on google hangout:
> > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> >
> > Reminder that this is open to all.
> > Here is how it goes:
> > - we do a "round table" of people present where they quickly introduce
> > themselves and state the topics they wish discussed (if any. Being a "fly
> > on the wall" is totally fine too)
> > - based on that first round we summarize the agenda and go over the
> topics
> > one by one. (can be just bringing attention of people to a PR that needs
> a
> > review or asking if it makes sense to implement some new feature)
> >  - In the end we send notes back to the list and follow ups happen on
> JIRA,
> > github PRs and the dev list.
> >  - if the time is inconvenient to you say so on the list and we can
> figure
> > out something.
> >
> > --
> > Julien
>



-- 
Julien


Re: Parquet sync starting in 10 min

2017-06-07 Thread Wes McKinney
Sorry, I was unable to join the sync today. I'm interested to discuss
more my comments on

https://github.com/apache/parquet-format/pull/51#discussion_r119911623

I'll wait for the notes from the call and maybe we can continue the
discussion on GitHub

On Wed, Jun 7, 2017 at 12:53 PM, Julien Le Dem  wrote:
> 10am PT on google hangout:
> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> Reminder that this is open to all.
> Here is how it goes:
> - we do a "round table" of people present where they quickly introduce
> themselves and state the topics they wish discussed (if any. Being a "fly
> on the wall" is totally fine too)
> - based on that first round we summarize the agenda and go over the topics
> one by one. (can be just bringing attention of people to a PR that needs a
> review or asking if it makes sense to implement some new feature)
>  - In the end we send notes back to the list and follow ups happen on JIRA,
> github PRs and the dev list.
>  - if the time is inconvenient to you say so on the list and we can figure
> out something.
>
> --
> Julien


Re: parquet sync starting now

2017-05-24 Thread Julien Le Dem
 Notes

Ryan (Netflix):
 - Parquet bloom filters
Julien (Dremio):
 - timestamp logical type
 - timestamp unknown ordering
 - pig decimal
Deepak (Vertica):
  - timestamp
  - bloom filter

Bloom filters:
 - Intel came back with good numbers on their bloom filters Pull Request
 - TODO: define the spec to make sure it’s portable
 - we need to minimize the need for tuning:
   - 5% default false positive rate?
   - detect overfilling to increase size automatically
   - keep hashes in memory or rehash values to fix overfilling?
   - possibly HLL for cardinality estimation (but let’s not increase the
scope)
 - Ryan will help intel with their Pull Request
 - Deepak will look into a c++ prototype to confirm portability.

Timestamp logical type:
 - need to reconcile arrow and parquet
   - https://issues.apache.org/jira/browse/ARROW-637
   -
https://github.com/apache/arrow/blob/3d8b1906ba7b0a6c856e8f3aeb54621489080794/format/Schema.fbs#L117
   - https://github.com/apache/parquet-format/pull/51#discussion_r118303404
 - discrepancy:
   - in Arrow, the timezone in the type means "with timezone”. No timezone
means “without timezone”
   - In parquet we just have a boolean flag that means “with/without
timezone”
   - that means the types are incompatible for now.
 - should the timezone field be optional in arrow and have an explicit
“witTimeZone” boolean flag?
 - Julien to send email cross list to clarify.

Decimal in Pig:  https://github.com/apache/parquet-mr/pull/404

 - Ryan to comment regarding parquet-avro impl:

Indexing review to be done next time



On Wed, May 24, 2017 at 10:04 AM, Julien Le Dem  wrote:

> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> --
> Julien
>



-- 
Julien


Re: Parquet sync starting now

2017-05-10 Thread Julien Le Dem
Notes:

Attendees and agenda building:

Ryan (Netflix):
  - new logical types representation
  - index proposal
Deepak (Vertica):
  - logical types for timestamps
Lars (Impala):
  - dummy ordering to test unknown ordering
  - implement new ordering in parquet-mr
Marcel (Impala):
  - index proposal
Uwe (Blue Yonder):
  - parquet cpp 1.1
Wes (twosigma):
  - parquet-cpp 1.1
  - indexing proposal
Zoltan (Cloudera - fileformats):
Julien (Dremio):
 - parquet-mr
 - indexing proposal: near footer of indexes.
 - new logical types

Discussion:
 - logical types: PARQUET-906
https://github.com/apache/parquet-format/pull/51
   - action: Marcel and Lars to give feedback
   - action: give feedback by next week
 - testing unknown ordering:
https://github.com/apache/parquet-format/pull/53/files
   - discussed pros and cons of approaches. Lars will follow up on the
JIRA/PR
 - parquet-cpp 1.1 release:
   - will include:
 - support for reading structs to arrow: (simple reader of one level
structs)
 - support for windows
 - reading and writing of lists of lists: (handles empty lists)
 - move arrow dependency from 0.2 to 0.3
   - rc coming soon.
   - todo: make summary/release notes
 - index proposal: PARQUET-922
- action Julien: open jira to implement footer reading optimization in
parquet-mr
- The new index metadata is before the footer to not impact regular
scan read.
- We will make pages stop on row boundaries when the index is present
  - add row_count to page v1
- discussion: do we need compression?
  - to be addressed later. We should prototype something first
- Deepak: open Jira for limiting stats size in parquet-cpp







On Wed, May 10, 2017 at 10:02 AM, Julien Le Dem  wrote:

> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> --
> Julien
>



-- 
Julien


Re: Parquet sync minutes

2017-04-26 Thread Marcel Kornacker
On Wed, Apr 26, 2017 at 11:02 AM, Julien Le Dem  wrote:
>  Attendance/Agenda:
> Deepak (Vertica):
>  - indexing discussion
> Wes (twosigma):
>  - indexing discussion
>  - parquet-cpp 1.1
> Marcel (Cloudera Impala):
>  - Index proposal
>  - sort order clarification went in
> Julien (Dremio):
>  - indexing
>  - protos
> Lukas (parquet-proto):
>  - parquet-proto
>
> Notes:
>  - parquet-proto:
>- 3 changes on the way:
>  - issue with protos repeated field that often are not read by other
> integrations
>  - add support for protos generic types (may break compatibility?)
>  - schema evolution using ids in photo fields.
>- Lukas to send JIRAs
>- would want to merge them soon and have a release
>
>  - Index proposal for improving point queries and range queries.
> https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BFxf8U_Do5K2wSO4/edit#
>- todo (Marcel): clarify mechanism to store OffsetIndex and ColumnIndex
> outside the footer (probably just before).
>- todo (Marcel): add other optional fields form statistics in
> ColumnIndex (min, max, null_count, distinct_count)

I made the requested edits.

>- todo (everyone): iterate on the feedback
>- impala prototype planned for June
>
> - Logical types pull request:
> https://github.com/apache/parquet-format/pull/51/files
>   - todo: give more feedback
>
>
>
>
> --
> Julien


Re: Parquet sync up in 10 min

2017-04-14 Thread Julien Le Dem
Thank you!

On Fri, Apr 14, 2017 at 4:19 PM, Ryan Blue 
wrote:

> Thanks for the reminder! I've updated the PARQUET-686 PR so it is ready for
> comments. Thanks, everyone!
>
> On Fri, Apr 14, 2017 at 3:25 PM, Julien Le Dem  wrote:
>
> > Reminder:
> > give feedback in:
> >  -  https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> > xf8U_Do5K2wSO4/edit#
> >  - https://github.com/apache/parquet-format/pull/51
> > 
> >  - (once updated by Ryan) https://github.com/apache/
> parquet-format/pull/46
> >
> > On Wed, Apr 12, 2017 at 11:22 AM, Julien Le Dem 
> wrote:
> >
> > > Notes from the sync (Full room today!)
> > >
> > > Zoltan (Cloudera, Parquet)
> > > Cheng (Databricks, Parquet - Spark integration): Index discussion
> > > Ryan (Netflix): Order changes, Logical type - Timestamp
> > > Deepak (Vertica - Parquet): Timestamp, indexes
> > > Greg (Cloudera): Timestamp
> > > Lars (Cloudera, Impala): Min/Max #46, feedback on indices
> > > Marcel (Cloudera, Impala): Min/Max #46, Index pages
> > > QinHui (Criteo): Migration project from JSON to Parquet using
> Protobuffs.
> > > Problem related to this.
> > > Srinath (Databricks): Indexing
> > > Julien (Dremio): Min/Max, Index discussion
> > >
> > > Min/max: https://github.com/apache/parquet-format/pull/46
> > >  - Discussed Forward compatibility requirements to have ColumnOrder as
> > the
> > > gatekeeper to interpret min_value and max_value fields
> > >  - have the signed field is redundant and unnecessary
> > >  - Action: Ryan to update the PR for final review this week (everyone).
> > >
> > > Index: https://docs.google.com/document/d/
> 1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> > > xf8U_Do5K2wSO4/edit#
> > >  - 2 types of lookup structures.
> > >   - SortColumnIndex: index of values on sorted columns. (just boundary
> > > values) (only for main sorting column)
> > >  - (name should be changed as it applies even if the column is not
> > > sorted)
> > >   - OffsetIndex: locate data pages by row number.
> > > SortColumnIndex is used to narrow down the pages to apply a filter on.
> > > OffsetIndex is used to find the select rows in the other columns
> > (projected
> > > but not filtered on)
> > > - Lars and Marcel to make sure the doc is linked in the JIRA and the
> JIRA
> > > referred to in the title.
> > > - Action for everyone: Provide feedback before April 19.
> > > - After that create a PR in parquet-format (labelled experimental spec
> > > until a reference implementation is finalized).
> > >
> > > Timestamp: https://github.com/apache/parquet-format/pull/51
> > > 
> > >  - PR #51 replaces the current LogicalType enum with a better and
> forward
> > > compatible union based definition.
> > >  - Action for everyone: Provide Feedback before April 19
> > >
> > >  Protobuf:
> > >  - QinHui to propose JIRA/PR for saving field ids in schema for
> > protobufs.
> > >  - capture unknown fields for which we only know the ID
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Apr 12, 2017 at 9:57 AM, Julien Le Dem 
> > wrote:
> > >
> > >> Marcel and Lars' doc:
> > >> https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8
> > >> ku4BFxf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb
> > >>
> > >> On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem 
> > wrote:
> > >>
> > >>> 10am PT today on google hangout:
> > >>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> > >>>
> > >>> --
> > >>> Julien
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Julien
> > >>
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
> >
> >
> > --
> > Julien
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Julien


Re: Parquet sync up in 10 min

2017-04-14 Thread Ryan Blue
Thanks for the reminder! I've updated the PARQUET-686 PR so it is ready for
comments. Thanks, everyone!

On Fri, Apr 14, 2017 at 3:25 PM, Julien Le Dem  wrote:

> Reminder:
> give feedback in:
>  -  https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> xf8U_Do5K2wSO4/edit#
>  - https://github.com/apache/parquet-format/pull/51
> 
>  - (once updated by Ryan) https://github.com/apache/parquet-format/pull/46
>
> On Wed, Apr 12, 2017 at 11:22 AM, Julien Le Dem  wrote:
>
> > Notes from the sync (Full room today!)
> >
> > Zoltan (Cloudera, Parquet)
> > Cheng (Databricks, Parquet - Spark integration): Index discussion
> > Ryan (Netflix): Order changes, Logical type - Timestamp
> > Deepak (Vertica - Parquet): Timestamp, indexes
> > Greg (Cloudera): Timestamp
> > Lars (Cloudera, Impala): Min/Max #46, feedback on indices
> > Marcel (Cloudera, Impala): Min/Max #46, Index pages
> > QinHui (Criteo): Migration project from JSON to Parquet using Protobuffs.
> > Problem related to this.
> > Srinath (Databricks): Indexing
> > Julien (Dremio): Min/Max, Index discussion
> >
> > Min/max: https://github.com/apache/parquet-format/pull/46
> >  - Discussed Forward compatibility requirements to have ColumnOrder as
> the
> > gatekeeper to interpret min_value and max_value fields
> >  - have the signed field is redundant and unnecessary
> >  - Action: Ryan to update the PR for final review this week (everyone).
> >
> > Index: https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> > xf8U_Do5K2wSO4/edit#
> >  - 2 types of lookup structures.
> >   - SortColumnIndex: index of values on sorted columns. (just boundary
> > values) (only for main sorting column)
> >  - (name should be changed as it applies even if the column is not
> > sorted)
> >   - OffsetIndex: locate data pages by row number.
> > SortColumnIndex is used to narrow down the pages to apply a filter on.
> > OffsetIndex is used to find the select rows in the other columns
> (projected
> > but not filtered on)
> > - Lars and Marcel to make sure the doc is linked in the JIRA and the JIRA
> > referred to in the title.
> > - Action for everyone: Provide feedback before April 19.
> > - After that create a PR in parquet-format (labelled experimental spec
> > until a reference implementation is finalized).
> >
> > Timestamp: https://github.com/apache/parquet-format/pull/51
> > 
> >  - PR #51 replaces the current LogicalType enum with a better and forward
> > compatible union based definition.
> >  - Action for everyone: Provide Feedback before April 19
> >
> >  Protobuf:
> >  - QinHui to propose JIRA/PR for saving field ids in schema for
> protobufs.
> >  - capture unknown fields for which we only know the ID
> >
> >
> >
> >
> >
> >
> > On Wed, Apr 12, 2017 at 9:57 AM, Julien Le Dem 
> wrote:
> >
> >> Marcel and Lars' doc:
> >> https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8
> >> ku4BFxf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb
> >>
> >> On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem 
> wrote:
> >>
> >>> 10am PT today on google hangout:
> >>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> >>>
> >>> --
> >>> Julien
> >>>
> >>
> >>
> >>
> >> --
> >> Julien
> >>
> >
> >
> >
> > --
> > Julien
> >
>
>
>
> --
> Julien
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: Parquet sync up in 10 min

2017-04-14 Thread Julien Le Dem
Reminder:
give feedback in:
 -  https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
xf8U_Do5K2wSO4/edit#
 - https://github.com/apache/parquet-format/pull/51

 - (once updated by Ryan) https://github.com/apache/parquet-format/pull/46

On Wed, Apr 12, 2017 at 11:22 AM, Julien Le Dem  wrote:

> Notes from the sync (Full room today!)
>
> Zoltan (Cloudera, Parquet)
> Cheng (Databricks, Parquet - Spark integration): Index discussion
> Ryan (Netflix): Order changes, Logical type - Timestamp
> Deepak (Vertica - Parquet): Timestamp, indexes
> Greg (Cloudera): Timestamp
> Lars (Cloudera, Impala): Min/Max #46, feedback on indices
> Marcel (Cloudera, Impala): Min/Max #46, Index pages
> QinHui (Criteo): Migration project from JSON to Parquet using Protobuffs.
> Problem related to this.
> Srinath (Databricks): Indexing
> Julien (Dremio): Min/Max, Index discussion
>
> Min/max: https://github.com/apache/parquet-format/pull/46
>  - Discussed Forward compatibility requirements to have ColumnOrder as the
> gatekeeper to interpret min_value and max_value fields
>  - have the signed field is redundant and unnecessary
>  - Action: Ryan to update the PR for final review this week (everyone).
>
> Index: https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> xf8U_Do5K2wSO4/edit#
>  - 2 types of lookup structures.
>   - SortColumnIndex: index of values on sorted columns. (just boundary
> values) (only for main sorting column)
>  - (name should be changed as it applies even if the column is not
> sorted)
>   - OffsetIndex: locate data pages by row number.
> SortColumnIndex is used to narrow down the pages to apply a filter on.
> OffsetIndex is used to find the select rows in the other columns (projected
> but not filtered on)
> - Lars and Marcel to make sure the doc is linked in the JIRA and the JIRA
> referred to in the title.
> - Action for everyone: Provide feedback before April 19.
> - After that create a PR in parquet-format (labelled experimental spec
> until a reference implementation is finalized).
>
> Timestamp: https://github.com/apache/parquet-format/pull/51
> 
>  - PR #51 replaces the current LogicalType enum with a better and forward
> compatible union based definition.
>  - Action for everyone: Provide Feedback before April 19
>
>  Protobuf:
>  - QinHui to propose JIRA/PR for saving field ids in schema for protobufs.
>  - capture unknown fields for which we only know the ID
>
>
>
>
>
>
> On Wed, Apr 12, 2017 at 9:57 AM, Julien Le Dem  wrote:
>
>> Marcel and Lars' doc:
>> https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8
>> ku4BFxf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb
>>
>> On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem  wrote:
>>
>>> 10am PT today on google hangout:
>>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>>>
>>> --
>>> Julien
>>>
>>
>>
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>



-- 
Julien


Re: Parquet sync up in 10 min

2017-04-12 Thread Julien Le Dem
Notes from the sync (Full room today!)

Zoltan (Cloudera, Parquet)
Cheng (Databricks, Parquet - Spark integration): Index discussion
Ryan (Netflix): Order changes, Logical type - Timestamp
Deepak (Vertica - Parquet): Timestamp, indexes
Greg (Cloudera): Timestamp
Lars (Cloudera, Impala): Min/Max #46, feedback on indices
Marcel (Cloudera, Impala): Min/Max #46, Index pages
QinHui (Criteo): Migration project from JSON to Parquet using Protobuffs.
Problem related to this.
Srinath (Databricks): Indexing
Julien (Dremio): Min/Max, Index discussion

Min/max: https://github.com/apache/parquet-format/pull/46
 - Discussed Forward compatibility requirements to have ColumnOrder as the
gatekeeper to interpret min_value and max_value fields
 - have the signed field is redundant and unnecessary
 - Action: Ryan to update the PR for final review this week (everyone).

Index:
https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BFxf8U_Do5K2wSO4/edit#
 - 2 types of lookup structures.
  - SortColumnIndex: index of values on sorted columns. (just boundary
values) (only for main sorting column)
 - (name should be changed as it applies even if the column is not
sorted)
  - OffsetIndex: locate data pages by row number.
SortColumnIndex is used to narrow down the pages to apply a filter on.
OffsetIndex is used to find the select rows in the other columns (projected
but not filtered on)
- Lars and Marcel to make sure the doc is linked in the JIRA and the JIRA
referred to in the title.
- Action for everyone: Provide feedback before April 19.
- After that create a PR in parquet-format (labelled experimental spec
until a reference implementation is finalized).

Timestamp: https://github.com/apache/parquet-format/pull/51

 - PR #51 replaces the current LogicalType enum with a better and forward
compatible union based definition.
 - Action for everyone: Provide Feedback before April 19

 Protobuf:
 - QinHui to propose JIRA/PR for saving field ids in schema for protobufs.
 - capture unknown fields for which we only know the ID






On Wed, Apr 12, 2017 at 9:57 AM, Julien Le Dem  wrote:

> Marcel and Lars' doc:
> https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> xf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb
>
> On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem  wrote:
>
>> 10am PT today on google hangout:
>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>



-- 
Julien


Re: Parquet sync up in 10 min

2017-04-12 Thread Julien Le Dem
Marcel and Lars' doc:
https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BFxf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb

On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem  wrote:

> 10am PT today on google hangout:
> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> --
> Julien
>



-- 
Julien


Re: Parquet Sync

2017-04-03 Thread Julien Le Dem
Thank you,
The next Parquet sync will be Wednesday 4/12 at 10am PT on Google Hangout
https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up

On Mon, Apr 3, 2017 at 1:36 PM, Lars Volker  wrote:

> Works for me, too.
>
> On Apr 3, 2017 20:58, "Marcel Kornacker"  wrote:
>
> > Works for me as well.
> >
> > On Mon, Apr 3, 2017 at 11:41 AM, Wes McKinney 
> wrote:
> > > +1
> > >
> > > On Mon, Apr 3, 2017 at 2:31 PM, Ryan Blue 
> > wrote:
> > >> Works for me.
> > >>
> > >> On Mon, Apr 3, 2017 at 11:28 AM, Julien Le Dem 
> > wrote:
> > >>
> > >>> I'll be in Munich this week for Dataworks/Hadoop Summit.
> > >>> I propose to move the Parquet Sync scheduled on Wednesday to the week
> > >>> after.
> > >>> Cheers.
> > >>> --
> > >>> Julien
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Ryan Blue
> > >> Software Engineer
> > >> Netflix
> >
>



-- 
Julien


Re: Parquet Sync

2017-04-03 Thread Lars Volker
Works for me, too.

On Apr 3, 2017 20:58, "Marcel Kornacker"  wrote:

> Works for me as well.
>
> On Mon, Apr 3, 2017 at 11:41 AM, Wes McKinney  wrote:
> > +1
> >
> > On Mon, Apr 3, 2017 at 2:31 PM, Ryan Blue 
> wrote:
> >> Works for me.
> >>
> >> On Mon, Apr 3, 2017 at 11:28 AM, Julien Le Dem 
> wrote:
> >>
> >>> I'll be in Munich this week for Dataworks/Hadoop Summit.
> >>> I propose to move the Parquet Sync scheduled on Wednesday to the week
> >>> after.
> >>> Cheers.
> >>> --
> >>> Julien
> >>>
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
>


Re: Parquet Sync

2017-04-03 Thread Marcel Kornacker
Works for me as well.

On Mon, Apr 3, 2017 at 11:41 AM, Wes McKinney  wrote:
> +1
>
> On Mon, Apr 3, 2017 at 2:31 PM, Ryan Blue  wrote:
>> Works for me.
>>
>> On Mon, Apr 3, 2017 at 11:28 AM, Julien Le Dem  wrote:
>>
>>> I'll be in Munich this week for Dataworks/Hadoop Summit.
>>> I propose to move the Parquet Sync scheduled on Wednesday to the week
>>> after.
>>> Cheers.
>>> --
>>> Julien
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix


Re: Parquet Sync

2017-04-03 Thread Wes McKinney
+1

On Mon, Apr 3, 2017 at 2:31 PM, Ryan Blue  wrote:
> Works for me.
>
> On Mon, Apr 3, 2017 at 11:28 AM, Julien Le Dem  wrote:
>
>> I'll be in Munich this week for Dataworks/Hadoop Summit.
>> I propose to move the Parquet Sync scheduled on Wednesday to the week
>> after.
>> Cheers.
>> --
>> Julien
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


Re: Parquet Sync

2017-04-03 Thread Ryan Blue
Works for me.

On Mon, Apr 3, 2017 at 11:28 AM, Julien Le Dem  wrote:

> I'll be in Munich this week for Dataworks/Hadoop Summit.
> I propose to move the Parquet Sync scheduled on Wednesday to the week
> after.
> Cheers.
> --
> Julien
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: Parquet sync starting now on hangout

2017-03-10 Thread Julien Le Dem
It requires extra conversion when using code expecting millis timestamps.
That's probably not a strong argument against it except we now have data
stored in that format.
Those types were added a while ago:
https://issues.apache.org/jira/browse/PARQUET-12

On Thu, Mar 9, 2017 at 6:15 PM, Marcel Kornacker  wrote:

> Timestamp_millis seems like a subset of Timestamp_micros, unless I'm
> missing something: both need 8 bytes of storage, and you can obviously
> pad the former by multiplying with 1000 to arrive at the latter.
> Postgres supports timestamp_micros with a range of 4713BC/294276AD,
> and while dropping to a millisecond resolution will give you a wider
> range of years, I cannot imagine anyone needing that.
>
> Is there a reason why an application that wants to store
> millisecond-resolution timestamps can't simply use timestamp_micros?
>
> On Wed, Mar 8, 2017 at 2:39 PM, Ryan Blue  wrote:
> > TIMESTAMP_MILLIS is a common format for applications that aren't SQL
> engines
> > and is intended as a way for those apps to mark timestamps. SQL engines
> > would ideally recognize those values and be able to read them.
> >
> > rb
> >
> > On Wed, Mar 8, 2017 at 2:08 PM, Marcel Kornacker 
> wrote:
> >>
> >> One thing I forgot to bring up: do we care about TIMESTAMP_MILLIS in
> >> addition to TIMESTAMP_MICROS? From  SQL perspective, only the latter
> >> is needed.
> >>
> >> On Wed, Mar 8, 2017 at 1:54 PM, Julien Le Dem 
> wrote:
> >> > 2. The other thing to look into is HyperLogLog for approximate
> distinct
> >> > value count. Similar concepts than Bloom filters
> >> >
> >> > On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue 
> >> > wrote:
> >> >
> >> >> To follow up on the bloom filter discussion: The discussion on
> >> >> PARQUET-41
> >> >>  has a lot of
> >> >> information
> >> >> and context for the bloom filter spreadsheet
> >> >>  1LQqGZ1EQSkPBXtdi9nyANiQOhwNFw
> >> >> qiiFe8Sazclf5Y/edit?usp=sharing>
> >> >> I mentioned in the sync-up. The main things we need to worry about
> are:
> >> >>
> >> >> 1. When are bloom filters worth using? Columns with low % unique will
> >> >> already be dictionary-encoded and dictionary filtering has no
> >> >> false-positives.
> >> >> 2. How should Parquet track the % unique for a column to size the
> bloom
> >> >> filter correctly? 2x overloading results in a 10x increase in
> >> >> false-positives, so this must avoid overloading.
> >> >> 3. How should Parquet set the target false-positive probability? This
> >> >> is
> >> >> related to the number of lookups in queries. 1% FPP with 5 lookups
> >> >> results
> >> >> in 4.9% FPP for a query.
> >> >>
> >> >> I think there was also some analysis of page level vs row-group level
> >> >> bloom
> >> >> filters and using geometrically decreasing FPP (scalable bloom
> >> >> filters).
> >> >>
> >> >> rb
> >> >>
> >> >> On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem 
> >> >> wrote:
> >> >>
> >> >> > Notes:
> >> >> >
> >> >> > Attendees/Agenda:
> >> >> > Zoltan (Cloudera, file formats):
> >> >> >   - timestamp types
> >> >> > Ryan (Netflix):
> >> >> >   - timestamp types
> >> >> >   - fix for sorting metadata (min-max)
> >> >> > Deepak (Vertica, parquet-cpp):
> >> >> >   - timestamp
> >> >> > Emily (IBM Spark Technology center)
> >> >> > Greg (Cloudera):
> >> >> >  - timestamp
> >> >> > Lars (Cloudera impala):
> >> >> >  - min-max (https://github.com/apache/parquet-format/pull/46)
> >> >> > Marcel (Cl Impala):
> >> >> >  - timestamp
> >> >> >  - sorting/min max
> >> >> >  - bloom filters
> >> >> > Julien (Dremio):
> >> >> >  - sorting/min max
> >> >> >  - timestamp.
> >> >> >
> >> >> > - Timestamp (2 types):
> >> >> >   - Floating Timestamp
> >> >> > - ambiguity to the TZ: year/month/day/microseconds is the data
> >> >> stored.
> >> >> > - timezone less
> >> >> > - same binary representation as current Timestamp. Different
> >> >> > logical
> >> >> > annotation.
> >> >> > - how to store metadata. Same binary format w/wo.
> >> >> > - action: Ryan to propose a PR on parquet-format
> >> >> >   - Timestamp with Timezone.
> >> >> > - stored in UTC
> >> >> > - client side conversion to UTC
> >> >> > - writer timezone should be stored in the metadata?
> >> >> >   - need to clarify if time can be adjusted.
> >> >> >   - Int96: to be deprecated
> >> >> > - int64 used instead with logical type.
> >> >> > - won’t fix int96 ordering. Instead use replacement type.
> >> >> > - Lars to update the JIRA (PARQUET-323)
> >> >> >   - new binary format : int64 storing actual date (year month day)
> +
> >> >> > microseconds since midnight.
> >> >> > - Marcel to open a JIRA.
> >> >> > - Sorting:
> >> >> >   - Ryan to update the the PR (
> >> >> > https://github.com/apache/parquet-format/pull/46)
> >> >> > - Bloom filter: (PARQUET-319, PARQUET-41)
> >> >> >   - take analysis from original PR:
> >> >> > - https://github.com/

Re: Parquet sync starting now on hangout

2017-03-09 Thread Marcel Kornacker
Timestamp_millis seems like a subset of Timestamp_micros, unless I'm
missing something: both need 8 bytes of storage, and you can obviously
pad the former by multiplying with 1000 to arrive at the latter.
Postgres supports timestamp_micros with a range of 4713BC/294276AD,
and while dropping to a millisecond resolution will give you a wider
range of years, I cannot imagine anyone needing that.

Is there a reason why an application that wants to store
millisecond-resolution timestamps can't simply use timestamp_micros?

On Wed, Mar 8, 2017 at 2:39 PM, Ryan Blue  wrote:
> TIMESTAMP_MILLIS is a common format for applications that aren't SQL engines
> and is intended as a way for those apps to mark timestamps. SQL engines
> would ideally recognize those values and be able to read them.
>
> rb
>
> On Wed, Mar 8, 2017 at 2:08 PM, Marcel Kornacker  wrote:
>>
>> One thing I forgot to bring up: do we care about TIMESTAMP_MILLIS in
>> addition to TIMESTAMP_MICROS? From  SQL perspective, only the latter
>> is needed.
>>
>> On Wed, Mar 8, 2017 at 1:54 PM, Julien Le Dem  wrote:
>> > 2. The other thing to look into is HyperLogLog for approximate distinct
>> > value count. Similar concepts than Bloom filters
>> >
>> > On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue 
>> > wrote:
>> >
>> >> To follow up on the bloom filter discussion: The discussion on
>> >> PARQUET-41
>> >>  has a lot of
>> >> information
>> >> and context for the bloom filter spreadsheet
>> >> > >> qiiFe8Sazclf5Y/edit?usp=sharing>
>> >> I mentioned in the sync-up. The main things we need to worry about are:
>> >>
>> >> 1. When are bloom filters worth using? Columns with low % unique will
>> >> already be dictionary-encoded and dictionary filtering has no
>> >> false-positives.
>> >> 2. How should Parquet track the % unique for a column to size the bloom
>> >> filter correctly? 2x overloading results in a 10x increase in
>> >> false-positives, so this must avoid overloading.
>> >> 3. How should Parquet set the target false-positive probability? This
>> >> is
>> >> related to the number of lookups in queries. 1% FPP with 5 lookups
>> >> results
>> >> in 4.9% FPP for a query.
>> >>
>> >> I think there was also some analysis of page level vs row-group level
>> >> bloom
>> >> filters and using geometrically decreasing FPP (scalable bloom
>> >> filters).
>> >>
>> >> rb
>> >>
>> >> On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem 
>> >> wrote:
>> >>
>> >> > Notes:
>> >> >
>> >> > Attendees/Agenda:
>> >> > Zoltan (Cloudera, file formats):
>> >> >   - timestamp types
>> >> > Ryan (Netflix):
>> >> >   - timestamp types
>> >> >   - fix for sorting metadata (min-max)
>> >> > Deepak (Vertica, parquet-cpp):
>> >> >   - timestamp
>> >> > Emily (IBM Spark Technology center)
>> >> > Greg (Cloudera):
>> >> >  - timestamp
>> >> > Lars (Cloudera impala):
>> >> >  - min-max (https://github.com/apache/parquet-format/pull/46)
>> >> > Marcel (Cl Impala):
>> >> >  - timestamp
>> >> >  - sorting/min max
>> >> >  - bloom filters
>> >> > Julien (Dremio):
>> >> >  - sorting/min max
>> >> >  - timestamp.
>> >> >
>> >> > - Timestamp (2 types):
>> >> >   - Floating Timestamp
>> >> > - ambiguity to the TZ: year/month/day/microseconds is the data
>> >> stored.
>> >> > - timezone less
>> >> > - same binary representation as current Timestamp. Different
>> >> > logical
>> >> > annotation.
>> >> > - how to store metadata. Same binary format w/wo.
>> >> > - action: Ryan to propose a PR on parquet-format
>> >> >   - Timestamp with Timezone.
>> >> > - stored in UTC
>> >> > - client side conversion to UTC
>> >> > - writer timezone should be stored in the metadata?
>> >> >   - need to clarify if time can be adjusted.
>> >> >   - Int96: to be deprecated
>> >> > - int64 used instead with logical type.
>> >> > - won’t fix int96 ordering. Instead use replacement type.
>> >> > - Lars to update the JIRA (PARQUET-323)
>> >> >   - new binary format : int64 storing actual date (year month day) +
>> >> > microseconds since midnight.
>> >> > - Marcel to open a JIRA.
>> >> > - Sorting:
>> >> >   - Ryan to update the the PR (
>> >> > https://github.com/apache/parquet-format/pull/46)
>> >> > - Bloom filter: (PARQUET-319, PARQUET-41)
>> >> >   - take analysis from original PR:
>> >> > - https://github.com/apache/parquet-mr/pull/215
>> >> > - https://github.com/apache/parquet-format/pull/28
>> >> >   - need to define metadata.
>> >> > - C++ code reuse between parquet-cpp, impala, …
>> >> >   - impala team to discuss how they want to do that.
>> >> > - store page level stats in footer (PARQUET-907)
>> >> >   - several options:
>> >> > - Index Page: similar to an ISAM index. 1 per row group: if
>> >> > ordered
>> >> > just maxes and offsets
>> >> > - add optional field in footer metadata.
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Mar 8, 2017

Re: Parquet sync starting now on hangout

2017-03-08 Thread Ryan Blue
TIMESTAMP_MILLIS is a common format for applications that aren't SQL
engines and is intended as a way for those apps to mark timestamps. SQL
engines would ideally recognize those values and be able to read them.

rb

On Wed, Mar 8, 2017 at 2:08 PM, Marcel Kornacker  wrote:

> One thing I forgot to bring up: do we care about TIMESTAMP_MILLIS in
> addition to TIMESTAMP_MICROS? From  SQL perspective, only the latter
> is needed.
>
> On Wed, Mar 8, 2017 at 1:54 PM, Julien Le Dem  wrote:
> > 2. The other thing to look into is HyperLogLog for approximate distinct
> > value count. Similar concepts than Bloom filters
> >
> > On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue 
> wrote:
> >
> >> To follow up on the bloom filter discussion: The discussion on
> PARQUET-41
> >>  has a lot of
> >> information
> >> and context for the bloom filter spreadsheet
> >>  >> qiiFe8Sazclf5Y/edit?usp=sharing>
> >> I mentioned in the sync-up. The main things we need to worry about are:
> >>
> >> 1. When are bloom filters worth using? Columns with low % unique will
> >> already be dictionary-encoded and dictionary filtering has no
> >> false-positives.
> >> 2. How should Parquet track the % unique for a column to size the bloom
> >> filter correctly? 2x overloading results in a 10x increase in
> >> false-positives, so this must avoid overloading.
> >> 3. How should Parquet set the target false-positive probability? This is
> >> related to the number of lookups in queries. 1% FPP with 5 lookups
> results
> >> in 4.9% FPP for a query.
> >>
> >> I think there was also some analysis of page level vs row-group level
> bloom
> >> filters and using geometrically decreasing FPP (scalable bloom filters).
> >>
> >> rb
> >>
> >> On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem 
> wrote:
> >>
> >> > Notes:
> >> >
> >> > Attendees/Agenda:
> >> > Zoltan (Cloudera, file formats):
> >> >   - timestamp types
> >> > Ryan (Netflix):
> >> >   - timestamp types
> >> >   - fix for sorting metadata (min-max)
> >> > Deepak (Vertica, parquet-cpp):
> >> >   - timestamp
> >> > Emily (IBM Spark Technology center)
> >> > Greg (Cloudera):
> >> >  - timestamp
> >> > Lars (Cloudera impala):
> >> >  - min-max (https://github.com/apache/parquet-format/pull/46)
> >> > Marcel (Cl Impala):
> >> >  - timestamp
> >> >  - sorting/min max
> >> >  - bloom filters
> >> > Julien (Dremio):
> >> >  - sorting/min max
> >> >  - timestamp.
> >> >
> >> > - Timestamp (2 types):
> >> >   - Floating Timestamp
> >> > - ambiguity to the TZ: year/month/day/microseconds is the data
> >> stored.
> >> > - timezone less
> >> > - same binary representation as current Timestamp. Different
> logical
> >> > annotation.
> >> > - how to store metadata. Same binary format w/wo.
> >> > - action: Ryan to propose a PR on parquet-format
> >> >   - Timestamp with Timezone.
> >> > - stored in UTC
> >> > - client side conversion to UTC
> >> > - writer timezone should be stored in the metadata?
> >> >   - need to clarify if time can be adjusted.
> >> >   - Int96: to be deprecated
> >> > - int64 used instead with logical type.
> >> > - won’t fix int96 ordering. Instead use replacement type.
> >> > - Lars to update the JIRA (PARQUET-323)
> >> >   - new binary format : int64 storing actual date (year month day) +
> >> > microseconds since midnight.
> >> > - Marcel to open a JIRA.
> >> > - Sorting:
> >> >   - Ryan to update the the PR (
> >> > https://github.com/apache/parquet-format/pull/46)
> >> > - Bloom filter: (PARQUET-319, PARQUET-41)
> >> >   - take analysis from original PR:
> >> > - https://github.com/apache/parquet-mr/pull/215
> >> > - https://github.com/apache/parquet-format/pull/28
> >> >   - need to define metadata.
> >> > - C++ code reuse between parquet-cpp, impala, …
> >> >   - impala team to discuss how they want to do that.
> >> > - store page level stats in footer (PARQUET-907)
> >> >   - several options:
> >> > - Index Page: similar to an ISAM index. 1 per row group: if
> ordered
> >> > just maxes and offsets
> >> > - add optional field in footer metadata.
> >> >
> >> >
> >> >
> >> > On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem 
> >> wrote:
> >> >
> >> > > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> >> > >
> >> > > --
> >> > > Julien
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Julien
> >> >
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >
> >
> >
> > --
> > Julien
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: Parquet sync starting now on hangout

2017-03-08 Thread Marcel Kornacker
One thing I forgot to bring up: do we care about TIMESTAMP_MILLIS in
addition to TIMESTAMP_MICROS? From  SQL perspective, only the latter
is needed.

On Wed, Mar 8, 2017 at 1:54 PM, Julien Le Dem  wrote:
> 2. The other thing to look into is HyperLogLog for approximate distinct
> value count. Similar concepts than Bloom filters
>
> On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue  wrote:
>
>> To follow up on the bloom filter discussion: The discussion on PARQUET-41
>>  has a lot of
>> information
>> and context for the bloom filter spreadsheet
>> > qiiFe8Sazclf5Y/edit?usp=sharing>
>> I mentioned in the sync-up. The main things we need to worry about are:
>>
>> 1. When are bloom filters worth using? Columns with low % unique will
>> already be dictionary-encoded and dictionary filtering has no
>> false-positives.
>> 2. How should Parquet track the % unique for a column to size the bloom
>> filter correctly? 2x overloading results in a 10x increase in
>> false-positives, so this must avoid overloading.
>> 3. How should Parquet set the target false-positive probability? This is
>> related to the number of lookups in queries. 1% FPP with 5 lookups results
>> in 4.9% FPP for a query.
>>
>> I think there was also some analysis of page level vs row-group level bloom
>> filters and using geometrically decreasing FPP (scalable bloom filters).
>>
>> rb
>>
>> On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem  wrote:
>>
>> > Notes:
>> >
>> > Attendees/Agenda:
>> > Zoltan (Cloudera, file formats):
>> >   - timestamp types
>> > Ryan (Netflix):
>> >   - timestamp types
>> >   - fix for sorting metadata (min-max)
>> > Deepak (Vertica, parquet-cpp):
>> >   - timestamp
>> > Emily (IBM Spark Technology center)
>> > Greg (Cloudera):
>> >  - timestamp
>> > Lars (Cloudera impala):
>> >  - min-max (https://github.com/apache/parquet-format/pull/46)
>> > Marcel (Cl Impala):
>> >  - timestamp
>> >  - sorting/min max
>> >  - bloom filters
>> > Julien (Dremio):
>> >  - sorting/min max
>> >  - timestamp.
>> >
>> > - Timestamp (2 types):
>> >   - Floating Timestamp
>> > - ambiguity to the TZ: year/month/day/microseconds is the data
>> stored.
>> > - timezone less
>> > - same binary representation as current Timestamp. Different logical
>> > annotation.
>> > - how to store metadata. Same binary format w/wo.
>> > - action: Ryan to propose a PR on parquet-format
>> >   - Timestamp with Timezone.
>> > - stored in UTC
>> > - client side conversion to UTC
>> > - writer timezone should be stored in the metadata?
>> >   - need to clarify if time can be adjusted.
>> >   - Int96: to be deprecated
>> > - int64 used instead with logical type.
>> > - won’t fix int96 ordering. Instead use replacement type.
>> > - Lars to update the JIRA (PARQUET-323)
>> >   - new binary format : int64 storing actual date (year month day) +
>> > microseconds since midnight.
>> > - Marcel to open a JIRA.
>> > - Sorting:
>> >   - Ryan to update the the PR (
>> > https://github.com/apache/parquet-format/pull/46)
>> > - Bloom filter: (PARQUET-319, PARQUET-41)
>> >   - take analysis from original PR:
>> > - https://github.com/apache/parquet-mr/pull/215
>> > - https://github.com/apache/parquet-format/pull/28
>> >   - need to define metadata.
>> > - C++ code reuse between parquet-cpp, impala, …
>> >   - impala team to discuss how they want to do that.
>> > - store page level stats in footer (PARQUET-907)
>> >   - several options:
>> > - Index Page: similar to an ISAM index. 1 per row group: if ordered
>> > just maxes and offsets
>> > - add optional field in footer metadata.
>> >
>> >
>> >
>> > On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem 
>> wrote:
>> >
>> > > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>> > >
>> > > --
>> > > Julien
>> > >
>> >
>> >
>> >
>> > --
>> > Julien
>> >
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
>
> --
> Julien


Re: Parquet sync starting now on hangout

2017-03-08 Thread Julien Le Dem
2. The other thing to look into is HyperLogLog for approximate distinct
value count. Similar concepts than Bloom filters

On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue  wrote:

> To follow up on the bloom filter discussion: The discussion on PARQUET-41
>  has a lot of
> information
> and context for the bloom filter spreadsheet
>  qiiFe8Sazclf5Y/edit?usp=sharing>
> I mentioned in the sync-up. The main things we need to worry about are:
>
> 1. When are bloom filters worth using? Columns with low % unique will
> already be dictionary-encoded and dictionary filtering has no
> false-positives.
> 2. How should Parquet track the % unique for a column to size the bloom
> filter correctly? 2x overloading results in a 10x increase in
> false-positives, so this must avoid overloading.
> 3. How should Parquet set the target false-positive probability? This is
> related to the number of lookups in queries. 1% FPP with 5 lookups results
> in 4.9% FPP for a query.
>
> I think there was also some analysis of page level vs row-group level bloom
> filters and using geometrically decreasing FPP (scalable bloom filters).
>
> rb
>
> On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem  wrote:
>
> > Notes:
> >
> > Attendees/Agenda:
> > Zoltan (Cloudera, file formats):
> >   - timestamp types
> > Ryan (Netflix):
> >   - timestamp types
> >   - fix for sorting metadata (min-max)
> > Deepak (Vertica, parquet-cpp):
> >   - timestamp
> > Emily (IBM Spark Technology center)
> > Greg (Cloudera):
> >  - timestamp
> > Lars (Cloudera impala):
> >  - min-max (https://github.com/apache/parquet-format/pull/46)
> > Marcel (Cl Impala):
> >  - timestamp
> >  - sorting/min max
> >  - bloom filters
> > Julien (Dremio):
> >  - sorting/min max
> >  - timestamp.
> >
> > - Timestamp (2 types):
> >   - Floating Timestamp
> > - ambiguity to the TZ: year/month/day/microseconds is the data
> stored.
> > - timezone less
> > - same binary representation as current Timestamp. Different logical
> > annotation.
> > - how to store metadata. Same binary format w/wo.
> > - action: Ryan to propose a PR on parquet-format
> >   - Timestamp with Timezone.
> > - stored in UTC
> > - client side conversion to UTC
> > - writer timezone should be stored in the metadata?
> >   - need to clarify if time can be adjusted.
> >   - Int96: to be deprecated
> > - int64 used instead with logical type.
> > - won’t fix int96 ordering. Instead use replacement type.
> > - Lars to update the JIRA (PARQUET-323)
> >   - new binary format : int64 storing actual date (year month day) +
> > microseconds since midnight.
> > - Marcel to open a JIRA.
> > - Sorting:
> >   - Ryan to update the the PR (
> > https://github.com/apache/parquet-format/pull/46)
> > - Bloom filter: (PARQUET-319, PARQUET-41)
> >   - take analysis from original PR:
> > - https://github.com/apache/parquet-mr/pull/215
> > - https://github.com/apache/parquet-format/pull/28
> >   - need to define metadata.
> > - C++ code reuse between parquet-cpp, impala, …
> >   - impala team to discuss how they want to do that.
> > - store page level stats in footer (PARQUET-907)
> >   - several options:
> > - Index Page: similar to an ISAM index. 1 per row group: if ordered
> > just maxes and offsets
> > - add optional field in footer metadata.
> >
> >
> >
> > On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem 
> wrote:
> >
> > > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> > >
> > > --
> > > Julien
> > >
> >
> >
> >
> > --
> > Julien
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Julien


Re: Parquet sync starting now on hangout

2017-03-08 Thread Ryan Blue
To follow up on the bloom filter discussion: The discussion on PARQUET-41
 has a lot of information
and context for the bloom filter spreadsheet

I mentioned in the sync-up. The main things we need to worry about are:

1. When are bloom filters worth using? Columns with low % unique will
already be dictionary-encoded and dictionary filtering has no
false-positives.
2. How should Parquet track the % unique for a column to size the bloom
filter correctly? 2x overloading results in a 10x increase in
false-positives, so this must avoid overloading.
3. How should Parquet set the target false-positive probability? This is
related to the number of lookups in queries. 1% FPP with 5 lookups results
in 4.9% FPP for a query.

I think there was also some analysis of page level vs row-group level bloom
filters and using geometrically decreasing FPP (scalable bloom filters).

rb

On Wed, Mar 8, 2017 at 11:51 AM, Julien Le Dem  wrote:

> Notes:
>
> Attendees/Agenda:
> Zoltan (Cloudera, file formats):
>   - timestamp types
> Ryan (Netflix):
>   - timestamp types
>   - fix for sorting metadata (min-max)
> Deepak (Vertica, parquet-cpp):
>   - timestamp
> Emily (IBM Spark Technology center)
> Greg (Cloudera):
>  - timestamp
> Lars (Cloudera impala):
>  - min-max (https://github.com/apache/parquet-format/pull/46)
> Marcel (Cl Impala):
>  - timestamp
>  - sorting/min max
>  - bloom filters
> Julien (Dremio):
>  - sorting/min max
>  - timestamp.
>
> - Timestamp (2 types):
>   - Floating Timestamp
> - ambiguity to the TZ: year/month/day/microseconds is the data stored.
> - timezone less
> - same binary representation as current Timestamp. Different logical
> annotation.
> - how to store metadata. Same binary format w/wo.
> - action: Ryan to propose a PR on parquet-format
>   - Timestamp with Timezone.
> - stored in UTC
> - client side conversion to UTC
> - writer timezone should be stored in the metadata?
>   - need to clarify if time can be adjusted.
>   - Int96: to be deprecated
> - int64 used instead with logical type.
> - won’t fix int96 ordering. Instead use replacement type.
> - Lars to update the JIRA (PARQUET-323)
>   - new binary format : int64 storing actual date (year month day) +
> microseconds since midnight.
> - Marcel to open a JIRA.
> - Sorting:
>   - Ryan to update the the PR (
> https://github.com/apache/parquet-format/pull/46)
> - Bloom filter: (PARQUET-319, PARQUET-41)
>   - take analysis from original PR:
> - https://github.com/apache/parquet-mr/pull/215
> - https://github.com/apache/parquet-format/pull/28
>   - need to define metadata.
> - C++ code reuse between parquet-cpp, impala, …
>   - impala team to discuss how they want to do that.
> - store page level stats in footer (PARQUET-907)
>   - several options:
> - Index Page: similar to an ISAM index. 1 per row group: if ordered
> just maxes and offsets
> - add optional field in footer metadata.
>
>
>
> On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem  wrote:
>
> > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> >
> > --
> > Julien
> >
>
>
>
> --
> Julien
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: Parquet sync starting now on hangout

2017-03-08 Thread Julien Le Dem
Notes:

Attendees/Agenda:
Zoltan (Cloudera, file formats):
  - timestamp types
Ryan (Netflix):
  - timestamp types
  - fix for sorting metadata (min-max)
Deepak (Vertica, parquet-cpp):
  - timestamp
Emily (IBM Spark Technology center)
Greg (Cloudera):
 - timestamp
Lars (Cloudera impala):
 - min-max (https://github.com/apache/parquet-format/pull/46)
Marcel (Cl Impala):
 - timestamp
 - sorting/min max
 - bloom filters
Julien (Dremio):
 - sorting/min max
 - timestamp.

- Timestamp (2 types):
  - Floating Timestamp
- ambiguity to the TZ: year/month/day/microseconds is the data stored.
- timezone less
- same binary representation as current Timestamp. Different logical
annotation.
- how to store metadata. Same binary format w/wo.
- action: Ryan to propose a PR on parquet-format
  - Timestamp with Timezone.
- stored in UTC
- client side conversion to UTC
- writer timezone should be stored in the metadata?
  - need to clarify if time can be adjusted.
  - Int96: to be deprecated
- int64 used instead with logical type.
- won’t fix int96 ordering. Instead use replacement type.
- Lars to update the JIRA (PARQUET-323)
  - new binary format : int64 storing actual date (year month day) +
microseconds since midnight.
- Marcel to open a JIRA.
- Sorting:
  - Ryan to update the the PR (
https://github.com/apache/parquet-format/pull/46)
- Bloom filter: (PARQUET-319, PARQUET-41)
  - take analysis from original PR:
- https://github.com/apache/parquet-mr/pull/215
- https://github.com/apache/parquet-format/pull/28
  - need to define metadata.
- C++ code reuse between parquet-cpp, impala, …
  - impala team to discuss how they want to do that.
- store page level stats in footer (PARQUET-907)
  - several options:
- Index Page: similar to an ISAM index. 1 per row group: if ordered
just maxes and offsets
- add optional field in footer metadata.



On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem  wrote:

> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> --
> Julien
>



-- 
Julien


Re: parquet sync starting now

2017-02-28 Thread Deepak Majeti
I am in favor of the two timestamp type solution as well.
We also have a choice between nanosecond and microsecond/millisecond
precision. Not all tools require nanosecond precision.
I propose the following.

- Add two logical types for nanosecond precision (TIMESTAMP, TIMESTAMP_TZ).
  The underlying physical type will be Fixed Length Byte Array (length 12).
First 4 bytes for the number of days and 8 bytes for nanoseconds elapsed on
that day.

- Add two more logical types for microsecond/millisecond precision
(TIMESTAMP_TZ_MICROS, TIMESTAMP_TZ_MILLIS). TIMESTAMP_MICROS and
TIMESTAMP_MILLIS types already exist.
  The underlying physical type will be INT64. Microseconds/milliseconds
from epoch.

TIMSTAMP_TZ types will be stored in UTC.
The question now comes to storing the TIMESTAMP like a string. One way is
to convert this to UTC as well and also store the writer timezone in the
schema. Since the parquet library will use the same timezone to write and
read timestamp values from UTC, we can get a constant timestamp value.


On Mon, Feb 27, 2017 at 9:00 PM, Greg Rahn  wrote:

> I think the decision comes down to how many TIMESTAMP types does Parquet
> (and systems that use it a format) want to support or the use cases that
> are being targeted.
>
> If the answer is two, then it makes sense to follow the ANSI standard and
> what Postgres et al. have done:
> - timestamp [ without time zone ] - never adjust for TZ, treated like a
> timestamp string
> - timestamp with time zone - normalize to UTC based on explicit TZ input
> or implicit TZ from env, results normalized to local/client TZ
>
> If the answer is three, then it makes sense to mimic Oracle’s timestamp
> types despite the differences from ANSI SQL naming/behavior:
> - timestamp [ without time zone ] - never adjust for TZ, treated like a
> timestamp string
> - timestamp with local time zone - normalize to UTC based on explicit TZ
> or implicit TZ from env, results normalized to local/client TZ
> - timestamp with time zone - explicitly store the TZ and never normalize
> or convert, results may contain >1 TZ
>
> It seems to me that the two timestamp type solution is the most popular,
> appears to solve most requirements, and one could explicitly store the TZ
> offset in another column and systems could provide a solution/function to
> convert and return results that contain >1 TZ.  Without normalization and
> storing the TZ explcitly, it forces the application to apply a conversion
> function for any comparisons operations otherwise it becomes logically
> impossible (or ambiguous) to apply date range filters on such a type.  This
> has an obvious performance impact.
>
> My vote would be for the two type solution, however, it is worth
> explicitly noting, that "timestamp with time zone" requires functionality
> beyond the Parquet file format to do the TZ adjustments whether that be
> done server-side on the result sets knowing client-side TZ settings, or in
> the client-side driver code.  Obviously the latter can result in many more
> places for bugs to creep in as every driver implementation needs to do the
> correct TZ adjustment.
>
>
> > On Feb 27, 2017, at 4:42 PM, Marcel Kornacker  wrote:
> >
> > Greg, thanks for this writeup.
> >
> > Going back to "timestamp with timezone" in Parquet: does anything
> > speak *against* following the SQL standard and storing UTC without an
> > attached timezone (and leaving it to the client to do the conversion
> > correctly for timestamp literals)?
> >
> > On Mon, Feb 27, 2017 at 4:03 PM, Greg Rahn  wrote:
> >> As pointed out, there are several different behaviors and type names
> from
> >> the SQL world:
> >>
> >> timestamp without time zone (aka timestamp)
> >>
> >> value not normalized - behaves same as storing the timestamp literal as
> a
> >> string or datetime with fractional seconds
> >> value normalized to UTC based on local environment and adjusted to local
> >> timezone on retrieval
> >>
> >> timestamp with time zone (aka timestamptz)
> >>
> >> value normalized to UTC based on specified time zone offset or value or
> >> falls back to client env when not specified,  adjusted to local
> timezone on
> >> retrieval
> >> value not normalized and time zone value is actually stored based on
> >> specified time zone offset or value or falls back to client env when not
> >> specified, values returned are not normalized to local time zone
> >>
> >> I'm not sure I would 100% agree with your table Zoltan, or the comment
> on
> >> Postgres not following the ANSI standard, as this is a key description
> from
> >> the ANSI standard:
> >>
> >>> A datetime value, of data type TIME WITHOUT TIME ZONE or TIMESTAMP
> WITHOUT
> >>> TIME ZONE, may represent a local time, whereas a datetime value of
> data type
> >>> TIME WITH TIME ZONE or TIMESTAMP WITH TIME ZONE represents UTC.
> >>
> >>
> >> Thus one challenge that exists is that there are two valid behaviors for
> >> TIMESTAMP [WITHOUT TIME ZONE] — conversion from/to local time, o

Re: parquet sync starting now

2017-02-27 Thread Greg Rahn
I think the decision comes down to how many TIMESTAMP types does Parquet (and 
systems that use it a format) want to support or the use cases that are being 
targeted.

If the answer is two, then it makes sense to follow the ANSI standard and what 
Postgres et al. have done:
- timestamp [ without time zone ] - never adjust for TZ, treated like a 
timestamp string
- timestamp with time zone - normalize to UTC based on explicit TZ input or 
implicit TZ from env, results normalized to local/client TZ

If the answer is three, then it makes sense to mimic Oracle’s timestamp types 
despite the differences from ANSI SQL naming/behavior:
- timestamp [ without time zone ] - never adjust for TZ, treated like a 
timestamp string
- timestamp with local time zone - normalize to UTC based on explicit TZ or 
implicit TZ from env, results normalized to local/client TZ
- timestamp with time zone - explicitly store the TZ and never normalize or 
convert, results may contain >1 TZ

It seems to me that the two timestamp type solution is the most popular, 
appears to solve most requirements, and one could explicitly store the TZ 
offset in another column and systems could provide a solution/function to 
convert and return results that contain >1 TZ.  Without normalization and 
storing the TZ explcitly, it forces the application to apply a conversion 
function for any comparisons operations otherwise it becomes logically 
impossible (or ambiguous) to apply date range filters on such a type.  This has 
an obvious performance impact.

My vote would be for the two type solution, however, it is worth explicitly 
noting, that "timestamp with time zone" requires functionality beyond the 
Parquet file format to do the TZ adjustments whether that be done server-side 
on the result sets knowing client-side TZ settings, or in the client-side 
driver code.  Obviously the latter can result in many more places for bugs to 
creep in as every driver implementation needs to do the correct TZ adjustment.


> On Feb 27, 2017, at 4:42 PM, Marcel Kornacker  wrote:
> 
> Greg, thanks for this writeup.
> 
> Going back to "timestamp with timezone" in Parquet: does anything
> speak *against* following the SQL standard and storing UTC without an
> attached timezone (and leaving it to the client to do the conversion
> correctly for timestamp literals)?
> 
> On Mon, Feb 27, 2017 at 4:03 PM, Greg Rahn  wrote:
>> As pointed out, there are several different behaviors and type names from
>> the SQL world:
>> 
>> timestamp without time zone (aka timestamp)
>> 
>> value not normalized - behaves same as storing the timestamp literal as a
>> string or datetime with fractional seconds
>> value normalized to UTC based on local environment and adjusted to local
>> timezone on retrieval
>> 
>> timestamp with time zone (aka timestamptz)
>> 
>> value normalized to UTC based on specified time zone offset or value or
>> falls back to client env when not specified,  adjusted to local timezone on
>> retrieval
>> value not normalized and time zone value is actually stored based on
>> specified time zone offset or value or falls back to client env when not
>> specified, values returned are not normalized to local time zone
>> 
>> I'm not sure I would 100% agree with your table Zoltan, or the comment on
>> Postgres not following the ANSI standard, as this is a key description from
>> the ANSI standard:
>> 
>>> A datetime value, of data type TIME WITHOUT TIME ZONE or TIMESTAMP WITHOUT
>>> TIME ZONE, may represent a local time, whereas a datetime value of data type
>>> TIME WITH TIME ZONE or TIMESTAMP WITH TIME ZONE represents UTC.
>> 
>> 
>> Thus one challenge that exists is that there are two valid behaviors for
>> TIMESTAMP [WITHOUT TIME ZONE] — conversion from/to local time, or no
>> adjustment.  The other challenge here is that what some RDBMS systems call
>> the type corresponds to an ANSI type name, however, the behavior differs.
>> This is exactly the case for ANSI's TIMESTAMP WITH TIME ZONE and Oracle's
>> TIMESTAMP WITH TIME ZONE as Oracle's TIMESTAMP WITH LOCAL TIME ZONE would
>> map to ANSI TIMESTAMP WITH TIME ZONE because neither store the TZ info, but
>> both adjust for it.  Oracle's TIMESTAMP WITH TIME ZONE does store the TZ
>> explicitly but does not adjust.  Oracle notes this difference in naming
>> between ANSI & Oracle using a disclaimer [1]:
>> 
>>> This chapter describes Oracle datetime and interval datatypes. It does not
>>> attempt to describe ANSI datatypes or other kinds of datatypes except when
>>> noted.
>> 
>> 
>> Also I will note some wording clarity -- when the ANSI SQL standard uses the
>> phrase "time zone displacement" this means the data type includes a value
>> for the time zone offset (displacement from UTC), be it explicitly or
>> implicitly.  It does not mean that "time zone displacement" is actually
>> stored on disk as this contradicts this ANSI statement:
>> 
>>> whereas a datetime value of data type TIMESTAMP WITH TIME ZONE represents
>

Re: parquet sync starting now

2017-02-27 Thread Marcel Kornacker
Greg, thanks for this writeup.

Going back to "timestamp with timezone" in Parquet: does anything
speak *against* following the SQL standard and storing UTC without an
attached timezone (and leaving it to the client to do the conversion
correctly for timestamp literals)?

On Mon, Feb 27, 2017 at 4:03 PM, Greg Rahn  wrote:
> As pointed out, there are several different behaviors and type names from
> the SQL world:
>
> timestamp without time zone (aka timestamp)
>
> value not normalized - behaves same as storing the timestamp literal as a
> string or datetime with fractional seconds
> value normalized to UTC based on local environment and adjusted to local
> timezone on retrieval
>
> timestamp with time zone (aka timestamptz)
>
> value normalized to UTC based on specified time zone offset or value or
> falls back to client env when not specified,  adjusted to local timezone on
> retrieval
> value not normalized and time zone value is actually stored based on
> specified time zone offset or value or falls back to client env when not
> specified, values returned are not normalized to local time zone
>
> I'm not sure I would 100% agree with your table Zoltan, or the comment on
> Postgres not following the ANSI standard, as this is a key description from
> the ANSI standard:
>
>> A datetime value, of data type TIME WITHOUT TIME ZONE or TIMESTAMP WITHOUT
>> TIME ZONE, may represent a local time, whereas a datetime value of data type
>> TIME WITH TIME ZONE or TIMESTAMP WITH TIME ZONE represents UTC.
>
>
> Thus one challenge that exists is that there are two valid behaviors for
> TIMESTAMP [WITHOUT TIME ZONE] — conversion from/to local time, or no
> adjustment.  The other challenge here is that what some RDBMS systems call
> the type corresponds to an ANSI type name, however, the behavior differs.
> This is exactly the case for ANSI's TIMESTAMP WITH TIME ZONE and Oracle's
> TIMESTAMP WITH TIME ZONE as Oracle's TIMESTAMP WITH LOCAL TIME ZONE would
> map to ANSI TIMESTAMP WITH TIME ZONE because neither store the TZ info, but
> both adjust for it.  Oracle's TIMESTAMP WITH TIME ZONE does store the TZ
> explicitly but does not adjust.  Oracle notes this difference in naming
> between ANSI & Oracle using a disclaimer [1]:
>
>> This chapter describes Oracle datetime and interval datatypes. It does not
>> attempt to describe ANSI datatypes or other kinds of datatypes except when
>> noted.
>
>
> Also I will note some wording clarity -- when the ANSI SQL standard uses the
> phrase "time zone displacement" this means the data type includes a value
> for the time zone offset (displacement from UTC), be it explicitly or
> implicitly.  It does not mean that "time zone displacement" is actually
> stored on disk as this contradicts this ANSI statement:
>
>> whereas a datetime value of data type TIMESTAMP WITH TIME ZONE represents
>> UTC.
>
>
> So IMO, all the Postgres-based systems (Vertica, Redshift, Greenplum, etc)
> implement the ANSI standard for TIMESTAMP WITH TIME ZONE — they normalize
> input values to UTC and return all values converted to local time zone.
> Greenplum explicitly cites their TIMESTAMP implementation is ANSI SQL:2008
> compliant for feature id F051-03 [2] (which is the ANSI TIMESTAMP type).
>
> Also see these behavior notes and documentation.
>
> Vertica -
> https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/SQLReferenceManual/DataTypes/Date-Time/TIMESTAMP.htm
> PostgreSQL -
> https://www.postgresql.org/docs/current/static/datatype-datetime.html#DATATYPE-TIMEZONES
> Redshift -
> http://docs.aws.amazon.com/redshift/latest/dg/r_Datetime_types.html#r_Datetime_types-timestamptz
>
>
> If I look at all of this and compare it to the Parquet Logical Types [3]
> doc, to me the thing that is missing is an unambiguous statement of client
> behavior for timezone conversion or not.  To my knowledge Apache Drill is
> the only system that seems to have implemented the Parquet TIMESTAMP_MILLIS
> type and it looks like they chose to implement it using the normalize to UTC
> behavior which is not my preference and not what most RDBMS systems do,
> including Postgres, Vertica, Redshift, Greenplum, Netezza, and Oracle do for
> TIMESTAMP [WITHOUT TIME ZONE].
>
> For example, changing the local system timezone setting does not change the
> results of the query for a TIMESTAMP [WITHOUT TIME ZONE] type:
>
> create table ts1 (t timestamp without time zone);
> insert into ts1 values (timestamp '2009-05-12 12:00:00');
> insert into ts1 values (timestamp '2009-05-13 12:00:00');
> insert into ts1 values (timestamp '2009-05-14 12:00:00');
>
> show timezone;
>name   |  setting
> --+
>  timezone | US/Eastern
>
> select * from ts1;
>   t
> -
>  2009-05-12 12:00:00
>  2009-05-13 12:00:00
>  2009-05-14 12:00:00
>
> set timezone US/Pacific;
> show timezone;
>
>name   |  setting
> --+
>  timezone | US/Pacific
>
> select * from ts1;
>   t
> --

Re: parquet sync starting now

2017-02-27 Thread Marcel Kornacker
On Mon, Feb 27, 2017 at 10:43 AM, Zoltan Ivanfi  wrote:
> What you describe (storing in UTC and adjusting to local time) is the
> implicit timezone that is associated with the plain TIMEZONE type of ANSI
> SQL. Excerpts:

Postgres allows explicit timezone offsets in timestamp literals. When
these are present they override the implicit local timezone.

>
>   Datetime data types that contain time fields (TIME and TIMESTAMP) are
> maintained
>   in Universal Coordinated Time (UTC), with an explicit or implied time zone
> part.
>
>   A TIME or TIMESTAMP that does not specify WITH TIME ZONE has an im-
>   plicit time zone equal to the local time zone for the SQL-session.
>   The value of time represented in the data changes along with the
>   local time zone for the SQL-session. However, the meaning of the
>   time does not change because it is effectively maintained in UTC.
>
> Zoltan
>
> On Mon, Feb 27, 2017 at 7:34 PM Marcel Kornacker  wrote:
>>
>> On Mon, Feb 27, 2017 at 8:47 AM, Zoltan Ivanfi  wrote:
>> > Hi,
>> >
>> > Although the draft of SQL-92[1] does not explicitly state that the time
>> > zone
>> > offset has to be stored, the following excerpts strongly suggest that
>> > the
>> > time zone has to be stored with each individual value of TIMESTAMP WITH
>> > TIME
>> > ZONE:
>> >
>> >   The length of a TIMESTAMP is 19 positions [...]
>> >   The length of a TIMESTAMP WITH TIME ZONE is 25 positions [...]
>> >
>> > The draft of SQL 2003[2] standard is more specific:
>> >
>> >   TIMESTAMP and TIME may also be specified as being WITH TIME ZONE, in
>> > which
>> > case every value has associated with it a time zone displacement.
>>
>> It is not clear to me whether that means that the timezone is *stored*
>> with the value. In Postgres, a timestamp still has an associated
>> timezone on input and output (but it only stores the timezone
>> normalized to UTC).
>>
>> >
>> > However, the TIMESTAMP WITH TIME ZONE type in PostgreSQL does not
>> > conform to
>> > this definition, but behaves like the plain TIMESTAMP type of the SQL
>> > specification. Oracle calls this type TIMESTAMP WITH LOCAL TIME instead.
>> >
>> > It's easier to get an overview on this chaos if we list the different
>> > interpretations and data types:
>> >
>> > Timestamp stored with a time zone:
>> > - ANSI SQL type: TIMESTAMP WITH TIME ZONE
>> > - Oracle type: TIMESTAMP WITH TIME ZONE
>> > - PostgeSQL type: -
>> >
>> > Timestamp stored without a time zone that represents a time in UTC
>> > (automatically converted to/from local time):
>> > - ANSI SQL type: TIMESTAMP
>> > - Oracle type: TIMESTAMP WITH LOCAL TIME
>> > - PostgeSQL type: TIMESTAMP WITH TIME ZONE
>> > - Parquet type: TIMESTAMP_MILLIS
>> >
>> > Timestamp stored without a time zone that represents floating time (has
>> > the
>> > same apparent value regardless of time zone, does not refer to an
>> > absolute
>> > time instant):
>> > - ANSI SQL type: -
>> > - Oracle type: TIMESTAMP
>> > - PostgeSQL type: TIMESTAMP
>> > - Impala type: TIMESTAMP (stored as INT96 in Parquet)
>> >
>> > [1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
>> > [2] http://www.wiscorp.com/sql_2003_standard.zip
>> >
>> > Zoltan
>> >
>> >
>> >
>> > On Thu, Feb 23, 2017 at 10:46 PM Marcel Kornacker 
>> > wrote:
>> >>
>> >> Regarding timestamp with timezone: I'm not sure whether the SQL
>> >> standard requires the timezone to be stored along with the timestamp
>> >> for 'timestamp with timezone' (at least Oracle and Postgres diverge on
>> >> that topic).
>> >>
>> >> Cc'ing Greg Rahn to shed some more light on that.
>> >>
>> >> Regarding 'make Impala depend on parquet-cpp': could someone expand on
>> >> why we want to do this? There probably is overlap between
>> >> Impala/Kudu/parquet-cpp, but the runtime systems of the first two have
>> >> specific requirements (and are also different from each other), so
>> >> trying to unify this into parquet-cpp seems difficult.
>> >>
>> >> On Thu, Feb 23, 2017 at 11:22 AM, Julien Le Dem 
>> >> wrote:
>> >> >  Attendees/agenda:
>> >> > - Nandor, Zoltan (Cloudera/file formats)
>> >> > - Lars (Cloudera/Impala)" Statistics progress
>> >> > - Uwe (Blue Yonder): Parquet cpp RC. Int96 timestamps
>> >> > - Wes (twosigma): parquet cpp rc. 1.0 Release
>> >> > - Julien (Dremio): parquet metadata. Statistics.
>> >> > - Deepak (HP/Vertica): Parquet-cpp
>> >> > - Kazuaki:
>> >> > - Ryan was excused :)
>> >> >
>> >> > Note:
>> >> >  - Statistics: https://github.com/apache/parquet-format/pull/46
>> >> >- Impala is waiting for parquet-format to settle on the format to
>> >> > finalize their simple mentation.
>> >> >- Action: Julien to follow up with Ryan on the PR
>> >> >
>> >> >  - Int96 timestamps: https://github.com/apache/parquet-format/pull/49
>> >> > (needs Ryan's feedback)
>> >> >- format is nanosecond level timestamp from midnight (64 bits)
>> >> > followed
>> >> > by number of days (32 bits)
>> >> >- it sounds like int96 ordering is different from natural by

Re: parquet sync starting now

2017-02-27 Thread Zoltan Ivanfi
What you describe (storing in UTC and adjusting to local time) is the
implicit timezone that is associated with the plain TIMEZONE type of ANSI
SQL. Excerpts:

  Datetime data types that contain time fields (TIME and TIMESTAMP) are
maintained
  in Universal Coordinated Time (UTC), with an explicit or implied time
zone part.

  A TIME or TIMESTAMP that does not specify WITH TIME ZONE has an im-
  plicit time zone equal to the local time zone for the SQL-session.
  The value of time represented in the data changes along with the
  local time zone for the SQL-session. However, the meaning of the
  time does not change because it is effectively maintained in UTC.

Zoltan

On Mon, Feb 27, 2017 at 7:34 PM Marcel Kornacker  wrote:

> On Mon, Feb 27, 2017 at 8:47 AM, Zoltan Ivanfi  wrote:
> > Hi,
> >
> > Although the draft of SQL-92[1] does not explicitly state that the time
> zone
> > offset has to be stored, the following excerpts strongly suggest that the
> > time zone has to be stored with each individual value of TIMESTAMP WITH
> TIME
> > ZONE:
> >
> >   The length of a TIMESTAMP is 19 positions [...]
> >   The length of a TIMESTAMP WITH TIME ZONE is 25 positions [...]
> >
> > The draft of SQL 2003[2] standard is more specific:
> >
> >   TIMESTAMP and TIME may also be specified as being WITH TIME ZONE, in
> which
> > case every value has associated with it a time zone displacement.
>
> It is not clear to me whether that means that the timezone is *stored*
> with the value. In Postgres, a timestamp still has an associated
> timezone on input and output (but it only stores the timezone
> normalized to UTC).
>
> >
> > However, the TIMESTAMP WITH TIME ZONE type in PostgreSQL does not
> conform to
> > this definition, but behaves like the plain TIMESTAMP type of the SQL
> > specification. Oracle calls this type TIMESTAMP WITH LOCAL TIME instead.
> >
> > It's easier to get an overview on this chaos if we list the different
> > interpretations and data types:
> >
> > Timestamp stored with a time zone:
> > - ANSI SQL type: TIMESTAMP WITH TIME ZONE
> > - Oracle type: TIMESTAMP WITH TIME ZONE
> > - PostgeSQL type: -
> >
> > Timestamp stored without a time zone that represents a time in UTC
> > (automatically converted to/from local time):
> > - ANSI SQL type: TIMESTAMP
> > - Oracle type: TIMESTAMP WITH LOCAL TIME
> > - PostgeSQL type: TIMESTAMP WITH TIME ZONE
> > - Parquet type: TIMESTAMP_MILLIS
> >
> > Timestamp stored without a time zone that represents floating time (has
> the
> > same apparent value regardless of time zone, does not refer to an
> absolute
> > time instant):
> > - ANSI SQL type: -
> > - Oracle type: TIMESTAMP
> > - PostgeSQL type: TIMESTAMP
> > - Impala type: TIMESTAMP (stored as INT96 in Parquet)
> >
> > [1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
> > [2] http://www.wiscorp.com/sql_2003_standard.zip
> >
> > Zoltan
> >
> >
> >
> > On Thu, Feb 23, 2017 at 10:46 PM Marcel Kornacker 
> wrote:
> >>
> >> Regarding timestamp with timezone: I'm not sure whether the SQL
> >> standard requires the timezone to be stored along with the timestamp
> >> for 'timestamp with timezone' (at least Oracle and Postgres diverge on
> >> that topic).
> >>
> >> Cc'ing Greg Rahn to shed some more light on that.
> >>
> >> Regarding 'make Impala depend on parquet-cpp': could someone expand on
> >> why we want to do this? There probably is overlap between
> >> Impala/Kudu/parquet-cpp, but the runtime systems of the first two have
> >> specific requirements (and are also different from each other), so
> >> trying to unify this into parquet-cpp seems difficult.
> >>
> >> On Thu, Feb 23, 2017 at 11:22 AM, Julien Le Dem 
> wrote:
> >> >  Attendees/agenda:
> >> > - Nandor, Zoltan (Cloudera/file formats)
> >> > - Lars (Cloudera/Impala)" Statistics progress
> >> > - Uwe (Blue Yonder): Parquet cpp RC. Int96 timestamps
> >> > - Wes (twosigma): parquet cpp rc. 1.0 Release
> >> > - Julien (Dremio): parquet metadata. Statistics.
> >> > - Deepak (HP/Vertica): Parquet-cpp
> >> > - Kazuaki:
> >> > - Ryan was excused :)
> >> >
> >> > Note:
> >> >  - Statistics: https://github.com/apache/parquet-format/pull/46
> >> >- Impala is waiting for parquet-format to settle on the format to
> >> > finalize their simple mentation.
> >> >- Action: Julien to follow up with Ryan on the PR
> >> >
> >> >  - Int96 timestamps: https://github.com/apache/parquet-format/pull/49
> >> > (needs Ryan's feedback)
> >> >- format is nanosecond level timestamp from midnight (64 bits)
> >> > followed
> >> > by number of days (32 bits)
> >> >- it sounds like int96 ordering is different from natural byte
> array
> >> > ordering because days is last in the bytes
> >> >- discussion about swapping bytes:
> >> >   - format dependent on the boost library used
> >> >   - there could be performance concerns in Impala against changing
> >> > it
> >> >   - there may be a separate project in impala to swap the bytes
> for
> >> > kudu co

Re: parquet sync starting now

2017-02-27 Thread Marcel Kornacker
On Mon, Feb 27, 2017 at 8:47 AM, Zoltan Ivanfi  wrote:
> Hi,
>
> Although the draft of SQL-92[1] does not explicitly state that the time zone
> offset has to be stored, the following excerpts strongly suggest that the
> time zone has to be stored with each individual value of TIMESTAMP WITH TIME
> ZONE:
>
>   The length of a TIMESTAMP is 19 positions [...]
>   The length of a TIMESTAMP WITH TIME ZONE is 25 positions [...]
>
> The draft of SQL 2003[2] standard is more specific:
>
>   TIMESTAMP and TIME may also be specified as being WITH TIME ZONE, in which
> case every value has associated with it a time zone displacement.

It is not clear to me whether that means that the timezone is *stored*
with the value. In Postgres, a timestamp still has an associated
timezone on input and output (but it only stores the timezone
normalized to UTC).

>
> However, the TIMESTAMP WITH TIME ZONE type in PostgreSQL does not conform to
> this definition, but behaves like the plain TIMESTAMP type of the SQL
> specification. Oracle calls this type TIMESTAMP WITH LOCAL TIME instead.
>
> It's easier to get an overview on this chaos if we list the different
> interpretations and data types:
>
> Timestamp stored with a time zone:
> - ANSI SQL type: TIMESTAMP WITH TIME ZONE
> - Oracle type: TIMESTAMP WITH TIME ZONE
> - PostgeSQL type: -
>
> Timestamp stored without a time zone that represents a time in UTC
> (automatically converted to/from local time):
> - ANSI SQL type: TIMESTAMP
> - Oracle type: TIMESTAMP WITH LOCAL TIME
> - PostgeSQL type: TIMESTAMP WITH TIME ZONE
> - Parquet type: TIMESTAMP_MILLIS
>
> Timestamp stored without a time zone that represents floating time (has the
> same apparent value regardless of time zone, does not refer to an absolute
> time instant):
> - ANSI SQL type: -
> - Oracle type: TIMESTAMP
> - PostgeSQL type: TIMESTAMP
> - Impala type: TIMESTAMP (stored as INT96 in Parquet)
>
> [1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
> [2] http://www.wiscorp.com/sql_2003_standard.zip
>
> Zoltan
>
>
>
> On Thu, Feb 23, 2017 at 10:46 PM Marcel Kornacker  wrote:
>>
>> Regarding timestamp with timezone: I'm not sure whether the SQL
>> standard requires the timezone to be stored along with the timestamp
>> for 'timestamp with timezone' (at least Oracle and Postgres diverge on
>> that topic).
>>
>> Cc'ing Greg Rahn to shed some more light on that.
>>
>> Regarding 'make Impala depend on parquet-cpp': could someone expand on
>> why we want to do this? There probably is overlap between
>> Impala/Kudu/parquet-cpp, but the runtime systems of the first two have
>> specific requirements (and are also different from each other), so
>> trying to unify this into parquet-cpp seems difficult.
>>
>> On Thu, Feb 23, 2017 at 11:22 AM, Julien Le Dem  wrote:
>> >  Attendees/agenda:
>> > - Nandor, Zoltan (Cloudera/file formats)
>> > - Lars (Cloudera/Impala)" Statistics progress
>> > - Uwe (Blue Yonder): Parquet cpp RC. Int96 timestamps
>> > - Wes (twosigma): parquet cpp rc. 1.0 Release
>> > - Julien (Dremio): parquet metadata. Statistics.
>> > - Deepak (HP/Vertica): Parquet-cpp
>> > - Kazuaki:
>> > - Ryan was excused :)
>> >
>> > Note:
>> >  - Statistics: https://github.com/apache/parquet-format/pull/46
>> >- Impala is waiting for parquet-format to settle on the format to
>> > finalize their simple mentation.
>> >- Action: Julien to follow up with Ryan on the PR
>> >
>> >  - Int96 timestamps: https://github.com/apache/parquet-format/pull/49
>> > (needs Ryan's feedback)
>> >- format is nanosecond level timestamp from midnight (64 bits)
>> > followed
>> > by number of days (32 bits)
>> >- it sounds like int96 ordering is different from natural byte array
>> > ordering because days is last in the bytes
>> >- discussion about swapping bytes:
>> >   - format dependent on the boost library used
>> >   - there could be performance concerns in Impala against changing
>> > it
>> >   - there may be a separate project in impala to swap the bytes for
>> > kudu compatibility.
>> >- discussion about deprecating int96:
>> >  - need to be able to read them always
>> >  - not need to define ordering if we have a clear replacement
>> >  - Need to clarify the requirement for alternative .
>> >  - int64 could be enough it sounds that nanosecond granularity might
>> > not be needed.
>> >- Julien to create JIRAs:
>> >  - int96 ordering
>> >  - int96 deprecation, replacement.
>> >
>> > - extra timestamp logical type:
>> >  - floating timestamp: (not TZ stored. up to the reader to interpret TS
>> > based on their TZ)
>> > - this would be better for following sql standard
>> > - Julien to create JIRA
>> >  - timestamp with timezone (per SQL):
>> > - each value has timezone
>> > - TZ can be different for each value
>> > - Julien to create JIRA
>> >
>> >  - parquet-cpp 1.0 release
>> >- Uwe to update release script in master.
>> >- Uwe to launch a new vot

Re: parquet sync starting now

2017-02-27 Thread Zoltan Ivanfi
Hi,

Although the draft of SQL-92[1] does not explicitly state that the time
zone offset has to be stored, the following excerpts strongly suggest that
the time zone has to be stored with each individual value of TIMESTAMP WITH
TIME ZONE:

  The length of a TIMESTAMP is 19 positions [...]
  The length of a TIMESTAMP WITH TIME ZONE is 25 positions [...]

The draft of SQL 2003[2] standard is more specific:

  TIMESTAMP and TIME may also be specified as being WITH TIME ZONE, in
which case every value has associated with it a time zone displacement.

However, the TIMESTAMP WITH TIME ZONE type in PostgreSQL does not conform
to this definition, but behaves like the plain TIMESTAMP type of the SQL
specification. Oracle calls this type TIMESTAMP WITH LOCAL TIME instead.

It's easier to get an overview on this chaos if we list the different
interpretations and data types:

Timestamp stored with a time zone:
- ANSI SQL type: TIMESTAMP WITH TIME ZONE
- Oracle type: TIMESTAMP WITH TIME ZONE
- PostgeSQL type: -

Timestamp stored without a time zone that represents a time in UTC
(automatically converted to/from local time):
- ANSI SQL type: TIMESTAMP
- Oracle type: TIMESTAMP WITH LOCAL TIME
- PostgeSQL type: TIMESTAMP WITH TIME ZONE
- Parquet type: TIMESTAMP_MILLIS

Timestamp stored without a time zone that represents floating time (has the
same apparent value regardless of time zone, does not refer to an absolute
time instant):
- ANSI SQL type: -
- Oracle type: TIMESTAMP
- PostgeSQL type: TIMESTAMP
- Impala type: TIMESTAMP (stored as INT96 in Parquet)

[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
[2] http://www.wiscorp.com/sql_2003_standard.zip

Zoltan


On Thu, Feb 23, 2017 at 10:46 PM Marcel Kornacker  wrote:

> Regarding timestamp with timezone: I'm not sure whether the SQL
> standard requires the timezone to be stored along with the timestamp
> for 'timestamp with timezone' (at least Oracle and Postgres diverge on
> that topic).
>
> Cc'ing Greg Rahn to shed some more light on that.
>
> Regarding 'make Impala depend on parquet-cpp': could someone expand on
> why we want to do this? There probably is overlap between
> Impala/Kudu/parquet-cpp, but the runtime systems of the first two have
> specific requirements (and are also different from each other), so
> trying to unify this into parquet-cpp seems difficult.
>
> On Thu, Feb 23, 2017 at 11:22 AM, Julien Le Dem  wrote:
> >  Attendees/agenda:
> > - Nandor, Zoltan (Cloudera/file formats)
> > - Lars (Cloudera/Impala)" Statistics progress
> > - Uwe (Blue Yonder): Parquet cpp RC. Int96 timestamps
> > - Wes (twosigma): parquet cpp rc. 1.0 Release
> > - Julien (Dremio): parquet metadata. Statistics.
> > - Deepak (HP/Vertica): Parquet-cpp
> > - Kazuaki:
> > - Ryan was excused :)
> >
> > Note:
> >  - Statistics: https://github.com/apache/parquet-format/pull/46
> >- Impala is waiting for parquet-format to settle on the format to
> > finalize their simple mentation.
> >- Action: Julien to follow up with Ryan on the PR
> >
> >  - Int96 timestamps: https://github.com/apache/parquet-format/pull/49
> > (needs Ryan's feedback)
> >- format is nanosecond level timestamp from midnight (64 bits)
> followed
> > by number of days (32 bits)
> >- it sounds like int96 ordering is different from natural byte array
> > ordering because days is last in the bytes
> >- discussion about swapping bytes:
> >   - format dependent on the boost library used
> >   - there could be performance concerns in Impala against changing it
> >   - there may be a separate project in impala to swap the bytes for
> > kudu compatibility.
> >- discussion about deprecating int96:
> >  - need to be able to read them always
> >  - not need to define ordering if we have a clear replacement
> >  - Need to clarify the requirement for alternative .
> >  - int64 could be enough it sounds that nanosecond granularity might
> > not be needed.
> >- Julien to create JIRAs:
> >  - int96 ordering
> >  - int96 deprecation, replacement.
> >
> > - extra timestamp logical type:
> >  - floating timestamp: (not TZ stored. up to the reader to interpret TS
> > based on their TZ)
> > - this would be better for following sql standard
> > - Julien to create JIRA
> >  - timestamp with timezone (per SQL):
> > - each value has timezone
> > - TZ can be different for each value
> > - Julien to create JIRA
> >
> >  - parquet-cpp 1.0 release
> >- Uwe to update release script in master.
> >- Uwe to launch a new vote with new RC
> >
> >  - make impala depend on parquet-cpp
> >   - duplication between parquet/impala/kudu
> >   - need to measure level of overlap
> >   - Wes to open JIRA for this
> >   - also need an "apache commons for c++” for SQL type operations:
> >  -> could be in arrow
> >
> >   - metadata improvements.
> >- add page level metadata in footer
> >- page skipping.
> >- Julien to open JIRA.
> >
> >  - add versio

Re: parquet sync starting now

2017-02-23 Thread Marcel Kornacker
Yes, that sounds like a good idea.

On Thu, Feb 23, 2017 at 2:16 PM, Wes McKinney  wrote:
> I made some comments about sharing C++ code more generally amongst
> Impala, Kudu, Parquet, and Arrow.
>
> There's a significant amount of byte and bit processing code that
> should have little coupling to the Impala or Kudu runtime:
>
> - SIMD algorithms for hashing
> - RLE encoding
> - Dictionary encoding
> - Bit packing and unpacking (we actually had a contribution to
> parquet-cpp from Daniel Lemire on this)
>
> Since Impala's Parquet scanner is tightly coupled to its in-memory
> data structures, using the Parquet reading and writing classes in
> parquet-cpp would require more careful analysis. The sharing of
> generic algorithms and SIMD utilities seems less controversial to me.
>
> Since Arrow is more of a library to be linked into other projects
> (e.g. parquet-cpp links against libarrow and uses its headers), and
> Arrow needs to do all things things as well as Parquet, we're planning
> to migrate this code to the Arrow codebase. So it might make sense for
> Arrow to be the place to assemble generic vectorized processing code,
> then link libarrow.a into parquet-cpp, Impala, and Kudu. I can help
> with as much of the legwork as possible with this, and I think all of
> our projects would benefit from the unification of efforts and unit
> testing / benchmarking.
>
> Thanks
> Wes
>
> On Thu, Feb 23, 2017 at 4:46 PM, Marcel Kornacker  wrote:
>> Regarding timestamp with timezone: I'm not sure whether the SQL
>> standard requires the timezone to be stored along with the timestamp
>> for 'timestamp with timezone' (at least Oracle and Postgres diverge on
>> that topic).
>>
>> Cc'ing Greg Rahn to shed some more light on that.
>>
>> Regarding 'make Impala depend on parquet-cpp': could someone expand on
>> why we want to do this? There probably is overlap between
>> Impala/Kudu/parquet-cpp, but the runtime systems of the first two have
>> specific requirements (and are also different from each other), so
>> trying to unify this into parquet-cpp seems difficult.
>>
>> On Thu, Feb 23, 2017 at 11:22 AM, Julien Le Dem  wrote:
>>>  Attendees/agenda:
>>> - Nandor, Zoltan (Cloudera/file formats)
>>> - Lars (Cloudera/Impala)" Statistics progress
>>> - Uwe (Blue Yonder): Parquet cpp RC. Int96 timestamps
>>> - Wes (twosigma): parquet cpp rc. 1.0 Release
>>> - Julien (Dremio): parquet metadata. Statistics.
>>> - Deepak (HP/Vertica): Parquet-cpp
>>> - Kazuaki:
>>> - Ryan was excused :)
>>>
>>> Note:
>>>  - Statistics: https://github.com/apache/parquet-format/pull/46
>>>- Impala is waiting for parquet-format to settle on the format to
>>> finalize their simple mentation.
>>>- Action: Julien to follow up with Ryan on the PR
>>>
>>>  - Int96 timestamps: https://github.com/apache/parquet-format/pull/49
>>> (needs Ryan's feedback)
>>>- format is nanosecond level timestamp from midnight (64 bits) followed
>>> by number of days (32 bits)
>>>- it sounds like int96 ordering is different from natural byte array
>>> ordering because days is last in the bytes
>>>- discussion about swapping bytes:
>>>   - format dependent on the boost library used
>>>   - there could be performance concerns in Impala against changing it
>>>   - there may be a separate project in impala to swap the bytes for
>>> kudu compatibility.
>>>- discussion about deprecating int96:
>>>  - need to be able to read them always
>>>  - not need to define ordering if we have a clear replacement
>>>  - Need to clarify the requirement for alternative .
>>>  - int64 could be enough it sounds that nanosecond granularity might
>>> not be needed.
>>>- Julien to create JIRAs:
>>>  - int96 ordering
>>>  - int96 deprecation, replacement.
>>>
>>> - extra timestamp logical type:
>>>  - floating timestamp: (not TZ stored. up to the reader to interpret TS
>>> based on their TZ)
>>> - this would be better for following sql standard
>>> - Julien to create JIRA
>>>  - timestamp with timezone (per SQL):
>>> - each value has timezone
>>> - TZ can be different for each value
>>> - Julien to create JIRA
>>>
>>>  - parquet-cpp 1.0 release
>>>- Uwe to update release script in master.
>>>- Uwe to launch a new vote with new RC
>>>
>>>  - make impala depend on parquet-cpp
>>>   - duplication between parquet/impala/kudu
>>>   - need to measure level of overlap
>>>   - Wes to open JIRA for this
>>>   - also need an "apache commons for c++” for SQL type operations:
>>>  -> could be in arrow
>>>
>>>   - metadata improvements.
>>>- add page level metadata in footer
>>>- page skipping.
>>>- Julien to open JIRA.
>>>
>>>  - add version of the writer in the footer (more precise than current).
>>>- Zoltan to open Jira
>>>- possibly add bitfield for bug fixes.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Feb 23, 2017 at 10:01 AM, Julien Le Dem  wrote:
>>>
 https://hangouts.google.com/hangout

  1   2   >