Re: Parquet dictionary size limits?

2023-09-19 Thread Aaron Niskode-Dossett
ols to > >>> inspect > >>> >> the > >>> >> > > file metadata and found that a dictionary had been written: > >>> >> > > > >>> >> > > % parquet-tools meta testdata-case1.parquet > >>> >> > > > file schema: testdata.TestRecord > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> > > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 > >>> >> > > > row group 1: RC:501 TS:18262874 OFFSET:4 > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> > > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:4 FPO:38918 > >>> >> > SZ:8181452/8181452/1.00 > >>> >> > > > VC:501 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: > 999, > >>> >> > > num_nulls: > >>> >> > > > 0] > >>> >> > > > >>> >> > > > >>> >> > > But in the second case (50k unique values), parquet-tools shows > >>> that > >>> >> no > >>> >> > > dictionary gets created, and the file size is *much* bigger: > >>> >> > > > >>> >> > > % parquet-tools meta testdata-case2.parquet > >>> >> > > > file schema: testdata.TestRecord > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> > > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 > >>> >> > > > row group 1: RC:501 TS:18262874 OFFSET:4 > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> > > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:0 FPO:4 > >>> >> SZ:43896278/43896278/1.00 > >>> >> > > > VC:501 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: , > >>> num_nulls: 0] > >>> >> > > > >>> >> > > > >>> >> > > (I created a gist of my test reproduction here > >>> >> > > < > >>> >> > >>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806 > >>> >> > >.) > >>> >> > > > >>> >> > > Based on this, I'm guessing there's some tip-over point after > >>> which > >>> >> > Parquet > >>> >> > > will give up on writing a dictionary for a given column? After > >>> reading > >>> >> > > the Configuration > >>> >> > > docs > >>> >> > > < > >>> >> > > >>> >> > >>> > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > >>> >> > > >, > >>> >> > > I tried increasing the dictionary page size configuration 5x, > >>> with the > >>> >> > same > >>> >> > > result (no dictionary created). > >>> >> > > > >>> >> > > So to summarize, my questions are: > >>> >> > > > >>> >> > > - What's the heuristic for Parquet dictionary writing to succeed > >>> for a > >>> >> > > given column? > >>> >> > > - Is that heuristic configurable at all? > >>> >> > > - For high-cardinality datasets, has the idea of a > frequency-based > >>> >> > > dictionary encoding been explored? Say, if the data follows a > >>> certain > >>> >> > > statistical distribution, we can create a dictionary of the most > >>> >> frequent > >>> >> > > values only? > >>> >> > > > >>> >> > > Thanks for your time! > >>> >> > > - Claire > >>> >> > > > >>> >> > > >>> >> > >>> > > >>> > >> > -- Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: building parquet macbook m1 with thrift 0.15.0

2023-06-14 Thread Aaron Niskode-Dossett
>bison being out of date, etc etc. > > suggestions? > -- Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: join mailing list

2022-07-19 Thread Aaron Niskode-Dossett
at 6:06 PM Sol Lederman wrote: > Hi, > > I'd like to join the parquet support mailing list. > > Thanks. > > Sol Lederman > -- Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: Look for protobuf reviewers for PR-900

2022-03-24 Thread Aaron Niskode-Dossett
; Protobbuf to review the change. If you can help, please review. Thanks. > > -- > Xinli Shang > -- Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: Two blogs about Apache Parquet were just published on the Uber EngBlog site

2022-03-14 Thread Aaron Niskode-Dossett
ld like to thank Gidon > for his continuous collaborations with us! > > If you have any questions about the blog, feel free to reach out! > > Xinli Shang > > Tech Lead Manager at Uber Data Infra > > VP Apache Parquet PMC Chair > -- Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: About the security issue of log4j for Parquet

2021-12-13 Thread Aaron Niskode-Dossett
et doesn't have a dependency on the log4j versions > that are impacted. Have a good weekend! > > -- > Xinli Shang > -- Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: [VOTE] Release Apache Parquet 1.12.0 RC4

2021-03-23 Thread Aaron Niskode-Dossett
1.11.1 On Tue, Mar 23, 2021 at 10:47 AM Xinli shang wrote: > Let's discuss it in today's community sync meeting. > > On Tue, Mar 23, 2021 at 8:37 AM Aaron Niskode-Dossett > wrote: > > > Gabor and Ismaël, thank you both for the very clear explanations of > what's >

Re: [VOTE] Release Apache Parquet 1.12.0 RC1

2021-01-29 Thread Aaron Niskode-Dossett
> > On 2021/1/29, 18:01, "Gidon Gershinsky" wrote: > > External Email > > Regarding the technical reason behind this addition - we needed it to > enable encryption in one of the writing paths. > > Cheers, Gidon > > > On Thu, Jan 28, 202

Re: [VOTE] Release Apache Parquet 1.12.0 RC1

2021-01-28 Thread Aaron Niskode-Dossett
3e53276572c105b4ccac71293e988adc30 > > > * This corresponds to the tag: apache-parquet-1.12.0-rc1 > > > * > > > > > > > > > https://github.com/apache/parquet-mr/tree/ad59c33e53276572c105b4ccac71293e988adc30 > > > > > > The release tarba

Re: Create a parquet-protobuf JIRA component

2020-10-21 Thread Aaron Niskode-Dossett
Gabor -- is there an active parquet committer who works in the protobuf module? There are several open PRs (mostly from David, one from me, perhaps others) that would constitute nice improvements to that module. Thanks, Aaron On Wed, Oct 21, 2020 at 7:39 AM Aaron Niskode-Dossett < aniskoded

Re: Create a parquet-protobuf JIRA component

2020-10-21 Thread Aaron Niskode-Dossett
. > > > > > > Aaron and I have put up a few PRs re: protobuf integration > > > > > > Is anyone able for review and potential push? > > > > > > Thanks. > > > > > > On Tue, Sep 29, 2020 at 12:01 PM Aaron Niskode-Dossett > > > wro

Re: Create a parquet-protobuf JIRA component

2020-10-20 Thread Aaron Niskode-Dossett
tial push? > > Thanks. > > On Tue, Sep 29, 2020 at 12:01 PM Aaron Niskode-Dossett > wrote: > > > Hello Parquet project members, > > > > Could a parquet-protobuf component be added to the project JIRA? There a > > few open JIRA tickets that would be nice to

Re: protobuf3 and oneof fields

2020-09-29 Thread Aaron Niskode-Dossett
I played around with the code and found a simple, maybe too simple, solution and opened a PR. Fingers crossed. On Tue, Sep 29, 2020 at 10:55 AM Aaron Niskode-Dossett < aniskodedoss...@etsy.com> wrote: > Thank you, David, I agree with your conclusions. I opened PARQUET-1917. > >

Create a parquet-protobuf JIRA component

2020-09-29 Thread Aaron Niskode-Dossett
Hello Parquet project members, Could a parquet-protobuf component be added to the project JIRA? There a few open JIRA tickets that would be nice to categorize. If the component is created, i would be happy to categorize the tickets. Thank you, Aaron -- Aaron Niskode-Dossett, Data Engineering

Re: protobuf3 and oneof fields

2020-09-29 Thread Aaron Niskode-Dossett
ell. > > > > I would expect "bar_int" and "bar_int2" to be 'null' here. > > > > Have you filed a JIRA with this reproduction? > > > > Thanks. > > > > On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett > > wrote: > > >

protobuf3 and oneof fields

2020-09-25 Thread Aaron Niskode-Dossett
0 | 0 | hello world | +---+---++--+ I would expect that bar_int and bar_int2 are EMPTY for all three rows since only bar_string is set in the oneof. Is this the right expectation for me to have? Thank you! -- Aaron Niskode-Dossett, Data Engineering -- Etsy