ols to
> >>> inspect
> >>> >> the
> >>> >> > > file metadata and found that a dictionary had been written:
> >>> >> > >
> >>> >> > > % parquet-tools meta testdata-case1.parquet
> >>> >> > > > file schema: testdata.TestRecord
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
>
> >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0
> >>> >> > > > row group 1: RC:501 TS:18262874 OFFSET:4
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
>
> >>> >> > > > stringField: BINARY UNCOMPRESSED DO:4 FPO:38918
> >>> >> > SZ:8181452/8181452/1.00
> >>> >> > > > VC:501 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max:
> 999,
> >>> >> > > num_nulls:
> >>> >> > > > 0]
> >>> >> > >
> >>> >> > >
> >>> >> > > But in the second case (50k unique values), parquet-tools shows
> >>> that
> >>> >> no
> >>> >> > > dictionary gets created, and the file size is *much* bigger:
> >>> >> > >
> >>> >> > > % parquet-tools meta testdata-case2.parquet
> >>> >> > > > file schema: testdata.TestRecord
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
>
> >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0
> >>> >> > > > row group 1: RC:501 TS:18262874 OFFSET:4
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
>
> >>> >> > > > stringField: BINARY UNCOMPRESSED DO:0 FPO:4
> >>> >> SZ:43896278/43896278/1.00
> >>> >> > > > VC:501 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: ,
> >>> num_nulls: 0]
> >>> >> > >
> >>> >> > >
> >>> >> > > (I created a gist of my test reproduction here
> >>> >> > > <
> >>> >>
> >>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> >>> >> > >.)
> >>> >> > >
> >>> >> > > Based on this, I'm guessing there's some tip-over point after
> >>> which
> >>> >> > Parquet
> >>> >> > > will give up on writing a dictionary for a given column? After
> >>> reading
> >>> >> > > the Configuration
> >>> >> > > docs
> >>> >> > > <
> >>> >> >
> >>> >>
> >>>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >>> >> > > >,
> >>> >> > > I tried increasing the dictionary page size configuration 5x,
> >>> with the
> >>> >> > same
> >>> >> > > result (no dictionary created).
> >>> >> > >
> >>> >> > > So to summarize, my questions are:
> >>> >> > >
> >>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> >>> for a
> >>> >> > > given column?
> >>> >> > > - Is that heuristic configurable at all?
> >>> >> > > - For high-cardinality datasets, has the idea of a
> frequency-based
> >>> >> > > dictionary encoding been explored? Say, if the data follows a
> >>> certain
> >>> >> > > statistical distribution, we can create a dictionary of the most
> >>> >> frequent
> >>> >> > > values only?
> >>> >> > >
> >>> >> > > Thanks for your time!
> >>> >> > > - Claire
> >>> >> > >
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>
>
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
>bison being out of date, etc etc.
>
> suggestions?
>
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
at 6:06 PM Sol Lederman wrote:
> Hi,
>
> I'd like to join the parquet support mailing list.
>
> Thanks.
>
> Sol Lederman
>
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
; Protobbuf to review the change. If you can help, please review. Thanks.
>
> --
> Xinli Shang
>
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
ld like to thank Gidon
> for his continuous collaborations with us!
>
> If you have any questions about the blog, feel free to reach out!
>
> Xinli Shang
>
> Tech Lead Manager at Uber Data Infra
>
> VP Apache Parquet PMC Chair
>
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
et doesn't have a dependency on the log4j versions
> that are impacted. Have a good weekend!
>
> --
> Xinli Shang
>
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
1.11.1
On Tue, Mar 23, 2021 at 10:47 AM Xinli shang
wrote:
> Let's discuss it in today's community sync meeting.
>
> On Tue, Mar 23, 2021 at 8:37 AM Aaron Niskode-Dossett
> wrote:
>
> > Gabor and Ismaël, thank you both for the very clear explanations of
> what's
>
>
> On 2021/1/29, 18:01, "Gidon Gershinsky" wrote:
>
> External Email
>
> Regarding the technical reason behind this addition - we needed it to
> enable encryption in one of the writing paths.
>
> Cheers, Gidon
>
>
> On Thu, Jan 28, 202
3e53276572c105b4ccac71293e988adc30
> > > * This corresponds to the tag: apache-parquet-1.12.0-rc1
> > > *
> > >
> > >
> >
> https://github.com/apache/parquet-mr/tree/ad59c33e53276572c105b4ccac71293e988adc30
> > >
> > > The release tarba
Gabor -- is there an active parquet committer who works in the protobuf
module? There are several open PRs (mostly from David, one from me, perhaps
others) that would constitute nice improvements to that module.
Thanks, Aaron
On Wed, Oct 21, 2020 at 7:39 AM Aaron Niskode-Dossett <
aniskoded
.
> > >
> > > Aaron and I have put up a few PRs re: protobuf integration
> > >
> > > Is anyone able for review and potential push?
> > >
> > > Thanks.
> > >
> > > On Tue, Sep 29, 2020 at 12:01 PM Aaron Niskode-Dossett
> > > wro
tial push?
>
> Thanks.
>
> On Tue, Sep 29, 2020 at 12:01 PM Aaron Niskode-Dossett
> wrote:
>
> > Hello Parquet project members,
> >
> > Could a parquet-protobuf component be added to the project JIRA? There a
> > few open JIRA tickets that would be nice to
I played around with the code and found a simple, maybe too simple,
solution and opened a PR. Fingers crossed.
On Tue, Sep 29, 2020 at 10:55 AM Aaron Niskode-Dossett <
aniskodedoss...@etsy.com> wrote:
> Thank you, David, I agree with your conclusions. I opened PARQUET-1917.
>
>
Hello Parquet project members,
Could a parquet-protobuf component be added to the project JIRA? There a
few open JIRA tickets that would be nice to categorize. If the component
is created, i would be happy to categorize the tickets.
Thank you, Aaron
--
Aaron Niskode-Dossett, Data Engineering
ell.
> >
> > I would expect "bar_int" and "bar_int2" to be 'null' here.
> >
> > Have you filed a JIRA with this reproduction?
> >
> > Thanks.
> >
> > On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
> > wrote:
> >
>
0 | 0 | hello world |
+---+---++--+
I would expect that bar_int and bar_int2 are EMPTY for all three rows since
only bar_string is set in the oneof.
Is this the right expectation for me to have?
Thank you!
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
16 matches
Mail list logo