Re: Parquet dictionary size limits?

2023-09-19 Thread Aaron Niskode-Dossett
t; >> > fallback.
> >>> >> >
> >>> >> > Cheers,
> >>> >> > Micah
> >>> >> >
> >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> >>> >> > claire.d.mcgi...@gmail.com>
> >>> >> > wrote:
> >>> >> >
> >>> >> > > Hi dev@,
> >>> >> > >
> >>> >> > > I'm running some benchmarking on Parquet read/write performance
> >>> and
> >>> >> have
> >>> >> > a
> >>> >> > > few questions about how dictionary encoding works under the
> hood.
> >>> Let
> >>> >> me
> >>> >> > > know if there's a better channel for this :)
> >>> >> > >
> >>> >> > > My test case uses parquet-avro, where I'm writing a single file
> >>> >> > containing
> >>> >> > > 5 million records. Each record has a single column, an Avro
> String
> >>> >> field
> >>> >> > > (Parquet binary field). I ran two configurations of base setup:
> >>> in the
> >>> >> > > first case, the string field has 5,000 possible unique values.
> In
> >>> the
> >>> >> > > second case, it has 50,000 unique values.
> >>> >> > >
> >>> >> > > In the first case (5k unique values), I used parquet-tools to
> >>> inspect
> >>> >> the
> >>> >> > > file metadata and found that a dictionary had been written:
> >>> >> > >
> >>> >> > > % parquet-tools meta testdata-case1.parquet
> >>> >> > > > file schema:  testdata.TestRecord
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> 
> >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >>> >> > > > row group 1:  RC:501 TS:18262874 OFFSET:4
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> 
> >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> >>> >> > SZ:8181452/8181452/1.00
> >>> >> > > > VC:501 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max:
> 999,
> >>> >> > > num_nulls:
> >>> >> > > > 0]
> >>> >> > >
> >>> >> > >
> >>> >> > > But in the second case (50k unique values), parquet-tools shows
> >>> that
> >>> >> no
> >>> >> > > dictionary gets created, and the file size is *much* bigger:
> >>> >> > >
> >>> >> > > % parquet-tools meta testdata-case2.parquet
> >>> >> > > > file schema:  testdata.TestRecord
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> 
> >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >>> >> > > > row group 1:  RC:501 TS:18262874 OFFSET:4
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> 
> >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> >>> >> SZ:43896278/43896278/1.00
> >>> >> > > > VC:501 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: ,
> >>> num_nulls: 0]
> >>> >> > >
> >>> >> > >
> >>> >> > > (I created a gist of my test reproduction here
> >>> >> > > <
> >>> >>
> >>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> >>> >> > >.)
> >>> >> > >
> >>> >> > > Based on this, I'm guessing there's some tip-over point after
> >>> which
> >>> >> > Parquet
> >>> >> > > will give up on writing a dictionary for a given column? After
> >>> reading
> >>> >> > > the Configuration
> >>> >> > > docs
> >>> >> > > <
> >>> >> >
> >>> >>
> >>>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >>> >> > > >,
> >>> >> > > I tried increasing the dictionary page size configuration 5x,
> >>> with the
> >>> >> > same
> >>> >> > > result (no dictionary created).
> >>> >> > >
> >>> >> > > So to summarize, my questions are:
> >>> >> > >
> >>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> >>> for a
> >>> >> > > given column?
> >>> >> > > - Is that heuristic configurable at all?
> >>> >> > > - For high-cardinality datasets, has the idea of a
> frequency-based
> >>> >> > > dictionary encoding been explored? Say, if the data follows a
> >>> certain
> >>> >> > > statistical distribution, we can create a dictionary of the most
> >>> >> frequent
> >>> >> > > values only?
> >>> >> > >
> >>> >> > > Thanks for your time!
> >>> >> > > - Claire
> >>> >> > >
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: building parquet macbook m1 with thrift 0.15.0

2023-06-14 Thread Aaron Niskode-Dossett
My solution was to write a docker container to build parquet on my
macbook.  I spent a couple of hours trying and failing to build it directly
and got a docker solution working in far less time.

On Wed, Jun 14, 2023 at 7:50 AM Steve Loughran 
wrote:

> How do people get a version of the native thrift binaries onto their
> macbook such that parquet build?
>
>
>1. as homebrew is on 0.18.1, and if you try to build with that you can
>see that thrift has added some new things to implement.
>2. try to rebuild thrift 0.15 and you end up in cmake pain with xcode
>bison being out of date, etc etc.
>
> suggestions?
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: join mailing list

2022-07-19 Thread Aaron Niskode-Dossett
Hi Sol,

Welcome! You can send an email to "dev-subscr...@parquet.apache.org" to
start that process.

We should probably add a note about that here:
https://parquet.apache.org/community/ I don't see the explicit instructions
in the community section.

Best, Aaron

On Mon, Jul 18, 2022 at 6:06 PM Sol Lederman  wrote:

> Hi,
>
> I'd like to join the parquet support mailing list.
>
> Thanks.
>
> Sol Lederman
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: Look for protobuf reviewers for PR-900

2022-03-24 Thread Aaron Niskode-Dossett
I have a little bit of familiarity and added a comment.

On Sun, Mar 20, 2022 at 1:26 PM Xinli shang  wrote:

> Hi all,
>
> We have a PR <https://github.com/apache/parquet-mr/pull/900> related to
> Protobuf pending review. We are looking for people who are familiar with
> Protobbuf to review the change. If you can help, please review. Thanks.
>
> --
> Xinli Shang
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: Two blogs about Apache Parquet were just published on the Uber EngBlog site

2022-03-14 Thread Aaron Niskode-Dossett
Thank you for sharing, those were quite interesting.

On Fri, Mar 11, 2022 at 10:43 AM Xinli shang 
wrote:

> Hi all,
>
> Uber EngBlog site just pushed two articles about Apache Parquet: Cost
> Efficiency @ Scale in Big Data File Format
> <https://eng.uber.com/cost-efficiency-big-data/> and One Stone, Three
> Birds: Finer-Grained Encryption @ Apache Parquet™
> <
> https://eng.uber.com/one-stone-three-birds-finer-grained-encryption-apache-parquet/
> >.
> Please checkout out!
>
>
> The first one is about how to use Parquet ZSTD, Column Prunning(deletion)
> tool, Precision Reduction, Multi-Column Ordering, and fast translation tool
> in Parquet to reduce storage space to improve cost efficiency. This project
> alone saves the storage size at hundred PB level which is equivalent to
> several millions of dollars savings per year.
>
> The second one talks about using Apache Parquet's fine-grained encryption
> feature to solve three challenges: encryption, access control, and data
> retention! This wraps up the work we have done with the community in the
> last 3 years around Parquet Modular Encryption. I would like to thank Gidon
> for his continuous collaborations with us!
>
> If you have any questions about the blog, feel free to reach out!
>
> Xinli Shang
>
> Tech Lead Manager at Uber Data Infra
>
> VP Apache Parquet PMC Chair
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: About the security issue of log4j for Parquet

2021-12-13 Thread Aaron Niskode-Dossett
Thank you, that was on my to do list today.

On Sun, Dec 12, 2021 at 3:39 PM Xinli shang  wrote:

> Hi all,
>
> Most of you must have known of the severe security issue(
> https://www.randori.com/blog/cve-2021-44228) in log4j. I just want to have
> a short update that Paquet doesn't have a dependency on the log4j versions
> that are impacted. Have a good weekend!
>
> --
> Xinli Shang
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: [VOTE] Release Apache Parquet 1.12.0 RC4

2021-03-23 Thread Aaron Niskode-Dossett
+1 (non-binding)

- cloned the 1.12.0-rc-4 tag from github
- compiled jars locally and all tests passed
- used the 1.12.0 jars as dependencies for a local application that streams
data into protobuf-parquet files
- confirmed data is correct and can be read with parquet-tools compiled
from parquet 1.11.1

On Tue, Mar 23, 2021 at 10:47 AM Xinli shang 
wrote:

> Let's discuss it in today's community sync meeting.
>
> On Tue, Mar 23, 2021 at 8:37 AM Aaron Niskode-Dossett
>  wrote:
>
> > Gabor and Ismaël, thank you both for the very clear explanations of
> what's
> > going on.
> >
> > Based on Gabor's description of avro compatibility I would be +1
> > (non-binding) for the current RC.
> >
> > On Tue, Mar 23, 2021 at 4:36 AM Gabor Szadovszky 
> wrote:
> >
> > > Thanks, Ismaël for the explanation. I have a couple of notes about your
> > > concerns.
> > >
> > > - Parquet 1.12.0 as per the semantic versioning is not a major but a
> > minor
> > > release. (It is different from the Avro versioning strategy where the
> > > second version number means major version changes.)
> > > - The jackson dependency is shaded in the parquet jars so the
> > > synchronization of the version is not needed (and not even possible).
> > > - Using the latest Avro version makes sense but if we do not use it for
> > the
> > > current release it should not cause any issues in our clients. Let's
> > check
> > > the following example. We upgrade to the latest 1.10.2 Avro release in
> > > parquet then release it under 1.12.0. Later on Avro creates a new
> release
> > > (e.g. 1.10.3 or even 1.11.0) while Parquet does not. In this case our
> > > clients need to upgrade Avro without Parquet. If it is a major Avro
> > release
> > > it might occur that the Parquet code has to be updated but usually it
> is
> > > not the case. (The last time we've had to change production code for an
> > > Avro upgrade was from 1.7.6 to 1.8.0.) I think our clients should be
> able
> > > to upgrade Avro independently from Parquet and vice versa (until there
> > are
> > > incompatibility issues). I would even change Parquet's Avro dependency
> to
> > > "provided" but that might be a breaking change and clearly won't do it
> > just
> > > before the release.
> > >
> > > What do you think? Anyone have a strong opinion about this topic?
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Mon, Mar 22, 2021 at 6:31 PM Ismaël Mejía 
> wrote:
> > >
> > > > Sure. The Avro upgrade feature/API wise is minor for Parquet, so the
> > > > possibility of adding a regression is really REALLY minor. The hidden
> > > issue
> > > > is the new transitive dependencies introduced by Avro, concretely
> > Jackson
> > > > 2.12.2.
> > > >
> > > > Since Parquet 1.12.0 is a major version it is probably a good moment
> to
> > > > upgrade Jackson too that's why I opened [1] (already merged). In
> > > particular
> > > > now that Spark merged support for both Avro 1.10.2 [1] and Jackson
> > 2.12.2
> > > > [2] for the upcoming 3.2.0 release, so now Spark can easily bring
> > > upgraded
> > > > Parquet too with all the dependencies well aligned. This of course is
> > > not a
> > > > blocker for the release or for other downstream projects but it might
> > > help
> > > > to make their life better because they will have less dependency
> > > alignment
> > > > issues to battle.
> > > >
> > > > Ismaël
> > > >
> > > > [1] https://github.com/apache/parquet-mr/pull/883
> > > > [2] https://github.com/apache/spark/pull/31866
> > > > [3] https://github.com/apache/spark/pull/31878
> > > >
> > > >
> > > > On Mon, Mar 22, 2021, 3:37 PM Xinli shang 
> > > wrote:
> > > >
> > > > > Hi Ismaël,
> > > > >
> > > > > Can you explain a little bit more on if we don't upgrade in this
> > > release,
> > > > > what could be the worst-case scenario for the ecosystem? The
> > > last-minute
> > > > > upgrading seems a rush to me but I would like to hear what are the
> > > impact
> > > > > if we don't.  As Gabor mentioned, this should not be a
> show-stopper.
> > > > >
> > > > > Xinli
> > > > >
> > > > >
> >

Re: [VOTE] Release Apache Parquet 1.12.0 RC1

2021-01-29 Thread Aaron Niskode-Dossett
I haven't seen this java.lang.NoSuchMethodError:
java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer; problem before, but
some research suggest that this can happen with JDK > 8 is used to compile
code that gets run with JDK 8.  Setting --source and --target alone do not
address the problem, but adding "--release 1.8" will address this.

I don't yet know enough about the Parquet build process to know if this is
systemic or an individual user issue.  It seems like we specify Java 8
through the pom file?

Another common suggestion to address this specific issue is to case
ByteBuffer to Buffer before calling methods like position, but that seems
like an incredibly fragile fix.

References:
http://openjdk.java.net/jeps/247
https://stackoverflow.com/questions/61267495/exception-in-thread-main-java-lang-nosuchmethoderror-java-nio-bytebuffer-flip
https://github.com/eclipse/jetty.project/issues/3244

On Fri, Jan 29, 2021 at 6:56 AM Wang, Yuming 
wrote:

>
> It seems there is something wrong with JDK 8:
>
> java.lang.NoSuchMethodError:
> java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
> at
> org.apache.parquet.bytes.CapacityByteArrayOutputStream.write(CapacityByteArrayOutputStream.java:197)
> at
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeOrAppendBitPackedRun(RunLengthBitPackingHybridEncoder.java:193)
> at
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeInt(RunLengthBitPackingHybridEncoder.java:179)
> at
> org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.getBytes(DictionaryValuesWriter.java:167)
> at
> org.apache.parquet.column.values.fallback.FallbackValuesWriter.getBytes(FallbackValuesWriter.java:74)
> at
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:60)
> at
> org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:387)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:235)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:222)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
> at org.apache.parquet.io
> .MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:307)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:465)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:148)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
> at
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
>
>
> On 2021/1/29, 18:01, "Gidon Gershinsky"  wrote:
>
> External Email
>
> Regarding the technical reason behind this addition - we needed it to
> enable encryption in one of the writing paths.
>
> Cheers, Gidon
>
>
> On Thu, Jan 28, 2021 at 7:09 PM Aaron Niskode-Dossett
>  wrote:
>
> > My (non-binding) is that this is ok.  In a different Apache project,
> we
> > didn't allow a change like that in minor versions and it delayed
> some key
> > work by several months.
> >
> > On Thu, Jan 28, 2021 at 3:00 AM Gabor Szadovszky 
> wrote:
> >
> > > Thanks a lot, Fokko.
> > >
> > > Regarding the breaking change. We have the maven plugin japicmp
> executed
> > in
> > > the verify phase so I was curious why it did not catch this issue.
> It
> > seems
> > > the plugin allows source incompatible changes for minor version
> upgrades
> > by
> > > default. It sounds reasonable to me but I am curious about the
> opinion of
> > > the community. See details about the plugin at
> > >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsiom79.github.io%2Fjapicmp%2FMavenPlugin.htmldata=04%7C01%7Cyumwang%40ebay.com%7Ca25ea17897d64584603408d8c43cd748%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637475112885259165%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=7FyRjo3idZOwtT5dRYRSwbC5oxcf3unhYuO72nroI74%3Dreserved=0.
> Search
> > > for METHOD_ADDED_TO_INTERFACE to find info about the current issue.
> > >
> > > Cheers,
> > > Gabor
> > >
> > >
> > > On Wed, Jan 27, 2021 at 11:08 PM Driesprong, Fokko
>  > >
> > > wrot

Re: [VOTE] Release Apache Parquet 1.12.0 RC1

2021-01-28 Thread Aaron Niskode-Dossett
ass ParquetIO {
> >
> >  public long defaultBlockSize() {
> >
> >return 0;
> >
> >  }
> >
> > +
> >
> > +@Override
> >
> > +public String getPath() {
> >
> > +  return this.file.location();
> >
> > +}
> >
> >}
> >
> > This change is introduced here:
> >
> >
> https://github.com/apache/parquet-mr/commit/5c6916c23cb2b9c225ea80328550ee0e11aee225
> >
> > It is breaking, but not sure if it is blocking.
> >
> > A +1 (non-binding) from my side!
> >
> > Cheers, Fokko
> >
> >
> >
> >
> > Op wo 27 jan. 2021 om 16:46 schreef Gabor Szadovszky :
> >
> > > Hi everyone,
> > >
> > > I propose the following RC to be released as the official Apache
> Parquet
> > > 1.12.0 release.
> > >
> > > The commit id is ad59c33e53276572c105b4ccac71293e988adc30
> > > * This corresponds to the tag: apache-parquet-1.12.0-rc1
> > > *
> > >
> > >
> >
> https://github.com/apache/parquet-mr/tree/ad59c33e53276572c105b4ccac71293e988adc30
> > >
> > > The release tarball, signature, and checksums are here:
> > > *
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.12.0-rc1
> > >
> > > You can find the KEYS file here:
> > > * https://downloads.apache.org/parquet/KEYS
> > >
> > > Binary artifacts are staged in Nexus here:
> > > *
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > >
> > > This release includes the features Parquet Modular Encryption and
> Parquet
> > > Bloom Filter. See details at:
> > > *
> > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0-rc1/CHANGES.md
> > >
> > > Please download, verify, and test.
> > >
> > > Please vote in the next 72 hours.
> > >
> > > [ ] +1 Release this as Apache Parquet 1.12.0
> > > [ ] +0
> > > [ ] -1 Do not release this because...
> > >
> > >
> > > PS.: Starting with RC1 instead of RC0 because I've missed to update the
> > > CHANGES.md at the first time.
> > >
> >
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: Create a parquet-protobuf JIRA component

2020-10-21 Thread Aaron Niskode-Dossett
Gabor -- is there an active parquet committer who works in the protobuf
module? There are several open PRs (mostly from David, one from me, perhaps
others) that would constitute nice improvements to that module.

Thanks, Aaron

On Wed, Oct 21, 2020 at 7:39 AM Aaron Niskode-Dossett <
aniskodedoss...@etsy.com> wrote:

> Wonderful, thank you!  My company hopes to use proto+parquet a lot and I
> look forward to contributing!
>
> On Wed, Oct 21, 2020 at 2:54 AM Gabor Szadovszky  wrote:
>
>> Sorry, I've missed this thread. Just created the component. Feel free to
>> use it.
>>
>> On Tue, Oct 20, 2020 at 4:27 PM Aaron Niskode-Dossett
>>  wrote:
>>
>> > Hi, just bumping this request for a parquet-protobuf JIRA component
>> again.
>> >
>> > On Fri, Oct 2, 2020 at 9:03 AM David  wrote:
>> >
>> > > Hello Gang,
>> > >
>> > > I too would like to see this too.
>> > >
>> > > Aaron and I have put up a few PRs re: protobuf integration
>> > >
>> > > Is anyone able for review and potential push?
>> > >
>> > > Thanks.
>> > >
>> > > On Tue, Sep 29, 2020 at 12:01 PM Aaron Niskode-Dossett
>> > >  wrote:
>> > >
>> > > > Hello Parquet project members,
>> > > >
>> > > > Could a parquet-protobuf component be added to the project JIRA?
>> > There a
>> > > > few open JIRA tickets that would be nice to categorize.  If the
>> > component
>> > > > is created, i would be happy to categorize the tickets.
>> > > >
>> > > > Thank you, Aaron
>> > > >
>> > > > --
>> > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
>> > > >
>> > >
>> >
>> >
>> > --
>> > Aaron Niskode-Dossett, Data Engineering -- Etsy
>> >
>>
>
>
> --
> Aaron Niskode-Dossett, Data Engineering -- Etsy
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: Create a parquet-protobuf JIRA component

2020-10-21 Thread Aaron Niskode-Dossett
Wonderful, thank you!  My company hopes to use proto+parquet a lot and I
look forward to contributing!

On Wed, Oct 21, 2020 at 2:54 AM Gabor Szadovszky  wrote:

> Sorry, I've missed this thread. Just created the component. Feel free to
> use it.
>
> On Tue, Oct 20, 2020 at 4:27 PM Aaron Niskode-Dossett
>  wrote:
>
> > Hi, just bumping this request for a parquet-protobuf JIRA component
> again.
> >
> > On Fri, Oct 2, 2020 at 9:03 AM David  wrote:
> >
> > > Hello Gang,
> > >
> > > I too would like to see this too.
> > >
> > > Aaron and I have put up a few PRs re: protobuf integration
> > >
> > > Is anyone able for review and potential push?
> > >
> > > Thanks.
> > >
> > > On Tue, Sep 29, 2020 at 12:01 PM Aaron Niskode-Dossett
> > >  wrote:
> > >
> > > > Hello Parquet project members,
> > > >
> > > > Could a parquet-protobuf component be added to the project JIRA?
> > There a
> > > > few open JIRA tickets that would be nice to categorize.  If the
> > component
> > > > is created, i would be happy to categorize the tickets.
> > > >
> > > > Thank you, Aaron
> > > >
> > > > --
> > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> > > >
> > >
> >
> >
> > --
> > Aaron Niskode-Dossett, Data Engineering -- Etsy
> >
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: Create a parquet-protobuf JIRA component

2020-10-20 Thread Aaron Niskode-Dossett
Hi, just bumping this request for a parquet-protobuf JIRA component again.

On Fri, Oct 2, 2020 at 9:03 AM David  wrote:

> Hello Gang,
>
> I too would like to see this too.
>
> Aaron and I have put up a few PRs re: protobuf integration
>
> Is anyone able for review and potential push?
>
> Thanks.
>
> On Tue, Sep 29, 2020 at 12:01 PM Aaron Niskode-Dossett
>  wrote:
>
> > Hello Parquet project members,
> >
> > Could a parquet-protobuf component be added to the project JIRA?  There a
> > few open JIRA tickets that would be nice to categorize.  If the component
> > is created, i would be happy to categorize the tickets.
> >
> > Thank you, Aaron
> >
> > --
> > Aaron Niskode-Dossett, Data Engineering -- Etsy
> >
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: protobuf3 and oneof fields

2020-09-29 Thread Aaron Niskode-Dossett
I played around with the code and found a simple, maybe too simple,
solution and opened a PR.  Fingers crossed.

On Tue, Sep 29, 2020 at 10:55 AM Aaron Niskode-Dossett <
aniskodedoss...@etsy.com> wrote:

> Thank you, David, I agree with your conclusions.  I opened PARQUET-1917.
>
> On Tue, Sep 29, 2020 at 10:18 AM David  wrote:
>
>> Hello,
>>
>> Perhaps a bit more nuance here.  I believe that the values are technically
>> correct (they should be the default value of 0), but we should not be
>> storing them as 0 values.  We need to check the hasBar*() to determine if
>> the value should be stored or omitted.
>>
>> Thanks.
>>
>> On Tue, Sep 29, 2020 at 10:39 AM David  wrote:
>>
>> > Hello,
>> >
>> > I too have been poking around the Parquet-Proto package as well.
>> >
>> > I would expect "bar_int" and "bar_int2" to be 'null' here.
>> >
>> > Have you filed a JIRA with this reproduction?
>> >
>> > Thanks.
>> >
>> > On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
>> >  wrote:
>> >
>> >> Hello,
>> >>
>> >> I am experimenting with serializing protobuf3 to parquet and have a
>> >> question about how "oneOf" fields should be treated.  I will describe
>> an
>> >> example.  I'm running parquet 1.11.1 with PARQUET-1684 applied.  That
>> JIRA
>> >> is about how default values are written out, and seems related to my
>> >> question.
>> >>
>> >> SCHEMA
>> >> 
>> >> message Person {
>> >>   int32 foo = 1;
>> >>   oneof optional_bar {
>> >> int32 bar_int = 200;
>> >> int32 bar_int2 = 201;
>> >> string bar_string = 300;
>> >>   }
>> >> }
>> >>
>> >> CODE
>> >> 
>> >> I set values for foo and bar_string
>> >>
>> >> for (int i = 0; i < 3; i += 1) {
>> >> com.etsy.grpcparquet.Person message =
>> Person.newBuilder()
>> >> .setFoo(i)
>> >> .setBarString("hello world")
>> >> .build();
>> >> message.writeDelimitedTo(out);
>> >> }
>> >> And then I write the protobuf file out to parquet.
>> >>
>> >> RESULT
>> >> ---
>> >> $ parquet-tools show example.parquet
>> >>
>> >>
>> >> +---+---++------+
>> >> |   foo |   bar_int |   bar_int2 | bar_string   |
>> >> |---+---++--|
>> >> | 0 | 0 |  0 | hello world  |
>> >> | 1 | 0 |  0 | hello world  |
>> >> | 2 | 0 |  0 | hello world  |
>> >> +---+---++--+
>> >>
>> >> I would expect that bar_int and bar_int2 are EMPTY for all three rows
>> >> since
>> >> only bar_string is set in the oneof.
>> >>
>> >> Is this the right expectation for me to have?
>> >>
>> >> Thank you!
>> >>
>> >> --
>> >> Aaron Niskode-Dossett, Data Engineering -- Etsy
>> >>
>> >
>>
>
>
> --
> Aaron Niskode-Dossett, Data Engineering -- Etsy
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Create a parquet-protobuf JIRA component

2020-09-29 Thread Aaron Niskode-Dossett
Hello Parquet project members,

Could a parquet-protobuf component be added to the project JIRA?  There a
few open JIRA tickets that would be nice to categorize.  If the component
is created, i would be happy to categorize the tickets.

Thank you, Aaron

-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


Re: protobuf3 and oneof fields

2020-09-29 Thread Aaron Niskode-Dossett
Thank you, David, I agree with your conclusions.  I opened PARQUET-1917.

On Tue, Sep 29, 2020 at 10:18 AM David  wrote:

> Hello,
>
> Perhaps a bit more nuance here.  I believe that the values are technically
> correct (they should be the default value of 0), but we should not be
> storing them as 0 values.  We need to check the hasBar*() to determine if
> the value should be stored or omitted.
>
> Thanks.
>
> On Tue, Sep 29, 2020 at 10:39 AM David  wrote:
>
> > Hello,
> >
> > I too have been poking around the Parquet-Proto package as well.
> >
> > I would expect "bar_int" and "bar_int2" to be 'null' here.
> >
> > Have you filed a JIRA with this reproduction?
> >
> > Thanks.
> >
> > On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
> >  wrote:
> >
> >> Hello,
> >>
> >> I am experimenting with serializing protobuf3 to parquet and have a
> >> question about how "oneOf" fields should be treated.  I will describe an
> >> example.  I'm running parquet 1.11.1 with PARQUET-1684 applied.  That
> JIRA
> >> is about how default values are written out, and seems related to my
> >> question.
> >>
> >> SCHEMA
> >> 
> >> message Person {
> >>   int32 foo = 1;
> >>   oneof optional_bar {
> >> int32 bar_int = 200;
> >> int32 bar_int2 = 201;
> >> string bar_string = 300;
> >>   }
> >> }
> >>
> >> CODE
> >> 
> >> I set values for foo and bar_string
> >>
> >> for (int i = 0; i < 3; i += 1) {
> >> com.etsy.grpcparquet.Person message =
> Person.newBuilder()
> >> .setFoo(i)
> >> .setBarString("hello world")
> >> .build();
> >> message.writeDelimitedTo(out);
> >> }
> >> And then I write the protobuf file out to parquet.
> >>
> >> RESULT
> >> ---
> >> $ parquet-tools show example.parquet
> >>
> >>
> >> +---+---++--+
> >> |   foo |   bar_int |   bar_int2 | bar_string   |
> >> |---+---++--|
> >> | 0 | 0 |  0 | hello world  |
> >> | 1 | 0 |  0 | hello world  |
> >> | 2 | 0 |  0 | hello world  |
> >> +---+---++--+
> >>
> >> I would expect that bar_int and bar_int2 are EMPTY for all three rows
> >> since
> >> only bar_string is set in the oneof.
> >>
> >> Is this the right expectation for me to have?
> >>
> >> Thank you!
> >>
> >> --
> >> Aaron Niskode-Dossett, Data Engineering -- Etsy
> >>
> >
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy


protobuf3 and oneof fields

2020-09-25 Thread Aaron Niskode-Dossett
Hello,

I am experimenting with serializing protobuf3 to parquet and have a
question about how "oneOf" fields should be treated.  I will describe an
example.  I'm running parquet 1.11.1 with PARQUET-1684 applied.  That JIRA
is about how default values are written out, and seems related to my
question.

SCHEMA

message Person {
  int32 foo = 1;
  oneof optional_bar {
int32 bar_int = 200;
int32 bar_int2 = 201;
string bar_string = 300;
  }
}

CODE

I set values for foo and bar_string

for (int i = 0; i < 3; i += 1) {
com.etsy.grpcparquet.Person message = Person.newBuilder()
.setFoo(i)
.setBarString("hello world")
.build();
message.writeDelimitedTo(out);
}
And then I write the protobuf file out to parquet.

RESULT
---
$ parquet-tools show example.parquet


+---+---++--+
|   foo |   bar_int |   bar_int2 | bar_string   |
|---+---++--|
| 0 | 0 |  0 | hello world  |
| 1 | 0 |  0 | hello world  |
| 2 | 0 |  0 | hello world  |
+---+---++--+

I would expect that bar_int and bar_int2 are EMPTY for all three rows since
only bar_string is set in the oneof.

Is this the right expectation for me to have?

Thank you!

-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy