Re: Next Parquet Sync Up

2015-07-21 Thread Jacques Nadeau
Any chance we can have these on either a different day or time? The Drill hangout is every Tuesday at 10am so I always have to pick one or the other. On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi nyigitb...@netflix.com.invalid wrote: An update to actions, I will create a PR for the

Re: Next Parquet Sync Up

2015-07-22 Thread Jacques Nadeau
on a Monday. Anybody objects? Julien On Jul 21, 2015, at 17:37, Jacques Nadeau jacq...@apache.org wrote: Any chance we can have these on either a different day or time? The Drill hangout is every Tuesday at 10am so I always have to pick one or the other

Re: Parquet Index Pages

2015-07-09 Thread Jacques Nadeau
I think we should start a design discussion around this. I think there were early ideas by some of the initial authors. However, I don't think it has been designed. On Jul 9, 2015 9:16 AM, Patrick Woody patrick.woo...@gmail.com wrote: Just wanted to follow up here. Is there any information on

Re: Next Parquet sync up tomorrow Wednesday 10am PT on hangout

2015-08-31 Thread Jacques Nadeau
By Wednesday, you mean the day after tomorrow, right? :) On Mon, Aug 31, 2015 at 10:29 PM, Julien Le Dem wrote: > Wed, September 2, 10:00 AM PDT > https://plus.google.com/hangouts/_/event/cob1rrt1spt1f15qbsfeqv51cmc > > -- > Julien > > > > -- > Julien >

Re: Binary sort order and RC

2016-09-08 Thread Jacques Nadeau
A non-binding +1 from me on releasing sooner/more often. On Thu, Sep 8, 2016 at 5:44 PM, Ryan Blue wrote: > Hey everyone, > > I'd like to put together a release candidate for 1.9.0. The other issues > are done, but the sort order min/max issue, PARQUET-686 is still

Re: Binary sort order and RC

2016-10-09 Thread Jacques Nadeau
> > > > > > > > >> So I’m cool with making necessary changes to get this in sooner > > rather > > > > >> than later, I’ve mostly been blocking on code reviews. If there’s > a > > > > >> commitment made to releasing 1.9.1

Re: Compression test data

2017-09-28 Thread Jacques Nadeau
Thanks for sharing these Ryan. Definitely intriguing. On Wed, Sep 27, 2017 at 5:38 PM, Ryan Blue wrote: > For anyone that would also like to test the compression codecs, I’ve > uploaded a copy of parquet-cli that can read and write zstd, lz4, and > brotli to my Apache

Codec value missing from Turbodbc files? Format issue?

2017-11-20 Thread Jacques Nadeau
One of our community members hit an issue where we couldn't parse a Parquet footer. It looks like the file is missing the Codec field for a column but the Parquet Thrift spec expects one. https://community.dremio.com/t/unable-to-read-parquet-footer-with-file-generated-with-turbodbc/474/9 Was

Re: Codec value missing from Turbodbc files? Format issue?

2017-11-20 Thread Jacques Nadeau
r readers can't read the files or metadata because of > how Thrift handles enums. > > rb > > On Mon, Nov 20, 2017 at 8:34 AM, Jacques Nadeau <jacq...@apache.org> > wrote: > > > One of our community members hit an issue where we couldn't parse a > Parquet > >

Re: Iceberg table format

2017-12-07 Thread Jacques Nadeau
Sounds super interesting. Would love to collaborate on this. Do you have a repo or mailing list where you are working on this? On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue wrote: > Hi everyone, > > I mentioned in the sync-up this morning that I’d send out an

Re: [VOTE] Accept donation of Parquet Rust implementation

2018-03-06 Thread Jacques Nadeau
+1 (non-binding) On Tue, Mar 6, 2018 at 12:31 PM, Uwe L. Korn wrote: > +1 > > On Tue, Mar 6, 2018, at 9:29 PM, Ryan Blue wrote: > > +1 > > > > Thanks for starting a vote, Wes! > > > > On Tue, Mar 6, 2018 at 12:24 PM, Wes McKinney > wrote: > > > > > Dear

Re: [DISCUSS] Upgrade to Jackson 2.x and remove the shading

2019-02-18 Thread Jacques Nadeau
I haven't looked at the usage but would wonder if the core modules truly need jackson. I don't think most of the systems that read Parquet use the jackson part (?). If so, maybe the code could be refactored to remove the dependency and it be moved to an optional component. We want to do the same

Re: Current status of Data Page V2?

2020-10-09 Thread Jacques Nadeau
The big win in v2 pages (if I remember correctly) is that the variable length encoding is no longer interleaved. That would provide a big performance lift when pulling into arrow vectors (and variable length decoding typically dominates total read processing time, on average I've seen 5-10x per

Re: Current status of Data Page V2?

2020-10-09 Thread Jacques Nadeau
t; v1 vs v2? > > Thanks, > Micah > > > [1] > https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 > [2] > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407 > > On Fr

Re: Current status of Data Page V2?

2020-10-09 Thread Jacques Nadeau
Gabor seems to agree that delta is V2 only. To summarize, no delta encodings are used for V1 pages. They are available > for V2 only. https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau wrote: > Good point. I had me

Re: Metadata summary file deprecation

2020-09-23 Thread Jacques Nadeau
Hey Jason, I'd suggest you look at Apache Iceberg. It is a much more mature way of handling metadata efficiency issues and provides a substantial superset of functionality over the old metadata cache files. On Wed, Sep 23, 2020 at 4:16 PM Jason Altekruse wrote: > Hello again, > > I took a look

Re: Writing very large rowgroups to Apache Parquet

2020-07-11 Thread Jacques Nadeau
I'd suggest a new write pattern. Write the columns page at a time to separate files then use a second process to concatenate the columns and append the footer. Odds are you would do better than os swapping and take memory requirements down to page size times field count. In s3 I believe you could

Re: Writing very large rowgroups to Apache Parquet

2020-07-17 Thread Jacques Nadeau
ected to be at least 5mb if I read their docs correctly >> [1]) >> >> [1] https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html >> >> >> On Saturday, July 11, 2020, Jacques Nadeau wrote: >> >> > I'd suggest a new write pattern. Write the columns page

Re: Request deprecation / removal of LZ4 compression

2021-02-16 Thread Jacques Nadeau
There is some ambiguity in the discussion and proposals here around deprecating future writing versus supporting reading of already written data and what it means to deprecate something in the format specification. I think it would be a mistake for someone who has written Hadoop-Lz4 for several

Re: Multiple pages with indexes vs multiple row groups with one data page per chunk

2022-03-19 Thread Jacques Nadeau
I can take your comment two ways: what is the downside to large pages or what is the downside to small row groups. One of the key considerations I've dealt with is that page is the unit of compression and if I recall correctly, parquet uses block rather than stream compression. This means you

[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

2015-12-15 Thread Jacques Nadeau (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059056#comment-15059056 ] Jacques Nadeau commented on PARQUET-369: +1 for Ryan's suggestion. Not sure how many Java users

[jira] [Created] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2017-06-09 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created PARQUET-1028: --- Summary: [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't Key: PARQUET-1028 URL: https://issues.apache.org/jira

[jira] [Commented] (PARQUET-1154) [C++] Add function to concatenate a collection of Parquet files to create a new single file

2017-11-04 Thread Jacques Nadeau (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239319#comment-16239319 ] Jacques Nadeau commented on PARQUET-1154: - As an aside, it would be really nice

[jira] [Created] (PARQUET-1475) DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor

2018-12-11 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created PARQUET-1475: --- Summary: DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor Key: PARQUET-1475 URL: https://issues.apache.org/jira/browse

[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

2020-01-13 Thread Jacques Nadeau (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014777#comment-17014777 ] Jacques Nadeau commented on PARQUET-1698: - In our internal work we actually separate this out