Re: Order of encodings?

2021-06-02 Thread Gabor Szadovszky
Hi Micah, The way you described is how parquet-mr works at the write path. Meanwhile, based on the parquet-mr code, it seems that the scenario explained can be read properly. (If we think we have/will have writers that support such scenarios we shall write unit tests for them.) Cheers, Gabor On

Re: Decouple parquet-mr compression API from hadoop compression API

2021-06-03 Thread Gabor Szadovszky
Hi Xin Dong, There are a couple of open jiras related to this. Like PARQUET-1812 about using the airlift implementation of the codecs or your own jiras about the provider-aware codecs. I strongly agree on having compression codecs that are indep

Re: Decouple parquet-mr compression API from hadoop compression API

2021-06-25 Thread Gabor Szadovszky
know when we are ready and your comments will be > appreciated. > > Thanks, > Xin Dong > > -Original Message- > From: Gabor Szadovszky > Sent: Thursday, June 3, 2021 3:49 PM > To: Parquet Dev > Subject: Re: Decouple parquet-mr compression API from hadoop compress

Re: num_values vs num_rows vs num_nulls

2021-07-15 Thread Gabor Szadovszky
Hi Jorge, Please correct me if I'm wrong but it seems the schema of your column is similar to the following: optional group column1 (LIST) { repeated group list { optional int32 element; } } Based on the specs in the thrift file

Re: statistics null count in nested types

2021-07-16 Thread Gabor Szadovszky
Hi Jorge, Spark (similarly to other jvm based implementations) are most probably using parquet-mr. parquet-mr counts null values independently from the level in the structure. An additional twist here is we cannot store empty lists but null lists (when the list itself is null) if it is optional. T

Re: 【vulnerability confirmation】parquet-format-structures-1.12.0

2021-08-16 Thread Gabor Szadovszky
Hi, It is required to shade the thrift library into paquet-format-structures because we use thrift to serialize/deserialize the metadata structures in the parquet files. So, you really don't have any way to change it at runtime. If it is urgent you may build your parquet-mr on your own with an upg

Any Parquet implementations might be impacted by PARQUET-2078

2021-08-27 Thread Gabor Szadovszky
Hi everyone, It turned out that since parquet-mr 1.12.0 in certain conditions we write wrong values into ColumnMetaData.dictionary_page_offset and ColumnChunk.file_offset

Re: Any Parquet implementations might be impacted by PARQUET-2078

2021-08-30 Thread Gabor Szadovszky
ies on the value of ColumnChunk.file_offset at least in cases when the file was written by parquet-mr 1.12.0. I've also created PARQUET-2080 <https://issues.apache.org/jira/browse/PARQUET-2080> to deprecate the field in the format. Regards, Gabor On Fri, Aug 27, 2021 at 11:11 AM Gabor Sza

Re: [VOTE] Release Apache Parquet 1.12.1 RC0

2021-09-13 Thread Gabor Szadovszky
Thanks a lot for working on this, Xinli. Do not forget that you also have a vote :) I have some issues with the content of the release. I would not include the change PARQUET-2043. It is not a bugfix and contains a lot of changes around dependencies. I feel it a bit risky to include it in a patch

Re: Concatenation of parquet files

2021-09-14 Thread Gabor Szadovszky
Hi Pau, I guess attachments are not allowed in the apache lists so we cannot see the image. If the two row groups contain the very same data in the same order and encoded with the same encoding, compressed with the same codec I think, they should be the same binary. I am not sure why you have dif

Re: [VOTE] Release Apache Parquet 1.12.1 RC1

2021-09-14 Thread Gabor Szadovszky
Thanks for the new RC, Xinli. The content seems correct to me. The checksum and sign are correct. Unit tests pass. My vote is +1 (binding) On Mon, Sep 13, 2021 at 8:11 PM Xinli shang wrote: > Hi everyone, > > > I propose the following RC to be released as the official Apache Parquet > 1.12.1 r

Re: Guidelines for Thrift max message size? (Thrift 0.14+)

2021-09-27 Thread Gabor Szadovszky
Hi Antoine, I do not have too much to add just hate getting no replies on the dev list. Parquet-mr doesn't have a release with thrift 0.14+ yet. (The latest release 1.12.1 went out with 0.13.0.) I don't know how common a >100MB file footer is. Since we read the whole footer at once to memory and p

Re: Map Type duplicate keys

2021-10-26 Thread Gabor Szadovszky
Hi Micah, Parquet-MR does not have its own data model (except an example implementation used for unit tests). So it is up to the data model how the values are handled. I think it is possible to store key-value pairs with the same key using the example implementation but there are no such tests. I

Re: unable to get rid of NoSuchMethodError with parquet-cli

2021-11-23 Thread Gabor Szadovszky
Hey, I can reproduce the same on master. It seems the same issue happens with older versions as well. I don't know how we did not find it yet. (Or I am making the same mistake as you :) ). Could you please create a jira about it and continue the discussion there? Thanks a lot, Gabor On Tue, Nov

PARQUET-1025: Support new min-max statistics in parquet-mr

2017-11-08 Thread Gabor Szadovszky
Hi, I started working on the jira PARQUET-1025 . It is about implementing the new min-max statistics specified in PARQUET-686 . After looking in the code deeper I think the spec of the new mi

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

2017-11-14 Thread Gabor Szadovszky
n the library and the application and as such is much better >> suited to a full-stack implementation like Impala than the parquet-mr + >> separate application stack. Neither relying on the application to calculate >> standards-compliant statistics nor forcing the applica

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

2017-11-14 Thread Gabor Szadovszky
d comparison on unsigned fields. > It makes it way too easy for developers to shoot themselves in the foot. > > Zoltan > > > On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky < > gabor.szadovs...@cloudera.com> wrote: > >> Hi, >> >> During the deve

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

2017-11-14 Thread Gabor Szadovszky
> rb > ​ > > On Tue, Nov 14, 2017 at 5:27 AM, Gabor Szadovszky < > gabor.szadovs...@cloudera.com <mailto:gabor.szadovs...@cloudera.com>> wrote: > >> Thanks a lot, Zoltan for making it clear. >> >> Meanwhile I’ve discovered that the problem I’ve menti

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

2017-11-14 Thread Gabor Szadovszky
types. I would even >> consider deprecating the current way of getting UINTs from parquet-mr >> because I think that using regular Java integers for UINTs is a weak point >> of the API. >> >> Zoltan >> >> On Tue, Nov 14, 2017 at 6:16 PM Gabor Szadovszky &l

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

2017-11-14 Thread Gabor Szadovszky
think it is worth the effort to make compareTo > work for all Binary cases when we know we can't fix this for the other > primitives, which must use Comparators anyway. That's why I think the best > solution is to deprecate the use of Comparable#compareTo and remove it. > &

Parquet Sync timing

2017-11-30 Thread Gabor Szadovszky
Hi, Unfortunately, the regular timing of the Parquet Sync meeting (Wednesday, 6PM CET) is not good for me. I don’t want to mess up everyone’s calendar, though. What do you think about having every twice meeting on Thursday? Thanks a lot, Gabor

parquet-mr build fail with jdk7; move to jdk8?

2018-01-18 Thread Gabor Szadovszky
Hi, The last commit in parquet-mr master (c6764c4a0848abf1d581e22df8b33e28ee9f2ced) does not build with jdk7 only with jdk8. We did not catch the issue because either Travis and me use jdk8 to build parquet-mr. (The source level in the pom.xml is set to 1.7 so both jdk7 and jdk8 should be able

Re: parquet-mr build fail with jdk7; move to jdk8?

2018-01-18 Thread Gabor Szadovszky
DK7 these days. > > rb > > On Thu, Jan 18, 2018 at 2:31 AM, Gabor Szadovszky < > gabor.szadovs...@cloudera.com> wrote: > >> Hi, >> >> The last commit in parquet-mr master >> (c6764c4a0848abf1d581e22df8b33e28ee9f2ced) >> does not build with

Re: Date and time for next parquet sync

2018-01-23 Thread Gabor Szadovszky
Hi All, As usual, I’m the one who complains… Tuesday/Thursday would be better for me. If one of these days is suitable for everyone I would be happy to participate. If not, I’m fine with going to the next meeting instead. Cheers, Gabor > On 24 Jan 2018, at 00:56, Lars Volker wrote: > > Hi Al

Re: Date and Time for next Parquet sync

2018-02-07 Thread Gabor Szadovszky
Hi All, I would vote on Tuesday but don’t have any problem with skipping this one if Wednesday fits more for others. Cheers, Gabor > On 7 Feb 2018, at 19:00, Lars Volker wrote: > > Hi All, > > I propose to have the next regular Parquet sync next week, either on > Tuesday or Wednesday at 9am

Re: Parquet performance tuning for help

2018-02-14 Thread Gabor Szadovszky
Hi, The old statistics have many problems. The sorting order was not defined properly and by specification it does not care about the logical type which can modify the order (e.g. UTF8 vs. DECIMAL for the primitive type BINARY.) See PARQUET-686

New parquet-format release 2.5.0

2018-02-21 Thread Gabor Szadovszky
Hi, I’ve created PARQUET-1234 to track the parquet-format release 2.5.0. Added “format-2.5.0” to the fix version of all the related JIRAs: https://issues.apache.org/jira/issues/?jql=project%20%3D%20parquet%20AND%20fixVersion%20%3D%20format-2.

Subject: [VOTE] Release Apache Parquet Format 2.5.0 RC0

2018-04-03 Thread Gabor Szadovszky
Hi everyone, Zoltan and I propose the following RC to be released as official Apache Parquet Format 2.5.0 release. The commit id is f0fa7c14a4699581b41d8ba9aff1512663cc0fb4 * This corresponds to the tag: apache-parquet-format-2.5.0 * https://github.com/apache/parquet-format/tree/f0fa7c14a469958

Re: [VOTE] Release Apache Parquet Java 1.10.0 RC0

2018-04-05 Thread Gabor Szadovszky
+1 (non-binding) Validated signature, checksums, matched source tarballs with the git repo. Gabor > On 6 Apr 2018, at 03:05, Ryan Blue wrote: > > And if anyone wants to try out the new command-line interface, you can call > it like this: > > hadoop jar parquet-cli-1.10.0-runtime.jar org.apach

[RESULT][VOTE] Release Apache Parquet Format 2.5.0 RC0

2018-04-06 Thread Gabor Szadovszky
te after it if everyone agrees. Regards, Gabor > On 4 Apr 2018, at 20:22, Ryan Blue wrote: > > +1 (binding) > > Built & tested, validated checksums and signature. RAT results look fine. > > On Tue, Apr 3, 2018 at 2:57 AM, Gabor Szadovszky < > gabor.szadovs..

[VOTE] Release Apache Parquet Format 2.5.0 RC0

2018-04-09 Thread Gabor Szadovszky
Hi everyone, Unfortunately, the previous vote has failed due to timeout. Now, Zoltan and I propose a new vote for the same RC to be released as official Apache Parquet Format 2.5.0 release. The commit id is f0fa7c14a4699581b41d8ba9aff1512663cc0fb4 * This corresponds to the tag: apache-parquet-f

Re: [VOTE] Release Apache Parquet Format 2.5.0 RC0

2018-04-17 Thread Gabor Szadovszky
>>> >>> Checked this for the last vote. >>> >>> On Mon, Apr 9, 2018 at 4:53 AM, Gabor Szadovszky < >>> gabor.szadovs...@cloudera.com> wrote: >>> >>>> Hi everyone, >>>> >>>> Unfortunatel

[RESULT][VOTE] Release Apache Parquet Format 2.5.0 RC0

2018-04-18 Thread Gabor Szadovszky
n Ivanfi wrote: > > +1 (binding) > > Checked sigs, built and tested. > > On Tue, Apr 17, 2018 at 1:39 PM Gabor Szadovszky < > gabor.szadovs...@cloudera.com> wrote: > >> Hi everyone, >> >> We reached the required 3 binding +1 votes. As there was no de

Releasing parquet-mr 1.8.3

2018-04-19 Thread Gabor Szadovszky
Hi All, Zoltan and I are planning to do the maintenance release 1.8.3 of parquet-mr. I’ve created the usual JIRA for it: PARQUET-1277 We would like to backport PARQUET-1217 and PARQUET-1246

Re: Releasing parquet-mr 1.8.3

2018-04-21 Thread Gabor Szadovszky
changes to see what other bug fixes >> should be included? I think there was one about closing files that would be >> useful. >> >> On Thu, Apr 19, 2018 at 11:32 PM, Gabor Szadovszky < >> gabor.szadovs...@cloudera.com> wrote: >> >>> Hi All

Re: Releasing parquet-mr 1.8.3

2018-04-21 Thread Gabor Szadovszky
in the CHANGES.md. I’ll update both based on git. Gabor > On 21 Apr 2018, at 13:49, Gabor Szadovszky > wrote: > > Thanks a lot, Ryan. > > I’ve already made a change on the 1.8 branch to use jdk7 in Travis. We’ll > also do the release build with java7. > > The commit fo

[VOTE] Release Apache Parquet MR 1.8.3 RC0

2018-05-04 Thread Gabor Szadovszky
Hi everyone, Zoltan and I propose the following RC to be released as official Apache Parquet MR 1.8.3 release. The commit id is aef7230e114214b7cc962a8f3fc5aeed6ce80828 * This corresponds to the tag: apache-parquet-1.8.3 * https://github.com/apache/parquet-mr/tree/aef7230e114214b7cc962a8f3fc5ae

Re: [VOTE] Release Apache Parquet MR 1.8.3 RC0

2018-05-07 Thread Gabor Szadovszky
> https://gist.github.com/xhochy/fd62748ba8c300a5f238a80e8bacfc90 > > I can provide more information if you can tell me what you would need. > > Uwe > > On Fri, May 4, 2018, at 2:12 PM, Gabor Szadovszky wrote: >> Hi everyone, >> >> Zoltan and I propose the following RC to be released

Re: [VOTE] Release Apache Parquet MR 1.8.3 RC0

2018-05-09 Thread Gabor Szadovszky
binding) >>>> >>>> * Built and tested on Debian 8 >>>> * verified sha1 >>>> * verified signature >>>> >>>> was quite a hassle to build with manually installing protobuf and >> thrift. >>>> For newer releases,

Re: [VOTE] Release Apache Parquet MR 1.8.3 RC0

2018-05-10 Thread Gabor Szadovszky
; wrote: > Thanks, Gabor! Looks good to me. > > Changing my vote to +1. > > rb > > On Wed, May 9, 2018 at 5:20 AM, Gabor Szadovszky < > gabor.szadovs...@cloudera.com <mailto:gabor.szadovs...@cloudera.com>> wrote: > > > Created PARQUET-1294 <https:/

parquet-mr exposing public API

2018-05-10 Thread Gabor Szadovszky
Hi Everyone, Unfortunately, I was not able to participate on the last Parquet sync where you discussed how to separate the java public code parts that are public for internal use only or public to be exposed to the user. I am currently adding some new code related to the column indexes and woul

Re: parquet-mr exposing public API

2018-05-10 Thread Gabor Szadovszky
ckages for some reason. > > For the yetus annotations, what is the benefit of using these? Is there > some integration with javadoc tools? What are the annotations that you're > proposing to use and what do they mean? > > rb > > On Thu, May 10, 2018 at 6:44 AM, Gabor

[RESULT][VOTE] Release Apache Parquet MR 1.8.3 RC0

2018-05-11 Thread Gabor Szadovszky
Hi Everyone, With 3 binding +1 and 1 non-binding +1 votes this vote PASSES. Thanks all for voting. We’ll release the artifacts and send the announcement. Regards, Gabor > On 10 May 2018, at 10:06, Gabor Szadovszky > wrote: > > Hi Everyone, > > We reached the required 3 bi

Re: parquet-mr exposing public API

2018-05-15 Thread Gabor Szadovszky
ld you please check and comment about my proposal? Cheers, Gabor > On 11 May 2018, at 07:12, Gabor Szadovszky > wrote: > > Hi, > > I completely agree on using private and package-protected visibility wherever > it is possible. Let me explain an actual example where it is not

Re: [Announce] new Parquet committer Gábor Szádovszky

2018-05-16 Thread Gabor Szadovszky
Thanks a lot, Julien. I'm proud to be a Parquet committer. :) On Wed, May 16, 2018 at 6:15 AM, Julien wrote: > We are happy to announce that Gábor has accepted to become a Parquet > committer. > Welcome Gábor! > Julien

Permissions for committers

2018-05-22 Thread Gabor Szadovszky
Hi, Could someone help me to have the required permissions on github so I can push commits? Thanks a lot, Gabor

Re: Permissions for committers

2018-05-22 Thread Gabor Szadovszky
: >> >> You don’t push commits to GitHub. You push them to the Apache git and they >> get replicated to GitHub >> >>> On Tue, May 22, 2018 at 09:37 Julien Le Dem wrote: >>> >>> Do you have your github id configured in I’d.apache.org ? >

Estimated row-group size is significantly higher than the written one

2018-06-21 Thread Gabor Szadovszky
Hi All, One of our customers faced the following issue. parquet.block.size is configured to 128M. (parquet.writer.max-padding is left with the default 8M.) In average 7 row-groups are generated in one block with the sizes ~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g. 60M

Re: Estimated row-group size is significantly higher than the written one

2018-06-25 Thread Gabor Szadovszky
ssion ratio of the 5 pages to estimate the final > size. We'd probably want to use some overhead value for the header. And, > we'd want to separate the amount of buffered data from our row group size > estimate, which are currently the same thing. > > rb > > On Thu, J

Re: New parquet contributor

2018-08-07 Thread Gabor Szadovszky
Hi Anatoli, I’ve added you to the contributors list in JIRA. Cheers, Gabor On Tue, Aug 7, 2018 at 5:04 PM Anatoli Shein wrote: > Hi, > > Could you please add me to contributors of Parquet? My apache username is > anatoli.shein > > Thank you, > Anatoli Shein >

Status of column index in parquet-mr

2018-08-18 Thread Gabor Szadovszky
Hi, The implementation of column index (writing and filtering) is almost done. All the implementation work was done under the PARQUET-1201 . Subtasks were used to decompose the work. Every change made was done on the separate feature branch colum

Re: Status of column index in parquet-mr

2018-08-21 Thread Gabor Szadovszky
Hi, Row alignment in my wording was the 1st definition in Uwe's mail. From column index based filtering point of view the implementation and the logic would be much simplier in this case but I do understand that the pages sizes would not be optimal. It seems, the community is against the row align

PARQUET-1399: Move parquet-mr related code from parquet-format

2018-08-22 Thread Gabor Szadovszky
Hi, I've just created this issue and worked on it. Created the PR https://github.com/apache/parquet-mr/pull/517 I'd like to keep the history of the files so the PR contains a merge commit and then another one to have it working as a submodule. What do you think? If you guys agree with the approac

Re: Status of column index in parquet-mr

2018-08-22 Thread Gabor Szadovszky
lumn-indexes <https://github.com/apache/parquet-mr/tree/column-indexes>. Also, feel free to comment on any modification on the branch itself. Any opinions about the improvement idea for writing column indexes only if it would result better filtering? Thanks a lot, Gabor On Tue, Aug 21, 201

Re: PARQUET-1399: Move parquet-mr related code from parquet-format

2018-08-23 Thread Gabor Szadovszky
Thanks a lot, Uwe. I'll do it manually after my change is working. But it is not... :( So, "mvn clean install" works just fine, every test passes. While "mvn clean install -DskipTests && mvn test" fails. That's why Travis fails as well. One stacktrace I've investigated (BTW reproducible in my IDE)

Re: [VOTE] Release Apache Parquet format 2.6.0 RC0

2018-09-27 Thread Gabor Szadovszky
+1 (non-binding) - Checked source tarball content - Checked checksums, signature Cheers, Gabor On Thu, Sep 27, 2018 at 5:10 PM Zoltan Ivanfi wrote: > +1 (binding) > > - contents look good > - units tests pass > - checksums match > - signature matches > > Thanks, > > Zoltan > > On Thu, Sep 27,

Is parquet-mr thread-safe?

2018-11-09 Thread Gabor Szadovszky
Hi All, After finally pushing column-indexes we've got the issue PARQUET-1456. ColumnIndexBuilder uses not thread-safe cache to build ColumnIndex objects while reading. I was not aware that parquet-mr might be used in a concurrent way. Never seen any unit tests that suggests we are prepared for t

Re: [VOTE] Release Apache Parquet 1.11.0 RC0

2018-11-22 Thread Gabor Szadovszky
Hi, Verified source tarball checksums and content. All are correct. Unit tests pass. +1 (non-binding) Cheers, Gabor On Wed, Nov 21, 2018 at 7:11 PM Zoltan Ivanfi wrote: > Dear Parquet Users and Developers, > > I propose the following RC to be released as the official Apache > Parquet 1.11.0 r

Re: [VOTE] Release Apache Parquet 1.11.0 RC1

2018-11-25 Thread Gabor Szadovszky
Checked source tarball content and the related checksum/signature. All are correct. Unit tests pass. +1 (non-binding) Cheers, Gabor On Fri, Nov 23, 2018 at 2:38 PM Zoltan Ivanfi wrote: > Dear Parquet Users and Developers, > > I propose the following RC to be released as the official Apache > Pa

Re: [VOTE] Release Apache Parquet 1.11.0 RC2

2018-12-14 Thread Gabor Szadovszky
Hi, Checked tarball: checksum/signature are correct. Content is correct based on release tag. Unit tests pass. +1 (non-binding) On Thu, Dec 13, 2018 at 9:17 PM Zoltan Ivanfi wrote: > Dear Parquet Users and Developers, > > I propose the following RC to be released as the official Apache > Parqu

Re: [VOTE] Release Apache Parquet 1.11.0 RC3

2019-01-10 Thread Gabor Szadovszky
Hi, Checked tarball: checksum/signature are correct. Content is correct based on release tag. Unit tests pass. +1 (non-binding) Cheers, Gabor On Wed, Jan 9, 2019 at 4:51 PM Zoltan Ivanfi wrote: > Dear Parquet Users and Developers, > > I propose the following RC to be released as the official

Re: [Discussion] How to build bloom filter in parquet

2019-01-17 Thread Gabor Szadovszky
Thanks for raising this, Junjie. One more topic worth to add: Which columns do we want to write bloom filters for? May it depend on the type? Is bloom filter required if we have dictionary? Is bloom filter required if the column is ordered and we have column indexes? (etc.) On Thu, Jan 17, 2019

Re: [VOTE] Release Apache Parquet 1.11.0 RC3

2019-01-22 Thread Gabor Szadovszky
; On Sat, Jan 12, 2019 at 3:07 AM 俊杰陈 wrote: > > > > > > > +1 (non-binding) > > > > * contents looks good > > > > * unit tests passed > > > > > > > > > > > > Zoltan Ivanfi 于2019年1月11日周五 下午9:31写道: > &g

Re: [VOTE] Release Apache Parquet 1.11.0 RC3

2019-01-22 Thread Gabor Szadovszky
Sorry, Uwe. I wanted to repy to you. :) On Tue, Jan 22, 2019 at 9:44 AM Gabor Szadovszky wrote: > Hi Wes, > > Thanks for checking the RC and voting. > I would like to highlight that the mentioned issues are also reproducible > on Linux with the parquet release 1.10.0 s

Re: [DISCUSS] Bump Apache Thrift dependency to 0.12.0

2019-01-25 Thread Gabor Szadovszky
May it cause any problems that we write the thrift structures in the parquet files (footer, page headers etc.) with a different version as before? It might require some tests if the older readers are able to read the files written with the new thrift. Any thoughts? On Thu, Jan 24, 2019 at 8:49 PM

Re: [DISCUSS] Bump Apache Thrift dependency to 0.12.0

2019-01-28 Thread Gabor Szadovszky
I: parquet-cpp already uses Thrift 0.12 in some of its binary > > > > distributions. So when there is a problem with old readers, one has > to > > > > notice that we already have files out in the wild. > > > > > > > > Cheers > > >

Re: [DISCUSS] Remove old modules?

2019-01-29 Thread Gabor Szadovszky
Hi, I agree with Fokko. It would be nice to drop these modules but only in the next major release. On Tue, Jan 29, 2019 at 11:57 AM Uwe L. Korn wrote: > Hello Fokko, > > I have put up a PR for the Scala update > https://github.com/apache/parquet-mr/pull/605. parquet-scrooge fails due > to a Thr

Re: [VOTE] Release Apache Parquet 1.10.1 RC0

2019-01-29 Thread Gabor Szadovszky
Hi Ryan, Checked the tarball: checksum/signature are correct. Content is correct based on the release tag. Unit tests pass. +1 (non-binding) Cheers, Gabor On Mon, Jan 28, 2019 at 11:08 PM Ryan Blue wrote: > Hi everyone, > > I propose the following RC to be released as official Apache Parquet

[VOTE] Release Apache Parquet 1.11.0 RC4

2019-02-14 Thread Gabor Szadovszky
Dear Parquet Users and Developers, I propose the following RC to be released as the official Apache Parquet 1.11.0 release: The commit id is 22a9f54d0a537bc6153cc485a3ed6fab9204b337 * This corresponds to the tag: apache-parquet-1.11.0 * https://github.com/apache/parquet-mr/tree/22a9f54d0a537bc615

Re: [VOTE] Release Apache Parquet 1.11.0 RC4

2019-02-15 Thread Gabor Szadovszky
PARK-26874 > > We should investigate that before moving forward. > > On Thu, Feb 14, 2019 at 8:26 AM Gabor Szadovszky wrote: > > > Dear Parquet Users and Developers, > > > > I propose the following RC to be released as the official Apache &g

Reverting the merge blocks command feature

2019-02-19 Thread Gabor Szadovszky
Hi All, During working on a fix I've discovered that the recently added (since 1.10.0) feature PARQUET-1381 is not properly implemented and causes some unit test failures with my independent fix. After a more deep investigation I think the design of this feature is conceptionally incompatible with

Re: Reverting the merge blocks command feature

2019-02-19 Thread Gabor Szadovszky
Sorry, wrong PR. So, see PARQUET-1381 <https://issues.apache.org/jira/browse/PARQUET-1381> and PR #621 <https://github.com/apache/parquet-mr/pull/621>. On Tue, Feb 19, 2019 at 5:44 PM Gabor Szadovszky wrote: > Hi All, > > During working on a fix I've discovered that

Re: Reverting the merge blocks command feature

2019-02-21 Thread Gabor Szadovszky
HI All, I'm planning to push the revert tomorrow if there are no objections. Cheers, Gabor On Tue, Feb 19, 2019 at 6:02 PM Gabor Szadovszky wrote: > Sorry, wrong PR. So, see PARQUET-1381 > <https://issues.apache.org/jira/browse/PARQUET-1381> and PR #621 > <https://githu

Re: Reverting the merge blocks command feature

2019-02-21 Thread Gabor Szadovszky
ntation of the feature. For the details of this investigation see the PARQUET-1381. On Thu, Feb 21, 2019 at 6:43 PM Ryan Blue wrote: > Was the motivation for this the bug that was found with PARQUET-1414? How > did we catch this? > > On Thu, Feb 21, 2019 at 4:56 AM Gabor Szadovszk

Re: Parquet-mr - ParquetFileReader IO and memory foot-print

2019-03-04 Thread Gabor Szadovszky
Hi Tomer, parquet-mr does not support lazy reading currently. The reason is performance. The pages for one column are written one after another (aka column chunks) and then similarly the other pages for the other columns. It means if you would like to keep only one page per column in the memory it

[VOTE] Release Apache Parquet 1.11.0 RC5

2019-03-13 Thread Gabor Szadovszky
Dear Parquet Users and Developers, I propose the following RC to be released as the official Apache Parquet 1.11.0 release: The commit id is e85dbd3774038d7f42d69c14fcd9884ff5a3cb48 * This corresponds to the tag: apache-parquet-1.11.0 * https://github.com/apache/parquet-mr/tree/e85dbd3774038d7f42

Re: [VOTE] Release Apache Parquet 1.11.0 RC5

2019-03-18 Thread Gabor Szadovszky
ts due to the way how shading works. Do you have any idea what can we do to prevent such issues? Thanks a lot, Gabor On Wed, Mar 13, 2019 at 1:57 PM Gabor Szadovszky wrote: > Dear Parquet Users and Developers, > > I propose the following RC to be released as the official Apache &

[VOTE] Release Apache Parquet 1.11.0 RC6

2019-03-19 Thread Gabor Szadovszky
Dear Parquet Users and Developers, I propose the following RC to be released as the official Apache Parquet 1.11.0 release: The commit id is 9756b0e2b35437a09716707a81e2ac0c187112ed * This corresponds to the tag: apache-parquet-1.11.0 * https://github.com/apache/parquet-mr/tree/9756b0e2b35437a097

Re: [VOTE] Release Apache Parquet 1.11.0 RC6

2019-04-16 Thread Gabor Szadovszky
ssue about Hadoop-lzo, but that is present in > the > > > 1.10.1 release also. > > > > > > Andy. > > > > > > > > > On 3/20/19, 7:50 AM, "Zoltan Ivanfi" > > wrote: > > > > > > CAUTION – UNVE

Re: [VOTE] Release Apache Parquet 1.8.2 RC1

2017-01-19 Thread Gabor Szadovszky
Hi Ryan, I’ve downloaded the tar and checked the signature and the checksums. SHA and ASC are fine. MD5 is not and the content does not seem to be a common MD5 either: apache-parquet-1.8.2.tar.gz: B3 74 39 95 BE E6 16 11 8C 28 F3 24 59 86 84 BA The artifacts on Nexus are good with all the rela

Re: [VOTE] Release Apache Parquet 1.8.2 RC1

2017-01-24 Thread Gabor Szadovszky
k the octets > the hash matches.. > > [blue@work Downloads]$ md5sum apache-parquet-1.8.2.tar.gz > b3743995bee616118c28f324598684ba apache-parquet-1.8.2.tar.gz > > rb > ​ > > On Thu, Jan 19, 2017 at 8:06 AM, Gabor Szadovszky < > gabor.szadovs...@cloudera.com>

Re: New PMC member: Gabor Szadovszky

2019-07-01 Thread Gabor Szadovszky
Fokko, Gabor, > > > > > > > > > > > > > > > > Sorry I'm late to the party:), but - congratulations!!! > Excellent > > > > > news, > > > > > > > the > > > > > > > > best thing I've he

Re: [VOTE] Parquet Bloom filter spec sign-off

2019-07-15 Thread Gabor Szadovszky
Hi Junjie, Sorry for bringing up this a bit late but I have some problems with the format update. The parquet.thrift file is updated to have the bloom filters as a page (just as dictionaries and data pages). Meanwhile, the spec (BloomFilter.md) says that the bloom filter is stored near the footer.

Re: [VOTE] Parquet Bloom filter spec sign-off

2019-07-17 Thread Gabor Szadovszky
oom_filter_page_header in PageHeader structure, while > the BloomFitlerHeader is kept intentionally for convenience. Since the > spec and the thrift should be aligned with each other eventually, so > the vote target is both of them. > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor

Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

2019-07-25 Thread Gabor Szadovszky
Hi Martin, I've removed the guys from CC who are members of the parquet dev list. I also suggest to write to the dev list only and let the others subscribe to it if they are interested or follow the discussion at https://lists.apache.org/list.html?dev@parquet.apache.org. Thanks a lot for this sum

Re: [VOTE] Release Apache Parquet Format 2.7.0 RC0

2019-09-26 Thread Gabor Szadovszky
Checksums/signatures are correct. Tarball content is correct. Unit tests pass. +1 (binding) On Thu, Sep 26, 2019 at 6:02 AM 俊杰陈 wrote: > +1, downloaded, verified the signature key ID is A4B2E9B5 which is > from Ryan, ran mvn install successfully. > > On Thu, Sep 26, 2019 at 11:20 AM Jim Apple

Updating parquet web site

2019-10-18 Thread Gabor Szadovszky
Dear All, There are some stuff on our web site that is ready for update (since a while). To spin up the process it would be great if we could follow the same git PR process we already have for our existing git repos. Jim has already created PARQUET-1675

Working on 1.11.0 RC7

2019-10-18 Thread Gabor Szadovszky
Dear All, In the next couple of weeks I'll be working on the next release candidate of 1.11.0. If you have any ongoing issues that you think will be nice to have in 1.11.0, please set "Fix Version/s" accordingly. (If it is not really targeted to 1.11.0, please, remove the related tag.) If you thin

Re: Updating parquet web site

2019-10-18 Thread Gabor Szadovszky
Hi Uwe, parquet-site sounds good to me. Cheers, Gabor On Fri, Oct 18, 2019 at 10:19 AM Uwe L. Korn wrote: > Hello Gabor, > > can we call this for clarity https://github.com/apache/parquet-site ? > > Thanks > Uwe > > On Fri, Oct 18, 2019, at 9:46 AM, Gabor Szadov

Re: Working on 1.11.0 RC7

2019-10-18 Thread Gabor Szadovszky
or 1.11.0? Please let me know. > > Cheers, Fokko > > Op vr 18 okt. 2019 om 09:55 schreef Gabor Szadovszky : > > > Dear All, > > > > In the next couple of weeks I'll be working on the next release candidate > > of 1.11.0. If you have any ongoing issues that you

Re: PARQUET-1441/parquet-mr #560 in 1.11.0 release?

2019-10-24 Thread Gabor Szadovszky
Thanks Fokko for answering. We missed to resolve the jira when pushed the PR. I've just resolved it. On Thu, Oct 24, 2019 at 10:35 AM Driesprong, Fokko wrote: > Hi Michael, > > Thanks for asking. The commit will be included from Parquet 1.11 as this > version will be tagged from master. It looks

release process - using rc tags

2019-10-30 Thread Gabor Szadovszky
Dear All, Our current tagging policy in the release process requires to use the same tag for all the release candidates which means at RC2 we remove the tag from RC1 head and adds again to the RC2 head and so on. I think it is not a good practice. Hard to track RCs and rewriting git history is usu

Re: release process - using rc tags

2019-11-05 Thread Gabor Szadovszky
cess for Iceberg and that's what we > > > > decided > > > > > to go with in that community. Here are the docs if you'd like to > copy > > > > them > > > > > to update the Parquet docs: > > > > >

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-11-07 Thread Gabor Szadovszky
+1 for adding BYTE_STREAM_SPLIT encoding to parquet-format. On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney wrote: > +1 from me on adding the FP encoding > > On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin wrote: > > > > Hello all, > > > > > > thanks for the vote Ryan and to Wes for the feedback. > >

[VOTE] Release Apache Parquet 1.11.0 RC7

2019-11-13 Thread Gabor Szadovszky
Hi everyone, I propose the following RC to be released as official Apache Parquet 1.11.0 release. The commit id is 18519eb8e059865652eee3ff0e8593f126701da4 * This corresponds to the tag: apache-parquet-1.11.0-rc7 * https://github.com/apache/parquet-mr/tree/18519eb8e059865652eee3ff0e8593f126701da4

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

2019-11-18 Thread Gabor Szadovszky
gt; > > +1 > > Verified signature, checksum and ran mvn install successfully. > > > > Wang, Yuming 于2019年11月14日周四 下午2:05写道: > > > > > > +1 > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt "sql/test-only" > > -Phadoop-3.

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

2019-11-18 Thread Gabor Szadovszky
s doesn't introduce an > unreasonable amount of overhead. In some cases, it should actually be > smaller since the column indexes are truncated and page stats are not. > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky > wrote: > > > Hi Fokko, > > > > For the firs

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

2019-11-19 Thread Gabor Szadovszky
ding parquet-1.11.x branch. I > expected to compare the release with the branch and tag but I found the > branch is not present. > > Thanks, > Ismaël > > > > On Tue, Nov 19, 2019 at 8:35 AM Gabor Szadovszky wrote: > > > Hi Ryan, > > > > It is not easy to c

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

2019-11-19 Thread Gabor Szadovszky
Not changing the version to 1.12 was also intentional. Until we have a successful vote for 1.11.0 it is not released and therefore we are still working on 1.11. I'll upgrade the version to 1.12.0-SNAPSHOT after 1.11.0 is released. On Tue, Nov 19, 2019 at 11:04 AM Gabor Szadovszky wrote:

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

2019-11-20 Thread Gabor Szadovszky
+1 > (non-binding). > > Cheers, Fokko > > Op di 19 nov. 2019 om 18:03 schreef Ryan Blue > > > Gabor, what I meant was: have we tried this with real data to see the > > effect? I think those results would be helpful. > > > > On Mon, Nov 18, 2019 at 11:35 PM Ga

  1   2   3   4   5   6   7   8   9   10   >