from:"Uwe L. Korn"

Re: Fwd: [C++] Parquet and Arrow overlap

2024-04-24 Thread Uwe L. Korn

> Should we consider
> Parquet developers from other projects than parquet-mr as Parquet commuters?

We are doing this (speaking as a Parquet PMC who didn't work on parquet-mr, but 
parquet-cpp).

Best
Uwe

On Wed, Apr 24, 2024, at 2:38 PM, Gang Wu wrote:
> +1 for moving parquet-cpp issues from Apache Jira to Arrow's GitHub issue.
>
> Besides, I want to echo Will's question in the thread. Should we consider
> Parquet developers from other projects than parquet-mr as Parquet commiters?
> Currently apache/parquet-format and apache/parquet-testing repositories are
> solely governed by Apache Parquet PMC. It would be better for the entire
> Parquet community if developers with sufficient contributions to open source
> Parquet projects (including but not limited to parquet-cpp, arrow-rs, cudf,
> etc.)
> can be considered as Parquet committer and PMC.
>
> Best,
> Gang
>
> On Wed, Apr 24, 2024 at 7:04 PM Uwe L. Korn  wrote:
>
>> I would be very supportive of this move. The Parquet C++ development has
>> been under the umbrella of the Arrow repository for more than five(six?)
>> years now. Thus, the issues should also be aligned with the Arrow project.
>>
>> Uwe
>>
>> On Tue, Apr 23, 2024, at 8:27 PM, Rok Mihevc wrote:
>> > Bumping this thread again to see if there is will to call for a vote and
>> > move parquet-cpp issues from Apache Jira to Arrow's GitHub issue as was
>> > done for Arrow.
>> > I'm willing to do the move as I already did it for Arrow.
>> >
>> > Rok
>> >
>> > On Sat, Apr 15, 2023 at 4:53 AM Micah Kornfield 
>> > wrote:
>> >
>> >> Bumping this thread again to see in any Parquet PMC members can chime
>> >> in/maybe start a formal vote to move governance of Parquet-CPP under the
>> >> umbrella.
>> >>
>> >> -Micah
>> >>
>> >> On 2023/02/02 10:34:25 Antoine Pitrou wrote:
>> >> >
>> >> >
>> >> > Hi Will,
>> >> >
>> >> > Le 01/02/2023 à 20:27, Will Jones a écrit :
>> >> > >
>> >> > > First, it's not obvious where issues are supposed to be open: In
>> >> Parquet
>> >> > > Jira or Arrow GitHub issues. Looking back at some of the original
>> >> > > discussion, it looks like the intention was
>> >> > >
>> >> > > * use PARQUET-XXX for issues relating to Parquet core
>> >> > >> * use ARROW-XXX for issues relation to Arrow's consumption of
>> Parquet
>> >> > >> core (e.g. changes that are in parquet/arrow right now)
>> >> > >>
>> >> > > The README for the old parquet-cpp repo [3] states instead in it's
>> >> > > migration note:
>> >> > >
>> >> > >   JIRA issues should continue to be opened in the PARQUET JIRA
>> project.
>> >> > >
>> >> > > Either way, it doesn't seem like this process is obvious to people.
>> >> Perhaps
>> >> > > we could clarify this and add notices to Arrow's GitHub issues
>> >> template?
>> >> >
>> >> > I agree we should clarify this. I have no personal preference, but I
>> >> will note
>> >> > that Github issues decrease friction as having a GH account is already
>> >> necessary
>> >> > for submitting PRs.
>> >> >
>> >> > > Second, committer status is a little unclear. I am a committer on
>> >> Arrow,
>> >> > > but not on Parquet right now. Does that mean I should only merge
>> >> Parquet
>> >> > > C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
>> >> > > Parquet changes at all?
>> >> >
>> >> > Since Parquet C++ is part of Arrow C++, you are allowed to merge
>> Parquet
>> >> C++
>> >> > changes. As always you should ensure you have sufficient understanding
>> >> of the
>> >> > contribution, and that it follows established practices:
>> >> > https://arrow.apache.org/docs/dev/developers/reviewing.html
>> >> >
>> >> > > Also, are the contributions to Arrow C++ Parquet being actively
>> >> reviewed
>> >> > > for potential new committers?
>> >> >
>> >> > I would certainly do.
>> >> >
>> >> > Regards
>> >> >
>> >> > Antoine.
>> >> >
>> >> >
>> >>
>>

Re: Fwd: [C++] Parquet and Arrow overlap

2024-04-24 Thread Uwe L. Korn

I would be very supportive of this move. The Parquet C++ development has been 
under the umbrella of the Arrow repository for more than five(six?) years now. 
Thus, the issues should also be aligned with the Arrow project.

Uwe

On Tue, Apr 23, 2024, at 8:27 PM, Rok Mihevc wrote:
> Bumping this thread again to see if there is will to call for a vote and
> move parquet-cpp issues from Apache Jira to Arrow's GitHub issue as was
> done for Arrow.
> I'm willing to do the move as I already did it for Arrow.
>
> Rok
>
> On Sat, Apr 15, 2023 at 4:53 AM Micah Kornfield 
> wrote:
>
>> Bumping this thread again to see in any Parquet PMC members can chime
>> in/maybe start a formal vote to move governance of Parquet-CPP under the
>> umbrella.
>>
>> -Micah
>>
>> On 2023/02/02 10:34:25 Antoine Pitrou wrote:
>> >
>> >
>> > Hi Will,
>> >
>> > Le 01/02/2023 à 20:27, Will Jones a écrit :
>> > >
>> > > First, it's not obvious where issues are supposed to be open: In
>> Parquet
>> > > Jira or Arrow GitHub issues. Looking back at some of the original
>> > > discussion, it looks like the intention was
>> > >
>> > > * use PARQUET-XXX for issues relating to Parquet core
>> > >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
>> > >> core (e.g. changes that are in parquet/arrow right now)
>> > >>
>> > > The README for the old parquet-cpp repo [3] states instead in it's
>> > > migration note:
>> > >
>> > >   JIRA issues should continue to be opened in the PARQUET JIRA project.
>> > >
>> > > Either way, it doesn't seem like this process is obvious to people.
>> Perhaps
>> > > we could clarify this and add notices to Arrow's GitHub issues
>> template?
>> >
>> > I agree we should clarify this. I have no personal preference, but I
>> will note
>> > that Github issues decrease friction as having a GH account is already
>> necessary
>> > for submitting PRs.
>> >
>> > > Second, committer status is a little unclear. I am a committer on
>> Arrow,
>> > > but not on Parquet right now. Does that mean I should only merge
>> Parquet
>> > > C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
>> > > Parquet changes at all?
>> >
>> > Since Parquet C++ is part of Arrow C++, you are allowed to merge Parquet
>> C++
>> > changes. As always you should ensure you have sufficient understanding
>> of the
>> > contribution, and that it follows established practices:
>> > https://arrow.apache.org/docs/dev/developers/reviewing.html
>> >
>> > > Also, are the contributions to Arrow C++ Parquet being actively
>> reviewed
>> > > for potential new committers?
>> >
>> > I would certainly do.
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>>

Re: [VOTE] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-07 Thread Uwe L. Korn

+1 (binding)

On Thu, Mar 7, 2024, at 3:08 PM, Gábor Szádovszky wrote:
> +1 (binding) - Not sure if "binding" matters for this case
> Thanks, Antoine, for working on this!
>
> Antoine Pitrou  ezt írta (időpont: 2024. márc. 7., Cs,
> 14:18):
>
>>
>> Hello,
>>
>> As discussed previously on this ML [1], I am proposing to expand
>> the types supported by the BYTE_STREAM_SPLIT encoding. The currently
>> supported types are FLOAT and DOUBLE. The proposal expands the
>> supported types to INT32, INT64 and FIXED_LEN_BYTE_ARRAY.
>>
>> The format addition is tracked on JIRA where some measurements on
>> sample data are also published and discussed [2].
>>
>> (please note that the original ML thread only discussed expanding
>> to FIXED_LEN_BYTE_ARRAY; discussion on the JIRA issue led to the
>> conclusion that it would also be beneficial to cover INT32 and INT64)
>>
>> The format additions are submitted as a PR in [3].
>> A data file for integration testing is submitted in [4].
>> An implementation for Parquet C++ is submitted in [5].
>> An implementation for parquet-mr is submitted in [6].
>>
>> This vote will be open for at least 1 week.
>>
>> +1: Accept the format additions
>> +0: ...
>> -1: Reject the format additions because ...
>>
>> Regards
>>
>> Antoine.
>>
>>
>> [1] https://lists.apache.org/thread/5on7rnc141jnw2cdxtsfgm5xhhdmsb4q
>> [2] https://issues.apache.org/jira/browse/PARQUET-2414
>> [3] https://github.com/apache/parquet-format/pull/229
>> [4] https://github.com/apache/parquet-testing/pull/46
>> [5] https://github.com/apache/arrow/pull/40094
>> [6] https://github.com/apache/parquet-mr/pull/1291
>>
>>
>>
>>

Re: parquet-format status

2024-03-07 Thread Uwe L. Korn

I can strongly second Antoine's response here. It is a small but very import 
repository hold crucial information for the project..

Best
Uwe

On Thu, Mar 7, 2024, at 1:17 PM, Antoine Pitrou wrote:
> Hello,
>
> I am surprised that this is suggesting to deprecate or delete a
> repository just because a website building procedure isn't properly
> setup to deal with it.
>
> ISTM the "right" solution would be for the Parquet website to
> automatically update its contents based on the latest released version
> of parquet-format. Perhaps using a git submodule or something.
>
> Regards
>
> Antoine.
>
>
> On Tue, 5 Mar 2024 21:30:45 -0500
> Vinoo Ganesh 
> wrote:
>> Hi Parquet Dev -
>> 
>> There have been some conversations about content stored on the
>> parquet-format github repo vs. the website. Doing a cursory pass of the
>> parquet-format  repo, it looks
>> like, other than the markdown documentation stored in the repo, most of the
>> core code was marked as deprecated here:
>> https://github.com/apache/parquet-format/pull/105, content was moved to
>> parquet-mr, and that entire repo really only exists to host this file:
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift.
>> It's possible I'm missing something, but is my understanding correct?
>> 
>> If so, would it make sense to just deprecate parquet-format as a repo, move
>> the content to be exclusively hosted on parquet-site
>> , and host the thrift
>> file elsewhere? This would solve the content duplication problem between
>> parquet format and the website, and would cut down on having to manage a
>> separate repo. I know there is benefit to having comments/discussions on
>> PRs or issues on the repo, but we could also pretty easily port this to the
>> site.
>> 
>> I'm sure this proposal will elicit some strong responses, but wanted to see
>> if anyone had insights here / if I'm missing anything.
>> 
>> Thanks, Vinoo
>> 
>> 
>> 
>>

Re: [VOTE][Format] Add Float16 type to specification

2023-10-10 Thread Uwe L. Korn

+1 (binding)

On Sat, Oct 7, 2023, at 5:49 AM, Daniel Weeks wrote:
> +1
>
> On Fri, Oct 6, 2023, 8:33 PM Gang Wu  wrote:
>
>> +1 (non-binding)
>>
>> Best,
>> Gang
>>
>> On Sat, Oct 7, 2023 at 11:05 AM Micah Kornfield 
>> wrote:
>>
>> > I'm +1 (non-binding) for the proposal in general.
>> >
>> > I do have a concern that we should be implementing
>> > https://issues.apache.org/jira/browse/PARQUET-2182 (ignoring stats for
>> > logical types the reader doesn't understand) and its equivalent in other
>> > libraries first, but given potential low usage we can possibly do that
>> as a
>> > follow-up.
>> >
>> >
>> >
>> >
>> > On Fri, Oct 6, 2023 at 12:50 AM Gábor Szádovszky 
>> wrote:
>> >
>> > > +1
>> > >
>> > > About the naming. We already use INT_8, INT_16 etc. for logical types
>> for
>> > > integer values. What do you think about FLOAT_16 to be consistent?
>> > >
>> > > Cheers,
>> > > Gabor
>> > >
>> > > On 2023/10/05 22:17:13 Ryan Blue wrote:
>> > > > +1
>> > > >
>> > > > I'm all for adding a 2-byte floating point representation since even
>> > > 4-byte
>> > > > floats are quite expensive to store.
>> > > >
>> > > > On Thu, Oct 5, 2023 at 1:43 PM Xinli shang 
>> > > wrote:
>> > > >
>> > > > > +1
>> > > > >
>> > > > > On Thu, Oct 5, 2023 at 1:32 PM Antoine Pitrou 
>> > > wrote:
>> > > > >
>> > > > > >
>> > > > > > Hello,
>> > > > > >
>> > > > > > +1 from me (non-binding).
>> > > > > >
>> > > > > > Regards
>> > > > > >
>> > > > > > Antoine.
>> > > > > >
>> > > > > >
>> > > > > > On Wed, 4 Oct 2023 16:14:00 -0400
>> > > > > > Ben Harkins 
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi everyone,
>> > > > > > >
>> > > > > > > I would like to propose adding a half-precision floating point
>> > > type to
>> > > > > > > the Parquet format specification, in accordance with the active
>> > > > > > > proposal here:
>> > > > > > >
>> > > > > > >
>> > > > > > >- https://github.com/apache/parquet-format/pull/184
>> > > > > > >
>> > > > > > > To summarize, the current proposal would introduce a Float16
>> > > logical
>> > > > > > > type, represented by a little-endian 2-byte FixedLenByteArray.
>> > The
>> > > > > > > value's encoding would adhere to the IEEE-754 standard [1].
>> > > > > > > Furthermore, implementations should ensure that any value
>> > > comparisons
>> > > > > > > and ordering requirements (mainly for column statistics)
>> emulate
>> > > the
>> > > > > > > behavior of native (i.e. physical) floating point types.
>> > > > > > >
>> > > > > > > As for how this would look in practice, there are currently
>> > several
>> > > > > > > implementations of this proposal that are more or less
>> complete:
>> > > > > > >
>> > > > > > >
>> > > > > > >- C++ (and Python):
>> > https://github.com/apache/arrow/pull/36073
>> > > > > > >- Java: https://github.com/apache/parquet-mr/pull/1142
>> > > > > > >- Go: https://github.com/apache/arrow/pull/37599
>> > > > > > >
>> > > > > > > Of course, we're prepared to make adjustments to the
>> > > implementations as
>> > > > > > > needed, since the format additions will need to be approved
>> > before
>> > > > > those
>> > > > > > > PRs are merged. I should also note that naming conventions
>> > haven't
>> > > been
>> > > > > > > extensively discussed, so feel free to chime in if you have a
>> > > strong
>> > > > > > > preference for HALF or HALF_FLOAT over FLOAT16!
>> > > > > > >
>> > > > > > >
>> > > > > > > This vote will be open for at least 72 hours.
>> > > > > > >
>> > > > > > > [ ] +1 Add this type to the format specification
>> > > > > > > [ ] +0
>> > > > > > > [ ] -1 Do not add this type to the format specification
>> > because...
>> > > > > > >
>> > > > > > > Thanks!
>> > > > > > >
>> > > > > > > Ben
>> > > > > > >
>> > > > > > > [1]:
>> > > > > https://en.wikipedia.org/wiki/Half-precision_floating-point_format
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > > > --
>> > > > > Xinli Shang
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Ryan Blue
>> > > > Tabular
>> > > >
>> > >
>> >
>>

Re: [Request] Send automated notifications to a separate mailing-list

2023-08-22 Thread Uwe L. Korn

+1

On Tue, Aug 22, 2023, at 5:29 AM, Gang Wu wrote:
> +1 on this.
>
> We may create the following mailing lists:
> - iss...@parquet.apache.org : notifications from JIRA issues.
> - comm...@parquet.apache.org : notifications from Github PRs and comments.
>
> This is what the Apache ORC community currently does. Can one of the PMCs
> do this?
> Probably we need a formal vote before proceeding.
> https://infra.apache.org/mailing-list-moderation.html#new-mailing-list
>
> Best,
> Gang
>
> On Tue, Aug 22, 2023 at 8:49 AM Xinli shang  wrote:
>
>> It is a good idea. Thank Antonie for the proposal.
>>
>> On Tue, Aug 22, 2023 at 2:03 AM Julien Le Dem > >
>> wrote:
>>
>> > +1
>> >
>> > On Mon, Aug 21, 2023 at 10:16 AM Antoine Pitrou 
>> > wrote:
>> >
>> > >
>> > > Hello,
>> > >
>> > > I would like to request that automated notifications (from GitHub,
>> > > Jira... whatever) be sent to a separate mailing-list and GMane mirror.
>> > > Currently, the endless stream of automated notifications in this
>> > > mailing-list means that discussions between humans quickly get lost or
>> > > even unnoticed by other people.
>> > >
>> > > For the record, we did this move in Apache Arrow and never came back.
>> > >
>> > > Thanks in advance
>> > >
>> > > Antoine.
>> > >
>> > >
>> > >
>> >
>>
>>
>> --
>> Xinli Shang
>>

Re: [RESULT] Release Apache Parquet Format 2.9.0 RC0

2021-04-15 Thread Uwe L. Korn

Published the release. 

On Wed, Apr 14, 2021, at 6:30 PM, Driesprong, Fokko wrote:
> Yes, you'll need PMC permissions to do that.
> 
> A PMC could fetch the artifacts from
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.9.0-rc0/
> and push them into svn as described below :)
> 
> Cheers, Fokko
> 
> Op wo 14 apr. 2021 om 17:38 schreef Antoine Pitrou :
> 
> > On Wed, 14 Apr 2021 17:30:34 +0200
> > Antoine Pitrou  wrote:
> >
> > > Ok, it seems PMC intervention is needed for the step
> > > "3. Copy the release artifacts in SVN into releases" outlined in
> > > https://parquet.apache.org/documentation/how-to-release/ .
> > >
> > > Basically, the `apache-parquet-format-2.9.0-rc0` directory from the SVN
> > > dev/parquet repository should be copied as
> > > `apache-parquet-format-2.9.0` to the SVN release/parquet repository.
> > >
> > > Could a PMC member do that?
> >
> > AFAICT, the required steps are the following (the last one is rejected
> > for me):
> >
> >   $ svn co https://dist.apache.org/repos/dist/dev/parquet candidates
> >   $ svn co https://dist.apache.org/repos/dist/release/parquet releases
> >   $ svn cp candidates/apache-parquet-format-2.9.0-rc0/
> > releases/apache-parquet-format-2.9.0
> >   $ cd releases/
> >   $ svn ci -m "Parquet Format: Add release 2.9.0"
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > On Wed, 14 Apr 2021 12:10:11 +0200
> > > Antoine Pitrou  wrote:
> > > > Hello,
> > > >
> > > > The vote to release 2.9.0 RC0 as Apache Parquet Format 2.9.0 is PASSED
> > > > with the required three +1 binding votes.
> > > >
> > > > I will try to finalize the release myself, but I may need help from a
> > > > PMC member.
> > > >
> > > > Best regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > > > On Wed, 7 Apr 2021 15:10:42 +0200
> > > > Antoine Pitrou  wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I propose the following RC to be released as official Apache Parquet
> > > > > Format 2.9.0 release.
> > > > >
> > > > > The commit id is b4f0c0a643a6ec1a7def37115dd6967ba9346df7
> > > > > * This corresponds to the tag: apache-parquet-format-2.9.0-rc0
> > > > > *
> > https://github.com/apache/parquet-format/tree/b4f0c0a643a6ec1a7def37115dd6967ba9346df7
> > > > >
> > > > > The release tarball, signature, and checksums are here:
> > > > > *
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.9.0-rc0/
> > > > >
> > > > > You can find the KEYS file here:
> > > > > * https://downloads.apache.org/parquet/KEYS
> > > > >
> > > > > Binary artifacts are staged in Nexus here:
> > > > > *
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/
> > > > >
> > > > > This release includes the following important fixes and improvements:
> > > > >
> > > > > * PARQUET-1996 - [Format] Add interoperable LZ4 codec, deprecate
> > existing LZ4 codec
> > > > > * PARQUET-2013 - [Format] Mention that converted types are deprecated
> > > > >
> > > > > ...among other changes (see CHANGES.md for full list).
> > > > >
> > > > > Please download, verify, and test.
> > > > >
> > > > > Please vote in the next 72 hours.
> > > > >
> > > > > [ ] +1 Release this as Apache Parquet 2.9.0
> > > > > [ ] +0
> > > > > [ ] -1 Do not release this because...
> > > > >
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>

Re: [VOTE] Release Apache Parquet Format 2.9.0 RC0

2021-04-12 Thread Uwe L. Korn

+1 (binding) 

Verified signature and checksum on the artifact, passed the tests on macOS 11 
(ARM64) with

mamba create -p $(pwd)/../env maven thrift-cpp=0.13
conda activate $(pwd)/../env
mvn test

On Fri, Apr 9, 2021, at 10:27 AM, Gabor Szadovszky wrote:
> Thanks, Wes. If this is the case I am happy to make this final step after
> the vote passes.
> 
> On Fri, Apr 9, 2021 at 3:54 AM Wes McKinney  wrote:
> 
> > hi Gabor — I think you may need to be a PMC member? I'm not sure though.
> >
> > +1 (binding), verified signature and checksum on the artifact
> >
> > On Wed, Apr 7, 2021 at 10:19 AM Gabor Szadovszky  wrote:
> > >
> > > I've updated the KEYS file with your public key in the release repo (
> > > downloads.apache.org is updated already). Please keep in mind that you
> > will
> > > still need write access to the release repo to finalize the release after
> > > the vote passes. Guys, any idea how to request write access to a repo?
> > >
> > > Verified checksum and signature; unit tests pass; parquet-mr builds with
> > > the new RC.
> > > +1(binding)
> > >
> > >
> > >
> > >
> > > On Wed, Apr 7, 2021 at 4:51 PM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > Ok, I've tried multiple variations and I still can't commit to the
> > > > release repository.
> > > >
> > > > May I ask you to commit the following patch:
> > > > https://gist.github.com/pitrou/0f9f1ffe280cfb48ea9427ebec19b65e
> > > >
> > > > You can check that the key block matches the one I added in the dev
> > > > repo.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > On Wed, 7 Apr 2021 16:35:16 +0200
> > > > Gabor Szadovszky
> > > > 
> > > > wrote:
> > > > > I don't have too much experience in svn. I usually follow the
> > commands
> > > > > listed in the how to release doc and it works for me. (Don't
> > remember if
> > > > > I've had to do some initial steps.) As a committer you should have
> > write
> > > > > access to all the repositories of the Parquet community.
> > > > >
> > > > > On Wed, Apr 7, 2021 at 4:18 PM Antoine Pitrou 
> > > > wrote:
> > > > >
> > > > > >
> > > > > > Ah!  It seems I can't push to that repo:
> > > > > >
> > > > > > SendingKEYS
> > > > > > Transmitting file data .svn: E195023: Commit failed (details
> > follow):
> > > > > > svn: E195023: Changing file
> > > > '/home/antoine/apache/parquet-release/KEYS' is
> > > > > > forbidden by the server
> > > > > > svn: E175013: Access to
> > > > > > '/repos/dist/!svn/txr/46918-13e8/release/parquet/KEYS' forbidden
> > > > > >
> > > > > >
> > > > > > The URL I used for checkout is
> > > > > > https://apit...@dist.apache.org/repos/dist/release/parquet
> > > > > > Should I use another one?
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, 7 Apr 2021 16:00:26 +0200
> > > > > > Gabor Szadovszky
> > > > > > 
> > > > > > wrote:
> > > > > > > Sorry, I've missed you updated the dev repo. The downloads page
> > > > mirrors
> > > > > > the
> > > > > > > release repo. Yet another place (besides the parquet-format and
> > > > > > parquet-mr
> > > > > > > repos) where we store a KEYS file for whatever reason. Please
> > update
> > > > the
> > > > > > > one in the release repo.
> > > > > > >
> > > > > > > On Wed, Apr 7, 2021 at 3:47 PM Gabor Szadovszky <
> > > > > > > gabor.szadovs...@cloudera.com> wrote:
> > > > > > >
> > > > > > > > I guess it only requires some time to sync. Last time the
> > release
> > > > > > tarball
> > > > > > > > required ~1hour to sync.
> > > > > > > >
> > > > > > > > On Wed, Apr 7, 2021 at 3:42 PM Antoine Pitrou <
> > anto...@python.org>
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > >>
> > > > > > > >> Hi Gabor,
> > > > > > > >>
> > > > > > > >> Ok, I updated the KEYS file in the Parquet SVN repository.
> > > > > > > >> The changes do appear in
> > > > > > > >> https://dist.apache.org/repos/dist/dev/parquet/KEYS -- but
> > not in
> > > > > > > >> https://downloads.apache.org/parquet/KEYS .  Is there any
> > > > additional
> > > > > > > >> step I should perform?
> > > > > > > >>
> > > > > > > >> Regards
> > > > > > > >>
> > > > > > > >> Antoine.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Wed, 7 Apr 2021 15:19:24 +0200
> > > > > > > >> Gabor Szadovszky  wrote:
> > > > > > > >>
> > > > > > > >> > Hi Antoine,
> > > > > > > >> >
> > > > > > > >> > Thanks for initiating this release! You need to update the
> > > > listed
> > > > > > KEYS
> > > > > > > >> file
> > > > > > > >> > with your public key otherwise we cannot validate the
> > > > signature.
> > > > > > (To do
> > > > > > > >> > that you need to update the releases svn repo. See details
> > in
> > > > the
> > > > > > how to
> > > > > > > >> > release doc about the publishing.)
> > > > > > > >> >
> > > > > > > >> > Regards,
> > > > > > > >> > Gabor
> > > > > > > >> >
> > > > > > > >> > On Wed, Apr 7, 2021 at 3:10 PM Antoine Pitrou <
> > > > anto...@python.org>
> > > > > >
> > > > > > > >>

Re: [C++] Changing the versioning string for Parquet-CPP

2021-03-12 Thread Uwe L. Korn

When we merged this into the Arrow repo, at least from my side, there was the 
intention to revert that maybe at some stage again. The though behind moving 
parquet-cpp out of the Arrow repo again was based on the idea that Parquet was 
one of the many interfaces Arrow does provide access to but not one of the 
outstanding ones. Nowadays, I have the feeling that Parquet and Arrow have a 
much more bound-together relationship than I initially expected. Thus we should 
probably accept that parquet-cpp will stay for a very long time in the Arrow 
repo and this should continue with the versioning.

Also we had the assumption that from time to time the parquet community would 
make separate releases. I have no memory anymore how we assumed that these 
releases would happen or why though.

Basically, we had some assumptions that supported keeping the version numbers 
separate makes sense. All of the assumptions I can think of turned out to be 
false, thus keeping the version in line with Arrow (C++) makes totally sense 
nowadays.

Uwe

On Tue, Mar 9, 2021, at 7:57 PM, Micah Kornfield wrote:
> I think there might have been some old agreement on this when parquet-cpp
> was moved into the Arrow repo.  I can't seem to find the thread, but it
> would be nice for some PMC members to chime it to make sure this seems OK
> to them.
> 
> On Sat, Mar 6, 2021 at 7:38 AM Antoine Pitrou  wrote:
> 
> > On Fri, 5 Mar 2021 10:26:57 -0800
> > Micah Kornfield 
> > wrote:
> > >
> > > I'd like to propose that we change the default version string [1] for
> > > parquet-cpp to reflect arrow releases (e.g. "parquet-cpp-arrow version
> > > 3.0.0" instead of "parquet-cpp version 1.5.1-snapshot").
> >
> > +1.  This definitely makes the most sense.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-30 Thread Uwe L. Korn

I'm also in favor of disabling support for now. Having to deal with broken 
files or the detection of various incompatible implementations in the long-term 
will harm more than not supporting LZ4 for a while. Snappy is generally more 
used than LZ4 in this category as it has been available since the inception of 
Parquet and thus should be considered as a viable alternative.

Cheers
Uwe

On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou  wrote:
> >
> >
> > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > hi folks,
> > >
> > > (cross-posting to dev@arrow and dev@parquet since there are
> > > stakeholders in both places)
> > >
> > > It seems there are still problems at least with the C++ implementation
> > > of LZ4 compression in Parquet files
> > >
> > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > https://issues.apache.org/jira/browse/PARQUET-1878
> >
> > I don't have any particular opinion on how to solve the LZ4 issue, but
> > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > compression algorithms available, and they span different parts of the
> > speed/compression spectrum, so it would be a pity to disable one of them.
> 
> It's true, however I think it's worse to write LZ4-compressed files
> that cannot be read by other Parquet implementations (if that's what's
> happening as I understand it?). If we are indeed shipping something
> broken then we either should fix it or disable it until it can be
> fixed.
> 
> > Regards
> >
> > Antoine.
>

Re: Updating parquet web site

2019-10-18 Thread Uwe L. Korn

Hello Gabor,

can we call this for clarity  https://github.com/apache/parquet-site ?

Thanks
Uwe

On Fri, Oct 18, 2019, at 9:46 AM, Gabor Szadovszky wrote:
> Dear All,
> 
> There are some stuff on our web site that is ready for update (since a
> while). To spin up the process it would be great if we could follow the
> same git PR process we already have for our existing git repos. Jim has
> already created PARQUET-1675
>  for moving the
> existing svn repo to git.
> 
> If there are no objections I will create an infra ticket to move the svn
> repo https://svn.apache.org/repos/asf/parquet to the new git repository
> https://github.com/apache/parquet.
> 
> Regards,
> Gabor
>

[jira] [Updated] (PARQUET-1586) [C++] Add --dump options to parquet-reader tool to dump def/rep levels

2019-05-26 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1586:
-
Fix Version/s: (was: 1.10.1)
   cpp-1.6.0

> [C++] Add --dump options to parquet-reader tool to dump def/rep levels
> --
>
> Key: PARQUET-1586
> URL: https://issues.apache.org/jira/browse/PARQUET-1586
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Renat Valiullin
>Assignee: Renat Valiullin
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (PARQUET-1586) [C++] Add --dump options to parquet-reader tool to dump def/rep levels

2019-05-26 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1586.
--
   Resolution: Fixed
Fix Version/s: 1.10.1

Issue resolved by pull request 4385
[https://github.com/apache/arrow/pull/4385]

> [C++] Add --dump options to parquet-reader tool to dump def/rep levels
> --
>
> Key: PARQUET-1586
> URL: https://issues.apache.org/jira/browse/PARQUET-1586
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Renat Valiullin
>Assignee: Renat Valiullin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (PARQUET-1583) [C++] Remove parquet::Vector class

2019-05-21 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1583.
--
Resolution: Fixed

Issue resolved by pull request 4354
[https://github.com/apache/arrow/pull/4354]

> [C++] Remove parquet::Vector class
> --
>
> Key: PARQUET-1583
> URL: https://issues.apache.org/jira/browse/PARQUET-1583
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I'm not sure this code is needed anymore, added during the early days of the 
> project in 2016



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Parquet vs. other Open Source Columnar Formats

2019-05-09 Thread Uwe L. Korn

Hello,

Be aware that Avro and Protobuf are general serialization formats, not columnar 
ones such as Parquet or ORC. They are good for RPC or row-wise streaming 
whereas the latter two are perfect for analytics.

Uwe

> Am 09.05.2019 um 20:33 schrieb David Mollitor :
> 
> I'm sure there are many different opinions on the matter, but in regards to
> Avro, I would say it is becoming more and more of a niche player.
> 
> Many folks are choosing to go with Google Protobufs for RPC and Parquet/ORC
> for analytic workloads.
> 
>> On Thu, May 9, 2019 at 2:30 PM Brian Bowman  wrote:
>> 
>> All,
>> 
>> Is it fair to say that Parquet is fast becoming the dominate open source
>> columnar storage format?   How do those of you with long-term Hadoop
>> experience see this?  For example, is Parquet overtaking ORC and Avro?
>> 
>> Thanks,
>> 
>> Brian
>>

[jira] [Commented] (PARQUET-1022) [C++] Append mode in parquet-cpp

2019-03-13 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791492#comment-16791492
 ] 

Uwe L. Korn commented on PARQUET-1022:
--

There is no implementation of merging concatenating files in C++ yet but that 
would be much easier to implement than an Append mode. For merging files you 
would read the footer of all files, copy the binary content of the RowGroups 
into the new file and then compute a new footer from the information of 
existing footer. From what I can think of currently, this should not require 
decoding any RowGroups thus implementing an explicit merge will be a lot faster 
then materialising the data first and writing a fully new file.

> [C++] Append mode in parquet-cpp
> 
>
> Key: PARQUET-1022
> URL: https://issues.apache.org/jira/browse/PARQUET-1022
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Affects Versions: cpp-1.1.0
>Reporter: yugu
>Assignee: Wes McKinney
>Priority: Major
>
> As said, currently trying to work out a append feature for parquet files in 
> c++.
> (been searching through repo etc, can't find example tho..)
> Current solution is to (assume no schema changes that is):
> Read in metadata
> Change metadata based on appended rows+ original rows
> Append a new row group (or multiple row group writer)
> Write the new rows.
> ---
> The problem is that, is approached this way, the original last row group may 
> not be complete filled. Was wondering if there is a fix or I'm using the api 
> wrong...
> Thanks ! : D



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1022) [C++] Append mode in parquet-cpp

2019-03-12 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790518#comment-16790518
 ] 

Uwe L. Korn commented on PARQUET-1022:
--

[~thamha] The solution here is to write more files and combine them afterwards. 
Reading Parquet files that have no footer is not possible. Even with the 
possibility of modifying an existing file, you would loose all data in your 
crash scenario as in the second write, the data would be written over the 
footer. While no new footer is written, the data in a file cannot be restored.

> [C++] Append mode in parquet-cpp
> 
>
> Key: PARQUET-1022
> URL: https://issues.apache.org/jira/browse/PARQUET-1022
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Affects Versions: cpp-1.1.0
>Reporter: yugu
>Assignee: Wes McKinney
>Priority: Major
>
> As said, currently trying to work out a append feature for parquet files in 
> c++.
> (been searching through repo etc, can't find example tho..)
> Current solution is to (assume no schema changes that is):
> Read in metadata
> Change metadata based on appended rows+ original rows
> Append a new row group (or multiple row group writer)
> Write the new rows.
> ---
> The problem is that, is approached this way, the original last row group may 
> not be complete filled. Was wondering if there is a fix or I'm using the api 
> wrong...
> Thanks ! : D



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: parquet using encoding other than UTF-8

2019-02-05 Thread Uwe L. Korn

Hello Manik,

this is not possible at the moment. As Parquet is a portable on-disk format, we 
focus on having a single representation for each data type. Thus implementing 
readers/writers is limited to these to make their implementation simpler. 
Especially as you are the producer but not the consumer, even adding a new type 
would not solve your problem. You really can only use a new logical type when 
it has been implemented in all the readers and your consumers have all updated 
to these reader versions.

As Unicode and thus UTF-8 support all characters one can think off, you should 
always be able to convert strings to it. Given that Parquet files encode and 
compress the data anyway afterwards, the conversion is a bit of a CPU overhead 
but should not make a difference in size and form of the data actually stored 
in the files. Also I guess that the UTF-16->UTF-8 conversion costs less CPU 
that the Parquet compression process. 

Did this help you or is there any reason why you really cannot convert your 
data to UTF-8?

Uwe

On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> Hi
> 
> I am new to Parquet. I am trying to save UTF-16 or some other encoding than
> UTF-8.
> I am also trying to use encoding hint when saving ByteBuffer.
> 
> I don't find way to use any thing other than UTF-8.
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says
> we can extend primitive types to solve cases.
> 
> Other thing I want to mention is I am only the producer of parquet file but
> not consumer.
> 
> Could you guide me which examples I can look into or which will be right way
> 
> 
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
> 
> "Life doesn't consist in holding good cards but playing those you hold
> well."

[jira] [Assigned] (PARQUET-1521) [C++] Do not use "extern template class" with parquet::ColumnWriter

2019-02-05 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1521:


Assignee: Wes McKinney

> [C++] Do not use "extern template class" with parquet::ColumnWriter
> ---
>
> Key: PARQUET-1521
> URL: https://issues.apache.org/jira/browse/PARQUET-1521
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As continued cleaning, similar to parquet::TypedColumnReader I will do 
> similar refactoring for parquet::TypedColumnWriter, leaving the current 
> public API for writing columns unchanged



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (PARQUET-1521) [C++] Do not use "extern template class" with parquet::ColumnWriter

2019-02-05 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1521.
--
Resolution: Fixed

Issue resolved by pull request 3551
[https://github.com/apache/arrow/pull/3551]

> [C++] Do not use "extern template class" with parquet::ColumnWriter
> ---
>
> Key: PARQUET-1521
> URL: https://issues.apache.org/jira/browse/PARQUET-1521
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As continued cleaning, similar to parquet::TypedColumnReader I will do 
> similar refactoring for parquet::TypedColumnWriter, leaving the current 
> public API for writing columns unchanged



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1523) [C++] Vectorize comparator interface

2019-02-04 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760517#comment-16760517
 ] 

Uwe L. Korn commented on PARQUET-1523:
--

This would probably benefit of arrow-compute kernel for vectorisation.

> [C++] Vectorize comparator interface
> 
>
> Key: PARQUET-1523
> URL: https://issues.apache.org/jira/browse/PARQUET-1523
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> The {{parquet::Comparator}} interface yields scalar virtual calls on the 
> innermost loop. In addition to removing the usage of 
> {{PARQUET_TEMPLATE_EXPORT}} as with other recent patches, I propose to 
> refactor to a vector-based comparison to update the minimum and maximum 
> elements in a single virtual call
> cc [~mdeepak] [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [VOTE] Release Apache Parquet 1.10.1 RC0

2019-01-31 Thread Uwe L. Korn

+1 (binding)

Build and tested using Ryan's script on Ubuntu 16.04.

The script helped me a bit as it included the necessary maven
options. Thanks!
For future, it would be good to include one as we have in Arrow that
also checks the signature. We have that in the main tree and the script
also downloads the source tarball. Then the script is simply in git and
not part of the release.
Uwe


On Thu, Jan 31, 2019, at 7:36 PM, Ryan Blue wrote:
> Uwe suggested that we include a validation script for release votes.
> It's a little late to add one to this release, but here is one to make
> it easier. Just run the attached script from the project root.> 
> The script checks for the correct versions of thrift and protobuf. If
> found, those modules and modules that depend on them are built and
> tested. Otherwise, it just tests the modules that don't require
> additional installs. This builds the project and runs RAT checks, then
> runs tests and prints a message at the end.> 
> rb
> 
> 
> On Wed, Jan 30, 2019 at 10:11 PM Dongjoon Hyun
>  wrote:>> +1 for 1.10.1 RC0 (non-binding).
>> 
>> I tested the src tar artifact on Ubuntu 16.04 and passed all UTs.
>> (Also, I saw the result of Ryan's Spark PR
>> https://github.com/apache/spark/pull/23704.)>> 
>> Thank you for the release.
>> 
>> Cheers,
>> Dongjoon.
>> 
>> On Wed, Jan 30, 2019 at 5:07 PM Dongjoon Hyun
>>  wrote:>>> Sure! I'll make a PR for that.
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Wed, Jan 30, 2019 at 3:11 PM Ryan Blue  wrote: 
>>> Looks like the README is out of date. I don't think we should
 fail this RC based on the README. Can you open a pull request to
 update it. 
 I think the correct version is protobuf 3.5.1 for the 1.10.x line.
 And we should remove the current release from the README. 
 On Wed, Jan 30, 2019 at 1:27 PM Dongjoon Hyun
  wrote:> Hi, All.
> 
> For RC testing, I downloaded the src tar file and followed the
> README. I'm wondering if the README is up-to-date and proper.> 
> In the master branch, `protoc` is handled by maven plugin. 
> 
> But, in RC, while `.travis.yml` installs `protobuf-3.5.1`, README
> guides to install `protobuf-2.5.0` and eventually fails due to
> `TestProto3.proto`.> 
>> main:
>> [mkdir] Created dir: 
>> /Users/dongjoon/APACHE/RC-PARQUET/apache-parquet-
>> 1.10.1/parquet-protobuf/target/generated-test-sources>> 
>> [mkdir] Created dir: /Users/dongjoon/APACHE/RC-PARQUET/apache-parquet-
>> 1.10.1/parquet-protobuf/target/generated-test-sources/java>> 
>>  [exec] src/test/resources/TestProto3.proto:1:10:
>>  Unrecognized syntax identifier "proto3".  This parser only
>>  recognizes "proto2".> 
> Also, README says `The current release is version 1.8.1` instead
> of `1.10.1`. Is it worth to fix?> 
> Bests,
> Dongjoon.
> 
> 
> On Wed, Jan 30, 2019 at 10:45 AM Ryan Blue
>  wrote:>> +1 (binding)
>> 
>>  Validated source signature, checksum. Ran unit tests. Tested
>>  Iceberg with>>  the candidate.
>> 
>>  For reference, here’s how to add test this candidate in a gradle
>>  project:>> 
>>  repositories {
>>...
>>maven {
>>  url '
>>  
>> https://repository.apache.org/content/repositories/orgapacheparquet-1022/'>>
>> }
>>  }
>> 
>>  And in a maven project:
>> 
>>
>>  ...
>>  
>>parquet-1.10.1
>>
>>Parquet 1.10.1 RC0
>>
>>
>> https://repository.apache.org/content/repositories/orgapacheparquet-1022/>>
>> 
>>  true
>>
>>
>>  false
>>
>>  
>>
>> 
>> 
>>  On Tue, Jan 29, 2019 at 6:06 AM Nandor Kollar
>>  >>  wrote:
>> 
>>  > +1 (non-binding)
>>  >
>>  > Verified signature and checksum, ran unit tests, all passed.
>>  >
>>  > Cheers,
>>  > Nandor
>>  >
>>  > On Tue, Jan 29, 2019 at 1:21 PM Gabor Szadovszky
>>  >  wrote:>>  >
>>  > > Hi Ryan,
>>  > >
>>  > > Checked the tarball: checksum/signature are correct. Content
>>  > > is correct>>  > > based on the release tag. Unit tests pass.
>>  > >
>>  > > +1 (non-binding)
>>  > >
>>  > > Cheers,
>>  > > Gabor
>>  > >
>>  > >
>>  > > On Mon, Jan 28, 2019 at 11:08 PM Ryan Blue
>>  > > >>  > > wrote:
>>  > >
>>  > > > Hi everyone,
>>  > > >
>>  > > > I propose the following RC to be released as official
>>  > > > Apache Parquet>>  > Java
>>  > > > 1.10.1 release.
>>  > > >
>>  > > > The commit id is a89df8f9932b6ef6633d06069e50c9b7970bebd1>>  
>> > > >
>>  > > >- This corresponds to the tag: apache-parquet-1.10.1
>>  > > >-

Re: [DISCUSS] Remove old modules?

2019-01-29 Thread Uwe L. Korn

Hello Fokko, 

I have put up a PR for the Scala update 
https://github.com/apache/parquet-mr/pull/605. parquet-scrooge fails due to a 
Thrift parsing error but parquet-scala succeeds with Scala 2.12 With dropping 
scrooge, we could at least move this forward.

Uwe

> Am 29.01.2019 um 11:40 schrieb Nandor Kollar :
> 
> Removing parquet-hive-* is a great idea, the code in Parquet is not
> maintained any more, it is just a burden there.
> 
> As of parquet-pig, I'd prefer moving it to Pig (if Pig community accepts it
> as it is) instead of dropping it or moving to a separate project. I know
> people who still use Pig with Parquet.
> 
> Regards,
> Nandor
> 
>> On Mon, Jan 28, 2019 at 6:29 PM Ryan Blue  wrote:
>> 
>> Hi everyone,
>> 
>> I’m working on the 1.10.1 build and I’ve noticed that we will have several
>> modules that are not maintained or are very old. This includes all of the
>> Hive modules that moved into Hive years ago and also modules like
>> parquet-scrooge and parquet-scala that are based on Scala 2.10 that has
>> been EOL for years.
>> 
>> We also have 2 command-line utilities, parquet-tools and parquet-cli. The
>> parquet-cli version is friendlier to use, but I’m clearly biased. In any
>> case, I don’t think we need to maintain both and it is confusing for users
>> to have two modules that do the same thing.
>> 
>> I propose we remove the following modules:
>> 
>>   - parquet-hive-*
>>   - parquet-scrooge
>>   - parquet-scala
>>   - parquet-tools
>>   - parquet-hadoop-bundle (shaded deps)
>>   -
>> 
>>   parquet-cascading (in favor of parquet-cascading3, if we keep it)
>>   There are also modules that I’m not sure about. Does anyone use these?
>>   -
>> 
>>   parquet-thrift
>>   - parquet-pig
>>   - parquet-cascading3
>> 
>> Pig hasn’t had an update (other than project-wide changes) since Oct 2017.
>> I think it may be time to drop support in Pig and allow that to exist as a
>> separate project if anyone is still interested in it.
>> 
>> In the last few years, we’ve moved more to a model where processing
>> frameworks and engines maintain their own integration. Spark, Presto,
>> Iceberg, and Hive fall into this category. So I would prefer to drop Pig
>> and Cascading3. I’m fine keeping thrift if people think it is useful.
>> 
>> Thoughts?
>> 
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>

Re: [DISCUSS] Bump Apache Thrift dependency to 0.12.0

2019-01-27 Thread Uwe L. Korn

To give a bit of input: I have written a simple Parquet file with 
pyarrow/parquet-cpp, one with Thrift 0.11, the other with Thrift 0.12 and they 
were identical. We have been using Thrift 0.11 for quite some while in there, 
so I would assume that there are no problems between the version. Otherwise I 
would have expected users reporting many issues about pyarrow/Spark interop.

Uwe

On Sun, Jan 27, 2019, at 3:33 AM, Wes McKinney wrote:
> I thought the Thrift binary protocol is stable at this point (I'm not
> sure what process Apache Thrift uses to ensure this), but I agree it
> would be worth a smoke test of new-Thrift vs. old-Thrift. I've build
> parquet-cpp with 0.12 and not had any issues
> 
> - Wes
> 
> On Fri, Jan 25, 2019 at 10:34 AM Ryan Blue  wrote:
> >
> > The thrift dependency in parquet-format and the one in parquet-mr can
> > coexist because we shade the one in parquet-format. Thrift should also be
> > binary compatible, although I don't think they publish any guarantees.
> >
> > On Fri, Jan 25, 2019 at 12:53 AM Uwe L. Korn  wrote:
> >
> > > As an FYI: parquet-cpp already uses Thrift 0.12 in some of its binary
> > > distributions. So when there is a problem with old readers, one has to
> > > notice that we already have files out in the wild.
> > >
> > > Cheers
> > > Uwe
> > >
> > > On Fri, Jan 25, 2019, at 9:13 AM, Gabor Szadovszky wrote:
> > > > May it cause any problems that we write the thrift structures in the
> > > > parquet files (footer, page headers etc.) with a different version as
> > > > before? It might require some tests if the older readers are able to 
> > > > read
> > > > the files written with the new thrift.
> > > > Any thoughts?
> > > >
> > > > On Thu, Jan 24, 2019 at 8:49 PM Ryan Blue 
> > > wrote:
> > > >
> > > > > Why is it a problem that thrift can't be compiled with Java 11? We
> > > should
> > > > > only have a binary dependency.
> > > > >
> > > > > +1 for moving thrift forward, though.
> > > > >
> > > > > On Thu, Jan 24, 2019 at 11:38 AM Driesprong, Fokko
> > > 
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I would like to discuss updating the Thrift dependency to 0.12.0 of
> > > > > > Parquet. In my effort to make Parquet forward compatible for JDK11
> > > > > > <https://github.com/apache/parquet-mr/pull/596>, I stumbled upon
> > > some
> > > > > > issues. One of them that we still rely, in both the CI and
> > > documentation,
> > > > > > on Thrift 0.9.3 (released October 2015). Unfortunately, this version
> > > of
> > > > > > Thrift won't compile under Java11:
> > > > > > [javac] Compiling 105 source files to
> > > > > /home/travis/build/apache/parquet-mr
> > > > > > /thrift-0.9.3/lib/java/build
> > > > > > [javac] warning: [options] bootstrap class path not set in
> > > conjunction
> > > > > with
> > > > > > -source 5
> > > > > > [javac] error: Source option 5 is no longer supported. Use 6 or
> > > later.
> > > > > > [javac] error: Target option 1.5 is no longer supported. Use 1.6 or
> > > > > later.
> > > > > >
> > > > > > Target 1.5, feels a bit awkward, doesn't it? My main question to the
> > > > > > dev-list is; is there any particular reason why we shouldn't update
> > > the
> > > > > > Thrift dependency to 0.12.0. I know that it will have an impact on
> > > > > Parquet,
> > > > > > but if we want to support Java11, we need to move forward 
> > > > > > eventually.
> > > > > >
> > > > > > After updating the thrift-maven plugin
> > > > > > <https://github.com/apache/parquet-mr/pull/600>, I was able to run
> > > the
> > > > > CI
> > > > > > against Thrift 0.12.0 <https://github.com/apache/parquet-mr/pull/601
> > > >.
> > > > > >
> > > > > > Cheers, Fokko
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix

Re: [DISCUSS] Parquet Java 1.10.1 release?

2019-01-27 Thread Uwe L. Korn

Hello Ryan,

Making a bugfix release sounds fine for this case.

Sadly as with all other RCs, it would help to have better instructions on how 
to verify the release candidate.

Uwe

On Fri, Jan 25, 2019, at 8:19 PM, Ryan Blue wrote:
> Hi everyone,
> 
> The Spark community caught a correctness bug in Parquet, PARQUET-1510
>  and SPARK-26677
> . The dictionary filter
> was ignoring null values and skipping row groups incorrectly.
> 
> Spark is considering disabling Parquet dictionary filters, but PARQUET-1309
>  causes a problem
> because the stats and dictionary filter config properties are swapped. And,
> it is a bad idea to disable filtering for all of Parquet due to a bug like
> this. (I've also suggested a work-around that I think is more likely.)
> 
> Since this is a correctness bug and Spark can't update to 1.11.0 in a patch
> release of Spark, if the Parquet release were finished, I think we should
> create a 1.10.1 release. I would include the fixes for PARQUET-1309 and
> PARQUET-1510.
> 
> Is everyone okay with me creating a release candidate for 1.10.1? If so,
> are there any other bugs that should be fixed in 1.10.1?
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

[jira] [Assigned] (PARQUET-1504) Add an option to convert Parquet Int96 to Arrow Timestamp

2019-01-27 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1504:


Assignee: Yongyan Wang

> Add an option to convert Parquet Int96 to Arrow Timestamp
> -
>
> Key: PARQUET-1504
> URL: https://issues.apache.org/jira/browse/PARQUET-1504
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yongyan Wang
>Assignee: Yongyan Wang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Add an option to convert Parquet Int96 to Arrow Timestamp for SchemaConverter



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (PARQUET-1504) Add an option to convert Parquet Int96 to Arrow Timestamp

2019-01-27 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1504.
--
   Resolution: Fixed
Fix Version/s: 1.12.0

Issue resolved by PR https://github.com/apache/parquet-mr/pull/594

> Add an option to convert Parquet Int96 to Arrow Timestamp
> -
>
> Key: PARQUET-1504
> URL: https://issues.apache.org/jira/browse/PARQUET-1504
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yongyan Wang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Add an option to convert Parquet Int96 to Arrow Timestamp for SchemaConverter



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1496) [Java] Update Scala to 2.12

2019-01-26 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1496:
-
Description: 
When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
the build fails for me in {{parquet-scala}} with:
{code:java}
[INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 ---
[INFO] Checking for multiple versions of scala
[INFO] includes = [**/*.java,**/*.scala,]
[INFO] excludes = []
[INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
info: compiling
[INFO] Compiling 1 source files to 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
1547922718010
[ERROR] error: error while loading package, Missing dependency 'object 
java.lang.Object in compiler mirror', required by 
/Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
[ERROR] error: error while loading package, Missing dependency 'object 
java.lang.Object in compiler mirror', required by 
/Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
[ERROR] error: scala.reflect.internal.MissingRequirementError: object 
java.lang.Object in compiler mirror not found.
[ERROR] at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
[ERROR] at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
[INFO] at scala.tools.nsc.Global$Run.(Global.scala:1290)
[INFO] at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
[INFO] at scala.tools.nsc.Main$.doCompile(Main.scala:79)
[INFO] at scala.tools.nsc.Driver.process(Driver.scala:54)
[INFO] at scala.tools.nsc.Driver.main(Driver.scala:67)
[INFO] at scala.tools.nsc.Main.main(Main.scala)
[INFO] at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[INFO] at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[INFO] at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[INFO] at java.base/java.lang.reflect.Method.invoke(Method.java:564)
[INFO] at 
org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
[INFO] at 
org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26){code}
This is because the referenced JARs were made for a Scala version not 
supporting JDK 11, we need to update to 2.12.

  was:
When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
the build fails for me in {{parquet-scala}} with:
{code:java}
[INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 ---
[INFO] Checking for multiple versions of scala
[INFO] includes = [**/*.java,**/*.scala,]
[INFO] excludes = []
[INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
info: compiling
[INFO] Compiling 1 source files to 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
1547922718010
[ERROR] error: error while loading package

[jira] [Updated] (PARQUET-1496) [Java] Update Scala to 2.12

2019-01-26 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1496:
-
Summary: [Java] Update Scala to 2.12  (was: [Java] Build fails on OSX and 
Java 10)

> [Java] Update Scala to 2.12
> ---
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>    Reporter: Uwe L. Korn
>Priority: Major
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO] at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO] at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO] at scala.tools.nsc.Main$.doCompile(Main.scala:79)
> [INFO] at scala.tools.nsc.Driver.process(Driver.scala:54)
> [INFO] at scala.tools.nsc.Driver.main(Driver.scala:67)
> [INFO] at scala.tools.nsc.Main.main(Main.scala)
> [INFO] at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [INFO] at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [INFO] at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [INFO] at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> [INFO] at 
> org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
> [INFO] at 
> org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (PARQUET-1496) [Java] Update Scala to 2.12

2019-01-26 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1496:


Assignee: Uwe L. Korn

> [Java] Update Scala to 2.12
> ---
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>    Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO] at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO] at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO] at scala.tools.nsc.Main$.doCompile(Main.scala:79)
> [INFO] at scala.tools.nsc.Driver.process(Driver.scala:54)
> [INFO] at scala.tools.nsc.Driver.main(Driver.scala:67)
> [INFO] at scala.tools.nsc.Main.main(Main.scala)
> [INFO] at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [INFO] at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [INFO] at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [INFO] at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> [INFO] at 
> org_scala_

Re: [DISCUSS] Bump Apache Thrift dependency to 0.12.0

2019-01-25 Thread Uwe L. Korn

As an FYI: parquet-cpp already uses Thrift 0.12 in some of its binary 
distributions. So when there is a problem with old readers, one has to notice 
that we already have files out in the wild.

Cheers
Uwe

On Fri, Jan 25, 2019, at 9:13 AM, Gabor Szadovszky wrote:
> May it cause any problems that we write the thrift structures in the
> parquet files (footer, page headers etc.) with a different version as
> before? It might require some tests if the older readers are able to read
> the files written with the new thrift.
> Any thoughts?
> 
> On Thu, Jan 24, 2019 at 8:49 PM Ryan Blue  wrote:
> 
> > Why is it a problem that thrift can't be compiled with Java 11? We should
> > only have a binary dependency.
> >
> > +1 for moving thrift forward, though.
> >
> > On Thu, Jan 24, 2019 at 11:38 AM Driesprong, Fokko 
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to discuss updating the Thrift dependency to 0.12.0 of
> > > Parquet. In my effort to make Parquet forward compatible for JDK11
> > > , I stumbled upon some
> > > issues. One of them that we still rely, in both the CI and documentation,
> > > on Thrift 0.9.3 (released October 2015). Unfortunately, this version of
> > > Thrift won't compile under Java11:
> > > [javac] Compiling 105 source files to
> > /home/travis/build/apache/parquet-mr
> > > /thrift-0.9.3/lib/java/build
> > > [javac] warning: [options] bootstrap class path not set in conjunction
> > with
> > > -source 5
> > > [javac] error: Source option 5 is no longer supported. Use 6 or later.
> > > [javac] error: Target option 1.5 is no longer supported. Use 1.6 or
> > later.
> > >
> > > Target 1.5, feels a bit awkward, doesn't it? My main question to the
> > > dev-list is; is there any particular reason why we shouldn't update the
> > > Thrift dependency to 0.12.0. I know that it will have an impact on
> > Parquet,
> > > but if we want to support Java11, we need to move forward eventually.
> > >
> > > After updating the thrift-maven plugin
> > > , I was able to run the
> > CI
> > > against Thrift 0.12.0 .
> > >
> > > Cheers, Fokko
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >

Re: [VOTE] Release Apache Parquet 1.11.0 RC3

2019-01-22 Thread Uwe L. Korn

Hello Gabor,

you probably confused Wes and me ;) I'm going to give this a try on Linux and 
try to verify the release. The Java 9+ issues sadly makes it harder for me to 
verify it on my normal system.

Uwe

On Tue, Jan 22, 2019, at 9:44 AM, Gabor Szadovszky wrote:
> Hi Wes,
> 
> Thanks for checking the RC and voting.
> I would like to highlight that the mentioned issues are also reproducible
> on Linux with the parquet release 1.10.0 so these are not regressions. We
> currently support java8 only so I cannot see any issue with the release.
> I completely agree that these issues shall be fixed but I think we should
> not stop the release because of them.
> 
> Cheers,
> Gabor
> 
> On Mon, Jan 21, 2019 at 6:40 PM Uwe L. Korn  wrote:
> 
> > Hi,
> >
> > I'm sadly giving a +0 here. The signatures look good but I was unable to
> > build with JDK 9/10/11 on OSX.
> > https://issues.apache.org/jira/browse/PARQUET-1497 and
> > https://issues.apache.org/jira/browse/PARQUET-1496 are the problems I'm
> > running into. I've also opened a PR to document how to install thrift on
> > OSX: https://github.com/apache/parquet-mr/pull/595
> >
> > Cheers
> > Uwe
> >
> > On Fri, Jan 18, 2019, at 1:55 PM, Anna Szonyi wrote:
> > > Hi All,
> > >
> > > While not a binding one, it's a "+1 (non-binding), everything looks good"
> > > from me!
> > >
> > > Best,
> > > Anna
> > >
> > > On Thu, Jan 17, 2019 at 6:13 PM Zoltan Ivanfi 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Friendly reminder to please vote for the release. We need 2 more
> > binding +1
> > > > votes.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Sat, Jan 12, 2019 at 3:07 AM 俊杰陈  wrote:
> > > >
> > > > > +1  (non-binding)
> > > > > * contents looks good
> > > > > * unit tests passed
> > > > >
> > > > >
> > > > > Zoltan Ivanfi  于2019年1月11日周五 下午9:31写道：
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > * contents look good
> > > > > > * unit tests pass
> > > > > > * checksums match
> > > > > > * signature matches
> > > > > >
> > > > > > Br,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Thu, Jan 10, 2019 at 11:48 AM Gabor Szadovszky <
> > ga...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Checked tarball: checksum/signature are correct. Content is
> > correct
> > > > > based
> > > > > > > on release tag. Unit tests pass.
> > > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gabor
> > > > > > >
> > > > > > > On Wed, Jan 9, 2019 at 4:51 PM Zoltan Ivanfi
> >  > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Dear Parquet Users and Developers,
> > > > > > > >
> > > > > > > > I propose the following RC to be released as the official
> > Apache
> > > > > > > > Parquet 1.11.0 release:
> > > > > > > >
> > > > > > > > The commit id is 8be767d12cca295cf9858a521725fc440b0c6f93
> > > > > > > > * This corresponds to the tag: apache-parquet-1.11.0
> > > > > > > > *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://github.com/apache/parquet-mr/tree/8be767d12cca295cf9858a521725fc440b0c6f93
> > > > > > > >
> > > > > > > > The release tarball, signature, and checksums are here:
> > > > > > > > *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.11.0-rc3/
> > > > > > > >
> > > > > > > > You can find the KEYS file here:
> > > > > > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > > > > > >
> > > > > > > > Binary artifacts are staged in Nexus here:
> > > > > > > > *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet/1.11.0/
> > > > > > > >
> > > > > > > > This release includes the following new features:
> > > > > > > >
> > > > > > > > - PARQUET-1201 - Column indexes
> > > > > > > > - PARQUET-1253 - Support for new logical type representation
> > > > > > > > - PARQUET-1381 - Add merge blocks command to parquet-tools
> > > > > > > > - PARQUET-1388 - Nanosecond precision time and timestamp -
> > > > parquet-mr
> > > > > > > >
> > > > > > > > The release also includes bug fixes, including:
> > > > > > > >
> > > > > > > > - PARQUET-1472: Dictionary filter fails on
> > FIXED_LEN_BYTE_ARRAY.
> > > > > > > >
> > > > > > > > Please download, verify, and test. The vote will be open for at
> > > > least
> > > > > > 72
> > > > > > > > hours.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Zoltan
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > > >
> > > >
> >

[jira] [Resolved] (PARQUET-1501) v1.8.x to be fixed with PARQUET-952 solution

2019-01-22 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1501.
--
Resolution: Won't Fix

> v1.8.x to be fixed with PARQUET-952 solution
> 
>
> Key: PARQUET-1501
> URL: https://issues.apache.org/jira/browse/PARQUET-1501
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.8.1, 1.8.2, 1.8.3
>Reporter: Vijayakumar N
>Priority: Major
>
> The following issue fixed in AVro parquet v1.11.0. PARQUET-952: Avro union 
> with single type fails with 'is not a group'
>  
> But we require this fix in v1.8.x series also. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1501) v1.8.x to be fixed with PARQUET-952 solution

2019-01-22 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1501:
-
Description: 
The following issue fixed in AVro parquet v1.11.0. PARQUET-952: Avro union with 
single type fails with 'is not a group'

 

But we require this fix in v1.8.x series also. 

 

  was:
The following issue fixed in AVro parquet v1.11.0.
h1. PARQUET-952: Avro union with single type fails with 'is not a group'

 

But we require this fix in v1.8.x series also. 

 


> v1.8.x to be fixed with PARQUET-952 solution
> 
>
> Key: PARQUET-1501
> URL: https://issues.apache.org/jira/browse/PARQUET-1501
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.8.1, 1.8.2, 1.8.3
>Reporter: Vijayakumar N
>Priority: Major
>
> The following issue fixed in AVro parquet v1.11.0. PARQUET-952: Avro union 
> with single type fails with 'is not a group'
>  
> But we require this fix in v1.8.x series also. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1501) v1.8.x to be fixed with PARQUET-952 solution

2019-01-22 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748498#comment-16748498
 ] 

Uwe L. Korn commented on PARQUET-1501:
--

[~nvijayrech] We probably won't make a bugfix release for 1.8 as long as there 
is nobody taking care of this. Closing as we don't have sufficient volunteers.

> v1.8.x to be fixed with PARQUET-952 solution
> 
>
> Key: PARQUET-1501
> URL: https://issues.apache.org/jira/browse/PARQUET-1501
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.8.1, 1.8.2, 1.8.3
>Reporter: Vijayakumar N
>Priority: Major
>
> The following issue fixed in AVro parquet v1.11.0.
> h1. PARQUET-952: Avro union with single type fails with 'is not a group'
>  
> But we require this fix in v1.8.x series also. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1501) v1.8.x to be fixed with PARQUET-952 solution

2019-01-22 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1501:
-
Flags: Patch  (was: Patch,Important)

> v1.8.x to be fixed with PARQUET-952 solution
> 
>
> Key: PARQUET-1501
> URL: https://issues.apache.org/jira/browse/PARQUET-1501
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.8.1, 1.8.2, 1.8.3
>Reporter: Vijayakumar N
>Priority: Major
>
> The following issue fixed in AVro parquet v1.11.0.
> h1. PARQUET-952: Avro union with single type fails with 'is not a group'
>  
> But we require this fix in v1.8.x series also. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [VOTE] Release Apache Parquet 1.11.0 RC3

2019-01-21 Thread Uwe L. Korn

Hi,

I'm sadly giving a +0 here. The signatures look good but I was unable to build 
with JDK 9/10/11 on OSX. https://issues.apache.org/jira/browse/PARQUET-1497 and 
https://issues.apache.org/jira/browse/PARQUET-1496 are the problems I'm running 
into. I've also opened a PR to document how to install thrift on OSX: 
https://github.com/apache/parquet-mr/pull/595

Cheers
Uwe

On Fri, Jan 18, 2019, at 1:55 PM, Anna Szonyi wrote:
> Hi All,
> 
> While not a binding one, it's a "+1 (non-binding), everything looks good"
> from me!
> 
> Best,
> Anna
> 
> On Thu, Jan 17, 2019 at 6:13 PM Zoltan Ivanfi 
> wrote:
> 
> > Hi,
> >
> > Friendly reminder to please vote for the release. We need 2 more binding +1
> > votes.
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Sat, Jan 12, 2019 at 3:07 AM 俊杰陈  wrote:
> >
> > > +1  (non-binding)
> > > * contents looks good
> > > * unit tests passed
> > >
> > >
> > > Zoltan Ivanfi  于2019年1月11日周五 下午9:31写道：
> > >
> > > > +1 (binding)
> > > >
> > > > * contents look good
> > > > * unit tests pass
> > > > * checksums match
> > > > * signature matches
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > > On Thu, Jan 10, 2019 at 11:48 AM Gabor Szadovszky 
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Checked tarball: checksum/signature are correct. Content is correct
> > > based
> > > > > on release tag. Unit tests pass.
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Cheers,
> > > > > Gabor
> > > > >
> > > > > On Wed, Jan 9, 2019 at 4:51 PM Zoltan Ivanfi 
> > > > >  > >
> > > > > wrote:
> > > > >
> > > > > > Dear Parquet Users and Developers,
> > > > > >
> > > > > > I propose the following RC to be released as the official Apache
> > > > > > Parquet 1.11.0 release:
> > > > > >
> > > > > > The commit id is 8be767d12cca295cf9858a521725fc440b0c6f93
> > > > > > * This corresponds to the tag: apache-parquet-1.11.0
> > > > > > *
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/parquet-mr/tree/8be767d12cca295cf9858a521725fc440b0c6f93
> > > > > >
> > > > > > The release tarball, signature, and checksums are here:
> > > > > > *
> > > > > >
> > > > >
> > > >
> > >
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.11.0-rc3/
> > > > > >
> > > > > > You can find the KEYS file here:
> > > > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > > > >
> > > > > > Binary artifacts are staged in Nexus here:
> > > > > > *
> > > > > >
> > > > >
> > > >
> > >
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet/1.11.0/
> > > > > >
> > > > > > This release includes the following new features:
> > > > > >
> > > > > > - PARQUET-1201 - Column indexes
> > > > > > - PARQUET-1253 - Support for new logical type representation
> > > > > > - PARQUET-1381 - Add merge blocks command to parquet-tools
> > > > > > - PARQUET-1388 - Nanosecond precision time and timestamp -
> > parquet-mr
> > > > > >
> > > > > > The release also includes bug fixes, including:
> > > > > >
> > > > > > - PARQUET-1472: Dictionary filter fails on FIXED_LEN_BYTE_ARRAY.
> > > > > >
> > > > > > Please download, verify, and test. The vote will be open for at
> > least
> > > > 72
> > > > > > hours.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >

[jira] [Created] (PARQUET-1497) [Java] Building on OSX fails with OpenJDK 11

2019-01-21 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created PARQUET-1497:


 Summary: [Java] Building on OSX fails with OpenJDK 11
 Key: PARQUET-1497
 URL: https://issues.apache.org/jira/browse/PARQUET-1497
 Project: Parquet
  Issue Type: Bug
  Components: parquet-thrift
Affects Versions: 1.10.0
Reporter: Uwe L. Korn


When trying to build with OpenJDK 11, I get errors due to the Generated 
annotation not being resolved:
{code:java}
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
parquet-format-structures ---

[INFO] Changes detected - recompiling the module!

[INFO] Compiling 51 source files to 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/classes

[INFO] -

[WARNING] COMPILATION WARNING :

[INFO] -

[WARNING] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java:
 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java
 uses or overrides a deprecated API.

[WARNING] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java:
 Recompile with -Xlint:deprecation for details.

[INFO] 2 warnings

[INFO] -

[INFO] -

[ERROR] COMPILATION ERROR :

[INFO] -

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[32,24]
 package javax.annotation does not exist

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[37,2]
 cannot find symbol

  symbol: class Generated

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[32,24]
 package javax.annotation does not exist

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[40,2]
 cannot find symbol

  symbol: class Generated

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[32,24]
 package javax.annotation does not exist

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[43,2]
 cannot find symbol

  symbol: class Generated

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[32,24]
 package javax.annotation does not exist

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[41,2]
 cannot find symbol

  symbol: class Generated

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[32,24]
 package javax.annotation does not exist

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[40,2]
 cannot find symbol

  symbol: class Generated

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[32,24]
 package javax.annotation does not exist

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[42,2]
 cannot find symbol

  symbol: class Generated

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimeUnit.java:[32,24]
 package javax.annotation does not exist

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/MilliSeconds.java:[32,24]
 package javax.annotation does not exist

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/MilliSeconds.java:[40,2]
 cannot find symbol

  symbol: class Generated

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/MicroSeconds.java:[32,24]
 package javax.annotation does not exist

[ERROR] 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources

[jira] [Created] (PARQUET-1498) [Java] Add instructions to install thrift via homebrew

2019-01-21 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created PARQUET-1498:


 Summary: [Java] Add instructions to install thrift via homebrew
 Key: PARQUET-1498
 URL: https://issues.apache.org/jira/browse/PARQUET-1498
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.10.0
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 1.11.0


Instead of manually building it, one can also install it via homebrew which 
much more comfortable. As this is not the latest thrift version, you need to 
explicitly include it in your {{PATH}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1496) [Java] Build fails on OSX and Java 10

2019-01-21 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created PARQUET-1496:


 Summary: [Java] Build fails on OSX and Java 10
 Key: PARQUET-1496
 URL: https://issues.apache.org/jira/browse/PARQUET-1496
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.10.0
Reporter: Uwe L. Korn


When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
the build fails for me in {{parquet-scala}} with:
{code:java}
[INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 ---
[INFO] Checking for multiple versions of scala
[INFO] includes = [**/*.java,**/*.scala,]
[INFO] excludes = []
[INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
info: compiling
[INFO] Compiling 1 source files to 
/Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
1547922718010
[ERROR] error: error while loading package, Missing dependency 'object 
java.lang.Object in compiler mirror', required by 
/Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
[ERROR] error: error while loading package, Missing dependency 'object 
java.lang.Object in compiler mirror', required by 
/Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
[ERROR] error: scala.reflect.internal.MissingRequirementError: object 
java.lang.Object in compiler mirror not found.
[ERROR] at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
[ERROR] at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
[INFO] at 
scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
[INFO] at 
scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
[INFO] at scala.tools.nsc.Global$Run.(Global.scala:1290)
[INFO] at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
[INFO] at scala.tools.nsc.Main$.doCompile(Main.scala:79)
[INFO] at scala.tools.nsc.Driver.process(Driver.scala:54)
[INFO] at scala.tools.nsc.Driver.main(Driver.scala:67)
[INFO] at scala.tools.nsc.Main.main(Main.scala)
[INFO] at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[INFO] at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[INFO] at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[INFO] at java.base/java.lang.reflect.Method.invoke(Method.java:564)
[INFO] at 
org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
[INFO] at 
org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [Draft REPORT] Apache Parquet - January 2019

2019-01-07 Thread Uwe L. Korn

+1

Uwe

On Mon, Jan 7, 2019, at 9:14 PM, Ryan Blue wrote:
> +1
> 
> On Mon, Jan 7, 2019 at 11:39 AM Julien Le Dem
>  wrote:
> 
> > ## Description:
> > Parquet is a standard and interoperable columnar file format
> > for efficient analytics. Parquet has 3 sub-projects:
> > - parquet-format: format reference doc along with thrift based metadata
> > definition (used by both sub-projects bellow)
> > - parquet-mr: java apis and implementation of the format along with
> > integrations to various projects (thrift, pig, protobuf, avro, ...)
> > - parquet-cpp: C++ apis and implementation of the format along with Python
> > bindings and arrow integration.
> >
> > ## Issues:
> >  No issue at this time
> >
> > ## Activity:
> > Current activity around:
> >
> >- encryption
> >- Page indexing
> >- cutting a new release
> >- improvement on parquet-proto
> >
> >
> > ## Health report:
> > The discussion volume on the mailing lists is stable.
> > Tickets get created and closed at a reasonable pace.
> >
> > ## PMC changes:
> >
> >  - Currently 24 PMC members.
> >  - No new PMC members added in the last 3 months
> >  - Last PMC addition was Zoltan Ivanfi on Sun Apr 15 2018
> >
> > ## Committer base changes:
> >
> >  - Currently 31 committers.
> >  - No new committers added in the last 3 months
> >  - Last committer addition was Benoit Hanotte at Mon May 28 2018
> >
> > ## Releases:
> >
> >  - Last release was Format 2.6.0 on Mon Oct 01 2018
> >
> > ## Mailing list activity:
> >
> >  - dev@parquet.apache.org:
> > - 216 subscribers (up 2 in the last 3 months):
> > - 529 emails sent to list (757 in previous quarter)
> >
> >
> > ## JIRA activity:
> >
> >  - 49 JIRA tickets created in the last 3 months
> >  - 65 JIRA tickets closed/resolved in the last 3 months
> >
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: [Draft REPORT] Apache Parquet - January 2019

2019-01-07 Thread Uwe L. Korn

+1

Uwe

On Mon, Jan 7, 2019, at 9:14 PM, Ryan Blue wrote:
> +1
> 
> On Mon, Jan 7, 2019 at 11:39 AM Julien Le Dem
>  wrote:
> 
> > ## Description:
> > Parquet is a standard and interoperable columnar file format
> > for efficient analytics. Parquet has 3 sub-projects:
> > - parquet-format: format reference doc along with thrift based metadata
> > definition (used by both sub-projects bellow)
> > - parquet-mr: java apis and implementation of the format along with
> > integrations to various projects (thrift, pig, protobuf, avro, ...)
> > - parquet-cpp: C++ apis and implementation of the format along with Python
> > bindings and arrow integration.
> >
> > ## Issues:
> >  No issue at this time
> >
> > ## Activity:
> > Current activity around:
> >
> >- encryption
> >- Page indexing
> >- cutting a new release
> >- improvement on parquet-proto
> >
> >
> > ## Health report:
> > The discussion volume on the mailing lists is stable.
> > Tickets get created and closed at a reasonable pace.
> >
> > ## PMC changes:
> >
> >  - Currently 24 PMC members.
> >  - No new PMC members added in the last 3 months
> >  - Last PMC addition was Zoltan Ivanfi on Sun Apr 15 2018
> >
> > ## Committer base changes:
> >
> >  - Currently 31 committers.
> >  - No new committers added in the last 3 months
> >  - Last committer addition was Benoit Hanotte at Mon May 28 2018
> >
> > ## Releases:
> >
> >  - Last release was Format 2.6.0 on Mon Oct 01 2018
> >
> > ## Mailing list activity:
> >
> >  - dev@parquet.apache.org:
> > - 216 subscribers (up 2 in the last 3 months):
> > - 529 emails sent to list (757 in previous quarter)
> >
> >
> > ## JIRA activity:
> >
> >  - 49 JIRA tickets created in the last 3 months
> >  - 65 JIRA tickets closed/resolved in the last 3 months
> >
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

2018-12-21 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726776#comment-16726776
 ] 

Uwe L. Korn commented on PARQUET-1481:
--

Can you describe how you generated this Parquet file?

> [C++] SEGV when reading corrupt parquet file
> 
>
> Key: PARQUET-1481
> URL: https://issues.apache.org/jira/browse/PARQUET-1481
> Project: Parquet
>  Issue Type: Bug
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Major
> Attachments: corrupt.parquet
>
>
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('corrupt.parquet')
> fish: 'python' terminated by signal SIGSEGV (Address boundary error)
>  
> Stack report from macOS:
>  
> 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10
> 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732
> 2 libc++.1.dylib 0x7fff4f04acb0 
> std::__1::condition_variable::wait(std::__1::unique_lock&) + 
> 18
> 3 libc++.1.dylib 0x7fff4f04b728 
> std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&)
>  + 46
> 4 libparquet.11.dylib 0x000115512d00 
> std::__1::__assoc_state::move() + 48
> 5 libparquet.11.dylib 0x0001154faa15 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector std::__1::allocator > const&, std::__1::shared_ptr*) + 1093
> 6 libparquet.11.dylib 0x0001154fb6fe 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*)
>  + 350
> 7 libparquet.11.dylib 0x0001154fce47 
> parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + 
> 23
> 8 _parquet.so 0x00011598d97b 
> __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, 
> _object*) + 1035



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [Discuss] Code of conduct

2018-12-09 Thread Uwe L. Korn

Hello Julien,

As per ASF guideline https://www.apache.org/foundation/policies/conduct.html 
applies also to the Apache Parquet channels. Would that be sufficient for you?

Cheers
Uwe

On Sat, Dec 8, 2018, at 2:14 AM, Julien Le Dem wrote:
> We currently don’t have an explicit code of conduct. We’ve always
> encouraged respectful discussions and as far as I know all discussions have
> been that way.
> However, I don’t think we should wait for an incident to create the need
> for an explicit code of conduct. I suggest we adopt the contributor
> covenant as it is well aligned with our values as far as I am concerned.
> I also think that explicitly adopting it will encourage others to do the
> same in the open source community.
> Best
> Julien

[jira] [Updated] (PARQUET-490) [C++] Incorporate DELTA_BINARY_PACKED value encoder into library and add unit tests

2018-11-02 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-490:

Summary: [C++] Incorporate DELTA_BINARY_PACKED value encoder into library 
and add unit tests  (was: Incorporate DELTA_BINARY_PACKED value encoder into 
library and add unit tests)

> [C++] Incorporate DELTA_BINARY_PACKED value encoder into library and add unit 
> tests
> ---
>
> Key: PARQUET-490
> URL: https://issues.apache.org/jira/browse/PARQUET-490
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> There is some code for this currently found in 
> {{examples/decode_benchmark.cc}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-492) [C++] Incorporate DELTA_BYTE_ARRAY value encoder into library and add unit tests

2018-11-02 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-492:

Summary: [C++] Incorporate DELTA_BYTE_ARRAY value encoder into library and 
add unit tests  (was: Incorporate DELTA_BYTE_ARRAY value encoder into library 
and add unit tests)

> [C++] Incorporate DELTA_BYTE_ARRAY value encoder into library and add unit 
> tests
> 
>
> Key: PARQUET-492
> URL: https://issues.apache.org/jira/browse/PARQUET-492
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> See {{examples/decode_benchmark.cc}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-491) [C++] Incorporate DELTA_LENGTH_BYTE_ARRAY value encoder into library and add unit tests

2018-11-02 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-491:

Summary: [C++] Incorporate DELTA_LENGTH_BYTE_ARRAY value encoder into 
library and add unit tests  (was: Incorporate DELTA_LENGTH_BYTE_ARRAY value 
encoder into library and add unit tests)

> [C++] Incorporate DELTA_LENGTH_BYTE_ARRAY value encoder into library and add 
> unit tests
> ---
>
> Key: PARQUET-491
> URL: https://issues.apache.org/jira/browse/PARQUET-491
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> See {{examples/decode_benchmark.cc}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1454) ld-linux-x86-64.so.2 is missing

2018-10-31 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669869#comment-16669869
 ] 

Uwe L. Korn commented on PARQUET-1454:
--

{{ld-linux-x86-64.so.2}} is a library that is normally present on all Linux 
systems and should not be bundled. Can you tell us which Linux distribution you 
are using?

> ld-linux-x86-64.so.2 is missing
> ---
>
> Key: PARQUET-1454
> URL: https://issues.apache.org/jira/browse/PARQUET-1454
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.0
>Reporter: Stets Alexander
>Priority: Minor
>  Labels: documentation
>
> parquet-avro uses dependensy org.xerial.snappy:snappy-java .
> snappy-java need extract native lib. For this goal it uses 
> ld-linux-x86-64.so.2.
> If your OS doesn't contain ld-linux-x86-64.so.2 you catch exception like this
> java.lang.UnsatisfiedLinkError: 
> /tmp/snappy-1.1.2-b0bbcae9-e398-4a99-ad6d-19c86734be76-libsnappyjava.so: 
> Error loading shared library ld-linux-x86-64.so.2: No such file or directory 
> (needed by 
> /tmp/snappy-1.1.2-b0bbcae9-e398-4a99-ad6d-19c86734be76-libsnappyjava.so)
> But documentation doesn't contain information about it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (PARQUET-1160) [C++] Implement BYTE_ARRAY-backed Decimal reads

2018-09-30 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1160:


Assignee: Ted Haining  (was: Phillip Cloud)

> [C++] Implement BYTE_ARRAY-backed Decimal reads
> ---
>
> Key: PARQUET-1160
> URL: https://issues.apache.org/jira/browse/PARQUET-1160
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Affects Versions: cpp-1.3.0
>Reporter: Phillip Cloud
>Assignee: Ted Haining
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
> Attachments: 20180726193815980.parquet
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> These are valid in the parquet spec, but it seems like no system in use today 
> implements a writer for this type.
> What systems support writing Decimals with this underlying type?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (PARQUET-1160) [C++] Implement BYTE_ARRAY-backed Decimal reads

2018-09-30 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1160.
--
   Resolution: Fixed
Fix Version/s: 1.10.1

Issue resolved by pull request 2646
[https://github.com/apache/arrow/pull/2646]

> [C++] Implement BYTE_ARRAY-backed Decimal reads
> ---
>
> Key: PARQUET-1160
> URL: https://issues.apache.org/jira/browse/PARQUET-1160
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Affects Versions: cpp-1.3.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.1
>
> Attachments: 20180726193815980.parquet
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> These are valid in the parquet spec, but it seems like no system in use today 
> implements a writer for this type.
> What systems support writing Decimals with this underlying type?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1160) [C++] Implement BYTE_ARRAY-backed Decimal reads

2018-09-30 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1160:
-
Fix Version/s: (was: 1.10.1)
   cpp-1.6.0

> [C++] Implement BYTE_ARRAY-backed Decimal reads
> ---
>
> Key: PARQUET-1160
> URL: https://issues.apache.org/jira/browse/PARQUET-1160
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Affects Versions: cpp-1.3.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
> Attachments: 20180726193815980.parquet
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> These are valid in the parquet spec, but it seems like no system in use today 
> implements a writer for this type.
> What systems support writing Decimals with this underlying type?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Donate C (GLib) bindings for C++ implementation

2018-09-25 Thread Uwe L. Korn

Hello Kou,

this was already mentioned on the pull request but copying it here for others:

We would very much like this in the Apache repository and are very grateful for 
the code donation. We should do the normal format code donation vote and then 
merge it.

Uwe

On Tue, Sep 25, 2018, at 7:20 AM, Kouhei Sutou wrote:
> I've created a pull request to view the changes for this work:
> 
>   https://github.com/apache/arrow/pull/2622
> 
> 
> Thanks,
> --
> kou
> 
> In <20180925.140014.513561756859357130@clear-code.com>
>   "Donate C (GLib) bindings for C++ implementation" on Tue, 25 Sep 2018 
> 14:00:14 +0900 (JST),
>   Kouhei Sutou  wrote:
> 
> > Hi,
> > 
> > I want to donate Parquet C (GLib) bindings for Parquet C++
> > implementation:
> > 
> >   https://github.com/red-data-tools/parquet-glib
> > 
> > It just implements Arrow file reader/writer for now. I'll
> > implement more features later.
> > 
> > Could you please any feedback for this? Is this useful for
> > Parquet project? Should I work on separated repository?
> > 
> > 
> > Background:
> > 
> > I'm the author of Arrow C (GLib) bindings for Arrow C++
> > implementation. Parquet C (GLib) bindings uses Arrow C
> > (GLib) bindings. Parquet C++ implementation has been merged
> > to Arrow repository recently. So, it's convenient for me
> > that Parquet C (GLib) bindings exists in the Arrow
> > repository.
> > 
> > This project got a pull request today:
> > https://github.com/red-data-tools/parquet-glib/pull/2
> > 
> > I thought that it's better that I donate this project to
> > Parquet project if C (GLib) bindings are useful for Parquet
> > project.
> > 
> > 
> > Thanks,
> > --
> > kou

[jira] [Commented] (PARQUET-1422) [C++] Use Arrow IO interfaces natively rather than current parquet:: wrappers

2018-09-22 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624704#comment-16624704
 ] 

Uwe L. Korn commented on PARQUET-1422:
--

+1 from me, that's one of the refactoring benefits I was hoping for to see.

> [C++] Use Arrow IO interfaces natively rather than current parquet:: wrappers
> -
>
> Key: PARQUET-1422
> URL: https://issues.apache.org/jira/browse/PARQUET-1422
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> We are beginning to do some work on asynchronous IO in Arrow and it would be 
> great to be able to leverage this in the Parquet core internals. 
> I am proposing to remove the Parquet-specific virtual file interfaces in
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/util/memory.h#L221
> and instead rely directly on the Arrow ones in arrow::io. In addition to 
> reducing the amount of code we have to maintain, we will also be able to 
> improve performance of Parquet by utilizing common utilities for managing 
> asynchronous / background IO
> cc [~mdeepak] [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [RESULT] [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-09-20 Thread Uwe L. Korn

Hello Wes,

I'm definitely +1 on archiving the master branch. I'm not sure what you mean 
exactly with this. I would have simply added a final commit that deletes all 
code and adds a message to the README that the repository has moved into a 
another repo.

Cheers
Uwe

On Thu, Sep 13, 2018, at 10:47 PM, Wes McKinney wrote:
> hi folks,
> 
> Could I get some feedback about the follow-up items? There are still
> some parts of the codebase that need to be migrated. Additionally, I'm
> proposing to archive the master branch so that people with build
> toolchains running against parquet-cpp master will be forced to
> migrate. The hard part is over; I would like to get things closed out
> on apache/parquet-cpp and move development forward.
> 
> Thanks,
> Wes
> On Sun, Sep 9, 2018 at 8:45 PM Wes McKinney  wrote:
> >
> > Might make sense to archive the master branch so that people's
> > now-outdated build toolchains (where they may be cloning
> > apache/parquet-cpp) will fail fast. We are already starting to get bug
> > reports along these lines.
> >
> > Thoughts?
> > On Sat, Sep 8, 2018 at 10:43 AM Wes McKinney  wrote:
> > >
> > > We should probably also write a blog post on the Apache Arrow website
> > > to increase visibility of this move to the broader community.
> > >
> > > On Sat, Sep 8, 2018 at 10:42 AM Wes McKinney  wrote:
> > > >
> > > > Dear all -- the merge has been completed, thank you! 318 patches
> > > > (after the filter-branch grafting procedure) were merged to
> > > > apache/arrow
> > > >
> > > > We have some follow up work to do:
> > > >
> > > > * Move patches from apache/parquet-cpp to apache/arrow
> > > > * Add CONTRIBUTING.md and note to README that patches are no longer
> > > > accepted at the old location
> > > > * Migrate CLI utiltiies and other small items that did not survive the
> > > > merge: tools/, benchmarks/, and examples/
> > > > * Develop new release procedure for Apache Parquet
> > > >
> > > > On this third point, we can also import their git history if desired.
> > > > Incorporating them into the build will be comparatively easy to the
> > > > library integration.
> > > >
> > > > There are already some JIRA issues open for some of these, but
> > > > anything else please create issues so we can keep track.
> > > >
> > > > I'm already quite excited to get busy with some refactoring and
> > > > internals improvements that I had avoided because of the painful
> > > > development procedure.
> > > >
> > > > Thanks,
> > > > Wes

[RESULT][VOTE] Release Apache Parquet C++ 1.5.0 RC0

2018-09-06 Thread Uwe L. Korn

With three binding votes, the release passes.

I will upload the tarballs in the next hours.

Uwe

On Sun, Sep 2, 2018, at 6:27 AM, Ryan Blue wrote:
> +1 (binding)
> 
> Verified using the validation script on Ubuntu 18.04. Thanks for getting
> this out, Uwe!
> 
> On Fri, Aug 31, 2018 at 3:46 PM Wes McKinney  wrote:
> 
> > +1 (binding). Verified sigs, checksums, unit tests on Ubuntu 14.04
> > Trusty using verify-release-candidate script
> >
> > I haven't been able to find the time to run on Windows but I trust our
> > CI to keep the build working properly
> >
> > Thanks Uwe for managing the release
> > On Thu, Aug 30, 2018 at 4:36 PM Wes McKinney  wrote:
> > >
> > > It may take me until sometime tomorrow to run the build since I want
> > > to check on Windows also
> > > On Wed, Aug 29, 2018 at 11:10 AM Uwe L. Korn  wrote:
> > > >
> > > > +1 (binding)
> > > >
> > > > Verified on Ubuntu 16.04 using `./dev/release/verify-release-candidate
> > 1.5.0 0`
> > > >
> > > > On Wed, Aug 29, 2018, at 5:09 PM, Uwe L. Korn wrote:
> > > > > All,
> > > > >
> > > > > I propose that we accept the following release candidate as the
> > official
> > > > > Apache Parquet C++ 1.5.0 release.
> > > > >
> > > > > Parquet C++ 1.5.0-rc0 includes the following:
> > > > > ---
> > > > > The CHANGELOG for the release is available at:
> > > > >
> > https://gitbox.apache.org/repos/asf?p=parquet-cpp.git=CHANGELOG=apache-parquet-cpp-1.5.0-rc0
> > > > >
> > > > > The tag used to create the release candidate is:
> > > > >
> > https://gitbox.apache.org/repos/asf?p=parquet-cpp.git;a=shortlog;h=refs/tags/apache-parquet-cpp-1.5.0-rc0
> > > > >
> > > > > The release candidate is available at:
> > > > >
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.5.0-rc0/apache-parquet-cpp-1.5.0.tar.gz
> > > > >
> > > > > The MD5 checksum of the release candidate can be found at:
> > > > >
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.5.0-rc0/apache-parquet-cpp-1.5.0.tar.gz.md5
> > > > >
> > > > > The signature of the release candidate can be found at:
> > > > >
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.5.0-rc0/apache-parquet-cpp-1.5.0.tar.gz.asc
> > > > >
> > > > > The GPG key used to sign the release are available at:
> > > > > https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > > >
> > > > > The release is based on the commit hash
> > > > > 80e110c823c5631ce4a4f0a5da486e759219f1e3.
> > > > >
> > > > > Please download, verify, and test.
> > > > >
> > > > > The vote will close on Sa 1. Sep 16:56:37 CEST 2018
> > > > >
> > > > > [ ] +1 Release this as Apache Parquet C++ 1.5.0
> > > > > [ ] +0
> > > > > [ ] -1 Do not release this as Apache Parquet C++ 1.5.0 because...
> >
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: [RESULT] [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-09-04 Thread Uwe L. Korn

Hello Wes,

I have not much time this week but I hope to squeeze in some minutes tomorrow 
afternoon to review the code. As this is a very big merge, I want to be extra 
careful to not break anything really badly. Hopefully more eyes will help.

Thank you for all the work in pushing this forward in the last days!

Uwe

On Tue, Sep 4, 2018, at 6:27 PM, Wes McKinney wrote:
> Dear all,
> 
> The repo merge is nearly ready to go modulo some fixes to CI. There
> will be a number of follow up issues to re-establish the various
> (untested) build procedures in parquet-cpp
> 
> https://github.com/apache/arrow/pull/2453
> 
> I would like to merge this by EOD Wednesday 9/5, or Thursday at
> latest, so we can get the patches from apache/parquet-cpp moved over
> and avoid any disruption to development process. If there are any
> comments please let me know
> 
> - Wes
> On Tue, Aug 21, 2018 at 12:23 PM Wes McKinney  wrote:
> >
> > hi all,
> >
> > with 3 binding +1 votes, the vote carries. We will discuss with Apache
> > Arrow about how to specifically proceed
> >
> > I have already done the preparatory work to undertake the merge
> >
> > https://github.com/apache/arrow/pull/2453
> >
> > thanks
> > Wes
> >
> > On Tue, Aug 21, 2018 at 10:41 AM, Wes McKinney  wrote:
> > > Yes, feel free to have a look at
> > >
> > > https://github.com/apache/arrow/pull/2453
> > >
> > > I'm not very in favor of having a commingled non-linear history that
> > > makes git bisect difficult. We will have to discuss on the Arrow ML
> > >
> > > Here's an example from Apache Spark where a similar merge took place
> > >
> > > https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53
> > >
> > > It would be my preference to have a single squashed commit whose
> > > message attributes the developers of the code and provides links back
> > > to the original commit history in the commit message
> > >
> > > - Wes
> > >
> > >
> > > On Tue, Aug 21, 2018 at 9:52 AM, Uwe L. Korn  wrote:
> > >> I have a very strong preference to keep the git history. I will have a 
> > >> look tomorrow to find the correct git magic to get a linear history. For 
> > >> me a single merge commit would be ok but I'm fine to spend an additional 
> > >> hour on this if you care strongly about linear history.
> > >>
> > >> Uwe
> > >>
> > >> On Sun, Aug 19, 2018, at 7:36 PM, Wes McKinney wrote:
> > >>> OK. I'm a bit -0 on doing anything that results in Arrow having a
> > >>> nonlinear git history (and rebasing is not really an option) but we
> > >>> can discuss that more later
> > >>>
> > >>> On Sun, Aug 19, 2018 at 8:50 AM, Uwe L. Korn  wrote:
> > >>> > +1 on this but also see my comments in the mail on the discussions.
> > >>> >
> > >>> > We should also keep the git history of parquet-cpp, that should not 
> > >>> > be hard with git and there is probably a StackOverflow answer out 
> > >>> > there that gives you the commands to do the merge.
> > >>> >
> > >>> > Uwe
> > >>> >
> > >>> > On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote:
> > >>> >> In case any are interested: my estimate of the work involved in the
> > >>> >> migration to be about a full day of total work, possibly less. As 
> > >>> >> soon
> > >>> >> as the migration plan is decided upon I intend to execute ASAP so 
> > >>> >> that
> > >>> >> ongoing development efforts are not disrupted.
> > >>> >>
> > >>> >> Additionally, in flight patches do not all need to be merged. Patches
> > >>> >> can be easily edited to apply against the modified repository
> > >>> >> structure
> > >>> >>
> > >>> >> On Wed, Aug 15, 2018 at 6:04 PM, Wes McKinney  
> > >>> >> wrote:
> > >>> >> > hi all,
> > >>> >> >
> > >>> >> > As discussed on the mailing list [1] I am proposing to undertake a
> > >>> >> > restructuring of the development process for parquet-cpp and its
> > >>> >> > consumption in the Arrow ecosystem to benefit the developers and 
> > >>> >> > users
> > >&g

Re: [VOTE] Release Apache Parquet C++ 1.5.0 RC0

2018-08-29 Thread Uwe L. Korn

+1 (binding)

Verified on Ubuntu 16.04 using `./dev/release/verify-release-candidate 1.5.0 0`

On Wed, Aug 29, 2018, at 5:09 PM, Uwe L. Korn wrote:
> All,
> 
> I propose that we accept the following release candidate as the official
> Apache Parquet C++ 1.5.0 release.
> 
> Parquet C++ 1.5.0-rc0 includes the following:
> ---
> The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=parquet-cpp.git=CHANGELOG=apache-parquet-cpp-1.5.0-rc0
> 
> The tag used to create the release candidate is:
> https://gitbox.apache.org/repos/asf?p=parquet-cpp.git;a=shortlog;h=refs/tags/apache-parquet-cpp-1.5.0-rc0
> 
> The release candidate is available at:
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.5.0-rc0/apache-parquet-cpp-1.5.0.tar.gz
> 
> The MD5 checksum of the release candidate can be found at:
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.5.0-rc0/apache-parquet-cpp-1.5.0.tar.gz.md5
> 
> The signature of the release candidate can be found at:
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.5.0-rc0/apache-parquet-cpp-1.5.0.tar.gz.asc
> 
> The GPG key used to sign the release are available at:
> https://dist.apache.org/repos/dist/dev/parquet/KEYS
> 
> The release is based on the commit hash 
> 80e110c823c5631ce4a4f0a5da486e759219f1e3.
> 
> Please download, verify, and test.
> 
> The vote will close on Sa 1. Sep 16:56:37 CEST 2018
> 
> [ ] +1 Release this as Apache Parquet C++ 1.5.0
> [ ] +0
> [ ] -1 Do not release this as Apache Parquet C++ 1.5.0 because...

[VOTE] Release Apache Parquet C++ 1.5.0 RC0

2018-08-29 Thread Uwe L. Korn

All,

I propose that we accept the following release candidate as the official
Apache Parquet C++ 1.5.0 release.

Parquet C++ 1.5.0-rc0 includes the following:
---
The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=parquet-cpp.git=CHANGELOG=apache-parquet-cpp-1.5.0-rc0

The tag used to create the release candidate is:
https://gitbox.apache.org/repos/asf?p=parquet-cpp.git;a=shortlog;h=refs/tags/apache-parquet-cpp-1.5.0-rc0

The release candidate is available at:
https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.5.0-rc0/apache-parquet-cpp-1.5.0.tar.gz

The MD5 checksum of the release candidate can be found at:
https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.5.0-rc0/apache-parquet-cpp-1.5.0.tar.gz.md5

The signature of the release candidate can be found at:
https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.5.0-rc0/apache-parquet-cpp-1.5.0.tar.gz.asc

The GPG key used to sign the release are available at:
https://dist.apache.org/repos/dist/dev/parquet/KEYS

The release is based on the commit hash 
80e110c823c5631ce4a4f0a5da486e759219f1e3.

Please download, verify, and test.

The vote will close on Sa 1. Sep 16:56:37 CEST 2018

[ ] +1 Release this as Apache Parquet C++ 1.5.0
[ ] +0
[ ] -1 Do not release this as Apache Parquet C++ 1.5.0 because...

Re: Doing a 1.5.0 C++ release

2018-08-27 Thread Uwe L. Korn

I can do the release somewhere in the next 48h. I need to find 30min of 
concentration, that's all what's blocking the release currently.

Uwe

> Am 27.08.2018 um 19:04 schrieb Wes McKinney :
> 
> Uwe -- are you going to be the RM? Let me know if there's anything I
> can do to help.
> 
> Thanks
> 
>> On Sun, Aug 26, 2018 at 1:06 PM, Wes McKinney  wrote:
>> I think we should be able to cut a release now? We can also proceed
>> with the Arrow merge at the same time once we agree how particularly
>> to do that.
>> 
>>> On Wed, Aug 22, 2018 at 7:30 AM, Uwe L. Korn  wrote:
>>> For me it would also be quite useful to also have 
>>> https://github.com/apache/parquet-cpp/pull/492 in the release.
>>> 
>>> Uwe
>>> 
>>>> On Tue, Aug 21, 2018, at 6:26 PM, Wes McKinney wrote:
>>>> I will review PARQUET-1372 again today so we can get that in soon.
>>>> 
>>>> I suggest we release 1.5.0 immediately after that so we are not
>>>> delayed in the monorepo merge. We need to conduct a vote there so it
>>>> will be a minimum of a few days anyhow until we're able to do that
>>>> 
>>>> - Wes
>>>> 
>>>>> On Sun, Aug 19, 2018 at 6:06 PM, Deepak Majeti  
>>>>> wrote:
>>>>> Uwe,
>>>>> 
>>>>> I would like to get https://issues.apache.org/jira/browse/PARQUET-1372 
>>>>> into
>>>>> this release as well. There is a PR already open for this JIRA and I got
>>>>> some feedback. I will address the feedback in the next couple of days.
>>>>> 
>>>>>> On Sun, Aug 19, 2018 at 8:48 AM Uwe L. Korn  wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> as we are in the process of doing/voting on a repo merge with the Arrow
>>>>>> project and also because there was some time since the last release, I
>>>>>> would like to proceed with a 1.5.0 release soon. Please have a look over
>>>>>> the issues at
>>>>>> https://issues.apache.org/jira/projects/PARQUET/versions/12342373 and
>>>>>> move the non-critical ones to 1.6.0 or help in fixing those that should 
>>>>>> go
>>>>>> into 1.5.0. Is there anything else currently in progress that should be
>>>>>> merged before we release?
>>>>>> 
>>>>>> Uwe
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> regards,
>>>>> Deepak Majeti

Re: Date and time for next Parquet sync

2018-08-27 Thread Uwe L. Korn

Hello Nador,

probably I can make this time. Just a timezone question: Is it 6pm CET or 6pm 
CEST? I guess the latter. 

See 
http://timesched.pocoo.org/?date=2018-08-28=central-europe-standard-time!,pacific-standard-time=1080,1140

Uwe

On Mon, Aug 27, 2018, at 12:20 PM, Nandor Kollar wrote:
> Hi All,
> 
> As discussed on last Parquet sync, I propose to have an other meeting
> on August 28th, at 6pm CET / 9 am PST to discuss those topic which we
> didn't have time on the sync at August 15th, and of course any new
> topic too.
> 
> Sorry for the late notice, feel free to propose other time slot if is
> is not suitable for you! Calendar entry to follow.
> 
> Regards,
> Nandor

[jira] [Resolved] (PARQUET-1372) [C++] Add an API to allow writing RowGroups based on their size rather than num_rows

2018-08-25 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1372.
--
Resolution: Fixed

Issue resolved by pull request 484
[https://github.com/apache/parquet-cpp/pull/484]

> [C++] Add an API to allow writing RowGroups based on their size rather than 
> num_rows
> 
>
> Key: PARQUET-1372
> URL: https://issues.apache.org/jira/browse/PARQUET-1372
> Project: Parquet
>  Issue Type: Task
>Reporter: Anatoli Shein
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> The current API allows writing RowGroups with specified numbers of rows, 
> however does not allow writing RowGroups with specified size. In order to 
> write RowGroups of specified size we need to write rows in chunks while 
> checking the total_bytes_written after each chunk is written. This is 
> currently impossible because the call to NextColumn() closes the current 
> column writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (PARQUET-1392) [C++] Supply row group indices to parquet::arrow::FileReader::ReadTable

2018-08-23 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1392.
--
Resolution: Fixed

Resolved by PR https://github.com/apache/parquet-cpp/pull/492

> [C++] Supply row group indices to parquet::arrow::FileReader::ReadTable
> ---
>
> Key: PARQUET-1392
> URL: https://issues.apache.org/jira/browse/PARQUET-1392
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>    Reporter: Uwe L. Korn
>    Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> By looking at the Parquet statistics, a user can already determine with its 
> own logic which RowGroups are interesting for him. Currently we only provide 
> functions to read the whole file or individual RowGroups. By supplying 
> {{parquet::arrow}} with the RowGroups at once, it can better optimize its 
> memory allocations as well as make better use of the underlying thread pool.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (PARQUET-1403) Can't save a df using Parquet if using float16

2018-08-23 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn moved ARROW-3112 to PARQUET-1403:
-

Affects Version/s: (was: 0.10.0)
   cpp-1.4.0
  Component/s: (was: Python)
   parquet-cpp
 Workflow: patch-available, re-open possible  (was: jira)
  Key: PARQUET-1403  (was: ARROW-3112)
  Project: Parquet  (was: Apache Arrow)

> Can't save a df using Parquet if using float16
> --
>
> Key: PARQUET-1403
> URL: https://issues.apache.org/jira/browse/PARQUET-1403
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Naor Volkovich
>Priority: Major
>
> When trying to save a Pandas DF using "to_parquet" when that DF has a column 
> with a dtype of float16, I get the error: 
> "pyarrow.lib.ArrowNotImplementedError: Unhandled type for Arrow to Parquet 
> schema conversion: halffloat"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: PARQUET-1399: Move parquet-mr related code from parquet-format

2018-08-22 Thread Uwe L. Korn

Hello Gabor
 
> I've just realized that merge commit in github is "not enabled for this
> repository". Any suggestions how we can workaround this?

You have to merge manually on your commandline using "git merge … && git push 
origin master". 

Uwe

Re: Doing a 1.5.0 C++ release

2018-08-22 Thread Uwe L. Korn

For me it would also be quite useful to also have 
https://github.com/apache/parquet-cpp/pull/492 in the release.

Uwe

On Tue, Aug 21, 2018, at 6:26 PM, Wes McKinney wrote:
> I will review PARQUET-1372 again today so we can get that in soon.
> 
> I suggest we release 1.5.0 immediately after that so we are not
> delayed in the monorepo merge. We need to conduct a vote there so it
> will be a minimum of a few days anyhow until we're able to do that
> 
> - Wes
> 
> On Sun, Aug 19, 2018 at 6:06 PM, Deepak Majeti  
> wrote:
> > Uwe,
> >
> > I would like to get https://issues.apache.org/jira/browse/PARQUET-1372 into
> > this release as well. There is a PR already open for this JIRA and I got
> > some feedback. I will address the feedback in the next couple of days.
> >
> > On Sun, Aug 19, 2018 at 8:48 AM Uwe L. Korn  wrote:
> >
> >> Hello,
> >>
> >> as we are in the process of doing/voting on a repo merge with the Arrow
> >> project and also because there was some time since the last release, I
> >> would like to proceed with a 1.5.0 release soon. Please have a look over
> >> the issues at
> >> https://issues.apache.org/jira/projects/PARQUET/versions/12342373 and
> >> move the non-critical ones to 1.6.0 or help in fixing those that should go
> >> into 1.5.0. Is there anything else currently in progress that should be
> >> merged before we release?
> >>
> >> Uwe
> >>
> >
> >
> > --
> > regards,
> > Deepak Majeti

Re: [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-08-21 Thread Uwe L. Korn

I have a very strong preference to keep the git history. I will have a look 
tomorrow to find the correct git magic to get a linear history. For me a single 
merge commit would be ok but I'm fine to spend an additional hour on this if 
you care strongly about linear history.

Uwe

On Sun, Aug 19, 2018, at 7:36 PM, Wes McKinney wrote:
> OK. I'm a bit -0 on doing anything that results in Arrow having a
> nonlinear git history (and rebasing is not really an option) but we
> can discuss that more later
> 
> On Sun, Aug 19, 2018 at 8:50 AM, Uwe L. Korn  wrote:
> > +1 on this but also see my comments in the mail on the discussions.
> >
> > We should also keep the git history of parquet-cpp, that should not be hard 
> > with git and there is probably a StackOverflow answer out there that gives 
> > you the commands to do the merge.
> >
> > Uwe
> >
> > On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote:
> >> In case any are interested: my estimate of the work involved in the
> >> migration to be about a full day of total work, possibly less. As soon
> >> as the migration plan is decided upon I intend to execute ASAP so that
> >> ongoing development efforts are not disrupted.
> >>
> >> Additionally, in flight patches do not all need to be merged. Patches
> >> can be easily edited to apply against the modified repository
> >> structure
> >>
> >> On Wed, Aug 15, 2018 at 6:04 PM, Wes McKinney  wrote:
> >> > hi all,
> >> >
> >> > As discussed on the mailing list [1] I am proposing to undertake a
> >> > restructuring of the development process for parquet-cpp and its
> >> > consumption in the Arrow ecosystem to benefit the developers and users
> >> > of both communities.
> >> >
> >> > The specific actions we would take would be:
> >> >
> >> > 1) Move the source code currently located at src/ in the
> >> > apache/parquet-cpp repository [2] to the cpp/src/ directory located in
> >> > apache/arrow [3]
> >> >
> >> > 2) The parquet code tree would remain separate from the Arrow code
> >> > tree, though the two projects will continue to share code as they do
> >> > now
> >> >
> >> > 3) The build system in apache/parquet-cpp would be effectively
> >> > deprecated and can be mostly discarded, as it is largely redundant and
> >> > duplicated from the build system in apache/arrow
> >> >
> >> > 4) The Parquet and Arrow C++ communities will collaborate to provide
> >> > development workflows to enable contributors working exclusively on
> >> > the Parquet core functionality to be able to work unencumbered with
> >> > unnecessary build or test dependencies from the rest of the Arrow
> >> > codebase. Note that parquet-cpp already builds a significant portion
> >> > of Apache Arrow en route to creating its libraries
> >> >
> >> > 5) The Parquet community can create scripts to "cut" Parquet C++
> >> > releases by packaging up the appropriate components and ensuring that
> >> > they can be built and installed independently as now
> >> >
> >> > 6) The CI processes would be merged -- since we already build the
> >> > Parquet libraries in Arrow's CI workflow, this would amount to
> >> > building the Parquet unit tests and running them.
> >> >
> >> > 7) Patches contributed that do not involve Arrow-related functionality
> >> > could use the PARQUET- marking, though some ARROW- patches may
> >> > span both codebases
> >> >
> >> > 8) Parquet C++ committers can be given push rights on apache/arrow
> >> > subject to ongoing good citizenry (e.g. not merging patches that break
> >> > builds). The Arrow PMC may need to vote on the procedure for offering
> >> > pass-through commit rights to anyone who has been invited to be a
> >> > committer for Apache Parquet
> >> >
> >> > 9) The contributors who work on both Arrow and Parquet will work in
> >> > good faith to ensure that that needs of Parquet-only developers (i.e.
> >> > who consume Parquet files in some way unrelated to the Arrow columnar
> >> > standard) are accommodated
> >> >
> >> > There are a number of particular details we will need to discuss
> >> > further (such as the specific logistics of the codebase surgery; e.g.
> >> > how to manage the commit history in apache/parqu

[jira] [Comment Edited] (PARQUET-1395) [C++] Tests fail due to not finding libboost_system.so

2018-08-20 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585947#comment-16585947
 ] 

Uwe L. Korn edited comment on PARQUET-1395 at 8/20/18 1:58 PM:
---

Ok, that is definitely the root of the problem.

In the cmake version I use locally, I have CMAKE_SKIP_INSTALL_RPATH=OFF and 
CMAKE_SKIP_RPATH=OFF. The CMAKE_INSTALL_RPATH_USE_LINK_PATH option was not set 
by cmake. I have cmake 3.10.0 from conda-forge installed.


was (Author: xhochy):
Ok, that is definitely the root of the problem.

In the cmake version I use locally, I get have CMAKE_SKIP_INSTALL_RPATH=OFF and 
CMAKE_SKIP_RPATH=OFF. The CMAKE_INSTALL_RPATH_USE_LINK_PATH option was not set 
by cmake. I have cmake 3.10.0 from conda-forge installed.

> [C++] Tests fail due to not finding libboost_system.so
> --
>
> Key: PARQUET-1395
> URL: https://issues.apache.org/jira/browse/PARQUET-1395
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> When building:
> {code}
> -- Boost version: 1.67.0
> -- Found the following Boost libraries:
> --   regex
> -- Boost include dir: /home/antoine/miniconda3/envs/pyarrow/include
> -- Boost libraries: 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_regex.so
> {code}
> Then:
> {code}
> $ ./build-debug/debug/memory-test 
> ./build-debug/debug/memory-test: error while loading shared libraries: 
> libboost_system.so.1.67.0: cannot open shared object file: No such file or 
> directory
> {code}
> {code}
> $ ldd ./build-debug/debug/memory-test 
>   linux-vdso.so.1 (0x7fffcbfed000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f64e2f07000)
>   libarrow.so.11 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libarrow.so.11 (0x7f64e28ad000)
>   libboost_regex.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_regex.so.1.67.0 
> (0x7f64e25a9000)
>   libstdc++.so.6 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f64e226a000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f64e1ecc000)
>   libgcc_s.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f64e1cb9000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f64e18c8000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f64e3415000)
>   libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f64e16c4000)
>   libboost_system.so.1.67.0 => not found
>   libboost_filesystem.so.1.67.0 => not found
>   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f64e14bc000)
>   libicudata.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicudata.so.58 
> (0x7f64df9bc000)
>   libicui18n.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicui18n.so.58 
> (0x7f64df547000)
>   libicuuc.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicuuc.so.58 
> (0x7f64df199000)
> {code}
> It looks like our cmake build script doesn't link explicitly with the conda 
> env's libboost_system.so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1395) [C++] Tests fail due to not finding libboost_system.so

2018-08-20 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585947#comment-16585947
 ] 

Uwe L. Korn commented on PARQUET-1395:
--

Ok, that is definitely the root of the problem.

In the cmake version I use locally, I get have CMAKE_SKIP_INSTALL_RPATH=OFF and 
CMAKE_SKIP_RPATH=OFF. The CMAKE_INSTALL_RPATH_USE_LINK_PATH option was not set 
by cmake. I have cmake 3.10.0 from conda-forge installed.

> [C++] Tests fail due to not finding libboost_system.so
> --
>
> Key: PARQUET-1395
> URL: https://issues.apache.org/jira/browse/PARQUET-1395
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> When building:
> {code}
> -- Boost version: 1.67.0
> -- Found the following Boost libraries:
> --   regex
> -- Boost include dir: /home/antoine/miniconda3/envs/pyarrow/include
> -- Boost libraries: 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_regex.so
> {code}
> Then:
> {code}
> $ ./build-debug/debug/memory-test 
> ./build-debug/debug/memory-test: error while loading shared libraries: 
> libboost_system.so.1.67.0: cannot open shared object file: No such file or 
> directory
> {code}
> {code}
> $ ldd ./build-debug/debug/memory-test 
>   linux-vdso.so.1 (0x7fffcbfed000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f64e2f07000)
>   libarrow.so.11 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libarrow.so.11 (0x7f64e28ad000)
>   libboost_regex.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_regex.so.1.67.0 
> (0x7f64e25a9000)
>   libstdc++.so.6 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f64e226a000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f64e1ecc000)
>   libgcc_s.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f64e1cb9000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f64e18c8000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f64e3415000)
>   libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f64e16c4000)
>   libboost_system.so.1.67.0 => not found
>   libboost_filesystem.so.1.67.0 => not found
>   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f64e14bc000)
>   libicudata.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicudata.so.58 
> (0x7f64df9bc000)
>   libicui18n.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicui18n.so.58 
> (0x7f64df547000)
>   libicuuc.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicuuc.so.58 
> (0x7f64df199000)
> {code}
> It looks like our cmake build script doesn't link explicitly with the conda 
> env's libboost_system.so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1395) [C++] Tests fail due to not finding libboost_system.so

2018-08-20 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585931#comment-16585931
 ] 

Uwe L. Korn commented on PARQUET-1395:
--

[~pitrou] can you post the output of ?

{code}
% objdump -x ./release/memory-test | grep RPATH
  RPATH/home/uwe/miniconda3/envs/pyarrow-dev/lib
{code}

> [C++] Tests fail due to not finding libboost_system.so
> --
>
> Key: PARQUET-1395
> URL: https://issues.apache.org/jira/browse/PARQUET-1395
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> When building:
> {code}
> -- Boost version: 1.67.0
> -- Found the following Boost libraries:
> --   regex
> -- Boost include dir: /home/antoine/miniconda3/envs/pyarrow/include
> -- Boost libraries: 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_regex.so
> {code}
> Then:
> {code}
> $ ./build-debug/debug/memory-test 
> ./build-debug/debug/memory-test: error while loading shared libraries: 
> libboost_system.so.1.67.0: cannot open shared object file: No such file or 
> directory
> {code}
> {code}
> $ ldd ./build-debug/debug/memory-test 
>   linux-vdso.so.1 (0x7fffcbfed000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f64e2f07000)
>   libarrow.so.11 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libarrow.so.11 (0x7f64e28ad000)
>   libboost_regex.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_regex.so.1.67.0 
> (0x7f64e25a9000)
>   libstdc++.so.6 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f64e226a000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f64e1ecc000)
>   libgcc_s.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f64e1cb9000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f64e18c8000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f64e3415000)
>   libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f64e16c4000)
>   libboost_system.so.1.67.0 => not found
>   libboost_filesystem.so.1.67.0 => not found
>   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f64e14bc000)
>   libicudata.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicudata.so.58 
> (0x7f64df9bc000)
>   libicui18n.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicui18n.so.58 
> (0x7f64df547000)
>   libicuuc.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicuuc.so.58 
> (0x7f64df199000)
> {code}
> It looks like our cmake build script doesn't link explicitly with the conda 
> env's libboost_system.so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1395) [C++] Tests fail due to not finding libboost_system.so

2018-08-20 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585927#comment-16585927
 ] 

Uwe L. Korn commented on PARQUET-1395:
--

We should have a look at what {{conda}} is doing internally. I know that 
{{conda}} is adjusting the RPATH during installation to ensure that the correct 
libs are used. This is most likely a thing we also need to do when we install 
into a conda environment.

I'm still a bit confused as why this pops up now, this was working quite ok for 
a long time.

> [C++] Tests fail due to not finding libboost_system.so
> --
>
> Key: PARQUET-1395
> URL: https://issues.apache.org/jira/browse/PARQUET-1395
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> When building:
> {code}
> -- Boost version: 1.67.0
> -- Found the following Boost libraries:
> --   regex
> -- Boost include dir: /home/antoine/miniconda3/envs/pyarrow/include
> -- Boost libraries: 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_regex.so
> {code}
> Then:
> {code}
> $ ./build-debug/debug/memory-test 
> ./build-debug/debug/memory-test: error while loading shared libraries: 
> libboost_system.so.1.67.0: cannot open shared object file: No such file or 
> directory
> {code}
> {code}
> $ ldd ./build-debug/debug/memory-test 
>   linux-vdso.so.1 (0x7fffcbfed000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f64e2f07000)
>   libarrow.so.11 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libarrow.so.11 (0x7f64e28ad000)
>   libboost_regex.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_regex.so.1.67.0 
> (0x7f64e25a9000)
>   libstdc++.so.6 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f64e226a000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f64e1ecc000)
>   libgcc_s.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f64e1cb9000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f64e18c8000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f64e3415000)
>   libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f64e16c4000)
>   libboost_system.so.1.67.0 => not found
>   libboost_filesystem.so.1.67.0 => not found
>   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f64e14bc000)
>   libicudata.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicudata.so.58 
> (0x7f64df9bc000)
>   libicui18n.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicui18n.so.58 
> (0x7f64df547000)
>   libicuuc.so.58 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libicuuc.so.58 
> (0x7f64df199000)
> {code}
> It looks like our cmake build script doesn't link explicitly with the conda 
> env's libboost_system.so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1393) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into continuous arrays

2018-08-19 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created PARQUET-1393:


 Summary: [C++] Change parquet::arrow::FileReader::ReadRowGroups to 
read into continuous arrays
 Key: PARQUET-1393
 URL: https://issues.apache.org/jira/browse/PARQUET-1393
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-cpp
Reporter: Uwe L. Korn
 Fix For: cpp-1.6.0


Instead of creating a chunk per RowGroup, we should read at least for primitive 
type into a single, pre-allocated Array. This needs some new functionality in 
the Record reader classes and thus should be done after 
https://github.com/apache/parquet-cpp/pull/462 is merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1392) [C++] Supply row group indices to parquet::arrow::FileReader::ReadTable

2018-08-19 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created PARQUET-1392:


 Summary: [C++] Supply row group indices to 
parquet::arrow::FileReader::ReadTable
 Key: PARQUET-1392
 URL: https://issues.apache.org/jira/browse/PARQUET-1392
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-cpp
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: cpp-1.5.0


By looking at the Parquet statistics, a user can already determine with its own 
logic which RowGroups are interesting for him. Currently we only provide 
functions to read the whole file or individual RowGroups. By supplying 
{{parquet::arrow}} with the RowGroups at once, it can better optimize its 
memory allocations as well as make better use of the underlying thread pool.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (PARQUET-1158) [C++] Basic RowGroup filtering

2018-08-19 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1158:


Assignee: Uwe L. Korn

> [C++] Basic RowGroup filtering
> --
>
> Key: PARQUET-1158
> URL: https://issues.apache.org/jira/browse/PARQUET-1158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>    Reporter: Uwe L. Korn
>    Assignee: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> See 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300
> We should be able to translate this into C++ enums and apply in the Arrow 
> read methods methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1158) [C++] Basic RowGroup filtering

2018-08-19 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1158:
-
Summary: [C++] Basic RowGroup filtering  (was: C++: Basic RowGroup 
filtering)

> [C++] Basic RowGroup filtering
> --
>
> Key: PARQUET-1158
> URL: https://issues.apache.org/jira/browse/PARQUET-1158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>    Reporter: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> See 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300
> We should be able to translate this into C++ enums and apply in the Arrow 
> read methods methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1158) [C++] Basic RowGroup filtering

2018-08-19 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1158:
-
Fix Version/s: (was: cpp-1.5.0)
   cpp-1.6.0

> [C++] Basic RowGroup filtering
> --
>
> Key: PARQUET-1158
> URL: https://issues.apache.org/jira/browse/PARQUET-1158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>    Reporter: Uwe L. Korn
>    Assignee: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> See 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300
> We should be able to translate this into C++ enums and apply in the Arrow 
> read methods methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-08-19 Thread Uwe L. Korn

+1 on this but also see my comments in the mail on the discussions.

We should also keep the git history of parquet-cpp, that should not be hard 
with git and there is probably a StackOverflow answer out there that gives you 
the commands to do the merge.

Uwe

On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote:
> In case any are interested: my estimate of the work involved in the
> migration to be about a full day of total work, possibly less. As soon
> as the migration plan is decided upon I intend to execute ASAP so that
> ongoing development efforts are not disrupted.
> 
> Additionally, in flight patches do not all need to be merged. Patches
> can be easily edited to apply against the modified repository
> structure
> 
> On Wed, Aug 15, 2018 at 6:04 PM, Wes McKinney  wrote:
> > hi all,
> >
> > As discussed on the mailing list [1] I am proposing to undertake a
> > restructuring of the development process for parquet-cpp and its
> > consumption in the Arrow ecosystem to benefit the developers and users
> > of both communities.
> >
> > The specific actions we would take would be:
> >
> > 1) Move the source code currently located at src/ in the
> > apache/parquet-cpp repository [2] to the cpp/src/ directory located in
> > apache/arrow [3]
> >
> > 2) The parquet code tree would remain separate from the Arrow code
> > tree, though the two projects will continue to share code as they do
> > now
> >
> > 3) The build system in apache/parquet-cpp would be effectively
> > deprecated and can be mostly discarded, as it is largely redundant and
> > duplicated from the build system in apache/arrow
> >
> > 4) The Parquet and Arrow C++ communities will collaborate to provide
> > development workflows to enable contributors working exclusively on
> > the Parquet core functionality to be able to work unencumbered with
> > unnecessary build or test dependencies from the rest of the Arrow
> > codebase. Note that parquet-cpp already builds a significant portion
> > of Apache Arrow en route to creating its libraries
> >
> > 5) The Parquet community can create scripts to "cut" Parquet C++
> > releases by packaging up the appropriate components and ensuring that
> > they can be built and installed independently as now
> >
> > 6) The CI processes would be merged -- since we already build the
> > Parquet libraries in Arrow's CI workflow, this would amount to
> > building the Parquet unit tests and running them.
> >
> > 7) Patches contributed that do not involve Arrow-related functionality
> > could use the PARQUET- marking, though some ARROW- patches may
> > span both codebases
> >
> > 8) Parquet C++ committers can be given push rights on apache/arrow
> > subject to ongoing good citizenry (e.g. not merging patches that break
> > builds). The Arrow PMC may need to vote on the procedure for offering
> > pass-through commit rights to anyone who has been invited to be a
> > committer for Apache Parquet
> >
> > 9) The contributors who work on both Arrow and Parquet will work in
> > good faith to ensure that that needs of Parquet-only developers (i.e.
> > who consume Parquet files in some way unrelated to the Arrow columnar
> > standard) are accommodated
> >
> > There are a number of particular details we will need to discuss
> > further (such as the specific logistics of the codebase surgery; e.g.
> > how to manage the commit history in apache/parquet-cpp -- do we care
> > about git blame?)
> >
> > This vote is to determine if the Parquet PMC is in favor of working in
> > good faith to execute on the above plan. I will inquire with the Arrow
> > PMC to see if we need to have a corresponding vote there, and also how
> > to handle the management of commit rights.
> >
> > [ ] +1: In favor of implementing the proposed monorepo plan
> > [ ] +0: . . .
> > [ ] -1: Not in favor because . . .
> >
> > Here is my vote: +1.
> >
> > Thank you,
> > Wes
> >
> > [1]: 
> > https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E
> > [2]: https://github.com/apache/parquet-cpp/tree/master/src/parquet
> > [3]: https://github.com/apache/arrow/tree/master/cpp/src

Doing a 1.5.0 C++ release

2018-08-19 Thread Uwe L. Korn

Hello,

as we are in the process of doing/voting on a repo merge with the Arrow project 
and also because there was some time since the last release, I would like to 
proceed with a 1.5.0 release soon. Please have a look over the issues at 
https://issues.apache.org/jira/projects/PARQUET/versions/12342373 and move the 
non-critical ones to 1.6.0 or help in fixing those that should go into 1.5.0. 
Is there anything else currently in progress that should be merged before we 
release?

Uwe

[jira] [Updated] (PARQUET-1122) [C++] Support 2-level list encoding in Arrow decoding

2018-08-19 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1122:
-
Fix Version/s: (was: cpp-1.5.0)
   cpp-1.6.0

> [C++] Support 2-level list encoding in Arrow decoding
> -
>
> Key: PARQUET-1122
> URL: https://issues.apache.org/jira/browse/PARQUET-1122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: centos 7.3, Anaconda 4.4.0 python 3.6.1
>Reporter: Luke Higgins
>Priority: Minor
> Fix For: cpp-1.6.0
>
>
> While trying to read a parquetfile (written by nifi) I am getting an error.
> code:
> import pyarrow.parquet as pq
> t = pq.read_table('test.parq')
> error:
> Traceback (most recent call last):
>   File "parquet_reader.py", line 2, in 
> t = pq.read_table('test.parq')
>   File "/opt/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/opt/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 119, in read
> nthreads=nthreads)
>   File "pyarrow/_parquet.pyx", line 466, in 
> pyarrow._parquet.ParquetReader.read_all 
> (/arrow/python/build/temp.linux-x86_64-3.6/_parquet.cxx:9181)
>   File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8115)
> pyarrow.lib.ArrowNotImplementedError: No support for reading columns of type 
> list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-19 Thread Uwe L. Korn

Back from vacation, I also want to finally raise my voice.

With the current state of the Parquet<->Arrow development, I see a benefit in 
merging the code base for now, but not necessarily forever.

Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter 
is built and that uses some of the more standard-library features of Arrow. It 
is the go-to place where also the same toolchain and CI setup is used. Here we 
also directly apply all improvements that we make in Arrow itself. These are 
the points that make it special in comparison to other tools providing Arrow 
adapters like Turbodbc.

Thus, I think that the current move to merge the code bases is ok for me. I 
must say that I'm not 100% certain that this is the best move but currently I 
lack better alternatives. As previously mentioned, we should take extra care 
that we can still do separate releases and also provide a path for a future 
where we split parquet-cpp into its own project/repository again.

An important point that we should keep in (and why I was a bit concerned in the 
previous times this discussion was raised) is that we have to be careful to not 
pull everything that touches Arrow into the Arrow repository. Having separate 
repositories for projects with each its own release cycle is for me still the 
aim for the longterm. I expect that there will be many more projects that will 
use Arrow's I/O libraries as well as will omit Arrow structures. These 
libraries should be also usable in Python/C++/Ruby/R/… These libraries are then 
hopefully not all developed by the same core group of Arrow/Parquet developers 
we have currently. For this to function really well, we will need a more stable 
API in Arrow as well as a good set of build tooling that other libraries can 
build up when using Arrow functionality. In addition to being stable, the API 
must also provide a good UX in the abstraction layers the Arrow functions are 
provided so that high-performance applications are not high-maintenance due to 
frequent API changes in Arrow. That said, this is currently is wish for the 
future. We are currently building and iterating heavily on these APIs to form a 
good basis for future developments. Thus the repo merge will hopefully improve 
the development speed so that we have to spent less time on toolchain 
maintenance and can focus on the user-facing APIs.

Uwe

On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
> Thanks Ryan, will do. The people I'd still like to hear from are:
> 
> * Phillip Cloud
> * Uwe Korn
> 
> As ASF contributors we are responsible to both be pragmatic as well as
> act in the best interests of the community's health and productivity.
> 
> 
> 
> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue  wrote:
> > I don't have an opinion here, but could someone send a summary of what is
> > decided to the dev list once there is consensus? This is a long thread for
> > parts of the project I don't work on, so I haven't followed it very closely.
> >
> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney  wrote:
> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> >> Can we enforce that parquet-cpp changes will not be committed without a
> >> corresponding Parquet JIRA?
> >>
> >> I think we would use the following policy:
> >>
> >> * use PARQUET-XXX for issues relating to Parquet core
> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> >> core (e.g. changes that are in parquet/arrow right now)
> >>
> >> We've already been dealing with annoyances relating to issues
> >> straddling the two projects (debugging an issue on Arrow side to find
> >> that it has to be fixed on Parquet side); this would make things
> >> simpler for us
> >>
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> simplify forking later (if needed) and be able to maintain the commit
> >> history.  I don't know if its possible to squash parquet-cpp commits and
> >> arrow commits separately before merging.
> >>
> >> This seems rather onerous for both contributors and maintainers and
> >> not in line with the goal of improving productivity. In the event that
> >> we fork I see it as a traumatic event for the community. If it does
> >> happen, then we can write a script (using git filter-branch and other
> >> such tools) to extract commits related to the forked code.
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti 
> >> wrote:
> >> > I have a few more logistical questions to add.
> >> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> > Arrow changes. Will we establish some guidelines for filing Parquet
> >> JIRAs?
> >> > Can we enforce that parquet-cpp changes will not be committed without a
> >> > corresponding Parquet JIRA?
> >> >
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> >

Re: num_level in Parquet Cpp library & how to add a JSON field?

2018-08-19 Thread Uwe L. Korn

Hello Ivy,

> Is there any ways to read the data in logical format? because I want to 
> check if my final output is correct.

I usually use the parquet-cli from the parquet-mr project to check if my file 
is written correctly. This should give you much more informative output.

Simple usage:

git clone https://github.com/apache/parquet-mr
cd parquet-mr
mvn -DskipTests=true package
cd parquet-cli
mvn dependency:copy-dependencies
java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main meta 


Note that these commands may not all work out-of-the box for you. In case 
anything breaks I can highly recommend reading parquet-mr's READMEs.

Uwe

> 
> Thanks!
> -Ivy
> 
> On 2018/08/03 13:46:15, "Uwe L. Korn"  wrote: 
> > Hello Ivy,
> > 
> > "primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not 
> > yet seen anyone use the JSON field with parquet-cpp but the JSON type is 
> > simply a binary string with an annotation so I would expect everything to 
> > just work.
> > 
> > Uwe
> > 
> > On Thu, Aug 2, 2018, at 7:59 PM, ivywu...@gmail.com wrote:
> > > Hi, 
> > > I’m creating a parquet file using the parquet C++ library. I’ve been 
> > > looking for answers online but still can’t figure out the following 
> > > questions.
> > > 
> > > 1. What does num_level mean in the WriteBatch method?
> > >  WriteBatch(int64_t num_levels, const int16_t* def_levels,
> > > const int16_t* rep_levels,
> > > const typename ParquetType::c_type* values)
> > > 
> > > 2. How to create a filed for JSON datatype?  By looking at this link 
> > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it 
> > > seems JSON is not considered as a nested datatype.  To create a filed 
> > > for JSON data, what primitive type should it be? According to the link, 
> > > it says “binary primitive type”,  does it mean "Type::BYTE_ARRAY”?
> > >   PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?, 
> > > LogicalType::JSON))
> > >   
> > > Any help is appreciated! 
> > > Thanks,
> > > Ivy
> > > 
> >

[jira] [Resolved] (PARQUET-1390) [Java] Upgrade to Arrow 0.10.0

2018-08-19 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1390.
--
Resolution: Fixed

Issue resolved by pull request 516
[https://github.com/apache/parquet-mr/pull/516]

> [Java] Upgrade to Arrow 0.10.0
> --
>
> Key: PARQUET-1390
> URL: https://issues.apache.org/jira/browse/PARQUET-1390
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Parquet is using Arrow 0.8.0 but version 0.10.0 was recently released. There 
> are numerous bug fixes and improvements, including building with JDK 8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (PARQUET-1390) [Java] Upgrade to Arrow 0.10.0

2018-08-19 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1390:


Assignee: Andy Grove

> [Java] Upgrade to Arrow 0.10.0
> --
>
> Key: PARQUET-1390
> URL: https://issues.apache.org/jira/browse/PARQUET-1390
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Parquet is using Arrow 0.8.0 but version 0.10.0 was recently released. There 
> are numerous bug fixes and improvements, including building with JDK 8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1390) [Java] Upgrade to Arrow 0.10.0

2018-08-19 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1390:
-
Fix Version/s: 1.11.0

> [Java] Upgrade to Arrow 0.10.0
> --
>
> Key: PARQUET-1390
> URL: https://issues.apache.org/jira/browse/PARQUET-1390
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Parquet is using Arrow 0.8.0 but version 0.10.0 was recently released. There 
> are numerous bug fixes and improvements, including building with JDK 8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Status of column index in parquet-mr

2018-08-19 Thread Uwe L. Korn

Hello Gabor,

comment in-line

> The implementation was done based on the original design of column indexes
>  meaning
> that no row alignment is required between the pages (the only requirement
> is for the pages to respect row boundaries).
> As we described in the preview parquet sync the desing/implementation would
> be much more clear (and might perform a bit better) if the row alignment
> would also be required. I would be happy to modify the implementation if we
> would decide to align pages on rows.* I would like to have a final decision
> on this topic before merging this feature.*

I'm not 100% certain what "row alignment" could mean, I thinking of two very 
different things.

1.  It would mean that all columns in a RowGroup would have the same number of 
pages that would all align on the same set of rows.
2. It would mean that pages are only split on the highest nesting level, i.e. 
only split on what would be the horizontal boundaries on a 2D-table. I.e. not 
splitting any cells of this table structure.

If the interpretation is 1, then I think this is generating far too much pages 
for very sparse columns. But I'm guessing that the interpretation is rather 2 
and there I would be more interested the concerns that were raised in the sync. 
This type of alignment also is something that made me some headaches when 
implementing things in parquet-cpp. From a Parquet developer's perspective, 
this would really ease the implementation but I'm wondering if there are 
use-cases where a single cell of a table becomes larger than what we would 
normally put into a page.

Uwe

Re: Date and time for next Parquet sync

2018-08-12 Thread Uwe L. Korn

As the meeting falls into my summer vacation I cannot participate but will try 
to join again if there is a meeting two weeks later.

Uwe

> Am 08.08.2018 um 16:43 schrieb Nandor Kollar :
> 
> Hi All,
> 
> It has been a while since we had a Parquet sync, therefore I'd like to
> propose to have one next week on August 15th, at 6pm CET / 9 am PST.
> 
> I'll send a meeting invite with the details soon, let me know if this time
> is not suitable for you!
> 
> Since the last sync there are couple of topics to discuss, like:
> - Status of Parquet encryption
> - Release a new minor version, scope of the new release
> - Bloom filters
> - Move Java specific code from parquet-format to parquet-mr
> - parquet.thrift usage best practices in different language bindings (Java,
> C++, Python, Rust)
> - LZ4 incompatibility
> 
> The agenda is open for suggestions.
> 
> Regards,
> Nandor

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568403#comment-16568403
 ] 

Uwe L. Korn commented on PARQUET-1370:
--

I'm doing the same, my code looks as follows:
{code:java}
reader = …some file handle…
reader = io.BufferedReader(reader, 512 * 1024)
parquet_file = ParquetFile(reader){code}
This was so simple that I thought it might not be relevant for now. Having a 
general C++ implementation of {{io.BufferedReader}} in Arrow C++ might be a 
simpler approach to our problem. The usage of `io.BufferedReader` involves 
probably some additional memory copies and overhead as we have to switch 
between Python and C++ often.

(In my case, the file handle is coming from [https://github.com/mbr/simplekv] / 
[https://github.com/blue-yonder/storefact] )

 

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568368#comment-16568368
 ] 

Uwe L. Korn commented on PARQUET-1370:
--

[~rgruener] I was also plagued by this issue but I wrapped my Python code in 
[https://docs.python.org/3/library/io.html#io.BufferedReader] and this gave me 
sufficient performance. This was especially useful for me as I'm working with 
object stores like S3 or Azure Blob where consecutive reads of 40kb or 512kb 
nearly make no difference but the HTTP request overhead is the main bottleneck. 

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-03 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568247#comment-16568247
 ] 

Uwe L. Korn commented on PARQUET-1369:
--

[~rgruener] Moved it.

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: PARQUET-1369
> URL: https://issues.apache.org/jira/browse/PARQUET-1369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.5.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-03 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn moved ARROW-2800 to PARQUET-1369:
-

Fix Version/s: (was: 0.11.0)
   cpp-1.5.0
Affects Version/s: (was: 0.9.0)
   cpp-1.4.0
  Component/s: (was: Python)
   parquet-cpp
 Workflow: patch-available, re-open possible  (was: jira)
  Key: PARQUET-1369  (was: ARROW-2800)
  Project: Parquet  (was: Apache Arrow)

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: PARQUET-1369
> URL: https://issues.apache.org/jira/browse/PARQUET-1369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.5.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: num_level in Parquet Cpp library & how to add a JSON field?

2018-08-03 Thread Uwe L. Korn

Hello Ivy,

"primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not yet 
seen anyone use the JSON field with parquet-cpp but the JSON type is simply a 
binary string with an annotation so I would expect everything to just work.

Uwe

On Thu, Aug 2, 2018, at 7:59 PM, ivywu...@gmail.com wrote:
> Hi, 
> I’m creating a parquet file using the parquet C++ library. I’ve been 
> looking for answers online but still can’t figure out the following 
> questions.
> 
> 1. What does num_level mean in the WriteBatch method?
>  WriteBatch(int64_t num_levels, const int16_t* def_levels,
> const int16_t* rep_levels,
> const typename ParquetType::c_type* values)
> 
> 2. How to create a filed for JSON datatype?  By looking at this link 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it 
> seems JSON is not considered as a nested datatype.  To create a filed 
> for JSON data, what primitive type should it be? According to the link, 
> it says “binary primitive type”,  does it mean "Type::BYTE_ARRAY”?
>   PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?, 
> LogicalType::JSON))
>   
> Any help is appreciated! 
> Thanks,
> Ivy
>

[jira] [Resolved] (PARQUET-1357) [C++] FormatStatValue truncates binary statistics on zero character

2018-08-01 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1357.
--
Resolution: Fixed

Issue resolved by PR https://github.com/apache/parquet-cpp/pull/479

> [C++] FormatStatValue truncates binary statistics on zero character
> ---
>
> Key: PARQUET-1357
> URL: https://issues.apache.org/jira/browse/PARQUET-1357
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>    Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> As {{FormatStatValue}} is currently called with a C-style string, we cannot 
> pass the actual binary content with its length. Instead change the interface 
> to {{std::string}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2018-07-31 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563874#comment-16563874
 ] 

Uwe L. Korn commented on PARQUET-1361:
--

What is the problem with the generated Parquet file? I don't understand where 
there is an issue.

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
> Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, 
> sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1363) Add IP address logical type

2018-07-30 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562055#comment-16562055
 ] 

Uwe L. Korn commented on PARQUET-1363:
--

[~tmgstev] You would probably need two types: IPv4 and IPv6

> Add IP address logical type
> ---
>
> Key: PARQUET-1363
> URL: https://issues.apache.org/jira/browse/PARQUET-1363
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Tristan Stevens
>Priority: Major
>
> IP addresses can be much more optimally represented as a 64 bit integer, 
> meaning that it's much more efficient for storage and allowing consumers to 
> do equality or subnet (range) comparisons using long-integer arithmetic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (PARQUET-1348) [C++] Allow Arrow FileWriter To Write FileMetaData

2018-07-28 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1348:


Assignee: Robert Gruener

> [C++] Allow Arrow FileWriter To Write FileMetaData
> --
>
> Key: PARQUET-1348
> URL: https://issues.apache.org/jira/browse/PARQUET-1348
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> The arrow [FileWriter open 
> method|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.h#L111]
>  only takes in a schema (which does not include row group information) and 
> not the full FileMetaData. This does not allow the summary _metadata file to 
> be created, and will need to be changed to write the full file metadata 
> object.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (PARQUET-1348) [C++] Allow Arrow FileWriter To Write FileMetaData

2018-07-28 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1348.
--
   Resolution: Fixed
Fix Version/s: cpp-1.5.0

Issue resolved by pull request 481
[https://github.com/apache/parquet-cpp/pull/481]

> [C++] Allow Arrow FileWriter To Write FileMetaData
> --
>
> Key: PARQUET-1348
> URL: https://issues.apache.org/jira/browse/PARQUET-1348
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> The arrow [FileWriter open 
> method|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.h#L111]
>  only takes in a schema (which does not include row group information) and 
> not the full FileMetaData. This does not allow the summary _metadata file to 
> be created, and will need to be changed to write the full file metadata 
> object.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (PARQUET-1358) [C++] index_page_offset should be unset as it is not supported.

2018-07-26 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1358.
--
Resolution: Fixed

Issue resolved by pull request 480
[https://github.com/apache/parquet-cpp/pull/480]

> [C++] index_page_offset should be unset as it is not supported.
> ---
>
> Key: PARQUET-1358
> URL: https://issues.apache.org/jira/browse/PARQUET-1358
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>    Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> We currently set to 0 while this is an optional attribute and should not be 
> set at all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1358) [C++] index_page_offset should be unset as it is not supported.

2018-07-26 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created PARQUET-1358:


 Summary: [C++] index_page_offset should be unset as it is not 
supported.
 Key: PARQUET-1358
 URL: https://issues.apache.org/jira/browse/PARQUET-1358
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Affects Versions: cpp-1.4.0
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: cpp-1.5.0


We currently set to 0 while this is an optional attribute and should not be set 
at all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1357) [C++] FormatStatValue truncates binary statistics on zero characters

2018-07-26 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created PARQUET-1357:


 Summary: [C++] FormatStatValue truncates binary statistics on zero 
characters
 Key: PARQUET-1357
 URL: https://issues.apache.org/jira/browse/PARQUET-1357
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Affects Versions: cpp-1.4.0
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: cpp-1.5.0


As {{FormatStatValue}} is currently called with a C-style string, we cannot 
pass the actual binary content with its length. Instead change the interface to 
{{std::string}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1357) [C++] FormatStatValue truncates binary statistics on zero character

2018-07-26 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1357:
-
Summary: [C++] FormatStatValue truncates binary statistics on zero 
character  (was: [C++] FormatStatValue truncates binary statistics on zero 
characters)

> [C++] FormatStatValue truncates binary statistics on zero character
> ---
>
> Key: PARQUET-1357
> URL: https://issues.apache.org/jira/browse/PARQUET-1357
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>    Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> As {{FormatStatValue}} is currently called with a C-style string, we cannot 
> pass the actual binary content with its length. Instead change the interface 
> to {{std::string}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1349) [C++] PARQUET_RPATH_ORIGIN is not picked by the build

2018-07-14 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created PARQUET-1349:


 Summary: [C++] PARQUET_RPATH_ORIGIN is not picked by the build
 Key: PARQUET-1349
 URL: https://issues.apache.org/jira/browse/PARQUET-1349
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Affects Versions: cpp-1.4.0
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: cpp-1.5.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 3 4 5 6 7 >

1 - 100 of 662 matches

Mail list logo