Re: Doing a 1.5.0 C++ release

2018-08-21 Thread Wes McKinney
I will review PARQUET-1372 again today so we can get that in soon.

I suggest we release 1.5.0 immediately after that so we are not
delayed in the monorepo merge. We need to conduct a vote there so it
will be a minimum of a few days anyhow until we're able to do that

- Wes

On Sun, Aug 19, 2018 at 6:06 PM, Deepak Majeti  wrote:
> Uwe,
>
> I would like to get https://issues.apache.org/jira/browse/PARQUET-1372 into
> this release as well. There is a PR already open for this JIRA and I got
> some feedback. I will address the feedback in the next couple of days.
>
> On Sun, Aug 19, 2018 at 8:48 AM Uwe L. Korn  wrote:
>
>> Hello,
>>
>> as we are in the process of doing/voting on a repo merge with the Arrow
>> project and also because there was some time since the last release, I
>> would like to proceed with a 1.5.0 release soon. Please have a look over
>> the issues at
>> https://issues.apache.org/jira/projects/PARQUET/versions/12342373 and
>> move the non-critical ones to 1.6.0 or help in fixing those that should go
>> into 1.5.0. Is there anything else currently in progress that should be
>> merged before we release?
>>
>> Uwe
>>
>
>
> --
> regards,
> Deepak Majeti


[RESULT] [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-08-21 Thread Wes McKinney
hi all,

with 3 binding +1 votes, the vote carries. We will discuss with Apache
Arrow about how to specifically proceed

I have already done the preparatory work to undertake the merge

https://github.com/apache/arrow/pull/2453

thanks
Wes

On Tue, Aug 21, 2018 at 10:41 AM, Wes McKinney  wrote:
> Yes, feel free to have a look at
>
> https://github.com/apache/arrow/pull/2453
>
> I'm not very in favor of having a commingled non-linear history that
> makes git bisect difficult. We will have to discuss on the Arrow ML
>
> Here's an example from Apache Spark where a similar merge took place
>
> https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53
>
> It would be my preference to have a single squashed commit whose
> message attributes the developers of the code and provides links back
> to the original commit history in the commit message
>
> - Wes
>
>
> On Tue, Aug 21, 2018 at 9:52 AM, Uwe L. Korn  wrote:
>> I have a very strong preference to keep the git history. I will have a look 
>> tomorrow to find the correct git magic to get a linear history. For me a 
>> single merge commit would be ok but I'm fine to spend an additional hour on 
>> this if you care strongly about linear history.
>>
>> Uwe
>>
>> On Sun, Aug 19, 2018, at 7:36 PM, Wes McKinney wrote:
>>> OK. I'm a bit -0 on doing anything that results in Arrow having a
>>> nonlinear git history (and rebasing is not really an option) but we
>>> can discuss that more later
>>>
>>> On Sun, Aug 19, 2018 at 8:50 AM, Uwe L. Korn  wrote:
>>> > +1 on this but also see my comments in the mail on the discussions.
>>> >
>>> > We should also keep the git history of parquet-cpp, that should not be 
>>> > hard with git and there is probably a StackOverflow answer out there that 
>>> > gives you the commands to do the merge.
>>> >
>>> > Uwe
>>> >
>>> > On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote:
>>> >> In case any are interested: my estimate of the work involved in the
>>> >> migration to be about a full day of total work, possibly less. As soon
>>> >> as the migration plan is decided upon I intend to execute ASAP so that
>>> >> ongoing development efforts are not disrupted.
>>> >>
>>> >> Additionally, in flight patches do not all need to be merged. Patches
>>> >> can be easily edited to apply against the modified repository
>>> >> structure
>>> >>
>>> >> On Wed, Aug 15, 2018 at 6:04 PM, Wes McKinney  
>>> >> wrote:
>>> >> > hi all,
>>> >> >
>>> >> > As discussed on the mailing list [1] I am proposing to undertake a
>>> >> > restructuring of the development process for parquet-cpp and its
>>> >> > consumption in the Arrow ecosystem to benefit the developers and users
>>> >> > of both communities.
>>> >> >
>>> >> > The specific actions we would take would be:
>>> >> >
>>> >> > 1) Move the source code currently located at src/ in the
>>> >> > apache/parquet-cpp repository [2] to the cpp/src/ directory located in
>>> >> > apache/arrow [3]
>>> >> >
>>> >> > 2) The parquet code tree would remain separate from the Arrow code
>>> >> > tree, though the two projects will continue to share code as they do
>>> >> > now
>>> >> >
>>> >> > 3) The build system in apache/parquet-cpp would be effectively
>>> >> > deprecated and can be mostly discarded, as it is largely redundant and
>>> >> > duplicated from the build system in apache/arrow
>>> >> >
>>> >> > 4) The Parquet and Arrow C++ communities will collaborate to provide
>>> >> > development workflows to enable contributors working exclusively on
>>> >> > the Parquet core functionality to be able to work unencumbered with
>>> >> > unnecessary build or test dependencies from the rest of the Arrow
>>> >> > codebase. Note that parquet-cpp already builds a significant portion
>>> >> > of Apache Arrow en route to creating its libraries
>>> >> >
>>> >> > 5) The Parquet community can create scripts to "cut" Parquet C++
>>> >> > releases by packaging up the appropriate components and ensuring that
>>> >> > they can be built and installed independently as now
>>> >> >
>>> >> > 6) The CI processes would be merged -- since we already build the
>>> >> > Parquet libraries in Arrow's CI workflow, this would amount to
>>> >> > building the Parquet unit tests and running them.
>>> >> >
>>> >> > 7) Patches contributed that do not involve Arrow-related functionality
>>> >> > could use the PARQUET- marking, though some ARROW- patches may
>>> >> > span both codebases
>>> >> >
>>> >> > 8) Parquet C++ committers can be given push rights on apache/arrow
>>> >> > subject to ongoing good citizenry (e.g. not merging patches that break
>>> >> > builds). The Arrow PMC may need to vote on the procedure for offering
>>> >> > pass-through commit rights to anyone who has been invited to be a
>>> >> > committer for Apache Parquet
>>> >> >
>>> >> > 9) The contributors who work on both Arrow and Parquet will work in
>>> >> > good faith to ensure that that needs of Parquet-only developers (i.e.
>>> >> > who consume 

[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-21 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587678#comment-16587678
 ] 

Wes McKinney commented on PARQUET-1241:
---

Please either add a new codec or add an option to {{Lz4Codec}} to use the 
framed format. We can discuss further in a relevant Arrow JIRA

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-08-21 Thread Wes McKinney
Yes, feel free to have a look at

https://github.com/apache/arrow/pull/2453

I'm not very in favor of having a commingled non-linear history that
makes git bisect difficult. We will have to discuss on the Arrow ML

Here's an example from Apache Spark where a similar merge took place

https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53

It would be my preference to have a single squashed commit whose
message attributes the developers of the code and provides links back
to the original commit history in the commit message

- Wes


On Tue, Aug 21, 2018 at 9:52 AM, Uwe L. Korn  wrote:
> I have a very strong preference to keep the git history. I will have a look 
> tomorrow to find the correct git magic to get a linear history. For me a 
> single merge commit would be ok but I'm fine to spend an additional hour on 
> this if you care strongly about linear history.
>
> Uwe
>
> On Sun, Aug 19, 2018, at 7:36 PM, Wes McKinney wrote:
>> OK. I'm a bit -0 on doing anything that results in Arrow having a
>> nonlinear git history (and rebasing is not really an option) but we
>> can discuss that more later
>>
>> On Sun, Aug 19, 2018 at 8:50 AM, Uwe L. Korn  wrote:
>> > +1 on this but also see my comments in the mail on the discussions.
>> >
>> > We should also keep the git history of parquet-cpp, that should not be 
>> > hard with git and there is probably a StackOverflow answer out there that 
>> > gives you the commands to do the merge.
>> >
>> > Uwe
>> >
>> > On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote:
>> >> In case any are interested: my estimate of the work involved in the
>> >> migration to be about a full day of total work, possibly less. As soon
>> >> as the migration plan is decided upon I intend to execute ASAP so that
>> >> ongoing development efforts are not disrupted.
>> >>
>> >> Additionally, in flight patches do not all need to be merged. Patches
>> >> can be easily edited to apply against the modified repository
>> >> structure
>> >>
>> >> On Wed, Aug 15, 2018 at 6:04 PM, Wes McKinney  wrote:
>> >> > hi all,
>> >> >
>> >> > As discussed on the mailing list [1] I am proposing to undertake a
>> >> > restructuring of the development process for parquet-cpp and its
>> >> > consumption in the Arrow ecosystem to benefit the developers and users
>> >> > of both communities.
>> >> >
>> >> > The specific actions we would take would be:
>> >> >
>> >> > 1) Move the source code currently located at src/ in the
>> >> > apache/parquet-cpp repository [2] to the cpp/src/ directory located in
>> >> > apache/arrow [3]
>> >> >
>> >> > 2) The parquet code tree would remain separate from the Arrow code
>> >> > tree, though the two projects will continue to share code as they do
>> >> > now
>> >> >
>> >> > 3) The build system in apache/parquet-cpp would be effectively
>> >> > deprecated and can be mostly discarded, as it is largely redundant and
>> >> > duplicated from the build system in apache/arrow
>> >> >
>> >> > 4) The Parquet and Arrow C++ communities will collaborate to provide
>> >> > development workflows to enable contributors working exclusively on
>> >> > the Parquet core functionality to be able to work unencumbered with
>> >> > unnecessary build or test dependencies from the rest of the Arrow
>> >> > codebase. Note that parquet-cpp already builds a significant portion
>> >> > of Apache Arrow en route to creating its libraries
>> >> >
>> >> > 5) The Parquet community can create scripts to "cut" Parquet C++
>> >> > releases by packaging up the appropriate components and ensuring that
>> >> > they can be built and installed independently as now
>> >> >
>> >> > 6) The CI processes would be merged -- since we already build the
>> >> > Parquet libraries in Arrow's CI workflow, this would amount to
>> >> > building the Parquet unit tests and running them.
>> >> >
>> >> > 7) Patches contributed that do not involve Arrow-related functionality
>> >> > could use the PARQUET- marking, though some ARROW- patches may
>> >> > span both codebases
>> >> >
>> >> > 8) Parquet C++ committers can be given push rights on apache/arrow
>> >> > subject to ongoing good citizenry (e.g. not merging patches that break
>> >> > builds). The Arrow PMC may need to vote on the procedure for offering
>> >> > pass-through commit rights to anyone who has been invited to be a
>> >> > committer for Apache Parquet
>> >> >
>> >> > 9) The contributors who work on both Arrow and Parquet will work in
>> >> > good faith to ensure that that needs of Parquet-only developers (i.e.
>> >> > who consume Parquet files in some way unrelated to the Arrow columnar
>> >> > standard) are accommodated
>> >> >
>> >> > There are a number of particular details we will need to discuss
>> >> > further (such as the specific logistics of the codebase surgery; e.g.
>> >> > how to manage the commit history in apache/parquet-cpp -- do we care
>> >> > about git blame?)
>> >> >
>> >> > This vote is to determine if the Parquet PMC is 

Re: [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-08-21 Thread Uwe L. Korn
I have a very strong preference to keep the git history. I will have a look 
tomorrow to find the correct git magic to get a linear history. For me a single 
merge commit would be ok but I'm fine to spend an additional hour on this if 
you care strongly about linear history.

Uwe

On Sun, Aug 19, 2018, at 7:36 PM, Wes McKinney wrote:
> OK. I'm a bit -0 on doing anything that results in Arrow having a
> nonlinear git history (and rebasing is not really an option) but we
> can discuss that more later
> 
> On Sun, Aug 19, 2018 at 8:50 AM, Uwe L. Korn  wrote:
> > +1 on this but also see my comments in the mail on the discussions.
> >
> > We should also keep the git history of parquet-cpp, that should not be hard 
> > with git and there is probably a StackOverflow answer out there that gives 
> > you the commands to do the merge.
> >
> > Uwe
> >
> > On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote:
> >> In case any are interested: my estimate of the work involved in the
> >> migration to be about a full day of total work, possibly less. As soon
> >> as the migration plan is decided upon I intend to execute ASAP so that
> >> ongoing development efforts are not disrupted.
> >>
> >> Additionally, in flight patches do not all need to be merged. Patches
> >> can be easily edited to apply against the modified repository
> >> structure
> >>
> >> On Wed, Aug 15, 2018 at 6:04 PM, Wes McKinney  wrote:
> >> > hi all,
> >> >
> >> > As discussed on the mailing list [1] I am proposing to undertake a
> >> > restructuring of the development process for parquet-cpp and its
> >> > consumption in the Arrow ecosystem to benefit the developers and users
> >> > of both communities.
> >> >
> >> > The specific actions we would take would be:
> >> >
> >> > 1) Move the source code currently located at src/ in the
> >> > apache/parquet-cpp repository [2] to the cpp/src/ directory located in
> >> > apache/arrow [3]
> >> >
> >> > 2) The parquet code tree would remain separate from the Arrow code
> >> > tree, though the two projects will continue to share code as they do
> >> > now
> >> >
> >> > 3) The build system in apache/parquet-cpp would be effectively
> >> > deprecated and can be mostly discarded, as it is largely redundant and
> >> > duplicated from the build system in apache/arrow
> >> >
> >> > 4) The Parquet and Arrow C++ communities will collaborate to provide
> >> > development workflows to enable contributors working exclusively on
> >> > the Parquet core functionality to be able to work unencumbered with
> >> > unnecessary build or test dependencies from the rest of the Arrow
> >> > codebase. Note that parquet-cpp already builds a significant portion
> >> > of Apache Arrow en route to creating its libraries
> >> >
> >> > 5) The Parquet community can create scripts to "cut" Parquet C++
> >> > releases by packaging up the appropriate components and ensuring that
> >> > they can be built and installed independently as now
> >> >
> >> > 6) The CI processes would be merged -- since we already build the
> >> > Parquet libraries in Arrow's CI workflow, this would amount to
> >> > building the Parquet unit tests and running them.
> >> >
> >> > 7) Patches contributed that do not involve Arrow-related functionality
> >> > could use the PARQUET- marking, though some ARROW- patches may
> >> > span both codebases
> >> >
> >> > 8) Parquet C++ committers can be given push rights on apache/arrow
> >> > subject to ongoing good citizenry (e.g. not merging patches that break
> >> > builds). The Arrow PMC may need to vote on the procedure for offering
> >> > pass-through commit rights to anyone who has been invited to be a
> >> > committer for Apache Parquet
> >> >
> >> > 9) The contributors who work on both Arrow and Parquet will work in
> >> > good faith to ensure that that needs of Parquet-only developers (i.e.
> >> > who consume Parquet files in some way unrelated to the Arrow columnar
> >> > standard) are accommodated
> >> >
> >> > There are a number of particular details we will need to discuss
> >> > further (such as the specific logistics of the codebase surgery; e.g.
> >> > how to manage the commit history in apache/parquet-cpp -- do we care
> >> > about git blame?)
> >> >
> >> > This vote is to determine if the Parquet PMC is in favor of working in
> >> > good faith to execute on the above plan. I will inquire with the Arrow
> >> > PMC to see if we need to have a corresponding vote there, and also how
> >> > to handle the management of commit rights.
> >> >
> >> > [ ] +1: In favor of implementing the proposed monorepo plan
> >> > [ ] +0: . . .
> >> > [ ] -1: Not in favor because . . .
> >> >
> >> > Here is my vote: +1.
> >> >
> >> > Thank you,
> >> > Wes
> >> >
> >> > [1]: 
> >> > https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E
> >> > [2]: https://github.com/apache/parquet-cpp/tree/master/src/parquet
> >> > [3]: 

[jira] [Updated] (PARQUET-1398) Separate iv_prefix for GCM and CTR modes

2018-08-21 Thread Gidon Gershinsky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1398:
--
Labels: pull-request-available  (was: )

> Separate iv_prefix for GCM and CTR modes
> 
>
> Key: PARQUET-1398
> URL: https://issues.apache.org/jira/browse/PARQUET-1398
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>  Labels: pull-request-available
>
> There is an ambiguity in what the iv_prefix applies to - GCM or CTR or both. 
> This parameter will be moved it to the Algorithms structures (from the 
> FileCryptoMetaData structure).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1398) Separate iv_prefix for GCM and CTR modes

2018-08-21 Thread Gidon Gershinsky (JIRA)
Gidon Gershinsky created PARQUET-1398:
-

 Summary: Separate iv_prefix for GCM and CTR modes
 Key: PARQUET-1398
 URL: https://issues.apache.org/jira/browse/PARQUET-1398
 Project: Parquet
  Issue Type: Sub-task
  Components: parquet-format
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


There is an ambiguity in what the iv_prefix applies to - GCM or CTR or both. 
This parameter will be moved it to the Algorithms structures (from the 
FileCryptoMetaData structure).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Status of column index in parquet-mr

2018-08-21 Thread Gabor Szadovszky
Hi,

Row alignment in my wording was the 1st definition in Uwe's mail. From
column index based filtering point of view the implementation and the logic
would be much simplier in this case but I do understand that the pages
sizes would not be optimal. It seems, the community is against the row
alignment so I would close this topic.
The 2nd definition is required for column indexes and already mentioned in
the spec. For page v2 we already do the same anyway.
Tim, as you mentioned it would still require skipping values. The biggest
pain in my implementation was to pass the required values through the API
to implement the skipping. The calculation itself was not that complicated.

Thanks a lot,
Gabor

On Mon, Aug 20, 2018 at 7:51 PM Tim Armstrong
 wrote:

> I had a similar concern to Uwe - if there are a large number of columns
> with variable size there does seem to be a real risk of having many tiny
> pages.
>
> I wonder if we could do something in-between where we allow different page
> sizes for different columns, but require that the row ranges for pages of
> different columns either are the same or one contains the other. I.e. if
> you have row ranges [a, b) and [c, d) from two different columns, then
> either they don't overlap (c >= b || a >= d) or one contains the other (c
> >= a && d <= b) || (a >= c && b <= d)
>
> E.g. if you have three columns, small, medium and large that fit 5,
> 2 and 1020 values per page, you could meet the above constraint with
> the following set of row ranges where pages are truncated when a page with
> an enclosing row range is full.
>
> Small: [0, 5), [5, 10)
> Medium: [0, 2), [2, 4), [4, 5), [5, 7), [7,
> 9), [9, 10)
> Large: [0, 1020), [1020, 2040), [2040, 3060), ..., [19390, 2), ...
>
> That seems like it would simplify the calculation of the relevant pages on
> the read path, although you would still need to have logic to skip values
> within a page.
>
>
>
> On Sun, Aug 19, 2018 at 1:57 AM, Uwe L. Korn  wrote:
>
> > Hello Gabor,
> >
> > comment in-line
> >
> > > The implementation was done based on the original design of column
> > indexes
> > > 
> > meaning
> > > that no row alignment is required between the pages (the only
> requirement
> > > is for the pages to respect row boundaries).
> > > As we described in the preview parquet sync the desing/implementation
> > would
> > > be much more clear (and might perform a bit better) if the row
> alignment
> > > would also be required. I would be happy to modify the implementation
> if
> > we
> > > would decide to align pages on rows.* I would like to have a final
> > decision
> > > on this topic before merging this feature.*
> >
> > I'm not 100% certain what "row alignment" could mean, I thinking of two
> > very different things.
> >
> > 1.  It would mean that all columns in a RowGroup would have the same
> > number of pages that would all align on the same set of rows.
> > 2. It would mean that pages are only split on the highest nesting level,
> > i.e. only split on what would be the horizontal boundaries on a 2D-table.
> > I.e. not splitting any cells of this table structure.
> >
> > If the interpretation is 1, then I think this is generating far too much
> > pages for very sparse columns. But I'm guessing that the interpretation
> is
> > rather 2 and there I would be more interested the concerns that were
> raised
> > in the sync. This type of alignment also is something that made me some
> > headaches when implementing things in parquet-cpp. From a Parquet
> > developer's perspective, this would really ease the implementation but
> I'm
> > wondering if there are use-cases where a single cell of a table becomes
> > larger than what we would normally put into a page.
> >
> > Uwe
> >
>