Re: Date and time for next Parquet sync

2018-09-07 Thread Ryan Blue
We may want to push this out another week because it also conflicts with
Strata NY. I think a few of us will be travelling Tuesday and both Julien
and I have talks on Wednesday.

On Fri, Sep 7, 2018 at 6:24 AM Gidon Gershinsky  wrote:

> Hi Nandor,
>
> Can we make it Wed this time, Sept 19? Or any of Tue/Wed on another week.
> Sept 18 is the Yom Kippur eve - this basically means I won't have a
> technical ability to join a call.
>
> Regarding the Google doc vs reviewed PR + .md file - it indeed becomes
> difficult and unneccesary to maintain two
> versions of the same documentation. Following you last mail, there was a
> high volume of review
> activity at the google doc, but now the spike is winding down, I'll be
> removing the duplicate part from the google doc
> (keeping the samples), with new comments to go to PRs (md and code). I'll
> send a detailed mail early next week.
>
>
> Cheers, Gidon.
>
> On Fri, Sep 7, 2018 at 3:42 PM Nandor Kollar  >
> wrote:
>
> > Hi All,
> >
> > I'd like propose to have a Parquet Sync next week Tuesday (September
> > 18th) at 6pm CEST / 9 am PST.
> >
> > Some of the topics which would be nice to discuss:
> > - review column indexes (PRs and feature branch)
> > - move Java code from format to mr (PR #517)
> > - Bloom filter spec
> > - columnar encryption spec (and general question, where to track
> > specs, Google doc vs reviewed PR + .md file)
> > - Refactor modules to use the new logical type API (PR under review)
> > - new format release scope (nano precision timestamp, bloom filer?,
> > columnar encryption?)
> >
> > I'll send the meeting invite shortly. Feel free to propose other time
> > slot if it is not suitable for you, and bring any additional topic
> > you'd like to discuss.
> >
> > Regards,
> > Nandor
> >
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [RESULT] [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-09-07 Thread Wes McKinney
After a lot of time beating my head against Windows toolchain issues
(I now know a _lot_ about this topic!) I have a green build at

https://github.com/apache/arrow/pull/2453

I'd like to merge this before much more time passes (i.e. today if
possible) and work on getting the outstanding patches migrated.

The only code that isn't a straight-copy is

https://github.com/apache/arrow/pull/2453/commits/fe5d435c9c58af42df4a37e7c97e37f33ae1857d

This contains all the modifications to the build system and CI to get
things fully working.

I will have to rebase (preserving the author and committer for each
patch) and then merge --ff-only to get this in

- Wes
On Tue, Sep 4, 2018 at 2:22 PM Wes McKinney  wrote:
>
> Great. It is definitely going to require some follow up patches to fix
> up the various packaging tasks, but at least the Linux Python wheels
> will still be working to start
> On Tue, Sep 4, 2018 at 2:04 PM Uwe L. Korn  wrote:
> >
> > Hello Wes,
> >
> > I have not much time this week but I hope to squeeze in some minutes 
> > tomorrow afternoon to review the code. As this is a very big merge, I want 
> > to be extra careful to not break anything really badly. Hopefully more eyes 
> > will help.
> >
> > Thank you for all the work in pushing this forward in the last days!
> >
> > Uwe
> >
> > On Tue, Sep 4, 2018, at 6:27 PM, Wes McKinney wrote:
> > > Dear all,
> > >
> > > The repo merge is nearly ready to go modulo some fixes to CI. There
> > > will be a number of follow up issues to re-establish the various
> > > (untested) build procedures in parquet-cpp
> > >
> > > https://github.com/apache/arrow/pull/2453
> > >
> > > I would like to merge this by EOD Wednesday 9/5, or Thursday at
> > > latest, so we can get the patches from apache/parquet-cpp moved over
> > > and avoid any disruption to development process. If there are any
> > > comments please let me know
> > >
> > > - Wes
> > > On Tue, Aug 21, 2018 at 12:23 PM Wes McKinney  wrote:
> > > >
> > > > hi all,
> > > >
> > > > with 3 binding +1 votes, the vote carries. We will discuss with Apache
> > > > Arrow about how to specifically proceed
> > > >
> > > > I have already done the preparatory work to undertake the merge
> > > >
> > > > https://github.com/apache/arrow/pull/2453
> > > >
> > > > thanks
> > > > Wes
> > > >
> > > > On Tue, Aug 21, 2018 at 10:41 AM, Wes McKinney  
> > > > wrote:
> > > > > Yes, feel free to have a look at
> > > > >
> > > > > https://github.com/apache/arrow/pull/2453
> > > > >
> > > > > I'm not very in favor of having a commingled non-linear history that
> > > > > makes git bisect difficult. We will have to discuss on the Arrow ML
> > > > >
> > > > > Here's an example from Apache Spark where a similar merge took place
> > > > >
> > > > > https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53
> > > > >
> > > > > It would be my preference to have a single squashed commit whose
> > > > > message attributes the developers of the code and provides links back
> > > > > to the original commit history in the commit message
> > > > >
> > > > > - Wes
> > > > >
> > > > >
> > > > > On Tue, Aug 21, 2018 at 9:52 AM, Uwe L. Korn  wrote:
> > > > >> I have a very strong preference to keep the git history. I will have 
> > > > >> a look tomorrow to find the correct git magic to get a linear 
> > > > >> history. For me a single merge commit would be ok but I'm fine to 
> > > > >> spend an additional hour on this if you care strongly about linear 
> > > > >> history.
> > > > >>
> > > > >> Uwe
> > > > >>
> > > > >> On Sun, Aug 19, 2018, at 7:36 PM, Wes McKinney wrote:
> > > > >>> OK. I'm a bit -0 on doing anything that results in Arrow having a
> > > > >>> nonlinear git history (and rebasing is not really an option) but we
> > > > >>> can discuss that more later
> > > > >>>
> > > > >>> On Sun, Aug 19, 2018 at 8:50 AM, Uwe L. Korn  
> > > > >>> wrote:
> > > > >>> > +1 on this but also see my comments in the mail on the 
> > > > >>> > discussions.
> > > > >>> >
> > > > >>> > We should also keep the git history of parquet-cpp, that should 
> > > > >>> > not be hard with git and there is probably a StackOverflow answer 
> > > > >>> > out there that gives you the commands to do the merge.
> > > > >>> >
> > > > >>> > Uwe
> > > > >>> >
> > > > >>> > On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote:
> > > > >>> >> In case any are interested: my estimate of the work involved in 
> > > > >>> >> the
> > > > >>> >> migration to be about a full day of total work, possibly less. 
> > > > >>> >> As soon
> > > > >>> >> as the migration plan is decided upon I intend to execute ASAP 
> > > > >>> >> so that
> > > > >>> >> ongoing development efforts are not disrupted.
> > > > >>> >>
> > > > >>> >> Additionally, in flight patches do not all need to be merged. 
> > > > >>> >> Patches
> > > > >>> >> can be easily edited to apply against the modified repository
> > > > >>> >> structure
> > > > >>> >>
> > > > >>> >> On Wed, Aug 15, 2018 at 

Re: Date and time for next Parquet sync

2018-09-07 Thread Gidon Gershinsky
Hi Nandor,

Can we make it Wed this time, Sept 19? Or any of Tue/Wed on another week.
Sept 18 is the Yom Kippur eve - this basically means I won't have a
technical ability to join a call.

Regarding the Google doc vs reviewed PR + .md file - it indeed becomes
difficult and unneccesary to maintain two
versions of the same documentation. Following you last mail, there was a
high volume of review
activity at the google doc, but now the spike is winding down, I'll be
removing the duplicate part from the google doc
(keeping the samples), with new comments to go to PRs (md and code). I'll
send a detailed mail early next week.


Cheers, Gidon.

On Fri, Sep 7, 2018 at 3:42 PM Nandor Kollar 
wrote:

> Hi All,
>
> I'd like propose to have a Parquet Sync next week Tuesday (September
> 18th) at 6pm CEST / 9 am PST.
>
> Some of the topics which would be nice to discuss:
> - review column indexes (PRs and feature branch)
> - move Java code from format to mr (PR #517)
> - Bloom filter spec
> - columnar encryption spec (and general question, where to track
> specs, Google doc vs reviewed PR + .md file)
> - Refactor modules to use the new logical type API (PR under review)
> - new format release scope (nano precision timestamp, bloom filer?,
> columnar encryption?)
>
> I'll send the meeting invite shortly. Feel free to propose other time
> slot if it is not suitable for you, and bring any additional topic
> you'd like to discuss.
>
> Regards,
> Nandor
>


Date and time for next Parquet sync

2018-09-07 Thread Nandor Kollar
Hi All,

I'd like propose to have a Parquet Sync next week Tuesday (September
18th) at 6pm CEST / 9 am PST.

Some of the topics which would be nice to discuss:
- review column indexes (PRs and feature branch)
- move Java code from format to mr (PR #517)
- Bloom filter spec
- columnar encryption spec (and general question, where to track
specs, Google doc vs reviewed PR + .md file)
- Refactor modules to use the new logical type API (PR under review)
- new format release scope (nano precision timestamp, bloom filer?,
columnar encryption?)

I'll send the meeting invite shortly. Feel free to propose other time
slot if it is not suitable for you, and bring any additional topic
you'd like to discuss.

Regards,
Nandor


[jira] [Commented] (PARQUET-1400) Deprecate parquet-mr related code in parquet-format

2018-09-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607068#comment-16607068
 ] 

ASF GitHub Bot commented on PARQUET-1400:
-

gszadovszky opened a new pull request #105: PARQUET-1400: Deprecate parquet-mr 
related code in parquet-format
URL: https://github.com/apache/parquet-format/pull/105
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Deprecate parquet-mr related code in parquet-format
> ---
>
> Key: PARQUET-1400
> URL: https://issues.apache.org/jira/browse/PARQUET-1400
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> There are java classes in the 
> [parquet-format|https://github.com/apache/parquet-format] repo that shall be 
> in the [parquet-mr|https://github.com/apache/parquet-mr] repo instead: [java 
> classes|https://github.com/apache/parquet-format/tree/master/src/main] and 
> [test classes|https://github.com/apache/parquet-format/tree/master/src/test]
> These classes shall be deprecated by mentioning they will be moved to the 
> [parquet-mr|https://github.com/apache/parquet-mr] repo.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1415) Improve logic when to write column indexes

2018-09-07 Thread Gabor Szadovszky (JIRA)
Gabor Szadovszky created PARQUET-1415:
-

 Summary: Improve logic when to write column indexes
 Key: PARQUET-1415
 URL: https://issues.apache.org/jira/browse/PARQUET-1415
 Project: Parquet
  Issue Type: Improvement
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Currently, we always write column indexes. In case of the data is ordered 
(ASCENDING or DESCENDING) the filtering would highly benefit from column 
indexes. While, if the data is UNORDERED it is not obvious if ordering based on 
column indexes would make sense. For example if the data is random then the 
min/max values of the different pages might be close to each other so in most 
cases filtering based on these values would not drop any of the pages. In the 
other hand UNORDERED values does not mean that the values are random. It can 
happen that the values are clustered or semi-ordered. We shall discover these 
cases somehow before writing the column indexes and write only if the min/max 
values for the pages do not overlap too much.

Another simple case if we have only one page. In this case writing column 
indexes is useless. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1414) Limit page size based on maximum row count

2018-09-07 Thread Gabor Szadovszky (JIRA)
Gabor Szadovszky created PARQUET-1414:
-

 Summary: Limit page size based on maximum row count
 Key: PARQUET-1414
 URL: https://issues.apache.org/jira/browse/PARQUET-1414
 Project: Parquet
  Issue Type: Improvement
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


For column index based filtering it is important to have enough pages for a 
column. In case of a perfectly matching encoding for the suitable data it can 
happen that all of the values can be encoded in one page (e.g. a column of an 
ascending counter).

With this improvement we would be able to limit the pages by the maximum number 
of rows to be written in it so we would have enough pages for every column. A 
good default value should be benchmarked. For initial, we can use 10k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1400) Deprecate parquet-mr related code in parquet-format

2018-09-07 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1400:
-

Assignee: Gabor Szadovszky

> Deprecate parquet-mr related code in parquet-format
> ---
>
> Key: PARQUET-1400
> URL: https://issues.apache.org/jira/browse/PARQUET-1400
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> There are java classes in the 
> [parquet-format|https://github.com/apache/parquet-format] repo that shall be 
> in the [parquet-mr|https://github.com/apache/parquet-mr] repo instead: [java 
> classes|https://github.com/apache/parquet-format/tree/master/src/main] and 
> [test classes|https://github.com/apache/parquet-format/tree/master/src/test]
> These classes shall be deprecated by mentioning they will be moved to the 
> [parquet-mr|https://github.com/apache/parquet-mr] repo.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)