Re: Date and time for next Parquet sync
We may want to push this out another week because it also conflicts with Strata NY. I think a few of us will be travelling Tuesday and both Julien and I have talks on Wednesday. On Fri, Sep 7, 2018 at 6:24 AM Gidon Gershinsky wrote: > Hi Nandor, > > Can we make it Wed this time, Sept 19? Or any of Tue/Wed on another week. > Sept 18 is the Yom Kippur eve - this basically means I won't have a > technical ability to join a call. > > Regarding the Google doc vs reviewed PR + .md file - it indeed becomes > difficult and unneccesary to maintain two > versions of the same documentation. Following you last mail, there was a > high volume of review > activity at the google doc, but now the spike is winding down, I'll be > removing the duplicate part from the google doc > (keeping the samples), with new comments to go to PRs (md and code). I'll > send a detailed mail early next week. > > > Cheers, Gidon. > > On Fri, Sep 7, 2018 at 3:42 PM Nandor Kollar > > wrote: > > > Hi All, > > > > I'd like propose to have a Parquet Sync next week Tuesday (September > > 18th) at 6pm CEST / 9 am PST. > > > > Some of the topics which would be nice to discuss: > > - review column indexes (PRs and feature branch) > > - move Java code from format to mr (PR #517) > > - Bloom filter spec > > - columnar encryption spec (and general question, where to track > > specs, Google doc vs reviewed PR + .md file) > > - Refactor modules to use the new logical type API (PR under review) > > - new format release scope (nano precision timestamp, bloom filer?, > > columnar encryption?) > > > > I'll send the meeting invite shortly. Feel free to propose other time > > slot if it is not suitable for you, and bring any additional topic > > you'd like to discuss. > > > > Regards, > > Nandor > > > -- Ryan Blue Software Engineer Netflix
Re: [RESULT] [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++
After a lot of time beating my head against Windows toolchain issues (I now know a _lot_ about this topic!) I have a green build at https://github.com/apache/arrow/pull/2453 I'd like to merge this before much more time passes (i.e. today if possible) and work on getting the outstanding patches migrated. The only code that isn't a straight-copy is https://github.com/apache/arrow/pull/2453/commits/fe5d435c9c58af42df4a37e7c97e37f33ae1857d This contains all the modifications to the build system and CI to get things fully working. I will have to rebase (preserving the author and committer for each patch) and then merge --ff-only to get this in - Wes On Tue, Sep 4, 2018 at 2:22 PM Wes McKinney wrote: > > Great. It is definitely going to require some follow up patches to fix > up the various packaging tasks, but at least the Linux Python wheels > will still be working to start > On Tue, Sep 4, 2018 at 2:04 PM Uwe L. Korn wrote: > > > > Hello Wes, > > > > I have not much time this week but I hope to squeeze in some minutes > > tomorrow afternoon to review the code. As this is a very big merge, I want > > to be extra careful to not break anything really badly. Hopefully more eyes > > will help. > > > > Thank you for all the work in pushing this forward in the last days! > > > > Uwe > > > > On Tue, Sep 4, 2018, at 6:27 PM, Wes McKinney wrote: > > > Dear all, > > > > > > The repo merge is nearly ready to go modulo some fixes to CI. There > > > will be a number of follow up issues to re-establish the various > > > (untested) build procedures in parquet-cpp > > > > > > https://github.com/apache/arrow/pull/2453 > > > > > > I would like to merge this by EOD Wednesday 9/5, or Thursday at > > > latest, so we can get the patches from apache/parquet-cpp moved over > > > and avoid any disruption to development process. If there are any > > > comments please let me know > > > > > > - Wes > > > On Tue, Aug 21, 2018 at 12:23 PM Wes McKinney wrote: > > > > > > > > hi all, > > > > > > > > with 3 binding +1 votes, the vote carries. We will discuss with Apache > > > > Arrow about how to specifically proceed > > > > > > > > I have already done the preparatory work to undertake the merge > > > > > > > > https://github.com/apache/arrow/pull/2453 > > > > > > > > thanks > > > > Wes > > > > > > > > On Tue, Aug 21, 2018 at 10:41 AM, Wes McKinney > > > > wrote: > > > > > Yes, feel free to have a look at > > > > > > > > > > https://github.com/apache/arrow/pull/2453 > > > > > > > > > > I'm not very in favor of having a commingled non-linear history that > > > > > makes git bisect difficult. We will have to discuss on the Arrow ML > > > > > > > > > > Here's an example from Apache Spark where a similar merge took place > > > > > > > > > > https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53 > > > > > > > > > > It would be my preference to have a single squashed commit whose > > > > > message attributes the developers of the code and provides links back > > > > > to the original commit history in the commit message > > > > > > > > > > - Wes > > > > > > > > > > > > > > > On Tue, Aug 21, 2018 at 9:52 AM, Uwe L. Korn wrote: > > > > >> I have a very strong preference to keep the git history. I will have > > > > >> a look tomorrow to find the correct git magic to get a linear > > > > >> history. For me a single merge commit would be ok but I'm fine to > > > > >> spend an additional hour on this if you care strongly about linear > > > > >> history. > > > > >> > > > > >> Uwe > > > > >> > > > > >> On Sun, Aug 19, 2018, at 7:36 PM, Wes McKinney wrote: > > > > >>> OK. I'm a bit -0 on doing anything that results in Arrow having a > > > > >>> nonlinear git history (and rebasing is not really an option) but we > > > > >>> can discuss that more later > > > > >>> > > > > >>> On Sun, Aug 19, 2018 at 8:50 AM, Uwe L. Korn > > > > >>> wrote: > > > > >>> > +1 on this but also see my comments in the mail on the > > > > >>> > discussions. > > > > >>> > > > > > >>> > We should also keep the git history of parquet-cpp, that should > > > > >>> > not be hard with git and there is probably a StackOverflow answer > > > > >>> > out there that gives you the commands to do the merge. > > > > >>> > > > > > >>> > Uwe > > > > >>> > > > > > >>> > On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote: > > > > >>> >> In case any are interested: my estimate of the work involved in > > > > >>> >> the > > > > >>> >> migration to be about a full day of total work, possibly less. > > > > >>> >> As soon > > > > >>> >> as the migration plan is decided upon I intend to execute ASAP > > > > >>> >> so that > > > > >>> >> ongoing development efforts are not disrupted. > > > > >>> >> > > > > >>> >> Additionally, in flight patches do not all need to be merged. > > > > >>> >> Patches > > > > >>> >> can be easily edited to apply against the modified repository > > > > >>> >> structure > > > > >>> >> > > > > >>> >> On Wed, Aug 15, 2018 at
Re: Date and time for next Parquet sync
Hi Nandor, Can we make it Wed this time, Sept 19? Or any of Tue/Wed on another week. Sept 18 is the Yom Kippur eve - this basically means I won't have a technical ability to join a call. Regarding the Google doc vs reviewed PR + .md file - it indeed becomes difficult and unneccesary to maintain two versions of the same documentation. Following you last mail, there was a high volume of review activity at the google doc, but now the spike is winding down, I'll be removing the duplicate part from the google doc (keeping the samples), with new comments to go to PRs (md and code). I'll send a detailed mail early next week. Cheers, Gidon. On Fri, Sep 7, 2018 at 3:42 PM Nandor Kollar wrote: > Hi All, > > I'd like propose to have a Parquet Sync next week Tuesday (September > 18th) at 6pm CEST / 9 am PST. > > Some of the topics which would be nice to discuss: > - review column indexes (PRs and feature branch) > - move Java code from format to mr (PR #517) > - Bloom filter spec > - columnar encryption spec (and general question, where to track > specs, Google doc vs reviewed PR + .md file) > - Refactor modules to use the new logical type API (PR under review) > - new format release scope (nano precision timestamp, bloom filer?, > columnar encryption?) > > I'll send the meeting invite shortly. Feel free to propose other time > slot if it is not suitable for you, and bring any additional topic > you'd like to discuss. > > Regards, > Nandor >
Date and time for next Parquet sync
Hi All, I'd like propose to have a Parquet Sync next week Tuesday (September 18th) at 6pm CEST / 9 am PST. Some of the topics which would be nice to discuss: - review column indexes (PRs and feature branch) - move Java code from format to mr (PR #517) - Bloom filter spec - columnar encryption spec (and general question, where to track specs, Google doc vs reviewed PR + .md file) - Refactor modules to use the new logical type API (PR under review) - new format release scope (nano precision timestamp, bloom filer?, columnar encryption?) I'll send the meeting invite shortly. Feel free to propose other time slot if it is not suitable for you, and bring any additional topic you'd like to discuss. Regards, Nandor
[jira] [Commented] (PARQUET-1400) Deprecate parquet-mr related code in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607068#comment-16607068 ] ASF GitHub Bot commented on PARQUET-1400: - gszadovszky opened a new pull request #105: PARQUET-1400: Deprecate parquet-mr related code in parquet-format URL: https://github.com/apache/parquet-format/pull/105 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Deprecate parquet-mr related code in parquet-format > --- > > Key: PARQUET-1400 > URL: https://issues.apache.org/jira/browse/PARQUET-1400 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > > There are java classes in the > [parquet-format|https://github.com/apache/parquet-format] repo that shall be > in the [parquet-mr|https://github.com/apache/parquet-mr] repo instead: [java > classes|https://github.com/apache/parquet-format/tree/master/src/main] and > [test classes|https://github.com/apache/parquet-format/tree/master/src/test] > These classes shall be deprecated by mentioning they will be moved to the > [parquet-mr|https://github.com/apache/parquet-mr] repo. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1415) Improve logic when to write column indexes
Gabor Szadovszky created PARQUET-1415: - Summary: Improve logic when to write column indexes Key: PARQUET-1415 URL: https://issues.apache.org/jira/browse/PARQUET-1415 Project: Parquet Issue Type: Improvement Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky Currently, we always write column indexes. In case of the data is ordered (ASCENDING or DESCENDING) the filtering would highly benefit from column indexes. While, if the data is UNORDERED it is not obvious if ordering based on column indexes would make sense. For example if the data is random then the min/max values of the different pages might be close to each other so in most cases filtering based on these values would not drop any of the pages. In the other hand UNORDERED values does not mean that the values are random. It can happen that the values are clustered or semi-ordered. We shall discover these cases somehow before writing the column indexes and write only if the min/max values for the pages do not overlap too much. Another simple case if we have only one page. In this case writing column indexes is useless. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1414) Limit page size based on maximum row count
Gabor Szadovszky created PARQUET-1414: - Summary: Limit page size based on maximum row count Key: PARQUET-1414 URL: https://issues.apache.org/jira/browse/PARQUET-1414 Project: Parquet Issue Type: Improvement Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky For column index based filtering it is important to have enough pages for a column. In case of a perfectly matching encoding for the suitable data it can happen that all of the values can be encoded in one page (e.g. a column of an ascending counter). With this improvement we would be able to limit the pages by the maximum number of rows to be written in it so we would have enough pages for every column. A good default value should be benchmarked. For initial, we can use 10k. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-1400) Deprecate parquet-mr related code in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-1400: - Assignee: Gabor Szadovszky > Deprecate parquet-mr related code in parquet-format > --- > > Key: PARQUET-1400 > URL: https://issues.apache.org/jira/browse/PARQUET-1400 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > There are java classes in the > [parquet-format|https://github.com/apache/parquet-format] repo that shall be > in the [parquet-mr|https://github.com/apache/parquet-mr] repo instead: [java > classes|https://github.com/apache/parquet-format/tree/master/src/main] and > [test classes|https://github.com/apache/parquet-format/tree/master/src/test] > These classes shall be deprecated by mentioning they will be moved to the > [parquet-mr|https://github.com/apache/parquet-mr] repo. -- This message was sent by Atlassian JIRA (v7.6.3#76005)