Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-20 Thread Wes McKinney
A couple more points to make re: Uwe's comments: > An important point that we should keep in (and why I was a bit concerned in > the previous times this discussion was raised) is that we have to be careful > to not pull everything that touches Arrow into the Arrow repository. An important

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-19 Thread Wes McKinney
hi Uwe, I agree with your points. Currently we have 3 software artifacts: 1. Arrow C++ libraries 2. Parquet C++ libraries with Arrow columnar integration 3. C++ interop layer for Python + Cython bindings Changes in #1 prompt an awkward workflow involving multiple PRs; as a result of this we

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-19 Thread Uwe L. Korn
Back from vacation, I also want to finally raise my voice. With the current state of the Parquet<->Arrow development, I see a benefit in merging the code base for now, but not necessarily forever. Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter is built and that

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-07 Thread Wes McKinney
Thanks Ryan, will do. The people I'd still like to hear from are: * Phillip Cloud * Uwe Korn As ASF contributors we are responsible to both be pragmatic as well as act in the best interests of the community's health and productivity. On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue wrote: > I

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-07 Thread Ryan Blue
I don't have an opinion here, but could someone send a summary of what is decided to the dev list once there is consensus? This is a long thread for parts of the project I don't work on, so I haven't followed it very closely. On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney wrote: > > It will be

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-07 Thread Wes McKinney
> It will be difficult to track parquet-cpp changes if they get mixed with > Arrow changes. Will we establish some guidelines for filing Parquet JIRAs? > Can we enforce that parquet-cpp changes will not be committed without a > corresponding Parquet JIRA? I think we would use the following

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-07 Thread Deepak Majeti
I have a few more logistical questions to add. It will be difficult to track parquet-cpp changes if they get mixed with Arrow changes. Will we establish some guidelines for filing Parquet JIRAs? Can we enforce that parquet-cpp changes will not be committed without a corresponding Parquet JIRA? I

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-07 Thread Wes McKinney
Do other people have opinions? I would like to undertake this work in the near future (the next 8-10 weeks); I would be OK with taking responsibility for the primary codebase surgery. Some logistical questions: * We have a handful of pull requests in flight in parquet-cpp that would need to be

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-01 Thread Wes McKinney
Thanks Tim. Indeed, it's not very simple. Just today Antoine cleaned up some platform code intending to improve the performance of bit-packing in Parquet writes, and we resulted with 2 interdependent PRs * https://github.com/apache/parquet-cpp/pull/483 * https://github.com/apache/arrow/pull/2355

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-01 Thread Tim Armstrong
I don't have a direct stake in this beyond wanting to see Parquet be successful, but I thought I'd give my two cents. For me, the thing that makes the biggest difference in contributing to a new codebase is the number of steps in the workflow for writing, testing, posting and iterating on a

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Wes McKinney
hi, On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti wrote: > I think the circular dependency can be broken if we build a new library for > the platform code. This will also make it easy for other projects such as > ORC to use it. > I also remember your proposal a while ago of having a separate

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Wes McKinney
> The current Arrow adaptor code for parquet should live in the arrow repo. > That will remove a majority of the dependency issues. Joshua's work would not > have been blocked in parquet-cpp if that adapter was in the arrow repo. This > will be similar to the ORC adaptor. This has been

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Julian Hyde
A controlled fork doesn’t sound like a terrible option. Copy the code from parquet into arrow, and for a limited period of time it would be the primary. When that period is over, the code in parquet becomes the primary. During the period during which arrow has the primary, the parquet release

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Wes McKinney
> If you still strongly feel that the only way forward is to clone the > parquet-cpp repo and part ways, I will withdraw my concern. Having two > parquet-cpp repos is no way a better approach. Yes, indeed. In my view, the next best option after a monorepo is to fork. That would obviously be a

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Deepak Majeti
Wes, Unfortunately, I cannot show you any practical fact-based problems of a non-existent Arrow-Parquet mono-repo. Bringing in related Apache community experiences are more meaningful than how mono-repos work at Google and other big organizations. We solely depend on volunteers and cannot hire

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Wes McKinney
@Antoine > By the way, one concern with the monorepo approach: it would slightly > increase Arrow CI times (which are already too large). A typical CI run in Arrow is taking about 45 minutes: https://travis-ci.org/apache/arrow/builds/410119750 Parquet run takes about 28

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Joshua Storck
I recently worked on an issue that had to be implemented in parquet-cpp (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585, ARROW-2586). I found the circular dependencies confusing and hard to work with. For example, I still have a PR open in parquet-cpp (created on May 10) because

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Joshua Storck
You're point about the constraints of the ASF release process are well taken and as a developer who's trying to work in the current environment I would be much happier if the codebases were merged. The main issues I worry about when you put codebases like these together are: 1. The delineation of

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Wes McKinney
> I would like to point out that arrow's use of orc is a great example of how > it would be possible to manage parquet-cpp as a separate codebase. That gives > me hope that the projects could be managed separately some day. Well, I don't know that ORC is the best example. The ORC C++ codebase

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Wes McKinney
hi Josh, > I can imagine use cases for parquet that don't involve arrow and tying them > together seems like the wrong choice. Apache is "Community over Code"; right now it's the same people building these projects -- my argument (which I think you agree with?) is that we should work more

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Wes McKinney
On Mon, Jul 30, 2018 at 8:50 PM, Ted Dunning wrote: > On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney wrote: > >> >> > The community will be less willing to accept large >> > changes that require multiple rounds of patches for stability and API >> > convergence. Our contributions to Libhdfs++ in

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Ted Dunning
On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney wrote: > > > The community will be less willing to accept large > > changes that require multiple rounds of patches for stability and API > > convergence. Our contributions to Libhdfs++ in the HDFS community took a > > significantly long time for the

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Wes McKinney
hi, On Mon, Jul 30, 2018 at 6:52 PM, Deepak Majeti wrote: > Wes, > > I definitely appreciate and do see the impact of contributions made by > everyone. I made this statement not to rate any contributions but solely to > support my concern. > The contribution barrier is higher simply because of

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Julian Hyde
I'm not going to comment on the design of the parquet-cpp module and whether it is “closer” to parquet or arrow. But I do think Wes’s proposal is consistent with Apache policy. PMCs make releases and govern communities; they don’t exist to manage code bases, except as a means to the end of

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Deepak Majeti
Wes, I definitely appreciate and do see the impact of contributions made by everyone. I made this statement not to rate any contributions but solely to support my concern. The contribution barrier is higher simply because of the increased code, build, and test dependencies. If the community has

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Wes McKinney
hi Deepak On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti wrote: > @Wes > My observation is that most of the parquet-cpp contributors you listed that > overlap with the Arrow community mainly contribute to the Arrow > bindings(parquet::arrow layer)/platform API changes in the parquet-cpp > repo.

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Deepak Majeti
@Wes My observation is that most of the parquet-cpp contributors you listed that overlap with the Arrow community mainly contribute to the Arrow bindings(parquet::arrow layer)/platform API changes in the parquet-cpp repo. Very few of them review/contribute patches to the parquet-cpp core. I

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Donald E. Foss
Could this work as each module gets configured as sub-git repots. Top level build tool go into each sub-repo, pick the correct release version to test. Tests in Python is dependent on cpp sub-repo to ensure the API still pass. This should be the best of both worlds, if sub-repo are supposed

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-29 Thread Philipp Moritz
I do not claim to have insight into parquet-cpp development. However, from our experience developing Ray, I can say that the monorepo approach (for Ray) has improved things a lot. Before we tried various schemes to split the project into multiple repos, but the build system and test infrastructure

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-29 Thread Wes McKinney
hi Donald, This would make things worse, not better. Code changes routinely involve changes to the build system, and so you could be talking about having to making changes to 2 or 3 git repositories as the result of a single new feature or bug fix. There isn't really a cross-repo CI solution

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-29 Thread Wes McKinney
hi Deepak, responses inline On Sun, Jul 29, 2018 at 10:44 PM, Deepak Majeti wrote: > I dislike the current build system complications as well. > > However, in my opinion, combining the code bases will severely impact the > progress of the parquet-cpp project and implicitly the progress of the >

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-29 Thread Deepak Majeti
I dislike the current build system complications as well. However, in my opinion, combining the code bases will severely impact the progress of the parquet-cpp project and implicitly the progress of the entire parquet project. Combining would have made much more sense if parquet-cpp is a mature