I do not claim to have insight into parquet-cpp development. However, from
our experience developing Ray, I can say that the monorepo approach (for
Ray) has improved things a lot. Before we tried various schemes to split
the project into multiple repos, but the build system and test
infrastructure duplications and overhead from synchronizing changes slowed
development down significantly (and fixing bugs that touch the subrepos and
the main repo is inconvenient).

Also the decision to put arrow and parquet-cpp into a common repo is
independent of how tightly coupled the two projects are (and there could be
a matrix entry in travis which tests that PRs keep them decoupled, or
rather that they both just depend on a small common "base"). Google and
Facebook demonstrate such independence by having many many projects in the
same repo of course. It would be great if the open source community would
move more into this direction too I think.

Best,
Philipp.

On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Donald,
>
> This would make things worse, not better. Code changes routinely
> involve changes to the build system, and so you could be talking about
> having to making changes to 2 or 3 git repositories as the result of a
> single new feature or bug fix. There isn't really a cross-repo CI
> solution available
>
> I've seen some approaches to the monorepo problem using multiple git
> repositories, such as
>
> https://github.com/twosigma/git-meta
>
> Until something like this has first class support by the GitHub
> platform and its CI services (Travis CI, Appveyor), I don't think it
> will work for us.
>
> - Wes
>
> On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <donald.f...@gmail.com>
> wrote:
> > Could this work as each module gets configured as sub-git repots. Top
> level
> > build tool go into each sub-repo, pick the correct release version to
> test.
> > Tests in Python is dependent on cpp sub-repo to ensure the API still
> pass.
> >
> > This should be the best of both worlds, if sub-repo are supposed option.
> >
> > --Donald E. Foss
> >
> > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <majeti.dee...@gmail.com>
> > wrote:
> >
> >> I dislike the current build system complications as well.
> >>
> >> However, in my opinion, combining the code bases will severely impact
> the
> >> progress of the parquet-cpp project and implicitly the progress of the
> >> entire parquet project.
> >> Combining would have made much more sense if parquet-cpp is a mature
> >> project and codebase.  But parquet-cpp (and the entire parquet project)
> is
> >> evolving continuously with new features being added including bloom
> >> filters,  column encryption, and indexes.
> >>
> >> If the two code bases merged, it will be much more difficult to
> contribute
> >> to the parquet-cpp project since now Arrow bindings have to be
> supported as
> >> well. Please correct me if I am wrong here.
> >>
> >> Out of the two evils, I think handling the build system, packaging
> >> duplication is much more manageable since they are quite stable at this
> >> point.
> >>
> >> Regarding "* API changes cause awkward release coordination issues
> between
> >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
> >> changes needed) as and when Arrow is released?
> >>
> >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> >> converting between Arrow columnar memory format and Parquet". Can this
> be
> >> moved to the Arrow project and expose the more stable low-level APIs in
> >> parquet-cpp?
> >>
> >> I am also curious if the Arrow and Parquet Java implementations have
> >> similar API compatibility issues.
> >>
> >>
> >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >> > hi folks,
> >> >
> >> > We've been struggling for quite some time with the development
> >> > workflow between the Arrow and Parquet C++ (and Python) codebases.
> >> >
> >> > To explain the root issues:
> >> >
> >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> >> > includes file interfaces, memory management, miscellaneous algorithms
> >> > (e.g. dictionary encoding), etc. Note that before this "platform"
> >> > dependency was introduced, there was significant duplicated code
> >> > between these codebases and incompatible abstract interfaces for
> >> > things like files
> >> >
> >> > * we maintain a Arrow conversion code in parquet-cpp for converting
> >> > between Arrow columnar memory format and Parquet
> >> >
> >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> >> > Apache Arrow. This introduces a circular dependency into our CI.
> >> >
> >> > * Substantial portions of our CMake build system and related tooling
> >> > are duplicated between the Arrow and Parquet repos
> >> >
> >> > * API changes cause awkward release coordination issues between Arrow
> >> > and Parquet
> >> >
> >> > I believe the best way to remedy the situation is to adopt a
> >> > "Community over Code" approach and find a way for the Parquet and
> >> > Arrow C++ development communities to operate out of the same code
> >> > repository, i.e. the apache/arrow git repository.
> >> >
> >> > This would bring major benefits:
> >> >
> >> > * Shared CMake build infrastructure, developer tools, and CI
> >> > infrastructure (Parquet is already being built as a dependency in
> >> > Arrow's CI systems)
> >> >
> >> > * Share packaging and release management infrastructure
> >> >
> >> > * Reduce / eliminate problems due to API changes (where we currently
> >> > introduce breakage into our CI workflow when there is a breaking /
> >> > incompatible change)
> >> >
> >> > * Arrow releases would include a coordinated snapshot of the Parquet
> >> > implementation as it stands
> >> >
> >> > Continuing with the status quo has become unsatisfactory to me and as
> >> > a result I've become less motivated to work on the parquet-cpp
> >> > codebase.
> >> >
> >> > The only Parquet C++ committer who is not an Arrow committer is Deepak
> >> > Majeti. I think the issue of commit privileges could be resolved
> >> > without too much difficulty or time.
> >> >
> >> > I also think if it is truly necessary that the Apache Parquet
> >> > community could create release scripts to cut a miniml versioned
> >> > Apache Parquet C++ release if that is deemed truly necessary.
> >> >
> >> > I know that some people are wary of monorepos and megaprojects, but as
> >> > an example TensorFlow is at least 10 times as large of a projects in
> >> > terms of LOCs and number of different platform components, and it
> >> > seems to be getting along just fine. I think we should be able to work
> >> > together as a community to function just as well.
> >> >
> >> > Interested in the opinions of others, and any other ideas for
> >> > practical solutions to the above problems.
> >> >
> >> > Thanks,
> >> > Wes
> >> >
> >>
> >>
> >> --
> >> regards,
> >> Deepak Majeti
> >>
>

Reply via email to