Re: Multifile parquet support

2020-09-07 Thread Radu Teodorescu
Thank you everyone for the input,

Per Neal’s suggestion here is a PR: https://github.com/apache/arrow/pull/8130

The implementation for the logic is all in and it runs as expected on my osx 
env. but for starterd if fails in CI and I do need to add testing and cleanup 
the examples, etc to make it merge ready.

As it stands though it fully illustrates my intentions, so please let me know 
what you think while I add the proper cleanup touches.

Thanks again
Radu

> On Sep 4, 2020, at 3:31 PM, Weston Pace  wrote:
> 
> Hello Radu,
> 
> If your goal is strictly "append" with common schema then maybe the
> terminology you are looking for is "append a parquet file to a parquet
> dataset" and not "append a row group to a multi-file parquet file".
> Parquet datasets (and arrow datasets) support having a common schema
> which is used to validate and/or represent the individual files in the
> dataset.  You can append a file to a parquet dataset and it should be
> straightforward.  However, each file in the dataset could stand on its
> own and will have its own schema as part of that file's metadata.  If
> that is a deal breaker then it probably is not an arrow question
> because I do not believe the parquet cpp library that arrow relies on
> supports creating a parquet file without a schema.
> 
> -Weston
> 
> On Fri, Sep 4, 2020 at 8:04 AM Neal Richardson
>  wrote:
>> 
>> Hi Radu,
>> It might be easier to get feedback on some concrete code. Perhaps make a PR
>> with a proof of concept and we can discuss there?
>> 
>> Neal
>> 
>> On Fri, Sep 4, 2020 at 4:27 AM Radu Teodorescu 
>> wrote:
>> 
>>> Micah and all,
>>> Thanks for that pointer, I certainly didn’t follow it in detail at the
>>> time.
>>> 
>>> My question/thoughts are actually more limited in scope and I am
>>> specifically targeting features supported by the standard AND are supported
>>> by other major parquet implementation.
>>> 
>>> Specifically I would like to enable support for the having RowGroups in
>>> separate file and (as a side effect) be able to keep metadata in a separate
>>> file.
>>> This seems to be supported by the spec and by most readers including arrow
>>> (at least from scanning the code).
>>> 
>>> If the above are true (or at least not known to be false), it seems like
>>> the writer can be modified fairly easily to support that and I am happy to
>>> look into making that change.
>>> 
>>> Thoughts?
>>> Radu
>>> 
>>> PS: don’t mean to be stubborn by keeping it on the arrow group, but it
>>> seems like it is an arrow implementation specific goal.
>>> 
>>> 
>>> 
>>> 
>>> 
 On Sep 3, 2020, at 6:42 PM, Micah Kornfield 
>>> wrote:
 
 Hi Radu,
 This is a conversation best had on dev@parquet.  It came up recently [1]
 and I cross-posted there as well.
 
 [1]
 
>>> https://lists.apache.org/thread.html/re4fe4bc80c9eadd446761588f9b03d827193f91269a7c14ce0c444dd%40%3Cdev.arrow.apache.org%3E
 
 On Thu, Sep 3, 2020 at 3:20 PM Radu Teodorescu
>>> 
 wrote:
 
> Hello,
> What is the current thinking around allowing the logical content of a
> parquet file to be split across multiple files?
> I see that in theory there is support for reading files where different
> row groups are in separate files but I cannot see any features that
>>> allow
> that for writing.
> 
> On a somewhat related note, what are the thoughts on supporting parquet
> file append mode?
> Specifically if the meatadata is stored in a standalone file one can
> easily add new row groups to an existing file and create a new version
>>> of
> the metadata file without affecting potential consumers of the existing
> data.
> 
> 
> 
>>> 
>>> 



Re: Closing Plasma issues?

2020-09-07 Thread Antoine Pitrou


I would certainly be ok with removing Plasma.  Factually, it's unmaintained.

Regards

Antoine.


Le 07/09/2020 à 21:06, Uwe L. Korn a écrit :
> If we do that, we should be clear with that and remove the code. Shipping 
> Plasma as part of the release and not maintaining it as other parts of the 
> Arrow libraries seems inconsistent and will just be an annoyance to user to 
> find a partly unusable component.
> 
> Cheers
> Uwe
> 
> On Mon, Sep 7, 2020, at 7:58 PM, Robert Nishihara wrote:
>> I think that makes sense. They can be reopened if necessary.
>>
>> On Mon, Sep 7, 2020 at 9:49 AM Antoine Pitrou  wrote:
>>
>>>
>>> Hello,
>>>
>>> The Plasma component in our C++ codebase is now unmaintained, with the
>>> original authors and maintainers having forked the codebase on their
>>> side.  I propose to close the open Plasma issues in JIRA as "Won't fix".
>>>  Is there any concern about this?
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>


Re: Closing Plasma issues?

2020-09-07 Thread Uwe L. Korn
If we do that, we should be clear with that and remove the code. Shipping 
Plasma as part of the release and not maintaining it as other parts of the 
Arrow libraries seems inconsistent and will just be an annoyance to user to 
find a partly unusable component.

Cheers
Uwe

On Mon, Sep 7, 2020, at 7:58 PM, Robert Nishihara wrote:
> I think that makes sense. They can be reopened if necessary.
> 
> On Mon, Sep 7, 2020 at 9:49 AM Antoine Pitrou  wrote:
> 
> >
> > Hello,
> >
> > The Plasma component in our C++ codebase is now unmaintained, with the
> > original authors and maintainers having forked the codebase on their
> > side.  I propose to close the open Plasma issues in JIRA as "Won't fix".
> >  Is there any concern about this?
> >
> > Regards
> >
> > Antoine.
> >
>


Re: Closing Plasma issues?

2020-09-07 Thread Robert Nishihara
I think that makes sense. They can be reopened if necessary.

On Mon, Sep 7, 2020 at 9:49 AM Antoine Pitrou  wrote:

>
> Hello,
>
> The Plasma component in our C++ codebase is now unmaintained, with the
> original authors and maintainers having forked the codebase on their
> side.  I propose to close the open Plasma issues in JIRA as "Won't fix".
>  Is there any concern about this?
>
> Regards
>
> Antoine.
>


Closing Plasma issues?

2020-09-07 Thread Antoine Pitrou


Hello,

The Plasma component in our C++ codebase is now unmaintained, with the
original authors and maintainers having forked the codebase on their
side.  I propose to close the open Plasma issues in JIRA as "Won't fix".
 Is there any concern about this?

Regards

Antoine.


[NIGHTLY] Arrow Build Report for Job nightly-2020-09-07-0

2020-09-07 Thread Crossbow


Arrow Build Report for Job nightly-2020-09-07-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0

Failed Tasks:
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-travis-centos-8-aarch64
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-test-conda-python-3.7-hdfs-2.9.2

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-clean
- conda-linux-gcc-py36-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-linux-gcc-py36-cpu
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-linux-gcc-py37-cpu
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-linux-gcc-py38-cuda
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-osx-clang-py38
- conda-win-vs2017-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-win-vs2017-py36
- conda-win-vs2017-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-win-vs2017-py37
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-travis-debian-stretch-arm64
- example-cpp-minimal-build-static-system-dependency:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-example-cpp-minimal-build-static-system-dependency
- example-cpp-minimal-build-static:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-example-cpp-minimal-build-static
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-travis-homebrew-r-autobrew
- nuget:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-nuget
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-test-conda-cpp-valgrind
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-test-conda-cpp
- test-conda-python-3.6-pandas-0.23:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-test-conda-python-3.6-pandas-0.23
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-07-0-github-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: