Re: plan for Go implementation of Plasma

2018-12-19 Thread Kouhei Sutou
Hi, GObject Plasma bindings mentioned by Philipp is the official C bindings for Plasma (Plasma GLib): https://github.com/apache/arrow/tree/master/c_glib/plasma-glib Ruby bindings use it. So we'll maintain and improve it. There were examples to generate Go bindings for Arrow from Arrow GLib

[jira] [Created] (ARROW-4085) [GLib] Use "field" for struct data type

2018-12-19 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-4085: --- Summary: [GLib] Use "field" for struct data type Key: ARROW-4085 URL: https://issues.apache.org/jira/browse/ARROW-4085 Project: Apache Arrow Issue Type:

Re: Dictionary with repeated values?

2018-12-19 Thread Wes McKinney
The way that dictionary encoding is implemented in C++ (with DictionaryType, DictionaryArray) is a construct particular to the library. At the protocol level, dictionary encoding is a property of field at some level of a schema tree [1]. The dictionary itself is a record batch with a single

Re: plan for Go implementation of Plasma

2018-12-19 Thread Philipp Moritz
Hey Dustin, Thanks for getting in touch! Here are two additional ways to do it: 5. Native go client library: If Go has support to ship file descriptors over unix domain sockets (which I think it has, see https://github.com/opencontainers/runc/blob/master/libcontainer/utils/cmsg.go) and interact

plan for Go implementation of Plasma

2018-12-19 Thread Dustin Long
Hi all! I am a developer on qri , a data-science tool built on IPFS written in go. We're interested in integrating Arrow and especially Plasma, in order to be able to share datasets with other apps like Jupyter Notebook. Having this functionality is going to be key for how we

[jira] [Created] (ARROW-4084) Simplify Status and stringstream boilerplate

2018-12-19 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4084: - Summary: Simplify Status and stringstream boilerplate Key: ARROW-4084 URL: https://issues.apache.org/jira/browse/ARROW-4084 Project: Apache Arrow

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Alberto Ramón
Some answers / ideas: The typical: Write in Kafka The Fashion: Pravega (from Apache Flink) The Future: Wait to Erasure Code in HDFS 3 On Wed, 19 Dec 2018 at 16:41, Wes McKinney wrote: > We could certainly develop some tools in C++ and/or Python to assist > with the compaction workflows. If you

[jira] [Created] (ARROW-4083) [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type)

2018-12-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4083: --- Summary: [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type) Key: ARROW-4083 URL:

[jira] [Created] (ARROW-4082) [C++] CMake tweaks: allow RelWithDebInfo, improve FindClangTools

2018-12-19 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-4082: Summary: [C++] CMake tweaks: allow RelWithDebInfo, improve FindClangTools Key: ARROW-4082 URL: https://issues.apache.org/jira/browse/ARROW-4082 Project:

[jira] [Created] (ARROW-4081) Sum methods on Mac OS X panic when the array is empty

2018-12-19 Thread Jonathan A Sternberg (JIRA)
Jonathan A Sternberg created ARROW-4081: --- Summary: Sum methods on Mac OS X panic when the array is empty Key: ARROW-4081 URL: https://issues.apache.org/jira/browse/ARROW-4081 Project: Apache

[jira] [Created] (ARROW-4080) [Rust] Improving lengthy build times in Appveyor

2018-12-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4080: --- Summary: [Rust] Improving lengthy build times in Appveyor Key: ARROW-4080 URL: https://issues.apache.org/jira/browse/ARROW-4080 Project: Apache Arrow Issue

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Wes McKinney
We could certainly develop some tools in C++ and/or Python to assist with the compaction workflows. If you have an idea about how these might look and be generally useful, please feel free to propose in a JIRA issue On Wed, Dec 19, 2018 at 9:09 AM Joel Pfaff wrote: > > Unfortunately I cannot use

[jira] [Created] (ARROW-4079) [C++] Add machine benchmarks

2018-12-19 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-4079: - Summary: [C++] Add machine benchmarks Key: ARROW-4079 URL: https://issues.apache.org/jira/browse/ARROW-4079 Project: Apache Arrow Issue Type: Wish

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Joel Pfaff
Unfortunately I cannot use kudu in my projects, I would have loved to give it a try. I did not know about hudi, it seems very similar to what we do (Parquet + Avro), I will have a look. I am following the iceberg project very closely, because it appears to solve a lot of problems that we face on a

[jira] [Created] (ARROW-4078) [CI] Need separate doc building job

2018-12-19 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-4078: - Summary: [CI] Need separate doc building job Key: ARROW-4078 URL: https://issues.apache.org/jira/browse/ARROW-4078 Project: Apache Arrow Issue Type: Bug

Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Krisztián Szűcs
We now have nightly docs build: https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93=docs If We decide where to upload it, We can publish nightly dev docs. On Wed, Dec 19, 2018 at 3:12 PM Wes McKinney wrote: > Indeed. I had opened an issue about this some time ago > >

Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Wes McKinney
Indeed. I had opened an issue about this some time ago https://issues.apache.org/jira/browse/ARROW-1299 On Wed, Dec 19, 2018 at 8:10 AM Antoine Pitrou wrote: > > > Le 19/12/2018 à 15:07, Wes McKinney a écrit : > > +1 also. The C++ README has grown quite long, for example. Probably to > > put

Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Antoine Pitrou
Le 19/12/2018 à 15:07, Wes McKinney a écrit : > +1 also. The C++ README has grown quite long, for example. Probably to > put all of that in the Sphinx project. > > One downside of Sphinx is that some things can grow out of date on the > website in between releases. Within the codebase itself,

Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Wes McKinney
+1 also. The C++ README has grown quite long, for example. Probably to put all of that in the Sphinx project. One downside of Sphinx is that some things can grow out of date on the website in between releases. Within the codebase itself, we can remedy this by directing people to the .rst files

Re: Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Wes McKinney
On Wed, Dec 19, 2018 at 7:47 AM Antoine Pitrou wrote: > > > Le 19/12/2018 à 14:42, Wes McKinney a écrit : > > > > * Our PR merge tool (dev/merge_arrow_py.py) squashes all the commits > > anyway, so squashing twice is redundant > > The problem is you can then get spurious conflicts if you base a

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Uwe L. Korn
This can also be solved by using a table format like https://github.com/uber/hudi or https://github.com/apache/incubator-iceberg where the latter has a PR open for a basic Python implementation with pyarrow. These table formats support using Avro and Parquet seamlessly together without the

Re: Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Wes McKinney
On Wed, Dec 19, 2018 at 7:47 AM Francois Saint-Jacques wrote: > > No issue with this. > > When the final squash is done, which title/body is preserved? The PR title (in GitHub) and the PR description are what matter. The commit messages don't really matter > > On Wed, Dec 19, 2018 at 8:43 AM

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Wes McKinney
This turns out to be a very common problem (landing incremental updates, dealing with compaction and small files). It's part of the reason that systems like Apache Kudu were developed, e.g. https://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/ If you

Re: Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Antoine Pitrou
Le 19/12/2018 à 14:42, Wes McKinney a écrit : > > * Our PR merge tool (dev/merge_arrow_py.py) squashes all the commits > anyway, so squashing twice is redundant The problem is you can then get spurious conflicts if you base a PR on another. Happened to me several times. Regards Antoine.

Re: Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Francois Saint-Jacques
No issue with this. When the final squash is done, which title/body is preserved? On Wed, Dec 19, 2018 at 8:43 AM Wes McKinney wrote: > hi folks, > > As the contributor base has grown, our development styles have grown > increasingly diverse. > > Sometimes contributors are used to working in a

[jira] [Created] (ARROW-4077) [Gandiva] fix CI if ctest doesn't run any tests

2018-12-19 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-4077: - Summary: [Gandiva] fix CI if ctest doesn't run any tests Key: ARROW-4077 URL: https://issues.apache.org/jira/browse/ARROW-4077 Project: Apache Arrow

Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Wes McKinney
hi folks, As the contributor base has grown, our development styles have grown increasingly diverse. Sometimes contributors are used to working in a Gerrit-style workflow where patches are always squashed with `git rebase -i` into a single patch, and then force pushed to the PR branch. I'd like

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Joel Pfaff
Hello, For my company's usecases, we have found that the number of files was a critical part of the time spent doing the execution plan, so we found the idea of very regularly writing small parquet files to be rather inefficient. There are some formats that support an `append` semantic (I have

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Francois Saint-Jacques
Hello Darren, what Uwe suggests is usually the way to go, your active process writes to a new file every time. Then you have a parallel process/thread that does compaction of smaller files in the background such that you don't have too many files. On Wed, Dec 19, 2018 at 7:59 AM Uwe L. Korn

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Uwe L. Korn
Hello Darren, you're out of luck here. Parquet files are immutable and meant for batch writes. Once they're written you cannot modify them anymore. To load them, you need to know their metadata which is in the footer. The footer is always at the end of the file and written once you call

Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Uwe L. Korn
+1, I would also like to see them in Sphinx. Uwe > Am 19.12.2018 um 11:13 schrieb Antoine Pitrou : > > > We should decide where we want to put developer docs. > > I would favour putting them in the Sphinx docs, personally. > > Regards > > Antoine. > > >> Le 19/12/2018 à 02:20, Wes

Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Antoine Pitrou
We should decide where we want to put developer docs. I would favour putting them in the Sphinx docs, personally. Regards Antoine. Le 19/12/2018 à 02:20, Wes McKinney a écrit : > Some projects have a REVIEWERS.md file > >

[jira] [Created] (ARROW-4076) [Python] schema validation and filters

2018-12-19 Thread George Sakkis (JIRA)
George Sakkis created ARROW-4076: Summary: [Python] schema validation and filters Key: ARROW-4076 URL: https://issues.apache.org/jira/browse/ARROW-4076 Project: Apache Arrow Issue Type: Bug