[DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol

2019-05-16 Thread Wes McKinney
hi folks, In a prior mailing list thread from February [1] I brought up some work I'd done in C++ to create an API to define custom data types that can be embedded in built-in Arrow logical types. These are serialized through IPC by adding special fields to the `custom_metadata` member of Field

[jira] [Created] (ARROW-5359) timestamp_as_object support for pa.Table.to_pandas in pyarrow

2019-05-16 Thread Joe Muruganandam (JIRA)
Joe Muruganandam created ARROW-5359: --- Summary: timestamp_as_object support for pa.Table.to_pandas in pyarrow Key: ARROW-5359 URL: https://issues.apache.org/jira/browse/ARROW-5359 Project: Apache

[jira] [Created] (ARROW-5358) [Rust] Implement equality check for ArrayData and Array

2019-05-16 Thread Chao Sun (JIRA)
Chao Sun created ARROW-5358: --- Summary: [Rust] Implement equality check for ArrayData and Array Key: ARROW-5358 URL: https://issues.apache.org/jira/browse/ARROW-5358 Project: Apache Arrow Issue

[jira] [Created] (ARROW-5357) [Rust] change Buffer::len to represent total bytes instead of used bytes

2019-05-16 Thread Chao Sun (JIRA)
Chao Sun created ARROW-5357: --- Summary: [Rust] change Buffer::len to represent total bytes instead of used bytes Key: ARROW-5357 URL: https://issues.apache.org/jira/browse/ARROW-5357 Project: Apache Arrow

[jira] [Created] (ARROW-5356) [JS] Implement Duration type, integration test support for Interval and Duration types

2019-05-16 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5356: --- Summary: [JS] Implement Duration type, integration test support for Interval and Duration types Key: ARROW-5356 URL: https://issues.apache.org/jira/browse/ARROW-5356

Re: [Discuss] [Python] protocol for conversion to pyarrow Array

2019-05-16 Thread Wes McKinney
hi Joris, Somewhat related to this, I want to also point out that we have C++ extension types [1]. As part of this, it would also be good to define and document a public API for users to create ExtensionArray subclasses that can be serialized and deserialized using this machinery. As a

[jira] [Created] (ARROW-5355) [C++] DictionaryBuilder provides information to determine array builder type at run-time

2019-05-16 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-5355: --- Summary: [C++] DictionaryBuilder provides information to determine array builder type at run-time Key: ARROW-5355 URL: https://issues.apache.org/jira/browse/ARROW-5355

Re: [DISCUSS] PR Backlog reduction

2019-05-16 Thread Wes McKinney
hi Micah, This sounds like a reasonable proposal, and I agree in particular for regular contributors that it makes sense to close PRs that are not close to being in merge-readiness to thin the noise of the patch queue We have some short-term issues such as various reviewers being busy lately

Re: Metadata for partitioned datasets in pyarrow.parquet

2019-05-16 Thread Joris Van den Bossche
Missed the email of Wes, but yeah, I think we basically said the same. Answer to another question you raised in the notebook: > [about writing a _common_metadata file] ... uses the schema object for > the 0th partition. This actually means that not *all* information in > _common_metadata will be

[DISCUSS] PR Backlog reduction

2019-05-16 Thread Micah Kornfield
Our backlog of open PRs is slowly creeping up. This isn't great because it allows contributions to slip through the cracks (which in turn possibly turns off new contributors). Perusing PRs I think things roughly fall into the following categories. 1. PRs are work in progress that never got

Re: Metadata for partitioned datasets in pyarrow.parquet

2019-05-16 Thread Joris Van den Bossche
Hi Rick, Thanks for exploring this! I am still quite new to Parquet myself, so the following might not be fully correct, but based on my current understanding, to enable projects like dask to write the different pieces of a Parquet dataset using pyarrow, we need the following functionalities: -

[jira] [Created] (ARROW-5354) [C++] allow Array to have null buffers when all elements are null

2019-05-16 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5354: Summary: [C++] allow Array to have null buffers when all elements are null Key: ARROW-5354 URL: https://issues.apache.org/jira/browse/ARROW-5354 Project:

Re: Metadata for partitioned datasets in pyarrow.parquet

2019-05-16 Thread Wes McKinney
hi Richard, We have been discussing this in https://issues.apache.org/jira/browse/ARROW-1983 All that is currently missing is (AFAICT): * A C++ function to write a vector of FileMetaData as a _metadata file (make sure the file path is set in the metadata objects) * A Python binding for this

[jira] [Created] (ARROW-5353) 0-row table can be written but not read

2019-05-16 Thread Thomas Buhrmann (JIRA)
Thomas Buhrmann created ARROW-5353: -- Summary: 0-row table can be written but not read Key: ARROW-5353 URL: https://issues.apache.org/jira/browse/ARROW-5353 Project: Apache Arrow Issue Type:

[jira] [Created] (ARROW-5352) [Rust] BinaryArray filter loses replaces nulls with empty strings

2019-05-16 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5352: - Summary: [Rust] BinaryArray filter loses replaces nulls with empty strings Key: ARROW-5352 URL: https://issues.apache.org/jira/browse/ARROW-5352 Project: Apache

[jira] [Created] (ARROW-5351) [Rust] Add support for take kernel functions

2019-05-16 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5351: - Summary: [Rust] Add support for take kernel functions Key: ARROW-5351 URL: https://issues.apache.org/jira/browse/ARROW-5351 Project: Apache Arrow Issue

[jira] [Created] (ARROW-5350) [Rust] Support filtering on nested array types

2019-05-16 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5350: - Summary: [Rust] Support filtering on nested array types Key: ARROW-5350 URL: https://issues.apache.org/jira/browse/ARROW-5350 Project: Apache Arrow Issue

Metadata for partitioned datasets in pyarrow.parquet

2019-05-16 Thread Richard Zamora
Note that I was asked to post here after making a similar comment on GitHub (https://github.com/apache/arrow/pull/4236)… I am hoping to help improve the use of pyarrow.parquet within dask (https://github.com/dask/dask). To this end, I put together a simple notebook to explore how

[jira] [Created] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-16 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5349: Summary: [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData Key: ARROW-5349 URL: https://issues.apache.org/jira/browse/ARROW-5349

[jira] [Created] (ARROW-5348) [CI] [Java] Gandiva checkstyle failure

2019-05-16 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5348: - Summary: [CI] [Java] Gandiva checkstyle failure Key: ARROW-5348 URL: https://issues.apache.org/jira/browse/ARROW-5348 Project: Apache Arrow Issue Type:

[jira] [Created] (ARROW-5347) [C++] Building fails on Windows with gtest symbol issue

2019-05-16 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5347: - Summary: [C++] Building fails on Windows with gtest symbol issue Key: ARROW-5347 URL: https://issues.apache.org/jira/browse/ARROW-5347 Project: Apache Arrow