[jira] [Created] (ARROW-9310) Use feature enum in java
Micah Kornfield created ARROW-9310: -- Summary: Use feature enum in java Key: ARROW-9310 URL: https://issues.apache.org/jira/browse/ARROW-9310 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9311) [Javascript] Use feature enum in javascript
Micah Kornfield created ARROW-9311: -- Summary: [Javascript] Use feature enum in javascript Key: ARROW-9311 URL: https://issues.apache.org/jira/browse/ARROW-9311 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9315) [Java] Fix the failure of testAllocationManagerType
Liya Fan created ARROW-9315: --- Summary: [Java] Fix the failure of testAllocationManagerType Key: ARROW-9315 URL: https://issues.apache.org/jira/browse/ARROW-9315 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan It appears sometimes in the CI build. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9309) Start writing out feature enums to value (umbrella issue)
Micah Kornfield created ARROW-9309: -- Summary: Start writing out feature enums to value (umbrella issue) Key: ARROW-9309 URL: https://issues.apache.org/jira/browse/ARROW-9309 Project: Apache Arrow Issue Type: Improvement Reporter: Micah Kornfield Proposed logic: 1. Add flag where appropriate for supports dictionary replacement if there is a possibility it can be used. 2. Only add compressed buffers when requested. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9314) [Go] Use Feature enum
Micah Kornfield created ARROW-9314: -- Summary: [Go] Use Feature enum Key: ARROW-9314 URL: https://issues.apache.org/jira/browse/ARROW-9314 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9313) [Rust] Use feature enum
Micah Kornfield created ARROW-9313: -- Summary: [Rust] Use feature enum Key: ARROW-9313 URL: https://issues.apache.org/jira/browse/ARROW-9313 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9312) [C++] Use feature enum
Micah Kornfield created ARROW-9312: -- Summary: [C++] Use feature enum Key: ARROW-9312 URL: https://issues.apache.org/jira/browse/ARROW-9312 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9308) Add Feature enum to schema.fbs for forward compatibity
Micah Kornfield created ARROW-9308: -- Summary: Add Feature enum to schema.fbs for forward compatibity Key: ARROW-9308 URL: https://issues.apache.org/jira/browse/ARROW-9308 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Micah Kornfield Assignee: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9307) [Ruby] Add Arrow::RecordBatchIterator#to_a
Kouhei Sutou created ARROW-9307: --- Summary: [Ruby] Add Arrow::RecordBatchIterator#to_a Key: ARROW-9307 URL: https://issues.apache.org/jira/browse/ARROW-9307 Project: Apache Arrow Issue Type: Improvement Components: Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9306) [Ruby] Add support for Arrow::RecordBatch.new(raw_table)
Kouhei Sutou created ARROW-9306: --- Summary: [Ruby] Add support for Arrow::RecordBatch.new(raw_table) Key: ARROW-9306 URL: https://issues.apache.org/jira/browse/ARROW-9306 Project: Apache Arrow Issue Type: Improvement Components: Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9305) [Python] Dependency load failure in Windows wheel build
Wes McKinney created ARROW-9305: --- Summary: [Python] Dependency load failure in Windows wheel build Key: ARROW-9305 URL: https://issues.apache.org/jira/browse/ARROW-9305 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 The Windows wheels are experiencing a DLL load failure probably due to one of the dependencies -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9304) [C++] Add "AppendEmptyValue" builder APIs for use inside StructBuilder::AppendNull
Wes McKinney created ARROW-9304: --- Summary: [C++] Add "AppendEmptyValue" builder APIs for use inside StructBuilder::AppendNull Key: ARROW-9304 URL: https://issues.apache.org/jira/browse/ARROW-9304 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 StructBuilder should probably also add "UnsafeAppendNull" so that there is the option of using the Unsafe* operations on the children -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9303) Can't install R arrow on CentOS 7.6.1810
Nathan TeBlunthuis created ARROW-9303: - Summary: Can't install R arrow on CentOS 7.6.1810 Key: ARROW-9303 URL: https://issues.apache.org/jira/browse/ARROW-9303 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.17.1 Environment: CentOS 7.6.1810 R 4.0.2 Reporter: Nathan TeBlunthuis I'm following the instructions here: https://arrow.apache.org/install/ arrow::install_arrow() gives error: {{./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or directory}} {{This leaves me without a working arrow::read_feather.}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9302) Specifying columns in a dataset drops the index (pandas) metadata.
Troy Zimmerman created ARROW-9302: - Summary: Specifying columns in a dataset drops the index (pandas) metadata. Key: ARROW-9302 URL: https://issues.apache.org/jira/browse/ARROW-9302 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Troy Zimmerman I'm not sure if this is a missing feature, or just undocumented, or perhaps not even something I should expect to work. Let's start with a multi-index dataframe. {code} >>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> import pyarrow.parquet as pq >>> >>> df data id when letter number a 10.0 a1 2020-05-05 08:30:01+00:00 b 21.1 b2 2020-05-05 08:30:01+00:00 31.2 b3 2020-05-05 08:30:01+00:00 c 42.1 c4 2020-05-05 08:30:01+00:00 52.2 c5 2020-05-05 08:30:01+00:00 62.3 c6 2020-05-05 08:30:01+00:00 >>> tbl = pa.Table.from_pandas(df) >>> tbl pyarrow.Table data: double id: string when: timestamp[ns, tz=+00:00] letter: string number: int64 >>> tbl.schema data: double id: string when: timestamp[ns, tz=+00:00] letter: string number: int64 -- schema metadata -- pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' + 783 {code} This of course works as expected, so let's write the table to disk, and read it with a {{dataset}}. {code} >>> pq.write_table(tbl, "/tmp/df.parquet") >>> data = ds.dataset("/tmp/df.parquet") >>> data.to_table(filter=ds.field("letter") == "c").to_pandas() data id when letter number c 42.1 c4 2020-05-05 08:30:01+00:00 52.2 c5 2020-05-05 08:30:01+00:00 62.3 c6 2020-05-05 08:30:01+00:00 {code} The filter also works as expected, and the dataframe is reconstructed properly. Let's do it again, but this time with a column selection. {code} >>> data.to_table(filter=ds.field("letter") == "c", columns=["data", >>> "id"]).to_pandas() data id 0 2.1 c4 1 2.2 c5 2 2.3 c6 {code} Hmm, not quite what I was thinking, but excluding the indices from the columns seems like a dumb move on my part, so let's try again, and this time include all columns to be safe. {code} >>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter", >>> "number", "data", "id", "when"]) >>> tbl.to_pandas() letter number data id when 0 c 4 2.1 c4 2020-05-05 08:30:01+00:00 1 c 5 2.2 c5 2020-05-05 08:30:01+00:00 2 c 6 2.3 c6 2020-05-05 08:30:01+00:00 >>> tbl pyarrow.Table letter: string number: int64 data: double id: string when: timestamp[us, tz=UTC] {code} It seems that when I specify any or all columns, the schema metadata is lost along the way, so {{to_pandas}} doesn't reconstruct the dataframe to match the original. Here's my relevant versions: - arrow-cpp: 0.17.1 - pyarrow: 0.17.1 - parquet-cpp: 1.5.1 - python: 3.7.6 - thrift-cpp: 0.13.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9301) [R] Cannot open parquet files with binary arrays
Steve Jacobs created ARROW-9301: --- Summary: [R] Cannot open parquet files with binary arrays Key: ARROW-9301 URL: https://issues.apache.org/jira/browse/ARROW-9301 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 1.0.0 Environment: apache arrow 0.17.1 Reporter: Steve Jacobs When trying to open a parquet file with a binary column the following error is returned:``` Error in Table__to_dataframe(x, use_threads = option_use_threads()) : cannot handle Array of type binary -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9300) [Java] Separate Netty Memory to its own module
Ryan Murray created ARROW-9300: -- Summary: [Java] Separate Netty Memory to its own module Key: ARROW-9300 URL: https://issues.apache.org/jira/browse/ARROW-9300 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ryan Murray Assignee: Ryan Murray Finish the work started in ARROW-8230 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9299) Expose ORC metadata() in Python ORCFile
Jeremy Dyer created ARROW-9299: -- Summary: Expose ORC metadata() in Python ORCFile Key: ARROW-9299 URL: https://issues.apache.org/jira/browse/ARROW-9299 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 0.17.1 Reporter: Jeremy Dyer There is currently no way for a user to directly access the underlying ORC metadata of a given file. It seems the C++ functions and objects already existing and rather the plumbing is just missing the the cython/python and potentially a few c++ shims. Giving users the ability to retrieve the metadata without first reading the entire file could help numerous applications to increase their query performance by allowing them to intelligently determine which ORC stripes should be read. This would allow for something like {code:java} import pyarrow as pa orc_metadata = pa.orc.ORCFile(filename).metadata() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [arrow-testing] pitrou merged pull request #33: Add IPC fuzz regression files
pitrou merged pull request #33: URL: https://github.com/apache/arrow-testing/pull/33 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] pitrou opened a new pull request #33: Add IPC fuzz regression files
pitrou opened a new pull request #33: URL: https://github.com/apache/arrow-testing/pull/33 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] pitrou merged pull request #32: Add IPC fuzz regression files
pitrou merged pull request #32: URL: https://github.com/apache/arrow-testing/pull/32 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] pitrou opened a new pull request #32: Add IPC fuzz regression files
pitrou opened a new pull request #32: URL: https://github.com/apache/arrow-testing/pull/32 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-9298) [C++] Fix crashes on invalid input (OSS-Fuzz)
Antoine Pitrou created ARROW-9298: - Summary: [C++] Fix crashes on invalid input (OSS-Fuzz) Key: ARROW-9298 URL: https://issues.apache.org/jira/browse/ARROW-9298 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9297) [C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB)
Joris Van den Bossche created ARROW-9297: Summary: [C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB) Key: ARROW-9297 URL: https://issues.apache.org/jira/browse/ARROW-9297 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Related to ARROW-3762 (the parquet issue which has been solved), and discovered in ARROW-9139. When creating a Parquet file with a large binary column (larger than BinaryArray capacity): {code} # code from the test_parquet.py::test_binary_array_overflow_to_chunked test values = [b'x'] + [ b'x' * (1 << 20) ] * 2 * (1 << 10) table = pa.table({'byte_col': values}) pq.write_table(table, "test_large_binary.parquet") {code} then reading this with the parquet API works (fixed by ARROW-3762): {code} In [3]: pq.read_table("test_large_binary.parquet") Out[3]: pyarrow.Table byte_col: binary {code} but with the Datasets API this still fails: {code} In [1]: import pyarrow.dataset as ds In [2]: dataset = ds.dataset("test_large_binary.parquet", format="parquet") In [4]: dataset.to_table() --- ArrowNotImplementedError Traceback (most recent call last) in > 1 dataset.to_table() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowNotImplementedError: This class cannot yet iterate chunked arrays {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9296) [CI][Rust] Enable more clippy lint checks
Krisztian Szucs created ARROW-9296: -- Summary: [CI][Rust] Enable more clippy lint checks Key: ARROW-9296 URL: https://issues.apache.org/jira/browse/ARROW-9296 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Rust Reporter: Krisztian Szucs Currently only {{clippy::redundant_field_names}} is allowed, so we should incrementally extend the list of enabled lints. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9295) [Archery] Support rust clippy in the lint command
Krisztian Szucs created ARROW-9295: -- Summary: [Archery] Support rust clippy in the lint command Key: ARROW-9295 URL: https://issues.apache.org/jira/browse/ARROW-9295 Project: Apache Arrow Issue Type: Improvement Components: Archery Reporter: Krisztian Szucs Fix For: 2.0.0 https://github.com/apache/arrow/pull/7501 introduces clippy support which we should move to the main linting job. -- This message was sent by Atlassian Jira (v8.3.4#803005)