Joris Van den Bossche created ARROW-9297: --------------------------------------------
Summary: [C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB) Key: ARROW-9297 URL: https://issues.apache.org/jira/browse/ARROW-9297 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Related to ARROW-3762 (the parquet issue which has been solved), and discovered in ARROW-9139. When creating a Parquet file with a large binary column (larger than BinaryArray capacity): {code} # code from the test_parquet.py::test_binary_array_overflow_to_chunked test values = [b'x'] + [ b'x' * (1 << 20) ] * 2 * (1 << 10) table = pa.table({'byte_col': values}) pq.write_table(table, "test_large_binary.parquet") {code} then reading this with the parquet API works (fixed by ARROW-3762): {code} In [3]: pq.read_table("test_large_binary.parquet") Out[3]: pyarrow.Table byte_col: binary {code} but with the Datasets API this still fails: {code} In [1]: import pyarrow.dataset as ds In [2]: dataset = ds.dataset("test_large_binary.parquet", format="parquet") In [4]: dataset.to_table() --------------------------------------------------------------------------- ArrowNotImplementedError Traceback (most recent call last) <ipython-input-4-6fb0d79c4511> in <module> ----> 1 dataset.to_table() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowNotImplementedError: This class cannot yet iterate chunked arrays {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)