[jira] [Created] (ARROW-9297) [C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB)

Joris Van den Bossche (Jira) Thu, 02 Jul 2020 05:06:06 -0700

Joris Van den Bossche created ARROW-9297:
--------------------------------------------


             Summary: [C++][Dataset] Dataset scanner cannot handle large binary 
column (> 2 GB)
                 Key: ARROW-9297
                 URL: https://issues.apache.org/jira/browse/ARROW-9297
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Joris Van den Bossche


Related to ARROW-3762 (the parquet issue which has been solved), and discovered 
in ARROW-9139.

When creating a Parquet file with a large binary column (larger than 
BinaryArray capacity):

{code}
# code from the test_parquet.py::test_binary_array_overflow_to_chunked test
values = [b'x'] + [ 
    b'x' * (1 << 20) 
] * 2 * (1 << 10)                                                               
                                                                                
                                      

table = pa.table({'byte_col': values})                                          
                                                                                
                                          
pq.write_table(table, "test_large_binary.parquet")                              
                                                                                
                                          
{code}

then reading this with the parquet API works (fixed by ARROW-3762):

{code}
In [3]: pq.read_table("test_large_binary.parquet")                              
                                                                                
                          
Out[3]: 
pyarrow.Table
byte_col: binary
{code}

but with the Datasets API this still fails:

{code}
In [1]: import pyarrow.dataset as ds                                            
                                                                                
                                                   

In [2]: dataset = ds.dataset("test_large_binary.parquet", format="parquet")     
                                                                                
                                                   

In [4]: dataset.to_table()                                                      
                                                                                
                                                   
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-4-6fb0d79c4511> in <module>
----> 1 dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Scanner.to_table()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: This class cannot yet iterate chunked arrays

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9297) [C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB)

Reply via email to