yurikoomiga created ARROW-16642:
-----------------------------------

             Summary: An Error Occured While Reading Parquet File Using C++ - 
GetRecordBatchReader -Corrupt snappy compressed data. 
                 Key: ARROW-16642
                 URL: https://issues.apache.org/jira/browse/ARROW-16642
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 8.0.0
         Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0
pyarrow 7.0.0 ubuntu 9.4.0  python3.8,

            Reporter: yurikoomiga
         Attachments: test_std_02.py

Hi All

When I use Arrow Reading Parquet File like follow:
```
auto st = parquet::arrow::FileReader::Make(
                    arrow::default_memory_pool(),
                    parquet::ParquetFileReader::Open(_parquet, _properties), 
&_reader);   
arrow::Status status = 
_reader->GetRecordBatchReader(\{_current_group},_parquet_column_ids, 
&_rb_batch);    
_reader->set_batch_size(65536);       
_reader->set_use_threads(true);      
status = _rb_batch->ReadNext(&_batch); `
``` 

status is not ok and an error occured like this:
`IOError: Corrupt snappy compressed data.`

When I comment out this statement ` _reader->set_use_threads(true);`,The 
program runs normally and I can read parquet file well.
Program errors only occur when I read multiple columns and using 
`_reader->set_use_threads(true); `and a single column will not occur error

The testing parquet file is created by pyarrow,I use only 1 group and each 
group has 3000000 records.
The parquet file has 20 columns including int and string types

you can create a test parquet file using attachment python script

Reading file using C++,arrow 7.0.0 ,snappy 1.1.8

Writting file using python3.8 ,pyarrow 7.0.0

Looking forward to your reply

Thank you!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to