If you are able to provide a file that reproduces the error, that
would also be very helpful (and we can open a Jira issue to track the
problem)

On Fri, Mar 5, 2021 at 10:19 PM Micah Kornfield <[email protected]> wrote:
>
> Hi Ruben,
> I'm not an expert here, but is it possible the CSV has newlines inside quotes 
> or some oddity?  There are a lot of configuration options for Read CSV and 
> you might want to validate that the defaults are at the most conservative 
> settings.
>
> -Micah
>
> On Fri, Mar 5, 2021 at 12:40 PM Ruben Laguna <[email protected]> wrote:
>>
>> Hi,
>>
>> I'm getting "CSV parser got out of sync with chunker", any idea on how to 
>> troubleshoot this?
>> If I feed the original file it fails after 1477218 rows
>> if I remove the first line after the header then it fails after 2919443 rows
>> if I remove the first 2 lines after the header  then it fails after 55339 
>> rows
>> if I remove the first 3 lines after the header then it fails after 8200437 
>> rows
>> if I remove the first 4 line after the header then if fails after 1866573 
>> rows
>> To me it doesn't make sense, the failure shows at different, seemly random 
>> places.
>>
>> What can be causing this?  source code below->
>>
>>
>>
>> Traceback (most recent call last):
>>   File "pa_inspect.py", line 15, in <module>
>>     for b in reader:
>>   File "pyarrow/ipc.pxi", line 497, in __iter__
>>   File "pyarrow/ipc.pxi", line 531, in 
>> pyarrow.lib.RecordBatchReader.read_next_batch
>>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
>> pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker
>> in
>>
>>
>> import pyarrow as pa
>> from pyarrow import csv
>> import pyarrow.parquet as pq
>>
>> # 
>> http://arrow.apache.org/docs/python/generated/pyarrow.csv.open_csv.html#pyarrow.csv.open_csv
>> # 
>> http://arrow.apache.org/docs/python/generated/pyarrow.csv.CSVStreamingReader.html
>> reader = csv.open_csv('inspect.csv')
>>
>>
>> # ParquetWriter : 
>> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
>> # RecordBat
>> # 
>> http://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>> crow = 0
>> with pq.ParquetWriter('inspect.parquet', reader.schema) as writer:
>>     for b in reader:
>>         print(b.num_rows,b.num_columns)
>>         crow = crow + b.num_rows
>>         print(crow)
>>         writer.write_table(pa.Table.from_batches([b]))
>>
>> --
>> /Rubén

Reply via email to