If you are able to provide a file that reproduces the error, that would also be very helpful (and we can open a Jira issue to track the problem)
On Fri, Mar 5, 2021 at 10:19 PM Micah Kornfield <[email protected]> wrote: > > Hi Ruben, > I'm not an expert here, but is it possible the CSV has newlines inside quotes > or some oddity? There are a lot of configuration options for Read CSV and > you might want to validate that the defaults are at the most conservative > settings. > > -Micah > > On Fri, Mar 5, 2021 at 12:40 PM Ruben Laguna <[email protected]> wrote: >> >> Hi, >> >> I'm getting "CSV parser got out of sync with chunker", any idea on how to >> troubleshoot this? >> If I feed the original file it fails after 1477218 rows >> if I remove the first line after the header then it fails after 2919443 rows >> if I remove the first 2 lines after the header then it fails after 55339 >> rows >> if I remove the first 3 lines after the header then it fails after 8200437 >> rows >> if I remove the first 4 line after the header then if fails after 1866573 >> rows >> To me it doesn't make sense, the failure shows at different, seemly random >> places. >> >> What can be causing this? source code below-> >> >> >> >> Traceback (most recent call last): >> File "pa_inspect.py", line 15, in <module> >> for b in reader: >> File "pyarrow/ipc.pxi", line 497, in __iter__ >> File "pyarrow/ipc.pxi", line 531, in >> pyarrow.lib.RecordBatchReader.read_next_batch >> File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status >> pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker >> in >> >> >> import pyarrow as pa >> from pyarrow import csv >> import pyarrow.parquet as pq >> >> # >> http://arrow.apache.org/docs/python/generated/pyarrow.csv.open_csv.html#pyarrow.csv.open_csv >> # >> http://arrow.apache.org/docs/python/generated/pyarrow.csv.CSVStreamingReader.html >> reader = csv.open_csv('inspect.csv') >> >> >> # ParquetWriter : >> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html >> # RecordBat >> # >> http://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing >> crow = 0 >> with pq.ParquetWriter('inspect.parquet', reader.schema) as writer: >> for b in reader: >> print(b.num_rows,b.num_columns) >> crow = crow + b.num_rows >> print(crow) >> writer.write_table(pa.Table.from_batches([b])) >> >> -- >> /Rubén
