[ https://issues.apache.org/jira/browse/ARROW-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027457#comment-17027457 ]
Artem KOZHEVNIKOV edited comment on ARROW-7731 at 2/4/20 9:11 AM: ------------------------------------------------------------------ I found another edge case that's maybe linked to this (pyarrow=0.15.1) {code:python} import pyarrow as pa import pyarrow.parquet as pq l1 = pa.array([list(range(100))] * 10**7, type=pa.list_(pa.int16())) tt = pa.Table.from_pydict(\{'big': pa.chunked_array([l1]*10)}) # if concat, offset will overflow int32 pq.write_table(tt, '/tmp/test.parquet') # that took a while but worked tt_reload = pq.read_table('/tmp/test.parquet') # it consumes a huge amount of memory before failing ArrowInvalid Traceback (most recent call last) <ipython-input-7-bf871e1a4f57> in <module> ----> 1 tt_reload = pq.read_table('/tmp/test.parquet') /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size) 1279 buffer_size=buffer_size) 1280 return pf.read(columns=columns, use_threads=use_threads, -> 1281 use_pandas_metadata=use_pandas_metadata) 1282 1283 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 1135 table = piece.read(columns=columns, use_threads=use_threads, 1136 partitions=self.partitions, -> 1137 use_pandas_metadata=use_pandas_metadata) 1138 tables.append(table) 1139 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, partitions, file, use_pandas_metadata) 603 table = reader.read_row_group(self.row_group, **options) 604 else: --> 605 table = reader.read(**options) 606 607 if len(self.partition_keys) > 0: /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 251 columns, use_pandas_metadata=use_pandas_metadata) 252 return self.reader.read_all(column_indices=column_indices, --> 253 use_threads=use_threads) 254 255 def scan_contents(self, columns=None, batch_size=65536): /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all() /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Column 0: Offset invariant failure: 21474837 inconsistent offset for non-null slot: -2147483596<2147483600 {code} Thrown Error is not explicit. I wonder if created parquet file is correct (I've not tried yet to reload it with spark) or it's just a reading by pyarrow that does not support it. was (Author: artemk): I found another edge case that maybe link to this (pyarrow=0.15.1) {code:python} import pyarrow as pa import pyarrow.parquet as pq l1 = pa.array([list(range(100))] * 10**7, type=pa.list_(pa.int16())) tt = pa.Table.from_pydict(\{'big': pa.chunked_array([l1]*10)}) # if concat, offset will overflow int32 pq.write_table(tt, '/tmp/test.parquet') # that took a while but worked tt_reload = pq.read_table('/tmp/test.parquet') # it consumes a huge amount of memory before failing ArrowInvalid Traceback (most recent call last) <ipython-input-7-bf871e1a4f57> in <module> ----> 1 tt_reload = pq.read_table('/tmp/test.parquet') /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size) 1279 buffer_size=buffer_size) 1280 return pf.read(columns=columns, use_threads=use_threads, -> 1281 use_pandas_metadata=use_pandas_metadata) 1282 1283 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 1135 table = piece.read(columns=columns, use_threads=use_threads, 1136 partitions=self.partitions, -> 1137 use_pandas_metadata=use_pandas_metadata) 1138 tables.append(table) 1139 /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, partitions, file, use_pandas_metadata) 603 table = reader.read_row_group(self.row_group, **options) 604 else: --> 605 table = reader.read(**options) 606 607 if len(self.partition_keys) > 0: /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, use_threads, use_pandas_metadata) 251 columns, use_pandas_metadata=use_pandas_metadata) 252 return self.reader.read_all(column_indices=column_indices, --> 253 use_threads=use_threads) 254 255 def scan_contents(self, columns=None, batch_size=65536): /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all() /opt/conda/envs/model/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Column 0: Offset invariant failure: 21474837 inconsistent offset for non-null slot: -2147483596<2147483600 {code} > [C++][Parquet] Support LargeListArray > ------------------------------------- > > Key: ARROW-7731 > URL: https://issues.apache.org/jira/browse/ARROW-7731 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: marc abboud > Priority: Major > Labels: parquet > > For now it's not possible to write a pyarrow.Table containing a > LargeListArray in parquet. The lines > {code:java} > from pyarrow import parquet > import pyarrow as pa > indices = [1, 2, 3] > indptr = [0, 1, 2, 3] > q = pa.lib.LargeListArray.from_arrays(indptr, indices) > table = pa.Table.from_arrays([q], names=['no']) > parquet.write_table(table, '/test'){code} > yields the error > {code:java} > ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema > conversion: large_list<item: int64> > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)