[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908622#comment-16908622 ] Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:39 AM: Hi all, below is a simple piece of code to reproduce the issue using: {code:java} s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 {code} The file generated is roughly 170MB {code:java} import pandas as pd import numpy as np pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') pd.read_parquet('s3://path/to/file.snappy.parquet') {code} {code:java} Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599){code} was (Author: hoi): Hi all, below is a simple piece of code to reproduce the issue using: {code:java} s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 {code} The file generated is roughly 170MB {code:java} import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') {code} {code:java} Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599){code} > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908622#comment-16908622 ] Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:03 AM: Hi all, below is a simple piece of code to reproduce the issue using: {code:java} s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 {code} The file generated is roughly 170MB {code:java} import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') {code} {code:java} Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599){code} was (Author: hoi): Hi all, below is a simple piece of code to reproduce the issue using: s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 The file generated is roughly 170MB ``` import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') ``` ``` Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599) ``` > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908622#comment-16908622 ] Wong Chung Hoi commented on ARROW-6058: --- Hi all, below is a simple piece of code to reproduce the issue using: s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 The file generated is roughly 170MB ``` import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') ``` ``` Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599) ``` > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907823#comment-16907823 ] Wong Chung Hoi commented on ARROW-6058: --- Hi all, FYI, I witness the same issue on BOTH GCP (with pandas.read_parquet and gcsfs) and AWS (pandas.read_parquet and s3fs). I have also tried running the same code on the same dataset with an older docker build with older version of pyarrow and it works. This is disabling us from using latest pyarrow to handle big parquet files. > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)