[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-15 Thread Wong Chung Hoi (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908622#comment-16908622
 ] 

Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:39 AM:


Hi all,

below is a simple piece of code to reproduce the issue using:

 
{code:java}
s3fs==0.3.3
pyarrow==0.14.1
pandas==0.24.0 
{code}
 

The file generated is roughly 170MB

 
{code:java}
import pandas as pd
import numpy as np
pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) for i 
in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
pd.read_parquet('s3://path/to/file.snappy.parquet')
{code}
{code:java}
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
 pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599){code}
 


was (Author: hoi):
Hi all,

below is a simple piece of code to reproduce the issue using:

 
{code:java}
s3fs==0.3.3
pyarrow==0.14.1
pandas==0.24.0 
{code}
 

The file generated is roughly 170MB

 
{code:java}
import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')
{code}
{code:java}
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
 pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599){code}
 

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-15 Thread Wong Chung Hoi (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908622#comment-16908622
 ] 

Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:03 AM:


Hi all,

below is a simple piece of code to reproduce the issue using:

 
{code:java}
s3fs==0.3.3
pyarrow==0.14.1
pandas==0.24.0 
{code}
 

The file generated is roughly 170MB

 
{code:java}
import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')
{code}
{code:java}
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
 pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599){code}
 


was (Author: hoi):
Hi all,

below is a simple piece of code to reproduce the issue using:

s3fs==0.3.3

pyarrow==0.14.1

pandas==0.24.0

 

The file generated is roughly 170MB

```

import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')

```

```
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599)

```

 

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-15 Thread Wong Chung Hoi (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908622#comment-16908622
 ] 

Wong Chung Hoi commented on ARROW-6058:
---

Hi all,

below is a simple piece of code to reproduce the issue using:

s3fs==0.3.3

pyarrow==0.14.1

pandas==0.24.0

 

The file generated is roughly 170MB

```

import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')

```

```
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599)

```

 

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-14 Thread Wong Chung Hoi (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907823#comment-16907823
 ] 

Wong Chung Hoi commented on ARROW-6058:
---

Hi all, 

FYI, I witness the same issue on BOTH GCP (with pandas.read_parquet and gcsfs) 
and AWS (pandas.read_parquet and s3fs).

 

I have also tried running the same code on the same dataset with an older 
docker build with older version of pyarrow and it works.

 

This is disabling us from using latest pyarrow to handle big parquet files.

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)