[jira] [Comment Edited] (ARROW-434) Segfaults and encoding issues in Python Parquet reads

2016-12-22 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770300#comment-15770300
 ] 

Wes McKinney edited comment on ARROW-434 at 12/22/16 3:28 PM:
--

I will look into the taxi data issue if you can get me access to the file 
(Dropbox/Google Drive is fine too to share). 

Where did "nation.dict.parquet" come from originally? I see it in jcrobak's 
github repo, but I don't see it in github.com/parquet/parquet-compatibility. 


was (Author: wesmckinn):
I will look into the airlines issue if you can get me access to the file 
(Dropbox/Google Drive is fine too to share). 

Where did "nation.dict.parquet" come from originally? I see it in jcrobak's 
github repo, but I don't see it in github.com/parquet/parquet-compatibility. 

> Segfaults and encoding issues in Python Parquet reads
> -
>
> Key: ARROW-434
> URL: https://issues.apache.org/jira/browse/ARROW-434
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Ubuntu, Python 3.5, installed pyarrow from conda-forge
>Reporter: Matthew Rocklin
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet, python
>
> I've conda installed pyarrow and am trying to read data from the 
> parquet-compatibility project.  I haven't explicitly built parquet-cpp or 
> anything and may or may not have old versions lying around, so please take 
> this issue with some salt:
> {code:python}
> In [1]: import pyarrow.parquet
> In [2]: t = pyarrow.parquet.read_table('nation.plain.parquet')
> ---
> ArrowExceptionTraceback (most recent call last)
>  in ()
> > 1 t = pyarrow.parquet.read_table('nation.plain.parquet')
> /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx
>  in pyarrow.parquet.read_table 
> (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2783)()
> /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx
>  in pyarrow.parquet.ParquetReader.read_all 
> (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2200)()
> /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/error.pyx
>  in pyarrow.error.check_status 
> (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/error.cxx:1185)()
> ArrowException: NotImplemented: list<: uint8>
> {code}
> Additionally I tried to read data from a Python file-like object pointing to 
> data on S3.  Let me know if you'd prefer a separate issue.
> {code:python}
> In [1]: import s3fs
> In [2]: fs = s3fs.S3FileSystem()
> In [3]: f = fs.open('dask-data/nyc-taxi/2015/parquet/part.0.parquet')
> In [4]: f.read(100)
> Out[4]: 
> b'PAR1\x15\x00\x15\x90\xc4\xa2\x12\x15\x90\xc4\xa2\x12,\x15\xc2\xa8\xa4\x02\x15\x00\x15\x06\x15\x08\x00\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00@\xc2\xce\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\x00\x89\xfc\xe7\x8b\x0b\x05\x00@\xcb\x0b\xe8\x8b\x0b\x05\x00\x80\r\x1b\xe8\x8b\x0b'
> In [5]: import pyarrow.parquet
> In [6]: t = pyarrow.parquet.read_table(f)
> Segmentation fault (core dumped)
> {code}
> Here is a more reproducible version:
> {code:python}
> In [1]: with open('nation.plain.parquet', 'rb') as f:
>...: data = f.read()
>...: 
> In [2]: from io import BytesIO
> In [3]: f = BytesIO(data)
> In [4]: f.seek(0)
> Out[4]: 0
> In [5]: import pyarrow.parquet
> In [6]: t = pyarrow.parquet.read_table(f)
> Segmentation fault (core dumped)
> {code}
> I was however pleased with round-trip functionality within this project, 
> which was very pleasant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ARROW-434) Segfaults and encoding issues in Python Parquet reads

2016-12-22 Thread Matthew Rocklin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770242#comment-15770242
 ] 

Matthew Rocklin edited comment on ARROW-434 at 12/22/16 3:22 PM:
-

Cool, verified that it works on my end.  The taxi data on s3fs is still failing 
with an encoding error.  I've been having difficulty managing permissions on S3 
to make this publicly available (just ignorance on my part).  In the mean time, 
here's the status of the files in the "parquet compatibility project"

{code}
In [1]: import pyarrow.parquet

In [2]: from glob import glob

In [3]: filenames = sorted(glob('*.parquet'))

In [4]: filenames
Out[4]: 
['customer.impala.parquet',
 'foo.parquet',
 'gzip-nation.impala.parquet',
 'nation.dict.parquet',
 'nation.impala.parquet',
 'nation.plain.parquet',
 'snappy-nation.impala.parquet',
 'test-converted-type-null.parquet',
 'test-null-dictionary.parquet',
 'test-null.parquet',
 'test.parquet']

In [5]: for fn in filenames:
   ...: try:
   ...: t = pyarrow.parquet.read_table(fn)
   ...: except Exception as e:
   ...: print('Failed on', fn, e)
   ...: else:
   ...: print("Succeeded on", fn)
   ...: 
   ...: 
Succeeded on customer.impala.parquet
Succeeded on foo.parquet
Succeeded on gzip-nation.impala.parquet
Failed on nation.dict.parquet IOError: Unexpected end of stream.
Succeeded on nation.impala.parquet
Succeeded on nation.plain.parquet
Succeeded on snappy-nation.impala.parquet
Succeeded on test-converted-type-null.parquet
Succeeded on test-null-dictionary.parquet
Succeeded on test-null.parquet
Succeeded on test.parquet

In [6]: pyarrow.parquet.read_table('nation.dict.parquet')
---
ArrowExceptionTraceback (most recent call last)
 in ()
> 1 pyarrow.parquet.read_table('nation.dict.parquet')

/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx
 in pyarrow.parquet.read_table 
(/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2907)()

/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx
 in pyarrow.parquet.ParquetReader.read_all 
(/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2275)()

/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/error.pyx
 in pyarrow.error.check_status 
(/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/error.cxx:1197)()

ArrowException: IOError: Unexpected end of stream.
{code}


was (Author: mrocklin):
Cool, verified that it works on my end.  The airlines data on s3fs is still 
failing with an encoding error.  I've been having difficulty managing 
permissions on S3 to make this publicly available (just ignorance on my part).  
In the mean time, here's the status of the files in the "parquet compatibility 
project"

{code}
In [1]: import pyarrow.parquet

In [2]: from glob import glob

In [3]: filenames = sorted(glob('*.parquet'))

In [4]: filenames
Out[4]: 
['customer.impala.parquet',
 'foo.parquet',
 'gzip-nation.impala.parquet',
 'nation.dict.parquet',
 'nation.impala.parquet',
 'nation.plain.parquet',
 'snappy-nation.impala.parquet',
 'test-converted-type-null.parquet',
 'test-null-dictionary.parquet',
 'test-null.parquet',
 'test.parquet']

In [5]: for fn in filenames:
   ...: try:
   ...: t = pyarrow.parquet.read_table(fn)
   ...: except Exception as e:
   ...: print('Failed on', fn, e)
   ...: else:
   ...: print("Succeeded on", fn)
   ...: 
   ...: 
Succeeded on customer.impala.parquet
Succeeded on foo.parquet
Succeeded on gzip-nation.impala.parquet
Failed on nation.dict.parquet IOError: Unexpected end of stream.
Succeeded on nation.impala.parquet
Succeeded on nation.plain.parquet
Succeeded on snappy-nation.impala.parquet
Succeeded on test-converted-type-null.parquet
Succeeded on test-null-dictionary.parquet
Succeeded on test-null.parquet
Succeeded on test.parquet

In [6]: pyarrow.parquet.read_table('nation.dict.parquet')
---
ArrowExceptionTraceback (most recent call last)
 in ()
> 1 pyarrow.parquet.read_table('nation.dict.parquet')

/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx
 in pyarrow.parquet.read_table 
(/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2907)()

/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx
 in