[
https://issues.apache.org/jira/browse/ARROW-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770242#comment-15770242
]
Matthew Rocklin edited comment on ARROW-434 at 12/22/16 3:22 PM:
-
Cool, verified that it works on my end. The taxi data on s3fs is still failing
with an encoding error. I've been having difficulty managing permissions on S3
to make this publicly available (just ignorance on my part). In the mean time,
here's the status of the files in the "parquet compatibility project"
{code}
In [1]: import pyarrow.parquet
In [2]: from glob import glob
In [3]: filenames = sorted(glob('*.parquet'))
In [4]: filenames
Out[4]:
['customer.impala.parquet',
'foo.parquet',
'gzip-nation.impala.parquet',
'nation.dict.parquet',
'nation.impala.parquet',
'nation.plain.parquet',
'snappy-nation.impala.parquet',
'test-converted-type-null.parquet',
'test-null-dictionary.parquet',
'test-null.parquet',
'test.parquet']
In [5]: for fn in filenames:
...: try:
...: t = pyarrow.parquet.read_table(fn)
...: except Exception as e:
...: print('Failed on', fn, e)
...: else:
...: print("Succeeded on", fn)
...:
...:
Succeeded on customer.impala.parquet
Succeeded on foo.parquet
Succeeded on gzip-nation.impala.parquet
Failed on nation.dict.parquet IOError: Unexpected end of stream.
Succeeded on nation.impala.parquet
Succeeded on nation.plain.parquet
Succeeded on snappy-nation.impala.parquet
Succeeded on test-converted-type-null.parquet
Succeeded on test-null-dictionary.parquet
Succeeded on test-null.parquet
Succeeded on test.parquet
In [6]: pyarrow.parquet.read_table('nation.dict.parquet')
---
ArrowExceptionTraceback (most recent call last)
in ()
> 1 pyarrow.parquet.read_table('nation.dict.parquet')
/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx
in pyarrow.parquet.read_table
(/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2907)()
/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx
in pyarrow.parquet.ParquetReader.read_all
(/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2275)()
/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/error.pyx
in pyarrow.error.check_status
(/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/error.cxx:1197)()
ArrowException: IOError: Unexpected end of stream.
{code}
was (Author: mrocklin):
Cool, verified that it works on my end. The airlines data on s3fs is still
failing with an encoding error. I've been having difficulty managing
permissions on S3 to make this publicly available (just ignorance on my part).
In the mean time, here's the status of the files in the "parquet compatibility
project"
{code}
In [1]: import pyarrow.parquet
In [2]: from glob import glob
In [3]: filenames = sorted(glob('*.parquet'))
In [4]: filenames
Out[4]:
['customer.impala.parquet',
'foo.parquet',
'gzip-nation.impala.parquet',
'nation.dict.parquet',
'nation.impala.parquet',
'nation.plain.parquet',
'snappy-nation.impala.parquet',
'test-converted-type-null.parquet',
'test-null-dictionary.parquet',
'test-null.parquet',
'test.parquet']
In [5]: for fn in filenames:
...: try:
...: t = pyarrow.parquet.read_table(fn)
...: except Exception as e:
...: print('Failed on', fn, e)
...: else:
...: print("Succeeded on", fn)
...:
...:
Succeeded on customer.impala.parquet
Succeeded on foo.parquet
Succeeded on gzip-nation.impala.parquet
Failed on nation.dict.parquet IOError: Unexpected end of stream.
Succeeded on nation.impala.parquet
Succeeded on nation.plain.parquet
Succeeded on snappy-nation.impala.parquet
Succeeded on test-converted-type-null.parquet
Succeeded on test-null-dictionary.parquet
Succeeded on test-null.parquet
Succeeded on test.parquet
In [6]: pyarrow.parquet.read_table('nation.dict.parquet')
---
ArrowExceptionTraceback (most recent call last)
in ()
> 1 pyarrow.parquet.read_table('nation.dict.parquet')
/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx
in pyarrow.parquet.read_table
(/feedstock_root/build_artefacts/work/arrow-268ffbeffb1cd0617e52d381d500a2d10f61124c/python/build/temp.linux-x86_64-3.5/parquet.cxx:2907)()
/home/mrocklin/Software/anaconda/envs/arrow-test/lib/python3.5/site-packages/pyarrow/parquet.pyx
in