[jira] [Created] (ARROW-8462) Crash in lib.concat_tables on Windows
Tom Augspurger created ARROW-8462: - Summary: Crash in lib.concat_tables on Windows Key: ARROW-8462 URL: https://issues.apache.org/jira/browse/ARROW-8462 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Tom Augspurger This crashes for me with pyarrow 0.16 on my Windows VM {{ import pyarrow as pa import pandas as pd t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]})) print("concat") pa.lib.concat_tables([t]) print('done') }} Installed pyarrow from conda-forge. I'm not really sure how to get more debug info on windows unfortunately. With `python -X faulthandler` I see {{ concat Windows fatal exception: access violation Current thread 0x04f8 (most recent call first): File "bug.py", line 6 in (module) }} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: AttributeError importing pyarrow 0.16.0
Thanks for linking to that. The Python there does seems problematic. Upgrading to TravisCI's "bionic" image (with Python 3.7.5 instead of 3.7.1) seems to have fixed it. Tom On Mon, Feb 10, 2020 at 1:34 PM Wes McKinney wrote: > hi Tom, > > Looks like it could be https://bugs.python.org/issue32973, but I'm not > sure. I wasn't able to reproduce locally. The Python version 3.7.1 > running in CI is also potentially suspicious. > > This class of error seems to have a lot of bug reports based on Google > searches > > Message isn't picklable so we should probably fix that regardless > > https://issues.apache.org/jira/browse/ARROW-7826 > > - Wes > > On Mon, Feb 10, 2020 at 12:17 PM Tom Augspurger > wrote: > > > > Hi all, > > > > I'm seeing a strange issue when importing pyarrow on the intake CI. I > get an > > exception saying > > > > AttributeError: type object 'pyarrow.lib.Message' has no attribute > > '__reduce_cython__' > > > > The full traceback is: > > > > __ test_arrow_import > > ___ > > > > def test_arrow_import(): > > > > > import pyarrow > > > > intake/cli/server/tests/test_server.py:32: > > > > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ > > _ _ > > > > > ../../../virtualenv/python3.7.1/lib/python3.7/site-packages/pyarrow/__init__.py:49: > > in > > > > from pyarrow.lib import cpu_count, set_cpu_count > > > > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ > > _ _ > > > > > ??? > > > > E AttributeError: type object 'pyarrow.lib.Message' has no attribute > > '__reduce_cython__' > > > > pyarrow/ipc.pxi:21: AttributeError > > > > _ TestServerV1Source.test_read_part_compressed > > _ > > > > > > I'm unable to reproduce this locally, and was wondering if anyone else > has > > seen something similar. > > Pyarrow was installed using pip / a wheel ( > > https://travis-ci.org/intake/intake/jobs/648523104#L311). > > > > A common cause of this error message is building with too old of a > Cython. > > While checking this, I noticed > > that some of the files are generated with Cython 0.29.8, while others > were > > generated with 0.29.14. > > I have no idea if this is a problem in general of if it's causing this > > specific issue. > > > > ``` > > _hdfs.cpp:1:/* Generated by Cython 0.29.14 */ > > include/arrow/python/pyarrow_lib.h:20:/* Generated by Cython 0.29.8 */ > > include/arrow/python/pyarrow_api.h:21:/* Generated by Cython 0.29.8 */ > > _plasma.cpp:1:/* Generated by Cython 0.29.14 */ > > _fs.cpp:1:/* Generated by Cython 0.29.14 */ > > lib_api.h:1:/* Generated by Cython 0.29.14 */ > > gandiva.cpp:1:/* Generated by Cython 0.29.14 */ > > _json.cpp:1:/* Generated by Cython 0.29.14 */ > > _parquet.cpp:1:/* Generated by Cython 0.29.14 */ > > _csv.cpp:1:/* Generated by Cython 0.29.14 */ > > _compute.cpp:1:/* Generated by Cython 0.29.14 */ > > _dataset.cpp:1:/* Generated by Cython 0.29.14 */ > > _flight.cpp:1:/* Generated by Cython 0.29.14 */ > > lib.cpp:1:/* Generated by Cython 0.29.14 */ > > ``` > > > > See the https://travis-ci.org/intake/intake/jobs/648523104 for the full > log. > > > > > > Thanks for any pointers! >
AttributeError importing pyarrow 0.16.0
Hi all, I'm seeing a strange issue when importing pyarrow on the intake CI. I get an exception saying AttributeError: type object 'pyarrow.lib.Message' has no attribute '__reduce_cython__' The full traceback is: __ test_arrow_import ___ def test_arrow_import(): > import pyarrow intake/cli/server/tests/test_server.py:32: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../virtualenv/python3.7.1/lib/python3.7/site-packages/pyarrow/__init__.py:49: in from pyarrow.lib import cpu_count, set_cpu_count _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E AttributeError: type object 'pyarrow.lib.Message' has no attribute '__reduce_cython__' pyarrow/ipc.pxi:21: AttributeError _ TestServerV1Source.test_read_part_compressed _ I'm unable to reproduce this locally, and was wondering if anyone else has seen something similar. Pyarrow was installed using pip / a wheel ( https://travis-ci.org/intake/intake/jobs/648523104#L311). A common cause of this error message is building with too old of a Cython. While checking this, I noticed that some of the files are generated with Cython 0.29.8, while others were generated with 0.29.14. I have no idea if this is a problem in general of if it's causing this specific issue. ``` _hdfs.cpp:1:/* Generated by Cython 0.29.14 */ include/arrow/python/pyarrow_lib.h:20:/* Generated by Cython 0.29.8 */ include/arrow/python/pyarrow_api.h:21:/* Generated by Cython 0.29.8 */ _plasma.cpp:1:/* Generated by Cython 0.29.14 */ _fs.cpp:1:/* Generated by Cython 0.29.14 */ lib_api.h:1:/* Generated by Cython 0.29.14 */ gandiva.cpp:1:/* Generated by Cython 0.29.14 */ _json.cpp:1:/* Generated by Cython 0.29.14 */ _parquet.cpp:1:/* Generated by Cython 0.29.14 */ _csv.cpp:1:/* Generated by Cython 0.29.14 */ _compute.cpp:1:/* Generated by Cython 0.29.14 */ _dataset.cpp:1:/* Generated by Cython 0.29.14 */ _flight.cpp:1:/* Generated by Cython 0.29.14 */ lib.cpp:1:/* Generated by Cython 0.29.14 */ ``` See the https://travis-ci.org/intake/intake/jobs/648523104 for the full log. Thanks for any pointers!
Is FileSystem._isfilestore considered public?
Hi, In https://github.com/dask/dask/issues/5526, we're seeing an issue stemming from a hack to ensure compatibility for Pyarrow. The details aren't too important. The core of the issue is that the Pyarrow parquet writer makes a couple checks for `FileSystem._isfilestore` via `_mkdir_if_not_exists`, e.g. in https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/parquet.py#L1349-L1350 . Is it OK for my FileSystem subclass to override _isfilestore? Is it considered public? Thanks, Tom
[jira] [Created] (ARROW-7102) Make filesystem wrappers compatible with fsspec
Tom Augspurger created ARROW-7102: - Summary: Make filesystem wrappers compatible with fsspec Key: ARROW-7102 URL: https://issues.apache.org/jira/browse/ARROW-7102 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Tom Augspurger [fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a common API for a variety filesystem implementations. I'm proposing a FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec implementation. Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to s3fs. [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320]. This implementation could be removed entirely once an FSSPecWrapper is done, or kept as an alias if it's part of the public API. This is realted to ARROW-3717, which requested a GCSFSWrapper for working with google cloud storage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Benchmarking dashboard proposal
I'll see if I can figure out why the benchmarks at https://pandas.pydata.org/speed/arrow/ aren't being updated this weekend. On Fri, Jan 18, 2019 at 2:34 AM Uwe L. Korn wrote: > Hello, > > note that we have(had?) the Python benchmarks continuously running and > reported at https://pandas.pydata.org/speed/arrow/. Seems like this > stopped in July 2018. > > UWe > > On Fri, Jan 18, 2019, at 9:23 AM, Antoine Pitrou wrote: > > > > Hi Areg, > > > > That sounds like a good idea to me. Note our benchmarks are currently > > scattered accross the various implementations. The two that I know of: > > > > - the C++ benchmarks are standalone executables created using the Google > > Benchmark library, aptly named "*-benchmark" (or "*-benchmark.exe" on > > Windows) > > - the Python benchmarks use the ASV utility: > > > https://github.com/apache/arrow/blob/master/docs/source/python/benchmarks.rst > > > > There may be more in the other implementations. > > > > Regards > > > > Antoine. > > > > > > Le 18/01/2019 à 07:13, Melik-Adamyan, Areg a écrit : > > > Hello, > > > > > > I want to restart/attach to the discussions for creating Arrow > benchmarking dashboard. I want to propose performance benchmark run per > commit to track the changes. > > > The proposal includes building infrastructure for per-commit tracking > comprising of the following parts: > > > - Hosted JetBrains for OSS https://teamcity.jetbrains.com/ as a build > system > > > - Agents running in cloud both VM/container (DigitalOcean, or others) > and bare-metal (Packet.net/AWS) and on-premise(Nvidia boxes?) > > > - JFrog artifactory storage and management for OSS projects > https://jfrog.com/open-source/#artifactory2 > > > - Codespeed as a frontend https://github.com/tobami/codespeed > > > > > > I am volunteering to build such system (if needed more Intel folks > will be involved) so we can start tracking performance on various platforms > and understand how changes affect it. > > > > > > Please, let me know your thoughts! > > > > > > Thanks, > > > -Areg. > > > > > > > > > >
Re: Continuous benchmarking setup
Currently, there are 3 snowflakes :) - Benchmark setup: https://github.com/TomAugspurger/asv-runner + Some setup to bootstrap a clean install with airflow, conda, asv, supervisor, etc. All the infrastructure around running the benchmarks. + Each project adds itself to the list of benchmarks, as in https://github.com/TomAugspurger/asv-runner/pull/3. Then things are re-deployed. Deployment requires ansible and an SSH key for the benchmark machine - Benchmark publishing: After running all the benchmarks, the results are collected and pushed to https://github.com/tomaugspurger/asv-collection - Benchmark hosting: A cron job on the server hosting pandas docs pulls https://github.com/tomaugspurger/asv-collection and serves them from the `/speed` directory. There are many things that could be improved on here, but I personally won't have time in the near term. Happy to assist though. On Mon, Apr 23, 2018 at 10:15 AM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Tom -- is the publishing workflow for this documented someplace, or > available in a GitHub repo? We want to make sure we don't accumulate > any "snowflakes" in the development process. > > thanks! > Wes > > On Fri, Apr 13, 2018 at 8:36 AM, Tom Augspurger > <tom.augspurge...@gmail.com> wrote: > > They are run daily and published to http://pandas.pydata.org/speed/ > > > > > > > > From: Antoine Pitrou <anto...@python.org> > > Sent: Friday, April 13, 2018 4:28:11 AM > > To: dev@arrow.apache.org > > Subject: Re: Continuous benchmarking setup > > > > > > Nice! Are the benchmark results published somewhere? > > > > > > > > Le 13/04/2018 à 02:50, Tom Augspurger a écrit : > >> https://github.com/TomAugspurger/asv-runner/ is the setup for the > projects currently running. Adding arrow to https://github.com/ > TomAugspurger/asv-runner/blob/master/tests/full.yml might work. I'll have > to redeploy with the update. > >> > >> > >> From: Wes McKinney <wesmck...@gmail.com> > >> Sent: Thursday, April 12, 2018 7:24:20 PM > >> To: dev@arrow.apache.org > >> Subject: Re: Continuous benchmarking setup > >> > >> hi Antoine, > >> > >> I have a bare metal machine at home (affectionately known as the > >> "pandabox") that's available via SSH that we've been using for > >> continuous benchmarking for other projects. Arrow is welcome to use > >> it. I can give you access to the machine if you would like. Hopefully, > >> we can suitably the process of setting up a continuous benchmarking > >> machine so that if we need to migrate to a new machine, it is not too > >> much of a hardship to do so. > >> > >> Thanks > >> Wes > >> > >> On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou <anto...@python.org> > wrote: > >>> > >>> Hello > >>> > >>> With the following changes, it seems we might reach the point where > >>> we're able to run the Python-based benchmark suite accross multiple > >>> commits (at least the ones not anterior to those changes): > >>> https://github.com/apache/arrow/pull/1775 > >>> > >>> To make this truly useful, we would need a dedicated host. Ideally a > >>> (Linux) OS running on bare metal, with SMT/HyperThreading disabled. > >>> If running virtualized, the VM should have dedicated physical CPU > cores. > >>> > >>> That machine would run the benchmarks on a regular basis (perhaps once > >>> per night) and publish the results in static HTML form somewhere. > >>> > >>> (note: nice to have in the future might be access to NVidia hardware, > >>> but right now there are no CUDA benchmarks in the Python benchmarks) > >>> > >>> What should be the procedure here? > >>> > >>> Regards > >>> > >>> Antoine. > >> >
Re: Continuous benchmarking setup
https://github.com/TomAugspurger/asv-runner/ is the setup for the projects currently running. Adding arrow to https://github.com/TomAugspurger/asv-runner/blob/master/tests/full.yml might work. I'll have to redeploy with the update. From: Wes McKinneySent: Thursday, April 12, 2018 7:24:20 PM To: dev@arrow.apache.org Subject: Re: Continuous benchmarking setup hi Antoine, I have a bare metal machine at home (affectionately known as the "pandabox") that's available via SSH that we've been using for continuous benchmarking for other projects. Arrow is welcome to use it. I can give you access to the machine if you would like. Hopefully, we can suitably the process of setting up a continuous benchmarking machine so that if we need to migrate to a new machine, it is not too much of a hardship to do so. Thanks Wes On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou wrote: > > Hello > > With the following changes, it seems we might reach the point where > we're able to run the Python-based benchmark suite accross multiple > commits (at least the ones not anterior to those changes): > https://github.com/apache/arrow/pull/1775 > > To make this truly useful, we would need a dedicated host. Ideally a > (Linux) OS running on bare metal, with SMT/HyperThreading disabled. > If running virtualized, the VM should have dedicated physical CPU cores. > > That machine would run the benchmarks on a regular basis (perhaps once > per night) and publish the results in static HTML form somewhere. > > (note: nice to have in the future might be access to NVidia hardware, > but right now there are no CUDA benchmarks in the Python benchmarks) > > What should be the procedure here? > > Regards > > Antoine.
[jira] [Created] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals
Tom Augspurger created ARROW-1897: - Summary: Incorrect numpy_type for pandas metadata of Categoricals Key: ARROW-1897 URL: https://issues.apache.org/jira/browse/ARROW-1897 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Tom Augspurger Fix For: 0.9.0 If I'm reading http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format correctly, the "numpy_type" field of a `Categorical` should be the storage type used for the *codes*. It looks like pyarrow is just using 'object' always. {{ In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: import pyarrow.parquet as pq In [4]: import io In [5]: import json In [6]: df = pd.DataFrame({"A": [1, 2]}, ...: index=pd.CategoricalIndex(['one', 'two'], name='idx')) ...: In [8]: sink = io.BytesIO() ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) ...: json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] ...: Out[8]: {'field_name': '__index_level_0__', 'metadata': {'num_categories': 2, 'ordered': False}, 'name': 'idx', 'numpy_type': 'object', 'pandas_type': 'categorical'} }} >From the spec: > The numpy_type is the physical storage type of the column, which is the > result of str(dtype) for the underlying NumPy array that holds the data. So > for datetimetz this is datetime64[ns] and for categorical, it may be any of > the supported integer categorical types. So 'numpy_type' field should be something like `'int8'` instead of `'object'` -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1593) [PYTHON] serialize_pandas should pass through the preserve_index keyword
Tom Augspurger created ARROW-1593: - Summary: [PYTHON] serialize_pandas should pass through the preserve_index keyword Key: ARROW-1593 URL: https://issues.apache.org/jira/browse/ARROW-1593 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Assignee: Tom Augspurger Priority: Minor Fix For: 0.8.0 I'm doing some benchmarking of Arrow serialization for dask.distributed to serialize dataframes. Overall things look good compared to the current implementation (using pickle). The biggest difference was pickle's ability to use pandas' RangeIndex to avoid serializing the entire Index of values when possible. I suspect that a "range type" isn't in scope for arrow, but in the meantime applications using Arrow could detect the `RangeIndex`, and pass {{ pyarrow.serialize_pandas(df, preserve_index=False) }} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1586) [PYTHON] serialize_pandas roundtrip loses columns name
Tom Augspurger created ARROW-1586: - Summary: [PYTHON] serialize_pandas roundtrip loses columns name Key: ARROW-1586 URL: https://issues.apache.org/jira/browse/ARROW-1586 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor Fix For: 0.8.0 The serialize / deserialize roundtrip loses {{ df.columns.name }} {code:python} In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: df = pd.DataFrame([[1, 2]], columns=pd.Index(['a', 'b'], name='col_name')) In [4]: df.columns.name Out[4]: 'col_name' In [5]: pa.deserialize_pandas(pa.serialize_pandas(df)).columns.name {code} Is this in scope for pyarrow? I suspect it would require an update to the pandas section of the Schema metadata. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1585) serialize_pandas round trip fails on integer columns
Tom Augspurger created ARROW-1585: - Summary: serialize_pandas round trip fails on integer columns Key: ARROW-1585 URL: https://issues.apache.org/jira/browse/ARROW-1585 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor Fix For: 0.8.0 This roundtrip fails, since the Integer column isn't converted to a string after deserializing {code:python} In [1]: import pandas as pd im In [2]: import pyarrow as pa In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, 2]}))).columns Out[3]: Index(['0'], dtype='object') {code} That should be an {{ Int64Index([0]) }} for the columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1557) pyarrow.Table.from_arrays doesn't validate names length
Tom Augspurger created ARROW-1557: - Summary: pyarrow.Table.from_arrays doesn't validate names length Key: ARROW-1557 URL: https://issues.apache.org/jira/browse/ARROW-1557 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor pa.Table.from_arrays doesn't validate that the length of {{arrays}} and {{names}} matches. I think this should raise with a {{ValueError}}: {{ In [1]: import pyarrow as pa In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 'b', 'c']) Out[2]: pyarrow.Table a: int64 b: int64 In [3]: pa.__version__ Out[3]: '0.7.0' }} (This is my first time using JIRA, hopefully I didn't mess up too badly) -- This message was sent by Atlassian JIRA (v6.4.14#64029)