[jira] [Created] (ARROW-8462) Crash in lib.concat_tables on Windows

2020-04-14 Thread Tom Augspurger (Jira)
Tom Augspurger created ARROW-8462:
-

 Summary: Crash in lib.concat_tables on Windows
 Key: ARROW-8462
 URL: https://issues.apache.org/jira/browse/ARROW-8462
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Tom Augspurger


This crashes for me with pyarrow 0.16 on my Windows VM


{{
import pyarrow as pa
import pandas as pd

t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
print("concat")
pa.lib.concat_tables([t])

print('done')
}}

Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
info on windows unfortunately. With `python -X faulthandler` I see

{{
concat
Windows fatal exception: access violation

Current thread 0x04f8 (most recent call first):
  File "bug.py", line 6 in (module)
}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: AttributeError importing pyarrow 0.16.0

2020-02-10 Thread Tom Augspurger
Thanks for linking to that. The Python there does seems problematic.
Upgrading to
TravisCI's "bionic" image (with Python 3.7.5 instead of 3.7.1) seems to
have fixed it.

Tom

On Mon, Feb 10, 2020 at 1:34 PM Wes McKinney  wrote:

> hi Tom,
>
> Looks like it could be https://bugs.python.org/issue32973, but I'm not
> sure. I wasn't able to reproduce locally. The Python version 3.7.1
> running in CI is also potentially suspicious.
>
> This class of error seems to have a lot of bug reports based on Google
> searches
>
> Message isn't picklable so we should probably fix that regardless
>
> https://issues.apache.org/jira/browse/ARROW-7826
>
> - Wes
>
> On Mon, Feb 10, 2020 at 12:17 PM Tom Augspurger
>  wrote:
> >
> > Hi all,
> >
> > I'm seeing a strange issue when importing pyarrow on the intake CI. I
> get an
> > exception saying
> >
> > AttributeError: type object 'pyarrow.lib.Message' has no attribute
> > '__reduce_cython__'
> >
> > The full traceback is:
> >
> > __ test_arrow_import
> > ___
> >
> > def test_arrow_import():
> >
> > >   import pyarrow
> >
> > intake/cli/server/tests/test_server.py:32:
> >
> > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _
> > _ _
> >
> >
> ../../../virtualenv/python3.7.1/lib/python3.7/site-packages/pyarrow/__init__.py:49:
> > in 
> >
> > from pyarrow.lib import cpu_count, set_cpu_count
> >
> > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _
> > _ _
> >
> > >   ???
> >
> > E   AttributeError: type object 'pyarrow.lib.Message' has no attribute
> > '__reduce_cython__'
> >
> > pyarrow/ipc.pxi:21: AttributeError
> >
> > _ TestServerV1Source.test_read_part_compressed
> > _
> >
> >
> > I'm unable to reproduce this locally, and was wondering if anyone else
> has
> > seen something similar.
> > Pyarrow was installed using pip / a wheel (
> > https://travis-ci.org/intake/intake/jobs/648523104#L311).
> >
> > A common cause of this error message is building with too old of a
> Cython.
> > While checking this, I noticed
> > that some of the files are generated with Cython 0.29.8, while others
> were
> > generated with 0.29.14.
> > I have no idea if this is a problem in general of if it's causing this
> > specific issue.
> >
> > ```
> > _hdfs.cpp:1:/* Generated by Cython 0.29.14 */
> > include/arrow/python/pyarrow_lib.h:20:/* Generated by Cython 0.29.8 */
> > include/arrow/python/pyarrow_api.h:21:/* Generated by Cython 0.29.8 */
> > _plasma.cpp:1:/* Generated by Cython 0.29.14 */
> > _fs.cpp:1:/* Generated by Cython 0.29.14 */
> > lib_api.h:1:/* Generated by Cython 0.29.14 */
> > gandiva.cpp:1:/* Generated by Cython 0.29.14 */
> > _json.cpp:1:/* Generated by Cython 0.29.14 */
> > _parquet.cpp:1:/* Generated by Cython 0.29.14 */
> > _csv.cpp:1:/* Generated by Cython 0.29.14 */
> > _compute.cpp:1:/* Generated by Cython 0.29.14 */
> > _dataset.cpp:1:/* Generated by Cython 0.29.14 */
> > _flight.cpp:1:/* Generated by Cython 0.29.14 */
> > lib.cpp:1:/* Generated by Cython 0.29.14 */
> > ```
> >
> > See the https://travis-ci.org/intake/intake/jobs/648523104 for the full
> log.
> >
> >
> > Thanks for any pointers!
>


AttributeError importing pyarrow 0.16.0

2020-02-10 Thread Tom Augspurger
Hi all,

I'm seeing a strange issue when importing pyarrow on the intake CI. I get an
exception saying

AttributeError: type object 'pyarrow.lib.Message' has no attribute
'__reduce_cython__'

The full traceback is:

__ test_arrow_import
___

def test_arrow_import():

>   import pyarrow

intake/cli/server/tests/test_server.py:32:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _

../../../virtualenv/python3.7.1/lib/python3.7/site-packages/pyarrow/__init__.py:49:
in 

from pyarrow.lib import cpu_count, set_cpu_count

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _

>   ???

E   AttributeError: type object 'pyarrow.lib.Message' has no attribute
'__reduce_cython__'

pyarrow/ipc.pxi:21: AttributeError

_ TestServerV1Source.test_read_part_compressed
_


I'm unable to reproduce this locally, and was wondering if anyone else has
seen something similar.
Pyarrow was installed using pip / a wheel (
https://travis-ci.org/intake/intake/jobs/648523104#L311).

A common cause of this error message is building with too old of a Cython.
While checking this, I noticed
that some of the files are generated with Cython 0.29.8, while others were
generated with 0.29.14.
I have no idea if this is a problem in general of if it's causing this
specific issue.

```
_hdfs.cpp:1:/* Generated by Cython 0.29.14 */
include/arrow/python/pyarrow_lib.h:20:/* Generated by Cython 0.29.8 */
include/arrow/python/pyarrow_api.h:21:/* Generated by Cython 0.29.8 */
_plasma.cpp:1:/* Generated by Cython 0.29.14 */
_fs.cpp:1:/* Generated by Cython 0.29.14 */
lib_api.h:1:/* Generated by Cython 0.29.14 */
gandiva.cpp:1:/* Generated by Cython 0.29.14 */
_json.cpp:1:/* Generated by Cython 0.29.14 */
_parquet.cpp:1:/* Generated by Cython 0.29.14 */
_csv.cpp:1:/* Generated by Cython 0.29.14 */
_compute.cpp:1:/* Generated by Cython 0.29.14 */
_dataset.cpp:1:/* Generated by Cython 0.29.14 */
_flight.cpp:1:/* Generated by Cython 0.29.14 */
lib.cpp:1:/* Generated by Cython 0.29.14 */
```

See the https://travis-ci.org/intake/intake/jobs/648523104 for the full log.


Thanks for any pointers!


Is FileSystem._isfilestore considered public?

2019-11-26 Thread Tom Augspurger
Hi,

In https://github.com/dask/dask/issues/5526, we're seeing an issue stemming
from a hack to ensure compatibility for Pyarrow. The details aren't too
important. The core of the issue is that the Pyarrow parquet writer makes a
couple checks for `FileSystem._isfilestore` via `_mkdir_if_not_exists`,
e.g. in
https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/parquet.py#L1349-L1350
.

Is it OK for my FileSystem subclass to override _isfilestore? Is it
considered public?

Thanks,

Tom


[jira] [Created] (ARROW-7102) Make filesystem wrappers compatible with fsspec

2019-11-08 Thread Tom Augspurger (Jira)
Tom Augspurger created ARROW-7102:
-

 Summary: Make filesystem wrappers compatible with fsspec
 Key: ARROW-7102
 URL: https://issues.apache.org/jira/browse/ARROW-7102
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Tom Augspurger


[fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a 
common API for a variety filesystem implementations. I'm proposing a 
FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec 
implementation.

 

Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to 
s3fs. 
[https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320].
 This implementation could be removed entirely once an FSSPecWrapper is done, 
or kept as an alias if it's part of the public API.

 

This is realted to ARROW-3717, which requested a GCSFSWrapper for working with 
google cloud storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Benchmarking dashboard proposal

2019-01-18 Thread Tom Augspurger
I'll see if I can figure out why the benchmarks at
https://pandas.pydata.org/speed/arrow/ aren't being updated this weekend.

On Fri, Jan 18, 2019 at 2:34 AM Uwe L. Korn  wrote:

> Hello,
>
> note that we have(had?) the Python benchmarks continuously running and
> reported at https://pandas.pydata.org/speed/arrow/. Seems like this
> stopped in July 2018.
>
> UWe
>
> On Fri, Jan 18, 2019, at 9:23 AM, Antoine Pitrou wrote:
> >
> > Hi Areg,
> >
> > That sounds like a good idea to me.  Note our benchmarks are currently
> > scattered accross the various implementations.  The two that I know of:
> >
> > - the C++ benchmarks are standalone executables created using the Google
> > Benchmark library, aptly named "*-benchmark" (or "*-benchmark.exe" on
> > Windows)
> > - the Python benchmarks use the ASV utility:
> >
> https://github.com/apache/arrow/blob/master/docs/source/python/benchmarks.rst
> >
> > There may be more in the other implementations.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 18/01/2019 à 07:13, Melik-Adamyan, Areg a écrit :
> > > Hello,
> > >
> > > I want to restart/attach to the discussions for creating Arrow
> benchmarking dashboard. I want to propose performance benchmark run per
> commit to track the changes.
> > > The proposal includes building infrastructure for per-commit tracking
> comprising of the following parts:
> > > - Hosted JetBrains for OSS https://teamcity.jetbrains.com/ as a build
> system
> > > - Agents running in cloud both VM/container (DigitalOcean, or others)
> and bare-metal (Packet.net/AWS) and on-premise(Nvidia boxes?)
> > > - JFrog artifactory storage and management for OSS projects
> https://jfrog.com/open-source/#artifactory2
> > > - Codespeed as a frontend https://github.com/tobami/codespeed
> > >
> > > I am volunteering to build such system (if needed more Intel folks
> will be involved) so we can start tracking performance on various platforms
> and understand how changes affect it.
> > >
> > > Please, let me know your thoughts!
> > >
> > > Thanks,
> > > -Areg.
> > >
> > >
> > >
>


Re: Continuous benchmarking setup

2018-04-23 Thread Tom Augspurger
Currently, there are 3 snowflakes :)

- Benchmark setup: https://github.com/TomAugspurger/asv-runner
  + Some setup to bootstrap a clean install with airflow, conda, asv,
supervisor, etc. All the infrastructure around running the benchmarks.
  + Each project adds itself to the list of benchmarks, as in
https://github.com/TomAugspurger/asv-runner/pull/3. Then things are
re-deployed. Deployment requires ansible and an SSH key for the benchmark
machine
- Benchmark publishing: After running all the benchmarks, the results are
collected and pushed to https://github.com/tomaugspurger/asv-collection
- Benchmark hosting: A cron job on the server hosting pandas docs pulls
https://github.com/tomaugspurger/asv-collection and serves them from the
`/speed` directory.

There are many things that could be improved on here, but I personally
won't have time in the near term. Happy to assist though.

On Mon, Apr 23, 2018 at 10:15 AM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Tom -- is the publishing workflow for this documented someplace, or
> available in a GitHub repo? We want to make sure we don't accumulate
> any "snowflakes" in the development process.
>
> thanks!
> Wes
>
> On Fri, Apr 13, 2018 at 8:36 AM, Tom Augspurger
> <tom.augspurge...@gmail.com> wrote:
> > They are run daily and published to http://pandas.pydata.org/speed/
> >
> >
> > 
> > From: Antoine Pitrou <anto...@python.org>
> > Sent: Friday, April 13, 2018 4:28:11 AM
> > To: dev@arrow.apache.org
> > Subject: Re: Continuous benchmarking setup
> >
> >
> > Nice! Are the benchmark results published somewhere?
> >
> >
> >
> > Le 13/04/2018 à 02:50, Tom Augspurger a écrit :
> >> https://github.com/TomAugspurger/asv-runner/ is the setup for the
> projects currently running. Adding arrow to  https://github.com/
> TomAugspurger/asv-runner/blob/master/tests/full.yml might work. I'll have
> to redeploy with the update.
> >>
> >> 
> >> From: Wes McKinney <wesmck...@gmail.com>
> >> Sent: Thursday, April 12, 2018 7:24:20 PM
> >> To: dev@arrow.apache.org
> >> Subject: Re: Continuous benchmarking setup
> >>
> >> hi Antoine,
> >>
> >> I have a bare metal machine at home (affectionately known as the
> >> "pandabox") that's available via SSH that we've been using for
> >> continuous benchmarking for other projects. Arrow is welcome to use
> >> it. I can give you access to the machine if you would like. Hopefully,
> >> we can suitably the process of setting up a continuous benchmarking
> >> machine so that if we need to migrate to a new machine, it is not too
> >> much of a hardship to do so.
> >>
> >> Thanks
> >> Wes
> >>
> >> On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou <anto...@python.org>
> wrote:
> >>>
> >>> Hello
> >>>
> >>> With the following changes, it seems we might reach the point where
> >>> we're able to run the Python-based benchmark suite accross multiple
> >>> commits (at least the ones not anterior to those changes):
> >>> https://github.com/apache/arrow/pull/1775
> >>>
> >>> To make this truly useful, we would need a dedicated host.  Ideally a
> >>> (Linux) OS running on bare metal, with SMT/HyperThreading disabled.
> >>> If running virtualized, the VM should have dedicated physical CPU
> cores.
> >>>
> >>> That machine would run the benchmarks on a regular basis (perhaps once
> >>> per night) and publish the results in static HTML form somewhere.
> >>>
> >>> (note: nice to have in the future might be access to NVidia hardware,
> >>> but right now there are no CUDA benchmarks in the Python benchmarks)
> >>>
> >>> What should be the procedure here?
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>
>


Re: Continuous benchmarking setup

2018-04-12 Thread Tom Augspurger
https://github.com/TomAugspurger/asv-runner/ is the setup for the projects 
currently running. Adding arrow to  
https://github.com/TomAugspurger/asv-runner/blob/master/tests/full.yml might 
work. I'll have to redeploy with the update.


From: Wes McKinney 
Sent: Thursday, April 12, 2018 7:24:20 PM
To: dev@arrow.apache.org
Subject: Re: Continuous benchmarking setup

hi Antoine,

I have a bare metal machine at home (affectionately known as the
"pandabox") that's available via SSH that we've been using for
continuous benchmarking for other projects. Arrow is welcome to use
it. I can give you access to the machine if you would like. Hopefully,
we can suitably the process of setting up a continuous benchmarking
machine so that if we need to migrate to a new machine, it is not too
much of a hardship to do so.

Thanks
Wes

On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou  wrote:
>
> Hello
>
> With the following changes, it seems we might reach the point where
> we're able to run the Python-based benchmark suite accross multiple
> commits (at least the ones not anterior to those changes):
> https://github.com/apache/arrow/pull/1775
>
> To make this truly useful, we would need a dedicated host.  Ideally a
> (Linux) OS running on bare metal, with SMT/HyperThreading disabled.
> If running virtualized, the VM should have dedicated physical CPU cores.
>
> That machine would run the benchmarks on a regular basis (perhaps once
> per night) and publish the results in static HTML form somewhere.
>
> (note: nice to have in the future might be access to NVidia hardware,
> but right now there are no CUDA benchmarks in the Python benchmarks)
>
> What should be the procedure here?
>
> Regards
>
> Antoine.


[jira] [Created] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1897:
-

 Summary: Incorrect numpy_type for pandas metadata of Categoricals
 Key: ARROW-1897
 URL: https://issues.apache.org/jira/browse/ARROW-1897
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Tom Augspurger
 Fix For: 0.9.0


If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{{
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1593) [PYTHON] serialize_pandas should pass through the preserve_index keyword

2017-09-21 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1593:
-

 Summary: [PYTHON] serialize_pandas should pass through the 
preserve_index keyword
 Key: ARROW-1593
 URL: https://issues.apache.org/jira/browse/ARROW-1593
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Assignee: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


I'm doing some benchmarking of Arrow serialization for dask.distributed to 
serialize dataframes.

Overall things look good compared to the current implementation (using pickle). 
The biggest difference was pickle's ability to use pandas' RangeIndex to avoid 
serializing the entire Index of values when possible.

I suspect that a "range type" isn't in scope for arrow, but in the meantime 
applications using Arrow could detect the `RangeIndex`, and pass {{ 
pyarrow.serialize_pandas(df, preserve_index=False) }} 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1586) [PYTHON] serialize_pandas roundtrip loses columns name

2017-09-20 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1586:
-

 Summary: [PYTHON] serialize_pandas roundtrip loses columns name
 Key: ARROW-1586
 URL: https://issues.apache.org/jira/browse/ARROW-1586
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


The serialize / deserialize roundtrip loses {{ df.columns.name }}

{code:python}
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: df = pd.DataFrame([[1, 2]], columns=pd.Index(['a', 'b'], 
name='col_name'))

In [4]: df.columns.name
Out[4]: 'col_name'

In [5]: pa.deserialize_pandas(pa.serialize_pandas(df)).columns.name
{code}

Is this in scope for pyarrow? I suspect it would require an update to the 
pandas section of the Schema metadata.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1585) serialize_pandas round trip fails on integer columns

2017-09-20 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1585:
-

 Summary: serialize_pandas round trip fails on integer columns
 Key: ARROW-1585
 URL: https://issues.apache.org/jira/browse/ARROW-1585
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


This roundtrip fails, since the Integer column isn't converted to a string 
after deserializing

{code:python}
In [1]: import pandas as pd
im
In [2]: import pyarrow as pa

In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, 
2]}))).columns
Out[3]: Index(['0'], dtype='object')
{code}

That should be an {{ Int64Index([0]) }} for the columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1557) pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1557:
-

 Summary: pyarrow.Table.from_arrays doesn't validate names length
 Key: ARROW-1557
 URL: https://issues.apache.org/jira/browse/ARROW-1557
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor


pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
{{names}} matches. I think this should raise with a {{ValueError}}:

{{
In [1]: import pyarrow as pa

In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 
'b', 'c'])
Out[2]:
pyarrow.Table
a: int64
b: int64

In [3]: pa.__version__
Out[3]: '0.7.0'
}}

(This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)