[ https://issues.apache.org/jira/browse/ARROW-8888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kevin Glasson updated ARROW-8888: --------------------------------- Description: When calling pa.Table.from_pandas() the code path that uses the ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the conversion is much much slower. I have a simple example - but the time difference is much worse with a real table. {code:java} Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18) Type 'copyright', 'credits' or 'license' for more information IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help. In [1]: import pyarrow as pa In [2]: import pandas as pd In [3]: df = pd.DataFrame({"A": [0] * 10000000}) In [4]: %timeit table = pa.Table.from_pandas(df) 577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1) 106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) {code} was: When calling pa.Table.from_pandas() the code path that uses the ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the conversion is much much slower. I have a simple example - but the time difference is much worse with a real table. Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18) Type 'copyright', 'credits' or 'license' for more information IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pyarrow as pa In [2]: import pandas as pd In [3]: df = pd.DataFrame(\{"A": [0] * 10000000}) In [4]: %timeit table = pa.Table.from_pandas(df) 577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1) 106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) > Heuristic in dataframe_to_arrays that decides to multithread convert cause > slow conversions > ------------------------------------------------------------------------------------------- > > Key: ARROW-8888 > URL: https://issues.apache.org/jira/browse/ARROW-8888 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.16.0 > Environment: MacOS: 10.15.4 (Also happening on windows 10) > Python: 3.7.3 > Pyarrow: 0.16.0 > Pandas: 0.25.3 > Reporter: Kevin Glasson > Priority: Minor > > When calling pa.Table.from_pandas() the code path that uses the > ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the > conversion is much much slower. > > I have a simple example - but the time difference is much worse with a real > table. > > > {code:java} > Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18) > Type 'copyright', 'credits' or 'license' for more information > IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help. > In [1]: import pyarrow as pa > In [2]: import pandas as pd > In [3]: df = pd.DataFrame({"A": [0] * 10000000}) > In [4]: %timeit table = pa.Table.from_pandas(df) > 577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) > In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1) > 106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)