[jira] [Created] (ARROW-11014) [Rust] [DataFusion] ParquetExec reports incorrect statistics

2020-12-22 Thread Andy Grove (Jira)
Andy Grove created ARROW-11014:
--

 Summary: [Rust] [DataFusion] ParquetExec reports incorrect 
statistics
 Key: ARROW-11014
 URL: https://issues.apache.org/jira/browse/ARROW-11014
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove


ParquetExec represents one or more Parquet files and currently only returns 
statistics based on the first file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11013) [Rust] CSV Reader cannot handle leading/trailing WhiteSpace

2020-12-22 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11013:
---

 Summary: [Rust] CSV Reader cannot handle leading/trailing 
WhiteSpace
 Key: ARROW-11013
 URL: https://issues.apache.org/jira/browse/ARROW-11013
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Affects Versions: 2.0.0
Reporter: Mike Seddon


Currently the CSV Reader assumes very clean input data which does not have 
things like leading spaces. This means parsing data like the TPC-H 'answers' 
set from the databricks/tpch_dbgen repo does not work (like below).

Spark uses the Univocity parser library provides the options 
'ignoreLeadingWhitespace' and 'ignoreTrailingWhitespace' which would help fix 
this issue.

```
l|l|sum_qty|sum_base_price|sum_disc_pricesum_chargeavg_qtyavg_priceavg_disccount_order
   
A|F|37734107.00|56586554400.73|53758257134.87|55909065222.83|25.52|38273.13|0.05|
   1478493
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11012) [Rust] [DataFusion] Make write_csv and write_parquet concurrent

2020-12-22 Thread Andy Grove (Jira)
Andy Grove created ARROW-11012:
--

 Summary: [Rust] [DataFusion] Make write_csv and write_parquet 
concurrent
 Key: ARROW-11012
 URL: https://issues.apache.org/jira/browse/ARROW-11012
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


ExecutionContext.write_csv and write_parquet currently iterate over the output 
partitions and execute one at a time and write the results out. We should run 
these as tokio tasks so they can run concurrently. This should, in theory, help 
with memory usage when the plan contains repartition operators.

We may want to add a configuration option so we can choose between serial and 
parallel writes?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11011) [Rust] [DataFusion] Implement hash partitioning

2020-12-22 Thread Andy Grove (Jira)
Andy Grove created ARROW-11011:
--

 Summary: [Rust] [DataFusion] Implement hash partitioning
 Key: ARROW-11011
 URL: https://issues.apache.org/jira/browse/ARROW-11011
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


Once https://issues.apache.org/jira/browse/ARROW-10582 is implemented, we 
should add support for hash partitioning. The logical physical plans already 
support create a plan with hash partitioning but the execution needs 
implementing in repartition.rs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11010) [Python] `np.float` deprecation warning in `_pandas_logical_type_map`

2020-12-22 Thread Tim Swast (Jira)
Tim Swast created ARROW-11010:
-

 Summary: [Python] `np.float` deprecation warning in 
`_pandas_logical_type_map`
 Key: ARROW-11010
 URL: https://issues.apache.org/jira/browse/ARROW-11010
 Project: Apache Arrow
  Issue Type: Task
Reporter: Tim Swast


I get the following warning when converting a floating point column in a pandas 
dataframe into a pyarrow array:

 

```

/Users/swast/src/python-bigquery/.nox/prerelease_deps/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1031:
 DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. 
Use `float` by itself, which is identical in behavior, to silence this warning. 
If you specifically wanted the numpy scalar type, use `np.float_` here.
 'floating': np.float,

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11009) [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc

2020-12-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-11009:


 Summary: [Python] Add environment variable to elect default usage 
of system memory allocator instead of jemalloc/mimalloc
 Key: ARROW-11009
 URL: https://issues.apache.org/jira/browse/ARROW-11009
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 3.0.0


We routinely get reports like ARROW-11007 where there is suspicion of a memory 
leak (which may or may not be valid) — having an easy way (requiring no changes 
to application code) to toggle usage of the non-system memory allocator would 
help with determining whether the memory usage patterns are the result of the 
allocator being used. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11008) [Rust][DataFusion] Simplify count accumulator

2020-12-22 Thread Jira
Daniël Heres created ARROW-11008:


 Summary: [Rust][DataFusion] Simplify count accumulator
 Key: ARROW-11008
 URL: https://issues.apache.org/jira/browse/ARROW-11008
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

2020-12-22 Thread Michael Peleshenko (Jira)
Michael Peleshenko created ARROW-11007:
--

 Summary: [Python] Memory leak in pq.read_table and table.to_pandas
 Key: ARROW-11007
 URL: https://issues.apache.org/jira/browse/ARROW-11007
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
Reporter: Michael Peleshenko


While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we 
observed a memory leak in the read_table and to_pandas methods. See below for 
sample code to reproduce it. Memory does not seem to be returned after deleting 
the table and df as it was in pyarrow 0.12.1.

*Sample Code*
{code:python}
import io

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from memory_profiler import profile


@profile
def read_file(f):
table = pq.read_table(f)
df = table.to_pandas(strings_to_categorical=True)
del table
del df


def main():
rows = 200
df = pd.DataFrame({
"string": ["test"] * rows,
"int": [5] * rows,
"float": [2.0] * rows,
})
table = pa.Table.from_pandas(df, preserve_index=False)
parquet_stream = io.BytesIO()
pq.write_table(table, parquet_stream)

for i in range(3):
parquet_stream.seek(0)
read_file(parquet_stream)


if __name__ == '__main__':
main()
{code}
*Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
{code:java}
Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #Mem usageIncrement  Occurences   Line Contents

 9161.7 MiB161.7 MiB   1   @profile
10 def read_file(f):
11212.1 MiB 50.4 MiB   1   table = pq.read_table(f)
12258.2 MiB 46.1 MiB   1   df = 
table.to_pandas(strings_to_categorical=True)
13258.2 MiB  0.0 MiB   1   del table
14256.3 MiB -1.9 MiB   1   del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #Mem usageIncrement  Occurences   Line Contents

 9256.3 MiB256.3 MiB   1   @profile
10 def read_file(f):
11279.2 MiB 23.0 MiB   1   table = pq.read_table(f)
12322.2 MiB 43.0 MiB   1   df = 
table.to_pandas(strings_to_categorical=True)
13322.2 MiB  0.0 MiB   1   del table
14320.3 MiB -1.9 MiB   1   del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #Mem usageIncrement  Occurences   Line Contents

 9320.3 MiB320.3 MiB   1   @profile
10 def read_file(f):
11326.9 MiB  6.5 MiB   1   table = pq.read_table(f)
12361.7 MiB 34.8 MiB   1   df = 
table.to_pandas(strings_to_categorical=True)
13361.7 MiB  0.0 MiB   1   del table
14359.8 MiB -1.9 MiB   1   del df
{code}
*Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
{code:java}
Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #Mem usageIncrement  Occurences   Line Contents

 9138.4 MiB138.4 MiB   1   @profile
10 def read_file(f):
11186.2 MiB 47.8 MiB   1   table = pq.read_table(f)
12219.2 MiB 33.0 MiB   1   df = 
table.to_pandas(strings_to_categorical=True)
13171.7 MiB-47.5 MiB   1   del table
14139.3 MiB-32.4 MiB   1   del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #Mem usageIncrement  Occurences   Line Contents

 9139.3 MiB139.3 MiB   1   @profile
10 def read_file(f):
11186.8 MiB 47.5 MiB   1   table = pq.read_table(f)
12219.2 MiB 32.4 MiB   1   df = 
table.to_pandas(strings_to_categorical=True)
13171.5 MiB-47.7 MiB   1   del table
14139.1 MiB-32.4 MiB   1   del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #Mem usageIncrement  Occurences   Line Contents

 9139.1 MiB139.1 MiB   1   @profile
10 def read_file(f):
11186.8 MiB 47.7 MiB   1   table = pq.read_table(f)
12219.2 MiB 32.4 MiB   1   df = 

[jira] [Created] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

2020-12-22 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-11006:


 Summary: [Python] Array to_numpy slow compared to Numpy.view
 Key: ARROW-11006
 URL: https://issues.apache.org/jira/browse/ARROW-11006
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Paul Balanca
Assignee: Paul Balanca


The method `to_numpy` is quite slow compare Numpy slice and viewing 
performance. For instance:
{code:java}
N = 100
np_arr = np.arange(N)
pa_arr = pa.array(np_arr)

%timeit l = [np_arr.view() for _ in range(N)]
251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
The previous benchmark is clearly an extreme case, but the idea is that for any 
operation not available in PyArrow, failing back on Numpy is a good option and 
the cost of extracting should be as minimal as possible (there are scenarios 
where you can't cache easily this view, so you end up calling `to_numpy` a fair 
amount of times).

I would believe that part of this overhead is probably due to PyArrow 
implementing a very generic Pandas conversion, and using this one even for very 
simple Numpy-like dense arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)