[jira] [Updated] (ARROW-10304) [C++][Compute] Optimize variance kernel for integers

2020-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10304:
---
Labels: pull-request-available  (was: )

> [C++][Compute] Optimize variance kernel for integers
> 
>
> Key: ARROW-10304
> URL: https://issues.apache.org/jira/browse/ARROW-10304
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current variance kernel converts all data type to `double` before 
> calculation. It's sub-optimal for integers. Integer arithmetic is much faster 
> than floating points, e.g., summation is 4x faster [1].
> A quick test for calculating int32 variance shows up to 3x performance gain. 
> Another benefit is that integer arithmetic is accurate.
> [1] https://quick-bench.com/q/_Sz-Peq1MNWYwZYrTtQDx3GI7lQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10309) [Ruby] gem install red-arrow fails

2020-10-14 Thread Bhargav Parsi (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhargav Parsi updated ARROW-10309:
--
Attachment: error2.txt

> [Ruby] gem install red-arrow fails
> --
>
> Key: ARROW-10309
> URL: https://issues.apache.org/jira/browse/ARROW-10309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Reporter: Bhargav Parsi
>Priority: Major
> Attachments: error2.txt, image-2020-10-14-14-51-27-796.png
>
>
> I am trying to install red arrow in 
> centos(centos-release-7-6.1810.2.el7.centos.x86_64).
> using ruby 2.6.3
>  I followed the steps mentioned here 
> [https://arrow.apache.org/install/|https://arrow.apache.org/install/)]
> Used the steps mentioned for centos 6/7.
> After that I ran `gem install red-arrow`.
> That gives 
> !image-2020-10-14-14-51-27-796.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10309) [Ruby] gem install red-arrow fails

2020-10-14 Thread Bhargav Parsi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214297#comment-17214297
 ] 

Bhargav Parsi commented on ARROW-10309:
---

I believe the error is `--ruby=/usr/bin/ruby`. That in our system is 2.0.0. but 
the default rvm version is 2.6.3 and has a different path which is stored in 
`/usr/local/rvm/rubies/ruby-2.6.3/bin/ruby`

> [Ruby] gem install red-arrow fails
> --
>
> Key: ARROW-10309
> URL: https://issues.apache.org/jira/browse/ARROW-10309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Reporter: Bhargav Parsi
>Priority: Major
> Attachments: image-2020-10-14-14-51-27-796.png
>
>
> I am trying to install red arrow in 
> centos(centos-release-7-6.1810.2.el7.centos.x86_64).
> using ruby 2.6.3
>  I followed the steps mentioned here 
> [https://arrow.apache.org/install/|https://arrow.apache.org/install/)]
> Used the steps mentioned for centos 6/7.
> After that I ran `gem install red-arrow`.
> That gives 
> !image-2020-10-14-14-51-27-796.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10309) [Ruby] gem install red-arrow fails

2020-10-14 Thread Bhargav Parsi (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhargav Parsi updated ARROW-10309:
--
Attachment: image-2020-10-14-14-51-27-796.png

> [Ruby] gem install red-arrow fails
> --
>
> Key: ARROW-10309
> URL: https://issues.apache.org/jira/browse/ARROW-10309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Reporter: Bhargav Parsi
>Priority: Major
> Attachments: image-2020-10-14-14-51-27-796.png
>
>
> I am trying to install red arrow in 
> centos(centos-release-7-6.1810.2.el7.centos.x86_64).
> using ruby 2.6.3
>  I followed the steps mentioned here 
> [https://arrow.apache.org/install/|https://arrow.apache.org/install/)]
> Used the steps mentioned for centos 6/7.
> After that I ran `gem install red-arrow`.
> That gives 
>  ```
>  Building native extensions.  This could take a while...Building native 
> extensions.  This could take a while...ERROR:  Error installing red-arrow: 
> ERROR: Failed to build gem native extension.
>      /usr/bin/ruby extconf.rbchecking --enable-debug-build option... 
> nochecking C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** 
> extconf.rb failed ***Could not create Makefile due to some reason, probably 
> lack of necessarylibraries and/or headers.  Check the mkmf.log file for more 
> details.  You mayneed configuration options.
>  Provided configuration options: --with-opt-dir --without-opt-dir 
> --with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib 
> --without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog 
> --srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build 
> --disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in
>  `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST 
> (NameError) from 
> /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in 
> check_warning_flags' from 
> /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' 
> from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in 
> `check_warning_flags' from 
> /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' 
> from extconf.rb:6:in `'
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10309) [Ruby] gem install red-arrow fails

2020-10-14 Thread Bhargav Parsi (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhargav Parsi updated ARROW-10309:
--
Description: 
I am trying to install red arrow in 
centos(centos-release-7-6.1810.2.el7.centos.x86_64).

using ruby 2.6.3
 I followed the steps mentioned here 
[https://arrow.apache.org/install/|https://arrow.apache.org/install/)]

Used the steps mentioned for centos 6/7.

After that I ran `gem install red-arrow`.

That gives 
!image-2020-10-14-14-51-27-796.png!

  was:
I am trying to install red arrow in 
centos(centos-release-7-6.1810.2.el7.centos.x86_64).

using ruby 2.6.3
 I followed the steps mentioned here 
[https://arrow.apache.org/install/|https://arrow.apache.org/install/)]

Used the steps mentioned for centos 6/7.

After that I ran `gem install red-arrow`.

That gives 
 ```
 Building native extensions.  This could take a while...Building native 
extensions.  This could take a while...ERROR:  Error installing red-arrow: 
ERROR: Failed to build gem native extension.
     /usr/bin/ruby extconf.rbchecking --enable-debug-build option... nochecking 
C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** extconf.rb failed 
***Could not create Makefile due to some reason, probably lack of 
necessarylibraries and/or headers.  Check the mkmf.log file for more details.  
You mayneed configuration options.
 Provided configuration options: --with-opt-dir --without-opt-dir 
--with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib 
--without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog 
--srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build 
--disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in
 `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST 
(NameError) from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in 
check_warning_flags' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in 
`check_warning_flags' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' from 
extconf.rb:6:in `'
 ```


> [Ruby] gem install red-arrow fails
> --
>
> Key: ARROW-10309
> URL: https://issues.apache.org/jira/browse/ARROW-10309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Reporter: Bhargav Parsi
>Priority: Major
> Attachments: image-2020-10-14-14-51-27-796.png
>
>
> I am trying to install red arrow in 
> centos(centos-release-7-6.1810.2.el7.centos.x86_64).
> using ruby 2.6.3
>  I followed the steps mentioned here 
> [https://arrow.apache.org/install/|https://arrow.apache.org/install/)]
> Used the steps mentioned for centos 6/7.
> After that I ran `gem install red-arrow`.
> That gives 
> !image-2020-10-14-14-51-27-796.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10309) [Ruby] gem install red-arrow fails

2020-10-14 Thread Bhargav Parsi (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhargav Parsi updated ARROW-10309:
--
Description: 
I am trying to install red arrow in 
centos(centos-release-7-6.1810.2.el7.centos.x86_64).

using ruby 2.6.3
 I followed the steps mentioned here 
[https://arrow.apache.org/install/|https://arrow.apache.org/install/)]

Used the steps mentioned for centos 6/7.

After that I ran `gem install red-arrow`.

That gives 
 ```
 Building native extensions.  This could take a while...Building native 
extensions.  This could take a while...ERROR:  Error installing red-arrow: 
ERROR: Failed to build gem native extension.
     /usr/bin/ruby extconf.rbchecking --enable-debug-build option... nochecking 
C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** extconf.rb failed 
***Could not create Makefile due to some reason, probably lack of 
necessarylibraries and/or headers.  Check the mkmf.log file for more details.  
You mayneed configuration options.
 Provided configuration options: --with-opt-dir --without-opt-dir 
--with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib 
--without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog 
--srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build 
--disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in
 `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST 
(NameError) from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in 
check_warning_flags' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in 
`check_warning_flags' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' from 
extconf.rb:6:in `'
 ```

  was:
I am trying to install red arrow in 
centos(centos-release-7-6.1810.2.el7.centos.x86_64).
I followed the steps mentioned here 
[https://arrow.apache.org/install/|https://arrow.apache.org/install/)]

Used the steps mentioned for centos 6/7.

After that I ran `gem install red-arrow`.

That gives 
```
Building native extensions.  This could take a while...Building native 
extensions.  This could take a while...ERROR:  Error installing red-arrow: 
ERROR: Failed to build gem native extension.
    /usr/bin/ruby extconf.rbchecking --enable-debug-build option... nochecking 
C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** extconf.rb failed 
***Could not create Makefile due to some reason, probably lack of 
necessarylibraries and/or headers.  Check the mkmf.log file for more details.  
You mayneed configuration options.
Provided configuration options: --with-opt-dir --without-opt-dir 
--with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib 
--without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog 
--srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build 
--disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in
 `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST 
(NameError) from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in 
check_warning_flags' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in 
`check_warning_flags' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' from 
extconf.rb:6:in `'
```


> [Ruby] gem install red-arrow fails
> --
>
> Key: ARROW-10309
> URL: https://issues.apache.org/jira/browse/ARROW-10309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Reporter: Bhargav Parsi
>Priority: Major
>
> I am trying to install red arrow in 
> centos(centos-release-7-6.1810.2.el7.centos.x86_64).
> using ruby 2.6.3
>  I followed the steps mentioned here 
> [https://arrow.apache.org/install/|https://arrow.apache.org/install/)]
> Used the steps mentioned for centos 6/7.
> After that I ran `gem install red-arrow`.
> That gives 
>  ```
>  Building native extensions.  This could take a while...Building native 
> extensions.  This could take a while...ERROR:  Error installing red-arrow: 
> ERROR: Failed to build gem native extension.
>      /usr/bin/ruby extconf.rbchecking --enable-debug-build option... 
> nochecking C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** 
> extconf.rb failed ***Could not create Makefile due to some reason, probably 
> lack of necessarylibraries and/or headers.  Check the mkmf.log file for more 
> details.  You mayneed configuration options.
>  Provided configuration options: --with-opt-dir --without-opt-dir 
> --with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib 
> 

[jira] [Created] (ARROW-10309) [Ruby] gem install red-arrow fails

2020-10-14 Thread Bhargav Parsi (Jira)
Bhargav Parsi created ARROW-10309:
-

 Summary: [Ruby] gem install red-arrow fails
 Key: ARROW-10309
 URL: https://issues.apache.org/jira/browse/ARROW-10309
 Project: Apache Arrow
  Issue Type: Bug
  Components: Ruby
Reporter: Bhargav Parsi


I am trying to install red arrow in 
centos(centos-release-7-6.1810.2.el7.centos.x86_64).
I followed the steps mentioned here 
[https://arrow.apache.org/install/|https://arrow.apache.org/install/)]

Used the steps mentioned for centos 6/7.

After that I ran `gem install red-arrow`.

That gives 
```
Building native extensions.  This could take a while...Building native 
extensions.  This could take a while...ERROR:  Error installing red-arrow: 
ERROR: Failed to build gem native extension.
    /usr/bin/ruby extconf.rbchecking --enable-debug-build option... nochecking 
C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** extconf.rb failed 
***Could not create Makefile due to some reason, probably lack of 
necessarylibraries and/or headers.  Check the mkmf.log file for more details.  
You mayneed configuration options.
Provided configuration options: --with-opt-dir --without-opt-dir 
--with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib 
--without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog 
--srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build 
--disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in
 `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST 
(NameError) from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in 
check_warning_flags' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in 
`check_warning_flags' from 
/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' from 
extconf.rb:6:in `'
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214287#comment-17214287
 ] 

Antoine Pitrou commented on ARROW-10308:


Also, if you're interested in only some of the columns, you can also reduce the 
processing time using {{ConvertOptions.include_columns}}: 
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html

But really, consider using Parquet if you can. It's a highly optimized binary 
format.

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214276#comment-17214276
 ] 

Antoine Pitrou commented on ARROW-10308:


Processing a CSV file can be costly. On a 12-core 24-thread machine with a 64 
MiB block size, I get around 1.5 GiB/s.

Profiling at the C++ level, it seems that the main bottlenecks are:
 * CSV parsing itself (finding boundaries, escape characters etc.): 22% of 
total CPU time
 * Building up double arrays (most of which is converting from string to 
double): 53% of total CPU time
 * Building up string arrays: 19% of total CPU time

If you're generating the data yourself (as opposed to getting it from a third 
party), I would really recommend using Parquet rather than CSV.

 

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214255#comment-17214255
 ] 

Dror Speiser edited comment on ARROW-10308 at 10/14/20, 8:39 PM:
-

The bad news: the default `block_size` of 1MB, and the default use of native 
file objects, are not so good for my workloads. Moreover, I don't know what's 
going on with the speeds O_O

The good news: I now know how to consistently get around 1.8GiB/s speed for my 
workload.

Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer 
types) x (11 block sizes) x (2 times everything) 
[^arrow-csv-benchmark-times.csv]

And also a scatter plot.   !arrow-csv-benchmark-plot.png!

Note that the x-axis is log in base 2 of the block size.

Do you think there's a place for changing the defaults of `block_size` and 
buffer objects for local paths?


was (Author: drorspei):
The bad news: the default `block_size` of 1MB, and the default use of native 
file objects, are not so good for my workloads. Moreover, I don't know what's 
going on with the speeds O_O

The good news: I now know how to consistently get around 1.8GiB/s speed for my 
workload.

Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer 
types) x (11 block sizes) x (2 times everything) 
[^arrow-csv-benchmark-times.csv]

And also a scatter plot.   !arrow-csv-benchmark-plot.png!

** Note that the x-axis is log in base 2 of the block size.

Do you think there's a place for changing the defaults of `block_size` and 
buffer objects for local paths?

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214255#comment-17214255
 ] 

Dror Speiser edited comment on ARROW-10308 at 10/14/20, 8:38 PM:
-

The bad news: the default `block_size` of 1MB, and the default use of native 
file objects, are not so good for my workloads. Moreover, I don't know what's 
going on with the speeds O_O

The good news: I now know how to consistently get around 1.8GiB/s speed for my 
workload.

Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer 
types) x (11 block sizes) x (2 times everything) 
[^arrow-csv-benchmark-times.csv]

And also a scatter plot.   !arrow-csv-benchmark-plot.png!

** Note that the x-axis is log in base 2 of the block size.

Do you think there's a place for changing the defaults of `block_size` and 
buffer objects for local paths?


was (Author: drorspei):
The bad news: the default `block_size` of 1MB, and the default use of native 
file objects, are not so good for my workloads. Moreover, I don't know what's 
going on with the speeds O_O

The good news: I now know how to consistently get around 1.8GiB/s speed for my 
workload.

Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer 
types) x (11 block sizes) x (2 times everything) 
[^arrow-csv-benchmark-times.csv]

And also a scatter plot.  !arrow-csv-benchmark-plot.png!

Do you think there's a place for changing the defaults of `block_size` and 
buffer objects for local paths?

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214269#comment-17214269
 ] 

Dror Speiser commented on ARROW-10308:
--

Yup, the graph confirms that block size in the range 32-100 MB is a good choice 
for my files.

But it still only gets to 1.8 GiB/s, which is slower than my SSD (2+ GiB/s). Is 
this reasonable? Are you not expecting the processing to be at least as fast as 
reading the files?

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214259#comment-17214259
 ] 

Dror Speiser edited comment on ARROW-10308 at 10/14/20, 8:29 PM:
-

I'm running in multi-thread, with 48 vcpus. htop shows them all lighting up 
when running the benchmark.

For buffer objects: for most cases it would be faster to read entire files and 
then use BufferReader, though there's a higher chance of maxing out on 
available ram.


was (Author: drorspei):
I'm running in multi-thread, with 48 vcpus. htop shows them all lighting up 
when running the benchmark.

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214260#comment-17214260
 ] 

Antoine Pitrou commented on ARROW-10308:


If you really have 400 columns in your file, you may want to try a much larger 
block size, e.g. 32 MB.

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214259#comment-17214259
 ] 

Dror Speiser commented on ARROW-10308:
--

I'm running in multi-thread, with 48 vcpus. htop shows them all lighting up 
when running the benchmark.

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214258#comment-17214258
 ] 

Dror Speiser commented on ARROW-10308:
--

Also, given the suggested results in the profiling I did, there still is the 
possibility of winning 30-50% performance for the defaults, if it's really 
about lock synchronisation.

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214257#comment-17214257
 ] 

Antoine Pitrou commented on ARROW-10308:


The adequate block size is heavily dependent on various characteristics, so 
it's not really possible to provide a one-size-fits-all default value.

As for "buffer objects for local paths", I guess I don't really understand the 
question.

Also: when you say "1.8GiB/s speed", this is in single-thread or multi-thread 
mode? If the latter how many CPU cores are active?

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214255#comment-17214255
 ] 

Dror Speiser commented on ARROW-10308:
--

The bad news: the default `block_size` of 1MB, and the default use of native 
file objects, are not so good for my workloads. Moreover, I don't know what's 
going on with the speeds O_O

The good news: I now know how to consistently get around 1.8GiB/s speed for my 
workload.

Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer 
types) x (11 block sizes) x (2 times everything) 
[^arrow-csv-benchmark-times.csv]

And also a scatter plot.  !arrow-csv-benchmark-plot.png!

Do you think there's a place for changing the defaults of `block_size` and 
buffer objects for local paths?

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dror Speiser updated ARROW-10308:
-
Attachment: arrow-csv-benchmark-times.csv

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-times.csv, 
> benchmark-csv.py, profile1.svg, profile2.svg, profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dror Speiser updated ARROW-10308:
-
Attachment: arrow-csv-benchmark-plot.png

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214192#comment-17214192
 ] 

Antoine Pitrou commented on ARROW-10308:


1) No, it uses native file objects in that case.

2) Thank you, don't hesitate to report the numbers!

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, benchmark-csv.py, profile1.svg, 
> profile2.svg, profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214190#comment-17214190
 ] 

Dror Speiser commented on ARROW-10308:
--

Thanks for the quick response!

1) Sorry, I should have made this more explicit: while the benchmark uses 
BytesIO, I was experiencing these speeds when calling 
`pd.read_csv("/path/to/my.csv")`. Does pyarrow use `BufferReader` in this case?

2) Thanks for the tip, I'll try this out and report back if the numbers change.

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, benchmark-csv.py, profile1.svg, 
> profile2.svg, profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214163#comment-17214163
 ] 

Antoine Pitrou commented on ARROW-10308:


Two things:
1) you are using a Python file object (a {{BytesIO}} object). This will 
unnecessarily reduce performance. Instead you should use an Arrow native file 
object (for example {{pyarrow.BufferReader}}).
2) depending on the CSV file size and structure, it can be beneficial to change 
the CSV read block size in {{pyarrow.csv.ReadOptions}}: 
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html


> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, benchmark-csv.py, profile1.svg, 
> profile2.svg, profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dror Speiser updated ARROW-10308:
-
Attachment: Dockerfile
benchmark-csv.py

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, benchmark-csv.py, profile1.svg, 
> profile2.svg, profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dror Speiser updated ARROW-10308:
-
Attachment: (was: Dockerfile)

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, benchmark-csv.py, profile1.svg, 
> profile2.svg, profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dror Speiser updated ARROW-10308:
-
Attachment: (was: benchmark-csv.py)

> read_csv from python is slow on some work loads
> ---
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, benchmark-csv.py, profile1.svg, 
> profile2.svg, profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10308) read_csv from python is slow on some work loads

2020-10-14 Thread Dror Speiser (Jira)
Dror Speiser created ARROW-10308:


 Summary: read_csv from python is slow on some work loads
 Key: ARROW-10308
 URL: https://issues.apache.org/jira/browse/ARROW-10308
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 1.0.1
 Environment: Machine: Azure, 48 vcpus, 384GiB ram
OS: Ubuntu 18.04
Dockerfile and script: attached, or here: 
https://github.com/drorspei/arrow-csv-benchmark
Reporter: Dror Speiser
 Attachments: Dockerfile, benchmark-csv.py, profile1.svg, profile2.svg, 
profile3.svg, profile4.svg

Hi!

I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
processing data around 0.5GiB/s. "Real workloads" means many string, float, and 
all-null columns, and large file size (5-10GiB), though the file size didn't 
matter to much.

Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
the time is spent on shared pointer lock mechanisms (though I'm not sure if 
this is to be trusted). I've attached the dumps in svg format.

I've also attached a script and a Dockerfile to run a benchmark, which 
reproduces the speeds I see. Building the docker image and running it on a 
large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 
0.5GiB/s.

This is all also available here: https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10303) [Rust] Parallel type transformation in CSV reader

2020-10-14 Thread Sergej Fries (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergej Fries closed ARROW-10303.

Resolution: Feedback Received

> [Rust] Parallel type transformation in CSV reader
> -
>
> Key: ARROW-10303
> URL: https://issues.apache.org/jira/browse/ARROW-10303
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Reporter: Sergej Fries
>Priority: Minor
>  Labels: CSVReader
> Attachments: tracing.png
>
>
> Currently, when the CSV file is read, a single thread is responsible for 
> reading the file and for transformation of returned string values into 
> correct data types.
> In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 
> seconds. Out of this time, only ~10% of this is reading the file,  and ~68% 
> is transformation of the string values into correct data types.
> My proposal is to parallelize the part responsible for the data type 
> transformation.
> It seems to be quite simple to achieve since after the CSV reader reads a 
> batch, all projected columns are transformed one by one using an iterator 
> over vector and a map function afterwards. I believe that if one uses the 
> rayon crate, the only change will be the adjustment of "iter()" into 
> "par_iter()" and
> changing
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: 
> {color:#00}Read{color}> 
> {color:#00}Reader{color}<{color:#20999d}R{color}>
> into:
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read 
> {color}+ 
> {color:#00}std{color}::{color:#00}marker{color}::{color:#00}Sync{color}>
>  {color:#00}Reader{color}<{color:#20999d}R{color}>
>  
> But maybe I oversee something crucial (as being quite new in Rust and Arrow). 
> Any advise from someone experienced is therefore very welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10303) [Rust] Parallel type transformation in CSV reader

2020-10-14 Thread Sergej Fries (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214124#comment-17214124
 ] 

Sergej Fries commented on ARROW-10303:
--

Ah, cool, seems that I didn't check DataFusion-related issues good enough 
before posting. Thanks for linking!

I will then close this issues.

> [Rust] Parallel type transformation in CSV reader
> -
>
> Key: ARROW-10303
> URL: https://issues.apache.org/jira/browse/ARROW-10303
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Reporter: Sergej Fries
>Priority: Minor
>  Labels: CSVReader
> Attachments: tracing.png
>
>
> Currently, when the CSV file is read, a single thread is responsible for 
> reading the file and for transformation of returned string values into 
> correct data types.
> In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 
> seconds. Out of this time, only ~10% of this is reading the file,  and ~68% 
> is transformation of the string values into correct data types.
> My proposal is to parallelize the part responsible for the data type 
> transformation.
> It seems to be quite simple to achieve since after the CSV reader reads a 
> batch, all projected columns are transformed one by one using an iterator 
> over vector and a map function afterwards. I believe that if one uses the 
> rayon crate, the only change will be the adjustment of "iter()" into 
> "par_iter()" and
> changing
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: 
> {color:#00}Read{color}> 
> {color:#00}Reader{color}<{color:#20999d}R{color}>
> into:
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read 
> {color}+ 
> {color:#00}std{color}::{color:#00}marker{color}::{color:#00}Sync{color}>
>  {color:#00}Reader{color}<{color:#20999d}R{color}>
>  
> But maybe I oversee something crucial (as being quite new in Rust and Arrow). 
> Any advise from someone experienced is therefore very welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10145) [C++][Dataset] Assert integer overflow in partitioning falls back to string

2020-10-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10145:

Summary: [C++][Dataset] Assert integer overflow in partitioning falls back 
to string  (was: [C++][Dataset] Integer-like partition field values outside 
int32 range error on reading)

> [C++][Dataset] Assert integer overflow in partitioning falls back to string
> ---
>
> Key: ARROW-10145
> URL: https://issues.apache.org/jira/browse/ARROW-10145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset
> Small reproducer:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'part': [3760212050]*10, 'col': range(10)})
> pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])
> In [35]: pq.read_table("test_int64_partition/")
> ...
> ArrowInvalid: error parsing '3760212050' as scalar of type int32
> In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
> In ../src/arrow/dataset/partition.cc, line 218, code: 
> (_error_or_value26).status()
> In ../src/arrow/dataset/partition.cc, line 229, code: 
> (_error_or_value27).status()
> In ../src/arrow/dataset/discovery.cc, line 256, code: 
> (_error_or_value17).status()
> In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
> Out[36]: 
> pyarrow.Table
> col: int64
> part: dictionary
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading

2020-10-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10145:

Fix Version/s: (was: 2.0.1)
   3.0.0

> [C++][Dataset] Integer-like partition field values outside int32 range error 
> on reading
> ---
>
> Key: ARROW-10145
> URL: https://issues.apache.org/jira/browse/ARROW-10145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset
> Small reproducer:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'part': [3760212050]*10, 'col': range(10)})
> pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])
> In [35]: pq.read_table("test_int64_partition/")
> ...
> ArrowInvalid: error parsing '3760212050' as scalar of type int32
> In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
> In ../src/arrow/dataset/partition.cc, line 218, code: 
> (_error_or_value26).status()
> In ../src/arrow/dataset/partition.cc, line 229, code: 
> (_error_or_value27).status()
> In ../src/arrow/dataset/discovery.cc, line 256, code: 
> (_error_or_value17).status()
> In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
> Out[36]: 
> pyarrow.Table
> col: int64
> part: dictionary
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5409) [C++] Improvement for IsIn Kernel when right array is small

2020-10-14 Thread David Sherrier (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214095#comment-17214095
 ] 

David Sherrier commented on ARROW-5409:
---

If no one is working on this, I would like to pick this up.  

Thanks

> [C++] Improvement for IsIn Kernel when right array is small
> ---
>
> Key: ARROW-5409
> URL: https://issues.apache.org/jira/browse/ARROW-5409
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Preeti Suman
>Priority: Major
> Fix For: 3.0.0
>
>
> The core of the algorithm (as python) is 
> {code:java}
> for idx, elem in array:
>   output[i] = (elem in memo_table)
> {code}
>  Often the right operand list will be very small, in this case, the hashtable 
> should be replaced with a constant vector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10307) [Rust] async parquet reader

2020-10-14 Thread Remi Dettai (Jira)
Remi Dettai created ARROW-10307:
---

 Summary: [Rust] async parquet reader
 Key: ARROW-10307
 URL: https://issues.apache.org/jira/browse/ARROW-10307
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Remi Dettai


The aim of this issue is to discuss and try to implement async in the Parquet 
crate for read traits.

It focuses on the read part to limit the complexity and impact of the changes. 
The design choices should also make sense for the write part.

Related issues:
[ARROW-9275|https://issues.apache.org/jira/browse/ARROW-9275] is a more generic 
and abstract discussion about async. This issue focuses on Parquet read

[ARROW-9464|https://issues.apache.org/jira/browse/ARROW-9464] focuses on 
threading in datafusion but overlaps with this issue when datafusion reads from 
parquet

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading

2020-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10145:
---
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Integer-like partition field values outside int32 range error 
> on reading
> ---
>
> Key: ARROW-10145
> URL: https://issues.apache.org/jira/browse/ARROW-10145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 2.0.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset
> Small reproducer:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'part': [3760212050]*10, 'col': range(10)})
> pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])
> In [35]: pq.read_table("test_int64_partition/")
> ...
> ArrowInvalid: error parsing '3760212050' as scalar of type int32
> In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
> In ../src/arrow/dataset/partition.cc, line 218, code: 
> (_error_or_value26).status()
> In ../src/arrow/dataset/partition.cc, line 229, code: 
> (_error_or_value27).status()
> In ../src/arrow/dataset/discovery.cc, line 256, code: 
> (_error_or_value17).status()
> In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
> Out[36]: 
> pyarrow.Table
> col: int64
> part: dictionary
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10304) [C++][Compute] Optimize variance kernel for integers

2020-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214036#comment-17214036
 ] 

Antoine Pitrou commented on ARROW-10304:


For the record, the slowdown seems mostly due to int->double conversion. That 
doesn't change the overall result, though :-)

> [C++][Compute] Optimize variance kernel for integers
> 
>
> Key: ARROW-10304
> URL: https://issues.apache.org/jira/browse/ARROW-10304
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
> Fix For: 3.0.0
>
>
> Current variance kernel converts all data type to `double` before 
> calculation. It's sub-optimal for integers. Integer arithmetic is much faster 
> than floating points, e.g., summation is 4x faster [1].
> A quick test for calculating int32 variance shows up to 3x performance gain. 
> Another benefit is that integer arithmetic is accurate.
> [1] https://quick-bench.com/q/_Sz-Peq1MNWYwZYrTtQDx3GI7lQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10304) [C++][Compute] Optimize variance kernel for integers

2020-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10304:
---
Fix Version/s: 3.0.0

> [C++][Compute] Optimize variance kernel for integers
> 
>
> Key: ARROW-10304
> URL: https://issues.apache.org/jira/browse/ARROW-10304
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
> Fix For: 3.0.0
>
>
> Current variance kernel converts all data type to `double` before 
> calculation. It's sub-optimal for integers. Integer arithmetic is much faster 
> than floating points, e.g., summation is 4x faster [1].
> A quick test for calculating int32 variance shows up to 3x performance gain. 
> Another benefit is that integer arithmetic is accurate.
> [1] https://quick-bench.com/q/_Sz-Peq1MNWYwZYrTtQDx3GI7lQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions

2020-10-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10305:

Flags:   (was: Important)

> [C++][R] Filter datasets with string expressions
> 
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
> {code:java}
> library(dplyr)
> library(arrow)
> data = data.frame(a = c("a", "a2", "a3"))
> write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions

2020-10-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10305:

Component/s: C++

> [C++][R] Filter datasets with string expressions
> 
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Affects Versions: 1.0.1
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
> {code:java}
> library(dplyr)
> library(arrow)
> data = data.frame(a = c("a", "a2", "a3"))
> write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions

2020-10-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10305:

Summary: [C++][R] Filter datasets with string expressions  (was: [R] Error: 
Filter expression not supported for Arrow Datasets (substr, grepl, str_detect))

> [C++][R] Filter datasets with string expressions
> 
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
> {code:java}
> library(dplyr)
> library(arrow)
> data = data.frame(a = c("a", "a2", "a3"))
> write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions

2020-10-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10305:

Affects Version/s: (was: 1.0.1)

> [C++][R] Filter datasets with string expressions
> 
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
> {code:java}
> library(dplyr)
> library(arrow)
> data = data.frame(a = c("a", "a2", "a3"))
> write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)

2020-10-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10305:

Issue Type: New Feature  (was: Improvement)

> [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, 
> str_detect)
> -
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
> {code:java}
> library(dplyr)
> library(arrow)
> data = data.frame(a = c("a", "a2", "a3"))
> write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10300) [Rust] Improve benchmark documentation for generating/converting TPC-H data

2020-10-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10300:
---
Summary: [Rust] Improve benchmark documentation for generating/converting 
TPC-H data  (was: [Rust] Parquet/CSV TPC-H data)

> [Rust] Improve benchmark documentation for generating/converting TPC-H data
> ---
>
> Key: ARROW-10300
> URL: https://issues.apache.org/jira/browse/ARROW-10300
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Reporter: Remi Dettai
>Assignee: Andy Grove
>Priority: Minor
>
> The TPC-H benchmark for datafusion works with Parquet/CSV data but the data 
> generation routine described in the README generates `.tbl` data.
> Could we describe how the TPC-H Parquet/CSV data can be generated to make the 
> benchmark easier to setup and more reproducible ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10300) [Rust] Parquet/CSV TPC-H data

2020-10-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-10300:
--

Assignee: Andy Grove

> [Rust] Parquet/CSV TPC-H data
> -
>
> Key: ARROW-10300
> URL: https://issues.apache.org/jira/browse/ARROW-10300
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Reporter: Remi Dettai
>Assignee: Andy Grove
>Priority: Minor
>
> The TPC-H benchmark for datafusion works with Parquet/CSV data but the data 
> generation routine described in the README generates `.tbl` data.
> Could we describe how the TPC-H Parquet/CSV data can be generated to make the 
> benchmark easier to setup and more reproducible ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10197:
---
Labels: pull-request-available  (was: )

> [Gandiva][python] Execute expression on filtered data
> -
>
> Key: ARROW-10197
> URL: https://issues.apache.org/jira/browse/ARROW-10197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Kirill Lykov
>Assignee: Kirill Lykov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Looks like there is no way to execute an expression on filtered data in 
> python. 
>  Basically, I cannot pass `SelectionVector` to projector's `evaluate` method
> ```python
>  import pyarrow as pa
>  import pyarrow.gandiva as gandiva
> table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
>                                    pa.array([5., 45., 36., 73.,
>                                              83., 23., 76.])],
>                                   ['a', 'b'])
> builder = gandiva.TreeExprBuilder()
>  node_a = builder.make_field(table.schema.field("a"))
>  node_b = builder.make_field(table.schema.field("b"))
>  fifty = builder.make_literal(50.0, pa.float64())
>  eleven = builder.make_literal(11.0, pa.float64())
> cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
>  cond_2 = builder.make_function("greater_than", [node_a, node_b],
>                                     pa.bool_())
>  cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
>  cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
>  condition = builder.make_condition(cond)
> filter = gandiva.make_filter(table.schema, condition)
>  filterResult = filter.evaluate(table.to_batches()[0], 
> pa.default_memory_pool()) --> filterResult has type SelectionVector
>  print(result)
> sum = builder.make_function("add", [node_a, node_b], pa.float64())
>  field_result = pa.field("c", pa.float64())
>  expr = builder.make_expression(sum, field_result)
>  projector = gandiva.make_projector(
>  table.schema, [expr], pa.default_memory_pool())
> r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a 
> problem that I don't know how to use filterResult with projector
>  ```
> In C++, I see that it is possible to pass SelectionVector as second argument 
> to projector::Evaluate: 
> [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
>   
>  Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
> [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10270) [R] Fix CSV timestamp_parsers test on R-devel

2020-10-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-10270.
-
Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8447
[https://github.com/apache/arrow/pull/8447]

> [R] Fix CSV timestamp_parsers test on R-devel
> -
>
> Key: ARROW-10270
> URL: https://issues.apache.org/jira/browse/ARROW-10270
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Apparently there is a change in the development version of R with respect to 
> timezone handling. I suspect it is this: 
> https://github.com/wch/r-source/blob/trunk/doc/NEWS.Rd#L296-L300
> It causes this failure:
> {code}
> ── 1. Failure: read_csv_arrow() can read timestamps (@test-csv.R#216)  
> ─
> `tbl` not equal to `df`.
> Component "time": 'tzone' attributes are inconsistent ('UTC' and '')
> ── 2. Failure: read_csv_arrow() can read timestamps (@test-csv.R#219)  
> ─
> `tbl` not equal to `df`.
> Component "time": 'tzone' attributes are inconsistent ('UTC' and '')
> {code}
> This needs to be fixed for the CRAN release because they check on the devel 
> version. But it doesn't need to block the 2.0 release candidate because I can 
> (at minimum) skip these tests before submitting to CRAN (FYI [~kszucs])
> I'll also add a CI job to test on R-devel. I just removed 2 R jobs so we can 
> afford to add one back.
> cc [~romainfrancois]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10301) [C++] Add "all" boolean reducing kernel

2020-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10301:
-
Summary: [C++] Add "all" boolean reducing kernel  (was: Add "all" boolean 
reducing kernel)

> [C++] Add "all" boolean reducing kernel
> ---
>
> Key: ARROW-10301
> URL: https://issues.apache.org/jira/browse/ARROW-10301
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Andrew Wieteska
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: analytics
> Fix For: 3.0.0
>
>
> As discussed on GitHub: 
> [https://github.com/apache/arrow/pull/8294#discussion_r504034461]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10303) [Rust] Parallel type transformation in CSV reader

2020-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10303:
-
Summary: [Rust] Parallel type transformation in CSV reader  (was: Parallel 
type transformation in CSV reader)

> [Rust] Parallel type transformation in CSV reader
> -
>
> Key: ARROW-10303
> URL: https://issues.apache.org/jira/browse/ARROW-10303
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Reporter: Sergej Fries
>Priority: Minor
>  Labels: CSVReader
> Attachments: tracing.png
>
>
> Currently, when the CSV file is read, a single thread is responsible for 
> reading the file and for transformation of returned string values into 
> correct data types.
> In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 
> seconds. Out of this time, only ~10% of this is reading the file,  and ~68% 
> is transformation of the string values into correct data types.
> My proposal is to parallelize the part responsible for the data type 
> transformation.
> It seems to be quite simple to achieve since after the CSV reader reads a 
> batch, all projected columns are transformed one by one using an iterator 
> over vector and a map function afterwards. I believe that if one uses the 
> rayon crate, the only change will be the adjustment of "iter()" into 
> "par_iter()" and
> changing
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: 
> {color:#00}Read{color}> 
> {color:#00}Reader{color}<{color:#20999d}R{color}>
> into:
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read 
> {color}+ 
> {color:#00}std{color}::{color:#00}marker{color}::{color:#00}Sync{color}>
>  {color:#00}Reader{color}<{color:#20999d}R{color}>
>  
> But maybe I oversee something crucial (as being quite new in Rust and Arrow). 
> Any advise from someone experienced is therefore very welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-10197:


Assignee: Kirill Lykov

> [Gandiva][python] Execute expression on filtered data
> -
>
> Key: ARROW-10197
> URL: https://issues.apache.org/jira/browse/ARROW-10197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Kirill Lykov
>Assignee: Kirill Lykov
>Priority: Major
> Fix For: 3.0.0
>
>
> Looks like there is no way to execute an expression on filtered data in 
> python. 
>  Basically, I cannot pass `SelectionVector` to projector's `evaluate` method
> ```python
>  import pyarrow as pa
>  import pyarrow.gandiva as gandiva
> table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
>                                    pa.array([5., 45., 36., 73.,
>                                              83., 23., 76.])],
>                                   ['a', 'b'])
> builder = gandiva.TreeExprBuilder()
>  node_a = builder.make_field(table.schema.field("a"))
>  node_b = builder.make_field(table.schema.field("b"))
>  fifty = builder.make_literal(50.0, pa.float64())
>  eleven = builder.make_literal(11.0, pa.float64())
> cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
>  cond_2 = builder.make_function("greater_than", [node_a, node_b],
>                                     pa.bool_())
>  cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
>  cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
>  condition = builder.make_condition(cond)
> filter = gandiva.make_filter(table.schema, condition)
>  filterResult = filter.evaluate(table.to_batches()[0], 
> pa.default_memory_pool()) --> filterResult has type SelectionVector
>  print(result)
> sum = builder.make_function("add", [node_a, node_b], pa.float64())
>  field_result = pa.field("c", pa.float64())
>  expr = builder.make_expression(sum, field_result)
>  projector = gandiva.make_projector(
>  table.schema, [expr], pa.default_memory_pool())
> r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a 
> problem that I don't know how to use filterResult with projector
>  ```
> In C++, I see that it is possible to pass SelectionVector as second argument 
> to projector::Evaluate: 
> [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
>   
>  Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
> [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9459) [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment

2020-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213950#comment-17213950
 ] 

Joris Van den Bossche commented on ARROW-9459:
--

This could probably also be solved by making the parsing lazy -> ARROW-10131

> [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment
> --
>
> Key: ARROW-9459
> URL: https://issues.apache.org/jira/browse/ARROW-9459
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-dask-integration
>
> See some timing checks here: 
> https://github.com/dask/dask/pull/6346#issuecomment-656548675
> Parsing all statistics, even from a centralized {{_metadata}} file, can be 
> quite expensive. If you know in advance that you are not going to use them 
> (eg you are only going to do filtering on the partition fields, and otherwise 
> read all data), it could be nice to have an option to disable parsing 
> statistics.
> cc [~rjzamora] [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9459) [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment

2020-10-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9459:
-
Fix Version/s: 3.0.0

> [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment
> --
>
> Key: ARROW-9459
> URL: https://issues.apache.org/jira/browse/ARROW-9459
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-dask-integration
> Fix For: 3.0.0
>
>
> See some timing checks here: 
> https://github.com/dask/dask/pull/6346#issuecomment-656548675
> Parsing all statistics, even from a centralized {{_metadata}} file, can be 
> quite expensive. If you know in advance that you are not going to use them 
> (eg you are only going to do filtering on the partition fields, and otherwise 
> read all data), it could be nice to have an option to disable parsing 
> statistics.
> cc [~rjzamora] [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10131) [C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment

2020-10-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10131:
--
Fix Version/s: 3.0.0

> [C++][Dataset] Lazily parse parquet metadata / statistics in 
> ParquetDatasetFactory and ParquetFileFragment
> --
>
> Key: ARROW-10131
> URL: https://issues.apache.org/jira/browse/ARROW-10131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-dask-integration
> Fix For: 3.0.0
>
>
> Related to ARROW-9730, parsing of the statistics in parquet metadata is 
> expensive, and therefore should be avoided when possible.
> For example, the {{ParquetDatasetFactory}} ({{ds.parquet_dataset()}} in 
> python) parses all statistics of all files and all columns. While when doing 
> a filtered read, you might only need the statistics of certain files (eg if a 
> filter on a partition field already excluded many files) and certain columns 
> (eg only the columns on which you are actually filtering).
> The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a 
> later EnsureCompleteMetadata parse all statistics, and don't allow parsing a 
> subset, or only parsing the other (non-statistics) metadata, ...), so I think 
> we should try to think of better abstractions.
> cc [~rjzamora] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9128) [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

2020-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213891#comment-17213891
 ] 

Antoine Pitrou commented on ARROW-9128:
---

This is unassigned, so you can definitely take it up.

> [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
> -
>
> Key: ARROW-9128
> URL: https://issues.apache.org/jira/browse/ARROW-9128
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10306) [C++] Add string replacement kernel

2020-10-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10306:


 Summary: [C++] Add string replacement kernel 
 Key: ARROW-10306
 URL: https://issues.apache.org/jira/browse/ARROW-10306
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels
Assignee: Maarten Breddels


Similar to 
[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html]
 with a plain variant, and optionally a RE2 version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10303) Parallel type transformation in CSV reader

2020-10-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213864#comment-17213864
 ] 

Jorge Leitão commented on ARROW-10303:
--

Linking to ARROW-9707, that is related to this

> Parallel type transformation in CSV reader
> --
>
> Key: ARROW-10303
> URL: https://issues.apache.org/jira/browse/ARROW-10303
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Reporter: Sergej Fries
>Priority: Minor
>  Labels: CSVReader
> Attachments: tracing.png
>
>
> Currently, when the CSV file is read, a single thread is responsible for 
> reading the file and for transformation of returned string values into 
> correct data types.
> In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 
> seconds. Out of this time, only ~10% of this is reading the file,  and ~68% 
> is transformation of the string values into correct data types.
> My proposal is to parallelize the part responsible for the data type 
> transformation.
> It seems to be quite simple to achieve since after the CSV reader reads a 
> batch, all projected columns are transformed one by one using an iterator 
> over vector and a map function afterwards. I believe that if one uses the 
> rayon crate, the only change will be the adjustment of "iter()" into 
> "par_iter()" and
> changing
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: 
> {color:#00}Read{color}> 
> {color:#00}Reader{color}<{color:#20999d}R{color}>
> into:
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read 
> {color}+ 
> {color:#00}std{color}::{color:#00}marker{color}::{color:#00}Sync{color}>
>  {color:#00}Reader{color}<{color:#20999d}R{color}>
>  
> But maybe I oversee something crucial (as being quite new in Rust and Arrow). 
> Any advise from someone experienced is therefore very welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10195) [C++] Add string struct extract kernel using re2

2020-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10195:
---
Labels: pull-request-available  (was: )

> [C++] Add string struct extract kernel using re2
> 
>
> Key: ARROW-10195
> URL: https://issues.apache.org/jira/browse/ARROW-10195
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Maarten Breddels
>Assignee: Maarten Breddels
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Similar to Pandas' str.extract a way to convert a string to a struct of 
> strings using the re2 regex library (when having named captured groups). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9128) [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

2020-10-14 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213857#comment-17213857
 ] 

Maarten Breddels commented on ARROW-9128:
-

Shall I implement this?

> [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
> -
>
> Key: ARROW-9128
> URL: https://issues.apache.org/jira/browse/ARROW-9128
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-14 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213783#comment-17213783
 ] 

Uwe Korn commented on ARROW-10276:
--

You have to look at the differences between the {{pip list}} outputs on these 
two machines if it works on your desktop. The error might be coming from 
differing {{pandas}} versions.

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-14 Thread utsav (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213779#comment-17213779
 ] 

utsav edited comment on ARROW-10276 at 10/14/20, 9:50 AM:
--

[~uwe] I can use it on my desktop though. Does this issue arise if the 
dependencies it needs are of a specific version despite what the requirements 
file says? I can recall it needing NumPy and pandas.  I used numpy==1.19.2, 
pandas==1.1.2, six==1.15.0, pytz==2020.1 and Cython==0.29.2. My doubt arises 
from [https://github.com/apache/arrow/issues/2468] and ARROW-3141


was (Author: utri092):
[~uwe] I can use it on my desktop though. Does this issue arise if the 
dependencies it needs are of a specific version despite what the requirements 
file says? I can recall it needing NumPy and pandas.  I used numpy==1.19.2, 
pandas==1.1.2, six==1.15.0, pytz==2020.1 and Cython==0.29.2. My doubt arises 
from this issue

[https://github.com/apache/arrow/issues/2468] and ARROW-3141

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-14 Thread utsav (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213779#comment-17213779
 ] 

utsav commented on ARROW-10276:
---

[~uwe] I can use it on my desktop though. Does this issue arise if the 
dependencies it needs are of a specific version despite what the requirements 
file says? I can recall it needing NumPy and pandas.  I used numpy==1.19.2, 
pandas==1.1.2, six==1.15.0, pytz==2020.1 and Cython==0.29.2. My doubt arises 
from this issue

[https://github.com/apache/arrow/issues/2468] and ARROW-3141

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-14 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213720#comment-17213720
 ] 

Uwe Korn commented on ARROW-10276:
--

Yes, Spark 3.0.1 is still not compatible with {{pyarrow=0.17}}, you can use 
0.14 and 0.15 with the latest Spark release but not newer AFAIK. So there is 
currently no combination that will work for you.

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)

2020-10-14 Thread Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pal updated ARROW-10305:

Description: 
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :
{code:java}
library(dplyr)
library(arrow)
data = data.frame(a = c("a", "a2", "a3"))
write_parquet(data, "Test_filter/data.parquet")
ds <- open_dataset("Test_filter/")
data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")
{code}
gives this error :
{code:java}
Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.{code}
These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.

  was:
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :
{code:java}
library(dplyr)
 library(arrow)
 data = data.frame(a = c("a", "a2", "a3"))
 write_parquet(data, "Test_filter/data.parquet")
ds <- open_dataset("Test_filter/")
data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")
{code}
gives this error :
{code:java}
Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.{code}
These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.


> [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, 
> str_detect)
> -
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
> {code:java}
> library(dplyr)
> library(arrow)
> data = data.frame(a = c("a", "a2", "a3"))
> write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)

2020-10-14 Thread Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pal updated ARROW-10305:

Description: 
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :

 
{code:java}
library(dplyr)
 library(arrow)
 data = data.frame(a = c("a", "a2", "a3"))
 write_parquet(data, "Test_filter/data.parquet")
ds <- open_dataset("Test_filter/")
data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")
{code}
 

gives this error :
{code:java}
Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.{code}
These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.

  was:
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :

 

{{library(dplyr)
 library(arrow)
 data = data.frame(a = c("a", "a2", "a3"))
 write_parquet(data, "Test_filter/data.parquet")

ds <- open_dataset("Test_filter/")

data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")}}

gives this error :

 

{{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.}}

These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.


> [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, 
> str_detect)
> -
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
>  
> {code:java}
> library(dplyr)
>  library(arrow)
>  data = data.frame(a = c("a", "a2", "a3"))
>  write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
>  
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)

2020-10-14 Thread Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pal updated ARROW-10305:

Description: 
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :
{code:java}
library(dplyr)
 library(arrow)
 data = data.frame(a = c("a", "a2", "a3"))
 write_parquet(data, "Test_filter/data.parquet")
ds <- open_dataset("Test_filter/")
data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")
{code}
gives this error :
{code:java}
Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.{code}
These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.

  was:
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :

 
{code:java}
library(dplyr)
 library(arrow)
 data = data.frame(a = c("a", "a2", "a3"))
 write_parquet(data, "Test_filter/data.parquet")
ds <- open_dataset("Test_filter/")
data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")
{code}
 

gives this error :
{code:java}
Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.{code}
These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.


> [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, 
> str_detect)
> -
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
> {code:java}
> library(dplyr)
>  library(arrow)
>  data = data.frame(a = c("a", "a2", "a3"))
>  write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)

2020-10-14 Thread Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pal updated ARROW-10305:

Description: 
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :

 

{{library(dplyr)
 library(arrow)
 data = data.frame(a = c("a", "a2", "a3"))
 write_parquet(data, "Test_filter/data.parquet")

ds <- open_dataset("Test_filter/")

data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")}}

gives this error :

 

{{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.}}

These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.

  was:
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :

 

```library(dplyr)
 library(arrow)
 data = data.frame(a = c("a", "a2", "a3"))
 write_parquet(data, "Test_filter/data.parquet")

ds <- open_dataset("Test_filter/")

data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")```

gives this error :

 

{{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.}}

These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.


> [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, 
> str_detect)
> -
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
>  
> {{library(dplyr)
>  library(arrow)
>  data = data.frame(a = c("a", "a2", "a3"))
>  write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")}}
> gives this error :
>  
> {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) 
> == "a"
>  Call collect() first to pull data into R.}}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)

2020-10-14 Thread Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pal updated ARROW-10305:

Description: 
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :

 

```library(dplyr)
 library(arrow)
 data = data.frame(a = c("a", "a2", "a3"))
 write_parquet(data, "Test_filter/data.parquet")

ds <- open_dataset("Test_filter/")

data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")```

gives this error :

 

{{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.}}

These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.

  was:
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :

 

{{library(dplyr)
library(arrow)
data = data.frame(a = c("a", "a2", "a3"))
write_parquet(data, "Test_filter/data.parquet")

ds <- open_dataset("Test_filter/")

data_flt <- ds %>% 
  filter(substr(a, 1, 1) == "a")}}

gives this error :

 

{{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
Call collect() first to pull data into R.}}

These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.


> [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, 
> str_detect)
> -
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering after open_datatset(). Specifically, the code below 
> :
>  
> ```library(dplyr)
>  library(arrow)
>  data = data.frame(a = c("a", "a2", "a3"))
>  write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")```
> gives this error :
>  
> {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) 
> == "a"
>  Call collect() first to pull data into R.}}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9856) [R] Add bindings for string compute functions

2020-10-14 Thread Pal (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213641#comment-17213641
 ] 

Pal commented on ARROW-9856:


This issue is also related to 
https://issues.apache.org/jira/browse/ARROW-10305. 

> [R] Add bindings for string compute functions
> -
>
> Key: ARROW-9856
> URL: https://issues.apache.org/jira/browse/ARROW-9856
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 3.0.0
>
>
> See https://arrow.apache.org/docs/cpp/compute.html#string-predicates and 
> below. Since R's base string functions, as well as stringr/stringi, aren't 
> generics that we can define methods for, this will probably make most sense 
> within the context of a dplyr expression where we have more control over the 
> evaluation.
> This will require enabling utf8proc in the builds; there's already an 
> rtools-package for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)

2020-10-14 Thread Pal (Jira)
Pal created ARROW-10305:
---

 Summary: [R] Error: Filter expression not supported for Arrow 
Datasets (substr, grepl, str_detect)
 Key: ARROW-10305
 URL: https://issues.apache.org/jira/browse/ARROW-10305
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 1.0.1
Reporter: Pal


Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :

 

{{library(dplyr)
library(arrow)
data = data.frame(a = c("a", "a2", "a3"))
write_parquet(data, "Test_filter/data.parquet")

ds <- open_dataset("Test_filter/")

data_flt <- ds %>% 
  filter(substr(a, 1, 1) == "a")}}

gives this error :

 

{{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
Call collect() first to pull data into R.}}

These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)