[jira] [Updated] (ARROW-10304) [C++][Compute] Optimize variance kernel for integers
[ https://issues.apache.org/jira/browse/ARROW-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10304: --- Labels: pull-request-available (was: ) > [C++][Compute] Optimize variance kernel for integers > > > Key: ARROW-10304 > URL: https://issues.apache.org/jira/browse/ARROW-10304 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Current variance kernel converts all data type to `double` before > calculation. It's sub-optimal for integers. Integer arithmetic is much faster > than floating points, e.g., summation is 4x faster [1]. > A quick test for calculating int32 variance shows up to 3x performance gain. > Another benefit is that integer arithmetic is accurate. > [1] https://quick-bench.com/q/_Sz-Peq1MNWYwZYrTtQDx3GI7lQ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10309) [Ruby] gem install red-arrow fails
[ https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhargav Parsi updated ARROW-10309: -- Attachment: error2.txt > [Ruby] gem install red-arrow fails > -- > > Key: ARROW-10309 > URL: https://issues.apache.org/jira/browse/ARROW-10309 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Reporter: Bhargav Parsi >Priority: Major > Attachments: error2.txt, image-2020-10-14-14-51-27-796.png > > > I am trying to install red arrow in > centos(centos-release-7-6.1810.2.el7.centos.x86_64). > using ruby 2.6.3 > I followed the steps mentioned here > [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] > Used the steps mentioned for centos 6/7. > After that I ran `gem install red-arrow`. > That gives > !image-2020-10-14-14-51-27-796.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10309) [Ruby] gem install red-arrow fails
[ https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214297#comment-17214297 ] Bhargav Parsi commented on ARROW-10309: --- I believe the error is `--ruby=/usr/bin/ruby`. That in our system is 2.0.0. but the default rvm version is 2.6.3 and has a different path which is stored in `/usr/local/rvm/rubies/ruby-2.6.3/bin/ruby` > [Ruby] gem install red-arrow fails > -- > > Key: ARROW-10309 > URL: https://issues.apache.org/jira/browse/ARROW-10309 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Reporter: Bhargav Parsi >Priority: Major > Attachments: image-2020-10-14-14-51-27-796.png > > > I am trying to install red arrow in > centos(centos-release-7-6.1810.2.el7.centos.x86_64). > using ruby 2.6.3 > I followed the steps mentioned here > [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] > Used the steps mentioned for centos 6/7. > After that I ran `gem install red-arrow`. > That gives > !image-2020-10-14-14-51-27-796.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10309) [Ruby] gem install red-arrow fails
[ https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhargav Parsi updated ARROW-10309: -- Attachment: image-2020-10-14-14-51-27-796.png > [Ruby] gem install red-arrow fails > -- > > Key: ARROW-10309 > URL: https://issues.apache.org/jira/browse/ARROW-10309 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Reporter: Bhargav Parsi >Priority: Major > Attachments: image-2020-10-14-14-51-27-796.png > > > I am trying to install red arrow in > centos(centos-release-7-6.1810.2.el7.centos.x86_64). > using ruby 2.6.3 > I followed the steps mentioned here > [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] > Used the steps mentioned for centos 6/7. > After that I ran `gem install red-arrow`. > That gives > ``` > Building native extensions. This could take a while...Building native > extensions. This could take a while...ERROR: Error installing red-arrow: > ERROR: Failed to build gem native extension. > /usr/bin/ruby extconf.rbchecking --enable-debug-build option... > nochecking C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** > extconf.rb failed ***Could not create Makefile due to some reason, probably > lack of necessarylibraries and/or headers. Check the mkmf.log file for more > details. You mayneed configuration options. > Provided configuration options: --with-opt-dir --without-opt-dir > --with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib > --without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog > --srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build > --disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in > `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST > (NameError) from > /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in > check_warning_flags' from > /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' > from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in > `check_warning_flags' from > /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' > from extconf.rb:6:in `' > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10309) [Ruby] gem install red-arrow fails
[ https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhargav Parsi updated ARROW-10309: -- Description: I am trying to install red arrow in centos(centos-release-7-6.1810.2.el7.centos.x86_64). using ruby 2.6.3 I followed the steps mentioned here [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] Used the steps mentioned for centos 6/7. After that I ran `gem install red-arrow`. That gives !image-2020-10-14-14-51-27-796.png! was: I am trying to install red arrow in centos(centos-release-7-6.1810.2.el7.centos.x86_64). using ruby 2.6.3 I followed the steps mentioned here [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] Used the steps mentioned for centos 6/7. After that I ran `gem install red-arrow`. That gives ``` Building native extensions. This could take a while...Building native extensions. This could take a while...ERROR: Error installing red-arrow: ERROR: Failed to build gem native extension. /usr/bin/ruby extconf.rbchecking --enable-debug-build option... nochecking C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** extconf.rb failed ***Could not create Makefile due to some reason, probably lack of necessarylibraries and/or headers. Check the mkmf.log file for more details. You mayneed configuration options. Provided configuration options: --with-opt-dir --without-opt-dir --with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib --without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog --srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build --disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST (NameError) from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in check_warning_flags' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `check_warning_flags' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' from extconf.rb:6:in `' ``` > [Ruby] gem install red-arrow fails > -- > > Key: ARROW-10309 > URL: https://issues.apache.org/jira/browse/ARROW-10309 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Reporter: Bhargav Parsi >Priority: Major > Attachments: image-2020-10-14-14-51-27-796.png > > > I am trying to install red arrow in > centos(centos-release-7-6.1810.2.el7.centos.x86_64). > using ruby 2.6.3 > I followed the steps mentioned here > [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] > Used the steps mentioned for centos 6/7. > After that I ran `gem install red-arrow`. > That gives > !image-2020-10-14-14-51-27-796.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10309) [Ruby] gem install red-arrow fails
[ https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhargav Parsi updated ARROW-10309: -- Description: I am trying to install red arrow in centos(centos-release-7-6.1810.2.el7.centos.x86_64). using ruby 2.6.3 I followed the steps mentioned here [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] Used the steps mentioned for centos 6/7. After that I ran `gem install red-arrow`. That gives ``` Building native extensions. This could take a while...Building native extensions. This could take a while...ERROR: Error installing red-arrow: ERROR: Failed to build gem native extension. /usr/bin/ruby extconf.rbchecking --enable-debug-build option... nochecking C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** extconf.rb failed ***Could not create Makefile due to some reason, probably lack of necessarylibraries and/or headers. Check the mkmf.log file for more details. You mayneed configuration options. Provided configuration options: --with-opt-dir --without-opt-dir --with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib --without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog --srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build --disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST (NameError) from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in check_warning_flags' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `check_warning_flags' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' from extconf.rb:6:in `' ``` was: I am trying to install red arrow in centos(centos-release-7-6.1810.2.el7.centos.x86_64). I followed the steps mentioned here [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] Used the steps mentioned for centos 6/7. After that I ran `gem install red-arrow`. That gives ``` Building native extensions. This could take a while...Building native extensions. This could take a while...ERROR: Error installing red-arrow: ERROR: Failed to build gem native extension. /usr/bin/ruby extconf.rbchecking --enable-debug-build option... nochecking C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** extconf.rb failed ***Could not create Makefile due to some reason, probably lack of necessarylibraries and/or headers. Check the mkmf.log file for more details. You mayneed configuration options. Provided configuration options: --with-opt-dir --without-opt-dir --with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib --without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog --srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build --disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST (NameError) from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in check_warning_flags' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `check_warning_flags' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' from extconf.rb:6:in `' ``` > [Ruby] gem install red-arrow fails > -- > > Key: ARROW-10309 > URL: https://issues.apache.org/jira/browse/ARROW-10309 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Reporter: Bhargav Parsi >Priority: Major > > I am trying to install red arrow in > centos(centos-release-7-6.1810.2.el7.centos.x86_64). > using ruby 2.6.3 > I followed the steps mentioned here > [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] > Used the steps mentioned for centos 6/7. > After that I ran `gem install red-arrow`. > That gives > ``` > Building native extensions. This could take a while...Building native > extensions. This could take a while...ERROR: Error installing red-arrow: > ERROR: Failed to build gem native extension. > /usr/bin/ruby extconf.rbchecking --enable-debug-build option... > nochecking C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** > extconf.rb failed ***Could not create Makefile due to some reason, probably > lack of necessarylibraries and/or headers. Check the mkmf.log file for more > details. You mayneed configuration options. > Provided configuration options: --with-opt-dir --without-opt-dir > --with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib >
[jira] [Created] (ARROW-10309) [Ruby] gem install red-arrow fails
Bhargav Parsi created ARROW-10309: - Summary: [Ruby] gem install red-arrow fails Key: ARROW-10309 URL: https://issues.apache.org/jira/browse/ARROW-10309 Project: Apache Arrow Issue Type: Bug Components: Ruby Reporter: Bhargav Parsi I am trying to install red arrow in centos(centos-release-7-6.1810.2.el7.centos.x86_64). I followed the steps mentioned here [https://arrow.apache.org/install/|https://arrow.apache.org/install/)] Used the steps mentioned for centos 6/7. After that I ran `gem install red-arrow`. That gives ``` Building native extensions. This could take a while...Building native extensions. This could take a while...ERROR: Error installing red-arrow: ERROR: Failed to build gem native extension. /usr/bin/ruby extconf.rbchecking --enable-debug-build option... nochecking C++ compiler... g++checking g++ version... 4.8 (gnu++11)*** extconf.rb failed ***Could not create Makefile due to some reason, probably lack of necessarylibraries and/or headers. Check the mkmf.log file for more details. You mayneed configuration options. Provided configuration options: --with-opt-dir --without-opt-dir --with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib --without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog --srcdir=. --curdir --ruby=/usr/bin/ruby --enable-debug-build --disable-debug-build/usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:111:in `try_cxx_warning_flag': uninitialized constant ExtPP::Compiler::CONFTEST (NameError) from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:136:in `block in check_warning_flags' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `each' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:135:in `check_warning_flags' from /usr/local/share/gems/gems/extpp-0.0.8/lib/extpp/compiler.rb:18:in `check' from extconf.rb:6:in `' ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214287#comment-17214287 ] Antoine Pitrou commented on ARROW-10308: Also, if you're interested in only some of the columns, you can also reduce the processing time using {{ConvertOptions.include_columns}}: https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html But really, consider using Parquet if you can. It's a highly optimized binary format. > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214276#comment-17214276 ] Antoine Pitrou commented on ARROW-10308: Processing a CSV file can be costly. On a 12-core 24-thread machine with a 64 MiB block size, I get around 1.5 GiB/s. Profiling at the C++ level, it seems that the main bottlenecks are: * CSV parsing itself (finding boundaries, escape characters etc.): 22% of total CPU time * Building up double arrays (most of which is converting from string to double): 53% of total CPU time * Building up string arrays: 19% of total CPU time If you're generating the data yourself (as opposed to getting it from a third party), I would really recommend using Parquet rather than CSV. > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214255#comment-17214255 ] Dror Speiser edited comment on ARROW-10308 at 10/14/20, 8:39 PM: - The bad news: the default `block_size` of 1MB, and the default use of native file objects, are not so good for my workloads. Moreover, I don't know what's going on with the speeds O_O The good news: I now know how to consistently get around 1.8GiB/s speed for my workload. Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer types) x (11 block sizes) x (2 times everything) [^arrow-csv-benchmark-times.csv] And also a scatter plot. !arrow-csv-benchmark-plot.png! Note that the x-axis is log in base 2 of the block size. Do you think there's a place for changing the defaults of `block_size` and buffer objects for local paths? was (Author: drorspei): The bad news: the default `block_size` of 1MB, and the default use of native file objects, are not so good for my workloads. Moreover, I don't know what's going on with the speeds O_O The good news: I now know how to consistently get around 1.8GiB/s speed for my workload. Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer types) x (11 block sizes) x (2 times everything) [^arrow-csv-benchmark-times.csv] And also a scatter plot. !arrow-csv-benchmark-plot.png! ** Note that the x-axis is log in base 2 of the block size. Do you think there's a place for changing the defaults of `block_size` and buffer objects for local paths? > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214255#comment-17214255 ] Dror Speiser edited comment on ARROW-10308 at 10/14/20, 8:38 PM: - The bad news: the default `block_size` of 1MB, and the default use of native file objects, are not so good for my workloads. Moreover, I don't know what's going on with the speeds O_O The good news: I now know how to consistently get around 1.8GiB/s speed for my workload. Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer types) x (11 block sizes) x (2 times everything) [^arrow-csv-benchmark-times.csv] And also a scatter plot. !arrow-csv-benchmark-plot.png! ** Note that the x-axis is log in base 2 of the block size. Do you think there's a place for changing the defaults of `block_size` and buffer objects for local paths? was (Author: drorspei): The bad news: the default `block_size` of 1MB, and the default use of native file objects, are not so good for my workloads. Moreover, I don't know what's going on with the speeds O_O The good news: I now know how to consistently get around 1.8GiB/s speed for my workload. Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer types) x (11 block sizes) x (2 times everything) [^arrow-csv-benchmark-times.csv] And also a scatter plot. !arrow-csv-benchmark-plot.png! Do you think there's a place for changing the defaults of `block_size` and buffer objects for local paths? > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214269#comment-17214269 ] Dror Speiser commented on ARROW-10308: -- Yup, the graph confirms that block size in the range 32-100 MB is a good choice for my files. But it still only gets to 1.8 GiB/s, which is slower than my SSD (2+ GiB/s). Is this reasonable? Are you not expecting the processing to be at least as fast as reading the files? > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214259#comment-17214259 ] Dror Speiser edited comment on ARROW-10308 at 10/14/20, 8:29 PM: - I'm running in multi-thread, with 48 vcpus. htop shows them all lighting up when running the benchmark. For buffer objects: for most cases it would be faster to read entire files and then use BufferReader, though there's a higher chance of maxing out on available ram. was (Author: drorspei): I'm running in multi-thread, with 48 vcpus. htop shows them all lighting up when running the benchmark. > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214260#comment-17214260 ] Antoine Pitrou commented on ARROW-10308: If you really have 400 columns in your file, you may want to try a much larger block size, e.g. 32 MB. > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214259#comment-17214259 ] Dror Speiser commented on ARROW-10308: -- I'm running in multi-thread, with 48 vcpus. htop shows them all lighting up when running the benchmark. > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214258#comment-17214258 ] Dror Speiser commented on ARROW-10308: -- Also, given the suggested results in the profiling I did, there still is the possibility of winning 30-50% performance for the defaults, if it's really about lock synchronisation. > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214257#comment-17214257 ] Antoine Pitrou commented on ARROW-10308: The adequate block size is heavily dependent on various characteristics, so it's not really possible to provide a one-size-fits-all default value. As for "buffer objects for local paths", I guess I don't really understand the question. Also: when you say "1.8GiB/s speed", this is in single-thread or multi-thread mode? If the latter how many CPU cores are active? > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214255#comment-17214255 ] Dror Speiser commented on ARROW-10308: -- The bad news: the default `block_size` of 1MB, and the default use of native file objects, are not so good for my workloads. Moreover, I don't know what's going on with the speeds O_O The good news: I now know how to consistently get around 1.8GiB/s speed for my workload. Attached is a csv with all the numbers: 220 runs = (5 rounds) x (2 buffer types) x (11 block sizes) x (2 times everything) [^arrow-csv-benchmark-times.csv] And also a scatter plot. !arrow-csv-benchmark-plot.png! Do you think there's a place for changing the defaults of `block_size` and buffer objects for local paths? > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dror Speiser updated ARROW-10308: - Attachment: arrow-csv-benchmark-times.csv > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-times.csv, > benchmark-csv.py, profile1.svg, profile2.svg, profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dror Speiser updated ARROW-10308: - Attachment: arrow-csv-benchmark-plot.png > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214192#comment-17214192 ] Antoine Pitrou commented on ARROW-10308: 1) No, it uses native file objects in that case. 2) Thank you, don't hesitate to report the numbers! > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, benchmark-csv.py, profile1.svg, > profile2.svg, profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214190#comment-17214190 ] Dror Speiser commented on ARROW-10308: -- Thanks for the quick response! 1) Sorry, I should have made this more explicit: while the benchmark uses BytesIO, I was experiencing these speeds when calling `pd.read_csv("/path/to/my.csv")`. Does pyarrow use `BufferReader` in this case? 2) Thanks for the tip, I'll try this out and report back if the numbers change. > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, benchmark-csv.py, profile1.svg, > profile2.svg, profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214163#comment-17214163 ] Antoine Pitrou commented on ARROW-10308: Two things: 1) you are using a Python file object (a {{BytesIO}} object). This will unnecessarily reduce performance. Instead you should use an Arrow native file object (for example {{pyarrow.BufferReader}}). 2) depending on the CSV file size and structure, it can be beneficial to change the CSV read block size in {{pyarrow.csv.ReadOptions}}: https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, benchmark-csv.py, profile1.svg, > profile2.svg, profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dror Speiser updated ARROW-10308: - Attachment: Dockerfile benchmark-csv.py > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, benchmark-csv.py, profile1.svg, > profile2.svg, profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dror Speiser updated ARROW-10308: - Attachment: (was: Dockerfile) > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, benchmark-csv.py, profile1.svg, > profile2.svg, profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10308) read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dror Speiser updated ARROW-10308: - Attachment: (was: benchmark-csv.py) > read_csv from python is slow on some work loads > --- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, benchmark-csv.py, profile1.svg, > profile2.svg, profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10308) read_csv from python is slow on some work loads
Dror Speiser created ARROW-10308: Summary: read_csv from python is slow on some work loads Key: ARROW-10308 URL: https://issues.apache.org/jira/browse/ARROW-10308 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 1.0.1 Environment: Machine: Azure, 48 vcpus, 384GiB ram OS: Ubuntu 18.04 Dockerfile and script: attached, or here: https://github.com/drorspei/arrow-csv-benchmark Reporter: Dror Speiser Attachments: Dockerfile, benchmark-csv.py, profile1.svg, profile2.svg, profile3.svg, profile4.svg Hi! I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, processing data around 0.5GiB/s. "Real workloads" means many string, float, and all-null columns, and large file size (5-10GiB), though the file size didn't matter to much. Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of the time is spent on shared pointer lock mechanisms (though I'm not sure if this is to be trusted). I've attached the dumps in svg format. I've also attached a script and a Dockerfile to run a benchmark, which reproduces the speeds I see. Building the docker image and running it on a large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 0.5GiB/s. This is all also available here: https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-10303) [Rust] Parallel type transformation in CSV reader
[ https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergej Fries closed ARROW-10303. Resolution: Feedback Received > [Rust] Parallel type transformation in CSV reader > - > > Key: ARROW-10303 > URL: https://issues.apache.org/jira/browse/ARROW-10303 > Project: Apache Arrow > Issue Type: Wish > Components: Rust >Reporter: Sergej Fries >Priority: Minor > Labels: CSVReader > Attachments: tracing.png > > > Currently, when the CSV file is read, a single thread is responsible for > reading the file and for transformation of returned string values into > correct data types. > In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 > seconds. Out of this time, only ~10% of this is reading the file, and ~68% > is transformation of the string values into correct data types. > My proposal is to parallelize the part responsible for the data type > transformation. > It seems to be quite simple to achieve since after the CSV reader reads a > batch, all projected columns are transformed one by one using an iterator > over vector and a map function afterwards. I believe that if one uses the > rayon crate, the only change will be the adjustment of "iter()" into > "par_iter()" and > changing > {color:#0033b3}impl{color}<{color:#20999d}R{color}: > {color:#00}Read{color}> > {color:#00}Reader{color}<{color:#20999d}R{color}> > into: > {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read > {color}+ > {color:#00}std{color}::{color:#00}marker{color}::{color:#00}Sync{color}> > {color:#00}Reader{color}<{color:#20999d}R{color}> > > But maybe I oversee something crucial (as being quite new in Rust and Arrow). > Any advise from someone experienced is therefore very welcome! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10303) [Rust] Parallel type transformation in CSV reader
[ https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214124#comment-17214124 ] Sergej Fries commented on ARROW-10303: -- Ah, cool, seems that I didn't check DataFusion-related issues good enough before posting. Thanks for linking! I will then close this issues. > [Rust] Parallel type transformation in CSV reader > - > > Key: ARROW-10303 > URL: https://issues.apache.org/jira/browse/ARROW-10303 > Project: Apache Arrow > Issue Type: Wish > Components: Rust >Reporter: Sergej Fries >Priority: Minor > Labels: CSVReader > Attachments: tracing.png > > > Currently, when the CSV file is read, a single thread is responsible for > reading the file and for transformation of returned string values into > correct data types. > In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 > seconds. Out of this time, only ~10% of this is reading the file, and ~68% > is transformation of the string values into correct data types. > My proposal is to parallelize the part responsible for the data type > transformation. > It seems to be quite simple to achieve since after the CSV reader reads a > batch, all projected columns are transformed one by one using an iterator > over vector and a map function afterwards. I believe that if one uses the > rayon crate, the only change will be the adjustment of "iter()" into > "par_iter()" and > changing > {color:#0033b3}impl{color}<{color:#20999d}R{color}: > {color:#00}Read{color}> > {color:#00}Reader{color}<{color:#20999d}R{color}> > into: > {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read > {color}+ > {color:#00}std{color}::{color:#00}marker{color}::{color:#00}Sync{color}> > {color:#00}Reader{color}<{color:#20999d}R{color}> > > But maybe I oversee something crucial (as being quite new in Rust and Arrow). > Any advise from someone experienced is therefore very welcome! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10145) [C++][Dataset] Assert integer overflow in partitioning falls back to string
[ https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10145: Summary: [C++][Dataset] Assert integer overflow in partitioning falls back to string (was: [C++][Dataset] Integer-like partition field values outside int32 range error on reading) > [C++][Dataset] Assert integer overflow in partitioning falls back to string > --- > > Key: ARROW-10145 > URL: https://issues.apache.org/jira/browse/ARROW-10145 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset > Small reproducer: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'part': [3760212050]*10, 'col': range(10)}) > pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part']) > In [35]: pq.read_table("test_int64_partition/") > ... > ArrowInvalid: error parsing '3760212050' as scalar of type int32 > In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this) > In ../src/arrow/dataset/partition.cc, line 218, code: > (_error_or_value26).status() > In ../src/arrow/dataset/partition.cc, line 229, code: > (_error_or_value27).status() > In ../src/arrow/dataset/discovery.cc, line 256, code: > (_error_or_value17).status() > In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True) > Out[36]: > pyarrow.Table > col: int64 > part: dictionary > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading
[ https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10145: Fix Version/s: (was: 2.0.1) 3.0.0 > [C++][Dataset] Integer-like partition field values outside int32 range error > on reading > --- > > Key: ARROW-10145 > URL: https://issues.apache.org/jira/browse/ARROW-10145 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset > Small reproducer: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'part': [3760212050]*10, 'col': range(10)}) > pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part']) > In [35]: pq.read_table("test_int64_partition/") > ... > ArrowInvalid: error parsing '3760212050' as scalar of type int32 > In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this) > In ../src/arrow/dataset/partition.cc, line 218, code: > (_error_or_value26).status() > In ../src/arrow/dataset/partition.cc, line 229, code: > (_error_or_value27).status() > In ../src/arrow/dataset/discovery.cc, line 256, code: > (_error_or_value17).status() > In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True) > Out[36]: > pyarrow.Table > col: int64 > part: dictionary > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5409) [C++] Improvement for IsIn Kernel when right array is small
[ https://issues.apache.org/jira/browse/ARROW-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214095#comment-17214095 ] David Sherrier commented on ARROW-5409: --- If no one is working on this, I would like to pick this up. Thanks > [C++] Improvement for IsIn Kernel when right array is small > --- > > Key: ARROW-5409 > URL: https://issues.apache.org/jira/browse/ARROW-5409 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Preeti Suman >Priority: Major > Fix For: 3.0.0 > > > The core of the algorithm (as python) is > {code:java} > for idx, elem in array: > output[i] = (elem in memo_table) > {code} > Often the right operand list will be very small, in this case, the hashtable > should be replaced with a constant vector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10307) [Rust] async parquet reader
Remi Dettai created ARROW-10307: --- Summary: [Rust] async parquet reader Key: ARROW-10307 URL: https://issues.apache.org/jira/browse/ARROW-10307 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Remi Dettai The aim of this issue is to discuss and try to implement async in the Parquet crate for read traits. It focuses on the read part to limit the complexity and impact of the changes. The design choices should also make sense for the write part. Related issues: [ARROW-9275|https://issues.apache.org/jira/browse/ARROW-9275] is a more generic and abstract discussion about async. This issue focuses on Parquet read [ARROW-9464|https://issues.apache.org/jira/browse/ARROW-9464] focuses on threading in datafusion but overlaps with this issue when datafusion reads from parquet -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading
[ https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10145: --- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Integer-like partition field values outside int32 range error > on reading > --- > > Key: ARROW-10145 > URL: https://issues.apache.org/jira/browse/ARROW-10145 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 2.0.1 > > Time Spent: 10m > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset > Small reproducer: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'part': [3760212050]*10, 'col': range(10)}) > pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part']) > In [35]: pq.read_table("test_int64_partition/") > ... > ArrowInvalid: error parsing '3760212050' as scalar of type int32 > In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this) > In ../src/arrow/dataset/partition.cc, line 218, code: > (_error_or_value26).status() > In ../src/arrow/dataset/partition.cc, line 229, code: > (_error_or_value27).status() > In ../src/arrow/dataset/discovery.cc, line 256, code: > (_error_or_value17).status() > In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True) > Out[36]: > pyarrow.Table > col: int64 > part: dictionary > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10304) [C++][Compute] Optimize variance kernel for integers
[ https://issues.apache.org/jira/browse/ARROW-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214036#comment-17214036 ] Antoine Pitrou commented on ARROW-10304: For the record, the slowdown seems mostly due to int->double conversion. That doesn't change the overall result, though :-) > [C++][Compute] Optimize variance kernel for integers > > > Key: ARROW-10304 > URL: https://issues.apache.org/jira/browse/ARROW-10304 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > Fix For: 3.0.0 > > > Current variance kernel converts all data type to `double` before > calculation. It's sub-optimal for integers. Integer arithmetic is much faster > than floating points, e.g., summation is 4x faster [1]. > A quick test for calculating int32 variance shows up to 3x performance gain. > Another benefit is that integer arithmetic is accurate. > [1] https://quick-bench.com/q/_Sz-Peq1MNWYwZYrTtQDx3GI7lQ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10304) [C++][Compute] Optimize variance kernel for integers
[ https://issues.apache.org/jira/browse/ARROW-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-10304: --- Fix Version/s: 3.0.0 > [C++][Compute] Optimize variance kernel for integers > > > Key: ARROW-10304 > URL: https://issues.apache.org/jira/browse/ARROW-10304 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > Fix For: 3.0.0 > > > Current variance kernel converts all data type to `double` before > calculation. It's sub-optimal for integers. Integer arithmetic is much faster > than floating points, e.g., summation is 4x faster [1]. > A quick test for calculating int32 variance shows up to 3x performance gain. > Another benefit is that integer arithmetic is accurate. > [1] https://quick-bench.com/q/_Sz-Peq1MNWYwZYrTtQDx3GI7lQ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10305: Flags: (was: Important) > [C++][R] Filter datasets with string expressions > > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10305: Component/s: C++ > [C++][R] Filter datasets with string expressions > > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Affects Versions: 1.0.1 >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10305: Summary: [C++][R] Filter datasets with string expressions (was: [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)) > [C++][R] Filter datasets with string expressions > > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Affects Versions: 1.0.1 >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10305: Affects Version/s: (was: 1.0.1) > [C++][R] Filter datasets with string expressions > > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10305: Issue Type: New Feature (was: Improvement) > [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, > str_detect) > - > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Affects Versions: 1.0.1 >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10300) [Rust] Improve benchmark documentation for generating/converting TPC-H data
[ https://issues.apache.org/jira/browse/ARROW-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-10300: --- Summary: [Rust] Improve benchmark documentation for generating/converting TPC-H data (was: [Rust] Parquet/CSV TPC-H data) > [Rust] Improve benchmark documentation for generating/converting TPC-H data > --- > > Key: ARROW-10300 > URL: https://issues.apache.org/jira/browse/ARROW-10300 > Project: Apache Arrow > Issue Type: Wish > Components: Rust >Reporter: Remi Dettai >Assignee: Andy Grove >Priority: Minor > > The TPC-H benchmark for datafusion works with Parquet/CSV data but the data > generation routine described in the README generates `.tbl` data. > Could we describe how the TPC-H Parquet/CSV data can be generated to make the > benchmark easier to setup and more reproducible ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10300) [Rust] Parquet/CSV TPC-H data
[ https://issues.apache.org/jira/browse/ARROW-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-10300: -- Assignee: Andy Grove > [Rust] Parquet/CSV TPC-H data > - > > Key: ARROW-10300 > URL: https://issues.apache.org/jira/browse/ARROW-10300 > Project: Apache Arrow > Issue Type: Wish > Components: Rust >Reporter: Remi Dettai >Assignee: Andy Grove >Priority: Minor > > The TPC-H benchmark for datafusion works with Parquet/CSV data but the data > generation routine described in the README generates `.tbl` data. > Could we describe how the TPC-H Parquet/CSV data can be generated to make the > benchmark easier to setup and more reproducible ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10197) [Gandiva][python] Execute expression on filtered data
[ https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10197: --- Labels: pull-request-available (was: ) > [Gandiva][python] Execute expression on filtered data > - > > Key: ARROW-10197 > URL: https://issues.apache.org/jira/browse/ARROW-10197 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva, Python >Reporter: Kirill Lykov >Assignee: Kirill Lykov >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Looks like there is no way to execute an expression on filtered data in > python. > Basically, I cannot pass `SelectionVector` to projector's `evaluate` method > ```python > import pyarrow as pa > import pyarrow.gandiva as gandiva > table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]), > pa.array([5., 45., 36., 73., > 83., 23., 76.])], > ['a', 'b']) > builder = gandiva.TreeExprBuilder() > node_a = builder.make_field(table.schema.field("a")) > node_b = builder.make_field(table.schema.field("b")) > fifty = builder.make_literal(50.0, pa.float64()) > eleven = builder.make_literal(11.0, pa.float64()) > cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_()) > cond_2 = builder.make_function("greater_than", [node_a, node_b], > pa.bool_()) > cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_()) > cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3]) > condition = builder.make_condition(cond) > filter = gandiva.make_filter(table.schema, condition) > filterResult = filter.evaluate(table.to_batches()[0], > pa.default_memory_pool()) --> filterResult has type SelectionVector > print(result) > sum = builder.make_function("add", [node_a, node_b], pa.float64()) > field_result = pa.field("c", pa.float64()) > expr = builder.make_expression(sum, field_result) > projector = gandiva.make_projector( > table.schema, [expr], pa.default_memory_pool()) > r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a > problem that I don't know how to use filterResult with projector > ``` > In C++, I see that it is possible to pass SelectionVector as second argument > to projector::Evaluate: > [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270] > > Meanwhile, it looks like it is impossible in `gandiva.pyx`: > [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10270) [R] Fix CSV timestamp_parsers test on R-devel
[ https://issues.apache.org/jira/browse/ARROW-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-10270. - Fix Version/s: (was: 3.0.0) 2.0.0 Resolution: Fixed Issue resolved by pull request 8447 [https://github.com/apache/arrow/pull/8447] > [R] Fix CSV timestamp_parsers test on R-devel > - > > Key: ARROW-10270 > URL: https://issues.apache.org/jira/browse/ARROW-10270 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Apparently there is a change in the development version of R with respect to > timezone handling. I suspect it is this: > https://github.com/wch/r-source/blob/trunk/doc/NEWS.Rd#L296-L300 > It causes this failure: > {code} > ── 1. Failure: read_csv_arrow() can read timestamps (@test-csv.R#216) > ─ > `tbl` not equal to `df`. > Component "time": 'tzone' attributes are inconsistent ('UTC' and '') > ── 2. Failure: read_csv_arrow() can read timestamps (@test-csv.R#219) > ─ > `tbl` not equal to `df`. > Component "time": 'tzone' attributes are inconsistent ('UTC' and '') > {code} > This needs to be fixed for the CRAN release because they check on the devel > version. But it doesn't need to block the 2.0 release candidate because I can > (at minimum) skip these tests before submitting to CRAN (FYI [~kszucs]) > I'll also add a CI job to test on R-devel. I just removed 2 R jobs so we can > afford to add one back. > cc [~romainfrancois] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10301) [C++] Add "all" boolean reducing kernel
[ https://issues.apache.org/jira/browse/ARROW-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10301: - Summary: [C++] Add "all" boolean reducing kernel (was: Add "all" boolean reducing kernel) > [C++] Add "all" boolean reducing kernel > --- > > Key: ARROW-10301 > URL: https://issues.apache.org/jira/browse/ARROW-10301 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Andrew Wieteska >Assignee: Andrew Wieteska >Priority: Major > Labels: analytics > Fix For: 3.0.0 > > > As discussed on GitHub: > [https://github.com/apache/arrow/pull/8294#discussion_r504034461] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10303) [Rust] Parallel type transformation in CSV reader
[ https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10303: - Summary: [Rust] Parallel type transformation in CSV reader (was: Parallel type transformation in CSV reader) > [Rust] Parallel type transformation in CSV reader > - > > Key: ARROW-10303 > URL: https://issues.apache.org/jira/browse/ARROW-10303 > Project: Apache Arrow > Issue Type: Wish > Components: Rust >Reporter: Sergej Fries >Priority: Minor > Labels: CSVReader > Attachments: tracing.png > > > Currently, when the CSV file is read, a single thread is responsible for > reading the file and for transformation of returned string values into > correct data types. > In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 > seconds. Out of this time, only ~10% of this is reading the file, and ~68% > is transformation of the string values into correct data types. > My proposal is to parallelize the part responsible for the data type > transformation. > It seems to be quite simple to achieve since after the CSV reader reads a > batch, all projected columns are transformed one by one using an iterator > over vector and a map function afterwards. I believe that if one uses the > rayon crate, the only change will be the adjustment of "iter()" into > "par_iter()" and > changing > {color:#0033b3}impl{color}<{color:#20999d}R{color}: > {color:#00}Read{color}> > {color:#00}Reader{color}<{color:#20999d}R{color}> > into: > {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read > {color}+ > {color:#00}std{color}::{color:#00}marker{color}::{color:#00}Sync{color}> > {color:#00}Reader{color}<{color:#20999d}R{color}> > > But maybe I oversee something crucial (as being quite new in Rust and Arrow). > Any advise from someone experienced is therefore very welcome! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10197) [Gandiva][python] Execute expression on filtered data
[ https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-10197: Assignee: Kirill Lykov > [Gandiva][python] Execute expression on filtered data > - > > Key: ARROW-10197 > URL: https://issues.apache.org/jira/browse/ARROW-10197 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva, Python >Reporter: Kirill Lykov >Assignee: Kirill Lykov >Priority: Major > Fix For: 3.0.0 > > > Looks like there is no way to execute an expression on filtered data in > python. > Basically, I cannot pass `SelectionVector` to projector's `evaluate` method > ```python > import pyarrow as pa > import pyarrow.gandiva as gandiva > table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]), > pa.array([5., 45., 36., 73., > 83., 23., 76.])], > ['a', 'b']) > builder = gandiva.TreeExprBuilder() > node_a = builder.make_field(table.schema.field("a")) > node_b = builder.make_field(table.schema.field("b")) > fifty = builder.make_literal(50.0, pa.float64()) > eleven = builder.make_literal(11.0, pa.float64()) > cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_()) > cond_2 = builder.make_function("greater_than", [node_a, node_b], > pa.bool_()) > cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_()) > cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3]) > condition = builder.make_condition(cond) > filter = gandiva.make_filter(table.schema, condition) > filterResult = filter.evaluate(table.to_batches()[0], > pa.default_memory_pool()) --> filterResult has type SelectionVector > print(result) > sum = builder.make_function("add", [node_a, node_b], pa.float64()) > field_result = pa.field("c", pa.float64()) > expr = builder.make_expression(sum, field_result) > projector = gandiva.make_projector( > table.schema, [expr], pa.default_memory_pool()) > r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a > problem that I don't know how to use filterResult with projector > ``` > In C++, I see that it is possible to pass SelectionVector as second argument > to projector::Evaluate: > [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270] > > Meanwhile, it looks like it is impossible in `gandiva.pyx`: > [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9459) [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment
[ https://issues.apache.org/jira/browse/ARROW-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213950#comment-17213950 ] Joris Van den Bossche commented on ARROW-9459: -- This could probably also be solved by making the parsing lazy -> ARROW-10131 > [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment > -- > > Key: ARROW-9459 > URL: https://issues.apache.org/jira/browse/ARROW-9459 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset, dataset-dask-integration > > See some timing checks here: > https://github.com/dask/dask/pull/6346#issuecomment-656548675 > Parsing all statistics, even from a centralized {{_metadata}} file, can be > quite expensive. If you know in advance that you are not going to use them > (eg you are only going to do filtering on the partition fields, and otherwise > read all data), it could be nice to have an option to disable parsing > statistics. > cc [~rjzamora] [~bkietz] [~fsaintjacques] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9459) [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment
[ https://issues.apache.org/jira/browse/ARROW-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-9459: - Fix Version/s: 3.0.0 > [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment > -- > > Key: ARROW-9459 > URL: https://issues.apache.org/jira/browse/ARROW-9459 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset, dataset-dask-integration > Fix For: 3.0.0 > > > See some timing checks here: > https://github.com/dask/dask/pull/6346#issuecomment-656548675 > Parsing all statistics, even from a centralized {{_metadata}} file, can be > quite expensive. If you know in advance that you are not going to use them > (eg you are only going to do filtering on the partition fields, and otherwise > read all data), it could be nice to have an option to disable parsing > statistics. > cc [~rjzamora] [~bkietz] [~fsaintjacques] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10131) [C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment
[ https://issues.apache.org/jira/browse/ARROW-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-10131: -- Fix Version/s: 3.0.0 > [C++][Dataset] Lazily parse parquet metadata / statistics in > ParquetDatasetFactory and ParquetFileFragment > -- > > Key: ARROW-10131 > URL: https://issues.apache.org/jira/browse/ARROW-10131 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset, dataset-dask-integration > Fix For: 3.0.0 > > > Related to ARROW-9730, parsing of the statistics in parquet metadata is > expensive, and therefore should be avoided when possible. > For example, the {{ParquetDatasetFactory}} ({{ds.parquet_dataset()}} in > python) parses all statistics of all files and all columns. While when doing > a filtered read, you might only need the statistics of certain files (eg if a > filter on a partition field already excluded many files) and certain columns > (eg only the columns on which you are actually filtering). > The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a > later EnsureCompleteMetadata parse all statistics, and don't allow parsing a > subset, or only parsing the other (non-statistics) metadata, ...), so I think > we should try to think of better abstractions. > cc [~rjzamora] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9128) [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
[ https://issues.apache.org/jira/browse/ARROW-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213891#comment-17213891 ] Antoine Pitrou commented on ARROW-9128: --- This is unassigned, so you can definitely take it up. > [C++] Implement string space trimming kernels: trim, ltrim, and rtrim > - > > Key: ARROW-9128 > URL: https://issues.apache.org/jira/browse/ARROW-9128 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10306) [C++] Add string replacement kernel
Maarten Breddels created ARROW-10306: Summary: [C++] Add string replacement kernel Key: ARROW-10306 URL: https://issues.apache.org/jira/browse/ARROW-10306 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels Assignee: Maarten Breddels Similar to [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html] with a plain variant, and optionally a RE2 version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10303) Parallel type transformation in CSV reader
[ https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213864#comment-17213864 ] Jorge Leitão commented on ARROW-10303: -- Linking to ARROW-9707, that is related to this > Parallel type transformation in CSV reader > -- > > Key: ARROW-10303 > URL: https://issues.apache.org/jira/browse/ARROW-10303 > Project: Apache Arrow > Issue Type: Wish > Components: Rust >Reporter: Sergej Fries >Priority: Minor > Labels: CSVReader > Attachments: tracing.png > > > Currently, when the CSV file is read, a single thread is responsible for > reading the file and for transformation of returned string values into > correct data types. > In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 > seconds. Out of this time, only ~10% of this is reading the file, and ~68% > is transformation of the string values into correct data types. > My proposal is to parallelize the part responsible for the data type > transformation. > It seems to be quite simple to achieve since after the CSV reader reads a > batch, all projected columns are transformed one by one using an iterator > over vector and a map function afterwards. I believe that if one uses the > rayon crate, the only change will be the adjustment of "iter()" into > "par_iter()" and > changing > {color:#0033b3}impl{color}<{color:#20999d}R{color}: > {color:#00}Read{color}> > {color:#00}Reader{color}<{color:#20999d}R{color}> > into: > {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read > {color}+ > {color:#00}std{color}::{color:#00}marker{color}::{color:#00}Sync{color}> > {color:#00}Reader{color}<{color:#20999d}R{color}> > > But maybe I oversee something crucial (as being quite new in Rust and Arrow). > Any advise from someone experienced is therefore very welcome! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10195) [C++] Add string struct extract kernel using re2
[ https://issues.apache.org/jira/browse/ARROW-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10195: --- Labels: pull-request-available (was: ) > [C++] Add string struct extract kernel using re2 > > > Key: ARROW-10195 > URL: https://issues.apache.org/jira/browse/ARROW-10195 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Maarten Breddels >Assignee: Maarten Breddels >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Similar to Pandas' str.extract a way to convert a string to a struct of > strings using the re2 regex library (when having named captured groups). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9128) [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
[ https://issues.apache.org/jira/browse/ARROW-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213857#comment-17213857 ] Maarten Breddels commented on ARROW-9128: - Shall I implement this? > [C++] Implement string space trimming kernels: trim, ltrim, and rtrim > - > > Key: ARROW-9128 > URL: https://issues.apache.org/jira/browse/ARROW-9128 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213783#comment-17213783 ] Uwe Korn commented on ARROW-10276: -- You have to look at the differences between the {{pip list}} outputs on these two machines if it works on your desktop. The error might be coming from differing {{pandas}} versions. > Armv7 orc and flight not supported for build. Compat error on using with spark > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_compat_error, build_pip_wheel.sh, > dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. I have attached images below > Any help would be appreciated. Thanks > Spark Version: 2.4.5. > The code is as follows: > ``` > import pandas as pd > df_pd = df.toPandas() > npArr = df_pd.to_numpy() > ``` > The error is as follows:- > ``` > /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas > attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is > set to true; however, failed by the reason below: > module 'pyarrow' has no attribute 'compat' > Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' > is set to true. > warnings.warn(msg) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213779#comment-17213779 ] utsav edited comment on ARROW-10276 at 10/14/20, 9:50 AM: -- [~uwe] I can use it on my desktop though. Does this issue arise if the dependencies it needs are of a specific version despite what the requirements file says? I can recall it needing NumPy and pandas. I used numpy==1.19.2, pandas==1.1.2, six==1.15.0, pytz==2020.1 and Cython==0.29.2. My doubt arises from [https://github.com/apache/arrow/issues/2468] and ARROW-3141 was (Author: utri092): [~uwe] I can use it on my desktop though. Does this issue arise if the dependencies it needs are of a specific version despite what the requirements file says? I can recall it needing NumPy and pandas. I used numpy==1.19.2, pandas==1.1.2, six==1.15.0, pytz==2020.1 and Cython==0.29.2. My doubt arises from this issue [https://github.com/apache/arrow/issues/2468] and ARROW-3141 > Armv7 orc and flight not supported for build. Compat error on using with spark > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_compat_error, build_pip_wheel.sh, > dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. I have attached images below > Any help would be appreciated. Thanks > Spark Version: 2.4.5. > The code is as follows: > ``` > import pandas as pd > df_pd = df.toPandas() > npArr = df_pd.to_numpy() > ``` > The error is as follows:- > ``` > /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas > attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is > set to true; however, failed by the reason below: > module 'pyarrow' has no attribute 'compat' > Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' > is set to true. > warnings.warn(msg) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213779#comment-17213779 ] utsav commented on ARROW-10276: --- [~uwe] I can use it on my desktop though. Does this issue arise if the dependencies it needs are of a specific version despite what the requirements file says? I can recall it needing NumPy and pandas. I used numpy==1.19.2, pandas==1.1.2, six==1.15.0, pytz==2020.1 and Cython==0.29.2. My doubt arises from this issue [https://github.com/apache/arrow/issues/2468] and ARROW-3141 > Armv7 orc and flight not supported for build. Compat error on using with spark > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_compat_error, build_pip_wheel.sh, > dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. I have attached images below > Any help would be appreciated. Thanks > Spark Version: 2.4.5. > The code is as follows: > ``` > import pandas as pd > df_pd = df.toPandas() > npArr = df_pd.to_numpy() > ``` > The error is as follows:- > ``` > /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas > attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is > set to true; however, failed by the reason below: > module 'pyarrow' has no attribute 'compat' > Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' > is set to true. > warnings.warn(msg) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213720#comment-17213720 ] Uwe Korn commented on ARROW-10276: -- Yes, Spark 3.0.1 is still not compatible with {{pyarrow=0.17}}, you can use 0.14 and 0.15 with the latest Spark release but not newer AFAIK. So there is currently no combination that will work for you. > Armv7 orc and flight not supported for build. Compat error on using with spark > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_compat_error, build_pip_wheel.sh, > dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. I have attached images below > Any help would be appreciated. Thanks > Spark Version: 2.4.5. > The code is as follows: > ``` > import pandas as pd > df_pd = df.toPandas() > npArr = df_pd.to_numpy() > ``` > The error is as follows:- > ``` > /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas > attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is > set to true; however, failed by the reason below: > module 'pyarrow' has no attribute 'compat' > Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' > is set to true. > warnings.warn(msg) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pal updated ARROW-10305: Description: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {code:java} library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a") {code} gives this error : {code:java} Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.{code} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. was: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {code:java} library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a") {code} gives this error : {code:java} Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.{code} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. > [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, > str_detect) > - > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 1.0.1 >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pal updated ARROW-10305: Description: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {code:java} library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a") {code} gives this error : {code:java} Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.{code} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. was: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {{library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a")}} gives this error : {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.}} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. > [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, > str_detect) > - > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 1.0.1 >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pal updated ARROW-10305: Description: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {code:java} library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a") {code} gives this error : {code:java} Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.{code} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. was: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {code:java} library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a") {code} gives this error : {code:java} Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.{code} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. > [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, > str_detect) > - > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 1.0.1 >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pal updated ARROW-10305: Description: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {{library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a")}} gives this error : {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.}} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. was: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : ```library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a")``` gives this error : {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.}} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. > [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, > str_detect) > - > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 1.0.1 >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > > {{library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a")}} > gives this error : > > {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) > == "a" > Call collect() first to pull data into R.}} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pal updated ARROW-10305: Description: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : ```library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a")``` gives this error : {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.}} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. was: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {{library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a")}} gives this error : {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.}} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. > [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, > str_detect) > - > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 1.0.1 >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering after open_datatset(). Specifically, the code below > : > > ```library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a")``` > gives this error : > > {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) > == "a" > Call collect() first to pull data into R.}} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9856) [R] Add bindings for string compute functions
[ https://issues.apache.org/jira/browse/ARROW-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213641#comment-17213641 ] Pal commented on ARROW-9856: This issue is also related to https://issues.apache.org/jira/browse/ARROW-10305. > [R] Add bindings for string compute functions > - > > Key: ARROW-9856 > URL: https://issues.apache.org/jira/browse/ARROW-9856 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 3.0.0 > > > See https://arrow.apache.org/docs/cpp/compute.html#string-predicates and > below. Since R's base string functions, as well as stringr/stringi, aren't > generics that we can define methods for, this will probably make most sense > within the context of a dplyr expression where we have more control over the > evaluation. > This will require enabling utf8proc in the builds; there's already an > rtools-package for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10305) [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect)
Pal created ARROW-10305: --- Summary: [R] Error: Filter expression not supported for Arrow Datasets (substr, grepl, str_detect) Key: ARROW-10305 URL: https://issues.apache.org/jira/browse/ARROW-10305 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 1.0.1 Reporter: Pal Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {{library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a")}} gives this error : {{Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.}} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)