[jira] [Updated] (ARROW-10331) [Rust] [DataFusion] Re-organize errors
[ https://issues.apache.org/jira/browse/ARROW-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10331: --- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Re-organize errors > -- > > Key: ARROW-10331 > URL: https://issues.apache.org/jira/browse/ARROW-10331 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Affects Versions: 3.0.0 >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > DataFusion's errors do not have much love these days, and I think that they > need a lift. For example, > * we use "General" very often > * the error is called "ExecutionError", even though sometimes it happens > during planning > * the error "InvalidColumn" is not being used > * There is not much documentation about the errors > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10331) [Rust] [DataFusion] Re-organize errors
Jorge Leitão created ARROW-10331: Summary: [Rust] [DataFusion] Re-organize errors Key: ARROW-10331 URL: https://issues.apache.org/jira/browse/ARROW-10331 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Affects Versions: 3.0.0 Reporter: Jorge Leitão Assignee: Jorge Leitão DataFusion's errors do not have much love these days, and I think that they need a lift. For example, * we use "General" very often * the error is called "ExecutionError", even though sometimes it happens during planning * the error "InvalidColumn" is not being used * There is not much documentation about the errors -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10330) [Rust][Datafusion] Implement nullif() function for DataFusion
[ https://issues.apache.org/jira/browse/ARROW-10330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215738#comment-17215738 ] Jorge Leitão commented on ARROW-10330: -- Good idea. (y) I moved this to 3.0.0 to not block the 2.0.0 release. > [Rust][Datafusion] Implement nullif() function for DataFusion > - > > Key: ARROW-10330 > URL: https://issues.apache.org/jira/browse/ARROW-10330 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - DataFusion >Reporter: Evan Chan >Priority: Major > Fix For: 3.0.0 > > > Here is the common definition of NULLIF() function: > [https://www.w3schools.com/sql/func_sqlserver_nullif.asp] > > Among other uses, it is used to protect denominators from divide by 0 errors. > We have implemented it at UrbanLogiq and would like to contribute this back. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10330) [Rust][Datafusion] Implement nullif() function for DataFusion
[ https://issues.apache.org/jira/browse/ARROW-10330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão updated ARROW-10330: - Fix Version/s: (was: 2.0.0) 3.0.0 > [Rust][Datafusion] Implement nullif() function for DataFusion > - > > Key: ARROW-10330 > URL: https://issues.apache.org/jira/browse/ARROW-10330 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - DataFusion >Reporter: Evan Chan >Priority: Major > Fix For: 3.0.0 > > > Here is the common definition of NULLIF() function: > [https://www.w3schools.com/sql/func_sqlserver_nullif.asp] > > Among other uses, it is used to protect denominators from divide by 0 errors. > We have implemented it at UrbanLogiq and would like to contribute this back. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-10327) [Rust] [DataFusion] Iterator of futures
[ https://issues.apache.org/jira/browse/ARROW-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão closed ARROW-10327. Resolution: Won't Fix As discussed in #8473 and #8480, this is better handled via buffering, to avoid memory issues. > [Rust] [DataFusion] Iterator of futures > --- > > Key: ARROW-10327 > URL: https://issues.apache.org/jira/browse/ARROW-10327 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10330) [Rust][Datafusion] Implement nullif() function for DataFusion
Evan Chan created ARROW-10330: - Summary: [Rust][Datafusion] Implement nullif() function for DataFusion Key: ARROW-10330 URL: https://issues.apache.org/jira/browse/ARROW-10330 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Evan Chan Fix For: 2.0.0 Here is the common definition of NULLIF() function: [https://www.w3schools.com/sql/func_sqlserver_nullif.asp] Among other uses, it is used to protect denominators from divide by 0 errors. We have implemented it at UrbanLogiq and would like to contribute this back. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10320) [Rust] Convert RecordBatchIterator to a Stream
[ https://issues.apache.org/jira/browse/ARROW-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10320: - Summary: [Rust] Convert RecordBatchIterator to a Stream (was: Convert RecordBatchIterator to a Stream) > [Rust] Convert RecordBatchIterator to a Stream > -- > > Key: ARROW-10320 > URL: https://issues.apache.org/jira/browse/ARROW-10320 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > So that we the unit of work is a single record batch instead of a part of a > partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-5409) [C++] Improvement for IsIn Kernel when right array is small
[ https://issues.apache.org/jira/browse/ARROW-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Sherrier reassigned ARROW-5409: - Assignee: David Sherrier > [C++] Improvement for IsIn Kernel when right array is small > --- > > Key: ARROW-5409 > URL: https://issues.apache.org/jira/browse/ARROW-5409 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Preeti Suman >Assignee: David Sherrier >Priority: Major > Fix For: 3.0.0 > > > The core of the algorithm (as python) is > {code:java} > for idx, elem in array: > output[i] = (elem in memo_table) > {code} > Often the right operand list will be very small, in this case, the hashtable > should be replaced with a constant vector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5409) [C++] Improvement for IsIn Kernel when right array is small
[ https://issues.apache.org/jira/browse/ARROW-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215627#comment-17215627 ] Wes McKinney commented on ARROW-5409: - Please go ahead. We'll need some benchmarks to get written so that we can establish a heuristic about which algorithm to choose > [C++] Improvement for IsIn Kernel when right array is small > --- > > Key: ARROW-5409 > URL: https://issues.apache.org/jira/browse/ARROW-5409 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Preeti Suman >Priority: Major > Fix For: 3.0.0 > > > The core of the algorithm (as python) is > {code:java} > for idx, elem in array: > output[i] = (elem in memo_table) > {code} > Often the right operand list will be very small, in this case, the hashtable > should be replaced with a constant vector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10329) [Rust][Datafusion] Datafusion queries involving a column name that begins with a number produces unexpected results
[ https://issues.apache.org/jira/browse/ARROW-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Morgan Cassels updated ARROW-10329: --- Summary: [Rust][Datafusion] Datafusion queries involving a column name that begins with a number produces unexpected results (was: Datafusion queries involving a column name that begins with a number produces unexpected results) > [Rust][Datafusion] Datafusion queries involving a column name that begins > with a number produces unexpected results > --- > > Key: ARROW-10329 > URL: https://issues.apache.org/jira/browse/ARROW-10329 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Morgan Cassels >Priority: Major > > This bug can be worked around by wrapping column names in quotes. > Example: > {{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}} > {{let logical_plan = ctx.create_logical_plan(query)?;}} > {{logical_plan.schema().fields() now has fields: [_20mph, _25mph]}} > The resulting table produced by this query looks like: > ||{{_20mph}}||{{_25mph}}|| > |16|21| > |16|21| > Every row is identical, where the column value is equal to the initial number > that appears in the column name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10329) Datafusion queries involving a column name that begins with a number produces unexpected results
[ https://issues.apache.org/jira/browse/ARROW-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Morgan Cassels updated ARROW-10329: --- Description: This bug can be worked around by wrapping column names in quotes. Example: {{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}} {{let logical_plan = ctx.create_logical_plan(query)?;}} {{logical_plan.schema().fields() now has fields: [_20mph, _25mph]}} The resulting table produced by this query looks like: ||{{_20mph}}||{{_25mph}}|| |16|21| |16|21| Every row is identical, where the column value is equal to the initial number that appears in the column name. was: This bug can be worked around by wrapping column names in quotes. Example: {{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}} {{let logical_plan = ctx.create_logical_plan(query)?;}} {{logical_plan.schema().fields() }}now has fields: {{_20mph, _25mph}} The resulting table produced by this query looks like: ||{{_20mph}}||{{_25mph}}|| |16|21| |16|21| Every row is identical, where the column value is equal to the initial number that appears in the column name. > Datafusion queries involving a column name that begins with a number produces > unexpected results > > > Key: ARROW-10329 > URL: https://issues.apache.org/jira/browse/ARROW-10329 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Morgan Cassels >Priority: Major > > This bug can be worked around by wrapping column names in quotes. > Example: > {{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}} > {{let logical_plan = ctx.create_logical_plan(query)?;}} > {{logical_plan.schema().fields() now has fields: [_20mph, _25mph]}} > The resulting table produced by this query looks like: > ||{{_20mph}}||{{_25mph}}|| > |16|21| > |16|21| > Every row is identical, where the column value is equal to the initial number > that appears in the column name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10329) Datafusion queries involving a column name that begins with a number produces unexpected results
Morgan Cassels created ARROW-10329: -- Summary: Datafusion queries involving a column name that begins with a number produces unexpected results Key: ARROW-10329 URL: https://issues.apache.org/jira/browse/ARROW-10329 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Morgan Cassels This bug can be worked around by wrapping column names in quotes. Example: {{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}} {{let logical_plan = ctx.create_logical_plan(query)?;}} {{logical_plan.schema().fields() }}now has fields: {{_20mph, _25mph}} The resulting table produced by this query looks like: ||{{_20mph}}||{{_25mph}}|| |16|21| |16|21| Every row is identical, where the column value is equal to the initial number that appears in the column name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10321: Fix Version/s: (was: 2.0.0) 3.0.0 > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: Frank Du >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-10321. - Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8478 [https://github.com/apache/arrow/pull/8478] > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: Frank Du >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215519#comment-17215519 ] Antoine Pitrou commented on ARROW-10308: > Antoine, do you think this is a good idea? Do you have input on what csv > compositions are found in the wild? Yes, that sounds like a very good idea. Instead of generating data, I think it's better to use actual data. You can find a variety of real-world datasets here: [https://github.com/awslabs/open-data-registry] A commonly used dataset for demonstration and benchmarking purposes is the New York taxi dataset: [https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page] You may also find datasets of Twitter messages, which would be more text-heavy and therefore would stress the CSV reader a bit differently. Generally, for multi-thread benchmarking, you want files that are at least 1GB long. It may be possible to take a smaller file and replicate its contents a number times to reach the desired size, though. > [Python] read_csv from python is slow on some work loads > > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215483#comment-17215483 ] Dror Speiser commented on ARROW-10308: -- Yeah, Azure doesn't tell me how many physical cores are at my disposal, which makes it hard to compare between setups. But even if it's 12 cpus with hyperthreading and bad advertising, there is still a gap to be explained between single thread and multi thread performance. I offer to work on a benchmark that measures reading csvs of different sizes and compositions, for a variety of block sizes, and run it on a few different machines sizes on AWS (tiny to xlarge) and Azure, and report here the results. Antoine, do you think this is a good idea? Do you have input on what csv compositions are found in the wild? You said that narrow columns is common, how would you quantify this? Personally I work with finance and real estate data; I can create "data profiles" for what I see in my own workloads and share. > [Python] read_csv from python is slow on some work loads > > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10313) [C++] Improve UTF8 validation speed and CSV string conversion
[ https://issues.apache.org/jira/browse/ARROW-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10313: Fix Version/s: (was: 2.0.0) 3.0.0 > [C++] Improve UTF8 validation speed and CSV string conversion > - > > Key: ARROW-10313 > URL: https://issues.apache.org/jira/browse/ARROW-10313 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Based on profiling from ARROW-10308, UTF8 validation is a bottleneck of CSV > string conversion. > This is because we must validate many small UTF8 strings individually. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-10324) function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present.
[ https://issues.apache.org/jira/browse/ARROW-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson closed ARROW-10324. --- Assignee: Neal Richardson Resolution: Duplicate > function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present. > -- > > Key: ARROW-10324 > URL: https://issues.apache.org/jira/browse/ARROW-10324 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Akash Shah >Assignee: Neal Richardson >Priority: Major > > For the following code snippet > {code:java} > // code placeholder > library(arrow) > download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet') > read_parquet(file = 'sample.parquet',as_data_frame = TRUE) > {code} > I get - > > {code:java} > Error in Table__to_dataframe(x, use_threads = option_use_threads()) : > embedded nul in string: '\0 at \0' > {code} > > So, I thought, what if I could read the file as binaries and replace the > embedded nul character \0 myself. > > {code:java} > parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) > raw <- write_to_raw(parquet,format = "file") > print(raw){code} > > In this case, I get an indecipherable stream of characters and nuls, which > makes it very difficult to remove '00' characters that are problematic in the > stream. > > {code:java} > [1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 > 0c 00 06 00 > [29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 > 08 00 00 00 > [57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 > a4 01 00 00 > [85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 > 34 00 00 00 > [113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 > 00 00 00 00 > [141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff > 00 00 01 05 > [169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 > 6c 61 6e 67 > [197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 > 04 00 > {code} > > Is there a way to handle this while reading Apache parquet? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10324) function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present.
[ https://issues.apache.org/jira/browse/ARROW-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215444#comment-17215444 ] Neal Richardson commented on ARROW-10324: - This is the same as ARROW-6582. We're working on a proper solution but don't have one yet. Two things to note: 1. In the upcoming release, it won't error anymore, it will truncate the string at the nul. Arguably that's worse because you won't know you have a problem. 2. I think you can work around this by reading with {{as_data_frame = FALSE}} as you have done, and then cast the offending column(s) to {{binary()}} before bringing the data into R. That will give you a list of raw vectors, and you should be able to filter out the {{00}}s and then call {{rawToChar()}} on them (assuming what you want is to drop the nuls). > function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present. > -- > > Key: ARROW-10324 > URL: https://issues.apache.org/jira/browse/ARROW-10324 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Akash Shah >Priority: Major > > For the following code snippet > {code:java} > // code placeholder > library(arrow) > download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet') > read_parquet(file = 'sample.parquet',as_data_frame = TRUE) > {code} > I get - > > {code:java} > Error in Table__to_dataframe(x, use_threads = option_use_threads()) : > embedded nul in string: '\0 at \0' > {code} > > So, I thought, what if I could read the file as binaries and replace the > embedded nul character \0 myself. > > {code:java} > parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) > raw <- write_to_raw(parquet,format = "file") > print(raw){code} > > In this case, I get an indecipherable stream of characters and nuls, which > makes it very difficult to remove '00' characters that are problematic in the > stream. > > {code:java} > [1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 > 0c 00 06 00 > [29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 > 08 00 00 00 > [57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 > a4 01 00 00 > [85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 > 34 00 00 00 > [113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 > 00 00 00 00 > [141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff > 00 00 01 05 > [169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 > 6c 61 6e 67 > [197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 > 04 00 > {code} > > Is there a way to handle this while reading Apache parquet? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10328) [C++] Consider using fast-double-parser
[ https://issues.apache.org/jira/browse/ARROW-10328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-10328: --- Description: We use Google's double-conversion library for parsing strings to doubles. We should consider using this library, which is more than 2x faster. https://github.com/lemire/fast_double_parser Parsing doubles is important for CSV performance. was: We use Google's double-conversion library for parsing strings to doubles. We should consider using this library, which is more than 2x faster. Parsing doubles is important for CSV performance. > [C++] Consider using fast-double-parser > --- > > Key: ARROW-10328 > URL: https://issues.apache.org/jira/browse/ARROW-10328 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Fix For: 3.0.0 > > > We use Google's double-conversion library for parsing strings to doubles. We > should consider using this library, which is more than 2x faster. > https://github.com/lemire/fast_double_parser > Parsing doubles is important for CSV performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10328) [C++] Consider using fast-double-parser
Antoine Pitrou created ARROW-10328: -- Summary: [C++] Consider using fast-double-parser Key: ARROW-10328 URL: https://issues.apache.org/jira/browse/ARROW-10328 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 3.0.0 We use Google's double-conversion library for parsing strings to doubles. We should consider using this library, which is more than 2x faster. Parsing doubles is important for CSV performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10327) [Rust] [DataFusion] Iterator of futures
[ https://issues.apache.org/jira/browse/ARROW-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10327: --- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Iterator of futures > --- > > Key: ARROW-10327 > URL: https://issues.apache.org/jira/browse/ARROW-10327 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215407#comment-17215407 ] Antoine Pitrou commented on ARROW-10308: For the record, on a 12-core 24-thread CPU, I get between 8x and 10x scaling from single-core to multi-core. This is far from linear scaling, but not horrific either. > [Python] read_csv from python is slow on some work loads > > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10327) [Rust] [DataFusion] Iterator of futures
Jorge Leitão created ARROW-10327: Summary: [Rust] [DataFusion] Iterator of futures Key: ARROW-10327 URL: https://issues.apache.org/jira/browse/ARROW-10327 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10197) [Gandiva][python] Execute expression on filtered data
[ https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215400#comment-17215400 ] Kirill Lykov commented on ARROW-10197: -- To simplify navigation, PR is there https://github.com/apache/arrow/pull/8461 > [Gandiva][python] Execute expression on filtered data > - > > Key: ARROW-10197 > URL: https://issues.apache.org/jira/browse/ARROW-10197 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva, Python >Reporter: Kirill Lykov >Assignee: Kirill Lykov >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Looks like there is no way to execute an expression on filtered data in > python. > Basically, I cannot pass `SelectionVector` to projector's `evaluate` method > ```python > import pyarrow as pa > import pyarrow.gandiva as gandiva > table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]), > pa.array([5., 45., 36., 73., > 83., 23., 76.])], > ['a', 'b']) > builder = gandiva.TreeExprBuilder() > node_a = builder.make_field(table.schema.field("a")) > node_b = builder.make_field(table.schema.field("b")) > fifty = builder.make_literal(50.0, pa.float64()) > eleven = builder.make_literal(11.0, pa.float64()) > cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_()) > cond_2 = builder.make_function("greater_than", [node_a, node_b], > pa.bool_()) > cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_()) > cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3]) > condition = builder.make_condition(cond) > filter = gandiva.make_filter(table.schema, condition) > filterResult = filter.evaluate(table.to_batches()[0], > pa.default_memory_pool()) --> filterResult has type SelectionVector > print(result) > sum = builder.make_function("add", [node_a, node_b], pa.float64()) > field_result = pa.field("c", pa.float64()) > expr = builder.make_expression(sum, field_result) > projector = gandiva.make_projector( > table.schema, [expr], pa.default_memory_pool()) > r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a > problem that I don't know how to use filterResult with projector > ``` > In C++, I see that it is possible to pass SelectionVector as second argument > to projector::Evaluate: > [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270] > > Meanwhile, it looks like it is impossible in `gandiva.pyx`: > [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215401#comment-17215401 ] Antoine Pitrou commented on ARROW-10308: "vcpu" doesn't mean anything precise unfortunately. What is the CPU model and how many *physical* cores are allocated to the virtual machine? > I am familiar with the simdjson library that claims to parse json files at > over 2 GiB/s, on a single core It all depends what "parsing" entails, what data it is tested on, and what is done with the data once parsed. On our internal micro-benchmarks, the Arrow CSV parser runs at around 600 MB/s (on a single core), but that's data-dependent. I tend to test on data with narrow column values since that's what "big data" often looks like, and that's the most difficult case for a CSV parser. It's possible that better speeds can be achieved on larger column values (such as large binary strings). But parsing isn't sufficient, then you have to convert the data to Arrow format, which also means you switch from a row-oriented format to a column-oriented format. That part probably hits quite hard on the memory and cache subsystem. > [Python] read_csv from python is slow on some work loads > > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215396#comment-17215396 ] Wes McKinney commented on ARROW-10308: -- I do think we should be doing better here than we are so it merits some analysis to see if some default options should change. The results do strike me as peculiar > [Python] read_csv from python is slow on some work loads > > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215393#comment-17215393 ] Dror Speiser commented on ARROW-10308: -- Thanks for the suggestions :) I am indeed getting the files from a third party, and I'm converting them to parquet on arrival using arrow. I'm actually content with 0.5 GiB/s. I'm here because I saw a tweet by Wes Mckinney saying that the csv parser in arrow is "extremely fast". I tweeted back my results and he suggested that I open an issue. I would like to note that the numbers don't quite add up. If the cpu usage is totally accounted for by the operations of parsing and building arrays, then that would mean that a single processor is doing between 0.06 to 0.13 GiB/s, which is very slow. When I run the benchmark without threads I get 0.3 GiB/s, which is reasonable for a single processor. But, it also means that the 48 vcpus I have are very far from achieving a linear speedup, which is in line with my profiling (though the attached images are block size of 1 mb). Do you see a linear speedup on your machine? As for processing csv's being costly in general, I'm not familiar enough with other libraries to say, but I am familiar with the simdjson library that claims to parse json files at over 2 GiB/s, on a single core. I'm looking at the code of both projects, hoping I'll be able to contribute something from simdjson to the csv parser in arrow. > [Python] read_csv from python is slow on some work loads > > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10326) [Rust] Add missing method docs for Arrays
[ https://issues.apache.org/jira/browse/ARROW-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahmut Bulut updated ARROW-10326: - Description: Whenever a PR comes we don't inspect documentation thus some of the methods are missing documentations about what they do. We should regularly check and carefully inspect the explanations if they are adequate or not. This issue is for filling in all missing doc comments. (was: Currently, whenever a PR comes we don't inspect documentation thus some of the methods are missing documentations about what they do. We should regularly check and carefully inspect the explanations if they are adequate or not. This issue is for filling in all missing doc comments.) > [Rust] Add missing method docs for Arrays > - > > Key: ARROW-10326 > URL: https://issues.apache.org/jira/browse/ARROW-10326 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Priority: Major > > Whenever a PR comes we don't inspect documentation thus some of the methods > are missing documentations about what they do. We should regularly check and > carefully inspect the explanations if they are adequate or not. This issue is > for filling in all missing doc comments. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10326) [Rust] Add missing method docs for Arrays
[ https://issues.apache.org/jira/browse/ARROW-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahmut Bulut updated ARROW-10326: - Description: Currently, whenever a PR comes we don't inspect documentation thus some of the methods are missing documentations about what they do. We should regularly check and carefully inspect the explanations if they are adequate or not. This issue is for filling in all missing doc comments. (was: Currently, whenever a PR comes we don't inspect documentation thus some of the methods are missing documentations about what they do. We should regularly check and carefully inspect the explanations that are adequate and not missing. This issue is for filling in all missing doc comments.) > [Rust] Add missing method docs for Arrays > - > > Key: ARROW-10326 > URL: https://issues.apache.org/jira/browse/ARROW-10326 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Priority: Major > > Currently, whenever a PR comes we don't inspect documentation thus some of > the methods are missing documentations about what they do. We should > regularly check and carefully inspect the explanations if they are adequate > or not. This issue is for filling in all missing doc comments. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10326) [Rust] Add missing method docs for Arrays
Mahmut Bulut created ARROW-10326: Summary: [Rust] Add missing method docs for Arrays Key: ARROW-10326 URL: https://issues.apache.org/jira/browse/ARROW-10326 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Mahmut Bulut Currently, whenever a PR comes we don't inspect documentation thus some of the methods are missing documentations about what they do. We should regularly check and carefully inspect the explanations that are adequate and not missing. This issue is for filling in all missing doc comments. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9707) [Rust] [DataFusion] Re-implement threading model
[ https://issues.apache.org/jira/browse/ARROW-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215335#comment-17215335 ] Andrew Lamb commented on ARROW-9707: FWIW now that DataFusion uses `async` -- https://github.com/apache/arrow/pull/8285 -- I think the number of threads issue cited in this PR is a non-issue (as DataFusion no longer launches its own threads) > [Rust] [DataFusion] Re-implement threading model > > > Key: ARROW-9707 > URL: https://issues.apache.org/jira/browse/ARROW-9707 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Attachments: image-2020-09-24-22-46-46-959.png > > Time Spent: 3h 20m > Remaining Estimate: 0h > > The current threading model is very simple and does not scale. We currently > use 1-2 dedicated threads per partition and they all run simultaneously, > which is a huge problem if you have more partitions than logical or physical > cores. > This task is to re-implement the threading model so that query execution uses > a fixed (configurable) number of threads. Work will be broken down into > stages and tasks and each in-process executor (running on a dedicated thread) > will process its queue of tasks. > This process will be driven by a scheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10311) [Release] Update crossbow verification process
[ https://issues.apache.org/jira/browse/ARROW-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-10311: --- Assignee: Krisztian Szucs > [Release] Update crossbow verification process > -- > > Key: ARROW-10311 > URL: https://issues.apache.org/jira/browse/ARROW-10311 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The automatized crossbow RC verification tasks needs to be updated since > multiple builds are failing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10311) [Release] Update crossbow verification process
[ https://issues.apache.org/jira/browse/ARROW-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-10311. - Resolution: Fixed Issue resolved by pull request 8464 [https://github.com/apache/arrow/pull/8464] > [Release] Update crossbow verification process > -- > > Key: ARROW-10311 > URL: https://issues.apache.org/jira/browse/ARROW-10311 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The automatized crossbow RC verification tasks needs to be updated since > multiple builds are failing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10325) [C++][Compute] Separate aggregate kernel registration
Yibo Cai created ARROW-10325: Summary: [C++][Compute] Separate aggregate kernel registration Key: ARROW-10325 URL: https://issues.apache.org/jira/browse/ARROW-10325 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yibo Cai Assignee: Yibo Cai We have basic aggregate kernel 'count/mean/sum/min_max' implemented in one file (plus simd version) and more complicated 'mode' and 'variance/stddev' kernels implemented in separated files. 'mode' and 'variance/stddev' kernels are now registered together with basic kernels in aggregate_basic.cc. And there are 'mode' and 'variance/stddev' kernel related function definitions in aggregate_basic_internal.h. This is not good. They should be moved from basic kernel source to their own implementation files and registered separately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9898) [C++][Gandiva] Error handling in castINT fails in some enviroments
[ https://issues.apache.org/jira/browse/ARROW-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Kumar resolved ARROW-9898. -- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8096 [https://github.com/apache/arrow/pull/8096] > [C++][Gandiva] Error handling in castINT fails in some enviroments > -- > > Key: ARROW-9898 > URL: https://issues.apache.org/jira/browse/ARROW-9898 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Projjal Chanda >Assignee: Projjal Chanda >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > In some environment the error path in castINT leads to segfault. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9898) [C++][Gandiva] Error handling in castINT fails in some enviroments
[ https://issues.apache.org/jira/browse/ARROW-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Kumar updated ARROW-9898: - Component/s: C++ - Gandiva > [C++][Gandiva] Error handling in castINT fails in some enviroments > -- > > Key: ARROW-9898 > URL: https://issues.apache.org/jira/browse/ARROW-9898 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Projjal Chanda >Assignee: Projjal Chanda >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > In some environment the error path in castINT leads to segfault. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10313) [C++] Improve UTF8 validation speed and CSV string conversion
[ https://issues.apache.org/jira/browse/ARROW-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-10313. Fix Version/s: (was: 3.0.0) 2.0.0 Resolution: Fixed Issue resolved by pull request 8470 [https://github.com/apache/arrow/pull/8470] > [C++] Improve UTF8 validation speed and CSV string conversion > - > > Key: ARROW-10313 > URL: https://issues.apache.org/jira/browse/ARROW-10313 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Based on profiling from ARROW-10308, UTF8 validation is a bottleneck of CSV > string conversion. > This is because we must validate many small UTF8 strings individually. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10321: --- Labels: pull-request-available (was: ) > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: Frank Du >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10324) function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present.
[ https://issues.apache.org/jira/browse/ARROW-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash Shah updated ARROW-10324: --- Docs Text: > sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] stringr_1.4.0 dplyr_1.0.2tictoc_1.0 arrow_1.0.1sparklyr_1.4.0 Description: For the following code snippet {code:java} // code placeholder library(arrow) download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet') read_parquet(file = 'sample.parquet',as_data_frame = TRUE) {code} I get - {code:java} Error in Table__to_dataframe(x, use_threads = option_use_threads()) : embedded nul in string: '\0 at \0' {code} So, I thought, what if I could read the file as binaries and replace the embedded nul character \0 myself. {code:java} parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) raw <- write_to_raw(parquet,format = "file") print(raw){code} In this case, I get an indecipherable stream of characters and nuls, which makes it very difficult to remove '00' characters that are problematic in the stream. {code:java} [1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 0c 00 06 00 [29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 08 00 00 00 [57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 a4 01 00 00 [85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 34 00 00 00 [113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 [141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff 00 00 01 05 [169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 6c 61 6e 67 [197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 04 00 {code} Is there a way to handle this while reading Apache parquet? was: For the following code snippet {code:java} // code placeholder library(arrow) download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet') read_parquet(file = 'sample.parquet',as_data_frame = TRUE) {code} I get - {code:java} Error in Table__to_dataframe(x, use_threads = option_use_threads()) : embedded nul in string: '\0 at \0' {code} | | So, I thought, what if I could read the file as binaries and replace the embedded nul character \0 myself. | {code:java} parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) raw <- write_to_raw(parquet,format = "file") print(raw) {code} | In this case, I get an indecipherable stream of characters and nuls, which makes it very difficult to remove '00' characters that are problematic in the stream. | {code:java} [1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 0c 00 06 00 [29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 08 00 00 00 [57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 a4 01 00 00 [85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 34 00 00 00 [113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 [141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff 00 00 01 05 [169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 6c 61 6e 67 [197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 04 00 {code} | Is there a way to handle this while reading Apache parquet? Issue Type: Improvement (was: Bug) > function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present. > -- > > Key: ARROW-10324 > URL: https://issues.apache.org/jira/browse/ARROW-10324 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Akash Shah >Priority: Major > > For the following code snippet > {code:java} > // code placeholder > library(arrow) >
[jira] [Assigned] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Du reassigned ARROW-10321: Assignee: Frank Du > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: Frank Du >Priority: Major > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10324) function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present.
Akash Shah created ARROW-10324: -- Summary: function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present. Key: ARROW-10324 URL: https://issues.apache.org/jira/browse/ARROW-10324 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Akash Shah For the following code snippet {code:java} // code placeholder library(arrow) download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet') read_parquet(file = 'sample.parquet',as_data_frame = TRUE) {code} I get - {code:java} Error in Table__to_dataframe(x, use_threads = option_use_threads()) : embedded nul in string: '\0 at \0' {code} | | So, I thought, what if I could read the file as binaries and replace the embedded nul character \0 myself. | {code:java} parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) raw <- write_to_raw(parquet,format = "file") print(raw) {code} | In this case, I get an indecipherable stream of characters and nuls, which makes it very difficult to remove '00' characters that are problematic in the stream. | {code:java} [1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 0c 00 06 00 [29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 08 00 00 00 [57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 a4 01 00 00 [85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 34 00 00 00 [113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 [141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff 00 00 01 05 [169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 6c 61 6e 67 [197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 04 00 {code} | Is there a way to handle this while reading Apache parquet? -- This message was sent by Atlassian Jira (v8.3.4#803005)