[jira] [Updated] (ARROW-10331) [Rust] [DataFusion] Re-organize errors

2020-10-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10331:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Re-organize errors
> --
>
> Key: ARROW-10331
> URL: https://issues.apache.org/jira/browse/ARROW-10331
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Affects Versions: 3.0.0
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> DataFusion's errors do not have much love these days, and I think that they 
> need a lift. For example,
>  * we use "General" very often
>  * the error is called "ExecutionError", even though sometimes it happens 
> during planning
>  * the error "InvalidColumn" is not being used
>  * There is not much documentation about the errors
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10331) [Rust] [DataFusion] Re-organize errors

2020-10-16 Thread Jira
Jorge Leitão created ARROW-10331:


 Summary: [Rust] [DataFusion] Re-organize errors
 Key: ARROW-10331
 URL: https://issues.apache.org/jira/browse/ARROW-10331
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Affects Versions: 3.0.0
Reporter: Jorge Leitão
Assignee: Jorge Leitão


DataFusion's errors do not have much love these days, and I think that they 
need a lift. For example,
 * we use "General" very often
 * the error is called "ExecutionError", even though sometimes it happens 
during planning
 * the error "InvalidColumn" is not being used
 * There is not much documentation about the errors

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10330) [Rust][Datafusion] Implement nullif() function for DataFusion

2020-10-16 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215738#comment-17215738
 ] 

Jorge Leitão commented on ARROW-10330:
--

Good idea. (y)

I moved this to 3.0.0 to not block the 2.0.0 release.


> [Rust][Datafusion] Implement nullif() function for DataFusion
> -
>
> Key: ARROW-10330
> URL: https://issues.apache.org/jira/browse/ARROW-10330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Evan Chan
>Priority: Major
> Fix For: 3.0.0
>
>
> Here is the common definition of NULLIF() function:
> [https://www.w3schools.com/sql/func_sqlserver_nullif.asp]
>  
> Among other uses, it is used to protect denominators from divide by 0 errors.
> We have implemented it at UrbanLogiq and would like to contribute this back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10330) [Rust][Datafusion] Implement nullif() function for DataFusion

2020-10-16 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-10330:
-
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust][Datafusion] Implement nullif() function for DataFusion
> -
>
> Key: ARROW-10330
> URL: https://issues.apache.org/jira/browse/ARROW-10330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Evan Chan
>Priority: Major
> Fix For: 3.0.0
>
>
> Here is the common definition of NULLIF() function:
> [https://www.w3schools.com/sql/func_sqlserver_nullif.asp]
>  
> Among other uses, it is used to protect denominators from divide by 0 errors.
> We have implemented it at UrbanLogiq and would like to contribute this back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10327) [Rust] [DataFusion] Iterator of futures

2020-10-16 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão closed ARROW-10327.

Resolution: Won't Fix

As discussed in #8473 and #8480, this is better handled via buffering, to avoid 
memory issues.

> [Rust] [DataFusion] Iterator of futures
> ---
>
> Key: ARROW-10327
> URL: https://issues.apache.org/jira/browse/ARROW-10327
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10330) [Rust][Datafusion] Implement nullif() function for DataFusion

2020-10-16 Thread Evan Chan (Jira)
Evan Chan created ARROW-10330:
-

 Summary: [Rust][Datafusion] Implement nullif() function for 
DataFusion
 Key: ARROW-10330
 URL: https://issues.apache.org/jira/browse/ARROW-10330
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Evan Chan
 Fix For: 2.0.0


Here is the common definition of NULLIF() function:

[https://www.w3schools.com/sql/func_sqlserver_nullif.asp]

 

Among other uses, it is used to protect denominators from divide by 0 errors.

We have implemented it at UrbanLogiq and would like to contribute this back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10320) [Rust] Convert RecordBatchIterator to a Stream

2020-10-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10320:
-
Summary: [Rust] Convert RecordBatchIterator to a Stream  (was: Convert 
RecordBatchIterator to a Stream)

> [Rust] Convert RecordBatchIterator to a Stream
> --
>
> Key: ARROW-10320
> URL: https://issues.apache.org/jira/browse/ARROW-10320
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> So that we the unit of work is a single record batch instead of a part of a 
> partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5409) [C++] Improvement for IsIn Kernel when right array is small

2020-10-16 Thread David Sherrier (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Sherrier reassigned ARROW-5409:
-

Assignee: David Sherrier

> [C++] Improvement for IsIn Kernel when right array is small
> ---
>
> Key: ARROW-5409
> URL: https://issues.apache.org/jira/browse/ARROW-5409
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Preeti Suman
>Assignee: David Sherrier
>Priority: Major
> Fix For: 3.0.0
>
>
> The core of the algorithm (as python) is 
> {code:java}
> for idx, elem in array:
>   output[i] = (elem in memo_table)
> {code}
>  Often the right operand list will be very small, in this case, the hashtable 
> should be replaced with a constant vector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5409) [C++] Improvement for IsIn Kernel when right array is small

2020-10-16 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215627#comment-17215627
 ] 

Wes McKinney commented on ARROW-5409:
-

Please go ahead. We'll need some benchmarks to get written so that we can 
establish a heuristic about which algorithm to choose

> [C++] Improvement for IsIn Kernel when right array is small
> ---
>
> Key: ARROW-5409
> URL: https://issues.apache.org/jira/browse/ARROW-5409
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Preeti Suman
>Priority: Major
> Fix For: 3.0.0
>
>
> The core of the algorithm (as python) is 
> {code:java}
> for idx, elem in array:
>   output[i] = (elem in memo_table)
> {code}
>  Often the right operand list will be very small, in this case, the hashtable 
> should be replaced with a constant vector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10329) [Rust][Datafusion] Datafusion queries involving a column name that begins with a number produces unexpected results

2020-10-16 Thread Morgan Cassels (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Morgan Cassels updated ARROW-10329:
---
Summary: [Rust][Datafusion] Datafusion queries involving a column name that 
begins with a number produces unexpected results  (was: Datafusion queries 
involving a column name that begins with a number produces unexpected results)

> [Rust][Datafusion] Datafusion queries involving a column name that begins 
> with a number produces unexpected results
> ---
>
> Key: ARROW-10329
> URL: https://issues.apache.org/jira/browse/ARROW-10329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Morgan Cassels
>Priority: Major
>
> This bug can be worked around by wrapping column names in quotes.
> Example:
> {{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}}
> {{let logical_plan = ctx.create_logical_plan(query)?;}}
> {{logical_plan.schema().fields() now has fields: [_20mph, _25mph]}}
> The resulting table produced by this query looks like:
> ||{{_20mph}}||{{_25mph}}||
> |16|21|
> |16|21|
> Every row is identical, where the column value is equal to the initial number 
> that appears in the column name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10329) Datafusion queries involving a column name that begins with a number produces unexpected results

2020-10-16 Thread Morgan Cassels (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Morgan Cassels updated ARROW-10329:
---
Description: 
This bug can be worked around by wrapping column names in quotes.

Example:

{{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}}

{{let logical_plan = ctx.create_logical_plan(query)?;}}

{{logical_plan.schema().fields() now has fields: [_20mph, _25mph]}}

The resulting table produced by this query looks like:
||{{_20mph}}||{{_25mph}}||
|16|21|
|16|21|

Every row is identical, where the column value is equal to the initial number 
that appears in the column name.

  was:
This bug can be worked around by wrapping column names in quotes.

Example:

{{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}}

{{let logical_plan = ctx.create_logical_plan(query)?;}}

{{logical_plan.schema().fields() }}now has fields: {{_20mph, _25mph}}

The resulting table produced by this query looks like:
||{{_20mph}}||{{_25mph}}||
|16|21|
|16|21|

Every row is identical, where the column value is equal to the initial number 
that appears in the column name.


> Datafusion queries involving a column name that begins with a number produces 
> unexpected results
> 
>
> Key: ARROW-10329
> URL: https://issues.apache.org/jira/browse/ARROW-10329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Morgan Cassels
>Priority: Major
>
> This bug can be worked around by wrapping column names in quotes.
> Example:
> {{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}}
> {{let logical_plan = ctx.create_logical_plan(query)?;}}
> {{logical_plan.schema().fields() now has fields: [_20mph, _25mph]}}
> The resulting table produced by this query looks like:
> ||{{_20mph}}||{{_25mph}}||
> |16|21|
> |16|21|
> Every row is identical, where the column value is equal to the initial number 
> that appears in the column name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10329) Datafusion queries involving a column name that begins with a number produces unexpected results

2020-10-16 Thread Morgan Cassels (Jira)
Morgan Cassels created ARROW-10329:
--

 Summary: Datafusion queries involving a column name that begins 
with a number produces unexpected results
 Key: ARROW-10329
 URL: https://issues.apache.org/jira/browse/ARROW-10329
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Morgan Cassels


This bug can be worked around by wrapping column names in quotes.

Example:

{{let query = "SELECT 16_20mph, 21_25mph FROM foo;"}}

{{let logical_plan = ctx.create_logical_plan(query)?;}}

{{logical_plan.schema().fields() }}now has fields: {{_20mph, _25mph}}

The resulting table produced by this query looks like:
||{{_20mph}}||{{_25mph}}||
|16|21|
|16|21|

Every row is identical, where the column value is equal to the initial number 
that appears in the column name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-16 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10321:

Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-16 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-10321.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8478
[https://github.com/apache/arrow/pull/8478]

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

2020-10-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215519#comment-17215519
 ] 

Antoine Pitrou commented on ARROW-10308:


> Antoine, do you think this is a good idea? Do you have input on what csv 
> compositions are found in the wild?

Yes, that sounds like a very good idea. Instead of generating data, I think 
it's better to use actual data. You can find a variety of real-world datasets 
here:
 [https://github.com/awslabs/open-data-registry]

A commonly used dataset for demonstration and benchmarking purposes is the New 
York taxi dataset:
 [https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page]

You may also find datasets of Twitter messages, which would be more text-heavy 
and therefore would stress the CSV reader a bit differently.

Generally, for multi-thread benchmarking, you want files that are at least 1GB 
long. It may be possible to take a smaller file and replicate its contents a 
number times to reach the desired size, though.

> [Python] read_csv from python is slow on some work loads
> 
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

2020-10-16 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215483#comment-17215483
 ] 

Dror Speiser commented on ARROW-10308:
--

Yeah, Azure doesn't tell me how many physical cores are at my disposal, which 
makes it hard to compare between setups. But even if it's 12 cpus with 
hyperthreading and bad advertising, there is still a gap to be explained 
between single thread and multi thread performance.

I offer to work on a benchmark that measures reading csvs of different sizes 
and compositions, for a variety of block sizes, and run it on a few different 
machines sizes on AWS (tiny to xlarge) and Azure, and report here the results.

Antoine, do you think this is a good idea? Do you have input on what csv 
compositions are found in the wild? You said that narrow columns is common, how 
would you quantify this? Personally I work with finance and real estate data; I 
can create "data profiles" for what I see in my own workloads and share.

> [Python] read_csv from python is slow on some work loads
> 
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10313) [C++] Improve UTF8 validation speed and CSV string conversion

2020-10-16 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10313:

Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++] Improve UTF8 validation speed and CSV string conversion
> -
>
> Key: ARROW-10313
> URL: https://issues.apache.org/jira/browse/ARROW-10313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Based on profiling from ARROW-10308, UTF8 validation is a bottleneck of CSV 
> string conversion.
> This is because we must validate many small UTF8 strings individually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10324) function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present.

2020-10-16 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-10324.
---
  Assignee: Neal Richardson
Resolution: Duplicate

> function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present. 
> --
>
> Key: ARROW-10324
> URL: https://issues.apache.org/jira/browse/ARROW-10324
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Akash Shah
>Assignee: Neal Richardson
>Priority: Major
>
> For the following code snippet
> {code:java}
> // code placeholder
> library(arrow)
> download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet')
> read_parquet(file = 'sample.parquet',as_data_frame = TRUE)
> {code}
> I get -
>  
> {code:java}
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
> embedded nul in string: '\0 at \0'
> {code}
>  
> So, I thought, what if I could read the file as binaries and replace the 
> embedded nul character \0 myself.
>  
> {code:java}
> parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) 
> raw <- write_to_raw(parquet,format = "file")
> print(raw){code}
>  
> In this case, I get an indecipherable stream of characters and nuls, which 
> makes it very difficult to remove '00' characters that are problematic in the 
> stream.
>  
> {code:java}
> [1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 
> 0c 00 06 00 
> [29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 
> 08 00 00 00 
> [57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 
> a4 01 00 00 
> [85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 
> 34 00 00 00 
> [113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 
> 00 00 00 00 
> [141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff 
> 00 00 01 05 
> [169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 
> 6c 61 6e 67 
> [197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 
> 04 00
> {code}
>  
> Is there a way to handle this while reading Apache parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10324) function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present.

2020-10-16 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215444#comment-17215444
 ] 

Neal Richardson commented on ARROW-10324:
-

This is the same as ARROW-6582. We're working on a proper solution but don't 
have one yet. Two things to note:

1. In the upcoming release, it won't error anymore, it will truncate the string 
at the nul. Arguably that's worse because you won't know you have a problem.
2. I think you can work around this by reading with {{as_data_frame = FALSE}} 
as you have done, and then cast the offending column(s) to {{binary()}} before 
bringing the data into R. That will give you a list of raw vectors, and you 
should be able to filter out the {{00}}s and then call {{rawToChar()}} on them 
(assuming what you want is to drop the nuls). 

> function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present. 
> --
>
> Key: ARROW-10324
> URL: https://issues.apache.org/jira/browse/ARROW-10324
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Akash Shah
>Priority: Major
>
> For the following code snippet
> {code:java}
> // code placeholder
> library(arrow)
> download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet')
> read_parquet(file = 'sample.parquet',as_data_frame = TRUE)
> {code}
> I get -
>  
> {code:java}
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
> embedded nul in string: '\0 at \0'
> {code}
>  
> So, I thought, what if I could read the file as binaries and replace the 
> embedded nul character \0 myself.
>  
> {code:java}
> parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) 
> raw <- write_to_raw(parquet,format = "file")
> print(raw){code}
>  
> In this case, I get an indecipherable stream of characters and nuls, which 
> makes it very difficult to remove '00' characters that are problematic in the 
> stream.
>  
> {code:java}
> [1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 
> 0c 00 06 00 
> [29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 
> 08 00 00 00 
> [57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 
> a4 01 00 00 
> [85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 
> 34 00 00 00 
> [113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 
> 00 00 00 00 
> [141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff 
> 00 00 01 05 
> [169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 
> 6c 61 6e 67 
> [197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 
> 04 00
> {code}
>  
> Is there a way to handle this while reading Apache parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10328) [C++] Consider using fast-double-parser

2020-10-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10328:
---
Description: 
We use Google's double-conversion library for parsing strings to doubles. We 
should consider using this library, which is more than 2x faster.
https://github.com/lemire/fast_double_parser

Parsing doubles is important for CSV performance.

  was:
We use Google's double-conversion library for parsing strings to doubles. We 
should consider using this library, which is more than 2x faster.

Parsing doubles is important for CSV performance.


> [C++] Consider using fast-double-parser
> ---
>
> Key: ARROW-10328
> URL: https://issues.apache.org/jira/browse/ARROW-10328
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 3.0.0
>
>
> We use Google's double-conversion library for parsing strings to doubles. We 
> should consider using this library, which is more than 2x faster.
> https://github.com/lemire/fast_double_parser
> Parsing doubles is important for CSV performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10328) [C++] Consider using fast-double-parser

2020-10-16 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10328:
--

 Summary: [C++] Consider using fast-double-parser
 Key: ARROW-10328
 URL: https://issues.apache.org/jira/browse/ARROW-10328
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 3.0.0


We use Google's double-conversion library for parsing strings to doubles. We 
should consider using this library, which is more than 2x faster.

Parsing doubles is important for CSV performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10327) [Rust] [DataFusion] Iterator of futures

2020-10-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10327:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Iterator of futures
> ---
>
> Key: ARROW-10327
> URL: https://issues.apache.org/jira/browse/ARROW-10327
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

2020-10-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215407#comment-17215407
 ] 

Antoine Pitrou commented on ARROW-10308:


For the record, on a 12-core 24-thread CPU, I get between 8x and 10x scaling 
from single-core to multi-core. This is far from linear scaling, but not 
horrific either.

> [Python] read_csv from python is slow on some work loads
> 
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10327) [Rust] [DataFusion] Iterator of futures

2020-10-16 Thread Jira
Jorge Leitão created ARROW-10327:


 Summary: [Rust] [DataFusion] Iterator of futures
 Key: ARROW-10327
 URL: https://issues.apache.org/jira/browse/ARROW-10327
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-16 Thread Kirill Lykov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215400#comment-17215400
 ] 

Kirill Lykov commented on ARROW-10197:
--

To simplify navigation, PR is there https://github.com/apache/arrow/pull/8461

> [Gandiva][python] Execute expression on filtered data
> -
>
> Key: ARROW-10197
> URL: https://issues.apache.org/jira/browse/ARROW-10197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Kirill Lykov
>Assignee: Kirill Lykov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Looks like there is no way to execute an expression on filtered data in 
> python. 
>  Basically, I cannot pass `SelectionVector` to projector's `evaluate` method
> ```python
>  import pyarrow as pa
>  import pyarrow.gandiva as gandiva
> table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
>                                    pa.array([5., 45., 36., 73.,
>                                              83., 23., 76.])],
>                                   ['a', 'b'])
> builder = gandiva.TreeExprBuilder()
>  node_a = builder.make_field(table.schema.field("a"))
>  node_b = builder.make_field(table.schema.field("b"))
>  fifty = builder.make_literal(50.0, pa.float64())
>  eleven = builder.make_literal(11.0, pa.float64())
> cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
>  cond_2 = builder.make_function("greater_than", [node_a, node_b],
>                                     pa.bool_())
>  cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
>  cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
>  condition = builder.make_condition(cond)
> filter = gandiva.make_filter(table.schema, condition)
>  filterResult = filter.evaluate(table.to_batches()[0], 
> pa.default_memory_pool()) --> filterResult has type SelectionVector
>  print(result)
> sum = builder.make_function("add", [node_a, node_b], pa.float64())
>  field_result = pa.field("c", pa.float64())
>  expr = builder.make_expression(sum, field_result)
>  projector = gandiva.make_projector(
>  table.schema, [expr], pa.default_memory_pool())
> r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a 
> problem that I don't know how to use filterResult with projector
>  ```
> In C++, I see that it is possible to pass SelectionVector as second argument 
> to projector::Evaluate: 
> [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
>   
>  Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
> [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

2020-10-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215401#comment-17215401
 ] 

Antoine Pitrou commented on ARROW-10308:


"vcpu" doesn't mean anything precise unfortunately. What is the CPU model and 
how many *physical* cores are allocated to the virtual machine?

> I am familiar with the simdjson library that claims to parse json files at 
> over 2 GiB/s, on a single core

It all depends what "parsing" entails, what data it is tested on, and what is 
done with the data once parsed.

On our internal micro-benchmarks, the Arrow CSV parser runs at around 600 MB/s 
(on a single core), but that's data-dependent. I tend to test on data with 
narrow column values since that's what "big data" often looks like, and that's 
the most difficult case for a CSV parser. It's possible that better speeds can 
be achieved on larger column values (such as large binary strings).

But parsing isn't sufficient, then you have to convert the data to Arrow 
format, which also means you switch from a row-oriented format to a 
column-oriented format. That part probably hits quite hard on the memory and 
cache subsystem.

> [Python] read_csv from python is slow on some work loads
> 
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

2020-10-16 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215396#comment-17215396
 ] 

Wes McKinney commented on ARROW-10308:
--

I do think we should be doing better here than we are so it merits some 
analysis to see if some default options should change. The results do strike me 
as peculiar

> [Python] read_csv from python is slow on some work loads
> 
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

2020-10-16 Thread Dror Speiser (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215393#comment-17215393
 ] 

Dror Speiser commented on ARROW-10308:
--

Thanks for the suggestions :) I am indeed getting the files from a third party, 
and I'm converting them to parquet on arrival using arrow. I'm actually content 
with 0.5 GiB/s. I'm here because I saw a tweet by Wes Mckinney saying that the 
csv parser in arrow is "extremely fast". I tweeted back my results and he 
suggested that I open an issue.

I would like to note that the numbers don't quite add up. If the cpu usage is 
totally accounted for by the operations of parsing and building arrays, then 
that would mean that a single processor is doing between 0.06 to 0.13 GiB/s, 
which is very slow.

When I run the benchmark without threads I get 0.3 GiB/s, which is reasonable 
for a single processor. But, it also means that the 48 vcpus I have are very 
far from achieving a linear speedup, which is in line with my profiling (though 
the attached images are block size of 1 mb). Do you see a linear speedup on 
your machine?

As for processing csv's being costly in general, I'm not familiar enough with 
other libraries to say, but I am familiar with the simdjson library that claims 
to parse json files at over 2 GiB/s, on a single core. I'm looking at the code 
of both projects, hoping I'll be able to contribute something from simdjson to 
the csv parser in arrow.

> [Python] read_csv from python is slow on some work loads
> 
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10326) [Rust] Add missing method docs for Arrays

2020-10-16 Thread Mahmut Bulut (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahmut Bulut updated ARROW-10326:
-
Description: Whenever a PR comes we don't inspect documentation thus some 
of the methods are missing documentations about what they do. We should 
regularly check and carefully inspect the explanations if they are adequate or 
not. This issue is for filling in all missing doc comments.  (was: Currently, 
whenever a PR comes we don't inspect documentation thus some of the methods are 
missing documentations about what they do. We should regularly check and 
carefully inspect the explanations if they are adequate or not. This issue is 
for filling in all missing doc comments.)

> [Rust] Add missing method docs for Arrays
> -
>
> Key: ARROW-10326
> URL: https://issues.apache.org/jira/browse/ARROW-10326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Priority: Major
>
> Whenever a PR comes we don't inspect documentation thus some of the methods 
> are missing documentations about what they do. We should regularly check and 
> carefully inspect the explanations if they are adequate or not. This issue is 
> for filling in all missing doc comments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10326) [Rust] Add missing method docs for Arrays

2020-10-16 Thread Mahmut Bulut (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahmut Bulut updated ARROW-10326:
-
Description: Currently, whenever a PR comes we don't inspect documentation 
thus some of the methods are missing documentations about what they do. We 
should regularly check and carefully inspect the explanations if they are 
adequate or not. This issue is for filling in all missing doc comments.  (was: 
Currently, whenever a PR comes we don't inspect documentation thus some of the 
methods are missing documentations about what they do. We should regularly 
check and carefully inspect the explanations that are adequate and not missing. 
This issue is for filling in all missing doc comments.)

> [Rust] Add missing method docs for Arrays
> -
>
> Key: ARROW-10326
> URL: https://issues.apache.org/jira/browse/ARROW-10326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Priority: Major
>
> Currently, whenever a PR comes we don't inspect documentation thus some of 
> the methods are missing documentations about what they do. We should 
> regularly check and carefully inspect the explanations if they are adequate 
> or not. This issue is for filling in all missing doc comments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10326) [Rust] Add missing method docs for Arrays

2020-10-16 Thread Mahmut Bulut (Jira)
Mahmut Bulut created ARROW-10326:


 Summary: [Rust] Add missing method docs for Arrays
 Key: ARROW-10326
 URL: https://issues.apache.org/jira/browse/ARROW-10326
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Mahmut Bulut


Currently, whenever a PR comes we don't inspect documentation thus some of the 
methods are missing documentations about what they do. We should regularly 
check and carefully inspect the explanations that are adequate and not missing. 
This issue is for filling in all missing doc comments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9707) [Rust] [DataFusion] Re-implement threading model

2020-10-16 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215335#comment-17215335
 ] 

Andrew Lamb commented on ARROW-9707:


FWIW now that DataFusion uses `async` -- 
https://github.com/apache/arrow/pull/8285 -- I think the number of threads 
issue cited in this PR is a non-issue (as DataFusion no longer launches its own 
threads)

> [Rust] [DataFusion] Re-implement threading model
> 
>
> Key: ARROW-9707
> URL: https://issues.apache.org/jira/browse/ARROW-9707
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
> Attachments: image-2020-09-24-22-46-46-959.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The current threading model is very simple and does not scale. We currently 
> use 1-2 dedicated threads per partition and they all run simultaneously, 
> which is a huge problem if you have more partitions than logical or physical 
> cores.
> This task is to re-implement the threading model so that query execution uses 
> a fixed (configurable) number of threads. Work will be broken down into 
> stages and tasks and each in-process executor (running on a dedicated thread) 
> will process its queue of tasks.
> This process will be driven by a scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10311) [Release] Update crossbow verification process

2020-10-16 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-10311:
---

Assignee: Krisztian Szucs

> [Release] Update crossbow verification process
> --
>
> Key: ARROW-10311
> URL: https://issues.apache.org/jira/browse/ARROW-10311
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The automatized crossbow RC verification tasks needs to be updated since 
> multiple builds are failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10311) [Release] Update crossbow verification process

2020-10-16 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10311.
-
Resolution: Fixed

Issue resolved by pull request 8464
[https://github.com/apache/arrow/pull/8464]

> [Release] Update crossbow verification process
> --
>
> Key: ARROW-10311
> URL: https://issues.apache.org/jira/browse/ARROW-10311
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The automatized crossbow RC verification tasks needs to be updated since 
> multiple builds are failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10325) [C++][Compute] Separate aggregate kernel registration

2020-10-16 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-10325:


 Summary: [C++][Compute] Separate aggregate kernel registration
 Key: ARROW-10325
 URL: https://issues.apache.org/jira/browse/ARROW-10325
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


We have basic aggregate kernel 'count/mean/sum/min_max' implemented in one file 
(plus simd version) and more complicated 'mode' and 'variance/stddev' kernels 
implemented in separated files.
'mode' and 'variance/stddev' kernels are now registered together with basic 
kernels in aggregate_basic.cc. And there are 'mode' and 'variance/stddev' 
kernel related function definitions in aggregate_basic_internal.h.
This is not good. They should be moved from basic kernel source to their own 
implementation files and registered separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9898) [C++][Gandiva] Error handling in castINT fails in some enviroments

2020-10-16 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-9898.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8096
[https://github.com/apache/arrow/pull/8096]

> [C++][Gandiva] Error handling in castINT fails in some enviroments
> --
>
> Key: ARROW-9898
> URL: https://issues.apache.org/jira/browse/ARROW-9898
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In some environment the error path in castINT leads to segfault.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9898) [C++][Gandiva] Error handling in castINT fails in some enviroments

2020-10-16 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar updated ARROW-9898:
-
Component/s: C++ - Gandiva

> [C++][Gandiva] Error handling in castINT fails in some enviroments
> --
>
> Key: ARROW-9898
> URL: https://issues.apache.org/jira/browse/ARROW-9898
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In some environment the error path in castINT leads to segfault.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10313) [C++] Improve UTF8 validation speed and CSV string conversion

2020-10-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10313.

Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8470
[https://github.com/apache/arrow/pull/8470]

> [C++] Improve UTF8 validation speed and CSV string conversion
> -
>
> Key: ARROW-10313
> URL: https://issues.apache.org/jira/browse/ARROW-10313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Based on profiling from ARROW-10308, UTF8 validation is a bottleneck of CSV 
> string conversion.
> This is because we must validate many small UTF8 strings individually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10321:
---
Labels: pull-request-available  (was: )

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10324) function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present.

2020-10-16 Thread Akash Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akash Shah updated ARROW-10324:
---
  Docs Text: 
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
 
 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C  LC_ADDRESS=C 
 
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C  
 

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] stringr_1.4.0  dplyr_1.0.2tictoc_1.0 arrow_1.0.1sparklyr_1.4.0
Description: 
For the following code snippet
{code:java}
// code placeholder
library(arrow)

download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet')

read_parquet(file = 'sample.parquet',as_data_frame = TRUE)

{code}
I get -

 
{code:java}
Error in Table__to_dataframe(x, use_threads = option_use_threads()) : embedded 
nul in string: '\0 at \0'
{code}
 

So, I thought, what if I could read the file as binaries and replace the 
embedded nul character \0 myself.

 
{code:java}
parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) 
raw <- write_to_raw(parquet,format = "file")
print(raw){code}
 

In this case, I get an indecipherable stream of characters and nuls, which 
makes it very difficult to remove '00' characters that are problematic in the 
stream.

 
{code:java}
[1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 0c 
00 06 00 
[29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 08 
00 00 00 
[57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 a4 
01 00 00 
[85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 34 
00 00 00 
[113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 
00 00 00 00 
[141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff 
00 00 01 05 
[169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 
6c 61 6e 67 
[197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 
04 00
{code}
 
Is there a way to handle this while reading Apache parquet?

  was:
For the following code snippet
{code:java}
// code placeholder
library(arrow)

download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet')

read_parquet(file = 'sample.parquet',as_data_frame = TRUE)

{code}
I get -

 
{code:java}
Error in Table__to_dataframe(x, use_threads = option_use_threads()) : embedded 
nul in string: '\0 at \0'
{code}
| |

 

So, I thought, what if I could read the file as binaries and replace the 
embedded nul character \0 myself.

 
| {code:java}
parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) 
raw <- write_to_raw(parquet,format = "file")
print(raw)
{code}
 |

 

In this case, I get an indecipherable stream of characters and nuls, which 
makes it very difficult to remove '00' characters that are problematic in the 
stream.

 
| {code:java}
[1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 0c 
00 06 00 
[29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 08 
00 00 00 
[57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 a4 
01 00 00 
[85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 34 
00 00 00 
[113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 
00 00 00 00 
[141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff 
00 00 01 05 
[169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 
6c 61 6e 67 
[197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 
04 00
{code}
 |

 

Is there a way to handle this while reading Apache parquet?

 Issue Type: Improvement  (was: Bug)

> function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present. 
> --
>
> Key: ARROW-10324
> URL: https://issues.apache.org/jira/browse/ARROW-10324
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Akash Shah
>Priority: Major
>
> For the following code snippet
> {code:java}
> // code placeholder
> library(arrow)
> 

[jira] [Assigned] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-16 Thread Frank Du (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Du reassigned ARROW-10321:


Assignee: Frank Du

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Frank Du
>Priority: Major
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10324) function read_parquet(*,as_data_frame=TRUE) fails when embedded nuls present.

2020-10-16 Thread Akash Shah (Jira)
Akash Shah created ARROW-10324:
--

 Summary: function read_parquet(*,as_data_frame=TRUE) fails when 
embedded nuls present. 
 Key: ARROW-10324
 URL: https://issues.apache.org/jira/browse/ARROW-10324
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Akash Shah


For the following code snippet
{code:java}
// code placeholder
library(arrow)

download.file('https://github.com/akashshah59/embedded_nul_parquet/raw/main/CC-MAIN-20200702045758-20200702075758-7.parquet','sample.parquet')

read_parquet(file = 'sample.parquet',as_data_frame = TRUE)

{code}
I get -

 
{code:java}
Error in Table__to_dataframe(x, use_threads = option_use_threads()) : embedded 
nul in string: '\0 at \0'
{code}
| |

 

So, I thought, what if I could read the file as binaries and replace the 
embedded nul character \0 myself.

 
| {code:java}
parquet <- read_parquet(file = 'sample.parquet',as_data_frame = FALSE) 
raw <- write_to_raw(parquet,format = "file")
print(raw)
{code}
 |

 

In this case, I get an indecipherable stream of characters and nuls, which 
makes it very difficult to remove '00' characters that are problematic in the 
stream.

 
| {code:java}
[1] 41 52 52 4f 57 31 00 00 ff ff ff ff d0 02 00 00 10 00 00 00 00 00 0a 00 0c 
00 06 00 
[29] 05 00 08 00 0a 00 00 00 00 01 04 00 0c 00 00 00 08 00 08 00 00 00 04 00 08 
00 00 00 
[57] 04 00 00 00 0d 00 00 00 70 02 00 00 38 02 00 00 10 02 00 00 d0 01 00 00 a4 
01 00 00 
[85] 74 01 00 00 34 01 00 00 04 01 00 00 cc 00 00 00 9c 00 00 00 64 00 00 00 34 
00 00 00 
[113] 04 00 00 00 d4 fd ff ff 00 00 01 05 14 00 00 00 0c 00 00 00 04 00 00 00 
00 00 00 00 
[141] c4 fd ff ff 0a 00 00 00 77 61 72 63 5f 6c 61 6e 67 73 00 00 00 fe ff ff 
00 00 01 05 
[169] 14 00 00 00 0c 00 00 00 04 00 00 00 00 00 00 00 f0 fd ff ff 0b 00 00 00 
6c 61 6e 67 
[197] 5f 64 65 74 65 63 74 00 2c fe ff ff 00 00 01 03 18 00 00 00 0c 00 00 00 
04 00
{code}
 |

 

Is there a way to handle this while reading Apache parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)