[jira] [Comment Edited] (ARROW-9915) [Java] getObject API for temporal types is inconsistent and in some cases incorrect

2020-09-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191169#comment-17191169
 ] 

Micah Kornfield edited comment on ARROW-9915 at 9/6/20, 12:23 AM:
--

[~mjadczak_gsa] thank you for the detailed analysis.  The last time this was 
[discussed on the mailing 
list|https://lists.apache.org/thread.html/bd9286591fa8bd682322c625f4176701af78e2510005a0488da359b3%40%3Cdev.arrow.apache.org%3E],
 it was decided not to make these changes (but possibly add utility methods).  
we can potentially discuss again.


was (Author: emkornfi...@gmail.com):
[~mjadczak_gsa] thank you for the detailed analysis.  The last time this was 
[discussed on the mailing 
list|[https://lists.apache.org/thread.html/bd9286591fa8bd682322c625f4176701af78e2510005a0488da359b3%40%3Cdev.arrow.apache.org%3E]],
 it was decided not to make these changes (but possibly add utility methods).  
we can potentially discuss again.

> [Java] getObject API for temporal types is inconsistent and in some cases 
> incorrect
> ---
>
> Key: ARROW-9915
> URL: https://issues.apache.org/jira/browse/ARROW-9915
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.16.0, 0.17.0, 
> 0.17.1, 1.0.0
>Reporter: Matt Jadczak
>Priority: Major
>
> It seems that the work which has been tracked in ARROW-2015 and merged in 
> [https://github.com/apache/arrow/pull/2966] to change the return types of the 
> various Time and Date vector types when using the getObject API missed some 
> of the vector types which are temporal and so should return a temporal type, 
> and provided an incorrect implementation for others (some of this was pointed 
> out in the initial PR review, but it seems that it slipped through the cracks 
> and was not addressed before merging).
> Here is a table of the various temporal vector types, what they currently 
> return from getObject, and what they should return, in my opinion (I have 
> included ones in which the implementation is correct for completeness, and 
> coloured them green).
>  
>  
> ||Vector class||Current return type||Proposed return type||Comments||
> |DateDayVector|Integer|LocalDate|Currently returns the raw value of days 
> since epoch, should return the actual date|
> |DateMilliVector|LocalDateTime|LocalDate|This type is supposed to encode a 
> date, not a datetime, so even though epoch millis are used, the result should 
> be a LocalDate|
> |{color:#00875a}DurationVector{color}|{color:#00875a}Duration{color}|{color:#00875a}Duration{color}|{color:#00875a}Correct.{color}|
> |IntervalDayVector|Duration|Period|As per 
> [https://github.com/apache/arrow/blob/master/format/Schema.fbs#L251] , 
> Interval should be a calendar-based datatype, not a time-based one. This is 
> represented in Java by a Period type. However, I note that the Java Period 
> class does not support milliseconds, unlike the Arrow type, which might be 
> why Duration is being returned. Some discussion may be needed on the best way 
> to deal with this.|
> |{color:#00875a}IntervalYearVector{color}|{color:#00875a}Period{color}|{color:#00875a}Period{color}|{color:#00875a}Correct.{color}|
> |TimeMicroVector|Long|LocalTime|Currently returns the raw number of micros, 
> should return the actual time|
> |TimeMilliVector|LocalDateTime|LocalTime|Currently returns a datetime on 
> 1970-01-01 with the correct time component, should just return the time|
> |TimeNanoVector|Long|LocalTime|Currently returns the raw number of nanos, 
> should return the actual time|
> |TimeSecVector|Integer|LocalTime|Currently returns the raw number of seconds, 
> should return the actual time|
> |{color:#00875a}TimeStampMicroVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |{color:#00875a}TimeStampMilliVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |{color:#00875a}TimeStampNanoVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |{color:#00875a}TimeStampSecVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |TimeStampMicroTZVector|Long|ZonedDateTime|Currently returns the underlying 
> micros, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> |TimeStampMilliTZVector|Long|ZonedDateTime|Currently returns the underlying 
> millis, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> |TimeStampNanoTZVector|Long|ZonedDateTime|Currently returns the underlying 
> nanos, and TZ has to be 

[jira] [Commented] (ARROW-9915) [Java] getObject API for temporal types is inconsistent and in some cases incorrect

2020-09-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191169#comment-17191169
 ] 

Micah Kornfield commented on ARROW-9915:


[~mjadczak_gsa] thank you for the detailed analysis.  The last time this was 
[discussed on the mailing 
list|[https://lists.apache.org/thread.html/bd9286591fa8bd682322c625f4176701af78e2510005a0488da359b3%40%3Cdev.arrow.apache.org%3E]],
 it was decided not to make these changes (but possibly add utility methods).  
we can potentially discuss again.

> [Java] getObject API for temporal types is inconsistent and in some cases 
> incorrect
> ---
>
> Key: ARROW-9915
> URL: https://issues.apache.org/jira/browse/ARROW-9915
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.16.0, 0.17.0, 
> 0.17.1, 1.0.0
>Reporter: Matt Jadczak
>Priority: Major
>
> It seems that the work which has been tracked in ARROW-2015 and merged in 
> [https://github.com/apache/arrow/pull/2966] to change the return types of the 
> various Time and Date vector types when using the getObject API missed some 
> of the vector types which are temporal and so should return a temporal type, 
> and provided an incorrect implementation for others (some of this was pointed 
> out in the initial PR review, but it seems that it slipped through the cracks 
> and was not addressed before merging).
> Here is a table of the various temporal vector types, what they currently 
> return from getObject, and what they should return, in my opinion (I have 
> included ones in which the implementation is correct for completeness, and 
> coloured them green).
>  
>  
> ||Vector class||Current return type||Proposed return type||Comments||
> |DateDayVector|Integer|LocalDate|Currently returns the raw value of days 
> since epoch, should return the actual date|
> |DateMilliVector|LocalDateTime|LocalDate|This type is supposed to encode a 
> date, not a datetime, so even though epoch millis are used, the result should 
> be a LocalDate|
> |{color:#00875a}DurationVector{color}|{color:#00875a}Duration{color}|{color:#00875a}Duration{color}|{color:#00875a}Correct.{color}|
> |IntervalDayVector|Duration|Period|As per 
> [https://github.com/apache/arrow/blob/master/format/Schema.fbs#L251] , 
> Interval should be a calendar-based datatype, not a time-based one. This is 
> represented in Java by a Period type. However, I note that the Java Period 
> class does not support milliseconds, unlike the Arrow type, which might be 
> why Duration is being returned. Some discussion may be needed on the best way 
> to deal with this.|
> |{color:#00875a}IntervalYearVector{color}|{color:#00875a}Period{color}|{color:#00875a}Period{color}|{color:#00875a}Correct.{color}|
> |TimeMicroVector|Long|LocalTime|Currently returns the raw number of micros, 
> should return the actual time|
> |TimeMilliVector|LocalDateTime|LocalTime|Currently returns a datetime on 
> 1970-01-01 with the correct time component, should just return the time|
> |TimeNanoVector|Long|LocalTime|Currently returns the raw number of nanos, 
> should return the actual time|
> |TimeSecVector|Integer|LocalTime|Currently returns the raw number of seconds, 
> should return the actual time|
> |{color:#00875a}TimeStampMicroVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |{color:#00875a}TimeStampMilliVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |{color:#00875a}TimeStampNanoVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |{color:#00875a}TimeStampSecVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |TimeStampMicroTZVector|Long|ZonedDateTime|Currently returns the underlying 
> micros, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> |TimeStampMilliTZVector|Long|ZonedDateTime|Currently returns the underlying 
> millis, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> |TimeStampNanoTZVector|Long|ZonedDateTime|Currently returns the underlying 
> nanos, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> |TimeStampSecTZVector|Long|ZonedDateTime|Currently returns the underlying 
> seconds, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> I am happy to submit a PR to fix this if there is no other reason for this to 
> persist other than these types being rarely used, but note that these would 
> all be breaking API changes.

[jira] [Updated] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray

2020-09-05 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-9921:
-
Issue Type: Improvement  (was: Bug)

> [Rust] Add `from(Vec>)` to [Large]StringArray
> --
>
> Key: ARROW-9921
> URL: https://issues.apache.org/jira/browse/ARROW-9921
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> and deprecate TryFrom, that currently uses a builder.
> This should have some performance improvement, and simplifies the interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray

2020-09-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9921:
--
Labels: pull-request-available  (was: )

> [Rust] Add `from(Vec>)` to [Large]StringArray
> --
>
> Key: ARROW-9921
> URL: https://issues.apache.org/jira/browse/ARROW-9921
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> and deprecate TryFrom, that currently uses a builder.
> This should have some performance improvement, and simplifies the interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray

2020-09-05 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9921:


Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] Add `from(Vec>)` to [Large]StringArray
> --
>
> Key: ARROW-9921
> URL: https://issues.apache.org/jira/browse/ARROW-9921
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> and deprecate TryFrom, that currently uses a builder.
> This should have some performance improvement, and simplifies the interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray

2020-09-05 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9921:


Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] Add `from(Vec>)` to [Large]StringArray
> --
>
> Key: ARROW-9921
> URL: https://issues.apache.org/jira/browse/ARROW-9921
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> and deprecate TryFrom, that currently uses a builder.
> This should have some performance improvement, and simplifies the interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray

2020-09-05 Thread Jorge (Jira)
Jorge created ARROW-9921:


 Summary: [Rust] Add `from(Vec>)` to [Large]StringArray
 Key: ARROW-9921
 URL: https://issues.apache.org/jira/browse/ARROW-9921
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge
Assignee: Jorge


and deprecate TryFrom, that currently uses a builder.

This should have some performance improvement, and simplifies the interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9916) [RUST] Avoid cloning ArrayData in several places

2020-09-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9916.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8113
[https://github.com/apache/arrow/pull/8113]

> [RUST] Avoid cloning ArrayData in several places
> 
>
> Key: ARROW-9916
> URL: https://issues.apache.org/jira/browse/ARROW-9916
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I noticed this while benchmarking improvements in ARROW-9895. A flamegraph 
> showed a significant amount of time spent in Arc::clone/atomic_add followed 
> by Arc::drop/atomic_sub
>  The Array trait has two methods for accessing ArrayData, `.data()` which 
> clones an `Arc` and `.data_ref()` which only borrows the data. In 
> many places borrow can be used instead of clone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9916) [RUST] Avoid cloning ArrayData in several places

2020-09-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-9916:
-

Assignee: Jörn Horstmann

> [RUST] Avoid cloning ArrayData in several places
> 
>
> Key: ARROW-9916
> URL: https://issues.apache.org/jira/browse/ARROW-9916
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I noticed this while benchmarking improvements in ARROW-9895. A flamegraph 
> showed a significant amount of time spent in Arc::clone/atomic_add followed 
> by Arc::drop/atomic_sub
>  The Array trait has two methods for accessing ArrayData, `.data()` which 
> clones an `Arc` and `.data_ref()` which only borrows the data. In 
> many places borrow can be used instead of clone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array

2020-09-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9920:


 Summary: [Python] pyarrow.concat_arrays segfaults when passing it 
a chunked array
 Key: ARROW-9920
 URL: https://issues.apache.org/jira/browse/ARROW-9920
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it 
the list of chunks:

{code}
In [1]: arr = pa.chunked_array([[0, 1], [3, 4]])

In [2]: pa.concat_arrays(arr.chunks)
Out[2]: 

[
  0,
  1,
  3,
  4
]
{code}

but if passing the chunked array itself, you get a segfault:

{code}
In [4]: pa.concat_arrays(arr)
Segmentation fault (core dumped)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9918) [Rust] [DataFusion] Favor "from" to create arrays

2020-09-05 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9918:
---
Description: 
[~yordan-pavlov]  hinted in https://issues.apache.org/jira/browse/ARROW-8908 
that there is a performance hit when using array builders.

Indeed, I get about 15% speedup by using {{From}} instead of a builder in 
DataFusion's math operations, and this number _increases with batch size_, 
which is detrimental since larger batch sizes leverage SIMD better.

Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the batch 
size:

 
{code:java}
sqrt_20_12  time:   [34.422 ms 34.503 ms 34.584 ms] 
  
change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05)
Performance has improved.

sqrt_22_12  time:   [150.13 ms 150.79 ms 151.42 ms] 
  
change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05)
Performance has improved.

sqrt_22_14  time:   [151.45 ms 151.68 ms 151.90 ms] 
  
change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05)
Performance has improved.
{code}
See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} (same 
batch size), but a measurable difference between {{sqrt_22_12}} and 
{{sqrt_22_14}} (different batch size).

This issue proposes that we migrate our datafusion and rust operations from 
builder-base array construction to a non-builder based, whenever possible.

 

Most code can be easily migrated to use {{From}} trait, a migration is along 
the lines of

 
{code:java}
let mut builder = Float64Builder::new(array.len());
for i in 0..array.len() {
if array.is_null(i) {
builder.append_null()?;
} else {
builder.append_value(array.value(i).$FUNC())?;
}
}
Ok(Arc::new(builder.finish()))
{code}
 to

 
{code:java}
let result: Float64Array = (0..array.len())
.map(|i| {
if array.is_null(i) {
None
} else {
Some(array.value(i).$FUNC())
}
})
.collect::>>()
.into();
Ok(Arc::new(result)){code}
and this is even simpler as we do not need to use an extra API (Builder).

 

  was:
[~yordan-pavlov]  hinted in https://issues.apache.org/jira/browse/ARROW-8908 
that there is a performance hit when using array builders.

Indeed, I get about 15% speedup by using {{From}} instead of a builder in 
DataFusion's math operations, and this number _increases with batch size_, 
which is detrimental since larger batch sizes leverage SDMI better.

Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the batch 
size:

 
{code:java}
sqrt_20_12  time:   [34.422 ms 34.503 ms 34.584 ms] 
  
change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05)
Performance has improved.

sqrt_22_12  time:   [150.13 ms 150.79 ms 151.42 ms] 
  
change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05)
Performance has improved.

sqrt_22_14  time:   [151.45 ms 151.68 ms 151.90 ms] 
  
change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05)
Performance has improved.
{code}
See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} (same 
batch size), but a measurable difference between {{sqrt_22_12}} and 
{{sqrt_22_14}} (different batch size).

This issue proposes that we migrate our datafusion and rust operations from 
builder-base array construction to a non-builder based, whenever possible.

 

Most code can be easily migrated to use {{From}} trait, a migration is along 
the lines of

 
{code:java}
let mut builder = Float64Builder::new(array.len());
for i in 0..array.len() {
if array.is_null(i) {
builder.append_null()?;
} else {
builder.append_value(array.value(i).$FUNC())?;
}
}
Ok(Arc::new(builder.finish()))
{code}
 to

 
{code:java}
let result: Float64Array = (0..array.len())
.map(|i| {
if array.is_null(i) {
None
} else {
Some(array.value(i).$FUNC())
}
})
.collect::>>()
.into();
Ok(Arc::new(result)){code}
and this is even simpler as we do not need to use an extra API (Builder).

 


> [Rust] [DataFusion] Favor "from" to create arrays
> -
>
> Key: ARROW-9918
> 

[jira] [Updated] (ARROW-9918) [Rust] [DataFusion] Favor "from" to create arrays for improved performance

2020-09-05 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9918:
---
Summary: [Rust] [DataFusion] Favor "from" to create arrays for improved 
performance  (was: [Rust] [DataFusion] Favor "from" to create arrays)

> [Rust] [DataFusion] Favor "from" to create arrays for improved performance
> --
>
> Key: ARROW-9918
> URL: https://issues.apache.org/jira/browse/ARROW-9918
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>
> [~yordan-pavlov]  hinted in https://issues.apache.org/jira/browse/ARROW-8908 
> that there is a performance hit when using array builders.
> Indeed, I get about 15% speedup by using {{From}} instead of a builder in 
> DataFusion's math operations, and this number _increases with batch size_, 
> which is detrimental since larger batch sizes leverage SIMD better.
> Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the 
> batch size:
>  
> {code:java}
> sqrt_20_12  time:   [34.422 ms 34.503 ms 34.584 ms]   
> 
> change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05)
> Performance has improved.
> sqrt_22_12  time:   [150.13 ms 150.79 ms 151.42 ms]   
> 
> change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05)
> Performance has improved.
> sqrt_22_14  time:   [151.45 ms 151.68 ms 151.90 ms]   
> 
> change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05)
> Performance has improved.
> {code}
> See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} 
> (same batch size), but a measurable difference between {{sqrt_22_12}} and 
> {{sqrt_22_14}} (different batch size).
> This issue proposes that we migrate our datafusion and rust operations from 
> builder-base array construction to a non-builder based, whenever possible.
>  
> Most code can be easily migrated to use {{From}} trait, a migration is along 
> the lines of
>  
> {code:java}
> let mut builder = Float64Builder::new(array.len());
> for i in 0..array.len() {
> if array.is_null(i) {
> builder.append_null()?;
> } else {
> builder.append_value(array.value(i).$FUNC())?;
> }
> }
> Ok(Arc::new(builder.finish()))
> {code}
>  to
>  
> {code:java}
> let result: Float64Array = (0..array.len())
> .map(|i| {
> if array.is_null(i) {
> None
> } else {
> Some(array.value(i).$FUNC())
> }
> })
> .collect::>>()
> .into();
> Ok(Arc::new(result)){code}
> and this is even simpler as we do not need to use an extra API (Builder).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9918) [Rust] [DataFusion] Favor "from" to create arrays

2020-09-05 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-9918:
-
Summary: [Rust] [DataFusion] Favor "from" to create arrays  (was: [Rust] 
[DataFusion] Move away from builders to create arrays)

> [Rust] [DataFusion] Favor "from" to create arrays
> -
>
> Key: ARROW-9918
> URL: https://issues.apache.org/jira/browse/ARROW-9918
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>
> [~yordan-pavlov]  hinted in https://issues.apache.org/jira/browse/ARROW-8908 
> that there is a performance hit when using array builders.
> Indeed, I get about 15% speedup by using {{From}} instead of a builder in 
> DataFusion's math operations, and this number _increases with batch size_, 
> which is detrimental since larger batch sizes leverage SDMI better.
> Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the 
> batch size:
>  
> {code:java}
> sqrt_20_12  time:   [34.422 ms 34.503 ms 34.584 ms]   
> 
> change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05)
> Performance has improved.
> sqrt_22_12  time:   [150.13 ms 150.79 ms 151.42 ms]   
> 
> change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05)
> Performance has improved.
> sqrt_22_14  time:   [151.45 ms 151.68 ms 151.90 ms]   
> 
> change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05)
> Performance has improved.
> {code}
> See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} 
> (same batch size), but a measurable difference between {{sqrt_22_12}} and 
> {{sqrt_22_14}} (different batch size).
> This issue proposes that we migrate our datafusion and rust operations from 
> builder-base array construction to a non-builder based, whenever possible.
>  
> Most code can be easily migrated to use {{From}} trait, a migration is along 
> the lines of
>  
> {code:java}
> let mut builder = Float64Builder::new(array.len());
> for i in 0..array.len() {
> if array.is_null(i) {
> builder.append_null()?;
> } else {
> builder.append_value(array.value(i).$FUNC())?;
> }
> }
> Ok(Arc::new(builder.finish()))
> {code}
>  to
>  
> {code:java}
> let result: Float64Array = (0..array.len())
> .map(|i| {
> if array.is_null(i) {
> None
> } else {
> Some(array.value(i).$FUNC())
> }
> })
> .collect::>>()
> .into();
> Ok(Arc::new(result)){code}
> and this is even simpler as we do not need to use an extra API (Builder).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9918) [Rust] [DataFusion] Move away from builders to create arrays

2020-09-05 Thread Jorge (Jira)
Jorge created ARROW-9918:


 Summary: [Rust] [DataFusion] Move away from builders to create 
arrays
 Key: ARROW-9918
 URL: https://issues.apache.org/jira/browse/ARROW-9918
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


[~yordan-pavlov]  hinted in https://issues.apache.org/jira/browse/ARROW-8908 
that there is a performance hit when using array builders.

Indeed, I get about 15% speedup by using {{From}} instead of a builder in 
DataFusion's math operations, and this number _increases with batch size_, 
which is detrimental since larger batch sizes leverage SDMI better.

Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the batch 
size:

 
{code:java}
sqrt_20_12  time:   [34.422 ms 34.503 ms 34.584 ms] 
  
change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05)
Performance has improved.

sqrt_22_12  time:   [150.13 ms 150.79 ms 151.42 ms] 
  
change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05)
Performance has improved.

sqrt_22_14  time:   [151.45 ms 151.68 ms 151.90 ms] 
  
change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05)
Performance has improved.
{code}
See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} (same 
batch size), but a measurable difference between {{sqrt_22_12}} and 
{{sqrt_22_14}} (different batch size).

This issue proposes that we migrate our datafusion and rust operations from 
builder-base array construction to a non-builder based, whenever possible.

 

Most code can be easily migrated to use {{From}} trait, a migration is along 
the lines of

 
{code:java}
let mut builder = Float64Builder::new(array.len());
for i in 0..array.len() {
if array.is_null(i) {
builder.append_null()?;
} else {
builder.append_value(array.value(i).$FUNC())?;
}
}
Ok(Arc::new(builder.finish()))
{code}
 to

 
{code:java}
let result: Float64Array = (0..array.len())
.map(|i| {
if array.is_null(i) {
None
} else {
Some(array.value(i).$FUNC())
}
})
.collect::>>()
.into();
Ok(Arc::new(result)){code}
and this is even simpler as we do not need to use an extra API (Builder).

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-05 Thread Jorge (Jira)
Jorge created ARROW-9919:


 Summary: [Rust] [DataFusion] Math functions
 Key: ARROW-9919
 URL: https://issues.apache.org/jira/browse/ARROW-9919
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Jorge
Assignee: Jorge


See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9919:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Math functions
> --
>
> Key: ARROW-9919
> URL: https://issues.apache.org/jira/browse/ARROW-9919
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-05 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9919:


Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] [DataFusion] Math functions
> --
>
> Key: ARROW-9919
> URL: https://issues.apache.org/jira/browse/ARROW-9919
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-05 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9919:


Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] Math functions
> --
>
> Key: ARROW-9919
> URL: https://issues.apache.org/jira/browse/ARROW-9919
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9896) [Go] Rename ipc tool from 'arrow-json-intergration-test' to 'arrow-json'

2020-09-05 Thread FredGan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

FredGan closed ARROW-9896.
--
Resolution: Not A Problem

> [Go] Rename ipc tool from 'arrow-json-intergration-test' to 'arrow-json'
> 
>
> Key: ARROW-9896
> URL: https://issues.apache.org/jira/browse/ARROW-9896
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Affects Versions: 1.0.0
>Reporter: FredGan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There are five tools in go/arrow/ipc/cmd directory. The other four tools are 
> named the same with that described in the code. But the 
> 'arrow-json-intergration-test' is different. The code name it "arrow-json". 
> So maybe it would be better that this directory renamed to 'arrow-json'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)