[jira] [Comment Edited] (ARROW-9915) [Java] getObject API for temporal types is inconsistent and in some cases incorrect
[ https://issues.apache.org/jira/browse/ARROW-9915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191169#comment-17191169 ] Micah Kornfield edited comment on ARROW-9915 at 9/6/20, 12:23 AM: -- [~mjadczak_gsa] thank you for the detailed analysis. The last time this was [discussed on the mailing list|https://lists.apache.org/thread.html/bd9286591fa8bd682322c625f4176701af78e2510005a0488da359b3%40%3Cdev.arrow.apache.org%3E], it was decided not to make these changes (but possibly add utility methods). we can potentially discuss again. was (Author: emkornfi...@gmail.com): [~mjadczak_gsa] thank you for the detailed analysis. The last time this was [discussed on the mailing list|[https://lists.apache.org/thread.html/bd9286591fa8bd682322c625f4176701af78e2510005a0488da359b3%40%3Cdev.arrow.apache.org%3E]], it was decided not to make these changes (but possibly add utility methods). we can potentially discuss again. > [Java] getObject API for temporal types is inconsistent and in some cases > incorrect > --- > > Key: ARROW-9915 > URL: https://issues.apache.org/jira/browse/ARROW-9915 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.16.0, 0.17.0, > 0.17.1, 1.0.0 >Reporter: Matt Jadczak >Priority: Major > > It seems that the work which has been tracked in ARROW-2015 and merged in > [https://github.com/apache/arrow/pull/2966] to change the return types of the > various Time and Date vector types when using the getObject API missed some > of the vector types which are temporal and so should return a temporal type, > and provided an incorrect implementation for others (some of this was pointed > out in the initial PR review, but it seems that it slipped through the cracks > and was not addressed before merging). > Here is a table of the various temporal vector types, what they currently > return from getObject, and what they should return, in my opinion (I have > included ones in which the implementation is correct for completeness, and > coloured them green). > > > ||Vector class||Current return type||Proposed return type||Comments|| > |DateDayVector|Integer|LocalDate|Currently returns the raw value of days > since epoch, should return the actual date| > |DateMilliVector|LocalDateTime|LocalDate|This type is supposed to encode a > date, not a datetime, so even though epoch millis are used, the result should > be a LocalDate| > |{color:#00875a}DurationVector{color}|{color:#00875a}Duration{color}|{color:#00875a}Duration{color}|{color:#00875a}Correct.{color}| > |IntervalDayVector|Duration|Period|As per > [https://github.com/apache/arrow/blob/master/format/Schema.fbs#L251] , > Interval should be a calendar-based datatype, not a time-based one. This is > represented in Java by a Period type. However, I note that the Java Period > class does not support milliseconds, unlike the Arrow type, which might be > why Duration is being returned. Some discussion may be needed on the best way > to deal with this.| > |{color:#00875a}IntervalYearVector{color}|{color:#00875a}Period{color}|{color:#00875a}Period{color}|{color:#00875a}Correct.{color}| > |TimeMicroVector|Long|LocalTime|Currently returns the raw number of micros, > should return the actual time| > |TimeMilliVector|LocalDateTime|LocalTime|Currently returns a datetime on > 1970-01-01 with the correct time component, should just return the time| > |TimeNanoVector|Long|LocalTime|Currently returns the raw number of nanos, > should return the actual time| > |TimeSecVector|Integer|LocalTime|Currently returns the raw number of seconds, > should return the actual time| > |{color:#00875a}TimeStampMicroVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}| > |{color:#00875a}TimeStampMilliVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}| > |{color:#00875a}TimeStampNanoVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}| > |{color:#00875a}TimeStampSecVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}| > |TimeStampMicroTZVector|Long|ZonedDateTime|Currently returns the underlying > micros, and TZ has to be obtained separately. Should return the actual > datetime with timezone| > |TimeStampMilliTZVector|Long|ZonedDateTime|Currently returns the underlying > millis, and TZ has to be obtained separately. Should return the actual > datetime with timezone| > |TimeStampNanoTZVector|Long|ZonedDateTime|Currently returns the underlying > nanos, and TZ has to be
[jira] [Commented] (ARROW-9915) [Java] getObject API for temporal types is inconsistent and in some cases incorrect
[ https://issues.apache.org/jira/browse/ARROW-9915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191169#comment-17191169 ] Micah Kornfield commented on ARROW-9915: [~mjadczak_gsa] thank you for the detailed analysis. The last time this was [discussed on the mailing list|[https://lists.apache.org/thread.html/bd9286591fa8bd682322c625f4176701af78e2510005a0488da359b3%40%3Cdev.arrow.apache.org%3E]], it was decided not to make these changes (but possibly add utility methods). we can potentially discuss again. > [Java] getObject API for temporal types is inconsistent and in some cases > incorrect > --- > > Key: ARROW-9915 > URL: https://issues.apache.org/jira/browse/ARROW-9915 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.16.0, 0.17.0, > 0.17.1, 1.0.0 >Reporter: Matt Jadczak >Priority: Major > > It seems that the work which has been tracked in ARROW-2015 and merged in > [https://github.com/apache/arrow/pull/2966] to change the return types of the > various Time and Date vector types when using the getObject API missed some > of the vector types which are temporal and so should return a temporal type, > and provided an incorrect implementation for others (some of this was pointed > out in the initial PR review, but it seems that it slipped through the cracks > and was not addressed before merging). > Here is a table of the various temporal vector types, what they currently > return from getObject, and what they should return, in my opinion (I have > included ones in which the implementation is correct for completeness, and > coloured them green). > > > ||Vector class||Current return type||Proposed return type||Comments|| > |DateDayVector|Integer|LocalDate|Currently returns the raw value of days > since epoch, should return the actual date| > |DateMilliVector|LocalDateTime|LocalDate|This type is supposed to encode a > date, not a datetime, so even though epoch millis are used, the result should > be a LocalDate| > |{color:#00875a}DurationVector{color}|{color:#00875a}Duration{color}|{color:#00875a}Duration{color}|{color:#00875a}Correct.{color}| > |IntervalDayVector|Duration|Period|As per > [https://github.com/apache/arrow/blob/master/format/Schema.fbs#L251] , > Interval should be a calendar-based datatype, not a time-based one. This is > represented in Java by a Period type. However, I note that the Java Period > class does not support milliseconds, unlike the Arrow type, which might be > why Duration is being returned. Some discussion may be needed on the best way > to deal with this.| > |{color:#00875a}IntervalYearVector{color}|{color:#00875a}Period{color}|{color:#00875a}Period{color}|{color:#00875a}Correct.{color}| > |TimeMicroVector|Long|LocalTime|Currently returns the raw number of micros, > should return the actual time| > |TimeMilliVector|LocalDateTime|LocalTime|Currently returns a datetime on > 1970-01-01 with the correct time component, should just return the time| > |TimeNanoVector|Long|LocalTime|Currently returns the raw number of nanos, > should return the actual time| > |TimeSecVector|Integer|LocalTime|Currently returns the raw number of seconds, > should return the actual time| > |{color:#00875a}TimeStampMicroVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}| > |{color:#00875a}TimeStampMilliVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}| > |{color:#00875a}TimeStampNanoVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}| > |{color:#00875a}TimeStampSecVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}| > |TimeStampMicroTZVector|Long|ZonedDateTime|Currently returns the underlying > micros, and TZ has to be obtained separately. Should return the actual > datetime with timezone| > |TimeStampMilliTZVector|Long|ZonedDateTime|Currently returns the underlying > millis, and TZ has to be obtained separately. Should return the actual > datetime with timezone| > |TimeStampNanoTZVector|Long|ZonedDateTime|Currently returns the underlying > nanos, and TZ has to be obtained separately. Should return the actual > datetime with timezone| > |TimeStampSecTZVector|Long|ZonedDateTime|Currently returns the underlying > seconds, and TZ has to be obtained separately. Should return the actual > datetime with timezone| > I am happy to submit a PR to fix this if there is no other reason for this to > persist other than these types being rarely used, but note that these would > all be breaking API changes.
[jira] [Updated] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray
[ https://issues.apache.org/jira/browse/ARROW-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-9921: - Issue Type: Improvement (was: Bug) > [Rust] Add `from(Vec>)` to [Large]StringArray > -- > > Key: ARROW-9921 > URL: https://issues.apache.org/jira/browse/ARROW-9921 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > and deprecate TryFrom, that currently uses a builder. > This should have some performance improvement, and simplifies the interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray
[ https://issues.apache.org/jira/browse/ARROW-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9921: -- Labels: pull-request-available (was: ) > [Rust] Add `from(Vec>)` to [Large]StringArray > -- > > Key: ARROW-9921 > URL: https://issues.apache.org/jira/browse/ARROW-9921 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > and deprecate TryFrom, that currently uses a builder. > This should have some performance improvement, and simplifies the interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray
[ https://issues.apache.org/jira/browse/ARROW-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9921: Assignee: Apache Arrow JIRA Bot (was: Jorge) > [Rust] Add `from(Vec>)` to [Large]StringArray > -- > > Key: ARROW-9921 > URL: https://issues.apache.org/jira/browse/ARROW-9921 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Jorge >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > and deprecate TryFrom, that currently uses a builder. > This should have some performance improvement, and simplifies the interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray
[ https://issues.apache.org/jira/browse/ARROW-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9921: Assignee: Jorge (was: Apache Arrow JIRA Bot) > [Rust] Add `from(Vec>)` to [Large]StringArray > -- > > Key: ARROW-9921 > URL: https://issues.apache.org/jira/browse/ARROW-9921 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > and deprecate TryFrom, that currently uses a builder. > This should have some performance improvement, and simplifies the interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray
Jorge created ARROW-9921: Summary: [Rust] Add `from(Vec>)` to [Large]StringArray Key: ARROW-9921 URL: https://issues.apache.org/jira/browse/ARROW-9921 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jorge Assignee: Jorge and deprecate TryFrom, that currently uses a builder. This should have some performance improvement, and simplifies the interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9916) [RUST] Avoid cloning ArrayData in several places
[ https://issues.apache.org/jira/browse/ARROW-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-9916. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8113 [https://github.com/apache/arrow/pull/8113] > [RUST] Avoid cloning ArrayData in several places > > > Key: ARROW-9916 > URL: https://issues.apache.org/jira/browse/ARROW-9916 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.0 >Reporter: Jörn Horstmann >Assignee: Jörn Horstmann >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > I noticed this while benchmarking improvements in ARROW-9895. A flamegraph > showed a significant amount of time spent in Arc::clone/atomic_add followed > by Arc::drop/atomic_sub > The Array trait has two methods for accessing ArrayData, `.data()` which > clones an `Arc` and `.data_ref()` which only borrows the data. In > many places borrow can be used instead of clone. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9916) [RUST] Avoid cloning ArrayData in several places
[ https://issues.apache.org/jira/browse/ARROW-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-9916: - Assignee: Jörn Horstmann > [RUST] Avoid cloning ArrayData in several places > > > Key: ARROW-9916 > URL: https://issues.apache.org/jira/browse/ARROW-9916 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.0 >Reporter: Jörn Horstmann >Assignee: Jörn Horstmann >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > I noticed this while benchmarking improvements in ARROW-9895. A flamegraph > showed a significant amount of time spent in Arc::clone/atomic_add followed > by Arc::drop/atomic_sub > The Array trait has two methods for accessing ArrayData, `.data()` which > clones an `Arc` and `.data_ref()` which only borrows the data. In > many places borrow can be used instead of clone. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
Joris Van den Bossche created ARROW-9920: Summary: [Python] pyarrow.concat_arrays segfaults when passing it a chunked array Key: ARROW-9920 URL: https://issues.apache.org/jira/browse/ARROW-9920 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it the list of chunks: {code} In [1]: arr = pa.chunked_array([[0, 1], [3, 4]]) In [2]: pa.concat_arrays(arr.chunks) Out[2]: [ 0, 1, 3, 4 ] {code} but if passing the chunked array itself, you get a segfault: {code} In [4]: pa.concat_arrays(arr) Segmentation fault (core dumped) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9918) [Rust] [DataFusion] Favor "from" to create arrays
[ https://issues.apache.org/jira/browse/ARROW-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb updated ARROW-9918: --- Description: [~yordan-pavlov] hinted in https://issues.apache.org/jira/browse/ARROW-8908 that there is a performance hit when using array builders. Indeed, I get about 15% speedup by using {{From}} instead of a builder in DataFusion's math operations, and this number _increases with batch size_, which is detrimental since larger batch sizes leverage SIMD better. Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the batch size: {code:java} sqrt_20_12 time: [34.422 ms 34.503 ms 34.584 ms] change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05) Performance has improved. sqrt_22_12 time: [150.13 ms 150.79 ms 151.42 ms] change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05) Performance has improved. sqrt_22_14 time: [151.45 ms 151.68 ms 151.90 ms] change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05) Performance has improved. {code} See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} (same batch size), but a measurable difference between {{sqrt_22_12}} and {{sqrt_22_14}} (different batch size). This issue proposes that we migrate our datafusion and rust operations from builder-base array construction to a non-builder based, whenever possible. Most code can be easily migrated to use {{From}} trait, a migration is along the lines of {code:java} let mut builder = Float64Builder::new(array.len()); for i in 0..array.len() { if array.is_null(i) { builder.append_null()?; } else { builder.append_value(array.value(i).$FUNC())?; } } Ok(Arc::new(builder.finish())) {code} to {code:java} let result: Float64Array = (0..array.len()) .map(|i| { if array.is_null(i) { None } else { Some(array.value(i).$FUNC()) } }) .collect::>>() .into(); Ok(Arc::new(result)){code} and this is even simpler as we do not need to use an extra API (Builder). was: [~yordan-pavlov] hinted in https://issues.apache.org/jira/browse/ARROW-8908 that there is a performance hit when using array builders. Indeed, I get about 15% speedup by using {{From}} instead of a builder in DataFusion's math operations, and this number _increases with batch size_, which is detrimental since larger batch sizes leverage SDMI better. Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the batch size: {code:java} sqrt_20_12 time: [34.422 ms 34.503 ms 34.584 ms] change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05) Performance has improved. sqrt_22_12 time: [150.13 ms 150.79 ms 151.42 ms] change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05) Performance has improved. sqrt_22_14 time: [151.45 ms 151.68 ms 151.90 ms] change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05) Performance has improved. {code} See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} (same batch size), but a measurable difference between {{sqrt_22_12}} and {{sqrt_22_14}} (different batch size). This issue proposes that we migrate our datafusion and rust operations from builder-base array construction to a non-builder based, whenever possible. Most code can be easily migrated to use {{From}} trait, a migration is along the lines of {code:java} let mut builder = Float64Builder::new(array.len()); for i in 0..array.len() { if array.is_null(i) { builder.append_null()?; } else { builder.append_value(array.value(i).$FUNC())?; } } Ok(Arc::new(builder.finish())) {code} to {code:java} let result: Float64Array = (0..array.len()) .map(|i| { if array.is_null(i) { None } else { Some(array.value(i).$FUNC()) } }) .collect::>>() .into(); Ok(Arc::new(result)){code} and this is even simpler as we do not need to use an extra API (Builder). > [Rust] [DataFusion] Favor "from" to create arrays > - > > Key: ARROW-9918 >
[jira] [Updated] (ARROW-9918) [Rust] [DataFusion] Favor "from" to create arrays for improved performance
[ https://issues.apache.org/jira/browse/ARROW-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb updated ARROW-9918: --- Summary: [Rust] [DataFusion] Favor "from" to create arrays for improved performance (was: [Rust] [DataFusion] Favor "from" to create arrays) > [Rust] [DataFusion] Favor "from" to create arrays for improved performance > -- > > Key: ARROW-9918 > URL: https://issues.apache.org/jira/browse/ARROW-9918 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jorge >Assignee: Jorge >Priority: Major > > [~yordan-pavlov] hinted in https://issues.apache.org/jira/browse/ARROW-8908 > that there is a performance hit when using array builders. > Indeed, I get about 15% speedup by using {{From}} instead of a builder in > DataFusion's math operations, and this number _increases with batch size_, > which is detrimental since larger batch sizes leverage SIMD better. > Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the > batch size: > > {code:java} > sqrt_20_12 time: [34.422 ms 34.503 ms 34.584 ms] > > change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05) > Performance has improved. > sqrt_22_12 time: [150.13 ms 150.79 ms 151.42 ms] > > change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05) > Performance has improved. > sqrt_22_14 time: [151.45 ms 151.68 ms 151.90 ms] > > change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05) > Performance has improved. > {code} > See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} > (same batch size), but a measurable difference between {{sqrt_22_12}} and > {{sqrt_22_14}} (different batch size). > This issue proposes that we migrate our datafusion and rust operations from > builder-base array construction to a non-builder based, whenever possible. > > Most code can be easily migrated to use {{From}} trait, a migration is along > the lines of > > {code:java} > let mut builder = Float64Builder::new(array.len()); > for i in 0..array.len() { > if array.is_null(i) { > builder.append_null()?; > } else { > builder.append_value(array.value(i).$FUNC())?; > } > } > Ok(Arc::new(builder.finish())) > {code} > to > > {code:java} > let result: Float64Array = (0..array.len()) > .map(|i| { > if array.is_null(i) { > None > } else { > Some(array.value(i).$FUNC()) > } > }) > .collect::>>() > .into(); > Ok(Arc::new(result)){code} > and this is even simpler as we do not need to use an extra API (Builder). > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9918) [Rust] [DataFusion] Favor "from" to create arrays
[ https://issues.apache.org/jira/browse/ARROW-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-9918: - Summary: [Rust] [DataFusion] Favor "from" to create arrays (was: [Rust] [DataFusion] Move away from builders to create arrays) > [Rust] [DataFusion] Favor "from" to create arrays > - > > Key: ARROW-9918 > URL: https://issues.apache.org/jira/browse/ARROW-9918 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jorge >Assignee: Jorge >Priority: Major > > [~yordan-pavlov] hinted in https://issues.apache.org/jira/browse/ARROW-8908 > that there is a performance hit when using array builders. > Indeed, I get about 15% speedup by using {{From}} instead of a builder in > DataFusion's math operations, and this number _increases with batch size_, > which is detrimental since larger batch sizes leverage SDMI better. > Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the > batch size: > > {code:java} > sqrt_20_12 time: [34.422 ms 34.503 ms 34.584 ms] > > change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05) > Performance has improved. > sqrt_22_12 time: [150.13 ms 150.79 ms 151.42 ms] > > change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05) > Performance has improved. > sqrt_22_14 time: [151.45 ms 151.68 ms 151.90 ms] > > change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05) > Performance has improved. > {code} > See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} > (same batch size), but a measurable difference between {{sqrt_22_12}} and > {{sqrt_22_14}} (different batch size). > This issue proposes that we migrate our datafusion and rust operations from > builder-base array construction to a non-builder based, whenever possible. > > Most code can be easily migrated to use {{From}} trait, a migration is along > the lines of > > {code:java} > let mut builder = Float64Builder::new(array.len()); > for i in 0..array.len() { > if array.is_null(i) { > builder.append_null()?; > } else { > builder.append_value(array.value(i).$FUNC())?; > } > } > Ok(Arc::new(builder.finish())) > {code} > to > > {code:java} > let result: Float64Array = (0..array.len()) > .map(|i| { > if array.is_null(i) { > None > } else { > Some(array.value(i).$FUNC()) > } > }) > .collect::>>() > .into(); > Ok(Arc::new(result)){code} > and this is even simpler as we do not need to use an extra API (Builder). > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9918) [Rust] [DataFusion] Move away from builders to create arrays
Jorge created ARROW-9918: Summary: [Rust] [DataFusion] Move away from builders to create arrays Key: ARROW-9918 URL: https://issues.apache.org/jira/browse/ARROW-9918 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge [~yordan-pavlov] hinted in https://issues.apache.org/jira/browse/ARROW-8908 that there is a performance hit when using array builders. Indeed, I get about 15% speedup by using {{From}} instead of a builder in DataFusion's math operations, and this number _increases with batch size_, which is detrimental since larger batch sizes leverage SDMI better. Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the batch size: {code:java} sqrt_20_12 time: [34.422 ms 34.503 ms 34.584 ms] change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05) Performance has improved. sqrt_22_12 time: [150.13 ms 150.79 ms 151.42 ms] change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05) Performance has improved. sqrt_22_14 time: [151.45 ms 151.68 ms 151.90 ms] change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05) Performance has improved. {code} See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} (same batch size), but a measurable difference between {{sqrt_22_12}} and {{sqrt_22_14}} (different batch size). This issue proposes that we migrate our datafusion and rust operations from builder-base array construction to a non-builder based, whenever possible. Most code can be easily migrated to use {{From}} trait, a migration is along the lines of {code:java} let mut builder = Float64Builder::new(array.len()); for i in 0..array.len() { if array.is_null(i) { builder.append_null()?; } else { builder.append_value(array.value(i).$FUNC())?; } } Ok(Arc::new(builder.finish())) {code} to {code:java} let result: Float64Array = (0..array.len()) .map(|i| { if array.is_null(i) { None } else { Some(array.value(i).$FUNC()) } }) .collect::>>() .into(); Ok(Arc::new(result)){code} and this is even simpler as we do not need to use an extra API (Builder). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9919) [Rust] [DataFusion] Math functions
Jorge created ARROW-9919: Summary: [Rust] [DataFusion] Math functions Key: ARROW-9919 URL: https://issues.apache.org/jira/browse/ARROW-9919 Project: Apache Arrow Issue Type: Sub-task Reporter: Jorge Assignee: Jorge See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions
[ https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9919: -- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Math functions > -- > > Key: ARROW-9919 > URL: https://issues.apache.org/jira/browse/ARROW-9919 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9919) [Rust] [DataFusion] Math functions
[ https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9919: Assignee: Apache Arrow JIRA Bot (was: Jorge) > [Rust] [DataFusion] Math functions > -- > > Key: ARROW-9919 > URL: https://issues.apache.org/jira/browse/ARROW-9919 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Jorge >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9919) [Rust] [DataFusion] Math functions
[ https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9919: Assignee: Jorge (was: Apache Arrow JIRA Bot) > [Rust] [DataFusion] Math functions > -- > > Key: ARROW-9919 > URL: https://issues.apache.org/jira/browse/ARROW-9919 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-9896) [Go] Rename ipc tool from 'arrow-json-intergration-test' to 'arrow-json'
[ https://issues.apache.org/jira/browse/ARROW-9896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] FredGan closed ARROW-9896. -- Resolution: Not A Problem > [Go] Rename ipc tool from 'arrow-json-intergration-test' to 'arrow-json' > > > Key: ARROW-9896 > URL: https://issues.apache.org/jira/browse/ARROW-9896 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Affects Versions: 1.0.0 >Reporter: FredGan >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > There are five tools in go/arrow/ipc/cmd directory. The other four tools are > named the same with that described in the code. But the > 'arrow-json-intergration-test' is different. The code name it "arrow-json". > So maybe it would be better that this directory renamed to 'arrow-json' -- This message was sent by Atlassian Jira (v8.3.4#803005)