[jira] [Created] (ARROW-18418) [WEBSITE] do not delete /datafusion-python
Andy Grove created ARROW-18418: -- Summary: [WEBSITE] do not delete /datafusion-python Key: ARROW-18418 URL: https://issues.apache.org/jira/browse/ARROW-18418 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Andy Grove Assignee: Andy Grove do not delete /datafusion-python when publishing -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17878) [Website] Exclude Ballista docs from being deleted
Andy Grove created ARROW-17878: -- Summary: [Website] Exclude Ballista docs from being deleted Key: ARROW-17878 URL: https://issues.apache.org/jira/browse/ARROW-17878 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Andy Grove Exclude Ballista docs from being deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17325) AQE should use available column statistics from completed query stages
[ https://issues.apache.org/jira/browse/ARROW-17325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-17325. -- Resolution: Invalid > AQE should use available column statistics from completed query stages > -- > > Key: ARROW-17325 > URL: https://issues.apache.org/jira/browse/ARROW-17325 > Project: Apache Arrow > Issue Type: Improvement > Components: SQL >Reporter: Andy Grove >Priority: Major > > In QueryStageExec.computeStats we copy partial statistics from materlized > query stages by calling QueryStageExec#getRuntimeStatistics, which in turn > calls ShuffleExchangeLike#runtimeStatistics or > BroadcastExchangeLike#runtimeStatistics. > Only dataSize and numOutputRows are copied into the new Statistics object: > {code:scala} > def computeStats(): Option[Statistics] = if (isMaterialized) { > val runtimeStats = getRuntimeStatistics > val dataSize = runtimeStats.sizeInBytes.max(0) > val numOutputRows = runtimeStats.rowCount.map(_.max(0)) > Some(Statistics(dataSize, numOutputRows, isRuntime = true)) > } else { > None > } > {code} > I would like to also copy over the column statistics stored in > Statistics.attributeMap so that they can be fed back into the logical plan > optimization phase. This is a small change as shown below: > {code:scala} > def computeStats(): Option[Statistics] = if (isMaterialized) { > val runtimeStats = getRuntimeStatistics > val dataSize = runtimeStats.sizeInBytes.max(0) > val numOutputRows = runtimeStats.rowCount.map(_.max(0)) > val attributeStats = runtimeStats.attributeStats > Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = > true)) > } else { > None > } > {code} > The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do > not currently provide such column statistics, but other custom > implementations can. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17325) AQE should use available column statistics from completed query stages
[ https://issues.apache.org/jira/browse/ARROW-17325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-17325: --- Description: In QueryStageExec.computeStats we copy partial statistics from materlized query stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls ShuffleExchangeLike#runtimeStatistics or BroadcastExchangeLike#runtimeStatistics. Only dataSize and numOutputRows are copied into the new Statistics object: {code:scala} def computeStats(): Option[Statistics] = if (isMaterialized) { val runtimeStats = getRuntimeStatistics val dataSize = runtimeStats.sizeInBytes.max(0) val numOutputRows = runtimeStats.rowCount.map(_.max(0)) Some(Statistics(dataSize, numOutputRows, isRuntime = true)) } else { None } {code} I would like to also copy over the column statistics stored in Statistics.attributeMap so that they can be fed back into the logical plan optimization phase. This is a small change as shown below: {code:scala} def computeStats(): Option[Statistics] = if (isMaterialized) { val runtimeStats = getRuntimeStatistics val dataSize = runtimeStats.sizeInBytes.max(0) val numOutputRows = runtimeStats.rowCount.map(_.max(0)) val attributeStats = runtimeStats.attributeStats Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = true)) } else { None } {code} The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do not currently provide such column statistics, but other custom implementations can. was: In QueryStageExec.computeStats we copy partial statistics from materlized query stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls ShuffleExchangeLike#runtimeStatistics or BroadcastExchangeLike#runtimeStatistics. Only dataSize and numOutputRows are copied into the new Statistics object: {code:scala} def computeStats(): Option[Statistics] = if (isMaterialized) { val runtimeStats = getRuntimeStatistics val dataSize = runtimeStats.sizeInBytes.max(0) val numOutputRows = runtimeStats.rowCount.map(_.max(0)) Some(Statistics(dataSize, numOutputRows, isRuntime = true)) } else { None } {code} I would like to also copy over the column statistics stored in Statistics.attributeMap so that they can be fed back into the logical plan optimization phase. The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do not currently provide such column statistics but other custom implementations can. > AQE should use available column statistics from completed query stages > -- > > Key: ARROW-17325 > URL: https://issues.apache.org/jira/browse/ARROW-17325 > Project: Apache Arrow > Issue Type: Improvement > Components: SQL >Reporter: Andy Grove >Priority: Major > > In QueryStageExec.computeStats we copy partial statistics from materlized > query stages by calling QueryStageExec#getRuntimeStatistics, which in turn > calls ShuffleExchangeLike#runtimeStatistics or > BroadcastExchangeLike#runtimeStatistics. > Only dataSize and numOutputRows are copied into the new Statistics object: > {code:scala} > def computeStats(): Option[Statistics] = if (isMaterialized) { > val runtimeStats = getRuntimeStatistics > val dataSize = runtimeStats.sizeInBytes.max(0) > val numOutputRows = runtimeStats.rowCount.map(_.max(0)) > Some(Statistics(dataSize, numOutputRows, isRuntime = true)) > } else { > None > } > {code} > I would like to also copy over the column statistics stored in > Statistics.attributeMap so that they can be fed back into the logical plan > optimization phase. This is a small change as shown below: > {code:scala} > def computeStats(): Option[Statistics] = if (isMaterialized) { > val runtimeStats = getRuntimeStatistics > val dataSize = runtimeStats.sizeInBytes.max(0) > val numOutputRows = runtimeStats.rowCount.map(_.max(0)) > val attributeStats = runtimeStats.attributeStats > Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = > true)) > } else { > None > } > {code} > The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do > not currently provide such column statistics, but other custom > implementations can. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17325) AQE should use available column statistics from completed query stages
Andy Grove created ARROW-17325: -- Summary: AQE should use available column statistics from completed query stages Key: ARROW-17325 URL: https://issues.apache.org/jira/browse/ARROW-17325 Project: Apache Arrow Issue Type: Improvement Components: SQL Reporter: Andy Grove In QueryStageExec.computeStats we copy partial statistics from materlized query stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls ShuffleExchangeLike#runtimeStatistics or BroadcastExchangeLike#runtimeStatistics. Only dataSize and numOutputRows are copied into the new Statistics object: {code:scala} def computeStats(): Option[Statistics] = if (isMaterialized) { val runtimeStats = getRuntimeStatistics val dataSize = runtimeStats.sizeInBytes.max(0) val numOutputRows = runtimeStats.rowCount.map(_.max(0)) Some(Statistics(dataSize, numOutputRows, isRuntime = true)) } else { None } {code} I would like to also copy over the column statistics stored in Statistics.attributeMap so that they can be fed back into the logical plan optimization phase. The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do not currently provide such column statistics but other custom implementations can. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-13656) [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts
[ https://issues.apache.org/jira/browse/ARROW-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-13656. -- Resolution: Won't Fix This is an old issue > [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts > --- > > Key: ARROW-13656 > URL: https://issues.apache.org/jira/browse/ARROW-13656 > Project: Apache Arrow > Issue Type: Task > Components: Website >Reporter: Andy Grove >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > https://github.com/apache/arrow-datafusion/issues/881 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16595) [WEBSITE] DataFusion 8.0.0 Release Blog Post
Andy Grove created ARROW-16595: -- Summary: [WEBSITE] DataFusion 8.0.0 Release Blog Post Key: ARROW-16595 URL: https://issues.apache.org/jira/browse/ARROW-16595 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andy Grove DataFusion 8.0.0 Release Blog Post -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-13656) [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts
Andy Grove created ARROW-13656: -- Summary: [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts Key: ARROW-13656 URL: https://issues.apache.org/jira/browse/ARROW-13656 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andy Grove Assignee: Andy Grove https://github.com/apache/arrow-datafusion/issues/881 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12435) [Rust][DataFusion] Remove unnecessary references to namespace in executor
[ https://issues.apache.org/jira/browse/ARROW-12435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12435. -- Resolution: Won't Fix Moved to https://github.com/apache/arrow-datafusion/issues/66 > [Rust][DataFusion] Remove unnecessary references to namespace in executor > - > > Key: ARROW-12435 > URL: https://issues.apache.org/jira/browse/ARROW-12435 > Project: Apache Arrow > Issue Type: Task > Components: Rust - Ballista >Reporter: Ximo Guanter >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > There is no need to support multiple executor clusters from a scheduler, so > the namespace of an executor is implicitly defined by the scheduler it > connects to. See > [https://the-asf.slack.com/archives/C01QUFS30TD/p1618679585211100] for more > context -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12403) [Rust] [Ballista] Integration tests should check that query results are correct
[ https://issues.apache.org/jira/browse/ARROW-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12403. -- Resolution: Won't Fix Moved to https://github.com/apache/arrow-datafusion/issues/65 > [Rust] [Ballista] Integration tests should check that query results are > correct > --- > > Key: ARROW-12403 > URL: https://issues.apache.org/jira/browse/ARROW-12403 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - Ballista >Reporter: Andy Grove >Priority: Major > Fix For: 5.0.0 > > > The integration checks only test that the benchmark queries run without > error. They do not check that the results are correct. > I think some work already happened in DataFusion to check the TPC-H results > so hopefully we can re-use that work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12331) [Rust] [Ballista] Make CI build work with snmalloc
[ https://issues.apache.org/jira/browse/ARROW-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12331. -- Resolution: Invalid > [Rust] [Ballista] Make CI build work with snmalloc > -- > > Key: ARROW-12331 > URL: https://issues.apache.org/jira/browse/ARROW-12331 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - Ballista >Reporter: Andy Grove >Priority: Major > > Ballista was added to CI in [https://github.com/apache/arrow/pull/9979] but > is building without default features due to snmalloc requiring cmake. > An alternative approach would be to build with cc instead of cmake. See the > above PR for conversation about this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12255) [Rust] [Ballista] Integrate scheduler with DataFusion
[ https://issues.apache.org/jira/browse/ARROW-12255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12255. -- Resolution: Won't Fix Moved to https://github.com/apache/arrow-datafusion/issues/64 > [Rust] [Ballista] Integrate scheduler with DataFusion > - > > Key: ARROW-12255 > URL: https://issues.apache.org/jira/browse/ARROW-12255 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 5.0.0 > > > The Ballista scheduler breaks a query down into stages based on changes in > partitioning int he plan, where each stage is broken down into tasks that can > be executed concurrently. > Rather than trying to run all the partitions at once, Ballista executors > process n concurrent tasks at a time and then request new tasks from the > scheduler. > This approach would help DataFusion scale better and it would be ideal to use > the same scheduler to scale across cores in DataFusion and across nodes in > Ballista. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12256) [Rust] [Ballista] Add DataFrame support
[ https://issues.apache.org/jira/browse/ARROW-12256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12256. -- Resolution: Invalid Ballista does already support DataFrame > [Rust] [Ballista] Add DataFrame support > --- > > Key: ARROW-12256 > URL: https://issues.apache.org/jira/browse/ARROW-12256 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Andy Grove >Priority: Major > Fix For: 5.0.0 > > > Ballista has so far been focused on SQL support rather than DataFrame > support. DataFrame support is partially implemented but needs more work to > complete. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12253) [Rust] [Ballista] Implement scalable joins
[ https://issues.apache.org/jira/browse/ARROW-12253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12253. -- Resolution: Won't Fix Moved to https://github.com/apache/arrow-datafusion/issues/63 > [Rust] [Ballista] Implement scalable joins > -- > > Key: ARROW-12253 > URL: https://issues.apache.org/jira/browse/ARROW-12253 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 5.0.0 > > > The main issue limiting scalability in Ballista today is that joins are > implemented as hash joins where each partition of the probe side causes the > entire left side to be loaded into memory. > To make this scalable we need to hash partition left and right inputs so that > we can join the left and right partitions in parallel. > There is already work underway in DataFusion to implement this that we can > leverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12252) [Rust] [Ballista] How to continue "This week in Ballista"?
[ https://issues.apache.org/jira/browse/ARROW-12252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12252. -- Resolution: Won't Fix Replaced by https://github.com/apache/arrow-datafusion/issues/18 > [Rust] [Ballista] How to continue "This week in Ballista"? > -- > > Key: ARROW-12252 > URL: https://issues.apache.org/jira/browse/ARROW-12252 > Project: Apache Arrow > Issue Type: Task > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > The Ballista project published a weekly newsletter and this has been very > effective at building a community around the project. > We need to determine how we can continue with something like this, while > following the Apache way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12257) [Rust] [Ballista] Publish user guide to Arrow site
[ https://issues.apache.org/jira/browse/ARROW-12257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12257. -- Resolution: Fixed Replaced by https://github.com/apache/arrow-datafusion/issues/18 > [Rust] [Ballista] Publish user guide to Arrow site > -- > > Key: ARROW-12257 > URL: https://issues.apache.org/jira/browse/ARROW-12257 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 5.0.0 > > > Ballista has a user guide in mdbook format and we need to figure out how to > get this published to the arrow site (it was previously hosted at > https://ballistacompute.org/docs/) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12434) [Rust] [Ballista] Show executed plans with metrics
[ https://issues.apache.org/jira/browse/ARROW-12434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12434. Resolution: Fixed PR was merged > [Rust] [Ballista] Show executed plans with metrics > -- > > Key: ARROW-12434 > URL: https://issues.apache.org/jira/browse/ARROW-12434 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Show executed plans with metrics to help with debugging and performance tuning -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12261) [Rust] [Ballista] Ballista should not have its own DataFrame API
[ https://issues.apache.org/jira/browse/ARROW-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12261. -- Resolution: Fixed Moved to https://github.com/apache/arrow-datafusion/issues/2 > [Rust] [Ballista] Ballista should not have its own DataFrame API > > > Key: ARROW-12261 > URL: https://issues.apache.org/jira/browse/ARROW-12261 > Project: Apache Arrow > Issue Type: Task > Components: Rust - Ballista >Reporter: Andy Grove >Priority: Major > Fix For: 5.0.0 > > > When building the Ballista POC it was necessary to implement a new DataFrame > API that wrapped the DataFusion API. > One issue is that it wasn't possible to override the behavior of the collect > method to make it use the Ballista context rather than the DataFusion context. > Now that the projects are in the same repo it should be easier to fix this > and have users always use the DataFusion DataFrame API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec
[ https://issues.apache.org/jira/browse/ARROW-12432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-12432: -- Assignee: Andy Grove > [Rust] [DataFusion] Add metrics for SortExec > > > Key: ARROW-12432 > URL: https://issues.apache.org/jira/browse/ARROW-12432 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Add metrics for SortExec -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec
[ https://issues.apache.org/jira/browse/ARROW-12432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12432. Resolution: Fixed Issue resolved by pull request 10078 [https://github.com/apache/arrow/pull/10078] > [Rust] [DataFusion] Add metrics for SortExec > > > Key: ARROW-12432 > URL: https://issues.apache.org/jira/browse/ARROW-12432 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Add metrics for SortExec -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12436) [Rust][Ballista] Add watch capabilities to config backend trait
[ https://issues.apache.org/jira/browse/ARROW-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12436. Fix Version/s: 5.0.0 Resolution: Fixed Issue resolved by pull request 10085 [https://github.com/apache/arrow/pull/10085] > [Rust][Ballista] Add watch capabilities to config backend trait > --- > > Key: ARROW-12436 > URL: https://issues.apache.org/jira/browse/ARROW-12436 > Project: Apache Arrow > Issue Type: Task > Components: Rust - Ballista >Reporter: Ximo Guanter >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > [arrow/lib.rs at 66aa3e7c365a8d4c4eca6e23668f2988e714b493 · apache/arrow > (github.com)|https://github.com/apache/arrow/blob/66aa3e7c365a8d4c4eca6e23668f2988e714b493/rust/ballista/rust/scheduler/src/lib.rs#L183] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12437) [Rust] [Ballista] Ballista plans must not include RepartitionExec
[ https://issues.apache.org/jira/browse/ARROW-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-12437: -- Assignee: Andy Grove > [Rust] [Ballista] Ballista plans must not include RepartitionExec > - > > Key: ARROW-12437 > URL: https://issues.apache.org/jira/browse/ARROW-12437 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Ballista plans must not include RepartitionExec because it results in > incorrect results. Ballista needs to manage its own repartitioning in a > distributed-aware way later on. For now we just need to configure the > DataFusion context to disable repartition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12437) [Rust] [Ballista] Ballista plans must not include RepartitionExec
[ https://issues.apache.org/jira/browse/ARROW-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12437. Fix Version/s: 5.0.0 Resolution: Fixed Issue resolved by pull request 10086 [https://github.com/apache/arrow/pull/10086] > [Rust] [Ballista] Ballista plans must not include RepartitionExec > - > > Key: ARROW-12437 > URL: https://issues.apache.org/jira/browse/ARROW-12437 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - Ballista >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Ballista plans must not include RepartitionExec because it results in > incorrect results. Ballista needs to manage its own repartitioning in a > distributed-aware way later on. For now we just need to configure the > DataFusion context to disable repartition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12437) [Rust] [Ballista] Ballista plans must not include RepartitionExec
Andy Grove created ARROW-12437: -- Summary: [Rust] [Ballista] Ballista plans must not include RepartitionExec Key: ARROW-12437 URL: https://issues.apache.org/jira/browse/ARROW-12437 Project: Apache Arrow Issue Type: Bug Components: Rust - Ballista Reporter: Andy Grove Ballista plans must not include RepartitionExec because it results in incorrect results. Ballista needs to manage its own repartitioning in a distributed-aware way later on. For now we just need to configure the DataFusion context to disable repartition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results
[ https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12334. Fix Version/s: 5.0.0 Resolution: Fixed Issue resolved by pull request 10083 [https://github.com/apache/arrow/pull/10083] > [Rust] [Ballista] Aggregate queries producing incorrect results > --- > > Key: ARROW-12334 > URL: https://issues.apache.org/jira/browse/ARROW-12334 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > I just ran benchmarks for the first time in a while and I see duplicate > entries for group by keys. > > For example, query 1 has "group by l_returnflag, l_linestatus" and I see > multiple results with l_returnflag = 'A' and l_linestatus = 'F'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics
[ https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12433: --- Component/s: Rust > [Rust] Builds failing due to new flatbuffer release introducing const generics > -- > > Key: ARROW-12433 > URL: https://issues.apache.org/jira/browse/ARROW-12433 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 4.0.0 >Reporter: Andy Grove >Priority: Blocker > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > I filed [https://github.com/google/flatbuffers/issues/6572] but for now we > should pin the dependency to 0.8.3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics
[ https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12433. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 10082 [https://github.com/apache/arrow/pull/10082] > [Rust] Builds failing due to new flatbuffer release introducing const generics > -- > > Key: ARROW-12433 > URL: https://issues.apache.org/jira/browse/ARROW-12433 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Andy Grove >Priority: Blocker > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I filed [https://github.com/google/flatbuffers/issues/6572] but for now we > should pin the dependency to 0.8.3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics
[ https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324321#comment-17324321 ] Andy Grove commented on ARROW-12433: Thanks [~alippai] that is a good suggestion So the issue is that our builds with nightly Rust are failing (our SIMD feature requires nightly, and the nightly version of Rust we use does not have const generics yet). I went ahead with a PR to pin to 0.8.3 to fix our builds. > [Rust] Builds failing due to new flatbuffer release introducing const generics > -- > > Key: ARROW-12433 > URL: https://issues.apache.org/jira/browse/ARROW-12433 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Andy Grove >Priority: Blocker > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > I filed [https://github.com/google/flatbuffers/issues/6572] but for now we > should pin the dependency to 0.8.3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics
[ https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12433: --- Priority: Blocker (was: Critical) > [Rust] Builds failing due to new flatbuffer release introducing const generics > -- > > Key: ARROW-12433 > URL: https://issues.apache.org/jira/browse/ARROW-12433 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Andy Grove >Priority: Blocker > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I filed [https://github.com/google/flatbuffers/issues/6572] but for now we > should pin the dependency to 0.8.3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12434) [Rust] [Ballista] Show executed plans with metrics
Andy Grove created ARROW-12434: -- Summary: [Rust] [Ballista] Show executed plans with metrics Key: ARROW-12434 URL: https://issues.apache.org/jira/browse/ARROW-12434 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 5.0.0 Show executed plans with metrics to help with debugging and performance tuning -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics
[ https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324308#comment-17324308 ] Andy Grove commented on ARROW-12433: [~alippai] Am I misunderstanding this issue? > [Rust] Builds failing due to new flatbuffer release introducing const generics > -- > > Key: ARROW-12433 > URL: https://issues.apache.org/jira/browse/ARROW-12433 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Andy Grove >Priority: Critical > > I filed [https://github.com/google/flatbuffers/issues/6572] but for now we > should pin the dependency to 0.8.3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics
[ https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324307#comment-17324307 ] Andy Grove commented on ARROW-12433: CI is already using 1.51 ... "latest update on 2021-03-25, rust version 1.51.0" > [Rust] Builds failing due to new flatbuffer release introducing const generics > -- > > Key: ARROW-12433 > URL: https://issues.apache.org/jira/browse/ARROW-12433 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Andy Grove >Priority: Critical > > I filed [https://github.com/google/flatbuffers/issues/6572] but for now we > should pin the dependency to 0.8.3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics
Andy Grove created ARROW-12433: -- Summary: [Rust] Builds failing due to new flatbuffer release introducing const generics Key: ARROW-12433 URL: https://issues.apache.org/jira/browse/ARROW-12433 Project: Apache Arrow Issue Type: Bug Affects Versions: 4.0.0 Reporter: Andy Grove I filed [https://github.com/google/flatbuffers/issues/6572] but for now we should pin the dependency to 0.8.3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec
Andy Grove created ARROW-12432: -- Summary: [Rust] [DataFusion] Add metrics for SortExec Key: ARROW-12432 URL: https://issues.apache.org/jira/browse/ARROW-12432 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Fix For: 5.0.0 Add metrics for SortExec -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results
[ https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324288#comment-17324288 ] Andy Grove commented on ARROW-12334: I tracked this down and there are two separate bugs: 1. We are getting RepartitionExec in the plan which is not compatible with Ballista and explodes the number of partitions (and likely causes incorrect results) 2. The query actually works fine and the final sort produces 2 rows, but the results are created by reading all the intermediate results as well > [Rust] [Ballista] Aggregate queries producing incorrect results > --- > > Key: ARROW-12334 > URL: https://issues.apache.org/jira/browse/ARROW-12334 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > I just ran benchmarks for the first time in a while and I see duplicate > entries for group by keys. > > For example, query 1 has "group by l_returnflag, l_linestatus" and I see > multiple results with l_returnflag = 'A' and l_linestatus = 'F'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
[ https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323818#comment-17323818 ] Andy Grove commented on ARROW-12421: This failure happens consistently on my 24 core Threadripper desktop running Ubuntu but I cannot reproduce it on my MacBook Pro or on my work PC (6 cores, also Ubuntu). > [Rust] [DataFusion] topk_query test fails in master > --- > > Key: ARROW-12421 > URL: https://issues.apache.org/jira/browse/ARROW-12421 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > > {code:java} > Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 > tests > test topk_plan ... ok > test topk_query ... FAILED > test normal_query ... okfailures: topk_query stdout > thread 'topk_query' panicked at 'assertion failed: `(left == right)` > left: `["+-+-+", "| customer_id | revenue |", > "+-+-+", "| paul| 300 |", "| jorge | > 200 |", "| andy| 150 |", "+-+-+"]`, > right: `["++", "||", "++", "++"]`: output mismatch for Topk context. > Expectedn > +-+-+ > | customer_id | revenue | > +-+-+ > | paul| 300 | > | jorge | 200 | > | andy| 150 | > +-+-+Actual: > ++ > || > ++ > ++ > ', datafusion/tests/user_defined_plan.rs:133:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12362) [Rust] [DataFusion] topk_query test failure
[ https://issues.apache.org/jira/browse/ARROW-12362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12362. -- Resolution: Duplicate > [Rust] [DataFusion] topk_query test failure > --- > > Key: ARROW-12362 > URL: https://issues.apache.org/jira/browse/ARROW-12362 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > Fix For: 5.0.0 > > > I'm seeing this locally with latest from master. > {code:java} > topk_query stdout > thread 'topk_query' panicked at 'assertion failed: `(left == right)` > left: `["+-+-+", "| customer_id | revenue |", > "+-+-+", "| paul| 300 |", "| jorge | > 200 |", "| andy| 150 |", "+-+-+"]`, > right: `["++", "||", "++", "++"]`: output mismatch for Topk context. > Expectedn > +-+-+ > | customer_id | revenue | > +-+-+ > | paul| 300 | > | jorge | 200 | > | andy| 150 | > +-+-+Actual: > ++ > || > ++ > ++ > ', datafusion/tests/user_defined_plan.rs:133:5 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
Andy Grove created ARROW-12421: -- Summary: [Rust] [DataFusion] topk_query test fails in master Key: ARROW-12421 URL: https://issues.apache.org/jira/browse/ARROW-12421 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove {code:java} Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 tests test topk_plan ... ok test topk_query ... FAILED test normal_query ... okfailures: topk_query stdout thread 'topk_query' panicked at 'assertion failed: `(left == right)` left: `["+-+-+", "| customer_id | revenue |", "+-+-+", "| paul| 300 |", "| jorge | 200 |", "| andy| 150 |", "+-+-+"]`, right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn +-+-+ | customer_id | revenue | +-+-+ | paul| 300 | | jorge | 200 | | andy| 150 | +-+-+Actual: ++ || ++ ++ ', datafusion/tests/user_defined_plan.rs:133:5 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12380) [Rust][Ballista] Add scheduler ui
[ https://issues.apache.org/jira/browse/ARROW-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12380. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 10026 [https://github.com/apache/arrow/pull/10026] > [Rust][Ballista] Add scheduler ui > - > > Key: ARROW-12380 > URL: https://issues.apache.org/jira/browse/ARROW-12380 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Sathis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12403) [Rust] [Ballista] Integration tests should check that query results are correct
[ https://issues.apache.org/jira/browse/ARROW-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12403: --- Component/s: Rust - Ballista > [Rust] [Ballista] Integration tests should check that query results are > correct > --- > > Key: ARROW-12403 > URL: https://issues.apache.org/jira/browse/ARROW-12403 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - Ballista >Reporter: Andy Grove >Priority: Major > Fix For: 5.0.0 > > > The integration checks only test that the benchmark queries run without > error. They do not check that the results are correct. > I think some work already happened in DataFusion to check the TPC-H results > so hopefully we can re-use that work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12403) [Rust] [Ballista] Integration tests should check that query results are correct
Andy Grove created ARROW-12403: -- Summary: [Rust] [Ballista] Integration tests should check that query results are correct Key: ARROW-12403 URL: https://issues.apache.org/jira/browse/ARROW-12403 Project: Apache Arrow Issue Type: Improvement Reporter: Andy Grove Fix For: 5.0.0 The integration checks only test that the benchmark queries run without error. They do not check that the results are correct. I think some work already happened in DataFusion to check the TPC-H results so hopefully we can re-use that work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12402) [Rust] [DataFusion] Implement SQL metrics framework
Andy Grove created ARROW-12402: -- Summary: [Rust] [DataFusion] Implement SQL metrics framework Key: ARROW-12402 URL: https://issues.apache.org/jira/browse/ARROW-12402 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 As a user, I would like the ability to inspect metrics for an executed plan to help with debugging and performance tuning. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12331) [Rust] [Ballista] Make CI build work with snmalloc
[ https://issues.apache.org/jira/browse/ARROW-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12331: --- Fix Version/s: (was: 4.0.0) > [Rust] [Ballista] Make CI build work with snmalloc > -- > > Key: ARROW-12331 > URL: https://issues.apache.org/jira/browse/ARROW-12331 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - Ballista >Reporter: Andy Grove >Priority: Major > > Ballista was added to CI in [https://github.com/apache/arrow/pull/9979] but > is building without default features due to snmalloc requiring cmake. > An alternative approach would be to build with cc instead of cmake. See the > above PR for conversation about this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results
[ https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12334: --- Fix Version/s: (was: 4.0.0) > [Rust] [Ballista] Aggregate queries producing incorrect results > --- > > Key: ARROW-12334 > URL: https://issues.apache.org/jira/browse/ARROW-12334 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > I just ran benchmarks for the first time in a while and I see duplicate > entries for group by keys. > > For example, query 1 has "group by l_returnflag, l_linestatus" and I see > multiple results with l_returnflag = 'A' and l_linestatus = 'F'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12335) [Rust] [Ballista] Bump DataFusion version
[ https://issues.apache.org/jira/browse/ARROW-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12335: --- Fix Version/s: (was: 4.0.0) > [Rust] [Ballista] Bump DataFusion version > - > > Key: ARROW-12335 > URL: https://issues.apache.org/jira/browse/ARROW-12335 > Project: Apache Arrow > Issue Type: Task > Components: Rust - Ballista >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Update Ballista to use latest DataFusion version -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12332) [Rust] [Ballista] Api server for scheduler
[ https://issues.apache.org/jira/browse/ARROW-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-12332: -- Assignee: (was: Sathis Kumar) > [Rust] [Ballista] Api server for scheduler > -- > > Key: ARROW-12332 > URL: https://issues.apache.org/jira/browse/ARROW-12332 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Sathis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12332) [Rust] [Ballista] Api server for scheduler
[ https://issues.apache.org/jira/browse/ARROW-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-12332: -- Assignee: Sathis Kumar > [Rust] [Ballista] Api server for scheduler > -- > > Key: ARROW-12332 > URL: https://issues.apache.org/jira/browse/ARROW-12332 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Sathis >Assignee: Sathis Kumar >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12362) [Rust] [DataFusion] topk_query test failure
Andy Grove created ARROW-12362: -- Summary: [Rust] [DataFusion] topk_query test failure Key: ARROW-12362 URL: https://issues.apache.org/jira/browse/ARROW-12362 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Fix For: 4.0.0 I'm seeing this locally with latest from master. {code:java} topk_query stdout thread 'topk_query' panicked at 'assertion failed: `(left == right)` left: `["+-+-+", "| customer_id | revenue |", "+-+-+", "| paul| 300 |", "| jorge | 200 |", "| andy| 150 |", "+-+-+"]`, right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn +-+-+ | customer_id | revenue | +-+-+ | paul| 300 | | jorge | 200 | | andy| 150 | +-+-+Actual: ++ || ++ ++ ', datafusion/tests/user_defined_plan.rs:133:5 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12361) [Rust] [DataFusion] Allow users to override physical optimization rules
Andy Grove created ARROW-12361: -- Summary: [Rust] [DataFusion] Allow users to override physical optimization rules Key: ARROW-12361 URL: https://issues.apache.org/jira/browse/ARROW-12361 Project: Apache Arrow Issue Type: Improvement Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 As a user of DataFusion (in Ballista) I would override the list of physical optimization rules. It is currently possible to add new rules but not to remove existing rules. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12332) [Rust] [Ballista] Api server for scheduler
[ https://issues.apache.org/jira/browse/ARROW-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12332. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9987 [https://github.com/apache/arrow/pull/9987] > [Rust] [Ballista] Api server for scheduler > -- > > Key: ARROW-12332 > URL: https://issues.apache.org/jira/browse/ARROW-12332 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Sathis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results
[ https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318973#comment-17318973 ] Andy Grove commented on ARROW-12334: I'm now very confused about this issue. I have been working on debugging it and now it suddenly is working, so I don't know if it is an intermittent bug or not. When it works correctly, the query returns 4 rows and takes ~13 seconds for me. When it does not work it returns many times more rows and takes 3x as long. It would be good to get a second pair of eyes on this. > [Rust] [Ballista] Aggregate queries producing incorrect results > --- > > Key: ARROW-12334 > URL: https://issues.apache.org/jira/browse/ARROW-12334 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 4.0.0 > > > I just ran benchmarks for the first time in a while and I see duplicate > entries for group by keys. > > For example, query 1 has "group by l_returnflag, l_linestatus" and I see > multiple results with l_returnflag = 'A' and l_linestatus = 'F'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12332) [Rust] [Ballista] Api server for scheduler
[ https://issues.apache.org/jira/browse/ARROW-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12332: --- Summary: [Rust] [Ballista] Api server for scheduler (was: Api server for scheduler) > [Rust] [Ballista] Api server for scheduler > -- > > Key: ARROW-12332 > URL: https://issues.apache.org/jira/browse/ARROW-12332 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Sathis >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results
[ https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318951#comment-17318951 ] Andy Grove commented on ARROW-12334: I tracked down the PR that introduced the regression in the original repo and it was [https://github.com/ballista-compute/ballista/pull/574] > [Rust] [Ballista] Aggregate queries producing incorrect results > --- > > Key: ARROW-12334 > URL: https://issues.apache.org/jira/browse/ARROW-12334 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 4.0.0 > > > I just ran benchmarks for the first time in a while and I see duplicate > entries for group by keys. > > For example, query 1 has "group by l_returnflag, l_linestatus" and I see > multiple results with l_returnflag = 'A' and l_linestatus = 'F'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12313) [Rust] [Ballista] Benchmark documentation out of date
[ https://issues.apache.org/jira/browse/ARROW-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12313. Resolution: Fixed Issue resolved by pull request 9990 [https://github.com/apache/arrow/pull/9990] > [Rust] [Ballista] Benchmark documentation out of date > - > > Key: ARROW-12313 > URL: https://issues.apache.org/jira/browse/ARROW-12313 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The scheduler/executor were refactored and the documentation for the > benchmarks now needs updating. I plan on fixing this over the weekend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12335) [Rust] [Ballista] Bump DataFusion version
Andy Grove created ARROW-12335: -- Summary: [Rust] [Ballista] Bump DataFusion version Key: ARROW-12335 URL: https://issues.apache.org/jira/browse/ARROW-12335 Project: Apache Arrow Issue Type: Task Components: Rust - Ballista Reporter: Andy Grove Fix For: 4.0.0 Update Ballista to use latest DataFusion version -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results
Andy Grove created ARROW-12334: -- Summary: [Rust] [Ballista] Aggregate queries producing incorrect results Key: ARROW-12334 URL: https://issues.apache.org/jira/browse/ARROW-12334 Project: Apache Arrow Issue Type: Bug Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 I just ran benchmarks for the first time in a while and I see duplicate entries for group by keys. For example, query 1 has "group by l_returnflag, l_linestatus" and I see multiple results with l_returnflag = 'A' and l_linestatus = 'F'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12260) [Website] [Rust] Announce Ballista donation
[ https://issues.apache.org/jira/browse/ARROW-12260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318904#comment-17318904 ] Andy Grove commented on ARROW-12260: https://github.com/apache/arrow-site/pull/100 > [Website] [Rust] Announce Ballista donation > --- > > Key: ARROW-12260 > URL: https://issues.apache.org/jira/browse/ARROW-12260 > Project: Apache Arrow > Issue Type: Task > Components: Website >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > Once the IP clearance vote passes and the PR has been merged, we should > announce the donation on the Arrow blog. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10920) [Rust] Segmentation fault in Arrow Parquet writer with huge arrays
[ https://issues.apache.org/jira/browse/ARROW-10920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-10920: --- Fix Version/s: (was: 4.0.0) > [Rust] Segmentation fault in Arrow Parquet writer with huge arrays > -- > > Key: ARROW-10920 > URL: https://issues.apache.org/jira/browse/ARROW-10920 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Priority: Major > > I stumbled across this by chance. I am not too surprised that this fails but > I would expect it to fail gracefully and not with a segmentation fault. > > {code:java} > use std::fs::File; > use std::sync::Arc; > use arrow::array::StringBuilder; > use arrow::datatypes::{DataType, Field, Schema}; > use arrow::error::Result; > use arrow::record_batch::RecordBatch; > use parquet::arrow::ArrowWriter; > fn main() -> Result<()> { > let schema = Schema::new(vec![ > Field::new("c0", DataType::Utf8, false), > Field::new("c1", DataType::Utf8, true), > ]); > let batch_size = 250; > let repeat_count = 140; > let file = File::create("/tmp/test.parquet")?; > let mut writer = ArrowWriter::try_new(file, Arc::new(schema.clone()), > None).unwrap(); > let mut c0_builder = StringBuilder::new(batch_size); > let mut c1_builder = StringBuilder::new(batch_size); > println!("Start of loop"); > for i in 0..batch_size { > let c0_value = format!("{:032}", i); > let c1_value = c0_value.repeat(repeat_count); > c0_builder.append_value(_value)?; > c1_builder.append_value(_value)?; > } > println!("Finish building c0"); > let c0 = Arc::new(c0_builder.finish()); > println!("Finish building c1"); > let c1 = Arc::new(c1_builder.finish()); > println!("Creating RecordBatch"); > let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![c0, c1])?; > // write the batch to parquet > println!("Writing RecordBatch"); > writer.write().unwrap(); > println!("Closing writer"); > writer.close().unwrap(); > Ok(()) > } > {code} > output: > {code:java} > Start of loop > Finish building c0 > Finish building c1 > Creating RecordBatch > Writing RecordBatch > Segmentation fault (core dumped) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11625) [Rust] [DataFusion] Move SortExec partition check to constructor
[ https://issues.apache.org/jira/browse/ARROW-11625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11625: --- Fix Version/s: (was: 4.0.0) > [Rust] [DataFusion] Move SortExec partition check to constructor > > > Key: ARROW-11625 > URL: https://issues.apache.org/jira/browse/ARROW-11625 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > > SortExec has the following error check at execution time and this could be > moved into the try_new constructor so the error check happens at planning > time instead. > > {code:java} > if 1 != self.input.output_partitioning().partition_count() { > return Err(DataFusionError::Internal( > "SortExec requires a single input partition".to_owned(), > )); > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11016) [Rust] Parquet ArrayReader should allow reading a subset of row groups
[ https://issues.apache.org/jira/browse/ARROW-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11016: --- Fix Version/s: (was: 4.0.0) > [Rust] Parquet ArrayReader should allow reading a subset of row groups > -- > > Key: ARROW-11016 > URL: https://issues.apache.org/jira/browse/ARROW-11016 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Priority: Major > > Parquet ArrayReader currently only supports reading an entire file from start > to finish and does not allow selectively reading a subset of row groups. This > prevents us from parallelizing work across threads when processing a single > parquet file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join
[ https://issues.apache.org/jira/browse/ARROW-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11094: --- Fix Version/s: (was: 4.0.0) > [Rust] [DataFusion] Implement Sort-Merge Join > - > > Key: ARROW-11094 > URL: https://issues.apache.org/jira/browse/ARROW-11094 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > > The current hash join works well when one side of the join can be loaded into > memory but cannot scale beyond the available RAM. > The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the > left and right partitions, and write the intermediate results to disk, and > then stream both sides of the join by merging these sorted partitions and we > do not need to load one side into memory. At most, we need to load all > batches from both sides that contain the current join key values. > In order to reduce memory pressure we will want to limit the concurrency of > these sort operations. > We would still want to default to hash join when we know that the build-side > can fit into memory since it is more efficient than using a sort-merge join. > [https://en.wikipedia.org/wiki/Sort-merge_join] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11020) [Rust] [DataFusion] Implement better tests for ParquetExec
[ https://issues.apache.org/jira/browse/ARROW-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11020: --- Fix Version/s: (was: 4.0.0) > [Rust] [DataFusion] Implement better tests for ParquetExec > -- > > Key: ARROW-11020 > URL: https://issues.apache.org/jira/browse/ARROW-11020 > Project: Apache Arrow > Issue Type: Test > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > Implement better tests for ParquetExec -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10884) [Rust] [DataFusion] Benchmark crate does not have a SIMD feature
[ https://issues.apache.org/jira/browse/ARROW-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-10884: --- Fix Version/s: (was: 4.0.0) > [Rust] [DataFusion] Benchmark crate does not have a SIMD feature > > > Key: ARROW-10884 > URL: https://issues.apache.org/jira/browse/ARROW-10884 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Minor > > The benchmarks run without SIMD by default. We need to add a feature to the > Cargo.toml to enable SIMD. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12313) [Rust] [Ballista] Benchmark documentation out of date
[ https://issues.apache.org/jira/browse/ARROW-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12313: --- Summary: [Rust] [Ballista] Benchmark documentation out of date (was: [Rust] [Ballista] Benchmark docuementation out of date) > [Rust] [Ballista] Benchmark documentation out of date > - > > Key: ARROW-12313 > URL: https://issues.apache.org/jira/browse/ARROW-12313 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 4.0.0 > > > The scheduler/executor were refactored and the documentation for the > benchmarks now needs updating. I plan on fixing this over the weekend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11059) [Rust] [DataFusion] Implement extensible configuration mechanism
[ https://issues.apache.org/jira/browse/ARROW-11059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11059: --- Fix Version/s: (was: 4.0.0) > [Rust] [DataFusion] Implement extensible configuration mechanism > > > Key: ARROW-11059 > URL: https://issues.apache.org/jira/browse/ARROW-11059 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > We are getting to the point where there are multiple settings we could add to > operators to fine-tune performance. Custom operators provided by crates that > extend DataFusion may also need this capability. > I propose that we add support for key-value configuration options so that we > don't need to plumb through each new configuration setting that we add. > For example. I am about to start on a "coalesce batches" operator and I would > like a setting such as "coalesce.batch.size". > For built-in settings like this we can provide information such as > documentation and default values and generate documentation from this. > For example, here is how Spark defines configs: > {code:java} > val PARQUET_VECTORIZED_READER_ENABLED = > buildConf("spark.sql.parquet.enableVectorizedReader") > .doc("Enables vectorized parquet decoding.") > .version("2.0.0") > .booleanConf > .createWithDefault(true) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI
[ https://issues.apache.org/jira/browse/ARROW-12251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12251. Resolution: Fixed Issue resolved by pull request 9979 [https://github.com/apache/arrow/pull/9979] > [Rust] [Ballista] Add Ballista tests to CI > -- > > Key: ARROW-12251 > URL: https://issues.apache.org/jira/browse/ARROW-12251 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Ballista is a standalone project (not part of the Arrow Rust workspace) and > therefore the tests will not run in CI without additional work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12331) [Rust] [Ballista] Make CI build work with snmalloc
Andy Grove created ARROW-12331: -- Summary: [Rust] [Ballista] Make CI build work with snmalloc Key: ARROW-12331 URL: https://issues.apache.org/jira/browse/ARROW-12331 Project: Apache Arrow Issue Type: Improvement Components: Rust - Ballista Reporter: Andy Grove Fix For: 4.0.0 Ballista was added to CI in [https://github.com/apache/arrow/pull/9979] but is building without default features due to snmalloc requiring cmake. An alternative approach would be to build with cc instead of cmake. See the above PR for conversation about this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12329) [Rust] [Ballista] Add README
[ https://issues.apache.org/jira/browse/ARROW-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12329. Resolution: Fixed Issue resolved by pull request 9981 [https://github.com/apache/arrow/pull/9981] > [Rust] [Ballista] Add README > > > Key: ARROW-12329 > URL: https://issues.apache.org/jira/browse/ARROW-12329 > Project: Apache Arrow > Issue Type: Task > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > We did not bring a README over in the donation and need to write a new one > anyway now this is part of Arrow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12328) [Rust] [Ballista] Fix code formatting
[ https://issues.apache.org/jira/browse/ARROW-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12328. Resolution: Fixed Issue resolved by pull request 9980 [https://github.com/apache/arrow/pull/9980] > [Rust] [Ballista] Fix code formatting > - > > Key: ARROW-12328 > URL: https://issues.apache.org/jira/browse/ARROW-12328 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12329) [Rust] [Ballista] Add README
Andy Grove created ARROW-12329: -- Summary: [Rust] [Ballista] Add README Key: ARROW-12329 URL: https://issues.apache.org/jira/browse/ARROW-12329 Project: Apache Arrow Issue Type: Task Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 We did not bring a README over in the donation and need to write a new one anyway now this is part of Arrow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12328) [Rust] [Ballista] Fix code formatting
Andy Grove created ARROW-12328: -- Summary: [Rust] [Ballista] Fix code formatting Key: ARROW-12328 URL: https://issues.apache.org/jira/browse/ARROW-12328 Project: Apache Arrow Issue Type: Improvement Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI
[ https://issues.apache.org/jira/browse/ARROW-12251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318509#comment-17318509 ] Andy Grove commented on ARROW-12251: [~boazbe]The goal is to add "cargo build" and "cargo test" for Ballista to the existing github actions for the Rust project. I have an initial PR up (linked to this JIRA) but I immediately ran into an issue with snmalloc. > [Rust] [Ballista] Add Ballista tests to CI > -- > > Key: ARROW-12251 > URL: https://issues.apache.org/jira/browse/ARROW-12251 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Ballista is a standalone project (not part of the Arrow Rust workspace) and > therefore the tests will not run in CI without additional work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12313) [Rust] [Ballista] Benchmark docuementation out of date
Andy Grove created ARROW-12313: -- Summary: [Rust] [Ballista] Benchmark docuementation out of date Key: ARROW-12313 URL: https://issues.apache.org/jira/browse/ARROW-12313 Project: Apache Arrow Issue Type: Bug Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 The scheduler/executor were refactored and the documentation for the benchmarks now needs updating. I plan on fixing this over the weekend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11982) [Rust] Donate Ballista Distributed Compute Platform
[ https://issues.apache.org/jira/browse/ARROW-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-11982. Resolution: Fixed Issue resolved by pull request 9723 [https://github.com/apache/arrow/pull/9723] > [Rust] Donate Ballista Distributed Compute Platform > --- > > Key: ARROW-11982 > URL: https://issues.apache.org/jira/browse/ARROW-11982 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > See PR for details. > https://github.com/apache/arrow/pull/9723 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12284) [Rust] [DataFusion] Review the contract between DataFusion and Arrow
Andy Grove created ARROW-12284: -- Summary: [Rust] [DataFusion] Review the contract between DataFusion and Arrow Key: ARROW-12284 URL: https://issues.apache.org/jira/browse/ARROW-12284 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove I am creating this issue based on the discussion at the sync call earlier today. Apparently DataFusion is not only using the high-level Arrow API but is also accessing Arrow internals directly and this would be one challenge in moving to a majorly refactored Arrow implementation. Perhaps we need to review what the public Arrow API should be and which APIs DataFusion should or should not be using. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12261) [Rust] [Ballista] Ballista should not have its own DataFrame API
Andy Grove created ARROW-12261: -- Summary: [Rust] [Ballista] Ballista should not have its own DataFrame API Key: ARROW-12261 URL: https://issues.apache.org/jira/browse/ARROW-12261 Project: Apache Arrow Issue Type: Task Components: Rust - Ballista Reporter: Andy Grove Fix For: 5.0.0 When building the Ballista POC it was necessary to implement a new DataFrame API that wrapped the DataFusion API. One issue is that it wasn't possible to override the behavior of the collect method to make it use the Ballista context rather than the DataFusion context. Now that the projects are in the same repo it should be easier to fix this and have users always use the DataFusion DataFrame API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12260) [Website] [Rust] Announce Ballista donation
Andy Grove created ARROW-12260: -- Summary: [Website] [Rust] Announce Ballista donation Key: ARROW-12260 URL: https://issues.apache.org/jira/browse/ARROW-12260 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andy Grove Assignee: Andy Grove Once the IP clearance vote passes and the PR has been merged, we should announce the donation on the Arrow blog. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI
[ https://issues.apache.org/jira/browse/ARROW-12251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12251: --- Issue Type: Improvement (was: Bug) > [Rust] [Ballista] Add Ballista tests to CI > -- > > Key: ARROW-12251 > URL: https://issues.apache.org/jira/browse/ARROW-12251 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - Ballista >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 4.0.0 > > > Ballista is a standalone project (not part of the Arrow Rust workspace) and > therefore the tests will not run in CI without additional work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12257) [Rust] [Ballista] Publish user guide to Arrow site
Andy Grove created ARROW-12257: -- Summary: [Rust] [Ballista] Publish user guide to Arrow site Key: ARROW-12257 URL: https://issues.apache.org/jira/browse/ARROW-12257 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 5.0.0 Ballista has a user guide in mdbook format and we need to figure out how to get this published to the arrow site (it was previously hosted at https://ballistacompute.org/docs/) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12256) [Rust] [Ballista] Add DataFrame support
Andy Grove created ARROW-12256: -- Summary: [Rust] [Ballista] Add DataFrame support Key: ARROW-12256 URL: https://issues.apache.org/jira/browse/ARROW-12256 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista Reporter: Andy Grove Fix For: 5.0.0 Ballista has so far been focused on SQL support rather than DataFrame support. DataFrame support is partially implemented but needs more work to complete. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12255) [Rust] [Ballista] Integrate scheduler with DataFusion
Andy Grove created ARROW-12255: -- Summary: [Rust] [Ballista] Integrate scheduler with DataFusion Key: ARROW-12255 URL: https://issues.apache.org/jira/browse/ARROW-12255 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista, Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 5.0.0 The Ballista scheduler breaks a query down into stages based on changes in partitioning int he plan, where each stage is broken down into tasks that can be executed concurrently. Rather than trying to run all the partitions at once, Ballista executors process n concurrent tasks at a time and then request new tasks from the scheduler. This approach would help DataFusion scale better and it would be ideal to use the same scheduler to scale across cores in DataFusion and across nodes in Ballista. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12253) [Rust] [Ballista] Implement scalable joins
Andy Grove created ARROW-12253: -- Summary: [Rust] [Ballista] Implement scalable joins Key: ARROW-12253 URL: https://issues.apache.org/jira/browse/ARROW-12253 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 5.0.0 The main issue limiting scalability in Ballista today is that joins are implemented as hash joins where each partition of the probe side causes the entire left side to be loaded into memory. To make this scalable we need to hash partition left and right inputs so that we can join the left and right partitions in parallel. There is already work underway in DataFusion to implement this that we can leverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12252) [Rust] [Ballista] How to continue "This week in Ballista"?
Andy Grove created ARROW-12252: -- Summary: [Rust] [Ballista] How to continue "This week in Ballista"? Key: ARROW-12252 URL: https://issues.apache.org/jira/browse/ARROW-12252 Project: Apache Arrow Issue Type: Task Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove The Ballista project published a weekly newsletter and this has been very effective at building a community around the project. We need to determine how we can continue with something like this, while following the Apache way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12250) [Rust] Failing test arrow::arrow_writer::tests::fixed_size_binary_single_column
[ https://issues.apache.org/jira/browse/ARROW-12250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-12250: --- Description: I just pulled latest from master (commit d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran `cargo clean` followed by `cargo test`. One test fails (sometimes). It can fail in multiple ways: {code:java} arrow::arrow_writer::tests::fixed_size_binary_single_column stdout thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked at 'called `Result::unwrap()` on an `Err` value: General("Could not parse metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54 {code} {code:java} arrow::arrow_writer::tests::fixed_size_binary_single_column stdout thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked at 'Unable to get batch: ParquetError("Parquet error: underlying Thrift error: end of file")', parquet/src/arrow/arrow_writer.rs:927:14 {code} was: I just pulled latest from master (commit d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran `cargo clean` followed by `cargo test`. One test fails: {code:java} arrow::arrow_writer::tests::fixed_size_binary_single_column stdout thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked at 'called `Result::unwrap()` on an `Err` value: General("Could not parse metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54 {code} > [Rust] Failing test > arrow::arrow_writer::tests::fixed_size_binary_single_column > --- > > Key: ARROW-12250 > URL: https://issues.apache.org/jira/browse/ARROW-12250 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Priority: Major > Fix For: 4.0.0 > > > I just pulled latest from master (commit > d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran > `cargo clean` followed by `cargo test`. > One test fails (sometimes). It can fail in multiple ways: > {code:java} > arrow::arrow_writer::tests::fixed_size_binary_single_column stdout > thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked > at 'called `Result::unwrap()` on an `Err` value: General("Could not parse > metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54 > {code} > {code:java} > arrow::arrow_writer::tests::fixed_size_binary_single_column stdout > thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked > at 'Unable to get batch: ParquetError("Parquet error: underlying Thrift > error: end of file")', parquet/src/arrow/arrow_writer.rs:927:14 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI
Andy Grove created ARROW-12251: -- Summary: [Rust] [Ballista] Add Ballista tests to CI Key: ARROW-12251 URL: https://issues.apache.org/jira/browse/ARROW-12251 Project: Apache Arrow Issue Type: Bug Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 Ballista is a standalone project (not part of the Arrow Rust workspace) and therefore the tests will not run in CI without additional work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12250) [Rust] Failing test arrow::arrow_writer::tests::fixed_size_binary_single_column
Andy Grove created ARROW-12250: -- Summary: [Rust] Failing test arrow::arrow_writer::tests::fixed_size_binary_single_column Key: ARROW-12250 URL: https://issues.apache.org/jira/browse/ARROW-12250 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove Fix For: 4.0.0 I just pulled latest from master (commit d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran `cargo clean` followed by `cargo test`. One test fails: {code:java} arrow::arrow_writer::tests::fixed_size_binary_single_column stdout thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked at 'called `Result::unwrap()` on an `Err` value: General("Could not parse metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12064) [Rust] [DataFusion] Make DataFrame extensible
Andy Grove created ARROW-12064: -- Summary: [Rust] [DataFusion] Make DataFrame extensible Key: ARROW-12064 URL: https://issues.apache.org/jira/browse/ARROW-12064 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove The DataFrame implementation currently has two types of logic: # Logic for building a logical query plan # Logic for executing a query using the DataFusion context We can make DataFrame more extensible by having it always delegate to the context for execution, allowing the same DataFrame logic to be used for local and distributed execution. We will likely need to introduce a new ExecutionContext trait with different implementations for DataFusion and Ballista. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11982) [Rust] Donate Ballista Distributed Compute Platform
[ https://issues.apache.org/jira/browse/ARROW-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11982: --- Description: See PR for details. https://github.com/apache/arrow/pull/9723 was:See [PR|[https://github.com/apache/arrow/pull/9723]] for details. > [Rust] Donate Ballista Distributed Compute Platform > --- > > Key: ARROW-11982 > URL: https://issues.apache.org/jira/browse/ARROW-11982 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 4.0.0 > > > See PR for details. > https://github.com/apache/arrow/pull/9723 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11982) [Rust] Donate Ballista Distributed Compute Platform
Andy Grove created ARROW-11982: -- Summary: [Rust] Donate Ballista Distributed Compute Platform Key: ARROW-11982 URL: https://issues.apache.org/jira/browse/ARROW-11982 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 See [PR|[https://github.com/apache/arrow/pull/9723]] for details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11150) [Rust] Set up bi-weekly Rust sync call and update website
[ https://issues.apache.org/jira/browse/ARROW-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301681#comment-17301681 ] Andy Grove commented on ARROW-11150: We should also list the ASF slack channel: https://s.apache.org/slack-invite > [Rust] Set up bi-weekly Rust sync call and update website > - > > Key: ARROW-11150 > URL: https://issues.apache.org/jira/browse/ARROW-11150 > Project: Apache Arrow > Issue Type: Task > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > Given the momentum on the Rust implementation, I am going to set up a > bi-weekly sync call on Google Meet most likely. The call will be at the same > time as the current sync call but on alternate weeks. > I will update the web site to list both calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11948) [Rust] 3.0.1 patch release
Andy Grove created ARROW-11948: -- Summary: [Rust] 3.0.1 patch release Key: ARROW-11948 URL: https://issues.apache.org/jira/browse/ARROW-11948 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.1 Spreadsheet where I am tracking the fixes that get merged to maint-3.0.x https://docs.google.com/spreadsheets/d/111k0PGEVzxg1k7Q_d_1kV7E24VRB3DVJP1MnQImVrCc/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11934) [Rust] Document patch release process
[ https://issues.apache.org/jira/browse/ARROW-11934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300476#comment-17300476 ] Andy Grove commented on ARROW-11934: [~npr] Could I ask you to take a look at this google doc when you get a chance. In particular, could you participate in the conversation I am having with [~emkornfield] about whether we can make language-specific patch releases? > [Rust] Document patch release process > - > > Key: ARROW-11934 > URL: https://issues.apache.org/jira/browse/ARROW-11934 > Project: Apache Arrow > Issue Type: Task > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 3.0.1 > > > Now that we moved to voting on source releases for patch releases, we need to > document the process for doing so in the Rust implementation. > > Google doc for discussion / collaboration: > https://docs.google.com/document/d/1i2Elk6J0H4nhPeQZdLDyqvHoRbsabx2iOTXLHxxNqRE/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11239) [Rust] array::transform::tests::test_struct failed
[ https://issues.apache.org/jira/browse/ARROW-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11239: --- Fix Version/s: 3.0.1 > [Rust] array::transform::tests::test_struct failed > -- > > Key: ARROW-11239 > URL: https://issues.apache.org/jira/browse/ARROW-11239 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 3.0.0 >Reporter: Qingyou Meng >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.0.1 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Test *array::transform::tests::test_struct* in > *arrow/src/array/transform/mod.rs* failed when swap the first two elements: > change from > {code:java} > // code placeholder > let strings: ArrayRef = Arc::new(StringArray::from(vec![ > Some("joe"), > None,{code} > to > {code:java} > // code placeholder > let strings: ArrayRef = Arc::new(StringArray::from(vec![ > None, > Some("joe"),{code} > The failure was first found when I report > https://issues.apache.org/jira/browse/ARROW-11160 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11269) [Rust] Unable to read Parquet file because of mismatch in column-derived and embedded schemas
[ https://issues.apache.org/jira/browse/ARROW-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11269: --- Fix Version/s: 3.0.1 > [Rust] Unable to read Parquet file because of mismatch in column-derived and > embedded schemas > - > > Key: ARROW-11269 > URL: https://issues.apache.org/jira/browse/ARROW-11269 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 3.0.0 >Reporter: Max Burke >Assignee: Neville Dipale >Priority: Blocker > Labels: pull-request-available > Fix For: 4.0.0, 3.0.1 > > Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet, main.rs > > Time Spent: 1h 50m > Remaining Estimate: 0h > > The issue seems to stem from the new(-ish) behavior of the Arrow Parquet > reader where the embedded arrow schema is used instead of deriving the schema > from the Parquet columns. > > However it seems like some cases still derive the schema type from the column > types, leading to the Arrow record batch reader erroring out that the column > types must match the schema types. > > In our case, the column type is an int96 datetime (ns) type, and the Arrow > type in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, > Some("UTC")). However, the code that constructs the Arrays seems to re-derive > this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because > the Parquet schema has no timezone information). And so, Parquet files that > we were able to read successfully with our branch of Arrow circa October are > now unreadable. > > I've attached an example of a Parquet file that demonstrates the problem. > This file was created in Python (as most of our Parquet files are). > > I've also attached a sample Rust program that will demonstrate the error. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11311) [Rust] unset_bit is toggling bits, not unsetting them
[ https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11311: --- Fix Version/s: 3.0.1 > [Rust] unset_bit is toggling bits, not unsetting them > - > > Key: ARROW-11311 > URL: https://issues.apache.org/jira/browse/ARROW-11311 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.0.1 > > Time Spent: 50m > Remaining Estimate: 0h > > The functions {{bit_util::unset_bit[_raw]}} are currently toggling bits, not > setting them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11322) [Rust] Arrow `memory` made private is a breaking API change
[ https://issues.apache.org/jira/browse/ARROW-11322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11322: --- Fix Version/s: 3.0.1 > [Rust] Arrow `memory` made private is a breaking API change > --- > > Key: ARROW-11322 > URL: https://issues.apache.org/jira/browse/ARROW-11322 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Max Burke >Assignee: Jorge Leitão >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0, 3.0.1 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > We depend on functionality in the Arrow memory module for buffer building and > this was recently made private. > > Please make this module public again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11313) [Rust] Size hint of iterators is incorrect
[ https://issues.apache.org/jira/browse/ARROW-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11313: --- Fix Version/s: 3.0.1 > [Rust] Size hint of iterators is incorrect > -- > > Key: ARROW-11313 > URL: https://issues.apache.org/jira/browse/ARROW-11313 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.0.1 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11323) [Rust][DataFusion] ComputeError("concat requires input of at least one array")) with queries with ORDER BY or GROUP BY that return no
[ https://issues.apache.org/jira/browse/ARROW-11323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11323: --- Fix Version/s: 3.0.1 > [Rust][DataFusion] ComputeError("concat requires input of at least one > array")) with queries with ORDER BY or GROUP BY that return no > -- > > Key: ARROW-11323 > URL: https://issues.apache.org/jira/browse/ARROW-11323 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.0.1 > > Time Spent: 1.5h > Remaining Estimate: 0h > > If you run a SQL query in datafusion which has predicates that produces no > rows that also includes a GROUP BY or ORDER BY clause, you get the following > error: > Error of "ArrowError(ComputeError("concat requires input of at least one > array"))" > Here are two test cases that show the problem: > https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/context.rs#L889 > {code} > #[tokio::test] > async fn sort_empty() -> Result<()> { > // The predicate on this query purposely generates no results > let results = > execute("SELECT c1, c2 FROM test WHERE c1 > 10 ORDER BY c1 > DESC, c2 ASC", 4).await?; > assert_eq!(results.len(), 0); > Ok(()) > } > #[tokio::test] > async fn aggregate_empty() -> Result<()> { > // The predicate on this query purposely generates no results > let results = execute("SELECT SUM(c1), SUM(c2) FROM test where c1 > > 10", 4).await?; > assert_eq!(results.len(), 0); > Ok(()) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11394) [Rust] Slice + Concat incorrect for structs
[ https://issues.apache.org/jira/browse/ARROW-11394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11394: --- Fix Version/s: 3.0.1 > [Rust] Slice + Concat incorrect for structs > --- > > Key: ARROW-11394 > URL: https://issues.apache.org/jira/browse/ARROW-11394 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 3.0.0 >Reporter: Ben Chambers >Assignee: Ben Chambers >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.0.1 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > If you slice an array and then use it with {{concat}} you get different > behaviors when using a primitive array (Float64 in the examples) or a struct > array. > In the case of a float, the result is what I'd expect -- it concatenates the > elements from the slice. > In the case of a struct, it is a bit surprising -- the result has the length > of the slice, but starts at the beginning of the original array. > {code:java} > // #[test] > fn test_repro() { > // Create float and struct array. > let float_array: ArrayRef = Arc::new(Float64Array::from(vec![1.0, 2.0, > 3.0, 4.0])); > let struct_array = Arc::new(StructArray::from(vec![( > Field::new("field", DataType::Float64, true), > float_array.clone(), > )])); > // Slice the float array and verify result is [3.0, 4.0] > let float_array_slice_ref = float_array.slice(2, 2); > let float_array_slice = float_array_slice_ref > .as_any() > .downcast_ref::>() > .unwrap(); > assert_eq!(float_array_slice, ::from(vec![3.0, 4.0])); > // Slice the struct array and verify result is [3.0, 4.0] > let struct_array_slice_ref = struct_array.slice(2, 2); > let struct_array_slice = struct_array_slice_ref > .as_any() > .downcast_ref::() > .unwrap(); > let struct_array_slice_floats = struct_array_slice > .column(0) > .as_any() > .downcast_ref::>() > .unwrap(); > assert_eq!( > struct_array_slice_floats, > ::from(vec![3.0, 4.0]) > ); > // Concat the float array, and verify the result is still [3.0, 4.0]. > let concat_float_array_ref = > > arrow::compute::kernels::concat::concat(&[float_array_slice]).unwrap(); > let concat_float_array = concat_float_array_ref > .as_any() > .downcast_ref::>() > .unwrap(); > assert_eq!(concat_float_array, ::from(vec![3.0, 4.0])); > // Concat the struct array and expect it to match the float array [3.0, > 4.0]. > let concat_struct_array_ref = > > arrow::compute::kernels::concat::concat(&[struct_array_slice]).unwrap(); > let concat_struct_array = concat_struct_array_ref > .as_any() > .downcast_ref::() > .unwrap(); > let concat_struct_array_floats = concat_struct_array > .column(0) > .as_any() > .downcast_ref::>() > .unwrap(); > // This is what is actually returned > assert_eq!( > concat_struct_array_floats, > ::from(vec![1.0, 2.0]) > ); > // This is what I'd expect, but fails: > assert_eq!( > concat_struct_array_floats, > ::from(vec![3.0, 4.0]) > ); > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11452) [Rust] Parquet reader cannot read file where a struct column has the same name as struct member columns
[ https://issues.apache.org/jira/browse/ARROW-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-11452: --- Fix Version/s: 3.0.1 > [Rust] Parquet reader cannot read file where a struct column has the same > name as struct member columns > > > Key: ARROW-11452 > URL: https://issues.apache.org/jira/browse/ARROW-11452 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 3.0.0 >Reporter: Max Burke >Assignee: Max Burke >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.0.1 > > Attachments: structs.parquet > > Time Spent: 2h 10m > Remaining Estimate: 0h > > For example, given the schema: > > count: struct count: uint64 not null, sum: int64 not null, variance: int64 not null> not > null > child 0, min: int64 not null > child 1, max: int64 not null > child 2, mean: int64 not null > child 3, count: uint64 not null > child 4, sum: int64 not null > child 5, variance: int64 not null > ul_observation_date: struct not null, mean: timestamp[us] not null, count: uint64 not null, sum: > timestamp[us] not null, variance: timestamp[us] not null> not null > child 0, min: timestamp[us] not null > child 1, max: timestamp[us] not null > child 2, mean: timestamp[us] not null > child 3, count: uint64 not null > child 4, sum: timestamp[us] not null > child 5, variance: timestamp[us] not null > > The array reader performs dictionary lookups for the type of columns of types > such as ul_observation_date, but when it looks up the field `count` it gets > the definition not of the ul_observation_date.count field but of the `count` > struct. > > Attached is a sample file that exhibits this behavior. > > [^structs.parquet] -- This message was sent by Atlassian Jira (v8.3.4#803005)