[jira] [Created] (ARROW-18418) [WEBSITE] do not delete /datafusion-python

2022-11-29 Thread Andy Grove (Jira)
Andy Grove created ARROW-18418:
--

 Summary: [WEBSITE] do not delete /datafusion-python
 Key: ARROW-18418
 URL: https://issues.apache.org/jira/browse/ARROW-18418
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Andy Grove
Assignee: Andy Grove


do not delete /datafusion-python when publishing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17878) [Website] Exclude Ballista docs from being deleted

2022-09-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-17878:
--

 Summary: [Website] Exclude Ballista docs from being deleted
 Key: ARROW-17878
 URL: https://issues.apache.org/jira/browse/ARROW-17878
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Andy Grove


Exclude Ballista docs from being deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17325) AQE should use available column statistics from completed query stages

2022-08-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-17325.
--
Resolution: Invalid

> AQE should use available column statistics from completed query stages
> --
>
> Key: ARROW-17325
> URL: https://issues.apache.org/jira/browse/ARROW-17325
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andy Grove
>Priority: Major
>
> In QueryStageExec.computeStats we copy partial statistics from materlized 
> query stages by calling QueryStageExec#getRuntimeStatistics, which in turn 
> calls ShuffleExchangeLike#runtimeStatistics or 
> BroadcastExchangeLike#runtimeStatistics.
> Only dataSize and numOutputRows are copied into the new Statistics object:
>  {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
>     val runtimeStats = getRuntimeStatistics
>     val dataSize = runtimeStats.sizeInBytes.max(0)
>     val numOutputRows = runtimeStats.rowCount.map(_.max(0))
>     Some(Statistics(dataSize, numOutputRows, isRuntime = true))
>   } else {
>     None
>   }
> {code}
> I would like to also copy over the column statistics stored in 
> Statistics.attributeMap so that they can be fed back into the logical plan 
> optimization phase. This is a small change as shown below:
> {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
> val runtimeStats = getRuntimeStatistics
> val dataSize = runtimeStats.sizeInBytes.max(0)
> val numOutputRows = runtimeStats.rowCount.map(_.max(0))
> val attributeStats = runtimeStats.attributeStats
> Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = 
> true))
>   } else {
> None
>   }
> {code}
> The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
> not currently provide such column statistics, but other custom 
> implementations can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17325) AQE should use available column statistics from completed query stages

2022-08-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-17325:
---
Description: 
In QueryStageExec.computeStats we copy partial statistics from materlized query 
stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls 
ShuffleExchangeLike#runtimeStatistics or 
BroadcastExchangeLike#runtimeStatistics.

Only dataSize and numOutputRows are copied into the new Statistics object:

 {code:scala}
  def computeStats(): Option[Statistics] = if (isMaterialized) {
    val runtimeStats = getRuntimeStatistics
    val dataSize = runtimeStats.sizeInBytes.max(0)
    val numOutputRows = runtimeStats.rowCount.map(_.max(0))
    Some(Statistics(dataSize, numOutputRows, isRuntime = true))
  } else {
    None
  }
{code}

I would like to also copy over the column statistics stored in 
Statistics.attributeMap so that they can be fed back into the logical plan 
optimization phase. This is a small change as shown below:

{code:scala}
  def computeStats(): Option[Statistics] = if (isMaterialized) {
val runtimeStats = getRuntimeStatistics
val dataSize = runtimeStats.sizeInBytes.max(0)
val numOutputRows = runtimeStats.rowCount.map(_.max(0))
val attributeStats = runtimeStats.attributeStats
Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = true))
  } else {
None
  }
{code}

The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
not currently provide such column statistics, but other custom implementations 
can.

  was:
In QueryStageExec.computeStats we copy partial statistics from materlized query 
stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls 
ShuffleExchangeLike#runtimeStatistics or 
BroadcastExchangeLike#runtimeStatistics.

 

Only dataSize and numOutputRows are copied into the new Statistics object:

 {code:scala}
  def computeStats(): Option[Statistics] = if (isMaterialized) {
    val runtimeStats = getRuntimeStatistics
    val dataSize = runtimeStats.sizeInBytes.max(0)
    val numOutputRows = runtimeStats.rowCount.map(_.max(0))
    Some(Statistics(dataSize, numOutputRows, isRuntime = true))
  } else {
    None
  }
{code}

I would like to also copy over the column statistics stored in 
Statistics.attributeMap so that they can be fed back into the logical plan 
optimization phase.

The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
not currently provide such column statistics but other custom implementations 
can.


> AQE should use available column statistics from completed query stages
> --
>
> Key: ARROW-17325
> URL: https://issues.apache.org/jira/browse/ARROW-17325
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andy Grove
>Priority: Major
>
> In QueryStageExec.computeStats we copy partial statistics from materlized 
> query stages by calling QueryStageExec#getRuntimeStatistics, which in turn 
> calls ShuffleExchangeLike#runtimeStatistics or 
> BroadcastExchangeLike#runtimeStatistics.
> Only dataSize and numOutputRows are copied into the new Statistics object:
>  {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
>     val runtimeStats = getRuntimeStatistics
>     val dataSize = runtimeStats.sizeInBytes.max(0)
>     val numOutputRows = runtimeStats.rowCount.map(_.max(0))
>     Some(Statistics(dataSize, numOutputRows, isRuntime = true))
>   } else {
>     None
>   }
> {code}
> I would like to also copy over the column statistics stored in 
> Statistics.attributeMap so that they can be fed back into the logical plan 
> optimization phase. This is a small change as shown below:
> {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
> val runtimeStats = getRuntimeStatistics
> val dataSize = runtimeStats.sizeInBytes.max(0)
> val numOutputRows = runtimeStats.rowCount.map(_.max(0))
> val attributeStats = runtimeStats.attributeStats
> Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = 
> true))
>   } else {
> None
>   }
> {code}
> The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
> not currently provide such column statistics, but other custom 
> implementations can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17325) AQE should use available column statistics from completed query stages

2022-08-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-17325:
--

 Summary: AQE should use available column statistics from completed 
query stages
 Key: ARROW-17325
 URL: https://issues.apache.org/jira/browse/ARROW-17325
 Project: Apache Arrow
  Issue Type: Improvement
  Components: SQL
Reporter: Andy Grove


In QueryStageExec.computeStats we copy partial statistics from materlized query 
stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls 
ShuffleExchangeLike#runtimeStatistics or 
BroadcastExchangeLike#runtimeStatistics.

 

Only dataSize and numOutputRows are copied into the new Statistics object:

 {code:scala}
  def computeStats(): Option[Statistics] = if (isMaterialized) {
    val runtimeStats = getRuntimeStatistics
    val dataSize = runtimeStats.sizeInBytes.max(0)
    val numOutputRows = runtimeStats.rowCount.map(_.max(0))
    Some(Statistics(dataSize, numOutputRows, isRuntime = true))
  } else {
    None
  }
{code}

I would like to also copy over the column statistics stored in 
Statistics.attributeMap so that they can be fed back into the logical plan 
optimization phase.

The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
not currently provide such column statistics but other custom implementations 
can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-13656) [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts

2022-07-13 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-13656.
--
Resolution: Won't Fix

This is an old issue

> [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts
> ---
>
> Key: ARROW-13656
> URL: https://issues.apache.org/jira/browse/ARROW-13656
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Andy Grove
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow-datafusion/issues/881



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16595) [WEBSITE] DataFusion 8.0.0 Release Blog Post

2022-05-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-16595:
--

 Summary: [WEBSITE] DataFusion 8.0.0 Release Blog Post
 Key: ARROW-16595
 URL: https://issues.apache.org/jira/browse/ARROW-16595
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Andy Grove


DataFusion 8.0.0 Release Blog Post



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-13656) [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts

2021-08-18 Thread Andy Grove (Jira)
Andy Grove created ARROW-13656:
--

 Summary: [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog 
Posts
 Key: ARROW-13656
 URL: https://issues.apache.org/jira/browse/ARROW-13656
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Andy Grove
Assignee: Andy Grove


https://github.com/apache/arrow-datafusion/issues/881



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12435) [Rust][DataFusion] Remove unnecessary references to namespace in executor

2021-04-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12435.
--
Resolution: Won't Fix

Moved to https://github.com/apache/arrow-datafusion/issues/66

> [Rust][DataFusion] Remove unnecessary references to namespace in executor
> -
>
> Key: ARROW-12435
> URL: https://issues.apache.org/jira/browse/ARROW-12435
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - Ballista
>Reporter: Ximo Guanter
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There is no need to support multiple executor clusters from a scheduler, so 
> the namespace of an executor is implicitly defined by the scheduler it 
> connects to. See 
> [https://the-asf.slack.com/archives/C01QUFS30TD/p1618679585211100] for more 
> context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12403) [Rust] [Ballista] Integration tests should check that query results are correct

2021-04-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12403.
--
Resolution: Won't Fix

Moved to https://github.com/apache/arrow-datafusion/issues/65

> [Rust] [Ballista] Integration tests should check that query results are 
> correct
> ---
>
> Key: ARROW-12403
> URL: https://issues.apache.org/jira/browse/ARROW-12403
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> The integration checks only test that the benchmark queries run without 
> error. They do not check that the results are correct.
> I think some work already happened in DataFusion to check the TPC-H results 
> so hopefully we can re-use that work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12331) [Rust] [Ballista] Make CI build work with snmalloc

2021-04-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12331.
--
Resolution: Invalid

> [Rust] [Ballista] Make CI build work with snmalloc
> --
>
> Key: ARROW-12331
> URL: https://issues.apache.org/jira/browse/ARROW-12331
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
>
> Ballista was added to CI in [https://github.com/apache/arrow/pull/9979] but 
> is building without default features due to snmalloc requiring cmake.
> An alternative approach would be to build with cc instead of cmake. See the 
> above PR for conversation about this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12255) [Rust] [Ballista] Integrate scheduler with DataFusion

2021-04-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12255.
--
Resolution: Won't Fix

Moved to https://github.com/apache/arrow-datafusion/issues/64

> [Rust] [Ballista] Integrate scheduler with DataFusion
> -
>
> Key: ARROW-12255
> URL: https://issues.apache.org/jira/browse/ARROW-12255
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> The Ballista scheduler breaks a query down into stages based on changes in 
> partitioning int he plan, where each stage is broken down into tasks that can 
> be executed concurrently.
> Rather than trying to run all the partitions at once, Ballista executors 
> process n concurrent tasks at a time and then request new tasks from the 
> scheduler.
> This approach would help DataFusion scale better and it would be ideal to use 
> the same scheduler to scale across cores in DataFusion and across nodes in 
> Ballista.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12256) [Rust] [Ballista] Add DataFrame support

2021-04-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12256.
--
Resolution: Invalid

Ballista does already support DataFrame

> [Rust] [Ballista] Add DataFrame support
> ---
>
> Key: ARROW-12256
> URL: https://issues.apache.org/jira/browse/ARROW-12256
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> Ballista has so far been focused on SQL support rather than DataFrame 
> support. DataFrame support is partially implemented but needs more work to 
> complete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12253) [Rust] [Ballista] Implement scalable joins

2021-04-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12253.
--
Resolution: Won't Fix

Moved to https://github.com/apache/arrow-datafusion/issues/63

> [Rust] [Ballista] Implement scalable joins
> --
>
> Key: ARROW-12253
> URL: https://issues.apache.org/jira/browse/ARROW-12253
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> The main issue limiting scalability in Ballista today is that joins are 
> implemented as hash joins where each partition of the probe side causes the 
> entire left side to be loaded into memory.
> To make this scalable we need to hash partition left and right inputs so that 
> we can join the left and right partitions in parallel.
> There is already work underway in DataFusion to implement this that we can 
> leverage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12252) [Rust] [Ballista] How to continue "This week in Ballista"?

2021-04-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12252.
--
Resolution: Won't Fix

Replaced by https://github.com/apache/arrow-datafusion/issues/18

> [Rust] [Ballista] How to continue "This week in Ballista"?
> --
>
> Key: ARROW-12252
> URL: https://issues.apache.org/jira/browse/ARROW-12252
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> The Ballista project published a weekly newsletter and this has been very 
> effective at building a community around the project.
> We need to determine how we can continue with something like this, while 
> following the Apache way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12257) [Rust] [Ballista] Publish user guide to Arrow site

2021-04-21 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12257.
--
Resolution: Fixed

Replaced by https://github.com/apache/arrow-datafusion/issues/18

> [Rust] [Ballista] Publish user guide to Arrow site
> --
>
> Key: ARROW-12257
> URL: https://issues.apache.org/jira/browse/ARROW-12257
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> Ballista has a user guide in mdbook format and we need to figure out how to 
> get this published to the arrow site (it was previously hosted at 
> https://ballistacompute.org/docs/)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12434) [Rust] [Ballista] Show executed plans with metrics

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12434.

Resolution: Fixed

PR was merged

> [Rust] [Ballista] Show executed plans with metrics
> --
>
> Key: ARROW-12434
> URL: https://issues.apache.org/jira/browse/ARROW-12434
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Show executed plans with metrics to help with debugging and performance tuning



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12261) [Rust] [Ballista] Ballista should not have its own DataFrame API

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12261.
--
Resolution: Fixed

Moved to https://github.com/apache/arrow-datafusion/issues/2

> [Rust] [Ballista] Ballista should not have its own DataFrame API
> 
>
> Key: ARROW-12261
> URL: https://issues.apache.org/jira/browse/ARROW-12261
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> When building the Ballista POC it was necessary to implement a new DataFrame 
> API that wrapped the DataFusion API.
> One issue is that it wasn't possible to override the behavior of the collect 
> method to make it use the Ballista context rather than the DataFusion context.
> Now that the projects are in the same repo it should be easier to fix this 
> and have users always use the DataFusion DataFrame API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-12432:
--

Assignee: Andy Grove

> [Rust] [DataFusion] Add metrics for SortExec
> 
>
> Key: ARROW-12432
> URL: https://issues.apache.org/jira/browse/ARROW-12432
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Add metrics for SortExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12432.

Resolution: Fixed

Issue resolved by pull request 10078
[https://github.com/apache/arrow/pull/10078]

> [Rust] [DataFusion] Add metrics for SortExec
> 
>
> Key: ARROW-12432
> URL: https://issues.apache.org/jira/browse/ARROW-12432
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Add metrics for SortExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12436) [Rust][Ballista] Add watch capabilities to config backend trait

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12436.

Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10085
[https://github.com/apache/arrow/pull/10085]

> [Rust][Ballista] Add watch capabilities to config backend trait
> ---
>
> Key: ARROW-12436
> URL: https://issues.apache.org/jira/browse/ARROW-12436
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - Ballista
>Reporter: Ximo Guanter
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [arrow/lib.rs at 66aa3e7c365a8d4c4eca6e23668f2988e714b493 · apache/arrow 
> (github.com)|https://github.com/apache/arrow/blob/66aa3e7c365a8d4c4eca6e23668f2988e714b493/rust/ballista/rust/scheduler/src/lib.rs#L183]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12437) [Rust] [Ballista] Ballista plans must not include RepartitionExec

2021-04-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-12437:
--

Assignee: Andy Grove

> [Rust] [Ballista] Ballista plans must not include RepartitionExec
> -
>
> Key: ARROW-12437
> URL: https://issues.apache.org/jira/browse/ARROW-12437
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Ballista plans must not include RepartitionExec because it results in 
> incorrect results. Ballista needs to manage its own repartitioning in a 
> distributed-aware way later on. For now we just need to configure the 
> DataFusion context to disable repartition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12437) [Rust] [Ballista] Ballista plans must not include RepartitionExec

2021-04-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12437.

Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10086
[https://github.com/apache/arrow/pull/10086]

> [Rust] [Ballista] Ballista plans must not include RepartitionExec
> -
>
> Key: ARROW-12437
> URL: https://issues.apache.org/jira/browse/ARROW-12437
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Ballista plans must not include RepartitionExec because it results in 
> incorrect results. Ballista needs to manage its own repartitioning in a 
> distributed-aware way later on. For now we just need to configure the 
> DataFusion context to disable repartition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12437) [Rust] [Ballista] Ballista plans must not include RepartitionExec

2021-04-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-12437:
--

 Summary: [Rust] [Ballista] Ballista plans must not include 
RepartitionExec
 Key: ARROW-12437
 URL: https://issues.apache.org/jira/browse/ARROW-12437
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - Ballista
Reporter: Andy Grove


Ballista plans must not include RepartitionExec because it results in incorrect 
results. Ballista needs to manage its own repartitioning in a distributed-aware 
way later on. For now we just need to configure the DataFusion context to 
disable repartition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results

2021-04-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12334.

Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10083
[https://github.com/apache/arrow/pull/10083]

> [Rust] [Ballista] Aggregate queries producing incorrect results
> ---
>
> Key: ARROW-12334
> URL: https://issues.apache.org/jira/browse/ARROW-12334
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I just ran benchmarks for the first time in a while and I see duplicate 
> entries for group by keys.
>  
> For example, query 1 has "group by l_returnflag, l_linestatus" and I see 
> multiple results with l_returnflag = 'A' and l_linestatus = 'F'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics

2021-04-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12433:
---
Component/s: Rust

> [Rust] Builds failing due to new flatbuffer release introducing const generics
> --
>
> Key: ARROW-12433
> URL: https://issues.apache.org/jira/browse/ARROW-12433
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 4.0.0
>Reporter: Andy Grove
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I filed [https://github.com/google/flatbuffers/issues/6572] but for now we 
> should pin the dependency to 0.8.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics

2021-04-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12433.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 10082
[https://github.com/apache/arrow/pull/10082]

> [Rust] Builds failing due to new flatbuffer release introducing const generics
> --
>
> Key: ARROW-12433
> URL: https://issues.apache.org/jira/browse/ARROW-12433
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Andy Grove
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I filed [https://github.com/google/flatbuffers/issues/6572] but for now we 
> should pin the dependency to 0.8.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics

2021-04-17 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324321#comment-17324321
 ] 

Andy Grove commented on ARROW-12433:


Thanks [~alippai] that is a good suggestion

 

So the issue is that our builds with nightly Rust are failing (our SIMD feature 
requires nightly, and the nightly version of Rust we use does not have const 
generics yet). I went ahead with a PR to pin to 0.8.3 to fix our builds.

> [Rust] Builds failing due to new flatbuffer release introducing const generics
> --
>
> Key: ARROW-12433
> URL: https://issues.apache.org/jira/browse/ARROW-12433
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Andy Grove
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I filed [https://github.com/google/flatbuffers/issues/6572] but for now we 
> should pin the dependency to 0.8.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics

2021-04-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12433:
---
Priority: Blocker  (was: Critical)

> [Rust] Builds failing due to new flatbuffer release introducing const generics
> --
>
> Key: ARROW-12433
> URL: https://issues.apache.org/jira/browse/ARROW-12433
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Andy Grove
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I filed [https://github.com/google/flatbuffers/issues/6572] but for now we 
> should pin the dependency to 0.8.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12434) [Rust] [Ballista] Show executed plans with metrics

2021-04-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-12434:
--

 Summary: [Rust] [Ballista] Show executed plans with metrics
 Key: ARROW-12434
 URL: https://issues.apache.org/jira/browse/ARROW-12434
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 5.0.0


Show executed plans with metrics to help with debugging and performance tuning



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics

2021-04-17 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324308#comment-17324308
 ] 

Andy Grove commented on ARROW-12433:


[~alippai] Am I misunderstanding this issue?

> [Rust] Builds failing due to new flatbuffer release introducing const generics
> --
>
> Key: ARROW-12433
> URL: https://issues.apache.org/jira/browse/ARROW-12433
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Andy Grove
>Priority: Critical
>
> I filed [https://github.com/google/flatbuffers/issues/6572] but for now we 
> should pin the dependency to 0.8.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics

2021-04-17 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324307#comment-17324307
 ] 

Andy Grove commented on ARROW-12433:


CI is already using 1.51 ... "latest update on 2021-03-25, rust version 1.51.0"

> [Rust] Builds failing due to new flatbuffer release introducing const generics
> --
>
> Key: ARROW-12433
> URL: https://issues.apache.org/jira/browse/ARROW-12433
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Andy Grove
>Priority: Critical
>
> I filed [https://github.com/google/flatbuffers/issues/6572] but for now we 
> should pin the dependency to 0.8.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics

2021-04-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-12433:
--

 Summary: [Rust] Builds failing due to new flatbuffer release 
introducing const generics
 Key: ARROW-12433
 URL: https://issues.apache.org/jira/browse/ARROW-12433
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 4.0.0
Reporter: Andy Grove


I filed [https://github.com/google/flatbuffers/issues/6572] but for now we 
should pin the dependency to 0.8.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec

2021-04-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-12432:
--

 Summary: [Rust] [DataFusion] Add metrics for SortExec
 Key: ARROW-12432
 URL: https://issues.apache.org/jira/browse/ARROW-12432
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 5.0.0


Add metrics for SortExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results

2021-04-17 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324288#comment-17324288
 ] 

Andy Grove commented on ARROW-12334:


I tracked this down and there are two separate bugs:

1. We are getting RepartitionExec in the plan which is not compatible with 
Ballista and explodes the number of partitions (and likely causes incorrect 
results)
2. The query actually works fine and the final sort produces 2 rows, but the 
results are created by reading all the intermediate results as well

> [Rust] [Ballista] Aggregate queries producing incorrect results
> ---
>
> Key: ARROW-12334
> URL: https://issues.apache.org/jira/browse/ARROW-12334
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I just ran benchmarks for the first time in a while and I see duplicate 
> entries for group by keys.
>  
> For example, query 1 has "group by l_returnflag, l_linestatus" and I see 
> multiple results with l_returnflag = 'A' and l_linestatus = 'F'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323818#comment-17323818
 ] 

Andy Grove commented on ARROW-12421:


This failure happens consistently on my 24 core Threadripper desktop running 
Ubuntu but I cannot reproduce it on my MacBook Pro or on my work PC (6 cores, 
also Ubuntu).

 

> [Rust] [DataFusion] topk_query test fails in master
> ---
>
> Key: ARROW-12421
> URL: https://issues.apache.org/jira/browse/ARROW-12421
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> {code:java}
>  Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 
> tests
> test topk_plan ... ok
> test topk_query ... FAILED
> test normal_query ... okfailures: topk_query stdout 
> thread 'topk_query' panicked at 'assertion failed: `(left == right)`
>   left: `["+-+-+", "| customer_id | revenue |", 
> "+-+-+", "| paul| 300 |", "| jorge   | 
> 200 |", "| andy| 150 |", "+-+-+"]`,
>  right: `["++", "||", "++", "++"]`: output mismatch for Topk context. 
> Expectedn
> +-+-+
> | customer_id | revenue |
> +-+-+
> | paul| 300 |
> | jorge   | 200 |
> | andy| 150 |
> +-+-+Actual:
> ++
> ||
> ++
> ++
> ', datafusion/tests/user_defined_plan.rs:133:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12362) [Rust] [DataFusion] topk_query test failure

2021-04-16 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12362.
--
Resolution: Duplicate

> [Rust] [DataFusion] topk_query test failure
> ---
>
> Key: ARROW-12362
> URL: https://issues.apache.org/jira/browse/ARROW-12362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> I'm seeing this locally with latest from master.
> {code:java}
>  topk_query stdout 
> thread 'topk_query' panicked at 'assertion failed: `(left == right)`
>   left: `["+-+-+", "| customer_id | revenue |", 
> "+-+-+", "| paul| 300 |", "| jorge   | 
> 200 |", "| andy| 150 |", "+-+-+"]`,
>  right: `["++", "||", "++", "++"]`: output mismatch for Topk context. 
> Expectedn
> +-+-+
> | customer_id | revenue |
> +-+-+
> | paul| 300 |
> | jorge   | 200 |
> | andy| 150 |
> +-+-+Actual:
> ++
> ||
> ++
> ++
> ', datafusion/tests/user_defined_plan.rs:133:5
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread Andy Grove (Jira)
Andy Grove created ARROW-12421:
--

 Summary: [Rust] [DataFusion] topk_query test fails in master
 Key: ARROW-12421
 URL: https://issues.apache.org/jira/browse/ARROW-12421
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove


{code:java}
 Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 tests
test topk_plan ... ok
test topk_query ... FAILED
test normal_query ... okfailures: topk_query stdout 
thread 'topk_query' panicked at 'assertion failed: `(left == right)`
  left: `["+-+-+", "| customer_id | revenue |", 
"+-+-+", "| paul| 300 |", "| jorge   | 200  
   |", "| andy| 150 |", "+-+-+"]`,
 right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn
+-+-+
| customer_id | revenue |
+-+-+
| paul| 300 |
| jorge   | 200 |
| andy| 150 |
+-+-+Actual:
++
||
++
++
', datafusion/tests/user_defined_plan.rs:133:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12380) [Rust][Ballista] Add scheduler ui

2021-04-16 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12380.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 10026
[https://github.com/apache/arrow/pull/10026]

> [Rust][Ballista] Add scheduler ui
> -
>
> Key: ARROW-12380
> URL: https://issues.apache.org/jira/browse/ARROW-12380
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Sathis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12403) [Rust] [Ballista] Integration tests should check that query results are correct

2021-04-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12403:
---
Component/s: Rust - Ballista

> [Rust] [Ballista] Integration tests should check that query results are 
> correct
> ---
>
> Key: ARROW-12403
> URL: https://issues.apache.org/jira/browse/ARROW-12403
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> The integration checks only test that the benchmark queries run without 
> error. They do not check that the results are correct.
> I think some work already happened in DataFusion to check the TPC-H results 
> so hopefully we can re-use that work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12403) [Rust] [Ballista] Integration tests should check that query results are correct

2021-04-15 Thread Andy Grove (Jira)
Andy Grove created ARROW-12403:
--

 Summary: [Rust] [Ballista] Integration tests should check that 
query results are correct
 Key: ARROW-12403
 URL: https://issues.apache.org/jira/browse/ARROW-12403
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andy Grove
 Fix For: 5.0.0


The integration checks only test that the benchmark queries run without error. 
They do not check that the results are correct.

I think some work already happened in DataFusion to check the TPC-H results so 
hopefully we can re-use that work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12402) [Rust] [DataFusion] Implement SQL metrics framework

2021-04-15 Thread Andy Grove (Jira)
Andy Grove created ARROW-12402:
--

 Summary: [Rust] [DataFusion] Implement SQL metrics framework
 Key: ARROW-12402
 URL: https://issues.apache.org/jira/browse/ARROW-12402
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


As a user, I would like the ability to inspect metrics for an executed plan to 
help with debugging and performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12331) [Rust] [Ballista] Make CI build work with snmalloc

2021-04-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12331:
---
Fix Version/s: (was: 4.0.0)

> [Rust] [Ballista] Make CI build work with snmalloc
> --
>
> Key: ARROW-12331
> URL: https://issues.apache.org/jira/browse/ARROW-12331
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
>
> Ballista was added to CI in [https://github.com/apache/arrow/pull/9979] but 
> is building without default features due to snmalloc requiring cmake.
> An alternative approach would be to build with cc instead of cmake. See the 
> above PR for conversation about this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results

2021-04-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12334:
---
Fix Version/s: (was: 4.0.0)

> [Rust] [Ballista] Aggregate queries producing incorrect results
> ---
>
> Key: ARROW-12334
> URL: https://issues.apache.org/jira/browse/ARROW-12334
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I just ran benchmarks for the first time in a while and I see duplicate 
> entries for group by keys.
>  
> For example, query 1 has "group by l_returnflag, l_linestatus" and I see 
> multiple results with l_returnflag = 'A' and l_linestatus = 'F'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12335) [Rust] [Ballista] Bump DataFusion version

2021-04-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12335:
---
Fix Version/s: (was: 4.0.0)

> [Rust] [Ballista] Bump DataFusion version
> -
>
> Key: ARROW-12335
> URL: https://issues.apache.org/jira/browse/ARROW-12335
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Update Ballista to use latest DataFusion version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12332) [Rust] [Ballista] Api server for scheduler

2021-04-13 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-12332:
--

Assignee: (was: Sathis Kumar)

> [Rust] [Ballista] Api server for scheduler
> --
>
> Key: ARROW-12332
> URL: https://issues.apache.org/jira/browse/ARROW-12332
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Sathis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12332) [Rust] [Ballista] Api server for scheduler

2021-04-13 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-12332:
--

Assignee: Sathis Kumar

> [Rust] [Ballista] Api server for scheduler
> --
>
> Key: ARROW-12332
> URL: https://issues.apache.org/jira/browse/ARROW-12332
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Sathis
>Assignee: Sathis Kumar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12362) [Rust] [DataFusion] topk_query test failure

2021-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-12362:
--

 Summary: [Rust] [DataFusion] topk_query test failure
 Key: ARROW-12362
 URL: https://issues.apache.org/jira/browse/ARROW-12362
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 4.0.0


I'm seeing this locally with latest from master.
{code:java}
 topk_query stdout 
thread 'topk_query' panicked at 'assertion failed: `(left == right)`
  left: `["+-+-+", "| customer_id | revenue |", 
"+-+-+", "| paul| 300 |", "| jorge   | 200  
   |", "| andy| 150 |", "+-+-+"]`,
 right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn
+-+-+
| customer_id | revenue |
+-+-+
| paul| 300 |
| jorge   | 200 |
| andy| 150 |
+-+-+Actual:
++
||
++
++
', datafusion/tests/user_defined_plan.rs:133:5
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12361) [Rust] [DataFusion] Allow users to override physical optimization rules

2021-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-12361:
--

 Summary: [Rust] [DataFusion] Allow users to override physical 
optimization rules
 Key: ARROW-12361
 URL: https://issues.apache.org/jira/browse/ARROW-12361
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


As a user of DataFusion (in Ballista) I would override the list of physical 
optimization rules. It is currently possible to add new rules but not to remove 
existing rules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12332) [Rust] [Ballista] Api server for scheduler

2021-04-13 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12332.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9987
[https://github.com/apache/arrow/pull/9987]

> [Rust] [Ballista] Api server for scheduler
> --
>
> Key: ARROW-12332
> URL: https://issues.apache.org/jira/browse/ARROW-12332
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Sathis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results

2021-04-11 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318973#comment-17318973
 ] 

Andy Grove commented on ARROW-12334:


I'm now very confused about this issue. I have been working on debugging it and 
now it suddenly is working, so I don't know if it is an intermittent bug or 
not. When it works correctly, the query returns 4 rows and takes ~13 seconds 
for me. When it does not work it returns many times more rows and takes 3x as 
long.

It would be good to get a second pair of eyes on this.

> [Rust] [Ballista] Aggregate queries producing incorrect results
> ---
>
> Key: ARROW-12334
> URL: https://issues.apache.org/jira/browse/ARROW-12334
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> I just ran benchmarks for the first time in a while and I see duplicate 
> entries for group by keys.
>  
> For example, query 1 has "group by l_returnflag, l_linestatus" and I see 
> multiple results with l_returnflag = 'A' and l_linestatus = 'F'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12332) [Rust] [Ballista] Api server for scheduler

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12332:
---
Summary: [Rust] [Ballista] Api server for scheduler  (was: Api server for 
scheduler)

> [Rust] [Ballista] Api server for scheduler
> --
>
> Key: ARROW-12332
> URL: https://issues.apache.org/jira/browse/ARROW-12332
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Sathis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results

2021-04-11 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318951#comment-17318951
 ] 

Andy Grove commented on ARROW-12334:


I tracked down the PR that introduced the regression in the original repo and 
it was [https://github.com/ballista-compute/ballista/pull/574]

> [Rust] [Ballista] Aggregate queries producing incorrect results
> ---
>
> Key: ARROW-12334
> URL: https://issues.apache.org/jira/browse/ARROW-12334
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> I just ran benchmarks for the first time in a while and I see duplicate 
> entries for group by keys.
>  
> For example, query 1 has "group by l_returnflag, l_linestatus" and I see 
> multiple results with l_returnflag = 'A' and l_linestatus = 'F'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12313) [Rust] [Ballista] Benchmark documentation out of date

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12313.

Resolution: Fixed

Issue resolved by pull request 9990
[https://github.com/apache/arrow/pull/9990]

> [Rust] [Ballista] Benchmark documentation out of date
> -
>
> Key: ARROW-12313
> URL: https://issues.apache.org/jira/browse/ARROW-12313
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The scheduler/executor were refactored and the documentation for the 
> benchmarks now needs updating. I plan on fixing this over the weekend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12335) [Rust] [Ballista] Bump DataFusion version

2021-04-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-12335:
--

 Summary: [Rust] [Ballista] Bump DataFusion version
 Key: ARROW-12335
 URL: https://issues.apache.org/jira/browse/ARROW-12335
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - Ballista
Reporter: Andy Grove
 Fix For: 4.0.0


Update Ballista to use latest DataFusion version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results

2021-04-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-12334:
--

 Summary: [Rust] [Ballista] Aggregate queries producing incorrect 
results
 Key: ARROW-12334
 URL: https://issues.apache.org/jira/browse/ARROW-12334
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


I just ran benchmarks for the first time in a while and I see duplicate entries 
for group by keys.

 

For example, query 1 has "group by l_returnflag, l_linestatus" and I see 
multiple results with l_returnflag = 'A' and l_linestatus = 'F'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12260) [Website] [Rust] Announce Ballista donation

2021-04-11 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318904#comment-17318904
 ] 

Andy Grove commented on ARROW-12260:


https://github.com/apache/arrow-site/pull/100

> [Website] [Rust] Announce Ballista donation
> ---
>
> Key: ARROW-12260
> URL: https://issues.apache.org/jira/browse/ARROW-12260
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Once the IP clearance vote passes and the PR has been merged, we should 
> announce the donation on the Arrow blog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10920) [Rust] Segmentation fault in Arrow Parquet writer with huge arrays

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10920:
---
Fix Version/s: (was: 4.0.0)

> [Rust] Segmentation fault in Arrow Parquet writer with huge arrays
> --
>
> Key: ARROW-10920
> URL: https://issues.apache.org/jira/browse/ARROW-10920
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>
> I stumbled across this by chance. I am not too surprised that this fails but 
> I would expect it to fail gracefully and not with a segmentation fault.
>  
> {code:java}
>  use std::fs::File;
> use std::sync::Arc;
> use arrow::array::StringBuilder;
> use arrow::datatypes::{DataType, Field, Schema};
> use arrow::error::Result;
> use arrow::record_batch::RecordBatch;
> use parquet::arrow::ArrowWriter;
> fn main() -> Result<()> {
> let schema = Schema::new(vec![
> Field::new("c0", DataType::Utf8, false),
> Field::new("c1", DataType::Utf8, true),
> ]);
> let batch_size = 250;
> let repeat_count = 140;
> let file = File::create("/tmp/test.parquet")?;
> let mut writer = ArrowWriter::try_new(file, Arc::new(schema.clone()), 
> None).unwrap();
> let mut c0_builder = StringBuilder::new(batch_size);
> let mut c1_builder = StringBuilder::new(batch_size);
> println!("Start of loop");
> for i in 0..batch_size {
> let c0_value = format!("{:032}", i);
> let c1_value = c0_value.repeat(repeat_count);
> c0_builder.append_value(_value)?;
> c1_builder.append_value(_value)?;
> }
> println!("Finish building c0");
> let c0 = Arc::new(c0_builder.finish());
> println!("Finish building c1");
> let c1 = Arc::new(c1_builder.finish());
> println!("Creating RecordBatch");
> let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![c0, c1])?;
> // write the batch to parquet
> println!("Writing RecordBatch");
> writer.write().unwrap();
> println!("Closing writer");
> writer.close().unwrap();
> Ok(())
> }
> {code}
> output:
> {code:java}
> Start of loop
> Finish building c0
> Finish building c1
> Creating RecordBatch
> Writing RecordBatch
> Segmentation fault (core dumped)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11625) [Rust] [DataFusion] Move SortExec partition check to constructor

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11625:
---
Fix Version/s: (was: 4.0.0)

> [Rust] [DataFusion] Move SortExec partition check to constructor
> 
>
> Key: ARROW-11625
> URL: https://issues.apache.org/jira/browse/ARROW-11625
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> SortExec has the following error check at execution time and this could be 
> moved into the try_new constructor so the error check happens at planning 
> time instead.
>  
> {code:java}
> if 1 != self.input.output_partitioning().partition_count() {
> return Err(DataFusionError::Internal(
> "SortExec requires a single input partition".to_owned(),
> ));
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11016) [Rust] Parquet ArrayReader should allow reading a subset of row groups

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11016:
---
Fix Version/s: (was: 4.0.0)

> [Rust] Parquet ArrayReader should allow reading a subset of row groups
> --
>
> Key: ARROW-11016
> URL: https://issues.apache.org/jira/browse/ARROW-11016
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>
> Parquet ArrayReader currently only supports reading an entire file from start 
> to finish and does not allow selectively reading a subset of row groups. This 
> prevents us from parallelizing work across threads when processing a single 
> parquet file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11094:
---
Fix Version/s: (was: 4.0.0)

> [Rust] [DataFusion] Implement Sort-Merge Join
> -
>
> Key: ARROW-11094
> URL: https://issues.apache.org/jira/browse/ARROW-11094
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> The current hash join works well when one side of the join can be loaded into 
> memory but cannot scale beyond the available RAM.
> The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
> left and right partitions, and write the intermediate results to disk, and 
> then stream both sides of the join by merging these sorted partitions and we 
> do not need to load one side into memory. At most, we need to load all 
> batches from both sides that contain the current join key values.
> In order to reduce memory pressure we will want to limit the concurrency of 
> these sort operations.
> We would still want to default to hash join when we know that the build-side 
> can fit into memory since it is more efficient than using a sort-merge join.
> [https://en.wikipedia.org/wiki/Sort-merge_join]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11020) [Rust] [DataFusion] Implement better tests for ParquetExec

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11020:
---
Fix Version/s: (was: 4.0.0)

> [Rust] [DataFusion] Implement better tests for ParquetExec
> --
>
> Key: ARROW-11020
> URL: https://issues.apache.org/jira/browse/ARROW-11020
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Implement better tests for ParquetExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10884) [Rust] [DataFusion] Benchmark crate does not have a SIMD feature

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10884:
---
Fix Version/s: (was: 4.0.0)

> [Rust] [DataFusion] Benchmark crate does not have a SIMD feature
> 
>
> Key: ARROW-10884
> URL: https://issues.apache.org/jira/browse/ARROW-10884
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>
> The benchmarks run without SIMD by default. We need to add a feature to the 
> Cargo.toml to enable SIMD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12313) [Rust] [Ballista] Benchmark documentation out of date

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12313:
---
Summary: [Rust] [Ballista] Benchmark documentation out of date  (was: 
[Rust] [Ballista] Benchmark docuementation out of date)

> [Rust] [Ballista] Benchmark documentation out of date
> -
>
> Key: ARROW-12313
> URL: https://issues.apache.org/jira/browse/ARROW-12313
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> The scheduler/executor were refactored and the documentation for the 
> benchmarks now needs updating. I plan on fixing this over the weekend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11059) [Rust] [DataFusion] Implement extensible configuration mechanism

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11059:
---
Fix Version/s: (was: 4.0.0)

> [Rust] [DataFusion] Implement extensible configuration mechanism
> 
>
> Key: ARROW-11059
> URL: https://issues.apache.org/jira/browse/ARROW-11059
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> We are getting to the point where there are multiple settings we could add to 
> operators to fine-tune performance. Custom operators provided by crates that 
> extend DataFusion may also need this capability.
> I propose that we add support for key-value configuration options so that we 
> don't need to plumb through each new configuration setting that we add.
> For example. I am about to start on a "coalesce batches" operator and I would 
> like a setting such as "coalesce.batch.size".
> For built-in settings like this we can provide information such as 
> documentation and default values and generate documentation from this.
> For example, here is how Spark defines configs:
> {code:java}
>   val PARQUET_VECTORIZED_READER_ENABLED =
> buildConf("spark.sql.parquet.enableVectorizedReader")
>   .doc("Enables vectorized parquet decoding.")
>   .version("2.0.0")
>   .booleanConf
>   .createWithDefault(true) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12251.

Resolution: Fixed

Issue resolved by pull request 9979
[https://github.com/apache/arrow/pull/9979]

> [Rust] [Ballista] Add Ballista tests to CI
> --
>
> Key: ARROW-12251
> URL: https://issues.apache.org/jira/browse/ARROW-12251
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Ballista is a standalone project (not part of the Arrow Rust workspace) and 
> therefore the tests will not run in CI without additional work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12331) [Rust] [Ballista] Make CI build work with snmalloc

2021-04-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-12331:
--

 Summary: [Rust] [Ballista] Make CI build work with snmalloc
 Key: ARROW-12331
 URL: https://issues.apache.org/jira/browse/ARROW-12331
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - Ballista
Reporter: Andy Grove
 Fix For: 4.0.0


Ballista was added to CI in [https://github.com/apache/arrow/pull/9979] but is 
building without default features due to snmalloc requiring cmake.

An alternative approach would be to build with cc instead of cmake. See the 
above PR for conversation about this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12329) [Rust] [Ballista] Add README

2021-04-10 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12329.

Resolution: Fixed

Issue resolved by pull request 9981
[https://github.com/apache/arrow/pull/9981]

> [Rust] [Ballista] Add README
> 
>
> Key: ARROW-12329
> URL: https://issues.apache.org/jira/browse/ARROW-12329
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We did not bring a README over in the donation and need to write a new one 
> anyway now this is part of Arrow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12328) [Rust] [Ballista] Fix code formatting

2021-04-10 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12328.

Resolution: Fixed

Issue resolved by pull request 9980
[https://github.com/apache/arrow/pull/9980]

> [Rust] [Ballista] Fix code formatting
> -
>
> Key: ARROW-12328
> URL: https://issues.apache.org/jira/browse/ARROW-12328
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12329) [Rust] [Ballista] Add README

2021-04-10 Thread Andy Grove (Jira)
Andy Grove created ARROW-12329:
--

 Summary: [Rust] [Ballista] Add README
 Key: ARROW-12329
 URL: https://issues.apache.org/jira/browse/ARROW-12329
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


We did not bring a README over in the donation and need to write a new one 
anyway now this is part of Arrow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12328) [Rust] [Ballista] Fix code formatting

2021-04-10 Thread Andy Grove (Jira)
Andy Grove created ARROW-12328:
--

 Summary: [Rust] [Ballista] Fix code formatting
 Key: ARROW-12328
 URL: https://issues.apache.org/jira/browse/ARROW-12328
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI

2021-04-10 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318509#comment-17318509
 ] 

Andy Grove commented on ARROW-12251:


[~boazbe]The goal is to add "cargo build" and "cargo test" for Ballista to the 
existing github actions for the Rust project. I have an initial PR up (linked 
to this JIRA) but I immediately ran into an issue with snmalloc.

> [Rust] [Ballista] Add Ballista tests to CI
> --
>
> Key: ARROW-12251
> URL: https://issues.apache.org/jira/browse/ARROW-12251
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Ballista is a standalone project (not part of the Arrow Rust workspace) and 
> therefore the tests will not run in CI without additional work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12313) [Rust] [Ballista] Benchmark docuementation out of date

2021-04-09 Thread Andy Grove (Jira)
Andy Grove created ARROW-12313:
--

 Summary: [Rust] [Ballista] Benchmark docuementation out of date
 Key: ARROW-12313
 URL: https://issues.apache.org/jira/browse/ARROW-12313
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


The scheduler/executor were refactored and the documentation for the benchmarks 
now needs updating. I plan on fixing this over the weekend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11982) [Rust] Donate Ballista Distributed Compute Platform

2021-04-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-11982.

Resolution: Fixed

Issue resolved by pull request 9723
[https://github.com/apache/arrow/pull/9723]

> [Rust] Donate Ballista Distributed Compute Platform
> ---
>
> Key: ARROW-11982
> URL: https://issues.apache.org/jira/browse/ARROW-11982
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See PR for details.
> https://github.com/apache/arrow/pull/9723



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12284) [Rust] [DataFusion] Review the contract between DataFusion and Arrow

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12284:
--

 Summary: [Rust] [DataFusion] Review the contract between 
DataFusion and Arrow
 Key: ARROW-12284
 URL: https://issues.apache.org/jira/browse/ARROW-12284
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


I am creating this issue based on the discussion at the sync call earlier today.

Apparently DataFusion is not only using the high-level Arrow API but is also 
accessing Arrow internals directly and this would be one challenge in moving to 
a majorly refactored Arrow implementation.

Perhaps we need to review what the public Arrow API should be and which APIs 
DataFusion should or should not be using.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12261) [Rust] [Ballista] Ballista should not have its own DataFrame API

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12261:
--

 Summary: [Rust] [Ballista] Ballista should not have its own 
DataFrame API
 Key: ARROW-12261
 URL: https://issues.apache.org/jira/browse/ARROW-12261
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - Ballista
Reporter: Andy Grove
 Fix For: 5.0.0


When building the Ballista POC it was necessary to implement a new DataFrame 
API that wrapped the DataFusion API.

One issue is that it wasn't possible to override the behavior of the collect 
method to make it use the Ballista context rather than the DataFusion context.

Now that the projects are in the same repo it should be easier to fix this and 
have users always use the DataFusion DataFrame API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12260) [Website] [Rust] Announce Ballista donation

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12260:
--

 Summary: [Website] [Rust] Announce Ballista donation
 Key: ARROW-12260
 URL: https://issues.apache.org/jira/browse/ARROW-12260
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Andy Grove
Assignee: Andy Grove


Once the IP clearance vote passes and the PR has been merged, we should 
announce the donation on the Arrow blog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI

2021-04-07 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12251:
---
Issue Type: Improvement  (was: Bug)

> [Rust] [Ballista] Add Ballista tests to CI
> --
>
> Key: ARROW-12251
> URL: https://issues.apache.org/jira/browse/ARROW-12251
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> Ballista is a standalone project (not part of the Arrow Rust workspace) and 
> therefore the tests will not run in CI without additional work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12257) [Rust] [Ballista] Publish user guide to Arrow site

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12257:
--

 Summary: [Rust] [Ballista] Publish user guide to Arrow site
 Key: ARROW-12257
 URL: https://issues.apache.org/jira/browse/ARROW-12257
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 5.0.0


Ballista has a user guide in mdbook format and we need to figure out how to get 
this published to the arrow site (it was previously hosted at 
https://ballistacompute.org/docs/)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12256) [Rust] [Ballista] Add DataFrame support

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12256:
--

 Summary: [Rust] [Ballista] Add DataFrame support
 Key: ARROW-12256
 URL: https://issues.apache.org/jira/browse/ARROW-12256
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista
Reporter: Andy Grove
 Fix For: 5.0.0


Ballista has so far been focused on SQL support rather than DataFrame support. 
DataFrame support is partially implemented but needs more work to complete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12255) [Rust] [Ballista] Integrate scheduler with DataFusion

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12255:
--

 Summary: [Rust] [Ballista] Integrate scheduler with DataFusion
 Key: ARROW-12255
 URL: https://issues.apache.org/jira/browse/ARROW-12255
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 5.0.0


The Ballista scheduler breaks a query down into stages based on changes in 
partitioning int he plan, where each stage is broken down into tasks that can 
be executed concurrently.

Rather than trying to run all the partitions at once, Ballista executors 
process n concurrent tasks at a time and then request new tasks from the 
scheduler.

This approach would help DataFusion scale better and it would be ideal to use 
the same scheduler to scale across cores in DataFusion and across nodes in 
Ballista.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12253) [Rust] [Ballista] Implement scalable joins

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12253:
--

 Summary: [Rust] [Ballista] Implement scalable joins
 Key: ARROW-12253
 URL: https://issues.apache.org/jira/browse/ARROW-12253
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 5.0.0


The main issue limiting scalability in Ballista today is that joins are 
implemented as hash joins where each partition of the probe side causes the 
entire left side to be loaded into memory.

To make this scalable we need to hash partition left and right inputs so that 
we can join the left and right partitions in parallel.

There is already work underway in DataFusion to implement this that we can 
leverage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12252) [Rust] [Ballista] How to continue "This week in Ballista"?

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12252:
--

 Summary: [Rust] [Ballista] How to continue "This week in Ballista"?
 Key: ARROW-12252
 URL: https://issues.apache.org/jira/browse/ARROW-12252
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove


The Ballista project published a weekly newsletter and this has been very 
effective at building a community around the project.

We need to determine how we can continue with something like this, while 
following the Apache way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12250) [Rust] Failing test arrow::arrow_writer::tests::fixed_size_binary_single_column

2021-04-07 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-12250:
---
Description: 
I just pulled latest from master (commit 
d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran 
`cargo clean` followed by `cargo test`.

One test fails (sometimes). It can fail in multiple ways:
{code:java}
 arrow::arrow_writer::tests::fixed_size_binary_single_column stdout 
thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked 
at 'called `Result::unwrap()` on an `Err` value: General("Could not parse 
metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54
 {code}
{code:java}
 arrow::arrow_writer::tests::fixed_size_binary_single_column stdout 
thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked 
at 'Unable to get batch: ParquetError("Parquet error: underlying Thrift error: 
end of file")', parquet/src/arrow/arrow_writer.rs:927:14
 {code}

  was:
I just pulled latest from master (commit 
d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran 
`cargo clean` followed by `cargo test`.

One test fails:
{code:java}
 arrow::arrow_writer::tests::fixed_size_binary_single_column stdout 
thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked 
at 'called `Result::unwrap()` on an `Err` value: General("Could not parse 
metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54
 {code}


> [Rust] Failing test 
> arrow::arrow_writer::tests::fixed_size_binary_single_column
> ---
>
> Key: ARROW-12250
> URL: https://issues.apache.org/jira/browse/ARROW-12250
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> I just pulled latest from master (commit 
> d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran 
> `cargo clean` followed by `cargo test`.
> One test fails (sometimes). It can fail in multiple ways:
> {code:java}
>  arrow::arrow_writer::tests::fixed_size_binary_single_column stdout 
> thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked 
> at 'called `Result::unwrap()` on an `Err` value: General("Could not parse 
> metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54
>  {code}
> {code:java}
>  arrow::arrow_writer::tests::fixed_size_binary_single_column stdout 
> thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked 
> at 'Unable to get batch: ParquetError("Parquet error: underlying Thrift 
> error: end of file")', parquet/src/arrow/arrow_writer.rs:927:14
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12251:
--

 Summary: [Rust] [Ballista] Add Ballista tests to CI
 Key: ARROW-12251
 URL: https://issues.apache.org/jira/browse/ARROW-12251
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


Ballista is a standalone project (not part of the Arrow Rust workspace) and 
therefore the tests will not run in CI without additional work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12250) [Rust] Failing test arrow::arrow_writer::tests::fixed_size_binary_single_column

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12250:
--

 Summary: [Rust] Failing test 
arrow::arrow_writer::tests::fixed_size_binary_single_column
 Key: ARROW-12250
 URL: https://issues.apache.org/jira/browse/ARROW-12250
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
 Fix For: 4.0.0


I just pulled latest from master (commit 
d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran 
`cargo clean` followed by `cargo test`.

One test fails:
{code:java}
 arrow::arrow_writer::tests::fixed_size_binary_single_column stdout 
thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked 
at 'called `Result::unwrap()` on an `Err` value: General("Could not parse 
metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12064) [Rust] [DataFusion] Make DataFrame extensible

2021-03-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-12064:
--

 Summary: [Rust] [DataFusion] Make DataFrame extensible
 Key: ARROW-12064
 URL: https://issues.apache.org/jira/browse/ARROW-12064
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove


The DataFrame implementation currently has two types of logic:
 # Logic for building a logical query plan
 # Logic for executing a query using the DataFusion context

We can make DataFrame more extensible by having it always delegate to the 
context for execution, allowing the same DataFrame logic to be used for local 
and distributed execution.

We will likely need to introduce a new ExecutionContext trait with different 
implementations for DataFusion and Ballista.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11982) [Rust] Donate Ballista Distributed Compute Platform

2021-03-16 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11982:
---
Description: 
See PR for details.

https://github.com/apache/arrow/pull/9723

  was:See [PR|[https://github.com/apache/arrow/pull/9723]] for details.


> [Rust] Donate Ballista Distributed Compute Platform
> ---
>
> Key: ARROW-11982
> URL: https://issues.apache.org/jira/browse/ARROW-11982
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> See PR for details.
> https://github.com/apache/arrow/pull/9723



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11982) [Rust] Donate Ballista Distributed Compute Platform

2021-03-16 Thread Andy Grove (Jira)
Andy Grove created ARROW-11982:
--

 Summary: [Rust] Donate Ballista Distributed Compute Platform
 Key: ARROW-11982
 URL: https://issues.apache.org/jira/browse/ARROW-11982
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


See [PR|[https://github.com/apache/arrow/pull/9723]] for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11150) [Rust] Set up bi-weekly Rust sync call and update website

2021-03-15 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301681#comment-17301681
 ] 

Andy Grove commented on ARROW-11150:


We should also list the ASF slack channel: https://s.apache.org/slack-invite

> [Rust] Set up bi-weekly Rust sync call and update website
> -
>
> Key: ARROW-11150
> URL: https://issues.apache.org/jira/browse/ARROW-11150
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Given the momentum on the Rust implementation, I am going to set up a 
> bi-weekly sync call on Google Meet most likely. The call will be at the same 
> time as the current sync call but on alternate weeks.
> I will update the web site to list both calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11948) [Rust] 3.0.1 patch release

2021-03-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-11948:
--

 Summary: [Rust] 3.0.1 patch release
 Key: ARROW-11948
 URL: https://issues.apache.org/jira/browse/ARROW-11948
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.1


Spreadsheet where I am tracking the fixes that get merged to maint-3.0.x

 

https://docs.google.com/spreadsheets/d/111k0PGEVzxg1k7Q_d_1kV7E24VRB3DVJP1MnQImVrCc/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11934) [Rust] Document patch release process

2021-03-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300476#comment-17300476
 ] 

Andy Grove commented on ARROW-11934:


[~npr] Could I ask you to take a look at this google doc when you get a chance. 
In particular, could you participate in the conversation I am having with 
[~emkornfield] about whether we can make language-specific patch releases?

> [Rust] Document patch release process
> -
>
> Key: ARROW-11934
> URL: https://issues.apache.org/jira/browse/ARROW-11934
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.1
>
>
> Now that we moved to voting on source releases for patch releases, we need to 
> document the process for doing so in the Rust implementation.
>  
> Google doc for discussion / collaboration: 
> https://docs.google.com/document/d/1i2Elk6J0H4nhPeQZdLDyqvHoRbsabx2iOTXLHxxNqRE/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11239) [Rust] array::transform::tests::test_struct failed

2021-03-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11239:
---
Fix Version/s: 3.0.1

> [Rust] array::transform::tests::test_struct failed
> --
>
> Key: ARROW-11239
> URL: https://issues.apache.org/jira/browse/ARROW-11239
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Qingyou Meng
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Test *array::transform::tests::test_struct*  in 
> *arrow/src/array/transform/mod.rs* failed when swap the first two elements:
> change from
> {code:java}
> // code placeholder
> let strings: ArrayRef = Arc::new(StringArray::from(vec![
> Some("joe"),
> None,{code}
> to
> {code:java}
> // code placeholder
> let strings: ArrayRef = Arc::new(StringArray::from(vec![
> None,
> Some("joe"),{code}
> The failure was first found when I report 
> https://issues.apache.org/jira/browse/ARROW-11160
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11269) [Rust] Unable to read Parquet file because of mismatch in column-derived and embedded schemas

2021-03-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11269:
---
Fix Version/s: 3.0.1

> [Rust] Unable to read Parquet file because of mismatch in column-derived and 
> embedded schemas
> -
>
> Key: ARROW-11269
> URL: https://issues.apache.org/jira/browse/ARROW-11269
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Max Burke
>Assignee: Neville Dipale
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
> Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet, main.rs
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The issue seems to stem from the new(-ish) behavior of the Arrow Parquet 
> reader where the embedded arrow schema is used instead of deriving the schema 
> from the Parquet columns.
>  
> However it seems like some cases still derive the schema type from the column 
> types, leading to the Arrow record batch reader erroring out that the column 
> types must match the schema types.
>  
> In our case, the column type is an int96 datetime (ns) type, and the Arrow 
> type in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, 
> Some("UTC")). However, the code that constructs the Arrays seems to re-derive 
> this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because 
> the Parquet schema has no timezone information). And so, Parquet files that 
> we were able to read successfully with our branch of Arrow circa October are 
> now unreadable.
>  
> I've attached an example of a Parquet file that demonstrates the problem. 
> This file was created in Python (as most of our Parquet files are).
>  
> I've also attached a sample Rust program that will demonstrate the error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11311) [Rust] unset_bit is toggling bits, not unsetting them

2021-03-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11311:
---
Fix Version/s: 3.0.1

> [Rust] unset_bit is toggling bits, not unsetting them
> -
>
> Key: ARROW-11311
> URL: https://issues.apache.org/jira/browse/ARROW-11311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The functions {{bit_util::unset_bit[_raw]}} are currently toggling bits, not 
> setting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11322) [Rust] Arrow `memory` made private is a breaking API change

2021-03-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11322:
---
Fix Version/s: 3.0.1

> [Rust] Arrow `memory` made private is a breaking API change
> ---
>
> Key: ARROW-11322
> URL: https://issues.apache.org/jira/browse/ARROW-11322
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Max Burke
>Assignee: Jorge Leitão
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We depend on functionality in the Arrow memory module for buffer building and 
> this was recently made private. 
>  
> Please make this module public again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11313) [Rust] Size hint of iterators is incorrect

2021-03-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11313:
---
Fix Version/s: 3.0.1

> [Rust] Size hint of iterators is incorrect
> --
>
> Key: ARROW-11313
> URL: https://issues.apache.org/jira/browse/ARROW-11313
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11323) [Rust][DataFusion] ComputeError("concat requires input of at least one array")) with queries with ORDER BY or GROUP BY that return no

2021-03-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11323:
---
Fix Version/s: 3.0.1

> [Rust][DataFusion] ComputeError("concat requires input of at least one 
> array")) with queries with ORDER BY or GROUP BY that return no 
> --
>
> Key: ARROW-11323
> URL: https://issues.apache.org/jira/browse/ARROW-11323
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> If you run a SQL query in datafusion which has predicates that produces no 
> rows that also includes a GROUP BY or ORDER BY clause, you get the following 
> error:
> Error of "ArrowError(ComputeError("concat requires input of at least one 
> array"))"
> Here are two test cases that show the problem: 
> https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/context.rs#L889
> {code}
> #[tokio::test]
> async fn sort_empty() -> Result<()> {
> // The predicate on this query purposely generates no results
> let results =
> execute("SELECT c1, c2 FROM test WHERE c1 > 10 ORDER BY c1 
> DESC, c2 ASC", 4).await?;
> assert_eq!(results.len(), 0);
> Ok(())
> }
> #[tokio::test]
> async fn aggregate_empty() -> Result<()> {
> // The predicate on this query purposely generates no results
> let results = execute("SELECT SUM(c1), SUM(c2) FROM test where c1 > 
> 10", 4).await?;
> assert_eq!(results.len(), 0);
> Ok(())
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11394) [Rust] Slice + Concat incorrect for structs

2021-03-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11394:
---
Fix Version/s: 3.0.1

> [Rust] Slice + Concat incorrect for structs
> ---
>
> Key: ARROW-11394
> URL: https://issues.apache.org/jira/browse/ARROW-11394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Ben Chambers
>Assignee: Ben Chambers
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> If you slice an array and then use it with {{concat}} you get different 
> behaviors when using a primitive array (Float64 in the examples) or a struct 
> array.
> In the case of a float, the result is what I'd expect -- it concatenates the 
> elements from the slice.
> In the case of a struct, it is a bit surprising -- the result has the length 
> of the slice, but starts at the beginning of the original array.
> {code:java}
> // #[test]
> fn test_repro() {
> // Create float and struct array.
> let float_array: ArrayRef = Arc::new(Float64Array::from(vec![1.0, 2.0, 
> 3.0, 4.0]));
> let struct_array = Arc::new(StructArray::from(vec![(
> Field::new("field", DataType::Float64, true),
> float_array.clone(),
> )]));
> // Slice the float array and verify result is [3.0, 4.0]
> let float_array_slice_ref = float_array.slice(2, 2);
> let float_array_slice = float_array_slice_ref
> .as_any()
> .downcast_ref::>()
> .unwrap();
> assert_eq!(float_array_slice, ::from(vec![3.0, 4.0]));
> // Slice the struct array and verify result is [3.0, 4.0]
> let struct_array_slice_ref = struct_array.slice(2, 2);
> let struct_array_slice = struct_array_slice_ref
> .as_any()
> .downcast_ref::()
> .unwrap();
> let struct_array_slice_floats = struct_array_slice
> .column(0)
> .as_any()
> .downcast_ref::>()
> .unwrap();
> assert_eq!(
> struct_array_slice_floats,
> ::from(vec![3.0, 4.0])
> );
> // Concat the float array, and verify the result is still [3.0, 4.0].
> let concat_float_array_ref =
> 
> arrow::compute::kernels::concat::concat(&[float_array_slice]).unwrap();
> let concat_float_array = concat_float_array_ref
> .as_any()
> .downcast_ref::>()
> .unwrap();
> assert_eq!(concat_float_array, ::from(vec![3.0, 4.0]));
> // Concat the struct array and expect it to match the float array [3.0, 
> 4.0].
> let concat_struct_array_ref =
> 
> arrow::compute::kernels::concat::concat(&[struct_array_slice]).unwrap();
> let concat_struct_array = concat_struct_array_ref
> .as_any()
> .downcast_ref::()
> .unwrap();
> let concat_struct_array_floats = concat_struct_array
> .column(0)
> .as_any()
> .downcast_ref::>()
> .unwrap();
> // This is what is actually returned
> assert_eq!(
> concat_struct_array_floats,
> ::from(vec![1.0, 2.0])
> );
> // This is what I'd expect, but fails:
> assert_eq!(
> concat_struct_array_floats,
> ::from(vec![3.0, 4.0])
> );
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11452) [Rust] Parquet reader cannot read file where a struct column has the same name as struct member columns

2021-03-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11452:
---
Fix Version/s: 3.0.1

> [Rust] Parquet reader cannot read file where a struct column has the same 
> name as struct member columns 
> 
>
> Key: ARROW-11452
> URL: https://issues.apache.org/jira/browse/ARROW-11452
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Max Burke
>Assignee: Max Burke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
> Attachments: structs.parquet
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> For example, given the schema:
>  
> count: struct count: uint64 not null, sum: int64 not null, variance: int64 not null> not 
> null
>   child 0, min: int64 not null
>   child 1, max: int64 not null
>   child 2, mean: int64 not null
>   child 3, count: uint64 not null
>   child 4, sum: int64 not null
>   child 5, variance: int64 not null
> ul_observation_date: struct not null, mean: timestamp[us] not null, count: uint64 not null, sum: 
> timestamp[us] not null, variance: timestamp[us] not null> not null
>   child 0, min: timestamp[us] not null
>   child 1, max: timestamp[us] not null
>   child 2, mean: timestamp[us] not null
>   child 3, count: uint64 not null
>   child 4, sum: timestamp[us] not null
>   child 5, variance: timestamp[us] not null
>  
> The array reader performs dictionary lookups for the type of columns of types 
> such as ul_observation_date, but when it looks up the field `count` it gets 
> the definition not of the ul_observation_date.count field but of the `count` 
> struct. 
>  
> Attached is a sample file that exhibits this behavior.
>  
> [^structs.parquet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   >