[jira] [Created] (ARROW-18418) [WEBSITE] do not delete /datafusion-python

2022-11-29 Thread Andy Grove (Jira)
Andy Grove created ARROW-18418:
--

 Summary: [WEBSITE] do not delete /datafusion-python
 Key: ARROW-18418
 URL: https://issues.apache.org/jira/browse/ARROW-18418
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Andy Grove
Assignee: Andy Grove


do not delete /datafusion-python when publishing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17878) [Website] Exclude Ballista docs from being deleted

2022-09-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-17878:
--

 Summary: [Website] Exclude Ballista docs from being deleted
 Key: ARROW-17878
 URL: https://issues.apache.org/jira/browse/ARROW-17878
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Andy Grove


Exclude Ballista docs from being deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17325) AQE should use available column statistics from completed query stages

2022-08-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-17325:
--

 Summary: AQE should use available column statistics from completed 
query stages
 Key: ARROW-17325
 URL: https://issues.apache.org/jira/browse/ARROW-17325
 Project: Apache Arrow
  Issue Type: Improvement
  Components: SQL
Reporter: Andy Grove


In QueryStageExec.computeStats we copy partial statistics from materlized query 
stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls 
ShuffleExchangeLike#runtimeStatistics or 
BroadcastExchangeLike#runtimeStatistics.

 

Only dataSize and numOutputRows are copied into the new Statistics object:

 {code:scala}
  def computeStats(): Option[Statistics] = if (isMaterialized) {
    val runtimeStats = getRuntimeStatistics
    val dataSize = runtimeStats.sizeInBytes.max(0)
    val numOutputRows = runtimeStats.rowCount.map(_.max(0))
    Some(Statistics(dataSize, numOutputRows, isRuntime = true))
  } else {
    None
  }
{code}

I would like to also copy over the column statistics stored in 
Statistics.attributeMap so that they can be fed back into the logical plan 
optimization phase.

The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
not currently provide such column statistics but other custom implementations 
can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16595) [WEBSITE] DataFusion 8.0.0 Release Blog Post

2022-05-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-16595:
--

 Summary: [WEBSITE] DataFusion 8.0.0 Release Blog Post
 Key: ARROW-16595
 URL: https://issues.apache.org/jira/browse/ARROW-16595
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Andy Grove


DataFusion 8.0.0 Release Blog Post



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-13656) [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts

2021-08-18 Thread Andy Grove (Jira)
Andy Grove created ARROW-13656:
--

 Summary: [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog 
Posts
 Key: ARROW-13656
 URL: https://issues.apache.org/jira/browse/ARROW-13656
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Andy Grove
Assignee: Andy Grove


https://github.com/apache/arrow-datafusion/issues/881



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12437) [Rust] [Ballista] Ballista plans must not include RepartitionExec

2021-04-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-12437:
--

 Summary: [Rust] [Ballista] Ballista plans must not include 
RepartitionExec
 Key: ARROW-12437
 URL: https://issues.apache.org/jira/browse/ARROW-12437
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - Ballista
Reporter: Andy Grove


Ballista plans must not include RepartitionExec because it results in incorrect 
results. Ballista needs to manage its own repartitioning in a distributed-aware 
way later on. For now we just need to configure the DataFusion context to 
disable repartition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12434) [Rust] [Ballista] Show executed plans with metrics

2021-04-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-12434:
--

 Summary: [Rust] [Ballista] Show executed plans with metrics
 Key: ARROW-12434
 URL: https://issues.apache.org/jira/browse/ARROW-12434
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 5.0.0


Show executed plans with metrics to help with debugging and performance tuning



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics

2021-04-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-12433:
--

 Summary: [Rust] Builds failing due to new flatbuffer release 
introducing const generics
 Key: ARROW-12433
 URL: https://issues.apache.org/jira/browse/ARROW-12433
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 4.0.0
Reporter: Andy Grove


I filed [https://github.com/google/flatbuffers/issues/6572] but for now we 
should pin the dependency to 0.8.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec

2021-04-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-12432:
--

 Summary: [Rust] [DataFusion] Add metrics for SortExec
 Key: ARROW-12432
 URL: https://issues.apache.org/jira/browse/ARROW-12432
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 5.0.0


Add metrics for SortExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread Andy Grove (Jira)
Andy Grove created ARROW-12421:
--

 Summary: [Rust] [DataFusion] topk_query test fails in master
 Key: ARROW-12421
 URL: https://issues.apache.org/jira/browse/ARROW-12421
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove


{code:java}
 Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 tests
test topk_plan ... ok
test topk_query ... FAILED
test normal_query ... okfailures: topk_query stdout 
thread 'topk_query' panicked at 'assertion failed: `(left == right)`
  left: `["+-+-+", "| customer_id | revenue |", 
"+-+-+", "| paul| 300 |", "| jorge   | 200  
   |", "| andy| 150 |", "+-+-+"]`,
 right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn
+-+-+
| customer_id | revenue |
+-+-+
| paul| 300 |
| jorge   | 200 |
| andy| 150 |
+-+-+Actual:
++
||
++
++
', datafusion/tests/user_defined_plan.rs:133:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12403) [Rust] [Ballista] Integration tests should check that query results are correct

2021-04-15 Thread Andy Grove (Jira)
Andy Grove created ARROW-12403:
--

 Summary: [Rust] [Ballista] Integration tests should check that 
query results are correct
 Key: ARROW-12403
 URL: https://issues.apache.org/jira/browse/ARROW-12403
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andy Grove
 Fix For: 5.0.0


The integration checks only test that the benchmark queries run without error. 
They do not check that the results are correct.

I think some work already happened in DataFusion to check the TPC-H results so 
hopefully we can re-use that work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12402) [Rust] [DataFusion] Implement SQL metrics framework

2021-04-15 Thread Andy Grove (Jira)
Andy Grove created ARROW-12402:
--

 Summary: [Rust] [DataFusion] Implement SQL metrics framework
 Key: ARROW-12402
 URL: https://issues.apache.org/jira/browse/ARROW-12402
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


As a user, I would like the ability to inspect metrics for an executed plan to 
help with debugging and performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12362) [Rust] [DataFusion] topk_query test failure

2021-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-12362:
--

 Summary: [Rust] [DataFusion] topk_query test failure
 Key: ARROW-12362
 URL: https://issues.apache.org/jira/browse/ARROW-12362
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 4.0.0


I'm seeing this locally with latest from master.
{code:java}
 topk_query stdout 
thread 'topk_query' panicked at 'assertion failed: `(left == right)`
  left: `["+-+-+", "| customer_id | revenue |", 
"+-+-+", "| paul| 300 |", "| jorge   | 200  
   |", "| andy| 150 |", "+-+-+"]`,
 right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn
+-+-+
| customer_id | revenue |
+-+-+
| paul| 300 |
| jorge   | 200 |
| andy| 150 |
+-+-+Actual:
++
||
++
++
', datafusion/tests/user_defined_plan.rs:133:5
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12361) [Rust] [DataFusion] Allow users to override physical optimization rules

2021-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-12361:
--

 Summary: [Rust] [DataFusion] Allow users to override physical 
optimization rules
 Key: ARROW-12361
 URL: https://issues.apache.org/jira/browse/ARROW-12361
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


As a user of DataFusion (in Ballista) I would override the list of physical 
optimization rules. It is currently possible to add new rules but not to remove 
existing rules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12335) [Rust] [Ballista] Bump DataFusion version

2021-04-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-12335:
--

 Summary: [Rust] [Ballista] Bump DataFusion version
 Key: ARROW-12335
 URL: https://issues.apache.org/jira/browse/ARROW-12335
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - Ballista
Reporter: Andy Grove
 Fix For: 4.0.0


Update Ballista to use latest DataFusion version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results

2021-04-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-12334:
--

 Summary: [Rust] [Ballista] Aggregate queries producing incorrect 
results
 Key: ARROW-12334
 URL: https://issues.apache.org/jira/browse/ARROW-12334
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


I just ran benchmarks for the first time in a while and I see duplicate entries 
for group by keys.

 

For example, query 1 has "group by l_returnflag, l_linestatus" and I see 
multiple results with l_returnflag = 'A' and l_linestatus = 'F'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12331) [Rust] [Ballista] Make CI build work with snmalloc

2021-04-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-12331:
--

 Summary: [Rust] [Ballista] Make CI build work with snmalloc
 Key: ARROW-12331
 URL: https://issues.apache.org/jira/browse/ARROW-12331
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - Ballista
Reporter: Andy Grove
 Fix For: 4.0.0


Ballista was added to CI in [https://github.com/apache/arrow/pull/9979] but is 
building without default features due to snmalloc requiring cmake.

An alternative approach would be to build with cc instead of cmake. See the 
above PR for conversation about this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12329) [Rust] [Ballista] Add README

2021-04-10 Thread Andy Grove (Jira)
Andy Grove created ARROW-12329:
--

 Summary: [Rust] [Ballista] Add README
 Key: ARROW-12329
 URL: https://issues.apache.org/jira/browse/ARROW-12329
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


We did not bring a README over in the donation and need to write a new one 
anyway now this is part of Arrow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12328) [Rust] [Ballista] Fix code formatting

2021-04-10 Thread Andy Grove (Jira)
Andy Grove created ARROW-12328:
--

 Summary: [Rust] [Ballista] Fix code formatting
 Key: ARROW-12328
 URL: https://issues.apache.org/jira/browse/ARROW-12328
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12313) [Rust] [Ballista] Benchmark docuementation out of date

2021-04-09 Thread Andy Grove (Jira)
Andy Grove created ARROW-12313:
--

 Summary: [Rust] [Ballista] Benchmark docuementation out of date
 Key: ARROW-12313
 URL: https://issues.apache.org/jira/browse/ARROW-12313
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


The scheduler/executor were refactored and the documentation for the benchmarks 
now needs updating. I plan on fixing this over the weekend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12284) [Rust] [DataFusion] Review the contract between DataFusion and Arrow

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12284:
--

 Summary: [Rust] [DataFusion] Review the contract between 
DataFusion and Arrow
 Key: ARROW-12284
 URL: https://issues.apache.org/jira/browse/ARROW-12284
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


I am creating this issue based on the discussion at the sync call earlier today.

Apparently DataFusion is not only using the high-level Arrow API but is also 
accessing Arrow internals directly and this would be one challenge in moving to 
a majorly refactored Arrow implementation.

Perhaps we need to review what the public Arrow API should be and which APIs 
DataFusion should or should not be using.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12261) [Rust] [Ballista] Ballista should not have its own DataFrame API

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12261:
--

 Summary: [Rust] [Ballista] Ballista should not have its own 
DataFrame API
 Key: ARROW-12261
 URL: https://issues.apache.org/jira/browse/ARROW-12261
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - Ballista
Reporter: Andy Grove
 Fix For: 5.0.0


When building the Ballista POC it was necessary to implement a new DataFrame 
API that wrapped the DataFusion API.

One issue is that it wasn't possible to override the behavior of the collect 
method to make it use the Ballista context rather than the DataFusion context.

Now that the projects are in the same repo it should be easier to fix this and 
have users always use the DataFusion DataFrame API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12260) [Website] [Rust] Announce Ballista donation

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12260:
--

 Summary: [Website] [Rust] Announce Ballista donation
 Key: ARROW-12260
 URL: https://issues.apache.org/jira/browse/ARROW-12260
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Andy Grove
Assignee: Andy Grove


Once the IP clearance vote passes and the PR has been merged, we should 
announce the donation on the Arrow blog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12257) [Rust] [Ballista] Publish user guide to Arrow site

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12257:
--

 Summary: [Rust] [Ballista] Publish user guide to Arrow site
 Key: ARROW-12257
 URL: https://issues.apache.org/jira/browse/ARROW-12257
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 5.0.0


Ballista has a user guide in mdbook format and we need to figure out how to get 
this published to the arrow site (it was previously hosted at 
https://ballistacompute.org/docs/)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12256) [Rust] [Ballista] Add DataFrame support

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12256:
--

 Summary: [Rust] [Ballista] Add DataFrame support
 Key: ARROW-12256
 URL: https://issues.apache.org/jira/browse/ARROW-12256
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista
Reporter: Andy Grove
 Fix For: 5.0.0


Ballista has so far been focused on SQL support rather than DataFrame support. 
DataFrame support is partially implemented but needs more work to complete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12255) [Rust] [Ballista] Integrate scheduler with DataFusion

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12255:
--

 Summary: [Rust] [Ballista] Integrate scheduler with DataFusion
 Key: ARROW-12255
 URL: https://issues.apache.org/jira/browse/ARROW-12255
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 5.0.0


The Ballista scheduler breaks a query down into stages based on changes in 
partitioning int he plan, where each stage is broken down into tasks that can 
be executed concurrently.

Rather than trying to run all the partitions at once, Ballista executors 
process n concurrent tasks at a time and then request new tasks from the 
scheduler.

This approach would help DataFusion scale better and it would be ideal to use 
the same scheduler to scale across cores in DataFusion and across nodes in 
Ballista.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12253) [Rust] [Ballista] Implement scalable joins

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12253:
--

 Summary: [Rust] [Ballista] Implement scalable joins
 Key: ARROW-12253
 URL: https://issues.apache.org/jira/browse/ARROW-12253
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 5.0.0


The main issue limiting scalability in Ballista today is that joins are 
implemented as hash joins where each partition of the probe side causes the 
entire left side to be loaded into memory.

To make this scalable we need to hash partition left and right inputs so that 
we can join the left and right partitions in parallel.

There is already work underway in DataFusion to implement this that we can 
leverage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12252) [Rust] [Ballista] How to continue "This week in Ballista"?

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12252:
--

 Summary: [Rust] [Ballista] How to continue "This week in Ballista"?
 Key: ARROW-12252
 URL: https://issues.apache.org/jira/browse/ARROW-12252
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove


The Ballista project published a weekly newsletter and this has been very 
effective at building a community around the project.

We need to determine how we can continue with something like this, while 
following the Apache way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12251:
--

 Summary: [Rust] [Ballista] Add Ballista tests to CI
 Key: ARROW-12251
 URL: https://issues.apache.org/jira/browse/ARROW-12251
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - Ballista
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


Ballista is a standalone project (not part of the Arrow Rust workspace) and 
therefore the tests will not run in CI without additional work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12250) [Rust] Failing test arrow::arrow_writer::tests::fixed_size_binary_single_column

2021-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-12250:
--

 Summary: [Rust] Failing test 
arrow::arrow_writer::tests::fixed_size_binary_single_column
 Key: ARROW-12250
 URL: https://issues.apache.org/jira/browse/ARROW-12250
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
 Fix For: 4.0.0


I just pulled latest from master (commit 
d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran 
`cargo clean` followed by `cargo test`.

One test fails:
{code:java}
 arrow::arrow_writer::tests::fixed_size_binary_single_column stdout 
thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked 
at 'called `Result::unwrap()` on an `Err` value: General("Could not parse 
metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12064) [Rust] [DataFusion] Make DataFrame extensible

2021-03-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-12064:
--

 Summary: [Rust] [DataFusion] Make DataFrame extensible
 Key: ARROW-12064
 URL: https://issues.apache.org/jira/browse/ARROW-12064
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove


The DataFrame implementation currently has two types of logic:
 # Logic for building a logical query plan
 # Logic for executing a query using the DataFusion context

We can make DataFrame more extensible by having it always delegate to the 
context for execution, allowing the same DataFrame logic to be used for local 
and distributed execution.

We will likely need to introduce a new ExecutionContext trait with different 
implementations for DataFusion and Ballista.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11982) [Rust] Donate Ballista Distributed Compute Platform

2021-03-16 Thread Andy Grove (Jira)
Andy Grove created ARROW-11982:
--

 Summary: [Rust] Donate Ballista Distributed Compute Platform
 Key: ARROW-11982
 URL: https://issues.apache.org/jira/browse/ARROW-11982
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


See [PR|[https://github.com/apache/arrow/pull/9723]] for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11948) [Rust] 3.0.1 patch release

2021-03-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-11948:
--

 Summary: [Rust] 3.0.1 patch release
 Key: ARROW-11948
 URL: https://issues.apache.org/jira/browse/ARROW-11948
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.1


Spreadsheet where I am tracking the fixes that get merged to maint-3.0.x

 

https://docs.google.com/spreadsheets/d/111k0PGEVzxg1k7Q_d_1kV7E24VRB3DVJP1MnQImVrCc/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11934) [Rust] Document patch release process

2021-03-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-11934:
--

 Summary: [Rust] Document patch release process
 Key: ARROW-11934
 URL: https://issues.apache.org/jira/browse/ARROW-11934
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.1


Now that we moved to voting on source releases for patch releases, we need to 
document the process for doing so in the Rust implementation.

 

Google doc for discussion / collaboration: 
https://docs.google.com/document/d/1i2Elk6J0H4nhPeQZdLDyqvHoRbsabx2iOTXLHxxNqRE/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11852) [Documentation] Update CONTRIBUTING to explain how to self-assign tickets

2021-03-03 Thread Andy Grove (Jira)
Andy Grove created ARROW-11852:
--

 Summary: [Documentation] Update CONTRIBUTING to explain how to 
self-assign tickets
 Key: ARROW-11852
 URL: https://issues.apache.org/jira/browse/ARROW-11852
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


In CONTRIBUTING.md [1] we should explain that new users need to ask someone to 
assign them the "Contributing" role so that they can self-assign issues in 
JIRA. It would also be useful to add notes for commuters on how to assign this 
role to people.

 [1]https://github.com/apache/arrow/blob/master/CONTRIBUTING.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11669) [Rust] [DataFusion] Remove concurrency field from GlobalLimitExec

2021-02-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-11669:
--

 Summary: [Rust] [DataFusion] Remove concurrency field from 
GlobalLimitExec
 Key: ARROW-11669
 URL: https://issues.apache.org/jira/browse/ARROW-11669
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 4.0.0


GlobalLimitExec has an unused concurrency field that can now be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11625) [Rust] [DataFusion] Move SortExec partition check to constructor

2021-02-14 Thread Andy Grove (Jira)
Andy Grove created ARROW-11625:
--

 Summary: [Rust] [DataFusion] Move SortExec partition check to 
constructor
 Key: ARROW-11625
 URL: https://issues.apache.org/jira/browse/ARROW-11625
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 4.0.0


SortExec has the following error check at execution time and this could be 
moved into the try_new constructor so the error check happens at planning time 
instead.

 
{code:java}
if 1 != self.input.output_partitioning().partition_count() {
return Err(DataFusionError::Internal(
"SortExec requires a single input partition".to_owned(),
));
} {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11622) [Rust] [DataFusion] AggregateExpression name inconsistency

2021-02-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-11622:
--

 Summary: [Rust] [DataFusion] AggregateExpression name inconsistency
 Key: ARROW-11622
 URL: https://issues.apache.org/jira/browse/ARROW-11622
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove


I have an aggregate query and the AggregateExpr has this name
{code:java}
 SUM(l_extendedprice Multiply Int64(1)){code}
This is hiding the fact that the expression has a CAST operation:
{code:java}
 expr: BinaryExpr { left: Column { name: "l_extendedprice" }, op: Multiply, 
right: CastExpr { expr: Literal { value: Int64(1) }, cast_type: Float64 } }, 
nullable: true }{code}
In Ballista, this causes a problem with serde because after a rountrip, the 
expression has a name that includes the CAST and this causes a schema mismatch.
{code:java}
SUM(l_extendedprice Multiply CAST(Int64(1) AS Float64)) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction

2021-02-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-11606:
--

 Summary: [Rust] [DataFusion] Need guidance on HashAggregateExec 
reconstruction
 Key: ARROW-11606
 URL: https://issues.apache.org/jira/browse/ARROW-11606
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


We have run into an issue in the Ballista project where we are reconstructing 
the Final and Partial HashAggregateExec operators [1] for distributed execution 
and we need some guidance.

The Partial HashAggregateExec gets created OK and executes correctly.

However, when we create the Final HashAggregateExec, it is not finding the 
expected schema in the input operator. The partial exec outputs field names 
ending with "[sum]" and "[count]" and so on but the final aggregate doesn't 
seem to be looking for those names.

It is also worth noting that the Final and Partial executors are not connected 
directly in this usage.

The Partial exec is executed and output streamed to disk.

The Final exec then runs against the output from the Partial exec.

We may need to make changes in DataFusion to allow other crates to support this 
kind of use case?

 [1] https://github.com/ballista-compute/ballista/pull/491

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11545) [Rust] [DataFusion] SendableRecordBatchStream should implement Sync

2021-02-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-11545:
--

 Summary: [Rust] [DataFusion] SendableRecordBatchStream should 
implement Sync
 Key: ARROW-11545
 URL: https://issues.apache.org/jira/browse/ARROW-11545
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


SendableRecordBatchStream should implement Sync. I ran into issues in Ballista 
when trying to use this type in some circumstances.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11544) [Rust] [DataFusion] Implement as_any for AggregateExpr

2021-02-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-11544:
--

 Summary: [Rust] [DataFusion] Implement as_any for AggregateExpr
 Key: ARROW-11544
 URL: https://issues.apache.org/jira/browse/ARROW-11544
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


Implement as_any for AggregateExpr so it can be downcast to a known 
implementation. Ballista needs this for serde.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11454) [Website] [Rust] 3.0.0 Blog Post

2021-02-01 Thread Andy Grove (Jira)
Andy Grove created ARROW-11454:
--

 Summary: [Website] [Rust] 3.0.0 Blog Post
 Key: ARROW-11454
 URL: https://issues.apache.org/jira/browse/ARROW-11454
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Website
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11440) [Rust] [DataFusion] Add method to CsvExec to get CSV schema

2021-01-30 Thread Andy Grove (Jira)
Andy Grove created ARROW-11440:
--

 Summary: [Rust] [DataFusion] Add method to CsvExec to get CSV 
schema
 Key: ARROW-11440
 URL: https://issues.apache.org/jira/browse/ARROW-11440
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 4.0.0


It is not possible to inspect an already created CsvExec and determine what the 
underlying schema is



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11360) [Rust] [DataFusion] Improve CSV "No files found" error message

2021-01-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11360:
--

 Summary: [Rust] [DataFusion] Improve CSV "No files found" error 
message
 Key: ARROW-11360
 URL: https://issues.apache.org/jira/browse/ARROW-11360
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 4.0.0


There are two places in DataFusion where the error message "No files found" is 
returned if no CSV files can be found in the specified directory with the 
specified file extension.

It would be much easier to debug issues if this error message stated the 
directory and file extension.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11276) [Rust] [DataFusion] Make MemoryStream public

2021-01-16 Thread Andy Grove (Jira)
Andy Grove created ARROW-11276:
--

 Summary: [Rust] [DataFusion] Make MemoryStream public
 Key: ARROW-11276
 URL: https://issues.apache.org/jira/browse/ARROW-11276
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


I found the need to take a copy of MemoryStream for use in another project.

It would be nice if we could expose this as a supported public API so that 
other projects building physical operators can re-use it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11252) [Rust] [DataFusion[ Add DataFusion to h2oai/db-benchmark

2021-01-14 Thread Andy Grove (Jira)
Andy Grove created ARROW-11252:
--

 Summary: [Rust] [DataFusion[ Add DataFusion to h2oai/db-benchmark
 Key: ARROW-11252
 URL: https://issues.apache.org/jira/browse/ARROW-11252
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 4.0.0


I would like to see DataFusion added to h2oai/db-benchmark so that we can see 
how we compare to other solutions (including Pandas, Spark, cuDF, and Polars).

Since Polars (another Rust DataFrame library that uses Arrow) has already been 
added, I am hoping that we can learn from their scripts.

There is an issue filed against db-benchmark for adding DataFusion:

https://github.com/h2oai/db-benchmark/issues/107



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11200) [Rust] [DateFusion] Physical operators and expressions should have public accessor methods

2021-01-10 Thread Andy Grove (Jira)
Andy Grove created ARROW-11200:
--

 Summary: [Rust] [DateFusion] Physical operators and expressions 
should have public accessor methods
 Key: ARROW-11200
 URL: https://issues.apache.org/jira/browse/ARROW-11200
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


Physical operators and expressions should have public accessor methods that 
expose the values used to construct them.

This allows projects that use DataFusion to be able to serialize a physical 
plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11195) [Rust] [DataFusiion] Built-in table providers should expose relevant fields

2021-01-09 Thread Andy Grove (Jira)
Andy Grove created ARROW-11195:
--

 Summary: [Rust] [DataFusiion] Built-in table providers should 
expose relevant fields
 Key: ARROW-11195
 URL: https://issues.apache.org/jira/browse/ARROW-11195
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


For projects that depend on the DataFusion logical plan, there is a breaking 
change currently compared to 2.0.0 where it is not possible to introspect the 
TableScan node for CSV and Parquet files to get the file path and other 
meta-data (such as delimiter for CSV).

I propose adding public methods to the specific providers so that other crates 
can downcast to specific implementations in order to access this data.

My specific use case is the ability to serialize a DataFusion logical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11150) [Rust] Set up bi-weekly Rust sync call and update website

2021-01-06 Thread Andy Grove (Jira)
Andy Grove created ARROW-11150:
--

 Summary: [Rust] Set up bi-weekly Rust sync call and update website
 Key: ARROW-11150
 URL: https://issues.apache.org/jira/browse/ARROW-11150
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove


Given the momentum on the Rust implementation, I am going to set up a bi-weekly 
sync call on Google Meet most likely. The call will be at the same time as the 
current sync call but on alternate weeks.

I will update the web site to list both calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join

2020-12-31 Thread Andy Grove (Jira)
Andy Grove created ARROW-11094:
--

 Summary: [Rust] [DataFusion] Implement Sort-Merge Join
 Key: ARROW-11094
 URL: https://issues.apache.org/jira/browse/ARROW-11094
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 4.0.0


The current hash join works well when one side of the join can be loaded into 
memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
left and right partitions in parallel and then stream both sides of the join by 
merging these sorted partitions and we do not need to load one side into 
memory. At most, we need to load all batches from both sides that contain the 
current join key values.

https://en.wikipedia.org/wiki/Sort-merge_join



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11093) [Rust] [DataFusion] RFC Roadmap for 2021

2020-12-31 Thread Andy Grove (Jira)
Andy Grove created ARROW-11093:
--

 Summary: [Rust] [DataFusion] RFC Roadmap for 2021
 Key: ARROW-11093
 URL: https://issues.apache.org/jira/browse/ARROW-11093
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove


Given the momentum and number of contributors involved in the Rust 
implementation, I think it would be useful to crowdsource a roadmap for the 
next few releases that we expect to release in 2021.

We have a small number of active committers on the project currently and it is 
hard for us to keep up with all the PRs sometimes, especially when so many 
different areas are being contributed to.

It would be helpful if we can co-ordinate to prioritize work for the release.

Of course, this is open source, and anyone can contribute anything at any time, 
but it would be nice to have some areas that we all agree are the main 
priorities.

I will create a PR to kick start this discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11068) [Rust] [DataFusion] Wrap HashJoinExec in CoalesceBatchExec

2020-12-29 Thread Andy Grove (Jira)
Andy Grove created ARROW-11068:
--

 Summary: [Rust] [DataFusion] Wrap HashJoinExec in CoalesceBatchExec
 Key: ARROW-11068
 URL: https://issues.apache.org/jira/browse/ARROW-11068
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


Once [https://github.com/apache/arrow/pull/9043] is merged, we should extend 
this to wrap join output as well.

Rather than hard-code a list of operators that need to be wrapped, we should 
find a more generic mechanism so that plans can declare if their input and/or 
output batches should be coalesced (similar to how we handle partitioning) and 
this would allow custom operators outside of DataFusion to benefit from this 
optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11059) [Rust] [DataFusion] Implement extensible configuration mechanism

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11059:
--

 Summary: [Rust] [DataFusion] Implement extensible configuration 
mechanism
 Key: ARROW-11059
 URL: https://issues.apache.org/jira/browse/ARROW-11059
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


We are getting to the point where there are multiple settings we could add to 
operators to fine-tune performance. Custom operators provided by crates that 
extend DataFusion may also need this capability.

I propose that we add support for key-value configuration options so that we 
don't need to plumb through each new configuration setting that we add.

For example. I am about to start on a "coalesce batches" operator and I would 
like a setting such as "coalesce.batch.size".

For built-in settings like this we can provide information such as 
documentation and default values and generate documentation from this.

For example, here is how Spark defines configs:
{code:java}
  val PARQUET_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.parquet.enableVectorizedReader")
  .doc("Enables vectorized parquet decoding.")
  .version("2.0.0")
  .booleanConf
  .createWithDefault(true) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11058) [Rust] [DataFusion] Implement "coalesce batches" operator

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11058:
--

 Summary: [Rust] [DataFusion] Implement "coalesce batches" operator
 Key: ARROW-11058
 URL: https://issues.apache.org/jira/browse/ARROW-11058
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


When we have a FilterExec in the plan, it can produce lots of small batches and 
we therefore lose efficiency of vectorized operations.

We should implement a new CoalesceBatchExec and wrap every FilterExec with one 
of these so that small batches can be recombined into larger batches to improve 
the efficiency of upstream operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11056) [Rust] [DataFusion] Allow ParquetExec to parallelize work based on row groups

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11056:
--

 Summary: [Rust] [DataFusion] Allow ParquetExec to parallelize work 
based on row groups
 Key: ARROW-11056
 URL: https://issues.apache.org/jira/browse/ARROW-11056
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


ParquetExec currently parallelizes work by passinging individual files to 
threads. It would be nice to be able to do this in a finer-grained way by 
assigning row groups and/or column chunks instead. This will be especially 
important in distributed systems built on DataFusion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11053) [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11053:
--

 Summary: [Rust] [DataFusion] Optimize joins with dynamic capacity 
for output batches
 Key: ARROW-11053
 URL: https://issues.apache.org/jira/browse/ARROW-11053
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


Rather than using the size of the left or right batches to determine the 
capacity of the output batches we can use the average size of previous output 
batches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11052) [Rust] [DataFusion] Implement metrics in join operator

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11052:
--

 Summary: [Rust] [DataFusion] Implement metrics in join operator
 Key: ARROW-11052
 URL: https://issues.apache.org/jira/browse/ARROW-11052
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


Implement metrics in join operator to make it easier to debug performance 
issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11047) [Rust] [DataFusion] ParquetTable should avoid scanning all files twice

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11047:
--

 Summary: [Rust] [DataFusion] ParquetTable should avoid scanning 
all files twice
 Key: ARROW-11047
 URL: https://issues.apache.org/jira/browse/ARROW-11047
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


ParquetTable currently reads the metadata for all files once in the constructor 
in order to get the schema, and does it again each time scan() is called.

We could read the metadata once and cache it instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11031) [Rust] Update assignees in JIRA where missing

2020-12-24 Thread Andy Grove (Jira)
Andy Grove created ARROW-11031:
--

 Summary: [Rust] Update assignees in JIRA where missing
 Key: ARROW-11031
 URL: https://issues.apache.org/jira/browse/ARROW-11031
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


I have started updating tasks in Jira that were missing assignees.

I see that [~nevi_me]  [~alamb] [~jorgecarleitao] have asked about how to do 
this in the past so I will briefly explain here.

First you need admin permissions to Jira (I will grant these to you now).

Next you need to add new contributors to the "contributors" role in JIRA. To do 
this, go to the project setting for Apache Arrow. Select "users and roles" on 
the left. Click "add user to role" in top right. Then enter the user name and 
choose the "contributor" role.

After that you can assign issues to that user (and they can self-assign too)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11030) [Rust] [DataFusion] TPC-H q12 does not finish with SF=100 Parquet input

2020-12-24 Thread Andy Grove (Jira)
Andy Grove created ARROW-11030:
--

 Summary: [Rust] [DataFusion] TPC-H q12 does not finish with SF=100 
Parquet input
 Key: ARROW-11030
 URL: https://issues.apache.org/jira/browse/ARROW-11030
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


TPC-H q12 does not finish with SF=100 Parquet input.

Either I have something wrong with my local setup or there is a major 
performance regression. I will investigate more after the holidays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11029) [Rust] [DataFusion] Join order optimization does not work with filter pushdown

2020-12-24 Thread Andy Grove (Jira)
Andy Grove created ARROW-11029:
--

 Summary: [Rust] [DataFusion] Join order optimization does not work 
with filter pushdown
 Key: ARROW-11029
 URL: https://issues.apache.org/jira/browse/ARROW-11029
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 3.0.0


When I run TPC-H query 12, I see that the join order is not optimized to put 
the smaller table on the left. I added some debug logging and see that the 
optimization sees row count of None on the left side because there is a filter 
wrapping the table scan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11022) [Rust] [DataFusion] Upgrade to tokio 1.0

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11022:
--

 Summary: [Rust] [DataFusion] Upgrade to tokio 1.0
 Key: ARROW-11022
 URL: https://issues.apache.org/jira/browse/ARROW-11022
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 3.0.0


https://tokio.rs/blog/2020-12-tokio-1-0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11020) [Rust] [DataFusion] Implement better tests for ParquetExec

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11020:
--

 Summary: [Rust] [DataFusion] Implement better tests for ParquetExec
 Key: ARROW-11020
 URL: https://issues.apache.org/jira/browse/ARROW-11020
 Project: Apache Arrow
  Issue Type: Test
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


Implement better tests for ParquetExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11019) [Rust] [DataFusion] Add support for reading partitioned Parquet files

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11019:
--

 Summary: [Rust] [DataFusion] Add support for reading partitioned 
Parquet files
 Key: ARROW-11019
 URL: https://issues.apache.org/jira/browse/ARROW-11019
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


Add support for reading Parquet files that are partitioned by key where the 
files are under a directory structure based on partition keys and values.

/path/to/files/KEY1=value/KEY2=value/files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11017) [Rust] [DataFusion] Add support for Parquet schema merging

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11017:
--

 Summary: [Rust] [DataFusion] Add support for Parquet schema 
merging 
 Key: ARROW-11017
 URL: https://issues.apache.org/jira/browse/ARROW-11017
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


Add support for Parquet schema merging so that we can read data sets where some 
files have additional columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11016) [Rust] Parquet ArrayReader should allow reading a subset of row groups

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11016:
--

 Summary: [Rust] Parquet ArrayReader should allow reading a subset 
of row groups
 Key: ARROW-11016
 URL: https://issues.apache.org/jira/browse/ARROW-11016
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Andy Grove


Parquet ArrayReader currently only supports reading an entire file from start 
to finish and does not allow selectively reading a subset of row groups. This 
prevents us from parallelizing work across threads when processing a single 
parquet file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11014) [Rust] [DataFusion] ParquetExec reports incorrect statistics

2020-12-22 Thread Andy Grove (Jira)
Andy Grove created ARROW-11014:
--

 Summary: [Rust] [DataFusion] ParquetExec reports incorrect 
statistics
 Key: ARROW-11014
 URL: https://issues.apache.org/jira/browse/ARROW-11014
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove


ParquetExec represents one or more Parquet files and currently only returns 
statistics based on the first file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11012) [Rust] [DataFusion] Make write_csv and write_parquet concurrent

2020-12-22 Thread Andy Grove (Jira)
Andy Grove created ARROW-11012:
--

 Summary: [Rust] [DataFusion] Make write_csv and write_parquet 
concurrent
 Key: ARROW-11012
 URL: https://issues.apache.org/jira/browse/ARROW-11012
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


ExecutionContext.write_csv and write_parquet currently iterate over the output 
partitions and execute one at a time and write the results out. We should run 
these as tokio tasks so they can run concurrently. This should, in theory, help 
with memory usage when the plan contains repartition operators.

We may want to add a configuration option so we can choose between serial and 
parallel writes?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11011) [Rust] [DataFusion] Implement hash partitioning

2020-12-22 Thread Andy Grove (Jira)
Andy Grove created ARROW-11011:
--

 Summary: [Rust] [DataFusion] Implement hash partitioning
 Key: ARROW-11011
 URL: https://issues.apache.org/jira/browse/ARROW-11011
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


Once https://issues.apache.org/jira/browse/ARROW-10582 is implemented, we 
should add support for hash partitioning. The logical physical plans already 
support create a plan with hash partitioning but the execution needs 
implementing in repartition.rs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10995) [Rust] [DataFusion] Improve parallelism when reading Parquet files

2020-12-20 Thread Andy Grove (Jira)
Andy Grove created ARROW-10995:
--

 Summary: [Rust] [DataFusion] Improve parallelism when reading 
Parquet files
 Key: ARROW-10995
 URL: https://issues.apache.org/jira/browse/ARROW-10995
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


Currently the unit of parallelism is the number of parquet files being read.

For example, if we run a query against a Parquet table that consists of 8 
partitions then we will attempt to run 8 async tasks in parallel and if there 
is a single Parquet file then we will only try and run 1 async task so this 
does not scale well.

A better approach would be to have one parallel task per "chunk" in each 
parquet file. This would involve an upfront step in the planner to scan the 
parquet meta-data to get a list of chunks and then split these up between the 
configured number of parallel tasks.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10985) [Rust] Update unsafe guidelines for adding JIRA references

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10985:
--

 Summary: [Rust] Update unsafe guidelines for adding JIRA references
 Key: ARROW-10985
 URL: https://issues.apache.org/jira/browse/ARROW-10985
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


We should document known soundness issues in Jira issues and reference them 
from the source code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10984) [Rust] Document use of unsafe in parquet crate

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10984:
--

 Summary: [Rust] Document use of unsafe in parquet crate
 Key: ARROW-10984
 URL: https://issues.apache.org/jira/browse/ARROW-10984
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


There are ~64 uses of unsafe in the parquet crate
{code:java}
./parquet/src/util/hash_util.rs:6
./parquet/src/util/bit_packing.rs:34
./parquet/src/util/bit_util.rs:1
./parquet/src/data_type.rs:12
./parquet/src/arrow/record_reader.rs:5
./parquet/src/arrow/array_reader.rs:8 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10983) [Rust] Document use of unsafe in datatypes.rs

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10983:
--

 Summary: [Rust] Document use of unsafe in datatypes.rs
 Key: ARROW-10983
 URL: https://issues.apache.org/jira/browse/ARROW-10983
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


There are ~12 uses of unsafe in datatypes and we should document them according 
to the guidelines in the Arrow crate README

 
{code:java}
// JUSTIFICATION
//  Benefit
//  Describe the benefit of using unsafe. E.g.
//  "30% performance degradation if the safe counterpart is used, see bench 
X."
//  Soundness
//  Describe why the code remains sound (according to the definition of 
rust's unsafe code guidelines). E.g.
//  "We bounded check these values at initialization and the array is 
immutable."
let ... = unsafe { ... }; {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10982) [Rust] Document use of unsafe in ffi.rs

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10982:
--

 Summary: [Rust] Document use of unsafe in ffi.rs
 Key: ARROW-10982
 URL: https://issues.apache.org/jira/browse/ARROW-10982
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10980) [Rust] Document use of unsafe in src/arrays

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10980:
--

 Summary: [Rust] Document use of unsafe in src/arrays
 Key: ARROW-10980
 URL: https://issues.apache.org/jira/browse/ARROW-10980
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


This needs to be broken down into smaller tasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10978) [Rust] Document use of unsafe in utils

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10978:
--

 Summary: [Rust] Document use of unsafe in utils
 Key: ARROW-10978
 URL: https://issues.apache.org/jira/browse/ARROW-10978
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


There are ~18 uses of unsafe in src/utils and we should document them according 
to the guidelines in the Arrow crate README

 
{code:java}
// JUSTIFICATION
//  Benefit
//  Describe the benefit of using unsafe. E.g.
//  "30% performance degradation if the safe counterpart is used, see bench 
X."
//  Soundness
//  Describe why the code remains sound (according to the definition of 
rust's unsafe code guidelines). E.g.
//  "We bounded check these values at initialization and the array is 
immutable."
let ... = unsafe { ... }; {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10977) [Rust] Document use of unsafe in builder.rs

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10977:
--

 Summary: [Rust] Document use of unsafe in builder.rs
 Key: ARROW-10977
 URL: https://issues.apache.org/jira/browse/ARROW-10977
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


There are ~6 uses of unsafe in builder.rs and we should document them according 
to the guidelines in the Arrow crate README

 
{code:java}
// JUSTIFICATION
//  Benefit
//  Describe the benefit of using unsafe. E.g.
//  "30% performance degradation if the safe counterpart is used, see bench 
X."
//  Soundness
//  Describe why the code remains sound (according to the definition of 
rust's unsafe code guidelines). E.g.
//  "We bounded check these values at initialization and the array is 
immutable."
let ... = unsafe { ... }; {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10976) [Rust] Document use of unsafe in buffer.rs

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10976:
--

 Summary: [Rust] Document use of unsafe in buffer.rs
 Key: ARROW-10976
 URL: https://issues.apache.org/jira/browse/ARROW-10976
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


There are ~32 uses of unsafe in buffer.rs and we should document them according 
to the guidelines in the Arrow crate README

 
{code:java}
// JUSTIFICATION
//  Benefit
//  Describe the benefit of using unsafe. E.g.
//  "30% performance degradation if the safe counterpart is used, see bench 
X."
//  Soundness
//  Describe why the code remains sound (according to the definition of 
rust's unsafe code guidelines). E.g.
//  "We bounded check these values at initialization and the array is 
immutable."
let ... = unsafe { ... }; {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10975) [Rust] Document use of unsafe in IPC

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10975:
--

 Summary: [Rust] Document use of unsafe in IPC
 Key: ARROW-10975
 URL: https://issues.apache.org/jira/browse/ARROW-10975
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


There are ~6 uses of unsafe in IPC and we should document them according to the 
guidelines in the Arrow crate README

 
{code:java}
// JUSTIFICATION
//  Benefit
//  Describe the benefit of using unsafe. E.g.
//  "30% performance degradation if the safe counterpart is used, see bench 
X."
//  Soundness
//  Describe why the code remains sound (according to the definition of 
rust's unsafe code guidelines). E.g.
//  "We bounded check these values at initialization and the array is 
immutable."
let ... = unsafe { ... }; {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10974) [Rust] Document use of unsafe in memory.rs

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10974:
--

 Summary: [Rust] Document use of unsafe in memory.rs
 Key: ARROW-10974
 URL: https://issues.apache.org/jira/browse/ARROW-10974
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


There are ~10 uses of unsafe in memory.rs and we should document them according 
to the guidelines in the Arrow crate README

 
{code:java}
// JUSTIFICATION
//  Benefit
//  Describe the benefit of using unsafe. E.g.
//  "30% performance degradation if the safe counterpart is used, see bench 
X."
//  Soundness
//  Describe why the code remains sound (according to the definition of 
rust's unsafe code guidelines). E.g.
//  "We bounded check these values at initialization and the array is 
immutable."
let ... = unsafe { ... }; {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10973) [Rust] Document use of unsafe in compute kernels

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10973:
--

 Summary: [Rust] Document use of unsafe in compute kernels
 Key: ARROW-10973
 URL: https://issues.apache.org/jira/browse/ARROW-10973
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


There are ~7 uses of unsafe in the compute kernels and we should document them 
according to the guidelines in the Arrow crate README

 
{code:java}
// JUSTIFICATION
//  Benefit
//  Describe the benefit of using unsafe. E.g.
//  "30% performance degradation if the safe counterpart is used, see bench 
X."
//  Soundness
//  Describe why the code remains sound (according to the definition of 
rust's unsafe code guidelines). E.g.
//  "We bounded check these values at initialization and the array is 
immutable."
let ... = unsafe { ... }; {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10972) [Rust] RFC Roadmap for 4.0.0

2020-12-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10972:
--

 Summary: [Rust] RFC Roadmap for 4.0.0
 Key: ARROW-10972
 URL: https://issues.apache.org/jira/browse/ARROW-10972
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove


Given the momentum and number of contributors involved in the Rust 
implementation, I think it would be useful to crowdsource a roadmap for the 
next release.

We have a small number of active committers on the project currently and it is 
hard for us to keep up with all the PRs sometimes, especially when so many 
different areas are being contributed to.

It would be helpful if we can co-ordinate to prioritize work for the release.

Of course, this is open source, and anyone can contribute anything at any time, 
but it would be nice to have some areas that we all agree are the main 
priorities.

I will create a PR to kick start this discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10964) [Rust] [DataFusion] Optimize nested joins

2020-12-18 Thread Andy Grove (Jira)
Andy Grove created ARROW-10964:
--

 Summary: [Rust] [DataFusion] Optimize nested joins
 Key: ARROW-10964
 URL: https://issues.apache.org/jira/browse/ARROW-10964
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


Once [https://github.com/apache/arrow/pull/8961] is merged, we have an 
optimization for a JOIN that operates on two tables.

The next step is to extend this optimization to work with nested joins, and 
this is not trivial. See discussion in 
[https://github.com/apache/arrow/pull/8961] for context.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10943) [Rust] Intermittent build failure in parquet encoding

2020-12-16 Thread Andy Grove (Jira)
Andy Grove created ARROW-10943:
--

 Summary: [Rust] Intermittent build failure in parquet encoding
 Key: ARROW-10943
 URL: https://issues.apache.org/jira/browse/ARROW-10943
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove


I saw this test failure locally
{code:java}
 encodings::encoding::tests::test_bool stdout 
thread 'encodings::encoding::tests::test_bool' panicked at 'Invalid byte when 
reading bool', parquet/src/util/bit_util.rs:73:18
 {code}
I ran "cargo test" again and it passed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10920) [Rust] Segmentation fault in Arrow Parquet writer with huge arrays

2020-12-15 Thread Andy Grove (Jira)
Andy Grove created ARROW-10920:
--

 Summary: [Rust] Segmentation fault in Arrow Parquet writer with 
huge arrays
 Key: ARROW-10920
 URL: https://issues.apache.org/jira/browse/ARROW-10920
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove


I stumbled across this by chance. I am not too surprised that this fails but I 
would expect it to fail gracefully and not with a segmentation fault.

 
{code:java}
 use std::fs::File;
use std::sync::Arc;

use arrow::array::StringBuilder;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::error::Result;
use arrow::record_batch::RecordBatch;

use parquet::arrow::ArrowWriter;

fn main() -> Result<()> {
let schema = Schema::new(vec![
Field::new("c0", DataType::Utf8, false),
Field::new("c1", DataType::Utf8, true),
]);
let batch_size = 250;
let repeat_count = 140;
let file = File::create("/tmp/test.parquet")?;
let mut writer = ArrowWriter::try_new(file, Arc::new(schema.clone()), 
None).unwrap();
let mut c0_builder = StringBuilder::new(batch_size);
let mut c1_builder = StringBuilder::new(batch_size);

println!("Start of loop");
for i in 0..batch_size {
let c0_value = format!("{:032}", i);
let c1_value = c0_value.repeat(repeat_count);
c0_builder.append_value(_value)?;
c1_builder.append_value(_value)?;
}

println!("Finish building c0");
let c0 = Arc::new(c0_builder.finish());

println!("Finish building c1");
let c1 = Arc::new(c1_builder.finish());

println!("Creating RecordBatch");
let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![c0, c1])?;

// write the batch to parquet
println!("Writing RecordBatch");
writer.write().unwrap();

println!("Closing writer");
writer.close().unwrap();

Ok(())
}
{code}
output:
{code:java}
Start of loop
Finish building c0
Finish building c1
Creating RecordBatch
Writing RecordBatch
Segmentation fault (core dumped)
 {code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10894) [Rust] [DataFusion] Optimizer rules should work with qualified column names

2020-12-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-10894:
--

 Summary: [Rust] [DataFusion] Optimizer rules should work with 
qualified column names
 Key: ARROW-10894
 URL: https://issues.apache.org/jira/browse/ARROW-10894
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


We have a number of optimization rules that deal with column names represented 
by strings.

In order to support qualified field names in queries we need to update these 
rules to work with a data structure representing optionally qualified column 
names instead.

Suggested data structure:
{code:java}
struct ColumnName {
  qualifier: Option,
  name: String
} {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10890) [Rust] [DataFusion] JOIN support

2020-12-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-10890:
--

 Summary: [Rust] [DataFusion] JOIN support
 Key: ARROW-10890
 URL: https://issues.apache.org/jira/browse/ARROW-10890
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 3.0.0


This is the tracking issue for JOIN support. See sub-tasks for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10889) [Rust] Document our approach to unsafe code in README

2020-12-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-10889:
--

 Summary: [Rust] Document our approach to unsafe code in README
 Key: ARROW-10889
 URL: https://issues.apache.org/jira/browse/ARROW-10889
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Andy Grove
 Fix For: 3.0.0


It would be helpful to document the project's stance on the use of unsafe code 
in a prominent location such as in the top-level README so that we can refer 
people to this when questions are asked about this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10888) [Rust] Tracking issue for safety issues

2020-12-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-10888:
--

 Summary: [Rust] Tracking issue for safety issues
 Key: ARROW-10888
 URL: https://issues.apache.org/jira/browse/ARROW-10888
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove


We have a number of functions in the Rust code base that use unsafe code but 
they are not declared as unsafe or documented as to why we are guaranteeing 
that they are actually safe.

I have seen recent criticism of the project on social media because of this and 
it is something that we should address.

If anyone is interested in working on this, please create sub-tasks under this 
JIRA.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10884) [Rust] [DataFusion] Benchmark crate does not have a SIMD feature

2020-12-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-10884:
--

 Summary: [Rust] [DataFusion] Benchmark crate does not have a SIMD 
feature
 Key: ARROW-10884
 URL: https://issues.apache.org/jira/browse/ARROW-10884
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 3.0.0


The benchmarks run without SIMD by default. We need to add a feature to the 
Cargo.toml to enable SIMD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10877) [Rust] [DataFusion] Add benchmark based on kaggle movies

2020-12-10 Thread Andy Grove (Jira)
Andy Grove created ARROW-10877:
--

 Summary: [Rust] [DataFusion] Add benchmark based on kaggle movies
 Key: ARROW-10877
 URL: https://issues.apache.org/jira/browse/ARROW-10877
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10829) [Rust] [DataFusion] Implement Into for DFSchema

2020-12-06 Thread Andy Grove (Jira)
Andy Grove created ARROW-10829:
--

 Summary: [Rust] [DataFusion] Implement Into for DFSchema
 Key: ARROW-10829
 URL: https://issues.apache.org/jira/browse/ARROW-10829
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove


Implement Into for DFSchema



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10815) [Rust] [DataFusion] Finish integrating SQL relation names

2020-12-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10815:
--

 Summary: [Rust] [DataFusion] Finish integrating SQL relation names
 Key: ARROW-10815
 URL: https://issues.apache.org/jira/browse/ARROW-10815
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10813) [Rust] [DataFusion] Implement DFSchema

2020-12-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10813:
--

 Summary: [Rust] [DataFusion] Implement DFSchema
 Key: ARROW-10813
 URL: https://issues.apache.org/jira/browse/ARROW-10813
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove


Implement DFSchema



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10793) [Rust] [DataFusion] Decide on CAST behaviour for invalid inputs

2020-12-02 Thread Andy Grove (Jira)
Andy Grove created ARROW-10793:
--

 Summary: [Rust] [DataFusion] Decide on CAST behaviour for invalid 
inputs
 Key: ARROW-10793
 URL: https://issues.apache.org/jira/browse/ARROW-10793
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


This is a placeholder for now. See discussion on 
[https://github.com/apache/arrow/pull/8794]

Briefly, the issue is do we want CAST to return null for invalid inputs or 
throw an error. Spark has different behavior depending on whether ANSI mode is 
enabled or not.

I'm not sure if this is a DataFusion specific or a more general Arrow issue 
yet. It needs a discussion.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10783) [Rust] [DataFusion] Implement row count statistics for Parquet TableProvider

2020-12-01 Thread Andy Grove (Jira)
Andy Grove created ARROW-10783:
--

 Summary: [Rust] [DataFusion] Implement row count statistics for 
Parquet TableProvider
 Key: ARROW-10783
 URL: https://issues.apache.org/jira/browse/ARROW-10783
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


Following on from https://issues.apache.org/jira/browse/ARROW-10781 we should 
implement statistics for Parquet data sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10782) [Rust] [DataFusion] Optimize hash join to use smaller relation as build side

2020-12-01 Thread Andy Grove (Jira)
Andy Grove created ARROW-10782:
--

 Summary: [Rust] [DataFusion] Optimize hash join to use smaller 
relation as build side
 Key: ARROW-10782
 URL: https://issues.apache.org/jira/browse/ARROW-10782
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


When performing an inner join using the hash join algorithm, it is more 
efficient to load the smaller table into memory and then stream the larger 
table.

We should the statistics made available in 
https://issues.apache.org/jira/browse/ARROW-10781 to build an optimizer rule to 
determine the smaller side of a join and use that as the build/hash side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10781) [Rust] [DataFusion] TableProvider should provide row count statistics

2020-12-01 Thread Andy Grove (Jira)
Andy Grove created ARROW-10781:
--

 Summary: [Rust] [DataFusion] TableProvider should provide row 
count statistics
 Key: ARROW-10781
 URL: https://issues.apache.org/jira/browse/ARROW-10781
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


In order to start building a cost-based optimizer, we need some statistics 
about data sources. The most basic statistic would be number of rows.

I propose that we add a Statistics struct that initially just makes a total row 
count available but that we can later extend to support more advanced 
statistics.
{code:java}
struct Statistics {
  row_count: Option
} {code}
We can then add a method to TableProvider:
{code:java}
trait TableProvider {
  fn statistics() -> Option;
} {code}
Statistics should be optional because not all data sources can provide 
statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10769) [CI] Integration tests are failing in master

2020-11-29 Thread Andy Grove (Jira)
Andy Grove created ARROW-10769:
--

 Summary: [CI] Integration tests are failing in master
 Key: ARROW-10769
 URL: https://issues.apache.org/jira/browse/ARROW-10769
 Project: Apache Arrow
  Issue Type: Bug
  Components: CI
Reporter: Andy Grove
Assignee: Neville Dipale


Caused by this PR being merged: [https://github.com/apache/arrow/pull/8715]

We could revert this PR to unblock people.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10761) [Rust] [DataFusion] Add SQL support for referencing fields in structs

2020-11-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-10761:
--

 Summary: [Rust] [DataFusion] Add SQL support for referencing 
fields in structs
 Key: ARROW-10761
 URL: https://issues.apache.org/jira/browse/ARROW-10761
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >