[jira] [Created] (ARROW-11060) [Rust] Logical equality for list arrays
Neville Dipale created ARROW-11060: -- Summary: [Rust] Logical equality for list arrays Key: ARROW-11060 URL: https://issues.apache.org/jira/browse/ARROW-11060 Project: Apache Arrow Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Neville Dipale Apply logical equality to lists. This requires computing the merged nulls of a list and its child based on list offsets. For example, a list having 3 slots, and 5 values (0, 1, 3, 5) needs to be expanded to 5 null masks. If the list is validity is [true, false, true], and the values are [t, f, t, f, t] we would get: [t, f, f, t, t] AND [t, f, t, f, t] = [t, f, f, f, t] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11059) [Rust] [DataFusion] Implement extensible configuration mechanism
Andy Grove created ARROW-11059: -- Summary: [Rust] [DataFusion] Implement extensible configuration mechanism Key: ARROW-11059 URL: https://issues.apache.org/jira/browse/ARROW-11059 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 We are getting to the point where there are multiple settings we could add to operators to fine-tune performance. Custom operators provided by crates that extend DataFusion may also need this capability. I propose that we add support for key-value configuration options so that we don't need to plumb through each new configuration setting that we add. For example. I am about to start on a "coalesce batches" operator and I would like a setting such as "coalesce.batch.size". For built-in settings like this we can provide information such as documentation and default values and generate documentation from this. For example, here is how Spark defines configs: {code:java} val PARQUET_VECTORIZED_READER_ENABLED = buildConf("spark.sql.parquet.enableVectorizedReader") .doc("Enables vectorized parquet decoding.") .version("2.0.0") .booleanConf .createWithDefault(true) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11058) [Rust] [DataFusion] Implement "coalesce batches" operator
Andy Grove created ARROW-11058: -- Summary: [Rust] [DataFusion] Implement "coalesce batches" operator Key: ARROW-11058 URL: https://issues.apache.org/jira/browse/ARROW-11058 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 When we have a FilterExec in the plan, it can produce lots of small batches and we therefore lose efficiency of vectorized operations. We should implement a new CoalesceBatchExec and wrap every FilterExec with one of these so that small batches can be recombined into larger batches to improve the efficiency of upstream operators. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11057) [Python] Data inconsistency with read and write
David Quijano created ARROW-11057: - Summary: [Python] Data inconsistency with read and write Key: ARROW-11057 URL: https://issues.apache.org/jira/browse/ARROW-11057 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Reporter: David Quijano I have been reading and writing some tables to parquet and I found some inconsistencies. {code:java} # create a table with some data a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,}) # write it to file pq.write_table(a, 'test.parquet') # read the same file b = pq.write_table('test.parquet') # a == b is True, that's good # write table b to file pq.write_table(b, 'test2.parquet') # test is different from test2{code} Basically it is: * Create table in memory * Write it to file * Read it again * Write it to a different file The files are not the same. The second one contains extra information. The differences are consistent across different compressions (I tried snappy and zstd). Also, reading the second file and and writing it again, produces the same file. Is this a bug or an expected behavior? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11056) [Rust] [DataFusion] Allow ParquetExec to parallelize work based on row groups
Andy Grove created ARROW-11056: -- Summary: [Rust] [DataFusion] Allow ParquetExec to parallelize work based on row groups Key: ARROW-11056 URL: https://issues.apache.org/jira/browse/ARROW-11056 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove ParquetExec currently parallelizes work by passinging individual files to threads. It would be nice to be able to do this in a finer-grained way by assigning row groups and/or column chunks instead. This will be especially important in distributed systems built on DataFusion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11055) [Rust] [DataFusion] Support date_trunc function
Pavel Tiunov created ARROW-11055: Summary: [Rust] [DataFusion] Support date_trunc function Key: ARROW-11055 URL: https://issues.apache.org/jira/browse/ARROW-11055 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Pavel Tiunov Support date_trunc SQL function to truncate timestamps into the desired granularity similar to Postgres and other major SQL databases. For example {code:java} SELECT date_trunc(timestamp, 'day') as day, count(*) FROM orders GROUP BY day {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11054) Update SQLParser to 0.70
Patsura Dmitry created ARROW-11054: -- Summary: Update SQLParser to 0.70 Key: ARROW-11054 URL: https://issues.apache.org/jira/browse/ARROW-11054 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Patsura Dmitry -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11053) [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches
Andy Grove created ARROW-11053: -- Summary: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches Key: ARROW-11053 URL: https://issues.apache.org/jira/browse/ARROW-11053 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 Rather than using the size of the left or right batches to determine the capacity of the output batches we can use the average size of previous output batches. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11052) [Rust] [DataFusion] Implement metrics in join operator
Andy Grove created ARROW-11052: -- Summary: [Rust] [DataFusion] Implement metrics in join operator Key: ARROW-11052 URL: https://issues.apache.org/jira/browse/ARROW-11052 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 Implement metrics in join operator to make it easier to debug performance issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11051) [Rust] Array sum result is wrong with remainder fields
Ziru Niu created ARROW-11051: Summary: [Rust] Array sum result is wrong with remainder fields Key: ARROW-11051 URL: https://issues.apache.org/jira/browse/ARROW-11051 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 2.0.0 Environment: Ubuntu 20.04. rustc nightly Reporter: Ziru Niu Minimal example ``` use arrow::\{array::PrimitiveArray, datatypes::Int64Type}; fn main() { let mut s = vec![]; for _ in 0..32 { s.push(Some(1i64)); s.push(None); } let v: PrimitiveArray = s.into(); dbg!(arrow::compute::sum()); } ``` The following code in `compute::sum` is wrong. The bit mask is checked reversed. ``` remainder.iter().enumerate().for_each(|(i, value)| { if remainder_bits & (1 << i) != 0 { remainder_sum = remainder_sum + *value; } }); ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11050) Help files read that write_parquet accepts objects of type RecordBatch - this results in a fatal crash
Matthew B Connor created ARROW-11050: Summary: Help files read that write_parquet accepts objects of type RecordBatch - this results in a fatal crash Key: ARROW-11050 URL: https://issues.apache.org/jira/browse/ARROW-11050 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 2.0.0 Environment: MacOS Catalina 10.15.7 R version 4.0.2 Reporter: Matthew B Connor write_parquet() fatally crashes the R environment when writing a 'record_batch' object #Repro {code:java} // working_dir <- getwd() dir.create(paste0(working_dir, '/test')) out_file <- '/test.snappy.parquet'data(mtcars) batch <- record_batch(mtcars) write_parquet(batch, paste0(working_dir,out_file)){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11049) [Python] Expose alternate memory pools
Antoine Pitrou created ARROW-11049: -- Summary: [Python] Expose alternate memory pools Key: ARROW-11049 URL: https://issues.apache.org/jira/browse/ARROW-11049 Project: Apache Arrow Issue Type: Improvement Reporter: Antoine Pitrou Fix For: 3.0.0 Currently, the default memory pool is exposed in Python but not the explicit memory pool singletons (jemalloc, mimalloc, system). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11048) [Rust] Add bench to MutableBuffer
Jorge Leitão created ARROW-11048: Summary: [Rust] Add bench to MutableBuffer Key: ARROW-11048 URL: https://issues.apache.org/jira/browse/ARROW-11048 Project: Apache Arrow Issue Type: Test Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11047) [Rust] [DataFusion] ParquetTable should avoid scanning all files twice
Andy Grove created ARROW-11047: -- Summary: [Rust] [DataFusion] ParquetTable should avoid scanning all files twice Key: ARROW-11047 URL: https://issues.apache.org/jira/browse/ARROW-11047 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove ParquetTable currently reads the metadata for all files once in the constructor in order to get the schema, and does it again each time scan() is called. We could read the metadata once and cache it instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11046) [Rust][DataFusion] Add count_distinct to dataframe API
Daniël Heres created ARROW-11046: Summary: [Rust][DataFusion] Add count_distinct to dataframe API Key: ARROW-11046 URL: https://issues.apache.org/jira/browse/ARROW-11046 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Daniël Heres Assignee: Daniël Heres -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11045) [Rust] Improve performance allocator
Jorge Leitão created ARROW-11045: Summary: [Rust] Improve performance allocator Key: ARROW-11045 URL: https://issues.apache.org/jira/browse/ARROW-11045 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)