[jira] [Created] (ARROW-11060) [Rust] Logical equality for list arrays

2020-12-28 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11060:
--

 Summary: [Rust] Logical equality for list arrays
 Key: ARROW-11060
 URL: https://issues.apache.org/jira/browse/ARROW-11060
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Neville Dipale


Apply logical equality to lists. This requires computing the merged nulls of a 
list and its child based on list offsets.

For example, a list having 3 slots, and 5 values (0, 1, 3, 5) needs to be 
expanded to 5 null masks. If the list is validity is [true, false, true], and 
the values are [t, f, t, f, t] we would get:

[t, f, f, t, t] AND [t, f, t, f, t] = [t, f, f, f, t]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11059) [Rust] [DataFusion] Implement extensible configuration mechanism

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11059:
--

 Summary: [Rust] [DataFusion] Implement extensible configuration 
mechanism
 Key: ARROW-11059
 URL: https://issues.apache.org/jira/browse/ARROW-11059
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


We are getting to the point where there are multiple settings we could add to 
operators to fine-tune performance. Custom operators provided by crates that 
extend DataFusion may also need this capability.

I propose that we add support for key-value configuration options so that we 
don't need to plumb through each new configuration setting that we add.

For example. I am about to start on a "coalesce batches" operator and I would 
like a setting such as "coalesce.batch.size".

For built-in settings like this we can provide information such as 
documentation and default values and generate documentation from this.

For example, here is how Spark defines configs:
{code:java}
  val PARQUET_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.parquet.enableVectorizedReader")
  .doc("Enables vectorized parquet decoding.")
  .version("2.0.0")
  .booleanConf
  .createWithDefault(true) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11058) [Rust] [DataFusion] Implement "coalesce batches" operator

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11058:
--

 Summary: [Rust] [DataFusion] Implement "coalesce batches" operator
 Key: ARROW-11058
 URL: https://issues.apache.org/jira/browse/ARROW-11058
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


When we have a FilterExec in the plan, it can produce lots of small batches and 
we therefore lose efficiency of vectorized operations.

We should implement a new CoalesceBatchExec and wrap every FilterExec with one 
of these so that small batches can be recombined into larger batches to improve 
the efficiency of upstream operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11057) [Python] Data inconsistency with read and write

2020-12-28 Thread David Quijano (Jira)
David Quijano created ARROW-11057:
-

 Summary: [Python] Data inconsistency with read and write
 Key: ARROW-11057
 URL: https://issues.apache.org/jira/browse/ARROW-11057
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
Reporter: David Quijano


 

I have been reading and writing some tables to parquet and I found some 
inconsistencies.
{code:java}
# create a table with some data
a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
# write it to file
pq.write_table(a, 'test.parquet')
# read the same file
b = pq.write_table('test.parquet')
# a == b is True, that's good
# write table b to file
pq.write_table(b, 'test2.parquet')
# test is different from test2{code}
Basically it is:
 * Create table in memory
 * Write it to file
 * Read it again
 * Write it to a different file

The files are not the same. The second one contains extra information.

The differences are consistent across different compressions (I tried snappy 
and zstd).

Also, reading the second file and and writing it again, produces the same file.

Is this a bug or an expected behavior?

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11056) [Rust] [DataFusion] Allow ParquetExec to parallelize work based on row groups

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11056:
--

 Summary: [Rust] [DataFusion] Allow ParquetExec to parallelize work 
based on row groups
 Key: ARROW-11056
 URL: https://issues.apache.org/jira/browse/ARROW-11056
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


ParquetExec currently parallelizes work by passinging individual files to 
threads. It would be nice to be able to do this in a finer-grained way by 
assigning row groups and/or column chunks instead. This will be especially 
important in distributed systems built on DataFusion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11055) [Rust] [DataFusion] Support date_trunc function

2020-12-28 Thread Pavel Tiunov (Jira)
Pavel Tiunov created ARROW-11055:


 Summary: [Rust] [DataFusion] Support date_trunc function
 Key: ARROW-11055
 URL: https://issues.apache.org/jira/browse/ARROW-11055
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Pavel Tiunov


Support date_trunc SQL function to truncate timestamps into the desired 
granularity similar to Postgres and other major SQL databases. For example
{code:java}
SELECT date_trunc(timestamp, 'day') as day, count(*) FROM orders GROUP BY day
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11054) Update SQLParser to 0.70

2020-12-28 Thread Patsura Dmitry (Jira)
Patsura Dmitry created ARROW-11054:
--

 Summary: Update SQLParser to 0.70
 Key: ARROW-11054
 URL: https://issues.apache.org/jira/browse/ARROW-11054
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Patsura Dmitry






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11053) [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11053:
--

 Summary: [Rust] [DataFusion] Optimize joins with dynamic capacity 
for output batches
 Key: ARROW-11053
 URL: https://issues.apache.org/jira/browse/ARROW-11053
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


Rather than using the size of the left or right batches to determine the 
capacity of the output batches we can use the average size of previous output 
batches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11052) [Rust] [DataFusion] Implement metrics in join operator

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11052:
--

 Summary: [Rust] [DataFusion] Implement metrics in join operator
 Key: ARROW-11052
 URL: https://issues.apache.org/jira/browse/ARROW-11052
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


Implement metrics in join operator to make it easier to debug performance 
issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11051) [Rust] Array sum result is wrong with remainder fields

2020-12-28 Thread Ziru Niu (Jira)
Ziru Niu created ARROW-11051:


 Summary: [Rust] Array sum result is wrong with remainder fields
 Key: ARROW-11051
 URL: https://issues.apache.org/jira/browse/ARROW-11051
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 2.0.0
 Environment: Ubuntu 20.04. rustc nightly
Reporter: Ziru Niu


Minimal example

```
use arrow::\{array::PrimitiveArray, datatypes::Int64Type};

fn main() {
let mut s = vec![];
for _ in 0..32 {
s.push(Some(1i64));
s.push(None);
}
let v: PrimitiveArray = s.into();
dbg!(arrow::compute::sum());
}
```
 
The following code in `compute::sum` is wrong. The bit mask is checked reversed.
```
remainder.iter().enumerate().for_each(|(i, value)| {
if remainder_bits & (1 << i) != 0 {
remainder_sum = remainder_sum + *value;
}
});
 ```
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11050) Help files read that write_parquet accepts objects of type RecordBatch - this results in a fatal crash

2020-12-28 Thread Matthew B Connor (Jira)
Matthew B Connor created ARROW-11050:


 Summary: Help files read that write_parquet accepts objects of 
type RecordBatch - this results in a fatal crash
 Key: ARROW-11050
 URL: https://issues.apache.org/jira/browse/ARROW-11050
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 2.0.0
 Environment: MacOS Catalina 10.15.7
R version 4.0.2
Reporter: Matthew B Connor


write_parquet() fatally crashes the R environment when writing a 'record_batch' 
object

#Repro
{code:java}
// 

working_dir <- getwd()
dir.create(paste0(working_dir, '/test'))
out_file <- '/test.snappy.parquet'data(mtcars)
batch <- record_batch(mtcars)
write_parquet(batch, paste0(working_dir,out_file)){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11049) [Python] Expose alternate memory pools

2020-12-28 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-11049:
--

 Summary: [Python] Expose alternate memory pools
 Key: ARROW-11049
 URL: https://issues.apache.org/jira/browse/ARROW-11049
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Antoine Pitrou
 Fix For: 3.0.0


Currently, the default memory pool is exposed in Python but not the explicit 
memory pool singletons (jemalloc, mimalloc, system).

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11048) [Rust] Add bench to MutableBuffer

2020-12-28 Thread Jira
Jorge Leitão created ARROW-11048:


 Summary: [Rust] Add bench to MutableBuffer
 Key: ARROW-11048
 URL: https://issues.apache.org/jira/browse/ARROW-11048
 Project: Apache Arrow
  Issue Type: Test
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11047) [Rust] [DataFusion] ParquetTable should avoid scanning all files twice

2020-12-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-11047:
--

 Summary: [Rust] [DataFusion] ParquetTable should avoid scanning 
all files twice
 Key: ARROW-11047
 URL: https://issues.apache.org/jira/browse/ARROW-11047
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


ParquetTable currently reads the metadata for all files once in the constructor 
in order to get the schema, and does it again each time scan() is called.

We could read the metadata once and cache it instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11046) [Rust][DataFusion] Add count_distinct to dataframe API

2020-12-28 Thread Jira
Daniël Heres created ARROW-11046:


 Summary: [Rust][DataFusion] Add count_distinct to dataframe API
 Key: ARROW-11046
 URL: https://issues.apache.org/jira/browse/ARROW-11046
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11045) [Rust] Improve performance allocator

2020-12-28 Thread Jira
Jorge Leitão created ARROW-11045:


 Summary: [Rust] Improve performance allocator
 Key: ARROW-11045
 URL: https://issues.apache.org/jira/browse/ARROW-11045
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)