[jira] [Created] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction

2021-02-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-11606:
--

 Summary: [Rust] [DataFusion] Need guidance on HashAggregateExec 
reconstruction
 Key: ARROW-11606
 URL: https://issues.apache.org/jira/browse/ARROW-11606
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


We have run into an issue in the Ballista project where we are reconstructing 
the Final and Partial HashAggregateExec operators [1] for distributed execution 
and we need some guidance.

The Partial HashAggregateExec gets created OK and executes correctly.

However, when we create the Final HashAggregateExec, it is not finding the 
expected schema in the input operator. The partial exec outputs field names 
ending with "[sum]" and "[count]" and so on but the final aggregate doesn't 
seem to be looking for those names.

It is also worth noting that the Final and Partial executors are not connected 
directly in this usage.

The Partial exec is executed and output streamed to disk.

The Final exec then runs against the output from the Partial exec.

We may need to make changes in DataFusion to allow other crates to support this 
kind of use case?

 [1] https://github.com/ballista-compute/ballista/pull/491

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11605) [Rust] Adopt a MSRV policy

2021-02-11 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11605:
--

 Summary: [Rust] Adopt a MSRV policy
 Key: ARROW-11605
 URL: https://issues.apache.org/jira/browse/ARROW-11605
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Neville Dipale


With all our crates now supporting stable Rust, we can decide on a Minimum 
Supported Rust Version, so that we don't introduce breakage to people relying 
on older Rust versions.

We could:
* Determine what the earliest Rust version that compiles is (at least 1.39 due 
to async in DF)
* Use this version in CI
* Decide on, and document, a policy for how we update versions

This might mean that when there's fresh new changes landing in Stable, we'd 
likely hold off on them until those changes meet our MSRV.

Thoughts [~Dandandan] [~alamb] [~jorgecarleitao] [~andygrove] [~paddyhoran] 
[~sunchao]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11604) [Rust] Remove some unsafe in buffer using fill

2021-02-11 Thread Jira
Daniël Heres created ARROW-11604:


 Summary: [Rust] Remove some unsafe in buffer using fill
 Key: ARROW-11604
 URL: https://issues.apache.org/jira/browse/ARROW-11604
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Daniël Heres
Assignee: Daniël Heres


We can use 
https://doc.rust-lang.org/std/primitive.slice.html#method.fill

instead of using write_bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11603) [Rust] Fix clippy error

2021-02-11 Thread Jira
Jorge Leitão created ARROW-11603:


 Summary: [Rust] Fix clippy error
 Key: ARROW-11603
 URL: https://issues.apache.org/jira/browse/ARROW-11603
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jorge Leitão
Assignee: Andrew Lamb






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11602) [Rust] Clippy CI is failing

2021-02-11 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-11602:
---

 Summary: [Rust] Clippy CI is failing
 Key: ARROW-11602
 URL: https://issues.apache.org/jira/browse/ARROW-11602
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Andrew Lamb
Assignee: Andrew Lamb


CI uses "stable" rust
1.50 stable was updated today: 
https://blog.rust-lang.org/2021/02/11/Rust-1.50.0.html

The new clippy is pickier resulting in many clippy warnings such as 
https://github.com/apache/arrow/pull/9469/checks?check_run_id=1881854256

We need to get CI back green



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11601) [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions

2021-02-11 Thread David Li (Jira)
David Li created ARROW-11601:


 Summary: [C++][Dataset] Expose pre-buffering in 
ParquetFileFormatReaderOptions
 Key: ARROW-11601
 URL: https://issues.apache.org/jira/browse/ARROW-11601
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 3.0.0
Reporter: David Li
Assignee: David Li


This can help performance on high-latency filesystems. However, some care will 
be needed as then we won't be able to create one Arrow reader per Parquet row 
group anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11600) Convert multi dimensional numpy array to pyarrow array

2021-02-11 Thread Bhavitvya Malik (Jira)
Bhavitvya Malik created ARROW-11600:
---

 Summary: Convert multi dimensional numpy array to pyarrow array
 Key: ARROW-11600
 URL: https://issues.apache.org/jira/browse/ARROW-11600
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 3.0.0, 2.0.0
Reporter: Bhavitvya Malik


{{}}

{{data = np.zeros((10,8), dtype=np.uint8)}}
{{out = pa.array(list(data))}}
{{out.type  # ListType(list)}}

{{data = np.zeros((3,4,6), dtype=np.uint8)}}
{{out = pa.array(list(data))  # Throws error ArrowInvalid: Can only convert 
1-dimensional array values}}

Even though it's working on 2D numpy arrays perfectly, it doesn't work on 
N-Dimensional numpy arrays (where N > 2). Is possible to extend the current 
feature for inner elements with dimension greater than 1?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11599) [Rust] Add function to create array with all nulls

2021-02-11 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11599:
--

 Summary: [Rust] Add function to create array with all nulls
 Key: ARROW-11599
 URL: https://issues.apache.org/jira/browse/ARROW-11599
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neville Dipale
Assignee: Neville Dipale






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11598) [Rust] Split buffer.rs in smaller files

2021-02-11 Thread Jira
Jorge Leitão created ARROW-11598:


 Summary: [Rust] Split buffer.rs in smaller files
 Key: ARROW-11598
 URL: https://issues.apache.org/jira/browse/ARROW-11598
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11597) [Rust] Split datatypes in a module

2021-02-11 Thread Jira
Jorge Leitão created ARROW-11597:


 Summary: [Rust] Split datatypes in a module
 Key: ARROW-11597
 URL: https://issues.apache.org/jira/browse/ARROW-11597
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11596) [C++][Python][Dataset] SIGSEGV when executing scan tasks with Python executors

2021-02-11 Thread David Li (Jira)
David Li created ARROW-11596:


 Summary: [C++][Python][Dataset] SIGSEGV when executing scan tasks 
with Python executors
 Key: ARROW-11596
 URL: https://issues.apache.org/jira/browse/ARROW-11596
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 3.0.0
Reporter: David Li
Assignee: David Li


This crashes for me with a segfault:
{code:python}
import concurrent.futures
import queue

import numpy as np
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.fs as fs
import pyarrow.parquet as pq


schema = pa.schema([("foo", pa.float64())])
table = pa.table([np.random.uniform(size=1024)], schema=schema)
path = "/tmp/foo.parquet"
pq.write_table(table, path)
dataset = pa.dataset.FileSystemDataset.from_paths(
[path],
schema=schema,
format=ds.ParquetFileFormat(),
filesystem=fs.LocalFileSystem(),
)

with concurrent.futures.ThreadPoolExecutor(2) as executor:
tasks = dataset.scan()
q = queue.Queue()

def _prebuffer():
for task in tasks:
iterator = task.execute()
next(iterator)
q.put(iterator)

executor.submit(_prebuffer).result()
next(q.get())
{code}

{noformat}
$ uname -a
Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 + 
x86_64 GNU/Linux
$ pip freeze
numpy==1.20.1
pyarrow==3.0.0
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11595) [C++][NIGHTLY:test-conda-cpp-valgrind] GenerateBitsUnrolled triggers valgrind on uninit output

2021-02-11 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-11595:


 Summary: [C++][NIGHTLY:test-conda-cpp-valgrind] 
GenerateBitsUnrolled triggers valgrind on uninit output
 Key: ARROW-11595
 URL: https://issues.apache.org/jira/browse/ARROW-11595
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 4.0.0


https://github.com/ursacomputing/crossbow/runs/1877315066#step:6:2818

Comparison kernels generate an output bitmap for all array values, including 
those masked by a null bit. This should be fine since the indeterminate bits 
are also masked in the output but valgrind still triggers on the branching in 
GenerateBitsUnrolled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11594) [Rust] Support pretty printing with NullArrays

2021-02-11 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-11594:
---

 Summary: [Rust] Support pretty printing with NullArrays
 Key: ARROW-11594
 URL: https://issues.apache.org/jira/browse/ARROW-11594
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andrew Lamb




The whole point of `NullArray::new_with_type` is to to be able to cheaply 
construct entirely null columns, with a smaller memory footprint.

Currently trying to print them out causes a painic:

{code}
#[test]
fn test_pretty_format_null() -> Result<()> {
// define a schema.
let schema = Arc::new(Schema::new(vec![
Field::new("a", DataType::Utf8, true),
Field::new("b", DataType::Int32, true),
]));

let num_rows = 4;

// define data (null)
let batch = RecordBatch::try_new(
schema,
vec![
Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)),
Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)),
],
)?;

let table = pretty_format_batches(&[batch])?;
}

{code}

Panics:

{code}

failures:

 util::pretty::tests::test_pretty_format_null stdout 
thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called 
`Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11593) Parquet does not support wasm32-unknown-unknown target

2021-02-11 Thread Dominik Moritz (Jira)
Dominik Moritz created ARROW-11593:
--

 Summary: Parquet does not support wasm32-unknown-unknown target
 Key: ARROW-11593
 URL: https://issues.apache.org/jira/browse/ARROW-11593
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Dominik Moritz


The Arrow crate successfully compiles to WebAssembly (e.g. 
https://github.com/domoritz/arrow-wasm) but the Parquet crate currently does 
not support the`wasm32-unknown-unknown` target. 

Try out the repository at 
https://github.com/domoritz/parquet-wasm/commit/e877f9ad9c45c09f73d98fab2a8ad384a802b2e0.
 The problem seems to be in liblz4, even if I do not include lz4 in the feature 
flags.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11592) Typo in comment

2021-02-11 Thread Dominik Moritz (Jira)
Dominik Moritz created ARROW-11592:
--

 Summary: Typo in comment
 Key: ARROW-11592
 URL: https://issues.apache.org/jira/browse/ARROW-11592
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Dominik Moritz
Assignee: Dominik Moritz






--
This message was sent by Atlassian Jira
(v8.3.4#803005)