[jira] [Created] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
Andy Grove created ARROW-11606: -- Summary: [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction Key: ARROW-11606 URL: https://issues.apache.org/jira/browse/ARROW-11606 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove We have run into an issue in the Ballista project where we are reconstructing the Final and Partial HashAggregateExec operators [1] for distributed execution and we need some guidance. The Partial HashAggregateExec gets created OK and executes correctly. However, when we create the Final HashAggregateExec, it is not finding the expected schema in the input operator. The partial exec outputs field names ending with "[sum]" and "[count]" and so on but the final aggregate doesn't seem to be looking for those names. It is also worth noting that the Final and Partial executors are not connected directly in this usage. The Partial exec is executed and output streamed to disk. The Final exec then runs against the output from the Partial exec. We may need to make changes in DataFusion to allow other crates to support this kind of use case? [1] https://github.com/ballista-compute/ballista/pull/491 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11605) [Rust] Adopt a MSRV policy
Neville Dipale created ARROW-11605: -- Summary: [Rust] Adopt a MSRV policy Key: ARROW-11605 URL: https://issues.apache.org/jira/browse/ARROW-11605 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Neville Dipale With all our crates now supporting stable Rust, we can decide on a Minimum Supported Rust Version, so that we don't introduce breakage to people relying on older Rust versions. We could: * Determine what the earliest Rust version that compiles is (at least 1.39 due to async in DF) * Use this version in CI * Decide on, and document, a policy for how we update versions This might mean that when there's fresh new changes landing in Stable, we'd likely hold off on them until those changes meet our MSRV. Thoughts [~Dandandan] [~alamb] [~jorgecarleitao] [~andygrove] [~paddyhoran] [~sunchao]? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11604) [Rust] Remove some unsafe in buffer using fill
Daniël Heres created ARROW-11604: Summary: [Rust] Remove some unsafe in buffer using fill Key: ARROW-11604 URL: https://issues.apache.org/jira/browse/ARROW-11604 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Daniël Heres Assignee: Daniël Heres We can use https://doc.rust-lang.org/std/primitive.slice.html#method.fill instead of using write_bytes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11603) [Rust] Fix clippy error
Jorge Leitão created ARROW-11603: Summary: [Rust] Fix clippy error Key: ARROW-11603 URL: https://issues.apache.org/jira/browse/ARROW-11603 Project: Apache Arrow Issue Type: Bug Reporter: Jorge Leitão Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11602) [Rust] Clippy CI is failing
Andrew Lamb created ARROW-11602: --- Summary: [Rust] Clippy CI is failing Key: ARROW-11602 URL: https://issues.apache.org/jira/browse/ARROW-11602 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb Assignee: Andrew Lamb CI uses "stable" rust 1.50 stable was updated today: https://blog.rust-lang.org/2021/02/11/Rust-1.50.0.html The new clippy is pickier resulting in many clippy warnings such as https://github.com/apache/arrow/pull/9469/checks?check_run_id=1881854256 We need to get CI back green -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11601) [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions
David Li created ARROW-11601: Summary: [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions Key: ARROW-11601 URL: https://issues.apache.org/jira/browse/ARROW-11601 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 3.0.0 Reporter: David Li Assignee: David Li This can help performance on high-latency filesystems. However, some care will be needed as then we won't be able to create one Arrow reader per Parquet row group anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11600) Convert multi dimensional numpy array to pyarrow array
Bhavitvya Malik created ARROW-11600: --- Summary: Convert multi dimensional numpy array to pyarrow array Key: ARROW-11600 URL: https://issues.apache.org/jira/browse/ARROW-11600 Project: Apache Arrow Issue Type: New Feature Components: Python Affects Versions: 3.0.0, 2.0.0 Reporter: Bhavitvya Malik {{}} {{data = np.zeros((10,8), dtype=np.uint8)}} {{out = pa.array(list(data))}} {{out.type # ListType(list)}} {{data = np.zeros((3,4,6), dtype=np.uint8)}} {{out = pa.array(list(data)) # Throws error ArrowInvalid: Can only convert 1-dimensional array values}} Even though it's working on 2D numpy arrays perfectly, it doesn't work on N-Dimensional numpy arrays (where N > 2). Is possible to extend the current feature for inner elements with dimension greater than 1? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11599) [Rust] Add function to create array with all nulls
Neville Dipale created ARROW-11599: -- Summary: [Rust] Add function to create array with all nulls Key: ARROW-11599 URL: https://issues.apache.org/jira/browse/ARROW-11599 Project: Apache Arrow Issue Type: New Feature Reporter: Neville Dipale Assignee: Neville Dipale -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11598) [Rust] Split buffer.rs in smaller files
Jorge Leitão created ARROW-11598: Summary: [Rust] Split buffer.rs in smaller files Key: ARROW-11598 URL: https://issues.apache.org/jira/browse/ARROW-11598 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11597) [Rust] Split datatypes in a module
Jorge Leitão created ARROW-11597: Summary: [Rust] Split datatypes in a module Key: ARROW-11597 URL: https://issues.apache.org/jira/browse/ARROW-11597 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11596) [C++][Python][Dataset] SIGSEGV when executing scan tasks with Python executors
David Li created ARROW-11596: Summary: [C++][Python][Dataset] SIGSEGV when executing scan tasks with Python executors Key: ARROW-11596 URL: https://issues.apache.org/jira/browse/ARROW-11596 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 3.0.0 Reporter: David Li Assignee: David Li This crashes for me with a segfault: {code:python} import concurrent.futures import queue import numpy as np import pyarrow as pa import pyarrow.dataset as ds import pyarrow.fs as fs import pyarrow.parquet as pq schema = pa.schema([("foo", pa.float64())]) table = pa.table([np.random.uniform(size=1024)], schema=schema) path = "/tmp/foo.parquet" pq.write_table(table, path) dataset = pa.dataset.FileSystemDataset.from_paths( [path], schema=schema, format=ds.ParquetFileFormat(), filesystem=fs.LocalFileSystem(), ) with concurrent.futures.ThreadPoolExecutor(2) as executor: tasks = dataset.scan() q = queue.Queue() def _prebuffer(): for task in tasks: iterator = task.execute() next(iterator) q.put(iterator) executor.submit(_prebuffer).result() next(q.get()) {code} {noformat} $ uname -a Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 + x86_64 GNU/Linux $ pip freeze numpy==1.20.1 pyarrow==3.0.0 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11595) [C++][NIGHTLY:test-conda-cpp-valgrind] GenerateBitsUnrolled triggers valgrind on uninit output
Ben Kietzman created ARROW-11595: Summary: [C++][NIGHTLY:test-conda-cpp-valgrind] GenerateBitsUnrolled triggers valgrind on uninit output Key: ARROW-11595 URL: https://issues.apache.org/jira/browse/ARROW-11595 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 4.0.0 https://github.com/ursacomputing/crossbow/runs/1877315066#step:6:2818 Comparison kernels generate an output bitmap for all array values, including those masked by a null bit. This should be fine since the indeterminate bits are also masked in the output but valgrind still triggers on the branching in GenerateBitsUnrolled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11594) [Rust] Support pretty printing with NullArrays
Andrew Lamb created ARROW-11594: --- Summary: [Rust] Support pretty printing with NullArrays Key: ARROW-11594 URL: https://issues.apache.org/jira/browse/ARROW-11594 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb The whole point of `NullArray::new_with_type` is to to be able to cheaply construct entirely null columns, with a smaller memory footprint. Currently trying to print them out causes a painic: {code} #[test] fn test_pretty_format_null() -> Result<()> { // define a schema. let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Utf8, true), Field::new("b", DataType::Int32, true), ])); let num_rows = 4; // define data (null) let batch = RecordBatch::try_new( schema, vec![ Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)), Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)), ], )?; let table = pretty_format_batches(&[batch])?; } {code} Panics: {code} failures: util::pretty::tests::test_pretty_format_null stdout thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11593) Parquet does not support wasm32-unknown-unknown target
Dominik Moritz created ARROW-11593: -- Summary: Parquet does not support wasm32-unknown-unknown target Key: ARROW-11593 URL: https://issues.apache.org/jira/browse/ARROW-11593 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Dominik Moritz The Arrow crate successfully compiles to WebAssembly (e.g. https://github.com/domoritz/arrow-wasm) but the Parquet crate currently does not support the`wasm32-unknown-unknown` target. Try out the repository at https://github.com/domoritz/parquet-wasm/commit/e877f9ad9c45c09f73d98fab2a8ad384a802b2e0. The problem seems to be in liblz4, even if I do not include lz4 in the feature flags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11592) Typo in comment
Dominik Moritz created ARROW-11592: -- Summary: Typo in comment Key: ARROW-11592 URL: https://issues.apache.org/jira/browse/ARROW-11592 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Dominik Moritz Assignee: Dominik Moritz -- This message was sent by Atlassian Jira (v8.3.4#803005)