[jira] [Created] (ARROW-10036) [Rust] [DataFusion] Test that the final schema is expected in integration tests
Jorge created ARROW-10036: - Summary: [Rust] [DataFusion] Test that the final schema is expected in integration tests Key: ARROW-10036 URL: https://issues.apache.org/jira/browse/ARROW-10036 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Currently, our integration tests convert a Recordbatch to a string, which we use for testing, but they do not test that the final schema matches our expectations. We should add a test for this, which includes: # field name # field type # field nulability for every field in the schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10035) [C++] Bump versions of vendored code
Antoine Pitrou created ARROW-10035: -- Summary: [C++] Bump versions of vendored code Key: ARROW-10035 URL: https://issues.apache.org/jira/browse/ARROW-10035 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10034) [Rust] Master build broken
Andy Grove created ARROW-10034: -- Summary: [Rust] Master build broken Key: ARROW-10034 URL: https://issues.apache.org/jira/browse/ARROW-10034 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 2.0.0 I merged quite a few PRs today. There was a conflict and I need to revert one of them. I am working on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10033) ArrowReaderProperties creates thread pool, even when use_threads=False and pre_buffer=False
Adam Hooper created ARROW-10033: --- Summary: ArrowReaderProperties creates thread pool, even when use_threads=False and pre_buffer=False Key: ARROW-10033 URL: https://issues.apache.org/jira/browse/ARROW-10033 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 1.0.1 Reporter: Adam Hooper `ArrowReaderProperties` has a `::arrow::io::AsyncContext async_context_;` member. Its ctor creates a thread pool. Stack trace: ``` #0 arrow::internal::ThreadPool::ThreadPool (this=0x232fa90) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:121 #1 0x008e4747 in arrow::internal::ThreadPool::Make (threads=8) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:246 #2 0x008e48c9 in arrow::internal::ThreadPool::MakeEternal (threads=8) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:252 #3 0x008a20ac in arrow::io::internal::MakeIOThreadPool () at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:326 #4 0x008a21dd in arrow::io::internal::GetIOThreadPool () at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:334 #5 0x008a064f in arrow::io::AsyncContext::AsyncContext ( this=0xea6bb0 ) at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:49 #6 0x0048893e in parquet::ArrowReaderProperties::ArrowReaderProperties ( this=0xea6b60 , use_threads=false) at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.h:579 #7 0x005e1b98 in parquet::default_arrow_reader_properties () at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.cc:53 #8 0x00414843 in parquet::arrow::FileReaderBuilder::FileReaderBuilder (this=0x7fffb31f0c60) at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:930 #9 0x00414b10 in parquet::arrow::OpenFile (file=..., pool=0xea6cf0 , reader=0x7fffb31f0e08) at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:957 ``` As a caller, I expect `use_threads=False` to prevent the creation of threads. (Maybe there should be an exception if `pre_buffer && !use_threads`?) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10032) [Documentation] C++ Windows docs are out of date
David Li created ARROW-10032: Summary: [Documentation] C++ Windows docs are out of date Key: ARROW-10032 URL: https://issues.apache.org/jira/browse/ARROW-10032 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: David Li * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags Even with this: * The developer prompt can't find cl.exe (the compiler) * The PowerShell prompt can't use conda (it complains a config file isn't signed) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10031) Support Java benchmark in Ursabot
Kazuaki Ishizaki created ARROW-10031: Summary: Support Java benchmark in Ursabot Key: ARROW-10031 URL: https://issues.apache.org/jira/browse/ARROW-10031 Project: Apache Arrow Issue Type: New Feature Components: CI, Java Affects Versions: 2.0.0 Reporter: Kazuaki Ishizaki Assignee: Kazuaki Ishizaki Based on [the suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e], Ursabot will support Java benchmarks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10030) [Rust] Support fromIter and toIter
Jorge created ARROW-10030: - Summary: [Rust] Support fromIter and toIter Key: ARROW-10030 URL: https://issues.apache.org/jira/browse/ARROW-10030 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Proposal for comments: https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing (dump of the proposal:) Rust Arrow supports two main computational models: # Batch Operations, that leverage some form of vectorization # Element-by-element operations, that emerge in more complex operations This document concerns element-by-element operations, that are the most common operations outside of the library. h2. Element-by-element operations These operations are programmatically written as: # Downcast the array to its specific type # Initialize buffers # Iterate over indices and perform the operation, appending to the buffers accordingly # Create ArrayData with the required null bitmap, buffers, childs, etc. # return ArrayRef from ArrayData We can split this process in 3 parts: # Initialization (1 and 2) # Iteration (3) # Finalization (4 and 5) Currently, the API that we offer to our users is: # as_any() to downcast the array based on its DataType # Builders for all types, that users can initialize, matching the downcasted array # Iterate # Use for i in (0..array.len()) # Use Array::value(i) and Array::is_valid(i)/is_null(i)` # use builder.append_value(new_value) or builder.append_null() # Finish the builder and wrap the result in an Arc This API has some issues: # value(i) +is unsafe+, even though it is not marked as such # builders are usually slow due to the checks that they need to perform # The API is not intuitive h2. Proposal This proposal aims at improving this API in 2 specific ways: * Implement IntoIterator Iterator and Iterator> * Implement FromIterator and Item=Option so that users can write: {code:java} let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected); {code} This results in an API that is: # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator # Safe, as it does not allow segfaults # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)