[jira] [Created] (ARROW-11942) [C++] If tasks are submitted quickly the thread pool may fail to spin up new threads
Weston Pace created ARROW-11942: --- Summary: [C++] If tasks are submitted quickly the thread pool may fail to spin up new threads Key: ARROW-11942 URL: https://issues.apache.org/jira/browse/ARROW-11942 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace Assignee: Weston Pace Probably only really affects unit tests. Consider an idle thread pool with 1 thread (ready_count_ == 1). If `Spawn` is called very quickly it may look like `ready_count_` is still greater than 0 (because `ready_count_` doesn't necessarily decrement by the time `Spawn` returns) and so it will not spin up new threads. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11941) [Dev] "DEBUG=1 merge_arrow_pr.py" updates Jira issue
Yibo Cai created ARROW-11941: Summary: [Dev] "DEBUG=1 merge_arrow_pr.py" updates Jira issue Key: ARROW-11941 URL: https://issues.apache.org/jira/browse/ARROW-11941 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Reporter: Yibo Cai Assignee: Yibo Cai "DEBUG=1 dev/merge_arrow_pr.py" acts as a dryrun without writing anything. It doesn't merge PR, but it does updates the Jira issue status. Should be fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11940) [Rust][Datafusion] Support joins on TimestampMillisecond columns
Morgan Cassels created ARROW-11940: -- Summary: [Rust][Datafusion] Support joins on TimestampMillisecond columns Key: ARROW-11940 URL: https://issues.apache.org/jira/browse/ARROW-11940 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Morgan Cassels Joining DataFrames on a TimestampMillisecond column gives error: ``` 'called `Result::unwrap()` on an `Err` value: Internal("Unsupported data type in hasher")' ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11939) Bug in `pa.PythonFile`?
Dave Hirschfeld created ARROW-11939: --- Summary: Bug in `pa.PythonFile`? Key: ARROW-11939 URL: https://issues.apache.org/jira/browse/ARROW-11939 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 3.0.0 Reporter: Dave Hirschfeld ```python with pa.PythonFile('deleteme.jnk', 'wb') as f: pass AttributeError: 'str' object has no attribute 'closed' ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11938) [R] Enable R build process to find locally built C++ library on Windows
Ian Cook created ARROW-11938: Summary: [R] Enable R build process to find locally built C++ library on Windows Key: ARROW-11938 URL: https://issues.apache.org/jira/browse/ARROW-11938 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Ian Cook Currently, {{configure.win}} and {{tools/winlibs.R}} have two ways of finding the Arrow C++ library: # If {{RWINLIB_LOCAL}} is set, it gets it from that zip file # If not, it downloads it Enable and document a third option for the case when the C++ library has been built locally. This will enable R package developers using Windows machines to make changes to code in the C++ library, build and install it, and then build the R package using it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11937) [C++] GZip codec hangs if flushed twice
David Li created ARROW-11937: Summary: [C++] GZip codec hangs if flushed twice Key: ARROW-11937 URL: https://issues.apache.org/jira/browse/ARROW-11937 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 3.0.0 Reporter: David Li Assignee: David Li Fix For: 4.0.0 {code:java} // "If deflate returns with avail_out == 0, this function must be called // again with the same value of the flush parameter and more output space // (updated avail_out), until the flush is complete (deflate returns // with non-zero avail_out)." return FlushResult{bytes_written, (bytes_written == 0)}; {code} But contrary to the comment, we're checking bytes_written. So if we flush twice, the second time, we won't write any bytes, but we'll erroneously interpret that as zlib asking for a larger buffer, rather than zlib telling us there's no data to decompress. Then we'll enter a loop where we keep doubling the buffer size forever, hanging the program. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11936) Rust/Java incorrect serialization of Struct wrapped Int8Dictionary
Justin created ARROW-11936: -- Summary: Rust/Java incorrect serialization of Struct wrapped Int8Dictionary Key: ARROW-11936 URL: https://issues.apache.org/jira/browse/ARROW-11936 Project: Apache Arrow Issue Type: Bug Components: Java, Rust Affects Versions: 3.0.0 Reporter: Justin Using rust, serialized datatype to a file with a schema of {code:java} Field { name: "val", data_type: Struct([Field { name: "val", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }{code} Using a java client to read the serialized datatype results in a schema of {code:java} Schema not null>{code} whilst calling ArrowFileReader.loadNextBatch() results in {code:java} Exception in thread "main" java.util.NoSuchElementExceptionException in thread "main" java.util.NoSuchElementException at java.base/java.util.ArrayList$Itr.next(ArrayList.java:1000) at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:81) at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:99) at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61) at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205) at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:153) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11935) [C++] Add push generator
Antoine Pitrou created ARROW-11935: -- Summary: [C++] Add push generator Key: ARROW-11935 URL: https://issues.apache.org/jira/browse/ARROW-11935 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Sometimes a producer of values just wants to queue futures and let a consumer pop them iteratively. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11934) [Rust] Document patch release process
Andy Grove created ARROW-11934: -- Summary: [Rust] Document patch release process Key: ARROW-11934 URL: https://issues.apache.org/jira/browse/ARROW-11934 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.1 Now that we moved to voting on source releases for patch releases, we need to document the process for doing so in the Rust implementation. Google doc for discussion / collaboration: https://docs.google.com/document/d/1i2Elk6J0H4nhPeQZdLDyqvHoRbsabx2iOTXLHxxNqRE/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11933) [Developer] Provide a dashboard for improved Pull Request management
Ben Kietzman created ARROW-11933: Summary: [Developer] Provide a dashboard for improved Pull Request management Key: ARROW-11933 URL: https://issues.apache.org/jira/browse/ARROW-11933 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Affects Versions: 3.0.0 Reporter: Ben Kietzman The [spark PR dashboard|https://github.com/databricks/spark-pr-dashboard] (instance at http://spark-prs.appspot.com/ ) provides a useful view of pull requests. Information is retrieved from the github API and persisted to a database for analyses, including classification of pull requests based on which files they modify. The added context provides greater visibility of PRs to the committers interested in reviewing/merging them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11932) [C++] Provide ArrayBuilder::AppendScalar
Ben Kietzman created ARROW-11932: Summary: [C++] Provide ArrayBuilder::AppendScalar Key: ARROW-11932 URL: https://issues.apache.org/jira/browse/ARROW-11932 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 3.0.0 Reporter: Ben Kietzman Fix For: 5.0.0 It would be useful to be able to append a Scalar (and/or ScalarVector) to an ArrayBuilder. For example, in https://github.com/apache/arrow/pull/9621#discussion_r587461083 (ARROW-11591) this could be used to accumulate an array of expected grouped aggregation results using existing scalar aggregate kernels -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [arrow-testing] jmgpeeters commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.
jmgpeeters commented on pull request #59: URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796777534 Agreed. I'll make the changes and get back to you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] pitrou commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.
pitrou commented on pull request #59: URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796768922 Indeed, the JSON format doesn't support it, so that will be a problem if we want to do roundtripping tests with the integration machinery. However, I think we can still use the "golden files" part of integration testing, because there the logic for each implementation is (see [here](https://github.com/apache/arrow/blob/master/cpp/src/arrow/testing/json_integration_test.cc#L225-L234) for the C++ implementation): * read the JSON file and convert it into a series of record batches * read the Arrow file and decode it into a series of record batches * compare respective record batches for equality Comparing for equality doesn't care if the dictionaries are shared, so this should be ok for testing the ability to read IPC files with shared dictionaries. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] jmgpeeters commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.
jmgpeeters commented on pull request #59: URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796737078 Ah, thanks, I wasn't aware of the Archery integration suite. Had a quick glance, and seems to make sense. Was a bit worried it would require support in all languages for shared dicts, but it seems easy to disable languages per folder etc. One thing I noticed from the JSON format is that it doesn't (appear to) support dictionary restatement, i.e. schema -> dict_batch[id=1] -> batch -> dict_batch[id=1] -> batch -> ... as we have in the streaming format, and that I'm currently explicitly testing in the bespoke tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] pitrou commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.
pitrou commented on pull request #59: URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796719583 @jmgpeeters It seems these should go into the "golden files" used for integration testing, see https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration Integration testing is documented here: https://arrow.apache.org/docs/format/Integration.html The integration testing machinery is maintained here: https://github.com/apache/arrow/tree/master/dev/archery/archery/integration Don't hesitate to ask questions if you have trouble navigating this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org