[jira] [Created] (ARROW-16117) [JS] Improve UTF8 decoding performance
Howard Zuo created ARROW-16117: -- Summary: [JS] Improve UTF8 decoding performance Key: ARROW-16117 URL: https://issues.apache.org/jira/browse/ARROW-16117 Project: Apache Arrow Issue Type: Improvement Environment: MacOS, Chrome, Safari Reporter: Howard Zuo While profiling the performance of decoding TPC-H Customer and Part in-browser, datasets where there are a lot of UTF8s, it turned out that much of the time was being spent in {{getVariableWidthBytes}} rather than in {{TextDecoder}} itself. Ideally all the time should be spent in {{{}TextDecoder{}}}. On Chrome {{getVariableWidthBytes}} took up to ~15% of the e2e decoding latency, and on Safari it was close to ~40% (Safari's TextDecoder is much faster than Chrome's, so this took up relatively more time). This is likely because the code in this PR is more amenable to V8/JSC's JIT, since {{x}} and {{y}} now are guaranteed to be SMIs ("small integers") instead of Object, allowing the JIT to emit efficient machine instructions that only deal in 32-bit integers. Once V8 discovers that a {{x}} and {{y}} can potentially be null (upon iterating past the bounds), it "poisons" the codepath forever, since it has to deal with the null case. See this V8 post for a more in-depth explanation (in particular see the examples underneath "Performance tips"): [https://v8.dev/blog/elements-kinds] Doing the bounds check explicitly instead of implicitly basically eliminates this function from showing up in the profiling. Empirically, on my machine decoding TPC-H Part dropped from 1.9s to 1.7s on Chrome, and Customer dropped from 1.4s to 1.2s. [https://github.com/apache/arrow/pull/12793] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16116) [C++] Properly handle non-nullable fields in Parquet reading
David Li created ARROW-16116: Summary: [C++] Properly handle non-nullable fields in Parquet reading Key: ARROW-16116 URL: https://issues.apache.org/jira/browse/ARROW-16116 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li ARROW-15961 found that the Parquet Arrow reader wasn't respecting the nullable aspect of fields, we need to ensure that if we reconstruct an array for a non-nullable field, that it has no validity bitmap. We need to also add tests for this case, they're implicitly tested in a few places, but we should explicitly test this for all supported types. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16115) [C++] ScannerBuilder::Filter returns an error when given an augmented field
Weston Pace created ARROW-16115: --- Summary: [C++] ScannerBuilder::Filter returns an error when given an augmented field Key: ARROW-16115 URL: https://issues.apache.org/jira/browse/ARROW-16115 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace Similar to {{ScannerBuilder::Project}] we should consider augmented fields as viable options for filtering. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16114) [Python] Document parquet.FileMetadata and statistics
Will Jones created ARROW-16114: -- Summary: [Python] Document parquet.FileMetadata and statistics Key: ARROW-16114 URL: https://issues.apache.org/jira/browse/ARROW-16114 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 {{FileMetaData}} in parquet module (returned by {{ParquetFile.metadata}}) isn't in the API docs. We should add to the API docs so users can know what fields are available. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16113) [Python] Partitioning.dictionaries in case of a subset of fields are dictionary encoded
Joris Van den Bossche created ARROW-16113: - Summary: [Python] Partitioning.dictionaries in case of a subset of fields are dictionary encoded Key: ARROW-16113 URL: https://issues.apache.org/jira/browse/ARROW-16113 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-14612, see the discussion at https://github.com/apache/arrow/pull/12530#discussion_r841760449 ARROW-14612 changes the return value of the {{dictionaries}} attribute from None to a list in case some of the partitioning schema fields are not dictionary encoded. But this can result in a non-clear mapping between arrays in {{Partitioning.dictionaries}} and fields in {{Partitioning.schema}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16112) [C++] Allow reordering fields of a StructArray via casting
David Li created ARROW-16112: Summary: [C++] Allow reordering fields of a StructArray via casting Key: ARROW-16112 URL: https://issues.apache.org/jira/browse/ARROW-16112 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li Follow-up to ARROW-15643 and possibly required for full handling of nested field refs in scanning. We may need to add a cast option to allow this since this can cause ambiguities. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16111) [C++][FlightRPC] Migrate SQL Client API to Result<>
Tobias Zagorni created ARROW-16111: -- Summary: [C++][FlightRPC] Migrate SQL Client API to Result<> Key: ARROW-16111 URL: https://issues.apache.org/jira/browse/ARROW-16111 Project: Apache Arrow Issue Type: Sub-task Components: C++, FlightRPC Reporter: Tobias Zagorni convert this API too as suggested here: [https://github.com/apache/arrow/pull/12719#discussion_r839570822] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16110) [C++] GcsFileSystem::Make ignores IOContext
Rok Mihevc created ARROW-16110: -- Summary: [C++] GcsFileSystem::Make ignores IOContext Key: ARROW-16110 URL: https://issues.apache.org/jira/browse/ARROW-16110 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Rok Mihevc Passed IO context is ignored and default context is used. See current function: {code:cpp} std::shared_ptr GcsFileSystem::Make(const GcsOptions& options, const io::IOContext& context) { // Cannot use `std::make_shared<>` as the constructor is private. return std::shared_ptr( new GcsFileSystem(options, io::default_io_context())); } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16109) python/pyarrow/tests/parquet/test_dataset.py::test_read_table_schema requires dataset mark
Raúl Cumplido created ARROW-16109: - Summary: python/pyarrow/tests/parquet/test_dataset.py::test_read_table_schema requires dataset mark Key: ARROW-16109 URL: https://issues.apache.org/jira/browse/ARROW-16109 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Raúl Cumplido Fix For: 8.0.0 Following the contributing guidelines for the first time I did not use the `-DARROW_DATASET=On` flag as it does not appear on the documentation guidelines. There was a test failure when running tests because the dataset module was not found: {code:java} ModuleNotFoundError: No module named 'pyarrow._dataset' {code} My expectation is that this test should have been skipped as it requires dataset. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16108) [Gandiva][C++] Fix castINTERVALDAY and castINTERVALYEAR
Johnnathan Rodrigo Pego de Almeida created ARROW-16108: -- Summary: [Gandiva][C++] Fix castINTERVALDAY and castINTERVALYEAR Key: ARROW-16108 URL: https://issues.apache.org/jira/browse/ARROW-16108 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Johnnathan Rodrigo Pego de Almeida Fix error in LLVM where didn't find these two functions. Fix regex to allow negative digits for Interval Day and Interval Year. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16107) [CI][Archery] Fix archery crossbow query to get latest prefix
Joris Van den Bossche created ARROW-16107: - Summary: [CI][Archery] Fix archery crossbow query to get latest prefix Key: ARROW-16107 URL: https://issues.apache.org/jira/browse/ARROW-16107 Project: Apache Arrow Issue Type: Test Components: Continuous Integration, Developer Tools Reporter: Joris Van den Bossche This feature stopped working when the crossbow builds were splitted into 3 parts -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16106) [R] Support for filename-based partitioning
Nicola Crane created ARROW-16106: Summary: [R] Support for filename-based partitioning Key: ARROW-16106 URL: https://issues.apache.org/jira/browse/ARROW-16106 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane This was added in ARROW-14612 and now needs implementing in R -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16105) [C++][Gandiva] Add support for LLVM 14
Kouhei Sutou created ARROW-16105: Summary: [C++][Gandiva] Add support for LLVM 14 Key: ARROW-16105 URL: https://issues.apache.org/jira/browse/ARROW-16105 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Kouhei Sutou Assignee: Kouhei Sutou Ubuntu 22.04 ships LLVM 14. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16104) [Packaging] Add support for Ubuntu 22.04
Kouhei Sutou created ARROW-16104: Summary: [Packaging] Add support for Ubuntu 22.04 Key: ARROW-16104 URL: https://issues.apache.org/jira/browse/ARROW-16104 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16103) [R] arrow::create_package_with_all_dependencies() fails to download third party dependencies
Daniel Paierl created ARROW-16103: - Summary: [R] arrow::create_package_with_all_dependencies() fails to download third party dependencies Key: ARROW-16103 URL: https://issues.apache.org/jira/browse/ARROW-16103 Project: Apache Arrow Issue Type: Bug Components: C++, R Affects Versions: 7.0.0 Environment: Windows 10 Ubuntu Focal OpenSuse ESP 15SP2 Reporter: Daniel Paierl Hello, Im in the unfortunate position that I need to get the arrow package to a company R Server without access to the web. h2. Main Problem `arrow;;create_package_with_all_dependencies` from the R arrow package (7.0) fails to download the third party dependencies. This happens irrespective of OS (Windows, Ubuntu Focal, OpenSuse EPS 15 SP2) on company and private machines (several...). I suspect an issue with the function or the underlying shell script that downloads these third party dependencies. Similar to [this Stackexchange thread]([https://stackoverflow.com/questions/70044518/how-to-install-c-dependencies-for-the-arrow-package).|https://stackoverflow.com/questions/70044518/how-to-install-c-dependencies-for-the-arrow-package)] {code:java} arrow::create_package_with_all_dependencies() Downloading Arrow source file trying URL 'https://cran.rstudio.com/src/contrib/arrow_7.0.0.tar.gz' Content type 'application/x-gzip' length 4553836 bytes (4.3 MB) downloaded 4.3 MB Downloading files to C:\Users\\AppData\Local\Temp\4\Rtmp0srhMl\file1475c6909878/arrow/tools/thirdparty_dependencies Error in arrow::create_package_with_all_dependencies() : Failed to download thirdparty dependencies {code} PS: Ghee I wish Jira would have simple MD support. -- This message was sent by Atlassian Jira (v8.20.1#820001)