[jira] [Created] (ARROW-16117) [JS] Improve UTF8 decoding performance

2022-04-04 Thread Howard Zuo (Jira)
Howard Zuo created ARROW-16117:
--

 Summary: [JS] Improve UTF8 decoding performance
 Key: ARROW-16117
 URL: https://issues.apache.org/jira/browse/ARROW-16117
 Project: Apache Arrow
  Issue Type: Improvement
 Environment: MacOS, Chrome, Safari

Reporter: Howard Zuo


While profiling the performance of decoding TPC-H Customer and Part in-browser, 
datasets where there are a lot of UTF8s, it turned out that much of the time 
was being spent in {{getVariableWidthBytes}} rather than in {{TextDecoder}} 
itself. Ideally all the time should be spent in {{{}TextDecoder{}}}.

On Chrome {{getVariableWidthBytes}} took up to ~15% of the e2e decoding 
latency, and on Safari it was close to ~40% (Safari's TextDecoder is much 
faster than Chrome's, so this took up relatively more time).

This is likely because the code in this PR is more amenable to V8/JSC's JIT, 
since {{x}} and {{y}} now are guaranteed to be SMIs ("small integers") instead 
of Object, allowing the JIT to emit efficient machine instructions that only 
deal in 32-bit integers. Once V8 discovers that a {{x}} and {{y}} can 
potentially be null (upon iterating past the bounds), it "poisons" the codepath 
forever, since it has to deal with the null case.

See this V8 post for a more in-depth explanation (in particular see the 
examples underneath "Performance tips"):
[https://v8.dev/blog/elements-kinds]

Doing the bounds check explicitly instead of implicitly basically eliminates 
this function from showing up in the profiling. Empirically, on my machine 
decoding TPC-H Part dropped from 1.9s to 1.7s on Chrome, and Customer dropped 
from 1.4s to 1.2s.

[https://github.com/apache/arrow/pull/12793]

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16116) [C++] Properly handle non-nullable fields in Parquet reading

2022-04-04 Thread David Li (Jira)
David Li created ARROW-16116:


 Summary: [C++] Properly handle non-nullable fields in Parquet 
reading
 Key: ARROW-16116
 URL: https://issues.apache.org/jira/browse/ARROW-16116
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li


ARROW-15961 found that the Parquet Arrow reader wasn't respecting the nullable 
aspect of fields, we need to ensure that if we reconstruct an array for a 
non-nullable field, that it has no validity bitmap. We need to also add tests 
for this case, they're implicitly tested in a few places, but we should 
explicitly test this for all supported types.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16115) [C++] ScannerBuilder::Filter returns an error when given an augmented field

2022-04-04 Thread Weston Pace (Jira)
Weston Pace created ARROW-16115:
---

 Summary: [C++] ScannerBuilder::Filter returns an error when given 
an augmented field
 Key: ARROW-16115
 URL: https://issues.apache.org/jira/browse/ARROW-16115
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


Similar to {{ScannerBuilder::Project}] we should consider augmented fields as 
viable options for filtering.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16114) [Python] Document parquet.FileMetadata and statistics

2022-04-04 Thread Will Jones (Jira)
Will Jones created ARROW-16114:
--

 Summary: [Python] Document parquet.FileMetadata and statistics
 Key: ARROW-16114
 URL: https://issues.apache.org/jira/browse/ARROW-16114
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


{{FileMetaData}} in parquet module (returned by {{ParquetFile.metadata}}) isn't 
in the API docs. We should add to the API docs so users can know what fields 
are available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16113) [Python] Partitioning.dictionaries in case of a subset of fields are dictionary encoded

2022-04-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16113:
-

 Summary: [Python] Partitioning.dictionaries in case of a subset of 
fields are dictionary encoded
 Key: ARROW-16113
 URL: https://issues.apache.org/jira/browse/ARROW-16113
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-14612, see the discussion at 
https://github.com/apache/arrow/pull/12530#discussion_r841760449

ARROW-14612 changes the return value of the {{dictionaries}} attribute from 
None to a list in case some of the partitioning schema fields are not 
dictionary encoded. 

But this can result in a non-clear mapping between arrays in 
{{Partitioning.dictionaries}} and fields in {{Partitioning.schema}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16112) [C++] Allow reordering fields of a StructArray via casting

2022-04-04 Thread David Li (Jira)
David Li created ARROW-16112:


 Summary: [C++] Allow reordering fields of a StructArray via casting
 Key: ARROW-16112
 URL: https://issues.apache.org/jira/browse/ARROW-16112
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li


Follow-up to ARROW-15643 and possibly required for full handling of nested 
field refs in scanning. We may need to add a cast option to allow this since 
this can cause ambiguities.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16111) [C++][FlightRPC] Migrate SQL Client API to Result<>

2022-04-04 Thread Tobias Zagorni (Jira)
Tobias Zagorni created ARROW-16111:
--

 Summary: [C++][FlightRPC] Migrate SQL Client API to Result<>
 Key: ARROW-16111
 URL: https://issues.apache.org/jira/browse/ARROW-16111
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++, FlightRPC
Reporter: Tobias Zagorni


convert this API too as suggested here: 
[https://github.com/apache/arrow/pull/12719#discussion_r839570822]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16110) [C++] GcsFileSystem::Make ignores IOContext

2022-04-04 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-16110:
--

 Summary: [C++] GcsFileSystem::Make ignores IOContext
 Key: ARROW-16110
 URL: https://issues.apache.org/jira/browse/ARROW-16110
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Rok Mihevc


Passed IO context is ignored and default context is used. See current function:

{code:cpp}
std::shared_ptr GcsFileSystem::Make(const GcsOptions& options,
   const io::IOContext& 
context) {
  // Cannot use `std::make_shared<>` as the constructor is private.
  return std::shared_ptr(
  new GcsFileSystem(options, io::default_io_context()));
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16109) python/pyarrow/tests/parquet/test_dataset.py::test_read_table_schema requires dataset mark

2022-04-04 Thread Jira
Raúl Cumplido created ARROW-16109:
-

 Summary: 
python/pyarrow/tests/parquet/test_dataset.py::test_read_table_schema requires 
dataset mark
 Key: ARROW-16109
 URL: https://issues.apache.org/jira/browse/ARROW-16109
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Raúl Cumplido
 Fix For: 8.0.0


Following the contributing guidelines for the first time I did not use the 
`-DARROW_DATASET=On` flag as it does not appear on the documentation guidelines.

There was a test failure when running tests because the dataset module was not 
found:
{code:java}
ModuleNotFoundError: No module named 'pyarrow._dataset' {code}
My expectation is that this test should have been skipped as it requires 
dataset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16108) [Gandiva][C++] Fix castINTERVALDAY and castINTERVALYEAR

2022-04-04 Thread Johnnathan Rodrigo Pego de Almeida (Jira)
Johnnathan Rodrigo Pego de Almeida created ARROW-16108:
--

 Summary: [Gandiva][C++] Fix castINTERVALDAY and castINTERVALYEAR
 Key: ARROW-16108
 URL: https://issues.apache.org/jira/browse/ARROW-16108
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Johnnathan Rodrigo Pego de Almeida


Fix error in LLVM where didn't find these two functions.

Fix regex to allow negative digits for Interval Day and Interval Year.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16107) [CI][Archery] Fix archery crossbow query to get latest prefix

2022-04-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16107:
-

 Summary: [CI][Archery] Fix archery crossbow query to get latest 
prefix
 Key: ARROW-16107
 URL: https://issues.apache.org/jira/browse/ARROW-16107
 Project: Apache Arrow
  Issue Type: Test
  Components: Continuous Integration, Developer Tools
Reporter: Joris Van den Bossche


This feature stopped working when the crossbow builds were splitted into 3 parts



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16106) [R] Support for filename-based partitioning

2022-04-04 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16106:


 Summary: [R] Support for filename-based partitioning
 Key: ARROW-16106
 URL: https://issues.apache.org/jira/browse/ARROW-16106
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


This was added in ARROW-14612 and now needs implementing in R



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16105) [C++][Gandiva] Add support for LLVM 14

2022-04-04 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16105:


 Summary: [C++][Gandiva] Add support for LLVM 14
 Key: ARROW-16105
 URL: https://issues.apache.org/jira/browse/ARROW-16105
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Ubuntu 22.04 ships LLVM 14.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16104) [Packaging] Add support for Ubuntu 22.04

2022-04-04 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16104:


 Summary: [Packaging] Add support for Ubuntu 22.04
 Key: ARROW-16104
 URL: https://issues.apache.org/jira/browse/ARROW-16104
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16103) [R] arrow::create_package_with_all_dependencies() fails to download third party dependencies

2022-04-04 Thread Daniel Paierl (Jira)
Daniel Paierl created ARROW-16103:
-

 Summary: [R] arrow::create_package_with_all_dependencies() fails 
to download third party dependencies
 Key: ARROW-16103
 URL: https://issues.apache.org/jira/browse/ARROW-16103
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Affects Versions: 7.0.0
 Environment: Windows 10
Ubuntu Focal
OpenSuse ESP 15SP2
Reporter: Daniel Paierl


Hello, Im in the unfortunate position that I need to get the arrow package to a 
company R Server without access to the web.
h2. Main Problem

`arrow;;create_package_with_all_dependencies` from the R arrow package (7.0) 
fails to download the third party dependencies. This happens irrespective of OS 
(Windows, Ubuntu Focal, OpenSuse EPS 15 SP2) on company and private machines 
(several...).
I suspect an issue with the function or the underlying shell script that 
downloads these third party dependencies.

 

Similar to [this Stackexchange 
thread]([https://stackoverflow.com/questions/70044518/how-to-install-c-dependencies-for-the-arrow-package).|https://stackoverflow.com/questions/70044518/how-to-install-c-dependencies-for-the-arrow-package)]

 
{code:java}
arrow::create_package_with_all_dependencies()
Downloading Arrow source file
trying URL 'https://cran.rstudio.com/src/contrib/arrow_7.0.0.tar.gz'
Content type 'application/x-gzip' length 4553836 bytes (4.3 MB)
downloaded 4.3 MB
Downloading files to 
C:\Users\\AppData\Local\Temp\4\Rtmp0srhMl\file1475c6909878/arrow/tools/thirdparty_dependencies
Error in arrow::create_package_with_all_dependencies() : 
  Failed to download thirdparty dependencies
{code}
 

PS: Ghee I wish Jira would have simple MD support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)