[jira] [Created] (ARROW-17923) [C++] Consider dictionary arrays for special fragment fields
Will Jones created ARROW-17923: -- Summary: [C++] Consider dictionary arrays for special fragment fields Key: ARROW-17923 URL: https://issues.apache.org/jira/browse/ARROW-17923 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Will Jones I noticed in ARROW-15281 we made {{__filename}} a string column. In common cases, this will be inefficient if materialized. If possible, it may be better to have them be dictionary arrays. As an example, [here|https://github.com/apache/arrow/pull/12826#issuecomment-1230745059] is a user report of 10x increased memory usage caused by accidentally including these special fragment columns. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17922) AWS SSO for R package
Brian Broderick created ARROW-17922: --- Summary: AWS SSO for R package Key: ARROW-17922 URL: https://issues.apache.org/jira/browse/ARROW-17922 Project: Apache Arrow Issue Type: Improvement Components: R Environment: Ubuntu 20.04 Reporter: Brian Broderick We'd like to use the arrow library to access datasets in AWS S3 but we are required to use SSO credentials, which don't work with the arrow R package–there is a similar issue for Python but we wanted to flag that this is also an issue with R -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17921) [Java][Doc] Define next steps for Arrow Tools module
David Dali Susanibar Arce created ARROW-17921: - Summary: [Java][Doc] Define next steps for Arrow Tools module Key: ARROW-17921 URL: https://issues.apache.org/jira/browse/ARROW-17921 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Java Reporter: David Dali Susanibar Arce There are functionalities such as [GenerateSampleData.java|https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/GenerateSampleData.java] for example that needs to be reviewed and analyze how this functionality/utility will be related with Arrow Tools module and define a common patter to use for next utilities that are planning to create -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17920) Built-in GRPC health checks in FlightServerBase
Akshaya Annavajhala (AK) created ARROW-17920: Summary: Built-in GRPC health checks in FlightServerBase Key: ARROW-17920 URL: https://issues.apache.org/jira/browse/ARROW-17920 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC Affects Versions: 9.0.0 Reporter: Akshaya Annavajhala (AK) [ARROW-14440] [C++][FlightRPC] Add example of registering gRPC service on a Flight server - ASF JIRA (apache.org) Related to the above issue, a portion of the issue notes that currently it is impossible in python for a server implementing FSB to extend to other gRPC messages. While I haven't verified that claim, it seems to be valid. Complete composability in python of an arbitrary gRPC service + FlightServer might not be a goal, but rather one possible way to implement a useful production flightserver that responds appropriately to [k8s-style health probes|[https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command].] A concrete ask perhaps more idiomatic to FSB is overridable default implementations of [gRPC health check messages|[https://github.com/grpc/grpc/blob/master/doc/health-checking.md#grpc-health-checking-protocol],] which would allow "simple" python derivations of FlightServerBase to be served in production environments (including k8s above which is adding support for the generic gRPC health checking protocol). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17919) [Java] Potentially inefficient variable-width vector reallocation
Antoine Pitrou created ARROW-17919: -- Summary: [Java] Potentially inefficient variable-width vector reallocation Key: ARROW-17919 URL: https://issues.apache.org/jira/browse/ARROW-17919 Project: Apache Arrow Issue Type: Wish Components: Java Reporter: Antoine Pitrou In a several places in the Java codebase you can see this kind of pattern: {code:java} while (vector.getDataBuffer().capacity() < toCapacity) { vector.reallocDataBuffer(); } {code} In the event that a much larger capacity is requested, this will spuriously make several reallocations (doubling the capacity each time). It would probably be more efficient to reallocate directly to satisfy the desired capacity. Coincidentally, there's a {{reallocDataBuffer}} overload that seems to do just that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17918) [Python] ExtensionArray.__getitem__ is not called if called from StructArray
Rok Mihevc created ARROW-17918: -- Summary: [Python] ExtensionArray.__getitem__ is not called if called from StructArray Key: ARROW-17918 URL: https://issues.apache.org/jira/browse/ARROW-17918 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Rok Mihevc It seems that when getting a value from a StructScalar extension information is lost. See: {code:python} import pyarrow as pa class ExampleScalar(pa.ExtensionScalar): def as_py(self): print("ExampleScalar.as_py -> {self.value.as_py()}") return self.value.as_py() class ExampleArray(pa.ExtensionArray): def __getitem__(self, item): return f"ExampleArray.__getitem__[{item}] -> {self.storage[item]}" def __arrow_ext_scalar_class__(self): return ExampleScalar class ExampleType(pa.ExtensionType): def __init__(self): pa.ExtensionType.__init__(self, pa.int64(), "ExampleExtensionType") def __arrow_ext_serialize__(self): return b"" def __arrow_ext_class__(self): return ExampleArray example_type = ExampleType() arr = pa.array([1, 2, 3]) example_array = pa.ExtensionArray.from_storage(example_type, arr) example_array2 = pa.StructArray.from_arrays([example_array, arr], ["a", "b"]) print("\nExample 1\n=") print(example_array[0]) print(example_array.type) print(type(example_array[0])) print("\nExample 2\n=") print(example_array2[0]) print(example_array2[0].type) print(example_array2[0]["a"]) print(example_array2[0]["a"].type) {code} Returns: {code:python} Example 1 = ExampleArray.__getitem__[0] -> 1 extension> Example 2 = [('a', 1), ('b', 1)] struct>, b: int64> 1 extension> {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17917) [C++] Add opaque device id identification to InputStream
Antoine Pitrou created ARROW-17917: -- Summary: [C++] Add opaque device id identification to InputStream Key: ARROW-17917 URL: https://issues.apache.org/jira/browse/ARROW-17917 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou For the purpose of collecting input/output statistics, it is important to know to which "device" these stats pertain, so as not to mix e.g. stats for a local NVMe drive, a NFS-attached drive, a S3 filesystem, or a in-memory buffer reader. I suggest adding to InputStream this API: {code:c++} /// \brief An opaque unique id for the device underlying this stream. /// /// Any implementation is free to fill those bytes as it sees fit, /// but it should be able to uniquely identify each "device" /// (for example, a specific local drive, or a specific remote network /// filesystem). /// /// A suggested format is ":" where "" /// is a short string representing the backend kind /// (for example "local", "s3"...) and "" is a /// backend-dependent string of bytes (for example a /// `dev_t` for a POSIX local file). /// /// This is not required to be printable nor human-readable, /// and may contain NUL characters. virtual std::string device_id() const = 0; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17916) [Python] Allow disabling more components
Antoine Pitrou created ARROW-17916: -- Summary: [Python] Allow disabling more components Key: ARROW-17916 URL: https://issues.apache.org/jira/browse/ARROW-17916 Project: Apache Arrow Issue Type: Wish Components: Python Affects Versions: 9.0.0 Reporter: Antoine Pitrou Some users would like to build lightweight versions of PyArrow, for example for use in AWS Lambda or similar systems which constrain the total size of usable libraries. However, PyArrow currently mandates some Arrow C++ components which can lead to a very sizable Arrow binary install: Compute, CSV, Dataset, Filesystem, HDFS and JSON. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17915) [C++] Error when using Substrait ProjectRel
Dewey Dunnington created ARROW-17915: Summary: [C++] Error when using Substrait ProjectRel Key: ARROW-17915 URL: https://issues.apache.org/jira/browse/ARROW-17915 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Dewey Dunnington After ARROW-16989 and ARROW-15584, there is new behaviour with ProjectRel. I implemented a solution that worked with DuckDB's consumer in , but when I try with Arrow's compiler I get an error: {code:R} library(arrow, warn.conflicts = FALSE) plan_as_json <- '{ "extensionUris": [ { "extensionUriAnchor": 1, "uri": "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml; } ], "relations": [ { "rel": { "project": { "common": {"emit": {"outputMapping": [3, 4]}}, "input": { "read": { "baseSchema": { "names": ["int", "dbl"], "struct": {"types": [{"i32": {}}, {"fp64": {}}]} }, "localFiles": { "items": [ { "uriFile": "file://THIS_IS_THE_TEMP_FILE", "parquet": {} } ] } } }, "expressions": [ {"selection": {"directReference": {"structField": {"field": 1, {"selection": {"directReference": {"structField": {"field": 0 ] } } } ] }' temp_parquet <- tempfile() write_parquet(data.frame(int = integer(), dbl = double()), temp_parquet) plan_as_json <- gsub("THIS_IS_THE_TEMP_FILE", temp_parquet, plan_as_json) arrow:::do_exec_plan_substrait(plan_as_json) #> Error: Invalid: Invalid column index to add field. #> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:338 project_schema->AddField( num_columns + static_cast(project.expressions().size()) - 1, std::move(project_field)) #> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/serde.cc:156 FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), ext_set, conversion_options) {code} It's admittedly a goofy thing to do: to compute a new column that is an identical copy of an existing column and then discard the original. I can and should simplify the substrait that I'm generating, but maybe this is also valid substrait that should be accepted? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17914) [Java] Support reading a subset of fields from an IPC file or stream
David Li created ARROW-17914: Summary: [Java] Support reading a subset of fields from an IPC file or stream Key: ARROW-17914 URL: https://issues.apache.org/jira/browse/ARROW-17914 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: David Li C++ supports {{IpcReadOptions.included_fields}} which lets you load a subset of (top-level) fields from an IPC file or stream, potentially saving on I/O costs. It would be useful to support this in Java as well. Some refactoring would be required since MessageSerializer currently reads record batch messages in as a whole, and it would be good to quantify how much of a benefit this provides in different scenarios. -- This message was sent by Atlassian Jira (v8.20.10#820010)