[jira] [Created] (ARROW-17923) [C++] Consider dictionary arrays for special fragment fields

2022-10-03 Thread Will Jones (Jira)
Will Jones created ARROW-17923:
--

 Summary: [C++] Consider dictionary arrays for special fragment 
fields
 Key: ARROW-17923
 URL: https://issues.apache.org/jira/browse/ARROW-17923
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Will Jones


I noticed in ARROW-15281 we made {{__filename}} a string column. In common 
cases, this will be inefficient if materialized. If possible, it may be better 
to have them be dictionary arrays.

As an example, 
[here|https://github.com/apache/arrow/pull/12826#issuecomment-1230745059] is a 
user report of 10x increased memory usage caused by accidentally including 
these special fragment columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17922) AWS SSO for R package

2022-10-03 Thread Brian Broderick (Jira)
Brian Broderick created ARROW-17922:
---

 Summary: AWS SSO for R package
 Key: ARROW-17922
 URL: https://issues.apache.org/jira/browse/ARROW-17922
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
 Environment: Ubuntu 20.04
Reporter: Brian Broderick


We'd like to use the arrow library to access datasets in AWS S3 but we are 
required to use SSO credentials, which don't work with the arrow R 
package–there is a similar issue for Python but we wanted to flag that this is 
also an issue with R



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17921) [Java][Doc] Define next steps for Arrow Tools module

2022-10-03 Thread David Dali Susanibar Arce (Jira)
David Dali Susanibar Arce created ARROW-17921:
-

 Summary: [Java][Doc] Define next steps for Arrow Tools module
 Key: ARROW-17921
 URL: https://issues.apache.org/jira/browse/ARROW-17921
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Java
Reporter: David Dali Susanibar Arce


There are functionalities such as 
[GenerateSampleData.java|https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/GenerateSampleData.java]
 for example that needs to be reviewed and analyze how this 
functionality/utility will be related with Arrow Tools module and define a 
common patter to use for next utilities that are planning to create 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17920) Built-in GRPC health checks in FlightServerBase

2022-10-03 Thread Akshaya Annavajhala (AK) (Jira)
Akshaya Annavajhala (AK) created ARROW-17920:


 Summary: Built-in GRPC health checks in FlightServerBase
 Key: ARROW-17920
 URL: https://issues.apache.org/jira/browse/ARROW-17920
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC
Affects Versions: 9.0.0
Reporter: Akshaya Annavajhala (AK)


[ARROW-14440] [C++][FlightRPC] Add example of registering gRPC service on a 
Flight server - ASF JIRA (apache.org) 

Related to the above issue, a portion of the issue notes that currently it is 
impossible in python for a server implementing FSB to extend to other gRPC 
messages. While I haven't verified that claim, it seems to be valid.

Complete composability in python of an arbitrary gRPC service + FlightServer 
might not be a goal, but rather one possible way to implement a useful 
production flightserver that responds appropriately to [k8s-style health 
probes|[https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command].]

 

A concrete ask perhaps more idiomatic to FSB is overridable default 
implementations of [gRPC health check 
messages|[https://github.com/grpc/grpc/blob/master/doc/health-checking.md#grpc-health-checking-protocol],]
 which would allow "simple" python derivations of FlightServerBase to be served 
in production environments (including k8s above which is adding support for the 
generic gRPC health checking protocol).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17919) [Java] Potentially inefficient variable-width vector reallocation

2022-10-03 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17919:
--

 Summary: [Java] Potentially inefficient variable-width vector 
reallocation
 Key: ARROW-17919
 URL: https://issues.apache.org/jira/browse/ARROW-17919
 Project: Apache Arrow
  Issue Type: Wish
  Components: Java
Reporter: Antoine Pitrou


In a several places in the Java codebase you can see this kind of pattern:
{code:java}
while (vector.getDataBuffer().capacity() < toCapacity) {
  vector.reallocDataBuffer();
}
{code}

In the event that a much larger capacity is requested, this will spuriously 
make several reallocations (doubling the capacity each time).

It would probably be more efficient to reallocate directly to satisfy the 
desired capacity.

Coincidentally, there's a {{reallocDataBuffer}} overload that seems to do just 
that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17918) [Python] ExtensionArray.__getitem__ is not called if called from StructArray

2022-10-03 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-17918:
--

 Summary: [Python] ExtensionArray.__getitem__ is not called if 
called from StructArray
 Key: ARROW-17918
 URL: https://issues.apache.org/jira/browse/ARROW-17918
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Rok Mihevc


It seems that when getting a value from a StructScalar extension information is 
lost. See:


{code:python}
import pyarrow as pa

class ExampleScalar(pa.ExtensionScalar):
def as_py(self):
print("ExampleScalar.as_py -> {self.value.as_py()}")
return self.value.as_py()

class ExampleArray(pa.ExtensionArray):
def __getitem__(self, item):
return f"ExampleArray.__getitem__[{item}] -> {self.storage[item]}"
def __arrow_ext_scalar_class__(self):
return ExampleScalar

class ExampleType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.int64(), "ExampleExtensionType")
def __arrow_ext_serialize__(self):
return b""
def __arrow_ext_class__(self):
return ExampleArray

example_type = ExampleType()
arr = pa.array([1, 2, 3])
example_array = pa.ExtensionArray.from_storage(example_type, arr)
example_array2 = pa.StructArray.from_arrays([example_array, arr], ["a", "b"])

print("\nExample 1\n=")
print(example_array[0])
print(example_array.type)
print(type(example_array[0]))

print("\nExample 2\n=")
print(example_array2[0])
print(example_array2[0].type)
print(example_array2[0]["a"])
print(example_array2[0]["a"].type)
{code}

Returns:

{code:python}
Example 1
=
ExampleArray.__getitem__[0] -> 1
extension>


Example 2
=
[('a', 1), ('b', 1)]
struct>, b: int64>
1
extension>
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17917) [C++] Add opaque device id identification to InputStream

2022-10-03 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17917:
--

 Summary: [C++] Add opaque device id identification to InputStream
 Key: ARROW-17917
 URL: https://issues.apache.org/jira/browse/ARROW-17917
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


For the purpose of collecting input/output statistics, it is important to know 
to which "device" these stats pertain, so as not to mix e.g. stats for a local 
NVMe drive, a NFS-attached drive, a S3 filesystem, or a in-memory buffer reader.

I suggest adding to InputStream this API:
{code:c++}
/// \brief An opaque unique id for the device underlying this stream.
///
/// Any implementation is free to fill those bytes as it sees fit,
/// but it should be able to uniquely identify each "device"
/// (for example, a specific local drive, or a specific remote network
/// filesystem).
///
/// A suggested format is ":" where ""
/// is a short string representing the backend kind
/// (for example "local", "s3"...) and "" is a
/// backend-dependent string of bytes (for example a
/// `dev_t` for a POSIX local file).
///
/// This is not required to be printable nor human-readable,
/// and may contain NUL characters.
virtual std::string device_id() const = 0;
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17916) [Python] Allow disabling more components

2022-10-03 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17916:
--

 Summary: [Python] Allow disabling more components
 Key: ARROW-17916
 URL: https://issues.apache.org/jira/browse/ARROW-17916
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Affects Versions: 9.0.0
Reporter: Antoine Pitrou


Some users would like to build lightweight versions of PyArrow, for example for 
use in AWS Lambda or similar systems which constrain the total size of usable 
libraries.

However, PyArrow currently mandates some Arrow C++ components which can lead to 
a very sizable Arrow binary install: Compute, CSV, Dataset, Filesystem, HDFS 
and JSON.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17915) [C++] Error when using Substrait ProjectRel

2022-10-03 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-17915:


 Summary: [C++] Error when using Substrait ProjectRel
 Key: ARROW-17915
 URL: https://issues.apache.org/jira/browse/ARROW-17915
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Dewey Dunnington


After ARROW-16989 and ARROW-15584, there is new behaviour with ProjectRel. I 
implemented a solution that worked with DuckDB's consumer in , but when I try 
with Arrow's compiler I get an error:

{code:R}
library(arrow, warn.conflicts = FALSE)

plan_as_json <- '{
  "extensionUris": [
{
  "extensionUriAnchor": 1,
  "uri": 
"https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml;
}
  ],
  "relations": [
{
  "rel": {
"project": {
  "common": {"emit": {"outputMapping": [3, 4]}},
  "input": {
"read": {
  "baseSchema": {
"names": ["int", "dbl"],
"struct": {"types": [{"i32": {}}, {"fp64": {}}]}
  },
  "localFiles": {
"items": [
  {
"uriFile": "file://THIS_IS_THE_TEMP_FILE",
"parquet": {}
  }
]
  }
}
  },
  "expressions": [
{"selection": {"directReference": {"structField": {"field": 1,
{"selection": {"directReference": {"structField": {"field": 0
  ]
}
  }
}
  ]
}'

temp_parquet <- tempfile()
write_parquet(data.frame(int = integer(), dbl = double()), temp_parquet)
plan_as_json <- gsub("THIS_IS_THE_TEMP_FILE", temp_parquet, plan_as_json)
arrow:::do_exec_plan_substrait(plan_as_json)
#> Error: Invalid: Invalid column index to add field.
#> 
/Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:338
  project_schema->AddField( num_columns + 
static_cast(project.expressions().size()) - 1, std::move(project_field))
#> 
/Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/serde.cc:156 
 FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), 
ext_set, conversion_options)
{code}

It's admittedly a goofy thing to do: to compute a new column that is an 
identical copy of an existing column and then discard the original. I can and 
should simplify the substrait that I'm generating, but maybe this is also valid 
substrait that should be accepted?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17914) [Java] Support reading a subset of fields from an IPC file or stream

2022-10-03 Thread David Li (Jira)
David Li created ARROW-17914:


 Summary: [Java] Support reading a subset of fields from an IPC 
file or stream
 Key: ARROW-17914
 URL: https://issues.apache.org/jira/browse/ARROW-17914
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: David Li


C++ supports {{IpcReadOptions.included_fields}} which lets you load a subset of 
(top-level) fields from an IPC file or stream, potentially saving on I/O costs. 
It would be useful to support this in Java as well. Some refactoring would be 
required since MessageSerializer currently reads record batch messages in as a 
whole, and it would be good to quantify how much of a benefit this provides in 
different scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)