[jira] [Created] (ARROW-18439) Misleading message when loading parquet data with invalid null data
created ARROW-18439: Summary: Misleading message when loading parquet data with invalid null data Key: ARROW-18439 URL: https://issues.apache.org/jira/browse/ARROW-18439 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 10.0.1 Reporter: I'm saving an arrow table to parquet. One column is a list of structs, which elements are marked as non nullable. But the data isn't valid because I've put a null in one of the nested field. When I save this data to parquet and try to load it back I get a very misleading message: {code:java} Length spanned by list offsets (2) larger than values array (length 1){code} I would rather arrow complains when creating the table or when saving it to parquet. Here's how to reproduce the issue: {code:java} struct = pa.struct( [ pa.field("nested_string", pa.string(), nullable=False), ] ) schema = pa.schema( [pa.field("list_column", pa.list_(pa.field("item", struct, nullable=False)))] ) table = pa.table( {"list_column": [[{"nested_string": ""}, {"nested_string": None}]]}, schema=schema ) with io.BytesIO() as file: pq.write_table(table, file) file.seek(0) pq.read_table(file) # Raises pa.ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18438) [Go] firstTimeBitmapWriter.Finish() panics with 8n structs
Min-Young Wu created ARROW-18438: Summary: [Go] firstTimeBitmapWriter.Finish() panics with 8n structs Key: ARROW-18438 URL: https://issues.apache.org/jira/browse/ARROW-18438 Project: Apache Arrow Issue Type: Bug Components: Go, Parquet Affects Versions: 10.0.1 Reporter: Min-Young Wu Even after [ARROW-17169|https://issues.apache.org/jira/browse/ARROW-17169] I still get a panic at the same location. Below is a test case that panics: {code:go} func (ps *ParquetIOTestSuite) TestStructWithNullableListOfStructs() { bldr := array.NewStructBuilder(memory.DefaultAllocator, arrow.StructOf( arrow.Field{ Name: "l", Type: arrow.ListOf(arrow.StructOf( arrow.Field{Name: "a", Type: arrow.BinaryTypes.String}, )), }, )) defer bldr.Release() lBldr := bldr.FieldBuilder(0).(*array.ListBuilder) stBldr := lBldr.ValueBuilder().(*array.StructBuilder) aBldr := stBldr.FieldBuilder(0).(*array.StringBuilder) bldr.AppendNull() bldr.Append(true) lBldr.Append(true) for i := 0; i < 8; i++ { stBldr.Append(true) aBldr.Append(strconv.Itoa(i)) } arr := bldr.NewArray() defer arr.Release() field := arrow.Field{Name: "x", Type: arr.DataType(), Nullable: true} expected := array.NewTable(arrow.NewSchema([]arrow.Field{field}, nil), []arrow.Column{*arrow.NewColumn(field, arrow.NewChunked(field.Type, []arrow.Array{arr}))}, -1) defer expected.Release() ps.roundTripTable(expected, false) } {code} I've tried to trim down the input data and this is as minimal as I could get it. And yes: * wrapping struct with initial null is required * the inner list needs to contain 8 structs (or any multiple of 8) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18437) [C++] Parquet DELTA_BINARY_PACKED Page didn't clear the context
Xuwei Fu created ARROW-18437: Summary: [C++] Parquet DELTA_BINARY_PACKED Page didn't clear the context Key: ARROW-18437 URL: https://issues.apache.org/jira/browse/ARROW-18437 Project: Apache Arrow Issue Type: Bug Components: Parquet Affects Versions: 11.0.0 Reporter: Xuwei Fu Assignee: Xuwei Fu Fix For: 11.0.0 When calling {{{}flushValues{}}}, it didn't: * clearing the {{total_value_count_}} * Re-advancing buffer for {{kMaxPageHeaderWriterSize}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18436) `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space
James Bourbeau created ARROW-18436: -- Summary: `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space Key: ARROW-18436 URL: https://issues.apache.org/jira/browse/ARROW-18436 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 10.0.1 Environment: - OS: macOS - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge) - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge) Reporter: James Bourbeau When attempting to create a new filesystem object from a public dataset in S3, where there is a space in the bucket name, an error is raised. Here's a minimal reproducer: ```python from pyarrow.fs import FileSystem result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet") ``` which fails with the following traceback: ``` Traceback (most recent call last): File "/Users/james/projects/dask/dask/test.py", line 3, in result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet") File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet' ``` Note that things work if I use a different dataset that doesn't have a space in the URI, or if I replace the portion of the URI that has a space with a `*` wildcard ```python from pyarrow.fs import FileSystem result = FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works ``` The wildcard isn't necessarily equivalent to the original failing URI, but I think highlights that the space is somehow problematic. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18435) [C++][Java] Update ORC to 1.8.1
Gang Wu created ARROW-18435: --- Summary: [C++][Java] Update ORC to 1.8.1 Key: ARROW-18435 URL: https://issues.apache.org/jira/browse/ARROW-18435 Project: Apache Arrow Issue Type: Improvement Components: C++, Java Reporter: Gang Wu Assignee: Gang Wu -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18434) [C++][Parquet] Parquet page index read support
Gang Wu created ARROW-18434: --- Summary: [C++][Parquet] Parquet page index read support Key: ARROW-18434 URL: https://issues.apache.org/jira/browse/ARROW-18434 Project: Apache Arrow Issue Type: Sub-task Reporter: Gang Wu Assignee: Gang Wu Implement read support for parquet page index and expose it from the reader API. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18433) Optimize aggregate functions to work with batches.
A. Coady created ARROW-18433: Summary: Optimize aggregate functions to work with batches. Key: ARROW-18433 URL: https://issues.apache.org/jira/browse/ARROW-18433 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Affects Versions: 10.0.1 Reporter: A. Coady Most compute functions work with the dataset api and don't load columns. But aggregate functions which are associative could also work: `min`, `max`, `any`, `all`, `sum`, `product`. Even `unique` and `value_counts`. A couple of implementation ideas: * expand the dataset api to support expressions which return scalars * add a `BatchedArray` type which is like a `ChunkedArray` but with lazy loading -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18432) [Python] Array constructor doesn't support arrow scalars.
A. Coady created ARROW-18432: Summary: [Python] Array constructor doesn't support arrow scalars. Key: ARROW-18432 URL: https://issues.apache.org/jira/browse/ARROW-18432 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 10.0.1 Reporter: A. Coady {code:python} pa.array([pa.scalar(0)]) ArrowInvalid: Could not convert with type pyarrow.lib.Int64Scalar: did not recognize Python value type when inferring an Arrow data type pa.array([pa.scalar(0)], 'int64') ArrowInvalid: Could not convert with type pyarrow.lib.Int64Scalar: tried to convert to int64{code} It seems odd that the array constructors don't recognize their own scalars. In practice, a list of scalars has to be converted with `.as_py()` just to be converted back, and that also loses the type information. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18431) Acero's Execution Plan never finishes.
Pau Garcia Rodriguez created ARROW-18431: Summary: Acero's Execution Plan never finishes. Key: ARROW-18431 URL: https://issues.apache.org/jira/browse/ARROW-18431 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 10.0.0 Reporter: Pau Garcia Rodriguez We have observed that sometimes an execution plan with a small input never finishes (the future returned by the ExecPlan::finished() method is never marked as finished), even though the generator in the sink node is exhausted and has returned nullopt. This issue seems to happen at random, the same plan with the same input sometimes works (the plan is marked finished) and sometimes it doesn't. Since the ExecPlanImpl destructor forces the executing thread to wait for the plan to finish (when the plan has not yet finished) we enter in a deadlock waiting for a plan that never finishes. Since this has only happened with small inputs and not in a deterministic way, we believe the issue might be in the ExecPlan::StartProducing method. Our hypothesis is that after the plan starts producing on each node, each node schedules their tasks and they are immediately finished (due to the small input) and somehow the callback that marks the future finished_ finished is never executed. {code:java} Status StartProducing() { ... Future<> scheduler_finished = util::AsyncTaskScheduler::Make([this(util::AsyncTaskScheduler* async_scheduler) { ... scheduler_finished.AddCallback([this](const Status& st) { finished_.MarkFinished(st);}); ... }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18430) [Python] Cannot cast nested nullable field to not-nullable
created ARROW-18430: Summary: [Python] Cannot cast nested nullable field to not-nullable Key: ARROW-18430 URL: https://issues.apache.org/jira/browse/ARROW-18430 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 10.0.1 Reporter: Casting from nullable field to not-nullable works provided all values are present. So for example this is a valid cast: {code:java} table = pa.table({'column_1': pa.array([1, 2 ,3])})table.cast( pa.schema([ f.with_nullable(False) for f in table.schema ]) ){code} But it doesn't work for nested field. Here's an example: {code:java} import pyarrow as pa record = {"nested_int": 1} data_type = pa.struct( [ pa.field("nested_int", pa.int32(), nullable=True), ] ) data_type_after = pa.struct( [ pa.field("nested_int", pa.int32(), nullable=False), ] ) table = pa.table({"column_1": pa.array([record], data_type)}) table.cast(pa.schema([pa.field("column_1", data_type_after)])) {code} Throws: {code:java} pyarrow.lib.ArrowTypeError: cannot cast nullable field to non-nullable field: struct struct {code} This is somewhat related to [https://github.com/apache/arrow/issues/13177] and https://issues.apache.org/jira/browse/ARROW-16603 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18429) [R] Bump dev version following 10.0.1 patch release
Nicola Crane created ARROW-18429: Summary: [R] Bump dev version following 10.0.1 patch release Key: ARROW-18429 URL: https://issues.apache.org/jira/browse/ARROW-18429 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, R Reporter: Nicola Crane Assignee: Nicola Crane Fix For: 11.0.0 CI job fails with: {code:java} Insufficient package version (submitted: 10.0.0.9000, existing: 10.0.1) Version contains large components (10.0.0.9000) {code} https://github.com/apache/arrow/actions/runs/3639669477/jobs/6145488845#step:10:567 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18428) [Website] Enable github issues on arrow-site repo
Joris Van den Bossche created ARROW-18428: - Summary: [Website] Enable github issues on arrow-site repo Key: ARROW-18428 URL: https://issues.apache.org/jira/browse/ARROW-18428 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Joris Van den Bossche Now we are moving to GitHub issues, it probably makes sense to open issues about the website in its own arrow-site repo, instead of keeping them in the main arrow repo. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18427) [C++] Suppose negative toletance in `AsofJoinNode`
Yaron Gvili created ARROW-18427: --- Summary: [C++] Suppose negative toletance in `AsofJoinNode` Key: ARROW-18427 URL: https://issues.apache.org/jira/browse/ARROW-18427 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Yaron Gvili Assignee: Yaron Gvili Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing past-joining, i.e., joining right-table rows with a timestamp at or before that of the left-table row. This issue will add support for a positive tolerance, which would allow future-joining too. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18426) Update committers and PMC members on website
Benson Muite created ARROW-18426: Summary: Update committers and PMC members on website Key: ARROW-18426 URL: https://issues.apache.org/jira/browse/ARROW-18426 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Benson Muite Assignee: Benson Muite Update committers and PMC members -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18425) Add support for Substrait round expression
Bryce Mecum created ARROW-18425: --- Summary: Add support for Substrait round expression Key: ARROW-18425 URL: https://issues.apache.org/jira/browse/ARROW-18425 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Bryce Mecum Work has been started on adding round to Substrait in [https://github.com/substrait-io/substrait/pull/322] and it looks like a mapping needs to be registered on the Acero side for Acero to consume plans with it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18424) [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness`
Yaron Gvili created ARROW-18424: --- Summary: [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness` Key: ARROW-18424 URL: https://issues.apache.org/jira/browse/ARROW-18424 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Yaron Gvili Assignee: Yaron Gvili Doxygen is hitting the following error: `/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' was not declared or defined. (warning treated as error, aborting now)`. See [this CI job output|https://github.com/apache/arrow/actions/runs/3557712768/jobs/5975904381], for example. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18423) [Python] Expose reading a schema from an IPC message
Andre Kohn created ARROW-18423: -- Summary: [Python] Expose reading a schema from an IPC message Key: ARROW-18423 URL: https://issues.apache.org/jira/browse/ARROW-18423 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Andre Kohn Pyarrow currently does not implement reading the Arrow schema from an IPC message. [https://github.com/apache/arrow/blob/80b389efe902af376a85a8b3740e0dbdc5f80900/python/pyarrow/ipc.pxi#L1094] We'd like to consume Arrow IPC stream data like the following: ``` schema_msg = pyarrow.ipc.read_message(result_iter.next().data) schema = pyarrow.ipc.read_schema(schema_msg) for batch_data in result_iter: batch_msg = pyarrow.ipc.read_message(batch_data.data) batch = pyarrow.ipc.read_record_batch(batch_msg, schema) ``` The associated (tiny) PR on GitHub implements this reading by binding the existing C++ function. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18422) [C++] Provide enum reflection utility
Ben Kietzman created ARROW-18422: Summary: [C++] Provide enum reflection utility Key: ARROW-18422 URL: https://issues.apache.org/jira/browse/ARROW-18422 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Ben Kietzman Assignee: Ben Kietzman Now that we have c++17, we could try again with ARROW-13296 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18421) [C++][ORC] Add accessor for number of rows by stripe in reader
Louis Calot created ARROW-18421: --- Summary: [C++][ORC] Add accessor for number of rows by stripe in reader Key: ARROW-18421 URL: https://issues.apache.org/jira/browse/ARROW-18421 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Louis Calot I need to have the number of rows by stripe to be able to read specific ranges of records in the ORC file without reading it all. The number of rows was already stored in the implementation but not available in the API. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18420) [C++][Parquet] Introduce ColumnIndex and OffsetIndex
Gang Wu created ARROW-18420: --- Summary: [C++][Parquet] Introduce ColumnIndex and OffsetIndex Key: ARROW-18420 URL: https://issues.apache.org/jira/browse/ARROW-18420 Project: Apache Arrow Issue Type: Sub-task Components: C++, Parquet Reporter: Gang Wu Assignee: Gang Wu Define interface of ColumnIndex and OffsetIndex and provide implementation to read from serialized form. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18419) [C++] Update vendored fast_float
Kouhei Sutou created ARROW-18419: Summary: [C++] Update vendored fast_float Key: ARROW-18419 URL: https://issues.apache.org/jira/browse/ARROW-18419 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou For https://github.com/fastfloat/fast_float/pull/147 . -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18418) [WEBSITE] do not delete /datafusion-python
Andy Grove created ARROW-18418: -- Summary: [WEBSITE] do not delete /datafusion-python Key: ARROW-18418 URL: https://issues.apache.org/jira/browse/ARROW-18418 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Andy Grove Assignee: Andy Grove do not delete /datafusion-python when publishing -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18417) [C++] Support emit info in Substrait extension-multi and AsOfJoin
Yaron Gvili created ARROW-18417: --- Summary: [C++] Support emit info in Substrait extension-multi and AsOfJoin Key: ARROW-18417 URL: https://issues.apache.org/jira/browse/ARROW-18417 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Yaron Gvili Assignee: Yaron Gvili Currently, Arrow-Substrait does not handle emit info that may appear in an extension-multi in a Substrait plan. Besides the generic handling in the Arrow-Substrait extension API, specific handling for AsOfJoin is required, because AsOfJoinNode produces an output schema that is different than the one used in the emit info. In particular, the AsOfJoinNode output scheme does not include on- and by-keys of right tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18416) [R] Update NEWS for 10.0.1
Nicola Crane created ARROW-18416: Summary: [R] Update NEWS for 10.0.1 Key: ARROW-18416 URL: https://issues.apache.org/jira/browse/ARROW-18416 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18415) [R] Update R package README to reference GH Issues
Nicola Crane created ARROW-18415: Summary: [R] Update R package README to reference GH Issues Key: ARROW-18415 URL: https://issues.apache.org/jira/browse/ARROW-18415 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane The R package README should be updated to refer to GH Issues for users who don't have a JIRA account -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18414) [Release] Add a post script to generate announce email
Kouhei Sutou created ARROW-18414: Summary: [Release] Add a post script to generate announce email Key: ARROW-18414 URL: https://issues.apache.org/jira/browse/ARROW-18414 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Fix For: 11.0.0 We want to generate an announce email like a vote email. e.g.: [ANNOUNCE] Apache Arrow 10.0.0 released https://lists.apache.org/thread/zdsogdwj3r7wjv93o84go4ykgrcwtr0p . FYI: We can generate a vote email by {{SOURCE_DEFAULT=0 SOURCE_VOTE=1 dev/release/02-source.sh ...}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18413) [C++][Parquet] FileMetaData exposes page index metadata
Gang Wu created ARROW-18413: --- Summary: [C++][Parquet] FileMetaData exposes page index metadata Key: ARROW-18413 URL: https://issues.apache.org/jira/browse/ARROW-18413 Project: Apache Arrow Issue Type: Sub-task Components: C++, Parquet Reporter: Gang Wu Assignee: Gang Wu Parquet ColumnChunk thrift object has recorded metadata for page index: {quote}struct ColumnChunk { /** File offset of ColumnChunk's OffsetIndex **/ 4: optional i64 offset_index_offset /** Size of ColumnChunk's OffsetIndex, in bytes **/ 5: optional i32 offset_index_length /** File offset of ColumnChunk's ColumnIndex **/ 6: optional i64 column_index_offset /** Size of ColumnChunk's ColumnIndex, in bytes **/ 7: optional i32 column_index_length } {quote} We just need to add public API to ColumnChunkMetaData to make it ready to read. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18412) [R] Windows build fails because of missing ChunkResolver symbols
Dewey Dunnington created ARROW-18412: Summary: [R] Windows build fails because of missing ChunkResolver symbols Key: ARROW-18412 URL: https://issues.apache.org/jira/browse/ARROW-18412 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Dewey Dunnington In recent nightly builds of the Windows package we have a build failure because some symbols related to the {{ChunkResolver}} are not found in the linking stage. https://github.com/ursacomputing/crossbow/actions/runs/3559717769/jobs/5979255297#step:9:2818 [~kou] suggested the following patch might fix the build: https://github.com/apache/arrow/pull/14530#issuecomment-1328341447 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18411) [Python] MapType comparison ignores nullable flag of item_field
created ARROW-18411: Summary: [Python] MapType comparison ignores nullable flag of item_field Key: ARROW-18411 URL: https://issues.apache.org/jira/browse/ARROW-18411 Project: Apache Arrow Issue Type: Bug Components: Python Environment: pyarrow==10.0.1 Reporter: By default MapType value fields are nullable: {code:java} pa.map_(pa.string(), pa.int32()).item_field.nullable == True {code} It is possible to mark the value field of a MapType as not-nullable: {code:java} pa.map_(pa.string(), pa.field("value", pa.int32(), nullable=False)).item_field.nullable == False{code} But comparing these two types, that are semantically different, returns True: {code:java} pa.map_(pa.string(), pa.int32()) == pa.map_(pa.string(), pa.field("value", pa.int32(), nullable=False)) # Returns True {code} So it looks like the comparison omits the nullable flag. {code:java} import pyarrow as pa import pytest print(pa.__version__) map_type = pa.map_(pa.string(), pa.int32()) pa.array( [[("one", 1), ("two", 2), ("null", None)]], map_type ) with pytest.raises(pa.ArrowInvalid, match=r"Invalid Map: key field can not contain null values"): pa.array( [[("one", 1), ("two", 2), (None, None)]], map_type ) map_type = pa.map_(pa.string(), pa.int32()) pa.array( [[("one", 1), ("two", 2), ("null", None)]], map_type ) non_null_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), nullable=False)) nullable_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), nullable=True)) pa.array( [[("one", 1), ("two", 2), ("null", None)]], map_type ) assert nullable_map_type == map_type # Should be different assert str(nullable_map_type) == str(map_type) assert non_null_map_type == map_type assert non_null_map_type.item_type == map_type.item_type assert non_null_map_type.item_field != map_type.item_field assert non_null_map_type.item_field.nullable != map_type.item_field.nullable assert non_null_map_type.item_field.name == map_type.item_field.name assert str(non_null_map_type) != str(map_type.item_field.name){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18410) [Packaging][Ubuntu] Add support for Ubuntu 22.10
Kouhei Sutou created ARROW-18410: Summary: [Packaging][Ubuntu] Add support for Ubuntu 22.10 Key: ARROW-18410 URL: https://issues.apache.org/jira/browse/ARROW-18410 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18409) [GLib][Plasma] Suppress deprecated warning in building plasma-glib
Kouhei Sutou created ARROW-18409: Summary: [GLib][Plasma] Suppress deprecated warning in building plasma-glib Key: ARROW-18409 URL: https://issues.apache.org/jira/browse/ARROW-18409 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou If we always get "Plasma is deprecated since Arrow 10.0.0. ..." warning from {{plasma/common.h}}, we can't use {{-Dwerror=true}} Meson option with plama-glib. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18408) [C++] Add nightly test that uses an older version of protoc
Weston Pace created ARROW-18408: --- Summary: [C++] Add nightly test that uses an older version of protoc Key: ARROW-18408 URL: https://issues.apache.org/jira/browse/ARROW-18408 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace Specifically we should test the protoc version installed by Ubuntu 20.04 to help detect issues like ARROW-18406 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18407) [Release][Website] Use UTC for release date
Kouhei Sutou created ARROW-18407: Summary: [Release][Website] Use UTC for release date Key: ARROW-18407 URL: https://issues.apache.org/jira/browse/ARROW-18407 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools, Website Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18406) [C++] Can't build Arrow with Substrait on Ubuntu 20.04
Dewey Dunnington created ARROW-18406: Summary: [C++] Can't build Arrow with Substrait on Ubuntu 20.04 Key: ARROW-18406 URL: https://issues.apache.org/jira/browse/ARROW-18406 Project: Apache Arrow Issue Type: Improvement Reporter: Dewey Dunnington I recently tried to rebuild Arrow with Substrait on Ubuntu 20.04 and got the following error: {code:java} [100%] Building CXX object src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/substrait/type_internal.cc.o /home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.cc: In function ‘arrow::Status arrow::engine::DecodeArg(const substrait::FunctionArgument&, int, arrow::engine::SubstraitCall*, const arrow::engine::ExtensionSet&, const arrow::engine::ConversionOptions&)’: /home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.cc:60:21: error: ‘bool substrait::FunctionArgument::has_enum_() const’ is private within this context 60 | if (arg.has_enum_()) { | ^ In file included from /home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.h:30, from /home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.cc:20: /home/dewey/.r-arrow-dev-build/build/substrait_ep-generated/substrait/algebra.pb.h:21690:13: note: declared private here 21690 | inline bool FunctionArgument::has_enum_() const { | ^~~~ [100%] Building CXX object src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/substrait/util.cc.o make[2]: *** [src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/build.make:76: src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/substrait/expression_internal.cc.o] Error 1 make[2]: *** Waiting for unfinished jobs make[1]: *** [CMakeFiles/Makefile2:2028: src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/all] Error 2 make: *** [Makefile:146: all] Error 2 {code} [~westonpace] suggested that it is probably a protobuf version problem! For me this is: {code:java} $ protoc --version libprotoc 3.6.1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18405) [Ruby] Raw table converter rebuilds chunked arrays
Sten Larsson created ARROW-18405: Summary: [Ruby] Raw table converter rebuilds chunked arrays Key: ARROW-18405 URL: https://issues.apache.org/jira/browse/ARROW-18405 Project: Apache Arrow Issue Type: Bug Components: Ruby Affects Versions: 10.0.0 Reporter: Sten Larsson Consider the following Ruby script: {code:ruby} require 'arrow' data = Arrow::ChunkedArray.new([Arrow::Int64Array.new([1])]) table = Arrow::Table.new('column' => data) puts table['column'].data_type {code} This prints "int64" with red-arrow 9.0.0 and "uint8" in 10.0.0. >From my understanding it is due to this commit: >[https://github.com/apache/arrow/commit/913d9c0a9a1a4398ed5f56d713d586770b4f702c#diff-f7f19bbc3945ea30ba06d851705f2d58f7666507bb101c4e151014ca398bd635R42] The old version would not call ArrayBuilder.build on a ChunkedArray, but the new version does. This is a problem for us, because we need the column to stay int64. A workaround is to specify a schema and list of arrays instead to bypass the raw table converter: {code:ruby} table = Arrow::Table.new([{name: 'column', type: 'int64'}], [data]) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18404) [Python] [Docs] Mention the C Data/Stream Interface in PyArrow Extending
Anja Boskovic created ARROW-18404: - Summary: [Python] [Docs] Mention the C Data/Stream Interface in PyArrow Extending Key: ARROW-18404 URL: https://issues.apache.org/jira/browse/ARROW-18404 Project: Apache Arrow Issue Type: Task Components: Documentation Reporter: Anja Boskovic Assignee: Anja Boskovic The [Arrow C Data/Stream Interface|https://arrow.apache.org/docs/format/CDataInterface.html] is a relatively lightweight option for developers that want to expose Arrow Arrays to Python users. It is not mentioned as a recommendation in the documentation on [using pyarrow from C++ code|[https://arrow.apache.org/docs/python/integration/extending.html].] The existing recommendation mentioned is [wrapping and unwrapping|[https://arrow.apache.org/docs/python/integration/extending.html#wrapping-and-unwrapping].] I propose adding a section to this page. I would be happy to take that on, if others agree that is a good idea. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18403) [C++] Error consuming Substrait plan which uses count function: "only unary aggregate functions are currently supported"
Nicola Crane created ARROW-18403: Summary: [C++] Error consuming Substrait plan which uses count function: "only unary aggregate functions are currently supported" Key: ARROW-18403 URL: https://issues.apache.org/jira/browse/ARROW-18403 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Nicola Crane ARROW-17523 added support for the Substrait extension function "count", but when I write code which produces a Substrait plan which calls it, and then try to run it in Acero, I get an error. The plan: {code:r} message of type 'substrait.Plan' with 3 fields set extension_uris { extension_uri_anchor: 1 uri: "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml; } extension_uris { extension_uri_anchor: 2 uri: "https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml; } extension_uris { extension_uri_anchor: 3 uri: "https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml; } extensions { extension_function { extension_uri_reference: 3 function_anchor: 2 name: "count" } } relations { rel { aggregate { input { project { common { emit { output_mapping: 9 output_mapping: 10 output_mapping: 11 output_mapping: 12 output_mapping: 13 output_mapping: 14 output_mapping: 15 output_mapping: 16 output_mapping: 17 } } input { read { base_schema { names: "int" names: "dbl" names: "dbl2" names: "lgl" names: "false" names: "chr" names: "verses" names: "padded_strings" names: "some_negative" struct_ { types { i32 { nullability: NULLABILITY_NULLABLE } } types { fp64 { nullability: NULLABILITY_NULLABLE } } types { fp64 { nullability: NULLABILITY_NULLABLE } } types { bool_ { nullability: NULLABILITY_NULLABLE } } types { bool_ { nullability: NULLABILITY_NULLABLE } } types { string { nullability: NULLABILITY_NULLABLE } } types { string { nullability: NULLABILITY_NULLABLE } } types { string { nullability: NULLABILITY_NULLABLE } } types { fp64 { nullability: NULLABILITY_NULLABLE } } } } local_files { items { uri_file: "file:///tmp/RtmpsBsoZJ/file1915f604cff4a" parquet { } } } } } expressions { selection { direct_reference { struct_field { } } root_reference { } } } expressions { selection { direct_reference { struct_field { field: 1 } } root_reference { } } } expressions { selection { direct_reference { struct_field { field: 2 } } root_reference { } } } expressions { selection { direct_reference { struct_field { field: 3 } } root_reference { } } } expressions { selection { direct_reference { struct_field { field: 4 } }
[jira] [Created] (ARROW-18402) [C++] Expose `DeclarationInfo`
Yaron Gvili created ARROW-18402: --- Summary: [C++] Expose `DeclarationInfo` Key: ARROW-18402 URL: https://issues.apache.org/jira/browse/ARROW-18402 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yaron Gvili Assignee: Yaron Gvili -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18401) [R] Failing test on test-r-rhub-ubuntu-gcc-release-latest
Dewey Dunnington created ARROW-18401: Summary: [R] Failing test on test-r-rhub-ubuntu-gcc-release-latest Key: ARROW-18401 URL: https://issues.apache.org/jira/browse/ARROW-18401 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dewey Dunnington I think this is an R problem where there is a string that is not getting converted to a timestamp (given that the kernel that's mentioned that doesn't exist probably doesn't and shouldn't exist). https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=40090=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=22256 {code:java} ══ Failed tests ── Error ('test-dplyr-query.R:694'): Scalars in expressions match the type of the field, if possible ── Error: NotImplemented: Function 'greater' has no kernel matching input types (timestamp[us, tz=UTC], string) Backtrace: ▆ 1. ├─testthat::expect_output(...) at test-dplyr-query.R:694:2 2. │ └─testthat:::quasi_capture(...) 3. │ ├─testthat (local) .capture(...) 4. │ │ └─testthat::capture_output_lines(code, print, width = width) 5. │ │ └─testthat:::eval_with_output(code, print = print, width = width) 6. │ │ ├─withr::with_output_sink(path, withVisible(code)) 7. │ │ │ └─base::force(code) 8. │ │ └─base::withVisible(code) 9. │ └─rlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo)) 10. ├─tab %>% filter(times > "2018-10-07 19:04:05") %>% ... 11. └─arrow::show_exec_plan(.) 12. ├─arrow::as_record_batch_reader(adq) 13. └─arrow:::as_record_batch_reader.arrow_dplyr_query(adq) 14. └─plan$Build(x) 15. └─node$Filter(.data$filtered_rows) 16. ├─self$preserve_extras(ExecNode_Filter(self, expr)) 17. └─arrow:::ExecNode_Filter(self, expr) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18400) Quadratic memory usage of Table.to_pandas with nested data
Adam Reeve created ARROW-18400: -- Summary: Quadratic memory usage of Table.to_pandas with nested data Key: ARROW-18400 URL: https://issues.apache.org/jira/browse/ARROW-18400 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 10.0.1 Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X with 64 GB RAM Reporter: Adam Reeve Reading nested Parquet data and then converting it to a Pandas DataFrame shows quadratic memory usage and will eventually run out of memory for reasonably small files. I had initially thought this was a regression since 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks in at higher row counts. Example code to generate nested Parquet data: {code:python} import numpy as np import random import string import pandas as pd _characters = string.ascii_uppercase + string.digits + string.punctuation def make_random_string(N=10): return ''.join(random.choice(_characters) for _ in range(N)) nrows = 1_024_000 filename = 'nested.parquet' arr_len = 10 nested_col = [] for i in range(nrows): nested_col.append(np.array( [{ 'a': None if i % 1000 == 0 else np.random.choice(1, size=3).astype(np.int64), 'b': None if i % 100 == 0 else random.choice(range(100)), 'c': None if i % 10 == 0 else make_random_string(5) } for i in range(arr_len)] )) df = pd.DataFrame({'c1': nested_col}) df.to_parquet(filename) {code} And then read into a DataFrame with: {code:python} import pyarrow.parquet as pq table = pq.read_table(filename) df = table.to_pandas() {code} Only reading to an Arrow table isn't a problem, it's the to_pandas method that exhibits the large memory usage. I haven't tested generating nested Arrow data in memory without writing Parquet from Pandas but I assume the problem probably isn't Parquet specific. Memory usage I see when reading different sized files: ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)|| |32,000|362|361| |64,000|531|531| |128,000|1,152|1,101| |256,000|2,888|1,402| |512,000|10,301|3,508| |1,024,000|38,697|5,313| |2,048,000|OOM|20,061| With Arrow 10.0.1, memory usage approximately quadruples when row count doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but then quadruples from 1024k to 2048k rows. PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something changed between 7.0.0 and 8.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18399) [Python] Reduce warnings during tests
Antoine Pitrou created ARROW-18399: -- Summary: [Python] Reduce warnings during tests Key: ARROW-18399 URL: https://issues.apache.org/jira/browse/ARROW-18399 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Antoine Pitrou Numerous warnings are displayed at the end of a test run, we should strive them to reduce them: https://github.com/apache/arrow/actions/runs/3533792571/jobs/5929880345#step:6:5489 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18398) [C++] Sporadic error in StressSourceGroupedSumStop
Antoine Pitrou created ARROW-18398: -- Summary: [C++] Sporadic error in StressSourceGroupedSumStop Key: ARROW-18398 URL: https://issues.apache.org/jira/browse/ARROW-18398 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou I just saw this occasional failure: https://github.com/apache/arrow/actions/runs/3533672097/jobs/5929601817#step:11:294 {code} [ RUN ] ExecPlanExecution.StressSourceGroupedSumStop D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:850: Failure Value of: _fut.Wait(::arrow::kDefaultAssertFinishesWaitSeconds) Actual: false Expected: true Google Test trace: D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:825: parallel D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:822: unslowed D:/a/arrow/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:60: Plan was destroyed before finishing {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18397) [C++] Clear S3 region resolver client at S3 shutdown
Antoine Pitrou created ARROW-18397: -- Summary: [C++] Clear S3 region resolver client at S3 shutdown Key: ARROW-18397 URL: https://issues.apache.org/jira/browse/ARROW-18397 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 10.0.2, 11.0.0 The S3 region resolver caches a S3 client at module scope. This client can be destroyed very late and trigger an assertion error in the AWS SDK because it was already shutdown: https://github.com/aws/aws-sdk-cpp/issues/2204 When explicitly finalizing S3, we should ensure we also destroy the cached S3 client. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18395) [C++] Move select-k implementation into separate module
Antoine Pitrou created ARROW-18395: -- Summary: [C++] Move select-k implementation into separate module Key: ARROW-18395 URL: https://issues.apache.org/jira/browse/ARROW-18395 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou The select-k kernel implementations are currently in {{vector_sort.cc}}, amongst other things. To make the code more readable and faster to compiler, we should move them into their own file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18396) [C++] Move rank implementation into separate module
Antoine Pitrou created ARROW-18396: -- Summary: [C++] Move rank implementation into separate module Key: ARROW-18396 URL: https://issues.apache.org/jira/browse/ARROW-18396 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou The rank kernel implementations are currently in {{vector_sort.cc}}, amongst other things. To make the code more readable and faster to compiler, we should move them into their own file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18394) [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail
Raúl Cumplido created ARROW-18394: - Summary: [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail Key: ARROW-18394 URL: https://issues.apache.org/jira/browse/ARROW-18394 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Python Reporter: Raúl Cumplido Fix For: 11.0.0 Currently the following jobs fail: |test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3532562061/jobs/5927065343| |test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3532562477/jobs/5927066168| with: {code:java} _ test_roundtrip_with_bytes_unicode[columns0] __columns = [b'foo'] @pytest.mark.parametrize('columns', ([b'foo'], ['foo'])) def test_roundtrip_with_bytes_unicode(columns): df = pd.DataFrame(columns=columns) table1 = pa.Table.from_pandas(df) > table2 = > pa.Table.from_pandas(table1.to_pandas())opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_pandas.py:2867: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/array.pxi:830: in pyarrow.lib._PandasConvertible.to_pandas ??? pyarrow/table.pxi:3908: in pyarrow.lib.Table._to_pandas ??? opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:819: in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:935: in _deserialize_column_index columns = _reconstruct_columns_from_metadata(columns, column_indexes) opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1154: in _reconstruct_columns_from_metadata level = level.astype(dtype) opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:1029: in astype return Index(new_values, name=self.name, dtype=new_values.dtype, copy=False) opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:518: in __new__ klass = cls._dtype_to_subclass(arr.dtype) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = , dtype = dtype('S3') @final @classmethod def _dtype_to_subclass(cls, dtype: DtypeObj): # Delay import for perf. https://github.com/pandas-dev/pandas/pull/31423 if isinstance(dtype, ExtensionDtype): if isinstance(dtype, DatetimeTZDtype): from pandas import DatetimeIndex return DatetimeIndex elif isinstance(dtype, CategoricalDtype): from pandas import CategoricalIndex return CategoricalIndex elif isinstance(dtype, IntervalDtype): from pandas import IntervalIndex return IntervalIndex elif isinstance(dtype, PeriodDtype): from pandas import PeriodIndex return PeriodIndex return Index if dtype.kind == "M": from pandas import DatetimeIndex return DatetimeIndex elif dtype.kind == "m": from pandas import TimedeltaIndex return TimedeltaIndex elif dtype.kind == "f": from pandas.core.api import Float64Index return Float64Index elif dtype.kind == "u": from pandas.core.api import UInt64Index return UInt64Index elif dtype.kind == "i": from pandas.core.api import Int64Index return Int64Index elif dtype.kind == "O": # NB: assuming away MultiIndex return Index elif issubclass( dtype.type, (str, bool, np.bool_, complex, np.complex64, np.complex128) ): return Index > raise NotImplementedError(dtype) E NotImplementedError: |S3opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:595: NotImplementedError{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18393) [Docs][R] Include warning when viewing old docs (redirecting to stable docs)
Nicola Crane created ARROW-18393: Summary: [Docs][R] Include warning when viewing old docs (redirecting to stable docs) Key: ARROW-18393 URL: https://issues.apache.org/jira/browse/ARROW-18393 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Joris Van den Bossche Assignee: Alenka Frim Now we have versioned docs, we also have the old versions of the developers docs (eg https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those might be outdated (eg regarding communication channels, build instructions, etc), and typically when contributing / developing with the latest arrow, one should _always_ check the latest dev version of the contributing docs. We could add a warning box pointing this out and linking to the dev docs. For example similarly how some projects warn about viewing old docs in general and point to the stable docs (eg https://mne.tools/1.1/index.html or https://scikit-learn.org/1.0/user_guide.html). In this case we could have a custom box when at a page in /developers to point to the dev docs instead of stable docs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18392) [CI][Python] Some nightly python tests fail due to ACCESS DENIED to S3 bucket
Raúl Cumplido created ARROW-18392: - Summary: [CI][Python] Some nightly python tests fail due to ACCESS DENIED to S3 bucket Key: ARROW-18392 URL: https://issues.apache.org/jira/browse/ARROW-18392 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Python Reporter: Raúl Cumplido Fix For: 11.0.0 Several nightly tests fail with: {code:java} === FAILURES === test_s3fs_wrong_region @pytest.mark.s3 def test_s3fs_wrong_region(): from pyarrow.fs import S3FileSystem # wrong region for bucket fs = S3FileSystem(region='eu-north-1') msg = ("When getting information for bucket 'voltrondata-labs-datasets': " r"AWS Error UNKNOWN \(HTTP status 301\) during HeadBucket " "operation: No response body. Looks like the configured region is " "'eu-north-1' while the bucket is located in 'us-east-2'." "|NETWORK_CONNECTION") with pytest.raises(OSError, match=msg) as exc: fs.get_file_info("voltrondata-labs-datasets") # Sometimes fails on unrelated network error, so next call would also fail. if 'NETWORK_CONNECTION' in str(exc.value): return fs = S3FileSystem(region='us-east-2') > > fs.get_file_info("voltrondata-labs-datasets")opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_fs.py:1339: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/_fs.pyx:571: in pyarrow._fs.FileSystem.get_file_info ??? pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E OSError: When getting information for bucket 'voltrondata-labs-datasets': AWS Error ACCESS_DENIED during HeadBucket operation: No response body. {code} I can't seem to be able to reproduce locally but is pretty consistent: * [test-conda-python-3.10|https://github.com/ursacomputing/crossbow/actions/runs/3528202639/jobs/5918051269] * [test-conda-python-3.11|https://github.com/ursacomputing/crossbow/actions/runs/3528201175/jobs/5918048135] * [test-conda-python-3.7|https://github.com/ursacomputing/crossbow/actions/runs/3528195566/jobs/5918035812] * [test-conda-python-3.7-pandas-latest|https://github.com/ursacomputing/crossbow/actions/runs/3528211334/jobs/5918069152] * [test-conda-python-3.8|https://github.com/ursacomputing/crossbow/actions/runs/3528193702/jobs/5918032370] * [test-conda-python-3.8-pandas-latest|https://github.com/ursacomputing/crossbow/actions/runs/3528213536/jobs/5918073481] * [test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3528205157/jobs/5918056277] * [test-conda-python-3.9|https://github.com/ursacomputing/crossbow/actions/runs/3528202402/jobs/5918050613] * [test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3528210560/jobs/5918067302] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18391) [R] Fix the version selector dropdown
Nicola Crane created ARROW-18391: Summary: [R] Fix the version selector dropdown Key: ARROW-18391 URL: https://issues.apache.org/jira/browse/ARROW-18391 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane ARROW-17887 updates the docs to use Bootstrap 5 which will break the docs version dropdown selector, as it relies on replacing a page element, but the page elements are different in this version of Bootstrap. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18390) [CI][Python] Nightly python test for spark master missing test module
Raúl Cumplido created ARROW-18390: - Summary: [CI][Python] Nightly python test for spark master missing test module Key: ARROW-18390 URL: https://issues.apache.org/jira/browse/ARROW-18390 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Python Reporter: Raúl Cumplido Assignee: Raúl Cumplido Fix For: 11.0.0 Currently the nightly test with spark master [test-conda-python-3.9-spark-master|[https://github.com/ursacomputing/crossbow/actions/runs/3528196313/jobs/5918037939]] fail with: {code:java} Starting test(python): pyspark.sql.tests.test_pandas_map (temp output: /spark/python/target/cbca1b18-4af7-4205-aa41-8c945bf1cf58/python__pyspark.sql.tests.test_pandas_map__9ptzo8sa.log) /opt/conda/envs/arrow/bin/python: No module named pyspark.sql.tests.test_pandas_grouped_map {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18389) [CI][Python] Update nightly test-conda-python-3.7-pandas-0.24 to pandas >= 1.0
Raúl Cumplido created ARROW-18389: - Summary: [CI][Python] Update nightly test-conda-python-3.7-pandas-0.24 to pandas >= 1.0 Key: ARROW-18389 URL: https://issues.apache.org/jira/browse/ARROW-18389 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Python Reporter: Raúl Cumplido Assignee: Raúl Cumplido Fix For: 11.0.0 https://issues.apache.org/jira/browse/ARROW-18173 Removed support for pandas < 1.0. We should upgrade the nightly test. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18388) [C++] Decide on duplicate column handling in scanner, add more tests
Weston Pace created ARROW-18388: --- Summary: [C++] Decide on duplicate column handling in scanner, add more tests Key: ARROW-18388 URL: https://issues.apache.org/jira/browse/ARROW-18388 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Weston Pace When a schema has duplicate column names it can be difficult to know how to map between the fragment schema and the dataset schema in the default evolution strategy. It's not clear from the comments describing evolution what the exact behavior is right now. Some suggestions have been: * Grab the first column in the fragment schema with the same name * Always error if there are duplicate columns * Allow duplicate columns but expect there to be the same # of occurrences in both the fragment and dataset schema and assume the order is consistent -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18387) [C++] Create many-column scanner microbenchmarks
Weston Pace created ARROW-18387: --- Summary: [C++] Create many-column scanner microbenchmarks Key: ARROW-18387 URL: https://issues.apache.org/jira/browse/ARROW-18387 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Weston Pace When developing we often assume schemas are cheap and small only to find out later that we create easily avoided bottlenecks for users that have very large schemas. We should create some micro-benchmarks for the scanner that verify we get roughly the same performance, in data-bytes-per-second, with many-columns as we do with few-columns (note, that we probably suffer in rows-per-second since we are loading more columns and thus more data). This might also be a good time to create similar benchmarks for dataset discovery. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18386) [C++] Add support for filename, file index, and batch index columns to exec plan based scanner
Weston Pace created ARROW-18386: --- Summary: [C++] Add support for filename, file index, and batch index columns to exec plan based scanner Key: ARROW-18386 URL: https://issues.apache.org/jira/browse/ARROW-18386 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Weston Pace Assignee: Weston Pace The old scanner currently appends these three fields to all outgoing batches. In retrospect, this caused some confusion, so I'd like to handle it slightly differently, where the user is able to request these fields, but they are not automatically appended. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18385) [Java]
Jacob Wujciak-Jens created ARROW-18385: -- Summary: [Java] Key: ARROW-18385 URL: https://issues.apache.org/jira/browse/ARROW-18385 Project: Apache Arrow Issue Type: Wish Components: Java Reporter: Jacob Wujciak-Jens Fix For: 11.0.0 Attachments: image.png While verifying 10.0.1 I came across this java test error that is caused by a mismatch in the ordering of the JSON metadata description (see attached image) ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.177 s <<< FAILURE! - in org.apache.arrow.adapter.jdbc.JdbcToArrowCommentMetadataTest [ERROR] org.apache.arrow.adapter.jdbc.JdbcToArrowCommentMetadataTest.schemaCommentWithDatabaseMetadata Time elapsed: 0.141 s <<< FAILURE! org.opentest4j.AssertionFailedError: cc [~lidavidm] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18384) [Release][MSYS2] Show pull request title
Kouhei Sutou created ARROW-18384: Summary: [Release][MSYS2] Show pull request title Key: ARROW-18384 URL: https://issues.apache.org/jira/browse/ARROW-18384 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18383) [C++] Avoid global variables for thread pools and at-fork handlers
Antoine Pitrou created ARROW-18383: -- Summary: [C++] Avoid global variables for thread pools and at-fork handlers Key: ARROW-18383 URL: https://issues.apache.org/jira/browse/ARROW-18383 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 11.0.0 Investigation revealed an issue where the global IO thread pool could be constructed before the at-fork handler internal state. The IO thread pool, created on library load, would register an at-fork handler; then, the at-fork handler state would be initialized and clobber the handler registered just before. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18382) [C++] "ADDRESS_SANITIZER" not defined in fuzzing builds
Antoine Pitrou created ARROW-18382: -- Summary: [C++] "ADDRESS_SANITIZER" not defined in fuzzing builds Key: ARROW-18382 URL: https://issues.apache.org/jira/browse/ARROW-18382 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fuzzing builds (as run by OSS-Fuzz) enable Address Sanitizer through their own set of options rather than by enabling {{ARROW_USE_ASAN}}. However, we need to be informed this situation in the Arrow source code. One example of where this matters is that eternal thread pools produce spurious leaks at shutdown because of the vector of at-fork handlers; it therefore needs to be worked around on those builds. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18381) MIGRATION: Create milestones for every needed fix version
Todd Farmer created ARROW-18381: --- Summary: MIGRATION: Create milestones for every needed fix version Key: ARROW-18381 URL: https://issues.apache.org/jira/browse/ARROW-18381 Project: Apache Arrow Issue Type: Task Reporter: Todd Farmer The Apache Arrow projects uses the "Fix version" field in ASF Jira issue to track the version in which issues were resolved/fixed/implemented. The most equivalent field in GitHub issues is the "milestone" field. This field is explicitly managed - the versions need to be added to the repository configuration before they can be used. This mapping needs to be established as a prerequisite for completing the import from ASF Jira. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18380) MIGRATION: Enable bot handling of GitHub issue linked PRs
Todd Farmer created ARROW-18380: --- Summary: MIGRATION: Enable bot handling of GitHub issue linked PRs Key: ARROW-18380 URL: https://issues.apache.org/jira/browse/ARROW-18380 Project: Apache Arrow Issue Type: Task Reporter: Todd Farmer GitHub workflows for the Apache Arrow project assume that PRs reference ASF Jira issues (or are minor changes). This needs to be revisited now that GitHub issue reporting is enabled, as there may well be no ASF Jira issue to link a PR against going forward. The resulting bot comments can be confusing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18379) [Python] Change warnings to _warnings in _plasma_store_entry_point
Alenka Frim created ARROW-18379: --- Summary: [Python] Change warnings to _warnings in _plasma_store_entry_point Key: ARROW-18379 URL: https://issues.apache.org/jira/browse/ARROW-18379 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Alenka Frim Assignee: Alenka Frim Fix For: 10.0.2, 11.0.0 There is a {{leftover in python/pyarrow/__init__.py}} from [https://github.com/apache/arrow/pull/14343] due to {{warnings}} being imported as {{_warnings}}. Connected GitHub issue: [https://github.com/apache/arrow/issues/14693] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18378) MIGRATION: Disable issue reporting in ASF Jira
Todd Farmer created ARROW-18378: --- Summary: MIGRATION: Disable issue reporting in ASF Jira Key: ARROW-18378 URL: https://issues.apache.org/jira/browse/ARROW-18378 Project: Apache Arrow Issue Type: Task Reporter: Todd Farmer ARROW-18364 enabled issue reporting for Apache Arrow in GitHub issues. Even though existing Jira issues have not yet been migrated and are still being worked in the Jira system, we should assess disabling creation of new issues in ASF Jira, and instead pointing users to GitHub issues. This may benefit the project by reducing the need to monitor inflow in two discrete systems. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18377) MIGRATION: Automate component labels from issue form content
Todd Farmer created ARROW-18377: --- Summary: MIGRATION: Automate component labels from issue form content Key: ARROW-18377 URL: https://issues.apache.org/jira/browse/ARROW-18377 Project: Apache Arrow Issue Type: Task Reporter: Todd Farmer ARROW-18364 added the ability to report issues in GitHub, and includes GitHub issue templates with a drop-down component(s) selector. These form elements drive resulting issue markdown only, and cannot dynamically drive issue labels. This requires GitHub actions, which also have a few limitations. First, the issue form does not produce any structured data, it only produces the issue description markdown, so a parser is required. Second, ASF restricts GitHub actions to a selection of approved actions. It is likely that while community actions exist to generate structured data from issue forms, the Apache Arrow project will need to write its own parser and label application action. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18376) MIGRATION: Add component labels to GitHub
Todd Farmer created ARROW-18376: --- Summary: MIGRATION: Add component labels to GitHub Key: ARROW-18376 URL: https://issues.apache.org/jira/browse/ARROW-18376 Project: Apache Arrow Issue Type: Task Reporter: Todd Farmer Similar to ARROW-18375, component labels have been established based on existing component values defined in ASF Jira. The following labels are needed: * Component: Archery * Component: Benchmarking * Component: C * Component: C# * Component: C++ * Component: C++ - Gandiva * Component: C++ - Plasma * Component: Continuous Integration * Component: Dart * Component: Developer Tools * Component: Documentation * Component: FlightRPC * Component: Format * Component: GLib * Component: Go * Component: GPU * Component: Integration * Component: Java * Component: JavaScript * Component: MATLAB * Component: Packaging * Component: Parquet * Component: Python * Component: R * Component: Ruby * Component: Swift * Component: Website * Component: Other -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18375) MIGRATION: Enable GitHub issue type labels
Todd Farmer created ARROW-18375: --- Summary: MIGRATION: Enable GitHub issue type labels Key: ARROW-18375 URL: https://issues.apache.org/jira/browse/ARROW-18375 Project: Apache Arrow Issue Type: Task Reporter: Todd Farmer As part of enabling GitHub issue reporting, the following labels have been defined and need to be added to the repository label options. Without these labels added, [new issues|https://github.com/apache/arrow/issues/14692] do not get the issue template-defined issue type labels set properly. Labels: * Type: bug * Type: enhancement * Type: usage -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18374) [Go][CI][Benchmarks] Fix Go Bench Script after conbench change
Matthew Topol created ARROW-18374: - Summary: [Go][CI][Benchmarks] Fix Go Bench Script after conbench change Key: ARROW-18374 URL: https://issues.apache.org/jira/browse/ARROW-18374 Project: Apache Arrow Issue Type: Bug Components: Benchmarking, Continuous Integration, Go Reporter: Matthew Topol Assignee: Matthew Topol Change [https://github.com/conbench/conbench/pull/417/files#] requires now putting an explicit {{github=None}} as an argument to {{BenchmarkResult}} to have it get the github info from the locally cloned repo. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18373) MIGRATION: Enable multiple component selection in issue templates
Todd Farmer created ARROW-18373: --- Summary: MIGRATION: Enable multiple component selection in issue templates Key: ARROW-18373 URL: https://issues.apache.org/jira/browse/ARROW-18373 Project: Apache Arrow Issue Type: Task Reporter: Todd Farmer Per comments in [this merged PR|https://github.com/apache/arrow/pull/14675], we would like to enable selection of multiple components when reporting issues via GitHub issues. Additionally, we may want to add the needed Apache license to the issue templates and remove the exclusion rules from rat_exclude_files.txt. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18372) [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell
Lucas Mation created ARROW-18372: Summary: [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell Key: ARROW-18372 URL: https://issues.apache.org/jira/browse/ARROW-18372 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 10.0.0 Reporter: Lucas Mation I have a large parquet file 900 million rows , 40cols parquet file, subdivided into folders for each year. I was trying to calculate how many unique combinations of id1+id2+id3+id4 there are in the dataset. Notice that the "collected" dataset is supposed to be only one row and one cel, containing the count (I've confirmed this by subseting the dataset ("%>% head(10^6)" ) before computing the count, and it works). That is why the error below is so weird ``` fa <- 'myparteq folder' #huge va <- open_dataset(fa) tic() d <- va %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect toc() Error in `collect()`: ! Invalid: negative malloc size Run `rlang::last_error()` to see where the error occurred. > rlang::last_error() Error in `collect()`: ! Invalid: negative malloc size --- Backtrace: 1. ... %>% collect 3. arrow:::collect.arrow_dplyr_query(.) Run `rlang::last_trace()` to see the full context. > rlang::last_trace() Error in `collect()`: ! Invalid: negative malloc size --- Backtrace: x 1. +-... %>% collect 2. +-dplyr::collect(.) 3. \-arrow:::collect.arrow_dplyr_query(.) 4. \-base::tryCatch(...) 5. \-base (local) tryCatchList(expr, classes, parentenv, handlers) 6. \-base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]]) 7. \-value[[3L]](cond) 8. \-arrow:::augment_io_error_msg(e, call, schema = x$.data$schema) 9. \-rlang::abort(msg, call = call) ``` I am running this on a windows server, 512Gb of RAM. sessionInfo() R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server 2012 R2 x64 (build 9600) Matrix products: default locale: [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C [5] LC_TIME=Portuguese_Brazil.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_10.0.0 data.table_1.14.4 forcats_0.5.2 dplyr_1.0.10 purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 [9] ggplot2_3.3.6 tidyverse_1.3.2 gt_0.7.0 xtable_1.8-4 ggthemes_4.2.4 collapse_1.8.6 pryr_0.1.5 janitor_2.1.0 [17] tictoc_1.1 lubridate_1.8.0 stringr_1.4.1 readxl_1.4.1 loaded via a namespace (and not attached): [1] Rcpp_1.0.9 assertthat_0.2.1 digest_0.6.30 utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 backports_1.4.1 [8] reprex_2.0.2 httr_1.4.4 pillar_1.8.1 rlang_1.0.6 googlesheets4_1.0.1 rstudioapi_0.14 googledrive_2.0.0 [15] bit_4.0.4 munsell_0.5.0 broom_1.0.1 compiler_4.2.1 modelr_0.1.9 pkgconfig_2.0.3 htmltools_0.5.3 [22] tidyselect_1.2.0 codetools_0.2-18 fansi_1.0.3 crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1 withr_2.5.0 [29] grid_4.2.1 jsonlite_1.8.3 gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3 magrittr_2.0.3 scales_1.2.1 [36] cli_3.4.1 stringi_1.7.8 fs_1.5.2 snakecase_0.11.0 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.3 [43] vctrs_0.5.0 tools_4.2.1 bit64_4.0.5 glue_1.6.2 hms_1.1.2 parallel_4.2.1 fastmap_1.1.0 [50] colorspace_2.0-3 gargle_1.2.1 rvest_1.0.3 haven_2.5.1 arrow_info() Arrow package version: 10.0.0 Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc FALSE mimalloc TRUE Arrow options(): arrow.use_threads FALSE Memory: Allocator mimalloc Current 74.82 Gb Max 97.75 Gb Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 10.0.0 C++ Compiler GNU C++ Compiler Version 10.3.0 Git ID aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18371) [C++] Expose *FromJSON helpers
Rok Mihevc created ARROW-18371: -- Summary: [C++] Expose *FromJSON helpers Key: ARROW-18371 URL: https://issues.apache.org/jira/browse/ARROW-18371 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Rok Mihevc {Array,{{Exec,Record}Batch}FromJSON helper functions would be useful when testing in projects that use Arrow. BatchesWithSchema and MakeBasicBatches could be considered as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18370) [Python] `ds.write_dataset` doesn't allow feather compression
Yu Zhu created ARROW-18370: -- Summary: [Python] `ds.write_dataset` doesn't allow feather compression Key: ARROW-18370 URL: https://issues.apache.org/jira/browse/ARROW-18370 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Yu Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18369) [C++] Support nested references as segment ids
Yaron Gvili created ARROW-18369: --- Summary: [C++] Support nested references as segment ids Key: ARROW-18369 URL: https://issues.apache.org/jira/browse/ARROW-18369 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Yaron Gvili -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18368) [Python] Expose grouping segment keys to PyArrow
Yaron Gvili created ARROW-18368: --- Summary: [Python] Expose grouping segment keys to PyArrow Key: ARROW-18368 URL: https://issues.apache.org/jira/browse/ARROW-18368 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Yaron Gvili This is a [follow-up task|https://github.com/apache/arrow/pull/14352#discussion_r1026926422] for a PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18367) Enable using InMemoryDataset to create substrait plans
Jianshen Liu created ARROW-18367: Summary: Enable using InMemoryDataset to create substrait plans Key: ARROW-18367 URL: https://issues.apache.org/jira/browse/ARROW-18367 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Jianshen Liu Fix For: 11.0.0 We think that the `Named Table` relation supported by substrait is an important abstraction in HPC to enable remote executions. To enable the creation of named tables with the `engine::SerializePlan` API, we would like to add the support of `InMemoryDataset` to scan nodes to be used to convert to substrait plans. The idea is to save the `names` of a named table in the metadata of the schema used to create the InMemoryDataset. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18366) [Packaging][RPM][Gandiva] Failed to link on AlmaLinux 9
Kouhei Sutou created ARROW-18366: Summary: [Packaging][RPM][Gandiva] Failed to link on AlmaLinux 9 Key: ARROW-18366 URL: https://issues.apache.org/jira/browse/ARROW-18366 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou https://github.com/ursacomputing/crossbow/actions/runs/3502784911/jobs/5867407921#step:6:4748 {noformat} FAILED: gandiva-glib/Gandiva-1.0.gir env PKG_CONFIG_PATH=/usr/lib64/pkgconfig:/usr/share/pkgconfig:/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/meson-uninstalled /usr/bin/g-ir-scanner --quiet --no-libtool --namespace=Gandiva --nsversion=1.0 --warn-all --output gandiva-glib/Gandiva-1.0.gir --c-include=gandiva-glib/gandiva-glib.h --warn-all --include-uninstalled=./arrow-glib/Arrow-1.0.gir -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/gandiva-glib -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/gandiva-glib -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/. -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/. -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/redhat-linux-build/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/redhat-linux-build/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/. -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/. -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/redhat-linux-build/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/redhat-linux-build/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/src --filelist=/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/gandiva-glib/libgandiva-glib.so.1100.0.0.p/Gandiva_1.0_gir_filelist --include=Arrow-1.0 --symbol-prefix=ggandiva --identifier-prefix=GGandiva --pkg-export=gandiva-glib --cflags-begin -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/. -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/. -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/redhat-linux-build/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/redhat-linux-build/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/src -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/src -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/sysprof-4 -I/usr/include/gobject-introspection-1.0 --cflags-end --add-include-path=/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/arrow-glib --add-include-path=/usr/share/gir-1.0 -L/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/gandiva-glib --library gandiva-glib -L/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/arrow-glib -L/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../../cpp/redhat-linux-build/release --extra-library=gobject-2.0 --extra-library=glib-2.0 --extra-library=girepository-1.0 --sources-top-dirs /build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/ --sources-top-dirs /build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/ --warn-error /usr/bin/ld: /build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../../cpp/redhat-linux-build/release/libgandiva.so.1100: undefined reference to `std::__glibcxx_assert_fail(char const*, int, char const*, char const*)' collect2: error: ld returned 1 exit status {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18365) [C++][Parquet] Optimize DELTA_BINARY_PACKED encoding and decoding
Rok Mihevc created ARROW-18365: -- Summary: [C++][Parquet] Optimize DELTA_BINARY_PACKED encoding and decoding Key: ARROW-18365 URL: https://issues.apache.org/jira/browse/ARROW-18365 Project: Apache Arrow Issue Type: New Feature Components: C++, Parquet Reporter: Rok Mihevc [As suggested here|https://github.com/apache/arrow/pull/14191#discussion_r1019762308] simd approach such as [FastDifferentialCoding|https://github.com/lemire/FastDifferentialCoding] could be used to speed up encoding and decoding. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18364) MIGRATION: Update GitHub issue templates to support bug reports and feature requests
Todd Farmer created ARROW-18364: --- Summary: MIGRATION: Update GitHub issue templates to support bug reports and feature requests Key: ARROW-18364 URL: https://issues.apache.org/jira/browse/ARROW-18364 Project: Apache Arrow Issue Type: Task Reporter: Todd Farmer The [GitHub issue creation page for Arrow|https://github.com/apache/arrow/issues/new/choose] directs users to open bug reports in Jira. Now that ASF Infra has disabled self-service registration in Jira, and in light of the pending migration of Apache Arrow issue tracking from ASF Jira to GitHub issues, we should enable bug reports to be submitted via GitHub directly. Issue templates will help distinguish bug reports and feature requests from existing usage assistance questions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18363) [Docs] Include warning when viewing old contributing docs (redirecting to dev docs)
Joris Van den Bossche created ARROW-18363: - Summary: [Docs] Include warning when viewing old contributing docs (redirecting to dev docs) Key: ARROW-18363 URL: https://issues.apache.org/jira/browse/ARROW-18363 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Joris Van den Bossche Now we have versioned docs, we also have the old versions of the developers docs (eg https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those might be outdated (eg regarding communication channels, build instructions, etc), and typically when contributing / developing with the latest arrow, one should _always_ check the latest dev version of the contributing docs. We could add a warning box pointing this out and linking to the dev docs. For example similarly how some projects warn about viewing old docs in general and point to the stable docs (eg https://mne.tools/1.1/index.html or https://scikit-learn.org/1.0/user_guide.html). In this case we could have a custom box when at a page in /developers to point to the dev docs instead of stable docs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18362) Accelerate Parquet bit-packing decoding with ICX AVX-512
zhaoyaqi created ARROW-18362: Summary: Accelerate Parquet bit-packing decoding with ICX AVX-512 Key: ARROW-18362 URL: https://issues.apache.org/jira/browse/ARROW-18362 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: zhaoyaqi h1. Accelerate Parquet bit-packing decoding with ICX AVX-512 instructions -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18361) [CI][Conan] Merge upstream changes
Kouhei Sutou created ARROW-18361: Summary: [CI][Conan] Merge upstream changes Key: ARROW-18361 URL: https://issues.apache.org/jira/browse/ARROW-18361 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Kouhei Sutou Assignee: Kouhei Sutou Updated: https://github.com/conan-io/conan-center-index/pull/14111 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18360) [Python] Incorrectly passing schema=None to do_put crashes
Bryan Cutler created ARROW-18360: Summary: [Python] Incorrectly passing schema=None to do_put crashes Key: ARROW-18360 URL: https://issues.apache.org/jira/browse/ARROW-18360 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 9.0.0 Reporter: Bryan Cutler In pyarrow.flight, putting an incorrect value of None for schema in do_put will lead to a core dump. In pyarrow 9.0.0, trying to enter the command leads to a {code} In [3]: writer, reader = client.do_put(flight.FlightDescriptor.for_command(cmd), schema=None) Segmentation fault (core dumped) {code} In pyarrow 7.0.0, the kernel crashes after attempting to access the writer and I got the following: {code} In [38]: client = flight.FlightClient('grpc+tls://localhost:9643', disable_server_verification=True) In [39]: writer, reader = client.do_put(flight.FlightDescriptor.for_command(cmd), None) In [40]: writer./home/conda/feedstock_root/build_artifacts/arrow-cpp-ext_1644752264449/work/cpp/src/arrow/flight/client.cc:736: Check failed: (batch_writer_) != (nullptr) miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(+0x66288c)[0x7f0feeae088c] miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0x101)[0x7f0feeae0c91] miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow_flight.so.700(+0x7c1e1)[0x7f0fa9e331e1] miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so(+0x17cf1a)[0x7f0fefe7ff1a] miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03] miniconda3/envs/dev/bin/python(+0x144814)[0x559a7cb8f814] miniconda3/envs/dev/bin/python(+0x1445bf)[0x559a7cb8f5bf] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc] miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac] miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44] miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397] miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac] miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(+0x151ef3)[0x559a7cb9cef3] miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44] miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x1311)[0x559a7cb7fbd1] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x66f)[0x559a7cb7ef2f] miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d] miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03] miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f] miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d] miniconda3/envs/dev/bin/python(+0x1416f5)[0x559a7cb8c6f5] miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x52)[0x559a7cb8c4a2] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f] miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d] miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03] miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f] miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x9ca)[0x559a7cb7f28a] miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178] miniconda3/envs/dev/bin/python(+0x1602d9)[0x559a7cbab2d9] miniconda3/envs/dev/bin/python(+0x19d5f5)[0x559a7cbe85f5] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc] miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178] miniconda3/envs/dev/bin/python
[jira] [Created] (ARROW-18359) PrettyPrint Improvements
Will Jones created ARROW-18359: -- Summary: PrettyPrint Improvements Key: ARROW-18359 URL: https://issues.apache.org/jira/browse/ARROW-18359 Project: Apache Arrow Issue Type: Improvement Components: C++, Python, R Reporter: Will Jones We have some pretty printing capabilities, but we may want to think at a high level about the design first. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18358) [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow
Nicola Crane created ARROW-18358: Summary: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow Key: ARROW-18358 URL: https://issues.apache.org/jira/browse/ARROW-18358 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane In order to make the transition between using the different CSV reading functions as smoothly as possible we could introduce a version of open_dataset specifically for reading CSVs with a signature more closely matching that of read_csv_arrow - this would just pass the arguments through to open_dataset (in the ellipses), but would make it simpler to have a docs page showing these options explicitly and thus be clearer for users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18357) [R] support parse_options, read_options, convert_options in open_dataset to mirror read_csv_arrow
Nicola Crane created ARROW-18357: Summary: [R] support parse_options, read_options, convert_options in open_dataset to mirror read_csv_arrow Key: ARROW-18357 URL: https://issues.apache.org/jira/browse/ARROW-18357 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane The {{read_csv_arrow()}} function allows users to pass in options via its parse_options, convert_options, and read_options parameters. We could allow users to pass these into {{open_dataset()}} to enable users to more easily switch between {{read_csv_arrow()}} and {{open_dataset()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18356) [R] Handle as_data_frame argument if passed into open_dataset for CSVs
Nicola Crane created ARROW-18356: Summary: [R] Handle as_data_frame argument if passed into open_dataset for CSVs Key: ARROW-18356 URL: https://issues.apache.org/jira/browse/ARROW-18356 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane Currently, if the argument {{as_data_frame}} is passed into {{open_dataset()}} with a CSV format dataset, the error message returned is: {code:r} Error: The following option is supported in "read_delim_arrow" functions but not yet supported here: "as_data_frame" {code} Instead, we could silently ignore it if as_data_frame is set to {{FALSE}} and give a more helpful error if set to {{TRUE}} (i.e. direct user to call {{as.data.frame()}} or {{collect()}}). Reasoning: it'd be great to get to a point where users can just swap their {{read_csv_arrow()}} syntax for {{open_dataset()}} and get helpful results. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null
Nicola Crane created ARROW-18355: Summary: [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null Key: ARROW-18355 URL: https://issues.apache.org/jira/browse/ARROW-18355 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18354) [R] Better document the CSV read/parse/convert options we can use with open_dataset()
Nicola Crane created ARROW-18354: Summary: [R] Better document the CSV read/parse/convert options we can use with open_dataset() Key: ARROW-18354 URL: https://issues.apache.org/jira/browse/ARROW-18354 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane When a user opens a CSV dataset using open_dataset, they can take advantage of a lot of different options which can be specified via {{CsvReadOptions$create()}} etc. However, as they are passed in via the ellipses ({{...}}) argument, it's not particularly clear to users which arguments are supported or not. They are not documented in the {{open_dataset()}} docs, and further confused (see the code for {{CsvFileFormat$create()}} by the fact that we support a mix of Arrow and readr parameters. We should better document the arguments we do support. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18353) [C++][Flight] Sporadic hang in UCX tests
Antoine Pitrou created ARROW-18353: -- Summary: [C++][Flight] Sporadic hang in UCX tests Key: ARROW-18353 URL: https://issues.apache.org/jira/browse/ARROW-18353 Project: Apache Arrow Issue Type: Bug Components: C++, FlightRPC Reporter: Antoine Pitrou The UCX tests sometimes hang here. Full gdb backtraces for all threads: {code} Thread 8 (Thread 0x7f4562fcd700 (LWP 76837)): #0 0x7f4577b72ad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x564ebe5b5b3c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0x564ebe5b5ae0, cond=0x564ebe5b5b10) at pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0x564ebe5b5b10, mutex=0x564ebe5b5ae0) at pthread_cond_wait.c:655 #3 0x7f457b4ce7cb in std::condition_variable::wait >(std::unique_lock &, struct {...}) (this=0x564ebe5b5b10, __lock=..., __p=...) at /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/condition_variable:111 #4 0x7f457b4c7b5e in arrow::flight::transport::ucx::(anonymous namespace)::WriteClientStream::WritesDone (this=0x564ebe5b5a90) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_client.cc:277 #5 0x7f457b4cc989 in arrow::flight::transport::ucx::(anonymous namespace)::UcxClientStream::DoFinish (this=0x564ebe5b5a90) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_client.cc:692 #6 0x7f457af80e04 in arrow::flight::internal::ClientDataStream::Finish (this=0x564ebe5b5a90, st=...) at /arrow/cpp/src/arrow/flight/transport.cc:46 #7 0x7f457af4f6e1 in arrow::flight::ClientMetadataReader::ReadMetadata (this=0x564ebe560630, out=0x7f4562fcc170) at /arrow/cpp/src/arrow/flight/client.cc:263 #8 0x7f457b593af6 in operator() (__closure=0x564ebe4e4848) at /arrow/cpp/src/arrow/flight/test_definitions.cc:1538 #9 0x7f457b5b66b8 in std::__invoke_impl >(std::__invoke_other, struct {...} &&) (__f=...) at /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/bits/invoke.h:60 #10 0x7f457b5b6529 in std::__invoke >(struct {...} &&) (__fn=...) at /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/bits/invoke.h:95 #11 0x7f457b5b63c4 in std::thread::_Invoker > >::_M_invoke<0>(std::_Index_tuple<0>) ( this=0x564ebe4e4848) at /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:264 #12 0x7f457b5b6224 in std::thread::_Invoker > >::operator()(void) ( this=0x564ebe4e4848) at /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:271 #13 0x7f457b5b5e1e in std::thread::_State_impl > > >::_M_run(void) (this=0x564ebe4e4840) at /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:215 #14 0x7f4578242a93 in std::execute_native_thread_routine (__p=) at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/new_allocator.h:82 #15 0x7f4577b6c6db in start_thread (arg=0x7f4562fcd700) at pthread_create.c:463 #16 0x7f4577ea561f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 7 (Thread 0x7f45725ca700 (LWP 76828)): #0 0x7f4577ea5947 in epoll_wait (epfd=36, events=events@entry=0x7f45725c86c0, maxevents=16, timeout=timeout@entry=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x7f45779fe3e3 in ucs_event_set_wait (event_set=0x7f4564026240, num_events=num_events@entry=0x7f45725c8804, timeout_ms=timeout_ms@entry=0, event_set_handler=event_set_handler@entry=0x7f4575d29320 , arg=arg@entry=0x7f45725c8800) at sys/event_set.c:198 #2 0x7f4575d29283 in uct_tcp_iface_progress (tl_iface=0x7f4564026900) at tcp/tcp_iface.c:327 #3 0x7f4577a7de22 in ucs_callbackq_dispatch (cbq=) at /usr/local/src/conda/ucx-1.13.1/src/ucs/datastruct/callbackq.h:211 #4 uct_worker_progress (worker=) at /usr/local/src/conda/ucx-1.13.1/src/uct/api/uct.h:2638 #5 ucp_worker_progress (worker=0x7f4564000c80) at core/ucp_worker.c:2782 #6 0x7f457b4f186f in arrow::flight::transport::ucx::UcpCallDriver::Impl::MakeProgress (this=0x7f456404d3b0) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:759 #7 0x7f457b4eee40 in arrow::flight::transport::ucx::UcpCallDriver::Impl::ReadNextFrame (this=0x7f456404d3b0) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:449 #8 0x7f457b4f3661 in arrow::flight::transport::ucx::UcpCallDriver::ReadNextFrame (this=0x7f456c0016d0) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:1037 #9 0x7f457b4d8c43 in arrow::flight::transport::ucx::(anonymous namespace)::PutServerStream::ReadImpl (this=0x7f45725c8b60, data=0x7f45725c8af0) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_server.cc:153 #10 0x7f457b4d8525 in arrow::flight::tra
[jira] [Created] (ARROW-18352) [R] Datasets API interface improvements
Nicola Crane created ARROW-18352: Summary: [R] Datasets API interface improvements Key: ARROW-18352 URL: https://issues.apache.org/jira/browse/ARROW-18352 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Umbrella ticket for improvements for our interface to the datasets API, and making the experience more consistent between {{open_dataset()}} and the {{read_*()}} functions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18351) [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange
Antoine Pitrou created ARROW-18351: -- Summary: [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange Key: ARROW-18351 URL: https://issues.apache.org/jira/browse/ARROW-18351 Project: Apache Arrow Issue Type: Bug Components: C++, FlightRPC Reporter: Antoine Pitrou I get a non-deterministic crash in the Flight UCX tests. {code} [--] 3 tests from UcxErrorHandlingTest [ RUN ] UcxErrorHandlingTest.TestGetFlightInfo [ OK ] UcxErrorHandlingTest.TestGetFlightInfo (24 ms) [ RUN ] UcxErrorHandlingTest.TestDoPut [ OK ] UcxErrorHandlingTest.TestDoPut (15 ms) [ RUN ] UcxErrorHandlingTest.TestDoExchange /arrow/cpp/src/arrow/util/future.cc:125: Check failed: !IsFutureFinished(state_) Future already marked finished {code} Here is the GDB backtrace: {code} #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x7f18c49cd7f1 in __GI_abort () at abort.c:79 #2 0x7f18c5854e00 in arrow::util::CerrLog::~CerrLog (this=0x7f18a81607b0, __in_chrg=) at /arrow/cpp/src/arrow/util/logging.cc:72 #3 0x7f18c5854e1c in arrow::util::CerrLog::~CerrLog (this=0x7f18a81607b0, __in_chrg=) at /arrow/cpp/src/arrow/util/logging.cc:74 #4 0x7f18c5855181 in arrow::util::ArrowLog::~ArrowLog (this=0x7f18c07fc380, __in_chrg=) at /arrow/cpp/src/arrow/util/logging.cc:250 #5 0x7f18c5826f86 in arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed (this=0x7f18a815f030, state=arrow::FutureState::FAILURE) at /arrow/cpp/src/arrow/util/future.cc:125 #6 0x7f18c58265af in arrow::ConcreteFutureImpl::DoMarkFailed (this=0x7f18a815f030) at /arrow/cpp/src/arrow/util/future.cc:40 #7 0x7f18c5827660 in arrow::FutureImpl::MarkFailed (this=0x7f18a815f030) at /arrow/cpp/src/arrow/util/future.cc:195 #8 0x7f18c80ff8d8 in arrow::Future >::DoMarkFinished (this=0x7f18a815efb0, res=...) at /arrow/cpp/src/arrow/util/future.h:660 #9 0x7f18c80fb37d in arrow::Future >::MarkFinished (this=0x7f18a815efb0, res=...) at /arrow/cpp/src/arrow/util/future.h:403 #10 0x7f18c80f5ae3 in arrow::flight::transport::ucx::UcpCallDriver::Impl::Push (this=0x7f18a804d2d0, status=...) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:780 #11 0x7f18c80f5c1f in arrow::flight::transport::ucx::UcpCallDriver::Impl::RecvActiveMessage (this=0x7f18a804d2d0, header=0x7f18c8081865, header_length=12, data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:791 #12 0x7f18c80f7d29 in arrow::flight::transport::ucx::UcpCallDriver::RecvActiveMessage (this=0x7f18b80017e0, header=0x7f18c8081865, header_length=12, data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:1082 #13 0x7f18c80e3ea4 in arrow::flight::transport::ucx::(anonymous namespace)::UcxServerImpl::HandleIncomingActiveMessage (self=0x7f18a80259a0, header=0x7f18c8081865, header_length=12, data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_server.cc:586 #14 0x7f18c4661a09 in ucp_am_invoke_cb (recv_flags=, reply_ep=, data_length=1, data=, user_hdr_length=, user_hdr=0x7f18c8081865, am_id=4132, worker=) at core/ucp_am.c:1220 #15 ucp_am_handler_common (name=, recv_flags=, am_flags=0, reply_ep=, total_length=, am_hdr=0x7f18c808185c, worker=) at core/ucp_am.c:1289 #16 ucp_am_handler_reply (am_arg=, am_data=, am_length=, am_flags=) at core/ucp_am.c:1327 #17 0x7f18c28e3f1c in uct_iface_invoke_am (flags=0, length=29, data=0x7f18c808185c, id=, iface=0x7f18a8027e20) at /usr/local/src/conda/ucx-1.13.1/src/uct/base/uct_iface.h:861 #18 uct_mm_iface_invoke_am (flags=0, length=29, data=0x7f18c808185c, am_id=, iface=0x7f18a8027e20) at sm/mm/base/mm_iface.h:256 #19 uct_mm_iface_process_recv (iface=0x7f18a8027e20) at sm/mm/base/mm_iface.c:256 #20 uct_mm_iface_poll_fifo (iface=0x7f18a8027e20) at sm/mm/base/mm_iface.c:304 #21 uct_mm_iface_progress (tl_iface=0x7f18a8027e20) at sm/mm/base/mm_iface.c:357 #22 0x7f18c4686e22 in ucs_callbackq_dispatch (cbq=) at /usr/local/src/conda/ucx-1.13.1/src/ucs/datastruct/callbackq.h:211 #23 uct_worker_progress (worker=) at /usr/local/src/conda/ucx-1.13.1/src/uct/api/uct.h:2638 #24 ucp_worker_progress (worker=0x7f18a80008d0) at core/ucp_worker.c:2782 #25 0x7f18c80f586f in arrow::flight::transport::ucx::UcpCallDriver::Impl::MakeProgress (this=0x7f18a804d2d0) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:759 #26 0x7f18c80f2e40 in arrow::flight::transport::ucx::UcpCallDriver::Impl::ReadNextFrame (this=0x7f18a804d2d0) at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:449 #27 0x7f18c80f7661 in arrow::flight::transport::ucx::UcpCallDriver::ReadNextFrame (this=0x7f18b8
[jira] [Created] (ARROW-18350) [C++] Use std::to_chars instead of std::to_string
Antoine Pitrou created ARROW-18350: -- Summary: [C++] Use std::to_chars instead of std::to_string Key: ARROW-18350 URL: https://issues.apache.org/jira/browse/ARROW-18350 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou {{std::to_chars}} is locale-independent unlike {{std::to_string}}; it may also be faster in some cases. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18349) [CI][C++][Flight] Exercise UCX on CI
Antoine Pitrou created ARROW-18349: -- Summary: [CI][C++][Flight] Exercise UCX on CI Key: ARROW-18349 URL: https://issues.apache.org/jira/browse/ARROW-18349 Project: Apache Arrow Issue Type: Task Components: C++, Continuous Integration, FlightRPC Reporter: Antoine Pitrou Fix For: 11.0.0 UCX doesn't seem enabled on any CI configuration for now. We should have at least a nightly job with UCX enabled, for example one of the Conda or Ubuntu builds. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18348) [CI][Release][Yum] redhat-rpm-config is needed on AlmaLinux 9
Kouhei Sutou created ARROW-18348: Summary: [CI][Release][Yum] redhat-rpm-config is needed on AlmaLinux 9 Key: ARROW-18348 URL: https://issues.apache.org/jira/browse/ARROW-18348 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou Fix For: 10.0.2, 11.0.0 https://github.com/ursacomputing/crossbow/actions/runs/3485133283/jobs/5830385419#step:7:1909 {noformat} Building native extensions. This could take a while... ERROR: Error installing gobject-introspection: ERROR: Failed to build gem native extension. current directory: /usr/local/share/gems/gems/glib2-4.0.3/ext/glib2 /usr/bin/ruby -I /usr/share/rubygems -r ./siteconf20221117-855-v8bktd.rb extconf.rb checking for --enable-debug-build option... no checking for -Wall option to compiler... *** extconf.rb failed *** Could not create Makefile due to some reason, probably lack of necessary libraries and/or headers. Check the mkmf.log file for more details. You may need configuration options. Provided configuration options: --with-opt-dir --without-opt-dir --with-opt-include --without-opt-include=${opt-dir}/include --with-opt-lib --without-opt-lib=${opt-dir}/lib64 --with-make-prog --without-make-prog --srcdir=. --curdir --ruby=/usr/bin/$(RUBY_BASE_NAME) --enable-debug-build --disable-debug-build /usr/share/ruby/mkmf.rb:471:in `try_do': The compiler failed to generate an executable file. (RuntimeError) You have to install development tools first. from /usr/share/ruby/mkmf.rb:597:in `block in try_compile' from /usr/share/ruby/mkmf.rb:546:in `with_werror' from /usr/share/ruby/mkmf.rb:597:in `try_compile' from /usr/local/share/gems/gems/glib2-4.0.3/lib/mkmf-gnome.rb:65:in `block in try_compiler_option' from /usr/share/ruby/mkmf.rb:971:in `block in checking_for' from /usr/share/ruby/mkmf.rb:361:in `block (2 levels) in postpone' from /usr/share/ruby/mkmf.rb:331:in `open' from /usr/share/ruby/mkmf.rb:361:in `block in postpone' from /usr/share/ruby/mkmf.rb:331:in `open' from /usr/share/ruby/mkmf.rb:357:in `postpone' from /usr/share/ruby/mkmf.rb:970:in `checking_for' from /usr/local/share/gems/gems/glib2-4.0.3/lib/mkmf-gnome.rb:64:in `try_compiler_option' from /usr/local/share/gems/gems/glib2-4.0.3/lib/mkmf-gnome.rb:74:in `' from :85:in `require' from :85:in `require' from extconf.rb:27:in `' To see why this extension failed to compile, please check the mkmf.log which can be found here: /usr/local/lib64/gems/ruby/glib2-4.0.3/mkmf.log extconf failed, exit code 1 Gem files will remain installed in /usr/local/share/gems/gems/glib2-4.0.3 for inspection. Results logged to /usr/local/lib64/gems/ruby/glib2-4.0.3/gem_make.out {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18347) [C++] Hook up cancellation to exec plan
Weston Pace created ARROW-18347: --- Summary: [C++] Hook up cancellation to exec plan Key: ARROW-18347 URL: https://issues.apache.org/jira/browse/ARROW-18347 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace There are two ways to cancel an exec plan. A call to StopProducing and cancelling the task group. Investigate which makes the most sense and then configure the DeclarationToReader method to support cancelling on discard. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18346) [Python] Dataset writer API papercuts
David Li created ARROW-18346: Summary: [Python] Dataset writer API papercuts Key: ARROW-18346 URL: https://issues.apache.org/jira/browse/ARROW-18346 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 10.0.0 Reporter: David Li * Writer options are not very discoverable. Perhaps "file_options" should mention compression as an example of something you can control, so people looking for it know where to go next? * Compression seems like it might be common enough to warrant a top-level parameter somehow (even if it gets implemented differently internally)? * Either way, this needs a cookbook example. * {{make_write_options}} is lacking a docstring * Writer options objects are lacking {{{}__repr__{}}}s -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18344) [C++] Use input pre-sortedness to create sorted table with ConcatenateTables
Rok Mihevc created ARROW-18344: -- Summary: [C++] Use input pre-sortedness to create sorted table with ConcatenateTables Key: ARROW-18344 URL: https://issues.apache.org/jira/browse/ARROW-18344 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Rok Mihevc In case of concatenating large sorted tables (e.g. sorted timeseries data) the resulting table is no longer sorted. However the input sortedness can be used to significantly speed up post concatenation sorting. A potential API could be to add ConcatenateTablesOptions.inputs_sorted and implement the logic in ConcatenateTables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18345) [R] Create a CRAN-specific packaging checklist that lives in the R package directory
Dewey Dunnington created ARROW-18345: Summary: [R] Create a CRAN-specific packaging checklist that lives in the R package directory Key: ARROW-18345 URL: https://issues.apache.org/jira/browse/ARROW-18345 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dewey Dunnington Like other packaging tasks, the CRAN packaging task (which is concerned with making sure the R package from the Arrow release complies with CRAN policies) for the R package is slightly different than the overall Arrow release task for the R package. For example, we often push patch-patch releases if the two-week window we get to "safely retain the package on CRAN" does not line up with a release vote. [~npr] has heroically been doing this for a long time, and while he has equally heroically volunteered to keep doing it, I am hoping to process of codifying this somewhere in the R repo will help a wider set of contributors understand the process (even if it was already documented elsewhere!). [~stephhazlitt] and I use {{usethis::use_release_issue()}} to manage our personal R package releases, and I'm wondering if creating a similar function or markdown template would help here. I'm happy to start the process of putting a PR up for discussion! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18343) [C++] AllocateBitmap() with out parameter is declared but not defined
Jin Shang created ARROW-18343: - Summary: [C++] AllocateBitmap() with out parameter is declared but not defined Key: ARROW-18343 URL: https://issues.apache.org/jira/browse/ARROW-18343 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Jin Shang [This variant of AllocateBitmap|https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h#L483] is declared but not defined in buffer.cc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18342) [C++] AsofJoinNode support for Boolean data field
Rok Mihevc created ARROW-18342: -- Summary: [C++] AsofJoinNode support for Boolean data field Key: ARROW-18342 URL: https://issues.apache.org/jira/browse/ARROW-18342 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Rok Mihevc This is to add boolean data field support to asof join as proposed here: https://github.com/apache/arrow/pull/14485 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18341) [Doc][Python] Update note about bundling Arrow C++ on Windows
Alenka Frim created ARROW-18341: --- Summary: [Doc][Python] Update note about bundling Arrow C++ on Windows Key: ARROW-18341 URL: https://issues.apache.org/jira/browse/ARROW-18341 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Reporter: Alenka Frim Assignee: Alenka Frim Fix For: 11.0.0 There is a note on the python development page under Widnows section about bundling the Arrow C++ libraries with Python extensions: [https://arrow.apache.org/docs/dev/developers/python.html#building-on-windows] This note can be revised: * if you are using conda, the fact that Arrow C++ libs are not bundled is fine since conda will ensure those libs are found. * If you are not using conda, you have to ensure those libs can be found: either by updating {{PATH}} (every time before importing pyarrow), or either by bundling them (... using the {{PYARROW_BUNDLE_ARROW_CPP}} env variable instead of {{{}--bundle-arrow-cpp{}}}). With the caveat those won't be automatically updated when rebuilding the arrow-cpp libs then. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18340) [Python] PyArrow C++ header files no longer always included in installed pyarrow
Joris Van den Bossche created ARROW-18340: - Summary: [Python] PyArrow C++ header files no longer always included in installed pyarrow Key: ARROW-18340 URL: https://issues.apache.org/jira/browse/ARROW-18340 Project: Apache Arrow Issue Type: Improvement Reporter: Joris Van den Bossche Assignee: Alenka Frim Fix For: 10.0.1 We have a python build env var to control whether the Arrow C++ header files are included in the python package or not ({{PYARROW_BUNDLE_ARROW_CPP_HEADERS}}). This is set to True by default, and only in the conda recipe set to False. After the cmake refactor, the Python C++ header files no longer live in the Arrow C++ package, and so should _always_ be included in the python package, regardless of how arrow-cpp is installed. Initially this was done, but it seems that https://github.com/apache/arrow/pull/13892 removed this unconditional copy of the PyArrow header files to {{pyarrow/include}}. Now it is only copied if {{PYARROW_BUNDLE_ARROW_CPP_HEADERS}} is enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)