from:"Jira"

[jira] [Created] (ARROW-18439) Misleading message when loading parquet data with invalid null data

2022-12-16 Thread (Jira)

 created ARROW-18439:


 Summary: Misleading message when loading parquet data with invalid 
null data
 Key: ARROW-18439
 URL: https://issues.apache.org/jira/browse/ARROW-18439
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 10.0.1
Reporter: 


I'm saving an arrow table to parquet. One column is a list of structs, which 
elements are marked as non nullable. But the data isn't valid because I've put 
a null in one of the nested field. 

When I save this data to parquet and try to load it back I get a very 
misleading message:
{code:java}
 Length spanned by list offsets (2) larger than values array (length 1){code}
I would rather arrow complains when creating the table or when saving it to 
parquet.

Here's how to reproduce the issue:
{code:java}
struct = pa.struct(
[
pa.field("nested_string", pa.string(), nullable=False),
]
)

schema = pa.schema(
[pa.field("list_column", pa.list_(pa.field("item", struct, 
nullable=False)))]
)
table = pa.table(
{"list_column": [[{"nested_string": ""}, {"nested_string": None}]]}, 
schema=schema
)
with io.BytesIO() as file:
pq.write_table(table, file)
file.seek(0)
pq.read_table(file) # Raises pa.ArrowInvalid
 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18438) [Go] firstTimeBitmapWriter.Finish() panics with 8n structs

2022-12-15 Thread Min-Young Wu (Jira)

Min-Young Wu created ARROW-18438:


 Summary: [Go] firstTimeBitmapWriter.Finish() panics with 8n structs
 Key: ARROW-18438
 URL: https://issues.apache.org/jira/browse/ARROW-18438
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Parquet
Affects Versions: 10.0.1
Reporter: Min-Young Wu


Even after [ARROW-17169|https://issues.apache.org/jira/browse/ARROW-17169] I 
still get a panic at the same location.

Below is a test case that panics:
{code:go}
func (ps *ParquetIOTestSuite) TestStructWithNullableListOfStructs() {
bldr := array.NewStructBuilder(memory.DefaultAllocator, arrow.StructOf(
arrow.Field{
Name: "l",
Type: arrow.ListOf(arrow.StructOf(
arrow.Field{Name: "a", Type: 
arrow.BinaryTypes.String},
)),
},
))
defer bldr.Release()

lBldr := bldr.FieldBuilder(0).(*array.ListBuilder)
stBldr := lBldr.ValueBuilder().(*array.StructBuilder)
aBldr := stBldr.FieldBuilder(0).(*array.StringBuilder)

bldr.AppendNull()
bldr.Append(true)
lBldr.Append(true)
for i := 0; i < 8; i++ {
stBldr.Append(true)
aBldr.Append(strconv.Itoa(i))
}

arr := bldr.NewArray()
defer arr.Release()

field := arrow.Field{Name: "x", Type: arr.DataType(), Nullable: true}
expected := array.NewTable(arrow.NewSchema([]arrow.Field{field}, nil),
[]arrow.Column{*arrow.NewColumn(field, 
arrow.NewChunked(field.Type, []arrow.Array{arr}))}, -1)
defer expected.Release()

ps.roundTripTable(expected, false)
}
{code}

I've tried to trim down the input data and this is as minimal as I could get 
it. And yes:
* wrapping struct with initial null is required
* the inner list needs to contain 8 structs (or any multiple of 8)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18437) [C++] Parquet DELTA_BINARY_PACKED Page didn't clear the context

2022-12-14 Thread Xuwei Fu (Jira)

Xuwei Fu created ARROW-18437:


 Summary: [C++] Parquet DELTA_BINARY_PACKED Page didn't clear the 
context
 Key: ARROW-18437
 URL: https://issues.apache.org/jira/browse/ARROW-18437
 Project: Apache Arrow
  Issue Type: Bug
  Components: Parquet
Affects Versions: 11.0.0
Reporter: Xuwei Fu
Assignee: Xuwei Fu
 Fix For: 11.0.0


When calling {{{}flushValues{}}}, it didn't:
 * clearing the {{total_value_count_}}
 * Re-advancing buffer for {{kMaxPageHeaderWriterSize}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18436) `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space

2022-12-14 Thread James Bourbeau (Jira)

James Bourbeau created ARROW-18436:
--

 Summary: `pyarrow.fs.FileSystem.from_uri` crashes when URI has a 
space
 Key: ARROW-18436
 URL: https://issues.apache.org/jira/browse/ARROW-18436
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 10.0.1
 Environment: - OS: macOS
- `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
- `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
Reporter: James Bourbeau


When attempting to create a new filesystem object from a public dataset in S3, 
where there is a space in the bucket name, an error is raised. 

 

Here's a minimal reproducer:

```python

from pyarrow.fs import FileSystem

result = FileSystem.from_uri("s3://nyc-tlc/trip 
data/fhvhv_tripdata_2022-06.parquet")

```

 

which fails with the following traceback:

 

```

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/test.py", line 3, in 
    result = FileSystem.from_uri("s3://nyc-tlc/trip 
data/fhvhv_tripdata_2022-06.parquet")
  File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
data/fhvhv_tripdata_2022-06.parquet'

```

 

Note that things work if I use a different dataset that doesn't have a space in 
the URI, or if I replace the portion of the URI that has a space with a `*` 
wildcard

 

```python

from pyarrow.fs import FileSystem


result = FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # 
works
```

 

The wildcard isn't necessarily equivalent to the original failing URI, but I 
think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18435) [C++][Java] Update ORC to 1.8.1

2022-12-13 Thread Gang Wu (Jira)

Gang Wu created ARROW-18435:
---

 Summary: [C++][Java] Update ORC to 1.8.1
 Key: ARROW-18435
 URL: https://issues.apache.org/jira/browse/ARROW-18435
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Java
Reporter: Gang Wu
Assignee: Gang Wu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18434) [C++][Parquet] Parquet page index read support

2022-12-13 Thread Gang Wu (Jira)

Gang Wu created ARROW-18434:
---

 Summary: [C++][Parquet] Parquet page index read support
 Key: ARROW-18434
 URL: https://issues.apache.org/jira/browse/ARROW-18434
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Gang Wu
Assignee: Gang Wu


Implement read support for parquet page index and expose it from the reader API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18433) Optimize aggregate functions to work with batches.

2022-12-10 Thread A. Coady (Jira)

A. Coady created ARROW-18433:


 Summary: Optimize aggregate functions to work with batches.
 Key: ARROW-18433
 URL: https://issues.apache.org/jira/browse/ARROW-18433
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Affects Versions: 10.0.1
Reporter: A. Coady


Most compute functions work with the dataset api and don't load columns. But 
aggregate functions which are associative could also work: `min`, `max`, `any`, 
`all`, `sum`, `product`. Even `unique` and `value_counts`.

A couple of implementation ideas:
 * expand the dataset api to support expressions which return scalars
 * add a `BatchedArray` type which is like a `ChunkedArray` but with lazy 
loading



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18432) [Python] Array constructor doesn't support arrow scalars.

2022-12-10 Thread A. Coady (Jira)

A. Coady created ARROW-18432:


 Summary: [Python] Array constructor doesn't support arrow scalars.
 Key: ARROW-18432
 URL: https://issues.apache.org/jira/browse/ARROW-18432
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 10.0.1
Reporter: A. Coady


{code:python}
pa.array([pa.scalar(0)])
ArrowInvalid: Could not convert  with type 
pyarrow.lib.Int64Scalar: did not recognize Python value type when inferring an 
Arrow data type

pa.array([pa.scalar(0)], 'int64')
ArrowInvalid: Could not convert  with type 
pyarrow.lib.Int64Scalar: tried to convert to int64{code}
It seems odd that the array constructors don't recognize their own scalars.

In practice, a list of scalars has to be converted with `.as_py()` just to be 
converted back, and that also loses the type information.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18431) Acero's Execution Plan never finishes.

2022-12-09 Thread Pau Garcia Rodriguez (Jira)

Pau Garcia Rodriguez created ARROW-18431:


 Summary: Acero's Execution Plan never finishes.
 Key: ARROW-18431
 URL: https://issues.apache.org/jira/browse/ARROW-18431
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 10.0.0
Reporter: Pau Garcia Rodriguez


We have observed that sometimes an execution plan with a small input never 
finishes (the future returned by the ExecPlan::finished() method is never 
marked as finished), even though the generator in the sink node is exhausted 
and has returned nullopt.

This issue seems to happen at random, the same plan with the same input 
sometimes works (the plan is marked finished) and sometimes it doesn't. Since 
the ExecPlanImpl destructor forces the executing thread to wait for the plan to 
finish (when the plan has not yet finished) we enter in a deadlock waiting for 
a plan that never finishes.

Since this has only happened with small inputs and not in a deterministic way, 
we believe the issue might be in the ExecPlan::StartProducing method.

Our hypothesis is that after the plan starts producing on each node, each node 
schedules their tasks and they are  immediately finished (due to the small 
input) and somehow the callback that marks the future finished_ finished is 
never executed.

 
{code:java}
Status StartProducing() {
  ...
  Future<> scheduler_finished =   
util::AsyncTaskScheduler::Make([this(util::AsyncTaskScheduler* async_scheduler) 
{
  ...
  scheduler_finished.AddCallback([this](const Status& st) { 
finished_.MarkFinished(st);});
...
}{code}
 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18430) [Python] Cannot cast nested nullable field to not-nullable

2022-12-08 Thread (Jira)

 created ARROW-18430:


 Summary: [Python] Cannot cast nested nullable field to not-nullable
 Key: ARROW-18430
 URL: https://issues.apache.org/jira/browse/ARROW-18430
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 10.0.1
Reporter: 


Casting from nullable field to not-nullable works provided all values are 
present. So for example this is a valid cast:
{code:java}
table = pa.table({'column_1': pa.array([1, 2 ,3])})table.cast(
    pa.schema([
        f.with_nullable(False) for f in table.schema
    ])
){code}
But it doesn't work for nested field. Here's an example:
{code:java}
import pyarrow as pa

record = {"nested_int": 1}

data_type = pa.struct(
[
pa.field("nested_int", pa.int32(), nullable=True),
]
)

data_type_after = pa.struct(
[
pa.field("nested_int", pa.int32(), nullable=False),
]
)

table = pa.table({"column_1": pa.array([record], data_type)})

table.cast(pa.schema([pa.field("column_1", data_type_after)])) {code}
Throws:
{code:java}
pyarrow.lib.ArrowTypeError: cannot cast nullable field to non-nullable field: 
struct struct {code}
This is somewhat related to [https://github.com/apache/arrow/issues/13177] and 
https://issues.apache.org/jira/browse/ARROW-16603 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18429) [R] Bump dev version following 10.0.1 patch release

2022-12-08 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18429:


 Summary: [R] Bump dev version following 10.0.1 patch release
 Key: ARROW-18429
 URL: https://issues.apache.org/jira/browse/ARROW-18429
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, R
Reporter: Nicola Crane
Assignee: Nicola Crane
 Fix For: 11.0.0


CI job fails with:


{code:java}
   Insufficient package version (submitted: 10.0.0.9000, existing: 10.0.1)
  Version contains large components (10.0.0.9000)
{code}


https://github.com/apache/arrow/actions/runs/3639669477/jobs/6145488845#step:10:567



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18428) [Website] Enable github issues on arrow-site repo

2022-12-08 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-18428:
-

 Summary: [Website] Enable github issues on arrow-site repo
 Key: ARROW-18428
 URL: https://issues.apache.org/jira/browse/ARROW-18428
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Joris Van den Bossche


Now we are moving to GitHub issues, it probably makes sense to open issues 
about the website in its own arrow-site repo, instead of keeping them in the 
main arrow repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18427) [C++] Suppose negative toletance in `AsofJoinNode`

2022-12-07 Thread Yaron Gvili (Jira)

Yaron Gvili created ARROW-18427:
---

 Summary: [C++] Suppose negative toletance in `AsofJoinNode`
 Key: ARROW-18427
 URL: https://issues.apache.org/jira/browse/ARROW-18427
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
past-joining, i.e., joining right-table rows with a timestamp at or before that 
of the left-table row. This issue will add support for a positive tolerance, 
which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18426) Update committers and PMC members on website

2022-12-05 Thread Benson Muite (Jira)

Benson Muite created ARROW-18426:


 Summary: Update committers and PMC members on website
 Key: ARROW-18426
 URL: https://issues.apache.org/jira/browse/ARROW-18426
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Benson Muite
Assignee: Benson Muite


Update committers and PMC members



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18425) Add support for Substrait round expression

2022-12-05 Thread Bryce Mecum (Jira)

Bryce Mecum created ARROW-18425:
---

 Summary: Add support for Substrait round expression
 Key: ARROW-18425
 URL: https://issues.apache.org/jira/browse/ARROW-18425
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Bryce Mecum


Work has been started on adding round to Substrait in 
[https://github.com/substrait-io/substrait/pull/322] and it looks like a 
mapping needs to be registered on the Acero side for Acero to consume plans 
with it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18424) [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness`

2022-12-05 Thread Yaron Gvili (Jira)

Yaron Gvili created ARROW-18424:
---

 Summary: [C++] Fix Doxygen error on 
`arrow::engine::ConversionStrictness`
 Key: ARROW-18424
 URL: https://issues.apache.org/jira/browse/ARROW-18424
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Doxygen is hitting the following error: 
`/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 
'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' 
was not declared or defined. (warning treated as error, aborting now)`. See 
[this CI job 
output|https://github.com/apache/arrow/actions/runs/3557712768/jobs/5975904381],
 for example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18423) [Python] Expose reading a schema from an IPC message

2022-12-02 Thread Andre Kohn (Jira)

Andre Kohn created ARROW-18423:
--

 Summary: [Python] Expose reading a schema from an IPC message
 Key: ARROW-18423
 URL: https://issues.apache.org/jira/browse/ARROW-18423
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Andre Kohn


Pyarrow currently does not implement reading the Arrow schema from an IPC 
message.

[https://github.com/apache/arrow/blob/80b389efe902af376a85a8b3740e0dbdc5f80900/python/pyarrow/ipc.pxi#L1094]

 

We'd like to consume Arrow IPC stream data like the following:

```

    schema_msg = pyarrow.ipc.read_message(result_iter.next().data)
    schema = pyarrow.ipc.read_schema(schema_msg)
    for batch_data in result_iter:
        batch_msg = pyarrow.ipc.read_message(batch_data.data)
        batch = pyarrow.ipc.read_record_batch(batch_msg, schema)
```

 

The associated (tiny) PR on GitHub implements this reading by binding the 
existing C++ function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18422) [C++] Provide enum reflection utility

2022-12-02 Thread Ben Kietzman (Jira)

Ben Kietzman created ARROW-18422:


 Summary: [C++] Provide enum reflection utility
 Key: ARROW-18422
 URL: https://issues.apache.org/jira/browse/ARROW-18422
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman


Now that we have c++17, we could try again with ARROW-13296



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18421) [C++][ORC] Add accessor for number of rows by stripe in reader

2022-12-01 Thread Louis Calot (Jira)

Louis Calot created ARROW-18421:
---

 Summary: [C++][ORC] Add accessor for number of rows by stripe in 
reader
 Key: ARROW-18421
 URL: https://issues.apache.org/jira/browse/ARROW-18421
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Louis Calot


I need to have the number of rows by stripe to be able to read specific ranges 
of records in the ORC file without reading it all. The number of rows was 
already stored in the implementation but not available in the API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18420) [C++][Parquet] Introduce ColumnIndex and OffsetIndex

2022-12-01 Thread Gang Wu (Jira)

Gang Wu created ARROW-18420:
---

 Summary: [C++][Parquet] Introduce ColumnIndex and OffsetIndex
 Key: ARROW-18420
 URL: https://issues.apache.org/jira/browse/ARROW-18420
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++, Parquet
Reporter: Gang Wu
Assignee: Gang Wu


Define interface of ColumnIndex and OffsetIndex and provide implementation to 
read from serialized form.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18419) [C++] Update vendored fast_float

2022-11-30 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18419:


 Summary: [C++] Update vendored fast_float
 Key: ARROW-18419
 URL: https://issues.apache.org/jira/browse/ARROW-18419
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


For https://github.com/fastfloat/fast_float/pull/147 .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18418) [WEBSITE] do not delete /datafusion-python

2022-11-29 Thread Andy Grove (Jira)

Andy Grove created ARROW-18418:
--

 Summary: [WEBSITE] do not delete /datafusion-python
 Key: ARROW-18418
 URL: https://issues.apache.org/jira/browse/ARROW-18418
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Andy Grove
Assignee: Andy Grove


do not delete /datafusion-python when publishing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18417) [C++] Support emit info in Substrait extension-multi and AsOfJoin

2022-11-29 Thread Yaron Gvili (Jira)

Yaron Gvili created ARROW-18417:
---

 Summary: [C++] Support emit info in Substrait extension-multi and 
AsOfJoin
 Key: ARROW-18417
 URL: https://issues.apache.org/jira/browse/ARROW-18417
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, Arrow-Substrait does not handle emit info that may appear in an 
extension-multi in a Substrait plan. Besides the generic handling in the 
Arrow-Substrait extension API, specific handling for AsOfJoin is required, 
because AsOfJoinNode produces an output schema that is different than the one 
used in the emit info. In particular, the AsOfJoinNode output scheme does not 
include on- and by-keys of right tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18416) [R] Update NEWS for 10.0.1

2022-11-29 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18416:


 Summary: [R] Update NEWS for 10.0.1
 Key: ARROW-18416
 URL: https://issues.apache.org/jira/browse/ARROW-18416
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18415) [R] Update R package README to reference GH Issues

2022-11-29 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18415:


 Summary: [R] Update R package README to reference GH Issues
 Key: ARROW-18415
 URL: https://issues.apache.org/jira/browse/ARROW-18415
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


The R package README should be updated to refer to GH Issues for users who 
don't have a JIRA account



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18414) [Release] Add a post script to generate announce email

2022-11-28 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18414:


 Summary: [Release] Add a post script to generate announce email
 Key: ARROW-18414
 URL: https://issues.apache.org/jira/browse/ARROW-18414
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
 Fix For: 11.0.0


We want to generate an announce email like a vote email.

e.g.: [ANNOUNCE] Apache Arrow 10.0.0 released
https://lists.apache.org/thread/zdsogdwj3r7wjv93o84go4ykgrcwtr0p .

FYI: We can generate a vote email by {{SOURCE_DEFAULT=0 SOURCE_VOTE=1 
dev/release/02-source.sh ...}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18413) [C++][Parquet] FileMetaData exposes page index metadata

2022-11-28 Thread Gang Wu (Jira)

Gang Wu created ARROW-18413:
---

 Summary: [C++][Parquet] FileMetaData exposes page index metadata
 Key: ARROW-18413
 URL: https://issues.apache.org/jira/browse/ARROW-18413
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++, Parquet
Reporter: Gang Wu
Assignee: Gang Wu


Parquet ColumnChunk thrift object has recorded metadata for page index:
{quote}struct ColumnChunk {
/** File offset of ColumnChunk's OffsetIndex **/
4: optional i64 offset_index_offset

/** Size of ColumnChunk's OffsetIndex, in bytes **/
5: optional i32 offset_index_length

/** File offset of ColumnChunk's ColumnIndex **/
6: optional i64 column_index_offset

/** Size of ColumnChunk's ColumnIndex, in bytes **/
7: optional i32 column_index_length
}
{quote}
We just need to add public API to ColumnChunkMetaData to make it ready to read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18412) [R] Windows build fails because of missing ChunkResolver symbols

2022-11-27 Thread Dewey Dunnington (Jira)

Dewey Dunnington created ARROW-18412:


 Summary: [R] Windows build fails because of missing ChunkResolver 
symbols
 Key: ARROW-18412
 URL: https://issues.apache.org/jira/browse/ARROW-18412
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Dewey Dunnington


In recent nightly builds of the Windows package we have a build failure because 
some symbols related to the {{ChunkResolver}} are not found in the linking 
stage.

https://github.com/ursacomputing/crossbow/actions/runs/3559717769/jobs/5979255297#step:9:2818

[~kou] suggested the following patch might fix the build: 
https://github.com/apache/arrow/pull/14530#issuecomment-1328341447



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18411) [Python] MapType comparison ignores nullable flag of item_field

2022-11-27 Thread (Jira)

 created ARROW-18411:


 Summary: [Python] MapType comparison ignores nullable flag of 
item_field
 Key: ARROW-18411
 URL: https://issues.apache.org/jira/browse/ARROW-18411
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
 Environment: pyarrow==10.0.1

Reporter: 


By default MapType value fields are nullable:
{code:java}
 pa.map_(pa.string(), pa.int32()).item_field.nullable == True {code}
It is possible to mark the value field of a MapType as not-nullable:
{code:java}
 pa.map_(pa.string(), pa.field("value", pa.int32(), 
nullable=False)).item_field.nullable == False{code}
But comparing these two types, that are semantically different, returns True:
{code:java}
pa.map_(pa.string(), pa.int32()) == pa.map_(pa.string(), pa.field("value", 
pa.int32(), nullable=False)) # Returns True {code}
So it looks like the comparison omits the nullable flag. 
{code:java}
import pyarrow as pa
import pytest

print(pa.__version__)

map_type = pa.map_(pa.string(), pa.int32())
pa.array(
[[("one", 1), ("two", 2), ("null", None)]],
map_type
)

with pytest.raises(pa.ArrowInvalid, match=r"Invalid Map: key field can not 
contain null values"):
pa.array(
[[("one", 1), ("two", 2), (None, None)]],
map_type
)

map_type = pa.map_(pa.string(), pa.int32())
pa.array(
[[("one", 1), ("two", 2), ("null", None)]],
map_type
)

non_null_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), 
nullable=False))
nullable_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), 
nullable=True))

pa.array(
[[("one", 1), ("two", 2), ("null", None)]],
map_type
)


assert nullable_map_type == map_type # Should be different
assert str(nullable_map_type) == str(map_type)
assert non_null_map_type == map_type
assert non_null_map_type.item_type == map_type.item_type
assert non_null_map_type.item_field != map_type.item_field
assert non_null_map_type.item_field.nullable != map_type.item_field.nullable
assert non_null_map_type.item_field.name == map_type.item_field.name
assert str(non_null_map_type) != str(map_type.item_field.name){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18410) [Packaging][Ubuntu] Add support for Ubuntu 22.10

2022-11-25 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18410:


 Summary: [Packaging][Ubuntu] Add support for Ubuntu 22.10
 Key: ARROW-18410
 URL: https://issues.apache.org/jira/browse/ARROW-18410
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18409) [GLib][Plasma] Suppress deprecated warning in building plasma-glib

2022-11-25 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18409:


 Summary: [GLib][Plasma] Suppress deprecated warning in building 
plasma-glib
 Key: ARROW-18409
 URL: https://issues.apache.org/jira/browse/ARROW-18409
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


If we always get "Plasma is deprecated since Arrow 10.0.0. ..." warning from 
{{plasma/common.h}}, we can't use {{-Dwerror=true}} Meson option with 
plama-glib.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18408) [C++] Add nightly test that uses an older version of protoc

2022-11-25 Thread Weston Pace (Jira)

Weston Pace created ARROW-18408:
---

 Summary: [C++] Add nightly test that uses an older version of 
protoc
 Key: ARROW-18408
 URL: https://issues.apache.org/jira/browse/ARROW-18408
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


Specifically we should test the protoc version installed by Ubuntu 20.04 to 
help detect issues like ARROW-18406



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18407) [Release][Website] Use UTC for release date

2022-11-25 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18407:


 Summary: [Release][Website] Use UTC for release date
 Key: ARROW-18407
 URL: https://issues.apache.org/jira/browse/ARROW-18407
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools, Website
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18406) [C++] Can't build Arrow with Substrait on Ubuntu 20.04

2022-11-25 Thread Dewey Dunnington (Jira)

Dewey Dunnington created ARROW-18406:


 Summary: [C++] Can't build Arrow with Substrait on Ubuntu 20.04
 Key: ARROW-18406
 URL: https://issues.apache.org/jira/browse/ARROW-18406
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Dewey Dunnington


I recently tried to rebuild Arrow with Substrait on Ubuntu 20.04 and got the 
following error:

{code:java}
[100%] Building CXX object 
src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/substrait/type_internal.cc.o
/home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.cc:
 In function ‘arrow::Status arrow::engine::DecodeArg(const 
substrait::FunctionArgument&, int, arrow::engine::SubstraitCall*, const 
arrow::engine::ExtensionSet&, const arrow::engine::ConversionOptions&)’:
/home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.cc:60:21:
 error: ‘bool substrait::FunctionArgument::has_enum_() const’ is private within 
this context
   60 |   if (arg.has_enum_()) {
  | ^
In file included from 
/home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.h:30,
 from 
/home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.cc:20:
/home/dewey/.r-arrow-dev-build/build/substrait_ep-generated/substrait/algebra.pb.h:21690:13:
 note: declared private here
21690 | inline bool FunctionArgument::has_enum_() const {
  | ^~~~
[100%] Building CXX object 
src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/substrait/util.cc.o
make[2]: *** 
[src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/build.make:76: 
src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/substrait/expression_internal.cc.o]
 Error 1
make[2]: *** Waiting for unfinished jobs
make[1]: *** [CMakeFiles/Makefile2:2028: 
src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
{code}

[~westonpace] suggested that it is probably a protobuf version problem! For me 
this is:


{code:java}
$ protoc --version
libprotoc 3.6.1
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18405) [Ruby] Raw table converter rebuilds chunked arrays

2022-11-25 Thread Sten Larsson (Jira)

Sten Larsson created ARROW-18405:


 Summary: [Ruby] Raw table converter rebuilds chunked arrays
 Key: ARROW-18405
 URL: https://issues.apache.org/jira/browse/ARROW-18405
 Project: Apache Arrow
  Issue Type: Bug
  Components: Ruby
Affects Versions: 10.0.0
Reporter: Sten Larsson


Consider the following Ruby script:
{code:ruby}
require 'arrow'
data = Arrow::ChunkedArray.new([Arrow::Int64Array.new([1])])
table = Arrow::Table.new('column' => data)
puts table['column'].data_type
{code}
This prints "int64" with red-arrow 9.0.0 and "uint8" in 10.0.0.

>From my understanding it is due to this commit: 
>[https://github.com/apache/arrow/commit/913d9c0a9a1a4398ed5f56d713d586770b4f702c#diff-f7f19bbc3945ea30ba06d851705f2d58f7666507bb101c4e151014ca398bd635R42]

The old version would not call ArrayBuilder.build on a ChunkedArray, but the 
new version does. This is a problem for us, because we need the column to stay 
int64.

A workaround is to specify a schema and list of arrays instead to bypass the 
raw table converter:
{code:ruby}
table = Arrow::Table.new([{name: 'column', type: 'int64'}], [data])
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18404) [Python] [Docs] Mention the C Data/Stream Interface in PyArrow Extending

2022-11-24 Thread Anja Boskovic (Jira)

Anja Boskovic created ARROW-18404:
-

 Summary: [Python] [Docs] Mention the C Data/Stream Interface in 
PyArrow Extending
 Key: ARROW-18404
 URL: https://issues.apache.org/jira/browse/ARROW-18404
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Anja Boskovic
Assignee: Anja Boskovic


The [Arrow C Data/Stream 
Interface|https://arrow.apache.org/docs/format/CDataInterface.html] is a 
relatively lightweight option for developers that want to expose Arrow Arrays 
to Python users.

It is not mentioned as a recommendation in the documentation on [using pyarrow 
from C++ 
code|[https://arrow.apache.org/docs/python/integration/extending.html].]

The existing recommendation mentioned is [wrapping and 
unwrapping|[https://arrow.apache.org/docs/python/integration/extending.html#wrapping-and-unwrapping].]

I propose adding a section to this page. I would be happy to take that on, if 
others agree that is a good idea.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18403) [C++] Error consuming Substrait plan which uses count function: "only unary aggregate functions are currently supported"

2022-11-24 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18403:


 Summary: [C++] Error consuming Substrait plan which uses count 
function: "only unary aggregate functions are currently supported"
 Key: ARROW-18403
 URL: https://issues.apache.org/jira/browse/ARROW-18403
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Nicola Crane


ARROW-17523 added support for the Substrait extension function "count", but 
when I write code which produces a Substrait plan which calls it, and then try 
to run it in Acero, I get an error.

The plan:

{code:r}
message of type 'substrait.Plan' with 3 fields set
extension_uris {
  extension_uri_anchor: 1
  uri: 
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml;
}
extension_uris {
  extension_uri_anchor: 2
  uri: 
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml;
}
extension_uris {
  extension_uri_anchor: 3
  uri: 
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml;
}
extensions {
  extension_function {
extension_uri_reference: 3
function_anchor: 2
name: "count"
  }
}
relations {
  rel {
aggregate {
  input {
project {
  common {
emit {
  output_mapping: 9
  output_mapping: 10
  output_mapping: 11
  output_mapping: 12
  output_mapping: 13
  output_mapping: 14
  output_mapping: 15
  output_mapping: 16
  output_mapping: 17
}
  }
  input {
read {
  base_schema {
names: "int"
names: "dbl"
names: "dbl2"
names: "lgl"
names: "false"
names: "chr"
names: "verses"
names: "padded_strings"
names: "some_negative"
struct_ {
  types {
i32 {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
fp64 {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
fp64 {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
bool_ {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
bool_ {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
string {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
string {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
string {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
fp64 {
  nullability: NULLABILITY_NULLABLE
}
  }
}
  }
  local_files {
items {
  uri_file: "file:///tmp/RtmpsBsoZJ/file1915f604cff4a"
  parquet {
  }
}
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
}
  }
  root_reference {
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
  field: 1
}
  }
  root_reference {
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
  field: 2
}
  }
  root_reference {
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
  field: 3
}
  }
  root_reference {
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
  field: 4
}
  }

[jira] [Created] (ARROW-18402) [C++] Expose `DeclarationInfo`

2022-11-24 Thread Yaron Gvili (Jira)

Yaron Gvili created ARROW-18402:
---

 Summary: [C++] Expose `DeclarationInfo`
 Key: ARROW-18402
 URL: https://issues.apache.org/jira/browse/ARROW-18402
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18401) [R] Failing test on test-r-rhub-ubuntu-gcc-release-latest

2022-11-24 Thread Dewey Dunnington (Jira)

Dewey Dunnington created ARROW-18401:


 Summary: [R] Failing test on test-r-rhub-ubuntu-gcc-release-latest
 Key: ARROW-18401
 URL: https://issues.apache.org/jira/browse/ARROW-18401
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dewey Dunnington


I think this is an R problem where there is a string that is not getting 
converted to a timestamp (given that the kernel that's mentioned that doesn't 
exist probably doesn't and shouldn't exist).

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=40090=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=22256

{code:java}
══ Failed tests 
── Error ('test-dplyr-query.R:694'): Scalars in expressions match the type of 
the field, if possible ──
Error: NotImplemented: Function 'greater' has no kernel matching input types 
(timestamp[us, tz=UTC], string)
Backtrace:
 ▆
  1. ├─testthat::expect_output(...) at test-dplyr-query.R:694:2
  2. │ └─testthat:::quasi_capture(...)
  3. │   ├─testthat (local) .capture(...)
  4. │   │ └─testthat::capture_output_lines(code, print, width = width)
  5. │   │   └─testthat:::eval_with_output(code, print = print, width = width)
  6. │   │ ├─withr::with_output_sink(path, withVisible(code))
  7. │   │ │ └─base::force(code)
  8. │   │ └─base::withVisible(code)
  9. │   └─rlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo))
 10. ├─tab %>% filter(times > "2018-10-07 19:04:05") %>% ...
 11. └─arrow::show_exec_plan(.)
 12.   ├─arrow::as_record_batch_reader(adq)
 13.   └─arrow:::as_record_batch_reader.arrow_dplyr_query(adq)
 14. └─plan$Build(x)
 15.   └─node$Filter(.data$filtered_rows)
 16. ├─self$preserve_extras(ExecNode_Filter(self, expr))
 17. └─arrow:::ExecNode_Filter(self, expr)

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18400) Quadratic memory usage of Table.to_pandas with nested data

2022-11-23 Thread Adam Reeve (Jira)

Adam Reeve created ARROW-18400:
--

 Summary: Quadratic memory usage of Table.to_pandas with nested data
 Key: ARROW-18400
 URL: https://issues.apache.org/jira/browse/ARROW-18400
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 10.0.1
 Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X with 
64 GB RAM
Reporter: Adam Reeve


Reading nested Parquet data and then converting it to a Pandas DataFrame shows 
quadratic memory usage and will eventually run out of memory for reasonably 
small files. I had initially thought this was a regression since 7.0.0, but it 
looks like 7.0.0 has similar quadratic memory usage that kicks in at higher row 
counts.

Example code to generate nested Parquet data:
{code:python}
import numpy as np
import random
import string
import pandas as pd

_characters = string.ascii_uppercase + string.digits + string.punctuation

def make_random_string(N=10):
    return ''.join(random.choice(_characters) for _ in range(N))

nrows = 1_024_000
filename = 'nested.parquet'

arr_len = 10
nested_col = []
for i in range(nrows):
    nested_col.append(np.array(
            [{
                'a': None if i % 1000 == 0 else np.random.choice(1, 
size=3).astype(np.int64),
                'b': None if i % 100 == 0 else random.choice(range(100)),
                'c': None if i % 10 == 0 else make_random_string(5)
            } for i in range(arr_len)]
        ))
df = pd.DataFrame({'c1': nested_col})
df.to_parquet(filename)
{code}
And then read into a DataFrame with:
{code:python}
import pyarrow.parquet as pq
table = pq.read_table(filename)
df = table.to_pandas()
{code}
Only reading to an Arrow table isn't a problem, it's the to_pandas method that 
exhibits the large memory usage. I haven't tested generating nested Arrow data 
in memory without writing Parquet from Pandas but I assume the problem probably 
isn't Parquet specific.

Memory usage I see when reading different sized files:
||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
|32,000|362|361|
|64,000|531|531|
|128,000|1,152|1,101|
|256,000|2,888|1,402|
|512,000|10,301|3,508|
|1,024,000|38,697|5,313|
|2,048,000|OOM|20,061|

With Arrow 10.0.1, memory usage approximately quadruples when row count doubles 
above 256k rows. With Arrow 7.0.0 memory usage is more linear but then 
quadruples from 1024k to 2048k rows.

PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18399) [Python] Reduce warnings during tests

2022-11-23 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18399:
--

 Summary: [Python] Reduce warnings during tests
 Key: ARROW-18399
 URL: https://issues.apache.org/jira/browse/ARROW-18399
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Antoine Pitrou


Numerous warnings are displayed at the end of a test run, we should strive them 
to reduce them:
https://github.com/apache/arrow/actions/runs/3533792571/jobs/5929880345#step:6:5489




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18398) [C++] Sporadic error in StressSourceGroupedSumStop

2022-11-23 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18398:
--

 Summary: [C++] Sporadic error in StressSourceGroupedSumStop
 Key: ARROW-18398
 URL: https://issues.apache.org/jira/browse/ARROW-18398
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


I just saw this occasional failure:
https://github.com/apache/arrow/actions/runs/3533672097/jobs/5929601817#step:11:294

{code}
[ RUN  ] ExecPlanExecution.StressSourceGroupedSumStop
D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:850: Failure
Value of: _fut.Wait(::arrow::kDefaultAssertFinishesWaitSeconds)
  Actual: false
Expected: true
Google Test trace:
D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:825: parallel
D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:822: unslowed
D:/a/arrow/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:60: Plan was destroyed 
before finishing
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18397) [C++] Clear S3 region resolver client at S3 shutdown

2022-11-23 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18397:
--

 Summary: [C++] Clear S3 region resolver client at S3 shutdown
 Key: ARROW-18397
 URL: https://issues.apache.org/jira/browse/ARROW-18397
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 10.0.2, 11.0.0


The S3 region resolver caches a S3 client at module scope. This client can be 
destroyed very late and trigger an assertion error in the AWS SDK because it 
was already shutdown:
https://github.com/aws/aws-sdk-cpp/issues/2204

When explicitly finalizing S3, we should ensure we also destroy the cached S3 
client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18395) [C++] Move select-k implementation into separate module

2022-11-23 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18395:
--

 Summary: [C++] Move select-k implementation into separate module
 Key: ARROW-18395
 URL: https://issues.apache.org/jira/browse/ARROW-18395
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou


The select-k kernel implementations are currently in {{vector_sort.cc}}, 
amongst other things.
To make the code more readable and faster to compiler, we should move them into 
their own file.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18396) [C++] Move rank implementation into separate module

2022-11-23 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18396:
--

 Summary: [C++] Move rank implementation into separate module
 Key: ARROW-18396
 URL: https://issues.apache.org/jira/browse/ARROW-18396
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou


The rank kernel implementations are currently in {{vector_sort.cc}}, amongst 
other things.
To make the code more readable and faster to compiler, we should move them into 
their own file.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18394) [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail

2022-11-23 Thread Jira

Raúl Cumplido created ARROW-18394:
-

 Summary: [CI][Python] Nightly pyhon pandas jobs using latest or 
upstream_devel fail
 Key: ARROW-18394
 URL: https://issues.apache.org/jira/browse/ARROW-18394
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Raúl Cumplido
 Fix For: 11.0.0


Currently the following jobs fail:
|test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3532562061/jobs/5927065343|
|test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3532562477/jobs/5927066168|

with:
{code:java}
  _ test_roundtrip_with_bytes_unicode[columns0] 
__columns = [b'foo']    @pytest.mark.parametrize('columns', 
([b'foo'], ['foo']))
    def test_roundtrip_with_bytes_unicode(columns):
        df = pd.DataFrame(columns=columns)
        table1 = pa.Table.from_pandas(df)
>       table2 = 
> pa.Table.from_pandas(table1.to_pandas())opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_pandas.py:2867:
>  
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/array.pxi:830: in pyarrow.lib._PandasConvertible.to_pandas
    ???
pyarrow/table.pxi:3908: in pyarrow.lib.Table._to_pandas
    ???
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:819: 
in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:935: 
in _deserialize_column_index
    columns = _reconstruct_columns_from_metadata(columns, column_indexes)
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1154: 
in _reconstruct_columns_from_metadata
    level = level.astype(dtype)
opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:1029:
 in astype
    return Index(new_values, name=self.name, dtype=new_values.dtype, copy=False)
opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:518:
 in __new__
    klass = cls._dtype_to_subclass(arr.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cls = , dtype = dtype('S3')    @final
    @classmethod
    def _dtype_to_subclass(cls, dtype: DtypeObj):
        # Delay import for perf. https://github.com/pandas-dev/pandas/pull/31423
    
        if isinstance(dtype, ExtensionDtype):
            if isinstance(dtype, DatetimeTZDtype):
                from pandas import DatetimeIndex
    
                return DatetimeIndex
            elif isinstance(dtype, CategoricalDtype):
                from pandas import CategoricalIndex
    
                return CategoricalIndex
            elif isinstance(dtype, IntervalDtype):
                from pandas import IntervalIndex
    
                return IntervalIndex
            elif isinstance(dtype, PeriodDtype):
                from pandas import PeriodIndex
    
                return PeriodIndex
    
            return Index
    
        if dtype.kind == "M":
            from pandas import DatetimeIndex
    
            return DatetimeIndex
    
        elif dtype.kind == "m":
            from pandas import TimedeltaIndex
    
            return TimedeltaIndex
    
        elif dtype.kind == "f":
            from pandas.core.api import Float64Index
    
            return Float64Index
        elif dtype.kind == "u":
            from pandas.core.api import UInt64Index
    
            return UInt64Index
        elif dtype.kind == "i":
            from pandas.core.api import Int64Index
    
            return Int64Index
    
        elif dtype.kind == "O":
            # NB: assuming away MultiIndex
            return Index
    
        elif issubclass(
            dtype.type, (str, bool, np.bool_, complex, np.complex64, 
np.complex128)
        ):
            return Index
    
>       raise NotImplementedError(dtype)
E       NotImplementedError: 
|S3opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:595:
 NotImplementedError{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18393) [Docs][R] Include warning when viewing old docs (redirecting to stable docs)

2022-11-23 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18393:


 Summary: [Docs][R] Include warning when viewing old docs 
(redirecting to stable docs)
 Key: ARROW-18393
 URL: https://issues.apache.org/jira/browse/ARROW-18393
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Joris Van den Bossche
Assignee: Alenka Frim


Now we have versioned docs, we also have the old versions of the developers 
docs (eg 
https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
might be outdated (eg regarding communication channels, build instructions, 
etc), and typically when contributing / developing with the latest arrow, one 
should _always_ check the latest dev version of the contributing docs.

We could add a warning box pointing this out and linking to the dev docs. 

For example similarly how some projects warn about viewing old docs in general 
and point to the stable docs (eg https://mne.tools/1.1/index.html or 
https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
custom box when at a page in /developers to point to the dev docs instead of 
stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18392) [CI][Python] Some nightly python tests fail due to ACCESS DENIED to S3 bucket

2022-11-23 Thread Jira

Raúl Cumplido created ARROW-18392:
-

 Summary: [CI][Python] Some nightly python tests fail due to ACCESS 
DENIED to S3 bucket 
 Key: ARROW-18392
 URL: https://issues.apache.org/jira/browse/ARROW-18392
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Raúl Cumplido
 Fix For: 11.0.0


Several nightly tests fail with:
{code:java}
 === FAILURES 
===
 test_s3fs_wrong_region 
    @pytest.mark.s3
    def test_s3fs_wrong_region():
        from pyarrow.fs import S3FileSystem
    
        # wrong region for bucket
        fs = S3FileSystem(region='eu-north-1')
    
        msg = ("When getting information for bucket 
'voltrondata-labs-datasets': "
               r"AWS Error UNKNOWN \(HTTP status 301\) during HeadBucket "
               "operation: No response body. Looks like the configured region 
is "
               "'eu-north-1' while the bucket is located in 'us-east-2'."
               "|NETWORK_CONNECTION")
        with pytest.raises(OSError, match=msg) as exc:
            fs.get_file_info("voltrondata-labs-datasets")
    
        # Sometimes fails on unrelated network error, so next call would also 
fail.
        if 'NETWORK_CONNECTION' in str(exc.value):
            return
    
        fs = S3FileSystem(region='us-east-2')
>       
> fs.get_file_info("voltrondata-labs-datasets")opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_fs.py:1339:
>  
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/_fs.pyx:571: in pyarrow._fs.FileSystem.get_file_info
    ???
pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   OSError: When getting information for bucket 'voltrondata-labs-datasets': 
AWS Error ACCESS_DENIED during HeadBucket operation: No response body. {code}
I can't seem to be able to reproduce locally but is pretty consistent:
 * 
[test-conda-python-3.10|https://github.com/ursacomputing/crossbow/actions/runs/3528202639/jobs/5918051269]
 * 
[test-conda-python-3.11|https://github.com/ursacomputing/crossbow/actions/runs/3528201175/jobs/5918048135]
 * 
[test-conda-python-3.7|https://github.com/ursacomputing/crossbow/actions/runs/3528195566/jobs/5918035812]
 * 
[test-conda-python-3.7-pandas-latest|https://github.com/ursacomputing/crossbow/actions/runs/3528211334/jobs/5918069152]
 * 
[test-conda-python-3.8|https://github.com/ursacomputing/crossbow/actions/runs/3528193702/jobs/5918032370]
 * 
[test-conda-python-3.8-pandas-latest|https://github.com/ursacomputing/crossbow/actions/runs/3528213536/jobs/5918073481]
 * 
[test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3528205157/jobs/5918056277]
 * 
[test-conda-python-3.9|https://github.com/ursacomputing/crossbow/actions/runs/3528202402/jobs/5918050613]
 * 
[test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3528210560/jobs/5918067302]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18391) [R] Fix the version selector dropdown

2022-11-23 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18391:


 Summary: [R] Fix the version selector dropdown
 Key: ARROW-18391
 URL: https://issues.apache.org/jira/browse/ARROW-18391
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane


ARROW-17887 updates the docs to use Bootstrap 5 which will break the docs 
version dropdown selector, as it relies on replacing a page element, but the 
page elements are different in this version of Bootstrap.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18390) [CI][Python] Nightly python test for spark master missing test module

2022-11-23 Thread Jira

Raúl Cumplido created ARROW-18390:
-

 Summary: [CI][Python] Nightly python test for spark master missing 
test module
 Key: ARROW-18390
 URL: https://issues.apache.org/jira/browse/ARROW-18390
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido
 Fix For: 11.0.0


Currently the nightly test with spark master 
[test-conda-python-3.9-spark-master|[https://github.com/ursacomputing/crossbow/actions/runs/3528196313/jobs/5918037939]]
  fail with:
{code:java}
Starting test(python): pyspark.sql.tests.test_pandas_map (temp output: 
/spark/python/target/cbca1b18-4af7-4205-aa41-8c945bf1cf58/python__pyspark.sql.tests.test_pandas_map__9ptzo8sa.log)
/opt/conda/envs/arrow/bin/python: No module named 
pyspark.sql.tests.test_pandas_grouped_map {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18389) [CI][Python] Update nightly test-conda-python-3.7-pandas-0.24 to pandas >= 1.0

2022-11-23 Thread Jira

Raúl Cumplido created ARROW-18389:
-

 Summary: [CI][Python] Update nightly 
test-conda-python-3.7-pandas-0.24 to pandas >= 1.0
 Key: ARROW-18389
 URL: https://issues.apache.org/jira/browse/ARROW-18389
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido
 Fix For: 11.0.0


https://issues.apache.org/jira/browse/ARROW-18173 Removed support for pandas < 
1.0. We should upgrade the nightly test.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18388) [C++] Decide on duplicate column handling in scanner, add more tests

2022-11-22 Thread Weston Pace (Jira)

Weston Pace created ARROW-18388:
---

 Summary: [C++] Decide on duplicate column handling in scanner, add 
more tests
 Key: ARROW-18388
 URL: https://issues.apache.org/jira/browse/ARROW-18388
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Weston Pace


When a schema has duplicate column names it can be difficult to know how to map 
between the fragment schema and the dataset schema in the default evolution 
strategy.  It's not clear from the comments describing evolution what the exact 
behavior is right now.  Some suggestions have been:

 * Grab the first column in the fragment schema with the same name
 * Always error if there are duplicate columns
 * Allow duplicate columns but expect there to be the same # of occurrences in 
both the fragment and dataset schema and assume the order is consistent



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18387) [C++] Create many-column scanner microbenchmarks

2022-11-22 Thread Weston Pace (Jira)

Weston Pace created ARROW-18387:
---

 Summary: [C++] Create many-column scanner microbenchmarks
 Key: ARROW-18387
 URL: https://issues.apache.org/jira/browse/ARROW-18387
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Weston Pace


When developing we often assume schemas are cheap and small only to find out 
later that we create easily avoided bottlenecks for users that have very large 
schemas.  We should create some micro-benchmarks for the scanner that verify we 
get roughly the same performance, in data-bytes-per-second, with many-columns 
as we do with few-columns (note, that we probably suffer in rows-per-second 
since we are loading more columns and thus more data).

This might also be a good time to create similar benchmarks for dataset 
discovery.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18386) [C++] Add support for filename, file index, and batch index columns to exec plan based scanner

2022-11-22 Thread Weston Pace (Jira)

Weston Pace created ARROW-18386:
---

 Summary: [C++] Add support for filename, file index, and batch 
index columns to exec plan based scanner
 Key: ARROW-18386
 URL: https://issues.apache.org/jira/browse/ARROW-18386
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


The old scanner currently appends these three fields to all outgoing batches.  
In retrospect, this caused some confusion, so I'd like to handle it slightly 
differently, where the user is able to request these fields, but they are not 
automatically appended.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18385) [Java]

2022-11-22 Thread Jacob Wujciak-Jens (Jira)

Jacob Wujciak-Jens created ARROW-18385:
--

 Summary: [Java] 
 Key: ARROW-18385
 URL: https://issues.apache.org/jira/browse/ARROW-18385
 Project: Apache Arrow
  Issue Type: Wish
  Components: Java
Reporter: Jacob Wujciak-Jens
 Fix For: 11.0.0
 Attachments: image.png

While verifying 10.0.1 I came across this java test error that is caused by a 
mismatch in the ordering of the JSON metadata description (see attached image)
ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.177 s 
<<< FAILURE! - in org.apache.arrow.adapter.jdbc.JdbcToArrowCommentMetadataTest
[ERROR] 
org.apache.arrow.adapter.jdbc.JdbcToArrowCommentMetadataTest.schemaCommentWithDatabaseMetadata
  Time elapsed: 0.141 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: 
cc [~lidavidm] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18384) [Release][MSYS2] Show pull request title

2022-11-22 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18384:


 Summary: [Release][MSYS2] Show pull request title
 Key: ARROW-18384
 URL: https://issues.apache.org/jira/browse/ARROW-18384
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18383) [C++] Avoid global variables for thread pools and at-fork handlers

2022-11-22 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18383:
--

 Summary: [C++] Avoid global variables for thread pools and at-fork 
handlers
 Key: ARROW-18383
 URL: https://issues.apache.org/jira/browse/ARROW-18383
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 11.0.0


Investigation revealed an issue where the global IO thread pool could be 
constructed before the at-fork handler internal state. The IO thread pool, 
created on library load, would register an at-fork handler; then, the at-fork 
handler state would be initialized and clobber the handler registered just 
before.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18382) [C++] "ADDRESS_SANITIZER" not defined in fuzzing builds

2022-11-22 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18382:
--

 Summary: [C++] "ADDRESS_SANITIZER" not defined in fuzzing builds
 Key: ARROW-18382
 URL: https://issues.apache.org/jira/browse/ARROW-18382
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Fuzzing builds (as run by OSS-Fuzz) enable Address Sanitizer through their own 
set of options rather than by enabling {{ARROW_USE_ASAN}}. However, we need to 
be informed this situation in the Arrow source code.

One example of where this matters is that eternal thread pools produce spurious 
leaks at shutdown because of the vector of at-fork handlers; it therefore needs 
to be worked around on those builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18381) MIGRATION: Create milestones for every needed fix version

2022-11-22 Thread Todd Farmer (Jira)

Todd Farmer created ARROW-18381:
---

 Summary: MIGRATION: Create milestones for every needed fix version
 Key: ARROW-18381
 URL: https://issues.apache.org/jira/browse/ARROW-18381
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


The Apache Arrow projects uses the "Fix version" field in ASF Jira issue to 
track the version in which issues were resolved/fixed/implemented. The most 
equivalent field in GitHub issues is the "milestone" field. This field is 
explicitly managed - the versions need to be added to the repository 
configuration before they can be used. This mapping needs to be established as 
a prerequisite for completing the import from ASF Jira.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18380) MIGRATION: Enable bot handling of GitHub issue linked PRs

2022-11-22 Thread Todd Farmer (Jira)

Todd Farmer created ARROW-18380:
---

 Summary: MIGRATION: Enable bot handling of GitHub issue linked PRs
 Key: ARROW-18380
 URL: https://issues.apache.org/jira/browse/ARROW-18380
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


GitHub workflows for the Apache Arrow project assume that PRs reference ASF 
Jira issues (or are minor changes). This needs to be revisited now that GitHub 
issue reporting is enabled, as there may well be no ASF Jira issue to link a PR 
against going forward. The resulting bot comments can be confusing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18379) [Python] Change warnings to _warnings in _plasma_store_entry_point

2022-11-22 Thread Alenka Frim (Jira)

Alenka Frim created ARROW-18379:
---

 Summary: [Python] Change warnings to _warnings in 
_plasma_store_entry_point
 Key: ARROW-18379
 URL: https://issues.apache.org/jira/browse/ARROW-18379
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 10.0.2, 11.0.0


There is a {{leftover in python/pyarrow/__init__.py}} from 
[https://github.com/apache/arrow/pull/14343] due to {{warnings}} being imported 
as {{_warnings}}.

Connected GitHub issue: [https://github.com/apache/arrow/issues/14693]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18378) MIGRATION: Disable issue reporting in ASF Jira

2022-11-21 Thread Todd Farmer (Jira)

Todd Farmer created ARROW-18378:
---

 Summary: MIGRATION: Disable issue reporting in ASF Jira
 Key: ARROW-18378
 URL: https://issues.apache.org/jira/browse/ARROW-18378
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


ARROW-18364 enabled issue reporting for Apache Arrow in GitHub issues. Even 
though existing Jira issues have not yet been migrated and are still being 
worked in the Jira system, we should assess disabling creation of new issues in 
ASF Jira, and instead pointing users to GitHub issues. This may benefit the 
project by reducing the need to monitor inflow in two discrete systems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18377) MIGRATION: Automate component labels from issue form content

2022-11-21 Thread Todd Farmer (Jira)

Todd Farmer created ARROW-18377:
---

 Summary: MIGRATION: Automate component labels from issue form 
content
 Key: ARROW-18377
 URL: https://issues.apache.org/jira/browse/ARROW-18377
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


ARROW-18364 added the ability to report issues in GitHub, and includes GitHub 
issue templates with a drop-down component(s) selector. These form elements 
drive resulting issue markdown only, and cannot dynamically drive issue labels. 
This requires GitHub actions, which also have a few limitations. First, the 
issue form does not produce any structured data, it only produces the issue 
description markdown, so a parser is required. Second, ASF restricts GitHub 
actions to a selection of approved actions. It is likely that while community 
actions exist to generate structured data from issue forms, the Apache Arrow 
project will need to write its own parser and label application action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18376) MIGRATION: Add component labels to GitHub

2022-11-21 Thread Todd Farmer (Jira)

Todd Farmer created ARROW-18376:
---

 Summary: MIGRATION: Add component labels to GitHub
 Key: ARROW-18376
 URL: https://issues.apache.org/jira/browse/ARROW-18376
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


Similar to ARROW-18375, component labels have been established based on 
existing component values defined in ASF Jira. The following labels are needed:

* Component: Archery
* Component: Benchmarking
* Component: C
* Component: C#
* Component: C++
* Component: C++ - Gandiva
* Component: C++ - Plasma
* Component: Continuous Integration
* Component: Dart
* Component: Developer Tools
* Component: Documentation
* Component: FlightRPC
* Component: Format
* Component: GLib
* Component: Go
* Component: GPU
* Component: Integration
* Component: Java
* Component: JavaScript
* Component: MATLAB
* Component: Packaging
* Component: Parquet
* Component: Python
* Component: R
* Component: Ruby
* Component: Swift
* Component: Website
* Component: Other



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-11-21 Thread Todd Farmer (Jira)

Todd Farmer created ARROW-18375:
---

 Summary: MIGRATION: Enable GitHub issue type labels
 Key: ARROW-18375
 URL: https://issues.apache.org/jira/browse/ARROW-18375
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


As part of enabling GitHub issue reporting, the following labels have been 
defined and need to be added to the repository label options. Without these 
labels added, [new issues|https://github.com/apache/arrow/issues/14692] do not 
get the issue template-defined issue type labels set properly.

 

Labels:
 * Type: bug
 * Type: enhancement
 * Type: usage

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18374) [Go][CI][Benchmarks] Fix Go Bench Script after conbench change

2022-11-21 Thread Matthew Topol (Jira)

Matthew Topol created ARROW-18374:
-

 Summary: [Go][CI][Benchmarks] Fix Go Bench Script after conbench 
change
 Key: ARROW-18374
 URL: https://issues.apache.org/jira/browse/ARROW-18374
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking, Continuous Integration, Go
Reporter: Matthew Topol
Assignee: Matthew Topol


Change [https://github.com/conbench/conbench/pull/417/files#] requires now 
putting an explicit {{github=None}} as an argument to {{BenchmarkResult}} to 
have it get the github info from the locally cloned repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18373) MIGRATION: Enable multiple component selection in issue templates

2022-11-21 Thread Todd Farmer (Jira)

Todd Farmer created ARROW-18373:
---

 Summary: MIGRATION: Enable multiple component selection in issue 
templates
 Key: ARROW-18373
 URL: https://issues.apache.org/jira/browse/ARROW-18373
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


Per comments in [this merged PR|https://github.com/apache/arrow/pull/14675], we 
would like to enable selection of multiple components when reporting issues via 
GitHub issues.

Additionally, we may want to add the needed Apache license to the issue 
templates and remove the exclusion rules from rat_exclude_files.txt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18372) [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell

2022-11-21 Thread Lucas Mation (Jira)

Lucas Mation created ARROW-18372:


 Summary: [R] "Error in `collect()`: ! Invalid: negative malloc 
size" after large computation returning one cell
 Key: ARROW-18372
 URL: https://issues.apache.org/jira/browse/ARROW-18372
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 10.0.0
Reporter: Lucas Mation


I have a large parquet file 900 million rows , 40cols parquet file, subdivided 
into folders for each year. I was trying to calculate how many unique 
combinations of id1+id2+id3+id4 there are in the dataset.

 

Notice that the "collected" dataset is supposed to be only one row and one cel, 
containing the count (I've confirmed this by subseting the dataset ("%>% 
head(10^6)" ) before computing the count, and it works). That is why the error 
below is so weird

```

fa <- 'myparteq folder' #huge 

va <- open_dataset(fa)

tic()
d <- va  %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect

toc()

 

Error in `collect()`:
! Invalid: negative malloc size
Run `rlang::last_error()` to see where the error occurred.

 

> rlang::last_error()

Error in `collect()`:
! Invalid: negative malloc size
---
Backtrace:
 1. ... %>% collect
 3. arrow:::collect.arrow_dplyr_query(.)
Run `rlang::last_trace()` to see the full context.

 

> rlang::last_trace()

Error in `collect()`:
! Invalid: negative malloc size
---
Backtrace:
    x
 1. +-... %>% collect
 2. +-dplyr::collect(.)
 3. \-arrow:::collect.arrow_dplyr_query(.)
 4.   \-base::tryCatch(...)
 5.     \-base (local) tryCatchList(expr, classes, parentenv, handlers)
 6.       \-base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7.         \-value[[3L]](cond)
 8.           \-arrow:::augment_io_error_msg(e, call, schema = x$.data$schema)
 9.             \-rlang::abort(msg, call = call)

 

```

I am running this on a windows server, 512Gb of RAM.

 sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    
LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] arrow_10.0.0      data.table_1.14.4 forcats_0.5.2     dplyr_1.0.10      
purrr_0.3.5  readr_2.1.3       tidyr_1.2.1       tibble_3.1.8     
 [9] ggplot2_3.3.6     tidyverse_1.3.2   gt_0.7.0          xtable_1.8-4      
ggthemes_4.2.4    collapse_1.8.6    pryr_0.1.5        janitor_2.1.0    
[17] tictoc_1.1        lubridate_1.8.0   stringr_1.4.1     readxl_1.4.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9          assertthat_0.2.1    digest_0.6.30       utf8_1.2.2     
     R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
 [8] reprex_2.0.2        httr_1.4.4          pillar_1.8.1        rlang_1.0.6    
     googlesheets4_1.0.1 rstudioapi_0.14     googledrive_2.0.0  
[15] bit_4.0.4           munsell_0.5.0       broom_1.0.1         compiler_4.2.1 
     modelr_0.1.9        pkgconfig_2.0.3     htmltools_0.5.3    
[22] tidyselect_1.2.0    codetools_0.2-18    fansi_1.0.3         crayon_1.5.2   
     tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0        
[29] grid_4.2.1          jsonlite_1.8.3      gtable_0.3.1        
lifecycle_1.0.3     DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
[36] cli_3.4.1           stringi_1.7.8       fs_1.5.2            
snakecase_0.11.0    xml2_1.3.3          ellipsis_0.3.2      generics_0.1.3     
[43] vctrs_0.5.0         tools_4.2.1         bit64_4.0.5         glue_1.6.2     
     hms_1.1.2           parallel_4.2.1      fastmap_1.1.0      
[50] colorspace_2.0-3    gargle_1.2.1        rvest_1.0.3         haven_2.5.1    

 

 arrow_info()
Arrow package version: 10.0.0

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():
                       
arrow.use_threads FALSE

Memory:
                  
Allocator mimalloc
Current   74.82 Gb
Max       97.75 Gb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                                             
C++ Library Version                                    10.0.0
C++ Compiler                                              GNU
C++ Compiler Version                                   10.3.0
Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18371) [C++] Expose *FromJSON helpers

2022-11-21 Thread Rok Mihevc (Jira)

Rok Mihevc created ARROW-18371:
--

 Summary: [C++] Expose *FromJSON helpers
 Key: ARROW-18371
 URL: https://issues.apache.org/jira/browse/ARROW-18371
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Rok Mihevc


{Array,{{Exec,Record}Batch}FromJSON helper functions would be useful when 
testing in projects that use Arrow. BatchesWithSchema and MakeBasicBatches 
could be considered as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18370) [Python] `ds.write_dataset` doesn't allow feather compression

2022-11-20 Thread Yu Zhu (Jira)

Yu Zhu created ARROW-18370:
--

 Summary: [Python] `ds.write_dataset` doesn't allow feather 
compression
 Key: ARROW-18370
 URL: https://issues.apache.org/jira/browse/ARROW-18370
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Yu Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18369) [C++] Support nested references as segment ids

2022-11-20 Thread Yaron Gvili (Jira)

Yaron Gvili created ARROW-18369:
---

 Summary: [C++] Support nested references as segment ids
 Key: ARROW-18369
 URL: https://issues.apache.org/jira/browse/ARROW-18369
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18368) [Python] Expose grouping segment keys to PyArrow

2022-11-20 Thread Yaron Gvili (Jira)

Yaron Gvili created ARROW-18368:
---

 Summary: [Python] Expose grouping segment keys to PyArrow
 Key: ARROW-18368
 URL: https://issues.apache.org/jira/browse/ARROW-18368
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Yaron Gvili


This is a [follow-up 
task|https://github.com/apache/arrow/pull/14352#discussion_r1026926422] for a 
PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18367) Enable using InMemoryDataset to create substrait plans

2022-11-19 Thread Jianshen Liu (Jira)

Jianshen Liu created ARROW-18367:


 Summary: Enable using InMemoryDataset to create substrait plans
 Key: ARROW-18367
 URL: https://issues.apache.org/jira/browse/ARROW-18367
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Jianshen Liu
 Fix For: 11.0.0


We think that the `Named Table` relation supported by substrait is an important 
abstraction in HPC to enable remote executions. To enable the creation of named 
tables with the `engine::SerializePlan` API, we would like to add the support 
of `InMemoryDataset` to scan nodes to be used to convert to substrait plans. 
The idea is to save the `names` of a named table in the metadata of the schema 
used to create the InMemoryDataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18366) [Packaging][RPM][Gandiva] Failed to link on AlmaLinux 9

2022-11-19 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18366:


 Summary: [Packaging][RPM][Gandiva] Failed to link on AlmaLinux 9 
 Key: ARROW-18366
 URL: https://issues.apache.org/jira/browse/ARROW-18366
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


https://github.com/ursacomputing/crossbow/actions/runs/3502784911/jobs/5867407921#step:6:4748

{noformat}
FAILED: gandiva-glib/Gandiva-1.0.gir 
env 
PKG_CONFIG_PATH=/usr/lib64/pkgconfig:/usr/share/pkgconfig:/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/meson-uninstalled
 /usr/bin/g-ir-scanner --quiet --no-libtool --namespace=Gandiva --nsversion=1.0 
--warn-all --output gandiva-glib/Gandiva-1.0.gir 
--c-include=gandiva-glib/gandiva-glib.h --warn-all 
--include-uninstalled=./arrow-glib/Arrow-1.0.gir 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/gandiva-glib 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/gandiva-glib 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/. 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/. 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/redhat-linux-build/src
 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/redhat-linux-build/src
 -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/src 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/src 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/. 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/. 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/redhat-linux-build/src
 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/redhat-linux-build/src
 -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/src 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/src 
--filelist=/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/gandiva-glib/libgandiva-glib.so.1100.0.0.p/Gandiva_1.0_gir_filelist
 --include=Arrow-1.0 --symbol-prefix=ggandiva --identifier-prefix=GGandiva 
--pkg-export=gandiva-glib --cflags-begin 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/. 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/. 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/redhat-linux-build/src
 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/redhat-linux-build/src
 -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/src 
-I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/src 
-I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/sysprof-4 
-I/usr/include/gobject-introspection-1.0 --cflags-end 
--add-include-path=/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/arrow-glib
 --add-include-path=/usr/share/gir-1.0 
-L/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/gandiva-glib 
--library gandiva-glib 
-L/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/arrow-glib 
-L/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../../cpp/redhat-linux-build/release
 --extra-library=gobject-2.0 --extra-library=glib-2.0 
--extra-library=girepository-1.0 --sources-top-dirs 
/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/ --sources-top-dirs 
/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/ --warn-error
/usr/bin/ld: 
/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../../cpp/redhat-linux-build/release/libgandiva.so.1100:
 undefined reference to `std::__glibcxx_assert_fail(char const*, int, char 
const*, char const*)'
collect2: error: ld returned 1 exit status
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18365) [C++][Parquet] Optimize DELTA_BINARY_PACKED encoding and decoding

2022-11-18 Thread Rok Mihevc (Jira)

Rok Mihevc created ARROW-18365:
--

 Summary: [C++][Parquet] Optimize DELTA_BINARY_PACKED encoding and 
decoding
 Key: ARROW-18365
 URL: https://issues.apache.org/jira/browse/ARROW-18365
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Parquet
Reporter: Rok Mihevc


[As suggested 
here|https://github.com/apache/arrow/pull/14191#discussion_r1019762308] simd 
approach such as 
[FastDifferentialCoding|https://github.com/lemire/FastDifferentialCoding] could 
be used to speed up encoding and decoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18364) MIGRATION: Update GitHub issue templates to support bug reports and feature requests

2022-11-18 Thread Todd Farmer (Jira)

Todd Farmer created ARROW-18364:
---

 Summary: MIGRATION: Update GitHub issue templates to support bug 
reports and feature requests
 Key: ARROW-18364
 URL: https://issues.apache.org/jira/browse/ARROW-18364
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


The [GitHub issue creation page for 
Arrow|https://github.com/apache/arrow/issues/new/choose] directs users to open 
bug reports in Jira. Now that ASF Infra has disabled self-service registration 
in Jira, and in light of the pending migration of Apache Arrow issue tracking 
from ASF Jira to GitHub issues, we should enable bug reports to be submitted 
via GitHub directly. Issue templates will help distinguish bug reports and 
feature requests from existing usage assistance questions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18363) [Docs] Include warning when viewing old contributing docs (redirecting to dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-18363:
-

 Summary: [Docs] Include warning when viewing old contributing docs 
(redirecting to dev docs)
 Key: ARROW-18363
 URL: https://issues.apache.org/jira/browse/ARROW-18363
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Joris Van den Bossche


Now we have versioned docs, we also have the old versions of the developers 
docs (eg 
https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
might be outdated (eg regarding communication channels, build instructions, 
etc), and typically when contributing / developing with the latest arrow, one 
should _always_ check the latest dev version of the contributing docs.

We could add a warning box pointing this out and linking to the dev docs. 

For example similarly how some projects warn about viewing old docs in general 
and point to the stable docs (eg https://mne.tools/1.1/index.html or 
https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
custom box when at a page in /developers to point to the dev docs instead of 
stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18362) Accelerate Parquet bit-packing decoding with ICX AVX-512

2022-11-18 Thread zhaoyaqi (Jira)

zhaoyaqi created ARROW-18362:


 Summary: Accelerate Parquet bit-packing decoding with ICX AVX-512
 Key: ARROW-18362
 URL: https://issues.apache.org/jira/browse/ARROW-18362
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: zhaoyaqi


h1. Accelerate Parquet bit-packing decoding with ICX AVX-512 instructions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18361) [CI][Conan] Merge upstream changes

2022-11-17 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18361:


 Summary: [CI][Conan] Merge upstream changes
 Key: ARROW-18361
 URL: https://issues.apache.org/jira/browse/ARROW-18361
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Updated: https://github.com/conan-io/conan-center-index/pull/14111



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18360) [Python] Incorrectly passing schema=None to do_put crashes

2022-11-17 Thread Bryan Cutler (Jira)

Bryan Cutler created ARROW-18360:


 Summary: [Python] Incorrectly passing schema=None to do_put crashes
 Key: ARROW-18360
 URL: https://issues.apache.org/jira/browse/ARROW-18360
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 9.0.0
Reporter: Bryan Cutler


In pyarrow.flight, putting an incorrect value of None for schema in do_put will 
lead to a core dump.

In pyarrow 9.0.0, trying to enter the command leads to a 

{code}
In [3]: writer, reader = 
client.do_put(flight.FlightDescriptor.for_command(cmd), schema=None)
Segmentation fault (core dumped)
{code}

In pyarrow 7.0.0, the kernel crashes after attempting to access the writer and 
I got the following:
{code}
In [38]: client = flight.FlightClient('grpc+tls://localhost:9643', 
disable_server_verification=True)

In [39]: writer, reader = 
client.do_put(flight.FlightDescriptor.for_command(cmd), None)

In [40]: 
writer./home/conda/feedstock_root/build_artifacts/arrow-cpp-ext_1644752264449/work/cpp/src/arrow/flight/client.cc:736:
  Check failed: (batch_writer_) != (nullptr) 
miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(+0x66288c)[0x7f0feeae088c]
miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0x101)[0x7f0feeae0c91]
miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow_flight.so.700(+0x7c1e1)[0x7f0fa9e331e1]
miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so(+0x17cf1a)[0x7f0fefe7ff1a]
miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03]
miniconda3/envs/dev/bin/python(+0x144814)[0x559a7cb8f814]
miniconda3/envs/dev/bin/python(+0x1445bf)[0x559a7cb8f5bf]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc]
miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac]
miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44]
miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397]
miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac]
miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(+0x151ef3)[0x559a7cb9cef3]
miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44]
miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x1311)[0x559a7cb7fbd1]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x66f)[0x559a7cb7ef2f]
miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d]
miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03]
miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f]
miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d]
miniconda3/envs/dev/bin/python(+0x1416f5)[0x559a7cb8c6f5]
miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x52)[0x559a7cb8c4a2]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f]
miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d]
miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03]
miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f]
miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x9ca)[0x559a7cb7f28a]
miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178]
miniconda3/envs/dev/bin/python(+0x1602d9)[0x559a7cbab2d9]
miniconda3/envs/dev/bin/python(+0x19d5f5)[0x559a7cbe85f5]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc]
miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178]
miniconda3/envs/dev/bin/python

[jira] [Created] (ARROW-18359) PrettyPrint Improvements

2022-11-17 Thread Will Jones (Jira)

Will Jones created ARROW-18359:
--

 Summary: PrettyPrint Improvements
 Key: ARROW-18359
 URL: https://issues.apache.org/jira/browse/ARROW-18359
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python, R
Reporter: Will Jones


We have some pretty printing capabilities, but we may want to think at a high 
level about the design first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18358) [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow

2022-11-17 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18358:


 Summary: [R] Implement new function open_dataset_csv with 
signature more closely matching read_csv_arrow
 Key: ARROW-18358
 URL: https://issues.apache.org/jira/browse/ARROW-18358
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


In order to make the transition between using the different CSV reading 
functions as smoothly as possible we could introduce a version of open_dataset 
specifically for reading CSVs with a signature more closely matching that of 
read_csv_arrow - this would just pass the arguments through to open_dataset (in 
the ellipses), but would make it simpler to have a docs page showing these 
options explicitly and thus be clearer for users.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18357) [R] support parse_options, read_options, convert_options in open_dataset to mirror read_csv_arrow

2022-11-17 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18357:


 Summary: [R] support parse_options, read_options, convert_options 
in open_dataset to mirror read_csv_arrow
 Key: ARROW-18357
 URL: https://issues.apache.org/jira/browse/ARROW-18357
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


The {{read_csv_arrow()}} function allows users to pass in options via its 
parse_options, convert_options, and read_options parameters.  We could allow 
users to pass these into {{open_dataset()}} to enable users to more easily 
switch between {{read_csv_arrow()}} and {{open_dataset()}}.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18356) [R] Handle as_data_frame argument if passed into open_dataset for CSVs

2022-11-17 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18356:


 Summary: [R] Handle as_data_frame argument if passed into 
open_dataset for CSVs
 Key: ARROW-18356
 URL: https://issues.apache.org/jira/browse/ARROW-18356
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


Currently, if the argument {{as_data_frame}} is passed into {{open_dataset()}} 
with a CSV format dataset, the error message returned is:

{code:r}
Error: The following option is supported in "read_delim_arrow" functions but 
not yet supported here: "as_data_frame"
{code}

Instead, we could silently ignore it if as_data_frame is set to {{FALSE}} and 
give a more helpful error if set to {{TRUE}} (i.e. direct user to call 
{{as.data.frame()}} or {{collect()}}).

Reasoning: it'd be great to get to a point where users can just swap their 
{{read_csv_arrow()}} syntax for {{open_dataset()}} and get helpful results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null

2022-11-17 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18355:


 Summary: [R] support the quoted_na argument in open_dataset for 
CSVs by mapping it to CSVConvertOptions$strings_can_be_null
 Key: ARROW-18355
 URL: https://issues.apache.org/jira/browse/ARROW-18355
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18354) [R] Better document the CSV read/parse/convert options we can use with open_dataset()

2022-11-17 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18354:


 Summary: [R] Better document the CSV read/parse/convert options we 
can use with open_dataset()
 Key: ARROW-18354
 URL: https://issues.apache.org/jira/browse/ARROW-18354
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


When a user opens a CSV dataset using open_dataset, they can take advantage of 
a lot of different options which can be specified via 
{{CsvReadOptions$create()}} etc.

However, as they are passed in via the ellipses ({{...}}) argument, it's not 
particularly clear to users which arguments are supported or not.  They are not 
documented in the {{open_dataset()}} docs, and further confused (see the code 
for {{CsvFileFormat$create()}} by the fact that we support a mix of Arrow and 
readr parameters.

We should better document the arguments we do support.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18353) [C++][Flight] Sporadic hang in UCX tests

2022-11-17 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18353:
--

 Summary: [C++][Flight] Sporadic hang in UCX tests
 Key: ARROW-18353
 URL: https://issues.apache.org/jira/browse/ARROW-18353
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Antoine Pitrou


The UCX tests sometimes hang here.

Full gdb backtraces for all threads:
{code}

Thread 8 (Thread 0x7f4562fcd700 (LWP 76837)):
#0  0x7f4577b72ad3 in futex_wait_cancelable (private=, 
expected=0, futex_word=0x564ebe5b5b3c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x564ebe5b5ae0, 
cond=0x564ebe5b5b10) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x564ebe5b5b10, mutex=0x564ebe5b5ae0) at 
pthread_cond_wait.c:655
#3  0x7f457b4ce7cb in 
std::condition_variable::wait 
>(std::unique_lock &, struct {...}) (this=0x564ebe5b5b10, 
__lock=..., __p=...)
at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/condition_variable:111
#4  0x7f457b4c7b5e in arrow::flight::transport::ucx::(anonymous 
namespace)::WriteClientStream::WritesDone (this=0x564ebe5b5a90)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_client.cc:277
#5  0x7f457b4cc989 in arrow::flight::transport::ucx::(anonymous 
namespace)::UcxClientStream::DoFinish (this=0x564ebe5b5a90)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_client.cc:692
#6  0x7f457af80e04 in arrow::flight::internal::ClientDataStream::Finish 
(this=0x564ebe5b5a90, st=...) at /arrow/cpp/src/arrow/flight/transport.cc:46
#7  0x7f457af4f6e1 in arrow::flight::ClientMetadataReader::ReadMetadata 
(this=0x564ebe560630, out=0x7f4562fcc170)
at /arrow/cpp/src/arrow/flight/client.cc:263
#8  0x7f457b593af6 in operator() (__closure=0x564ebe4e4848) at 
/arrow/cpp/src/arrow/flight/test_definitions.cc:1538
#9  0x7f457b5b66b8 in std::__invoke_impl 
>(std::__invoke_other, struct {...} &&) (__f=...)
at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/bits/invoke.h:60
#10 0x7f457b5b6529 in 
std::__invoke 
>(struct {...} &&) (__fn=...)
at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/bits/invoke.h:95
#11 0x7f457b5b63c4 in 
std::thread::_Invoker
 > >::_M_invoke<0>(std::_Index_tuple<0>) (
this=0x564ebe4e4848) at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:264
#12 0x7f457b5b6224 in 
std::thread::_Invoker
 > >::operator()(void) (
this=0x564ebe4e4848) at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:271
#13 0x7f457b5b5e1e in 
std::thread::_State_impl
 > > >::_M_run(void) (this=0x564ebe4e4840) at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:215
#14 0x7f4578242a93 in std::execute_native_thread_routine (__p=)
at 
/home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/new_allocator.h:82
#15 0x7f4577b6c6db in start_thread (arg=0x7f4562fcd700) at 
pthread_create.c:463
#16 0x7f4577ea561f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f45725ca700 (LWP 76828)):
#0  0x7f4577ea5947 in epoll_wait (epfd=36, 
events=events@entry=0x7f45725c86c0, maxevents=16, timeout=timeout@entry=0)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x7f45779fe3e3 in ucs_event_set_wait (event_set=0x7f4564026240, 
num_events=num_events@entry=0x7f45725c8804, timeout_ms=timeout_ms@entry=0, 
event_set_handler=event_set_handler@entry=0x7f4575d29320 
, arg=arg@entry=0x7f45725c8800) at 
sys/event_set.c:198
#2  0x7f4575d29283 in uct_tcp_iface_progress (tl_iface=0x7f4564026900) at 
tcp/tcp_iface.c:327
#3  0x7f4577a7de22 in ucs_callbackq_dispatch (cbq=) at 
/usr/local/src/conda/ucx-1.13.1/src/ucs/datastruct/callbackq.h:211
#4  uct_worker_progress (worker=) at 
/usr/local/src/conda/ucx-1.13.1/src/uct/api/uct.h:2638
#5  ucp_worker_progress (worker=0x7f4564000c80) at core/ucp_worker.c:2782
#6  0x7f457b4f186f in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::MakeProgress 
(this=0x7f456404d3b0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:759
#7  0x7f457b4eee40 in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::ReadNextFrame 
(this=0x7f456404d3b0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:449
#8  0x7f457b4f3661 in 
arrow::flight::transport::ucx::UcpCallDriver::ReadNextFrame 
(this=0x7f456c0016d0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:1037
#9  0x7f457b4d8c43 in arrow::flight::transport::ucx::(anonymous 
namespace)::PutServerStream::ReadImpl (this=0x7f45725c8b60, data=0x7f45725c8af0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_server.cc:153
#10 0x7f457b4d8525 in arrow::flight::tra

[jira] [Created] (ARROW-18352) [R] Datasets API interface improvements

2022-11-17 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18352:


 Summary: [R] Datasets API interface improvements
 Key: ARROW-18352
 URL: https://issues.apache.org/jira/browse/ARROW-18352
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Umbrella ticket for improvements for our interface to the datasets API, and 
making the experience more consistent between {{open_dataset()}} and the 
{{read_*()}} functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18351) [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange

2022-11-17 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18351:
--

 Summary: [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange
 Key: ARROW-18351
 URL: https://issues.apache.org/jira/browse/ARROW-18351
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Antoine Pitrou


I get a non-deterministic crash in the Flight UCX tests.
{code}
[--] 3 tests from UcxErrorHandlingTest
[ RUN  ] UcxErrorHandlingTest.TestGetFlightInfo
[   OK ] UcxErrorHandlingTest.TestGetFlightInfo (24 ms)
[ RUN  ] UcxErrorHandlingTest.TestDoPut
[   OK ] UcxErrorHandlingTest.TestDoPut (15 ms)
[ RUN  ] UcxErrorHandlingTest.TestDoExchange
/arrow/cpp/src/arrow/util/future.cc:125:  Check failed: 
!IsFutureFinished(state_) Future already marked finished
{code}

Here is the GDB backtrace:
{code}
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7f18c49cd7f1 in __GI_abort () at abort.c:79
#2  0x7f18c5854e00 in arrow::util::CerrLog::~CerrLog (this=0x7f18a81607b0, 
__in_chrg=) at /arrow/cpp/src/arrow/util/logging.cc:72
#3  0x7f18c5854e1c in arrow::util::CerrLog::~CerrLog (this=0x7f18a81607b0, 
__in_chrg=) at /arrow/cpp/src/arrow/util/logging.cc:74
#4  0x7f18c5855181 in arrow::util::ArrowLog::~ArrowLog 
(this=0x7f18c07fc380, __in_chrg=) at 
/arrow/cpp/src/arrow/util/logging.cc:250
#5  0x7f18c5826f86 in arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed 
(this=0x7f18a815f030, state=arrow::FutureState::FAILURE)
at /arrow/cpp/src/arrow/util/future.cc:125
#6  0x7f18c58265af in arrow::ConcreteFutureImpl::DoMarkFailed 
(this=0x7f18a815f030) at /arrow/cpp/src/arrow/util/future.cc:40
#7  0x7f18c5827660 in arrow::FutureImpl::MarkFailed (this=0x7f18a815f030) 
at /arrow/cpp/src/arrow/util/future.cc:195
#8  0x7f18c80ff8d8 in 
arrow::Future 
>::DoMarkFinished (this=0x7f18a815efb0, res=...)
at /arrow/cpp/src/arrow/util/future.h:660
#9  0x7f18c80fb37d in 
arrow::Future 
>::MarkFinished (this=0x7f18a815efb0, res=...)
at /arrow/cpp/src/arrow/util/future.h:403
#10 0x7f18c80f5ae3 in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::Push (this=0x7f18a804d2d0, 
status=...)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:780
#11 0x7f18c80f5c1f in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::RecvActiveMessage 
(this=0x7f18a804d2d0, header=0x7f18c8081865, header_length=12, 
data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at 
/arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:791
#12 0x7f18c80f7d29 in 
arrow::flight::transport::ucx::UcpCallDriver::RecvActiveMessage 
(this=0x7f18b80017e0, header=0x7f18c8081865, header_length=12, 
data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at 
/arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:1082
#13 0x7f18c80e3ea4 in arrow::flight::transport::ucx::(anonymous 
namespace)::UcxServerImpl::HandleIncomingActiveMessage (self=0x7f18a80259a0, 
header=0x7f18c8081865, header_length=12, data=0x7f18c8081864, 
data_length=1, param=0x7f18c07fc680)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_server.cc:586
#14 0x7f18c4661a09 in ucp_am_invoke_cb (recv_flags=, 
reply_ep=, data_length=1, data=, 
user_hdr_length=, user_hdr=0x7f18c8081865, am_id=4132, 
worker=) at core/ucp_am.c:1220
#15 ucp_am_handler_common (name=, recv_flags=, am_flags=0, reply_ep=, total_length=, 
am_hdr=0x7f18c808185c, worker=) at core/ucp_am.c:1289
#16 ucp_am_handler_reply (am_arg=, am_data=, 
am_length=, am_flags=) at core/ucp_am.c:1327
#17 0x7f18c28e3f1c in uct_iface_invoke_am (flags=0, length=29, 
data=0x7f18c808185c, id=, iface=0x7f18a8027e20)
at /usr/local/src/conda/ucx-1.13.1/src/uct/base/uct_iface.h:861
#18 uct_mm_iface_invoke_am (flags=0, length=29, data=0x7f18c808185c, 
am_id=, iface=0x7f18a8027e20) at sm/mm/base/mm_iface.h:256
#19 uct_mm_iface_process_recv (iface=0x7f18a8027e20) at 
sm/mm/base/mm_iface.c:256
#20 uct_mm_iface_poll_fifo (iface=0x7f18a8027e20) at sm/mm/base/mm_iface.c:304
#21 uct_mm_iface_progress (tl_iface=0x7f18a8027e20) at sm/mm/base/mm_iface.c:357
#22 0x7f18c4686e22 in ucs_callbackq_dispatch (cbq=) at 
/usr/local/src/conda/ucx-1.13.1/src/ucs/datastruct/callbackq.h:211
#23 uct_worker_progress (worker=) at 
/usr/local/src/conda/ucx-1.13.1/src/uct/api/uct.h:2638
#24 ucp_worker_progress (worker=0x7f18a80008d0) at core/ucp_worker.c:2782
#25 0x7f18c80f586f in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::MakeProgress 
(this=0x7f18a804d2d0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:759
#26 0x7f18c80f2e40 in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::ReadNextFrame 
(this=0x7f18a804d2d0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:449
#27 0x7f18c80f7661 in 
arrow::flight::transport::ucx::UcpCallDriver::ReadNextFrame 
(this=0x7f18b8

[jira] [Created] (ARROW-18350) [C++] Use std::to_chars instead of std::to_string

2022-11-17 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18350:
--

 Summary: [C++] Use std::to_chars instead of std::to_string
 Key: ARROW-18350
 URL: https://issues.apache.org/jira/browse/ARROW-18350
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


{{std::to_chars}} is locale-independent unlike {{std::to_string}}; it may also 
be faster in some cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18349) [CI][C++][Flight] Exercise UCX on CI

2022-11-17 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18349:
--

 Summary: [CI][C++][Flight] Exercise UCX on CI
 Key: ARROW-18349
 URL: https://issues.apache.org/jira/browse/ARROW-18349
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Continuous Integration, FlightRPC
Reporter: Antoine Pitrou
 Fix For: 11.0.0


UCX doesn't seem enabled on any CI configuration for now.

We should have at least a nightly job with UCX enabled, for example one of the 
Conda or Ubuntu builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18348) [CI][Release][Yum] redhat-rpm-config is needed on AlmaLinux 9

2022-11-16 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18348:


 Summary: [CI][Release][Yum] redhat-rpm-config is needed on 
AlmaLinux 9
 Key: ARROW-18348
 URL: https://issues.apache.org/jira/browse/ARROW-18348
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 10.0.2, 11.0.0


https://github.com/ursacomputing/crossbow/actions/runs/3485133283/jobs/5830385419#step:7:1909

{noformat}
  Building native extensions. This could take a while...
  ERROR:  Error installing gobject-introspection:
ERROR: Failed to build gem native extension.
  
  current directory: /usr/local/share/gems/gems/glib2-4.0.3/ext/glib2
  /usr/bin/ruby -I /usr/share/rubygems -r ./siteconf20221117-855-v8bktd.rb 
extconf.rb
  checking for --enable-debug-build option... no
  checking for -Wall option to compiler... *** extconf.rb failed ***
  Could not create Makefile due to some reason, probably lack of necessary
  libraries and/or headers.  Check the mkmf.log file for more details.  You may
  need configuration options.
  
  Provided configuration options:
--with-opt-dir
--without-opt-dir
--with-opt-include
--without-opt-include=${opt-dir}/include
--with-opt-lib
--without-opt-lib=${opt-dir}/lib64
--with-make-prog
--without-make-prog
--srcdir=.
--curdir
--ruby=/usr/bin/$(RUBY_BASE_NAME)
--enable-debug-build
--disable-debug-build
  /usr/share/ruby/mkmf.rb:471:in `try_do': The compiler failed to generate an 
executable file. (RuntimeError)
  You have to install development tools first.
from /usr/share/ruby/mkmf.rb:597:in `block in try_compile'
from /usr/share/ruby/mkmf.rb:546:in `with_werror'
from /usr/share/ruby/mkmf.rb:597:in `try_compile'
from /usr/local/share/gems/gems/glib2-4.0.3/lib/mkmf-gnome.rb:65:in 
`block in try_compiler_option'
from /usr/share/ruby/mkmf.rb:971:in `block in checking_for'
from /usr/share/ruby/mkmf.rb:361:in `block (2 levels) in postpone'
from /usr/share/ruby/mkmf.rb:331:in `open'
from /usr/share/ruby/mkmf.rb:361:in `block in postpone'
from /usr/share/ruby/mkmf.rb:331:in `open'
from /usr/share/ruby/mkmf.rb:357:in `postpone'
from /usr/share/ruby/mkmf.rb:970:in `checking_for'
from /usr/local/share/gems/gems/glib2-4.0.3/lib/mkmf-gnome.rb:64:in 
`try_compiler_option'
from /usr/local/share/gems/gems/glib2-4.0.3/lib/mkmf-gnome.rb:74:in 
`'
from 
:85:in 
`require'
from 
:85:in 
`require'
from extconf.rb:27:in `'
  
  To see why this extension failed to compile, please check the mkmf.log which 
can be found here:
  
/usr/local/lib64/gems/ruby/glib2-4.0.3/mkmf.log
  
  extconf failed, exit code 1
  
  Gem files will remain installed in /usr/local/share/gems/gems/glib2-4.0.3 for 
inspection.
  Results logged to /usr/local/lib64/gems/ruby/glib2-4.0.3/gem_make.out
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18347) [C++] Hook up cancellation to exec plan

2022-11-16 Thread Weston Pace (Jira)

Weston Pace created ARROW-18347:
---

 Summary: [C++] Hook up cancellation to exec plan
 Key: ARROW-18347
 URL: https://issues.apache.org/jira/browse/ARROW-18347
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


There are two ways to cancel an exec plan.  A call to StopProducing and 
cancelling the task group.  Investigate which makes the most sense and then 
configure the DeclarationToReader method to support cancelling on discard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18346) [Python] Dataset writer API papercuts

2022-11-16 Thread David Li (Jira)

David Li created ARROW-18346:


 Summary: [Python] Dataset writer API papercuts
 Key: ARROW-18346
 URL: https://issues.apache.org/jira/browse/ARROW-18346
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 10.0.0
Reporter: David Li


* Writer options are not very discoverable. Perhaps "file_options" should 
mention compression as an example of something you can control, so people 
looking for it know where to go next?
 * Compression seems like it might be common enough to warrant a top-level 
parameter somehow (even if it gets implemented differently internally)?
 * Either way, this needs a cookbook example.
 * {{make_write_options}} is lacking a docstring
 * Writer options objects are lacking {{{}__repr__{}}}s



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18344) [C++] Use input pre-sortedness to create sorted table with ConcatenateTables

2022-11-16 Thread Rok Mihevc (Jira)

Rok Mihevc created ARROW-18344:
--

 Summary: [C++] Use input pre-sortedness to create sorted table 
with ConcatenateTables
 Key: ARROW-18344
 URL: https://issues.apache.org/jira/browse/ARROW-18344
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Rok Mihevc


In case of concatenating large sorted tables (e.g. sorted timeseries data) the 
resulting table is no longer sorted. However the input sortedness can be used 
to significantly speed up post concatenation sorting. A potential API could be 
to add ConcatenateTablesOptions.inputs_sorted and implement the logic in 
ConcatenateTables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18345) [R] Create a CRAN-specific packaging checklist that lives in the R package directory

2022-11-16 Thread Dewey Dunnington (Jira)

Dewey Dunnington created ARROW-18345:


 Summary: [R] Create a  CRAN-specific packaging checklist that 
lives in the R package directory
 Key: ARROW-18345
 URL: https://issues.apache.org/jira/browse/ARROW-18345
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dewey Dunnington


Like other packaging tasks, the CRAN packaging task (which is concerned with 
making sure the R package from the Arrow release complies with CRAN policies) 
for the R package is slightly different than the overall Arrow release task for 
the R package. For example, we often push patch-patch releases if the two-week 
window we get to "safely retain the package on CRAN" does not line up with a 
release vote. [~npr] has heroically been doing this for a long time, and while 
he has equally heroically volunteered to keep doing it, I am hoping to process 
of codifying this somewhere in the R repo will help a wider set of contributors 
understand the process (even if it was already documented elsewhere!).

[~stephhazlitt] and I use {{usethis::use_release_issue()}} to manage our 
personal R package releases, and I'm wondering if creating a similar function 
or markdown template would help here.

I'm happy to start the process of putting a PR up for discussion!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18343) [C++] AllocateBitmap() with out parameter is declared but not defined

2022-11-16 Thread Jin Shang (Jira)

Jin Shang created ARROW-18343:
-

 Summary: [C++] AllocateBitmap() with out parameter is declared but 
not defined
 Key: ARROW-18343
 URL: https://issues.apache.org/jira/browse/ARROW-18343
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Jin Shang


[This variant of 
AllocateBitmap|https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h#L483]
 is declared but not defined in buffer.cc.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18342) [C++] AsofJoinNode support for Boolean data field

2022-11-16 Thread Rok Mihevc (Jira)

Rok Mihevc created ARROW-18342:
--

 Summary: [C++] AsofJoinNode support for Boolean data field
 Key: ARROW-18342
 URL: https://issues.apache.org/jira/browse/ARROW-18342
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Rok Mihevc


This is to add boolean data field support to asof join as proposed here: 
https://github.com/apache/arrow/pull/14485



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18341) [Doc][Python] Update note about bundling Arrow C++ on Windows

2022-11-16 Thread Alenka Frim (Jira)

Alenka Frim created ARROW-18341:
---

 Summary: [Doc][Python] Update note about bundling Arrow C++ on 
Windows
 Key: ARROW-18341
 URL: https://issues.apache.org/jira/browse/ARROW-18341
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 11.0.0


There is a note on the python development page under Widnows section about 
bundling the Arrow C++ libraries with Python extensions:

[https://arrow.apache.org/docs/dev/developers/python.html#building-on-windows]

This note can be revised:
 * if you are using conda, the fact that Arrow C++ libs are not bundled is fine 
since conda will ensure those libs are found.
 * If you are not using conda, you have to ensure those libs can be found: 
either by updating {{PATH}} (every time before importing pyarrow), or either by 
bundling them (... using the {{PYARROW_BUNDLE_ARROW_CPP}} env variable instead 
of {{{}--bundle-arrow-cpp{}}}). With the caveat those won't be automatically 
updated when rebuilding the arrow-cpp libs then.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18340) [Python] PyArrow C++ header files no longer always included in installed pyarrow

2022-11-16 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-18340:
-

 Summary: [Python] PyArrow C++ header files no longer always 
included in installed pyarrow
 Key: ARROW-18340
 URL: https://issues.apache.org/jira/browse/ARROW-18340
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche
Assignee: Alenka Frim
 Fix For: 10.0.1


We have a python build env var to control whether the Arrow C++ header files 
are included in the python package or not 
({{PYARROW_BUNDLE_ARROW_CPP_HEADERS}}). This is set to True by default, and 
only in the conda recipe set to False.

After the cmake refactor, the Python C++ header files no longer live in the 
Arrow C++ package, and so should _always_ be included in the python package, 
regardless of how arrow-cpp is installed. 
Initially this was done, but it seems that 
https://github.com/apache/arrow/pull/13892 removed this unconditional copy of 
the PyArrow header files to {{pyarrow/include}}. Now it is only copied if 
{{PYARROW_BUNDLE_ARROW_CPP_HEADERS}} is enabled.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 64600 matches

Mail list logo