Re: [I] [C++] Consuming or closing a RecordBatchReader created from a Dataset Scanner does not close underlying files [arrow]

2024-05-28 Thread via GitHub


bkietz closed issue #41771: [C++] Consuming or closing a RecordBatchReader 
created from a Dataset Scanner does not close underlying files
URL: https://github.com/apache/arrow/issues/41771


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [MATLAB] Add C Data Interface format import/export functionality for `arrow.tabular.RecordBatch` [arrow]

2024-05-28 Thread via GitHub


sgilmore10 closed issue #41803: [MATLAB] Add C Data Interface format 
import/export functionality for `arrow.tabular.RecordBatch`
URL: https://github.com/apache/arrow/issues/41803


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Implement a function to load field buffers from external buffers for StringView [arrow]

2024-05-28 Thread via GitHub


vibhatha closed issue #40931: [Java] Implement a function to load field buffers 
from external buffers for StringView
URL: https://github.com/apache/arrow/issues/40931


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Implement a strategy to return variable width buffer count for StringView in TypeLayout [arrow]

2024-05-28 Thread via GitHub


vibhatha closed issue #40935: [Java] Implement a strategy to return variable 
width buffer count for StringView in TypeLayout
URL: https://github.com/apache/arrow/issues/40935


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] TypeLayout enhancement to support StringView [arrow]

2024-05-28 Thread via GitHub


vibhatha closed issue #40934: [Java] TypeLayout enhancement to support 
StringView
URL: https://github.com/apache/arrow/issues/40934


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Support LZ4_RAW for parquet writing [arrow]

2024-05-28 Thread via GitHub


douglas-raillard-arm opened a new issue, #41863:
URL: https://github.com/apache/arrow/issues/41863

   ### Describe the enhancement requested
   
   `pyarrow.dataset.write_dataset(compression='lz4_raw')` currently fails with:
   
   ```
   Traceback (most recent call last):
 File "/work/projects/lisa/testpyarrow.py", line 3, in 
   _reencode_parquet('sched_switch.lz4.parquet', 'updated.parquet', 
compression='lz4_raw')#, row_group_size=128*1024*1024, compression='LZ4')
   
^^^
 File "x.py", line 1, in my_write_parquet
   options = pyarrow.dataset.ParquetFileFormat().make_write_options(
 ^^^
 File "pyarrow/_dataset_parquet.pyx", line 206, in 
pyarrow._dataset_parquet.ParquetFileFormat.make_write_options
 File "pyarrow/_dataset_parquet.pyx", line 594, in 
pyarrow._dataset_parquet.ParquetFileWriteOptions.update
 File "pyarrow/_dataset_parquet.pyx", line 599, in 
pyarrow._dataset_parquet.ParquetFileWriteOptions._set_properties
 File "pyarrow/_parquet.pyx", line 1855, in 
pyarrow._parquet._create_writer_properties
 File "pyarrow/_parquet.pyx", line 1369, in 
pyarrow._parquet.check_compression_name
   pyarrow.lib.ArrowException: Unsupported compression: lz4_raw
   ``` 
   
   And indeed, no mention of `lz4_raw` is to be found in 
`python/pyarrow/_parquet.pyx`.
   
   Would it be possible to add support for LZ4_RAW codec when writing parquet 
files, particularly using the dataset API ?
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Thread deadlock in ObjectOutputStream [arrow]

2024-05-28 Thread via GitHub


icexelloss opened a new issue, #41862:
URL: https://github.com/apache/arrow/issues/41862

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I am seeing a deadlock when destructing an ObjectOutputStream. I have 
attached the stack trace.
   
   I did some debugging and found that the issue seems to be that the mutex in 
question is already held by this thread (I checked the __owner field in the 
pthread_mutex_t which points to the hanging thread)
   
   Unfortunately the stack trace doesn’t show exactly which mutex it is trying 
to lock. I wonder if someone more familiar with the IO code has some ideas what 
might be the issue and where to dig deeper?
   
   
[arrow_object_output_stream_stacktrace.txt](https://github.com/apache/arrow/files/15469090/arrow_object_output_stream_stacktrace.txt)
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] How to concatenate multiple tables in one parquet? [arrow]

2024-05-27 Thread via GitHub


zliucd opened a new issue, #41858:
URL: https://github.com/apache/arrow/issues/41858

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi,
   
   It's possible to write multiple tables in a single parquet by appending each 
rows from individual parquet?  All tables read from parquets have same columns. 
  This functionality is similar to Python ```dataframe.concat([df1, df2])```.
   
   
   For example:
   ```
   table1
   Name   Age
   Jim 36
   Bill  30
   
   table2
   Name   Age
   Sam28
   Joe 30
   ```
   
   The concatenated table and parquet file should be:
   ```
   Name   Age
   Jim   36
   Bill30
   Sam  28
   Joe   30
   ```
   
   We can concatenate tables using ```auto con_tables = 
arrow::ConcatenateTables```, but it's not possible to write ```con_tables``` 
using ```parquet::arrow::WriteTable()```.The first param of WriteTable()  
is a single ```arrow::Table```.
   
   This post shows how to merge tables by appending columns, but my context is 
appending rows. 
   https://stackoverflow.com/questions/71183352/merging-tables-in-apache-arrow
   
   Thanks.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Packaging][RPM] Mismatch between package version and library version in naming [arrow]

2024-05-27 Thread via GitHub


kou closed issue #41784: [Packaging][RPM] Mismatch between package version and 
library version in naming
URL: https://github.com/apache/arrow/issues/41784


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Error repeating df.to_parquet in pytest: "pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined" [arrow]

2024-05-27 Thread via GitHub


bjfar opened a new issue, #41857:
URL: https://github.com/apache/arrow/issues/41857

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Python version: 3.10.14
   pyarrow version: 16.1.0
   pandas version: 2.2.2
   pytest version: 8.2.1
   
   I have some apparently niche circumstances that trigger the following error:
   
   ```
   /home/benf/repos/tetra/python/tests/test_minimal.py:24: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/util/_decorators.py:333:
 in wrapper
   return func(*args, **kwargs)
   
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/core/frame.py:3113:
 in to_parquet
   return to_parquet(
   
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/io/parquet.py:476:
 in to_parquet
   impl = get_engine(engine)
   
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/io/parquet.py:63:
 in get_engine
   return engine_class()
   
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/io/parquet.py:169:
 in __init__
   import pandas.core.arrays.arrow.extension_types  # pyright: 
ignore[reportUnusedImport] # noqa: F401
   
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/core/arrays/arrow/extension_types.py:59:
 in 
   pyarrow.register_extension_type(_period_type)
   pyarrow/types.pxi:1954: in pyarrow.lib.register_extension_type
   ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   pyarrow.lib.ArrowKeyError: A type extension with name pandas.period 
already defined
   
   pyarrow/error.pxi:91: ArrowKeyError
   = short test summary 
info =
   FAILED python/tests/test_minimal.py::test_pyarrow_issue_2 - 
pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already 
defined
   ```
   
   It seems to have something to do with how pytest orchestrates its tests. 
Here is my minimal example:
   
   test_minimal.py
   ```
   import pytest
   import pandas as pd
   
   pytest_plugins = ["pytester"]
   
   def test_pyarrow_issue(testdir, tmp_path):
   path = str(tmp_path / "test.tar")
   df = pd.DataFrame()
   df.to_parquet(path)
   
   def test_pyarrow_issue_2(testdir, tmp_path):
   path = str(tmp_path / "test_2.tar")
   df = pd.DataFrame()
   df.to_parquet(path)
   ```
   
   Running `pytest test_minimal.py` then triggers the error.
   
   Notably, the error does *not* occur if either test is run independently, and 
it does not occur if the `testdir` fixture is removed or replaced with some 
other fixture. So I guess it has something to do with whatever `testdir` is 
doing under the hood. Presumably to do with how pandas/pyarrow get imported.
   
   In my real case I would really quite like to keep using the `testdir` 
fixture, though I can probably find a different way to do things. But 
nonetheless this behaviour seemed worth reporting. Not sure if it is a pyarrow 
issue though, or whether it is more of a pytest issue, or maybe even pandas.
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] arrow flight sql jdbc drive with Lz4Compression [arrow]

2024-05-27 Thread via GitHub


kou closed issue #41456: arrow flight sql jdbc drive with Lz4Compression
URL: https://github.com/apache/arrow/issues/41456


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [CI][Packaging] Fix conda arrow-nightlies channel [arrow]

2024-05-27 Thread via GitHub


amoeba opened a new issue, #41856:
URL: https://github.com/apache/arrow/issues/41856

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The Conda [arrow-nightlies channel is 
empty](https://anaconda.org/arrow-nightlies/repo/files?label=main=conda), 
which means you can't install Arrow C++ or PyArrow nightlies from it at the 
moment. I noticed this in CI on 
https://github.com/apache/arrow-cookbook/pull/352. It's my understanding that 
the jobs that upload artifacts to this channel are running but failing, see the 
failing builds at http://crossbow.voltrondata.com/.
   
   From a quick look, the failures may just be due to Azure deprecations based 
on this error I see in a few Azure Pipelines logs:
   
   > The CondaEnvironment@1 (Conda environment) task has been deprecated since 
February 13, 2019 and will soon be retired. Use the Conda CLI ('conda') 
directly from a bash/pwsh/script task. Please visit 
https://aka.ms/azdo-deprecated-tasks to learn more about deprecated tasks.
   
   ### Component(s)
   
   Continuous Integration, Packaging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [R][CI]: Remove more defunct rhub containers [arrow]

2024-05-27 Thread via GitHub


jonkeane opened a new issue, #41841:
URL: https://github.com/apache/arrow/issues/41841

   ### Describe the enhancement requested
   
   In debugging a CRAN submission, found another location where we are using 
the stale rhub containers.
   
   ### Component(s)
   
   Continuous Integration, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Format][FlightRPC] Flight SQL evolution [arrow]

2024-05-27 Thread via GitHub


lidavidm opened a new issue, #41840:
URL: https://github.com/apache/arrow/issues/41840

   ### Describe the enhancement requested
   
   From https://github.com/apache/arrow-rs/issues/5731#issuecomment-2133104504
   
   Originally Flight RPC was implemented as a framework wrapping gRPC. This was 
especially expedient for the C++ implementation. By now it's mostly a weight 
dragging down Flight users, especially Flight SQL.
   
   If we have the chance to evolve Flight SQL and/or Flight RPC, some changes 
may include:
   
   - Use a proper gRPC service definition, instead of opaque bytes payloads
   
   ### Component(s)
   
   FlightRPC, Format


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] take into account orc's capabilities for finding tzdb [arrow]

2024-05-27 Thread via GitHub


kou closed issue #41755: [C++] take into account orc's capabilities for finding 
tzdb
URL: https://github.com/apache/arrow/issues/41755


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Add support for FileIO [arrow-julia]

2024-05-27 Thread via GitHub


Beforerr opened a new issue, #507:
URL: https://github.com/apache/arrow-julia/issues/507

   It is registered in FileIO however neither `load` nor `fileio_load` is 
defined.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Issue using open_dataset() in r4.4.0 [arrow]

2024-05-26 Thread via GitHub


SHEvElynP opened a new issue, #41835:
URL: https://github.com/apache/arrow/issues/41835

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hello
   
   My workplace has recently moved from R4.3.2 to R4.4.0. I used to be able to 
do
   
 open_dataset(dir_name,
  format = "arrow",
  partitioning = hive_partition())
   
   but now I get an error saying "This build of the arrow package does not 
support Datasets". I attempted the workaround in comment 
[https://github.com/apache/arrow/issues/40667#issuecomment-2007942987](url) but 
it broke my .proj file and RStudio would not open it so I had to create a new 
one.
   
   Does anyone know any other workaround? I am fairly new to anything 
resembling coding
   
   Thank you!
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Fields within a null struct are not initialized with null values [arrow]

2024-05-26 Thread via GitHub


timsaucer opened a new issue, #41833:
URL: https://github.com/apache/arrow/issues/41833

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When creating an array from a python dict, field entries of a null struct 
are initialized with default values rather than null even if their field is 
nullable. In the minimal example below, you would expect the 3rd row to have 
values of `inner_1` and `inner_2` to be null.
   
   ```
   import pyarrow as pa
   
   print(pa.array([
   {"outer": {"inner_1": 1, "inner_2": 2}},
   {"outer": {"inner_1": 3, "inner_2": None}},
   {"outer": None},
   ]))
   ```
   
   Generates the following output:
   
   ```
   -- is_valid: all not null
   -- child 0 type: struct
 -- is_valid:
 [
 true,
 true,
 false
   ]
 -- child 0 type: int64
   [
 1,
 3,
 0
   ]
 -- child 1 type: int64
   [
 2,
 null,
 0
   ]
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [GLib] Allow getting a RecordBatchReader from a Dataset or Dataset Scanner [arrow]

2024-05-25 Thread via GitHub


kou closed issue #41749: [GLib] Allow getting a RecordBatchReader from a 
Dataset or Dataset Scanner
URL: https://github.com/apache/arrow/issues/41749


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [R] Update relative URLs in README to absolute paths to prevent CRAN check failures [arrow]

2024-05-25 Thread via GitHub


thisisnic opened a new issue, #41829:
URL: https://github.com/apache/arrow/issues/41829

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   In #40148, we updated the README, but there were some URLs in there which 
pointed to relative links; we should update them to point to the absolute path 
so we don't fail CRAN checks.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [R] Update NEWS.md for 16.0.0 [arrow]

2024-05-25 Thread via GitHub


thisisnic closed issue #41420: [R] Update NEWS.md for 16.0.0
URL: https://github.com/apache/arrow/issues/41420


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++][Parquet][Benchmark] Adding benchmarking for reading Statistics [arrow]

2024-05-25 Thread via GitHub


mapleFU opened a new issue, #41826:
URL: https://github.com/apache/arrow/issues/41826

   ### Describe the enhancement requested
   
   This pr ( https://github.com/apache/arrow/pull/41761 ) does a basics for 
benchmarking metadata. We'd like to add more benchmarks on Statistics 
encoding/decoding
   
   Parquet standard support statistics ( see 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L244
 ). And in C++ Parquet, the statistics, would be decoded to thrift and convert 
to a `EncodedStats` or `Statistics` ( See 
https://github.com/apache/arrow/blob/7c8ce4589ae9e3c4a9c0cd54cff81a54ac003079/cpp/src/parquet/statistics.h
 )
   
   We'd like to adding benchmark for reading/writing Statistics, specailly for 
BYTE_ARRAY, which could having a long `min_value` and `max_value` here.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [CI][GLib] Suppress "`unlink': Permission denied" warnings in tests on Windows [arrow]

2024-05-24 Thread via GitHub


kou closed issue #41770: [CI][GLib] Suppress "`unlink': Permission denied" 
warnings in tests on Windows
URL: https://github.com/apache/arrow/issues/41770


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] python/adbc_driver_postgresql ingest NOT_IMPLEMENTED when running adbc_ingest with json array [arrow-adbc]

2024-05-24 Thread via GitHub


lidavidm closed issue #1868: python/adbc_driver_postgresql ingest 
NOT_IMPLEMENTED when running adbc_ingest with json array
URL: https://github.com/apache/arrow-adbc/issues/1868


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Enhance the `copyFrom*` functionality in StringView [arrow]

2024-05-24 Thread via GitHub


lidavidm closed issue #40933: [Java] Enhance the `copyFrom*` functionality in 
StringView
URL: https://github.com/apache/arrow/issues/40933


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++][Parquet] Unify normalize dictionary encoding handling [arrow]

2024-05-24 Thread via GitHub


mapleFU opened a new issue, #41818:
URL: https://github.com/apache/arrow/issues/41818

   ### Describe the enhancement requested
   
   This is mentioned here: 
https://github.com/apache/arrow/pull/40957#discussion_r1562703901
   
   There're some points:
   
   1. 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L444-L445 
. encoding is not passed in Encoder
   2. But shit, it's RLE in decoder : 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L1607 it 
will be detect and normalized in other place, like: 
   3. 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L876
   
   We'd better unifying them
   
   
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Create Meson WrapDB Entry for Arrow [arrow]

2024-05-24 Thread via GitHub


WillAyd opened a new issue, #41816:
URL: https://github.com/apache/arrow/issues/41816

   ### Describe the enhancement requested
   
   Meson has a rather nice collection of projects in its WrapDB, which makes it 
rather easy to add dependencies to your project:
   
   https://mesonbuild.com/Wrapdb-projects.html
   
   I do not believe this would require Arrow to implement the Meson build 
system; we would just have to provide Meson patch files as part of the WrapDB:
   
   https://mesonbuild.com/Adding-new-projects-to-wrapdb.html
   
   This is also something I've explored for nanoarrow, with the only difference 
being that nanoarrow has meson build files in the source tree.
   
   Would this be something the Arrow team would be interested in? And if so, 
are there any thoughts on the dependencies we would like to provide? I was 
thinking something along the lines of:
   
   - arrow_core
   - arrow_parquet
   - arrow_flight
   - arrow_flight
   - arrow_gandiva
   - arrow_acero
   - arrow_dataset
   - arrow_substrait
   
   To match how @raulcd created the new conda packages for pyarrow 
https://github.com/conda-forge/arrow-cpp-feedstock/pull/1255#issuecomment-1920988437
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] `pyarrow.write_feather` can't be used in `atexit` contexts to write a `pandas.DataFrame` [arrow]

2024-05-24 Thread via GitHub


pjh40 opened a new issue, #41815:
URL: https://github.com/apache/arrow/issues/41815

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When `pyarrow.write_feather()` is given a `pandas.DataFrame`, 
`write_feather()` unconditionally calls `Table.from_pandas()` with the default 
`nthreads=None` argument.  This is then passed to 
`pandas_compat.dataframe_to_arrays()`, allowing it to heuristically use a 
`concurrent.futures.ThreadPoolExecuter` to convert columns.  This causes a 
runtime error when `write_feather` is used in an `atexit` (or 
`weakref.finalize`) context on exit of the interpreter:
   ```
   RuntimeError: cannot schedule new futures after interpreter shutdown
   ```
   This scenario can be avoided by adding a `use_threads` parameter to 
`write_feather` that can be used to force serial operation.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Clean up Assorted Warnings to get a clean nanoarrow build [arrow]

2024-05-24 Thread via GitHub


bkietz closed issue #41478: [C++] Clean up Assorted Warnings to get a clean 
nanoarrow build
URL: https://github.com/apache/arrow/issues/41478


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Segfault when collecting parquet dataset query results [arrow]

2024-05-24 Thread via GitHub


mrd0ll4r opened a new issue, #41813:
URL: https://github.com/apache/arrow/issues/41813

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hello!
   I've been using arrow with R for a while now to great success.
   Recently, I've re-opened an old project (managed with renv, so I'm pretty 
confident all the package versions were the same).
   It is possible I upgraded the OS and/or OS packages in the meantime.
   Now, some of my queries on a gzip-compressed dataset of parquet files lead 
to a segfault:
   
   ```
*** caught segfault ***
   address 0x7f54ce520898, cause 'memory not mapped'
   
   Traceback:
1: Table__from_ExecPlanReader(self)
2: x$read_table()
3: as_arrow_table.RecordBatchReader(reader)
4: as_arrow_table(reader)
5: as_arrow_table.arrow_dplyr_query(x)
6: as_arrow_table(x)
7: doTryCatch(return(expr), name, parentenv, handler)
8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
9: tryCatchList(expr, classes, parentenv, handlers)
   10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 
4)) {augment_io_error_msg(e, call, schema = schema())})
   11: compute.arrow_dplyr_query(x)
   12: collect.arrow_dplyr_query(.)
   13: collect(.)
   14: d_redacted %>% group_by(year, month, cid) %>% summarize(n = n()) %>% 
collect()
   ```
   
   I have a core dump from that session, but it's 46GB.
   I'm not a professional in analyzing these things, but this is what I got:
   ```
   Core was generated by `/usr/lib/R/bin/exec/R'.
   Program terminated with signal SIGSEGV, Segmentation fault.
   #0  0x7f612d4ea3b0 in 
arrow::compute::KeyCompare::CompareBinaryColumnToRow_avx2(bool, unsigned int, 
unsigned int, unsigned short const*, unsigned int const*, 
arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, 
arrow::compute::RowTableImpl const&, unsigned char*) () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   [Current thread is 1 (Thread 0x7f6093fff640 (LWP 2273813))]
   (gdb) bt
   #0  0x7f612d4ea3b0 in 
arrow::compute::KeyCompare::CompareBinaryColumnToRow_avx2(bool, unsigned int, 
unsigned int, unsigned short const*, unsigned int const*, 
arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, 
arrow::compute::RowTableImpl const&, unsigned char*) () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #1  0x7f612d4d7093 in void 
arrow::compute::KeyCompare::CompareBinaryColumnToRow(unsigned int, 
unsigned int, unsigned short const*, unsigned int const*, 
arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, 
arrow::compute::RowTableImpl const&, unsigned char*) () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #2  0x7f612d4d6278 in 
arrow::compute::KeyCompare::CompareColumnsToRows(unsigned int, unsigned short 
const*, unsigned int const*, arrow::compute::LightContext*, unsigned int*, 
unsigned short*, std::vector > const&, 
arrow::compute::RowTableImpl const&, bool, unsigned char*) ()
  from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #3  0x7f612d4d896e in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #4  0x7f612d3a98e6 in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #5  0x7f612d3ab154 in arrow::compute::SwissTable::find(int, unsigned int 
const*, unsigned char*, unsigned char const*, unsigned int*, 
arrow::util::TempVectorStack*, std::function const&, void*) 
const ()
  from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #6  0x7f612d4df2d0 in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #7  0x7f612d4dfb73 in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #8  0x7f612cf8da83 in arrow::acero::aggregate::GroupByNode::Merge() () 
from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #9  0x7f612cf8f8a3 in 
arrow::acero::aggregate::GroupByNode::OutputResult(bool) ()
  from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #10 0x7f612cf941f6 in 
arrow::acero::aggregate::GroupByNode::InputReceived(arrow::acero::ExecNode*, 
arrow::compute::ExecBatch) ()
   

[I] Table.from_arrow can't import nan values into a non-null float column [arrow]

2024-05-24 Thread via GitHub


lord opened a new issue, #41812:
URL: https://github.com/apache/arrow/issues/41812

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   This small examples fails with `ValueError: Field pyarrow.Field was non-nullable but pandas column had 1 null values` on 16.1.0.
   
   ```
   import pandas as pd
   import pyarrow as pa
   
   df = pd.DataFrame({"a": [1.0, float("nan")]})
   schema = pa.schema([pa.field('a', pa.float64(), nullable=False)])
   pa.Table.from_pandas(df, schema=schema)
   ```
   
   I guess this seems like a bug to me, but I'm no pandas expert. It does feel 
like this makes roundtripping a non-null float column through pandas impossible?
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Importing an extension type without `ARROW:extension:metadata` crashes [arrow]

2024-05-24 Thread via GitHub


paleolimbot closed issue #41741: [C++] Importing an extension type without 
`ARROW:extension:metadata` crashes
URL: https://github.com/apache/arrow/issues/41741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++] Add Compute Kernel for Casting from union to string [arrow]

2024-05-24 Thread via GitHub


llama90 opened a new issue, #41810:
URL: https://github.com/apache/arrow/issues/41810

   ### Describe the enhancement requested
   
   This is a sub-issue of the issue mentioned below.
   
   - #35560
   
   This issue is aiming to address #39182.
   
   A pull request (https://github.com/apache/arrow/pull/40237) has been 
submitted to resolve issue, and additional features that need to be supported 
have emerged.
   
   | From   | To| 
Using Function   |
   
||---|--|
   | sparse_union 
| utf8 | cast_string |
   | dense_union  
| utf8 | cast_string |
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++] Add Compute Kernel for Casting from map to string [arrow]

2024-05-24 Thread via GitHub


llama90 opened a new issue, #41809:
URL: https://github.com/apache/arrow/issues/41809

   ### Describe the enhancement requested
   
   This is a sub-issue of the issue mentioned below.
   
   - #35560
   
   This issue is aiming to address #39182.
   
   A pull request (https://github.com/apache/arrow/pull/40237) has been 
submitted to resolve issue, and additional features that need to be supported 
have emerged.
   
   | From   | To| 
Using Function   |
   
||---|--|
   | map  | utf8  | 
cast_string  |
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [JAVA] Jni mvn generate-resources failed because not generate arrow-bom [arrow]

2024-05-24 Thread via GitHub


jinchengchenghh opened a new issue, #41808:
URL: https://github.com/apache/arrow/issues/41808

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   arrow_ep/src/arrow_ep/java# mvn generate-resources -P 
generate-libs-cdata-all-os -Darrow.c.jni.dist.dir=$ARROW_INSTALL_DIR   
-Dmaven.test.skip -Drat.skip -Dmaven.gitcommitid.skip -Dcheckstyle.skip -N
   [INFO] Scanning for projects...
   Downloading from central: 
https://repo.maven.apache.org/maven2/org/apache/arrow/arrow-bom/15.0.0-gluten-3/arrow-bom-15.0.0-gluten-3.pom
   [ERROR] [ERROR] Some problems were encountered while processing the POMs:
   [ERROR] Non-resolvable import POM: The following artifacts could not be 
resolved: org.apache.arrow:arrow-bom:pom:15.0.0-gluten-3 (absent): Could not 
find artifact org.apache.arrow:arrow-bom:pom:15.0.0-gluten-3 in central 
(https://repo.maven.apache.org/maven2) @ line 601, column 20
@
   [ERROR] The build could not read 1 project -> [Help 1]
   [ERROR]
   [ERROR]   The project org.apache.arrow:arrow-java-root:15.0.0-gluten-3 
(/mnt/DP_disk1/code/incubator-gluten/ep/build-velox/build/velox_ep/_build/release/third_party/arrow_ep/src/arrow_ep/java/pom.xml)
 has 1 error
   [ERROR] Non-resolvable import POM: The following artifacts could not be 
resolved: org.apache.arrow:arrow-bom:pom:15.0.0-gluten-3 (absent): Could not 
find artifact org.apache.arrow:arrow-bom:pom:15.0.0-gluten-3 in central 
(https://repo.maven.apache.org/maven2) @ line 601, column 20 -> [Help 2]
   [ERROR]
   [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
   [ERROR] Re-run Maven using the -X switch to enable full debug logging.
   [ERROR]
   [ERROR] For more information about the errors and possible solutions, please 
read the following articles:
   [ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
   
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Adding `variadicBufferCounts` to `RecordBatch` [arrow]

2024-05-23 Thread via GitHub


lidavidm closed issue #41730: [Java] Adding `variadicBufferCounts` to 
`RecordBatch`
URL: https://github.com/apache/arrow/issues/41730


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Nullability of struct child vectors not preserved in TransferPair [arrow]

2024-05-23 Thread via GitHub


lidavidm closed issue #41686: [Java] Nullability of struct child vectors not 
preserved in TransferPair
URL: https://github.com/apache/arrow/issues/41686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Transition from gradle-enterprise-maven-extension to develocity-maven-extension [arrow]

2024-05-23 Thread via GitHub


lidavidm closed issue #41799: [Java] Transition from 
gradle-enterprise-maven-extension to develocity-maven-extension
URL: https://github.com/apache/arrow/issues/41799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Implement a function to retrieve reference buffers in StringView [arrow]

2024-05-23 Thread via GitHub


lidavidm closed issue #40930: [Java] Implement a function to retrieve reference 
buffers in StringView 
URL: https://github.com/apache/arrow/issues/40930


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [GLib][CI] Use vcpkg for C++ dependencies when building GLib libraries with MSVC [arrow]

2024-05-23 Thread via GitHub


adamreeve opened a new issue, #41806:
URL: https://github.com/apache/arrow/issues/41806

   ### Describe the enhancement requested
   
   This is a follow up to #41134 and should hopefully allow building more of 
the GLib libraries with MSVC. 
   
   Context: https://github.com/apache/arrow/pull/41599#discussion_r1596163069 
and https://github.com/apache/arrow/pull/41599#issuecomment-2126145170
   
   ### Component(s)
   
   Continuous Integration, GLib


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Use immutables value-annotations instead of value artifact [arrow]

2024-05-23 Thread via GitHub


lidavidm closed issue #41789: [Java] Use immutables value-annotations instead 
of value artifact
URL: https://github.com/apache/arrow/issues/41789


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [GLib] Add support for MSVC with vcpkg [arrow]

2024-05-23 Thread via GitHub


kou closed issue #41134: [GLib] Add support for MSVC with vcpkg
URL: https://github.com/apache/arrow/issues/41134


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Thirdparty: Bump xsimd to 13.0.0 [arrow]

2024-05-23 Thread via GitHub


kou closed issue #41547: [C++] Thirdparty: Bump xsimd to 13.0.0
URL: https://github.com/apache/arrow/issues/41547


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Flight] Flight benchmark doesn't work anymore [arrow]

2024-05-23 Thread via GitHub


kou closed issue #41780: [C++][Flight] Flight benchmark doesn't work anymore
URL: https://github.com/apache/arrow/issues/41780


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Swift] Add Struct (Nested) types [arrow]

2024-05-23 Thread via GitHub


abandy opened a new issue, #41804:
URL: https://github.com/apache/arrow/issues/41804

   ### Describe the enhancement requested
   
   Struct (Nested) types are currently not implemented in Swift.  Adding Nested 
types as this is required to implement other arrow features.
   
   ### Component(s)
   
   Swift


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [MATLAB] Add C Data Interface format import/export functionality for `arrow.tabular.RecordBatch` [arrow]

2024-05-23 Thread via GitHub


sgilmore10 opened a new issue, #41803:
URL: https://github.com/apache/arrow/issues/41803

   ### Describe the enhancement requested
   
   Now that #41656 has been closed, we should add MATLAB APIs for 
importing/exporting `arrow.tabular.RecordBatch`es using the C Data Interface 
format.
   
   The C Data Interface format import/export workflows would like this:
   
   ### Import into MATLAB
   
   ```matlab
   cArray = arrow.c.Array
   cSchema = arrow.c.Schema
   .
   .
   . 
   % Pass cArray and cSchema to export APIs of another Arrow language bindings 
to fill in C struct details
   .
   .
   .
   % Import Arrow RecordBatch from pre-populated C Data interface format C 
structs
   rb = arrow.tabular.RecordBatch.importFromC(cArray, cSchema);
   ```
   
   ### Export from MATLAB
   
   ```matlab
   .
   .
   . 
   % Create C Data Interface format ArrowArray and ArrowSchema C structs using 
APIs of another Arrow language binding ...
   .
   .
   .
   rb = arrow.recordBatch(table(1:10)')).
   % Export Arrow RecordBatch from MATLAB to C Data Interface format and fill 
in C struct details
   rb.exportToC(cArrayAddress, cSchemaAddress)
   .
   .
   .
   % Import Arrow RecordBatch from pre-populated C Data Interface format C 
structs using APIs of another Arrow language binding 
   ```
   
   We can implement this functionality using the C Data Interface format C++ 
APIS defined in 
https://github.com/apache/arrow/blob/main/cpp/src/arrow/c/bridge.h.
   
   
   ### Component(s)
   
   MATLAB


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] Java Cookbook fails on 16.0.0-SNAPSHOT [arrow-cookbook]

2024-05-23 Thread via GitHub


amoeba closed issue #347: [Java] Java Cookbook fails on 16.0.0-SNAPSHOT
URL: https://github.com/apache/arrow-cookbook/issues/347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet][Doc] Denote PARQUET:field_id in parquet.rst [arrow]

2024-05-23 Thread via GitHub


pitrou closed issue #41186: [C++][Parquet][Doc] Denote PARQUET:field_id in 
parquet.rst
URL: https://github.com/apache/arrow/issues/41186


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Acero] An useless parameter for QueryContext::Init called in hash_join_benchmark [arrow]

2024-05-23 Thread via GitHub


pitrou closed issue #41720: [C++][Acero] An useless parameter for 
QueryContext::Init called in hash_join_benchmark
URL: https://github.com/apache/arrow/issues/41720


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Add functionality to MemoryManager for copying a slice of a buffer [arrow]

2024-05-23 Thread via GitHub


pitrou closed issue #39858: [C++] Add functionality to MemoryManager for 
copying a slice of a buffer
URL: https://github.com/apache/arrow/issues/39858


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++][S3] Remove GetBucketRegion hack for never AWS SDK versions [arrow]

2024-05-23 Thread via GitHub


pitrou opened a new issue, #41797:
URL: https://github.com/apache/arrow/issues/41797

   ### Describe the enhancement requested
   
   In https://github.com/aws/aws-sdk-cpp/issues/1885#issuecomment-2118124214 it 
was pointed out that the "x-amz-bucket-region" header of successful HeadBucket 
responses are now accessible using `S3Model::HeadBucketResult::GetRegion`. We 
should use that API whenever possible.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Deserialization as Vector{SubArray} breaks `push!` on DataFrame [arrow-julia]

2024-05-23 Thread via GitHub


maleadt opened a new issue, #506:
URL: https://github.com/apache/arrow-julia/issues/506

   I'm using Arrow v2.7.2 with DataFrames v1.6.1 on Julia 1.10, and am running 
into an issue that seems to stem from Arrow.jl deserializing my 
`Vector{Vector{T}}` columns as `Vector{SubArray{...}}`:
   
   ```julia
   julia> using Arrow, DataFrames
   
   julia> df = DataFrame(foo=Vector{Int}[]);
   
   julia> push!(df, [[1,2,3]])
   1×1 DataFrame
Row │ foo
│ Array…
   ─┼───
  1 │ [1, 2, 3]
   
   julia> Arrow.write("/tmp/test.arrow", df)
   "/tmp/test.arrow"
   
   julia> df2 = copy(DataFrame(Arrow.Table("/tmp/test.arrow")));
   
   julia> typeof(df2.foo)
   Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, 
Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, 
Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1})
   ```
   
   This breaks certain `push!`es on the dataframe, which I haven't been able to 
reproduce in isolation, but which looks as follows:
   
   ```
   MethodError: Cannot `convert` an object of type Vector{Int64} to an object 
of type SubArray{Int64, 1, Arrow.Primitive{Int64, Vector{Int64}}, 
Tuple{UnitRange{Int64}}, true}
   
   Stacktrace:
 [1] push!(a::Vector{SubArray{Int64, 1, Arrow.Primitive{Int64, 
Vector{Int64}}, Tuple{UnitRange{Int64}}, true}}, item::Vector{Int64})
   @ Base ./array.jl:1118
 [2] _row_inserter!(df::DataFrame, loc::Int64, row::Tuple{String, 
Vector{Int64}, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, String, 
Bool, Bool, Bool, Vector{Int64}, Vector{Int64}, Vector{Int64}, String, String, 
Float64}, mode::Val{:push}, promote::Bool)
   @ DataFrames 
~/.julia/packages/DataFrames/58MUJ/src/dataframe/insertion.jl:663
 [3] push!(df::DataFrame, row::Tuple{String, Vector{Int64}, Int64, Int64, 
Int64, Int64, Int64, Int64, Int64, Int64, String, Bool, Bool, Bool, 
Vector{Int64}, Vector{Int64}, Vector{Int64}, String, String, Float64})
   @ DataFrames 
~/.julia/packages/DataFrames/58MUJ/src/dataframe/insertion.jl:457
   ```
   
   It's possible I'm doing something wrong; first time Arrow.jl user here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++] [Python] Add functionality of `STSProfileCredentialsProvider` to default credentials chain for `S3FileSystem` [arrow]

2024-05-23 Thread via GitHub


fjetter opened a new issue, #41794:
URL: https://github.com/apache/arrow/issues/41794

   ### Describe the enhancement requested
   
   Given a typical AWS credentials setup that defines IAM roles like the 
following
   
   ```
   # ~/.aws/config
   [default]
   region=us-east-2
   role_arn=arn:aws:iam::123456789012:role/RoleName
   source_profile=default
   
   # ~/.aws/credentials
   [default]
   aws_access_key_id=XXX
   aws_secret_access_key=
   ```
   
   almost all AWS sdks are interpreting this correctly as an `assume-role` 
method that generates a temporary STS token pair.
   
   For example, using python this looks like
   
   ```python
   import boto3
   b3sess = boto3.Session()
   creds = b3sess.get_credentials()
   {
   "method": creds.method,
   "secret": creds.secret_key[:5] + "...",
   "token": creds.token[:5] + "...",
   }
   
   {'method': 'assume-role', 'secret': 'jALbI...', 'token': 'IQoJb...'}
   ```
   
   The C++ sdk is deviating from how the default credentials chain is 
implemented and is not supporting this kind of configuration but instead uses 
the plain access key + secret key pair that is found in the configuration which 
does not necessarily provide sufficient permissions.
   
   Dask adopted the S3FileSystem as a more performant alternative to the 
existing default fsspec filesystem for its parquet reader but this lack of 
support in the C++ sdk is a bit of a nasty blocker for further adoption. We 
ended up writing a workaround for our benchmarking by using boto to read the 
credentials and initialize the [S3FileSystem 
manually](https://github.com/coiled/benchmarks/blob/934a69e0ed093ef7319a5034b87c03a53dc0c0d8/tests/tpch/conftest.py#L290-L301)
 but this has a couple of flaws. For starters, this is pretty unergonomic and 
nontrivial but more importantly this prohibits the refresh of the token after 
expiration (max duration is 1hr)
   
   There's been some discussion on the aws-sdk-cpp repo about this with a 
suggestion to implement an amended credentials chain, see 
[here](https://github.com/aws/aws-sdk-cpp/issues/150#issuecomment-538548438) 
that includes the `STSProfileCredentialsProvider` but it's also pointed out 
that this is flawed as well.
   
   Also related
   - https://github.com/aws/aws-sdk-cpp/issues/2814
   - https://github.com/aws/aws-sdk-cpp/pull/2815
   
   I know this is ultimately a aws-sdk-cpp problem but end users of the arrow 
`S3FileSystem` do not have this transparency and expect things to "just work", 
particularly when consuming the python API and they are used from how boto and 
other libraries are parsing credentials.
   
   cc @pitrou since you've been poking in this area recently
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Can get tables info with schema contain custom field type [arrow]

2024-05-23 Thread via GitHub


Curricane closed issue #41722: Can get tables info with schema contain custom 
field type
URL: https://github.com/apache/arrow/issues/41722


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] fmpp-maven-plugin generates files directly under target/generated-sources [arrow]

2024-05-22 Thread via GitHub


lidavidm closed issue #41787: [Java] fmpp-maven-plugin generates files directly 
under target/generated-sources
URL: https://github.com/apache/arrow/issues/41787


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [CI][Integration][Release] RC verification script failed [arrow]

2024-05-22 Thread via GitHub


kou opened a new issue, #41792:
URL: https://github.com/apache/arrow/issues/41792

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   verify-rc-source-integration-linux-almalinux-8-amd64:
   
   
https://github.com/ursacomputing/crossbow/actions/runs/9191601776/job/25278362624#step:6:86624
   
   ```text
   Traceback (most recent call last):
 File "/tmp/arrow-HEAD.WiViY/venv-source/bin/archery", line 8, in 
   sys.exit(archery())
^
 File 
"/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py",
 line 1157, in __call__
   return self.main(*args, **kwargs)
  ^^
 File 
"/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py",
 line 1078, in main
   rv = self.invoke(ctx)

 File 
"/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py",
 line 1688, in invoke
   return _process_result(sub_ctx.command.invoke(sub_ctx))
  ^^^
 File 
"/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py",
 line 1434, in invoke
   return ctx.invoke(self.callback, **ctx.params)
  ^^^
 File 
"/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py",
 line 783, in invoke
   return __callback(*args, **kwargs)
  ^^^
 File "/arrow/dev/archery/archery/cli.py", line 771, in integration
   from .integration.runner import write_js_test_json, run_all_tests
 File "/arrow/dev/archery/archery/integration/runner.py", line 36, in 

   from .tester_java import JavaTester
 File "/arrow/dev/archery/archery/integration/tester_java.py", line 53, in 

   _arrow_version = load_version_from_pom()
^^^
 File "/arrow/dev/archery/archery/integration/tester_java.py", line 37, in 
load_version_from_pom
   tree = ET.parse(os.path.join(ARROW_BUILD_ROOT, 'java', 'pom.xml'))
  ^^^
 File "/usr/lib64/python3.11/xml/etree/ElementTree.py", line 1218, in parse
   tree.parse(source, parser)
 File "/usr/lib64/python3.11/xml/etree/ElementTree.py", line 569, in parse
   source = open(source, "rb")
^^
   FileNotFoundError: [Errno 2] No such file or directory: '/java/pom.xml'
   ```
   
   ### Component(s)
   
   Continuous Integration, Integration, Release


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [CI][Conda] The CondaEnvironment@1 (Conda environment) task has been deprecated since February 13, 2019 and will soon be retired [arrow]

2024-05-22 Thread via GitHub


kou opened a new issue, #41791:
URL: https://github.com/apache/arrow/issues/41791

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   conda-linux-aarch64-cpu-py3:
   
   
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=64087=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=cf5ca333-5432-59fc-78af-35cb2d46743b=161
   
   ```text
   ##[error]The CondaEnvironment@1 (Conda environment) task has been deprecated 
since February 13, 2019 and will soon be retired. Use the Conda CLI ('conda') 
directly from a bash/pwsh/script task. Please visit 
https://aka.ms/azdo-deprecated-tasks to learn more about deprecated tasks.
   ```
   
   `CondaEnvironment` is used here: 
https://github.com/apache/arrow/blob/9185d7dad773ed8768f90fb63ad3ef7e7a92f108/dev/tasks/conda-recipes/azure.linux.yml#L49-L53
   
   ### Component(s)
   
   Continuous Integration, Packaging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Make git-dependent preprocessor definitions internal [arrow]

2024-05-22 Thread via GitHub


kou closed issue #41783: [C++] Make git-dependent preprocessor definitions 
internal
URL: https://github.com/apache/arrow/issues/41783


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Java] fmpp-maven-plugin generates files directly under target/generated-sources [arrow]

2024-05-22 Thread via GitHub


laurentgo opened a new issue, #41787:
URL: https://github.com/apache/arrow/issues/41787

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   `fmpp-plugin-maven` is used in `arrow-vector` module to generate source 
files before the compilation phase.  Those files are generated directly under 
`target/generated-sources` where they conflict with the 
`target/generated-sources/annotations` directory created by `javac`. Per 
convention each plugin generates files under its own directory to prevent risks 
of conflicts.
   
   Although it doesn't cause a direct issue with the build, it may also confuse 
some IDEs (Eclipse and VSCode notably) which are detecting overlapping source 
directories
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [CI][Python] AMD64 Conda Java C Data Interface Integration Failure building PyArrow trying to use PYARROW_PARQUET [arrow]

2024-05-22 Thread via GitHub


kou closed issue #41725: [CI][Python] AMD64 Conda Java C Data Interface 
Integration Failure building PyArrow trying to use PYARROW_PARQUET
URL: https://github.com/apache/arrow/issues/41725


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Add option to disable exact matches optional in join_asof [arrow]

2024-05-22 Thread via GitHub


0x26res opened a new issue, #41786:
URL: https://github.com/apache/arrow/issues/41786

   ### Describe the enhancement requested
   
   I would like to do a `join_asof` that would exclude exact matches.
   
   This is supported in pandas 
https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html
   
   In the example below, I expect [1,2,2,3] instead of [1,2,3,4].
   
   ```
   left = pa.table({"left": [10, 20, 30, 40], "key": [1, 1, 1, 1]})
   right = pa.table(
   {
   "right": [9, 12, 30, 41],
   "key": [1, 1, 1, 1],
   "value": [1, 2, 3, 4],
   }
   )
   
   assert left.join_asof(
   right, on="left", by="key", tolerance=-10, right_on="right", 
right_by="key"
   ) == pa.table(
   {
   "left": [10, 20, 30, 40],
   "key": [1, 1, 1, 1],
   "value_right": [1, 2, 3, 3], # Should be [1,2,2,3]
   }
   )
   
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Mismatch between package version and library version in naming [arrow]

2024-05-22 Thread via GitHub


daeden opened a new issue, #41784:
URL: https://github.com/apache/arrow/issues/41784

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   **Version**: 16.1.0
   
   **Platform**: Details about the operating system or environment where the 
bug was found
   
   **Summary**: The version number in the library name does not match the 
version of the package that is installed. This causes load time issues where we 
fail to find library dependencies even when version mismatch is only on minor 
version change, which should backwards compatible.
   
   **Steps to Reproduce**:
   1. Install the package for version 16.1.0
   2. List the installed libraries (`ls /lib64/libarrow.so*`)
   
   **Expected Result**: The libraries should be labeled with the correct 
version number (16.1.0) with symlinks for major version and non versioned.
   
   For example, I would expect to find:
   
   ```
   $ ls  /lib64/libarrow.so*
   /lib64/libarrow.so@  /lib64/libarrow.so.16@  /lib64/libarrow.so.16.1.0*
   ```
   
   **Actual Result**: The libraries are labeled with an incorrect version 
number (1601.0.0)
   
   ```
   $ ls  /lib64/libarrow.so*
   /lib64/libarrow.so@  /lib64/libarrow.so.1601@  /lib64/libarrow.so.1601.0.0*
   ```
   
   ### Component(s)
   
   Packaging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Thrift: generate template method to accelerate reading thrift [arrow]

2024-05-22 Thread via GitHub


pitrou closed issue #41702: [C++][Parquet] Thrift: generate template method to 
accelerate reading thrift
URL: https://github.com/apache/arrow/issues/41702


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] ADBC Python Postgres - Stuck connections to the database [arrow-adbc]

2024-05-22 Thread via GitHub


gaspardc-met opened a new issue, #1881:
URL: https://github.com/apache/arrow-adbc/issues/1881

   ### What happened?
   
   Context before the bug (working):
   - Postgres database on Kubernetes with several tables
   - 4 services (webapp, machine learning inference, and FastAPI backend APIs) 
deployed on kubernetes and fetching data from postgres
   - 1 service, data orchestrator, writing data to Postgres
   - Fetching data from PG with `pd.read_sql` from Pandas and a SQLalchemy 
engine
   - Been doing this for 1+ year without any Postgres issues
   
   Switching to ADBC:
   - Following my upgrade to pandas >2.0.0 I wanted to switch to 
`adbc_driver_postgresql`'s `dbapi` connection with `pd.read_sql`
   - Initial tests were great, it was faster than before
   - Deployed this to production on all aforementioned services twice 
(initially with connection caching, then no caching and properly closing each 
and every connection
   - Once again initially smooth, everything worked and was fast
   
   Problem:
   - In both instances of the deployment, within ~12 hours, the connections 
would be stuck
   - Webapp or another service would create an ADBC connection, and run the sql 
query with `pd.read_sql` (know this through caching) and then wait 
indefinitely. 
   - Reloading the webapp, clearing webapp cache, recreating the connection 
would do nothing at all
   - The log on the Postgres pod indicated a password issue with the current 
database/user, which never happened before
   - Both SQLalchemy and ADBC get the same postgres URI to create the 
engine/connection with
   - Reverting to SQLalchemy solved the problem, and the error has not been 
seen again
   
   ### How can we reproduce the bug?
   
   - The given URI was `"postgresql://{user}:{password}@{host}:{port}/{db}"` 
formatted with the proper values
   - The function was used to create the ADBC connection:
   
   ```python
   def create_adbc_conn() -> Connection:
   logger_stdout.info(f"Creating a new ADBC connection at 
{pd.Timestamp.now()}.")
   uri = get_default_uri() # URI shown above, formatted
   connection = dbapi.connect(uri=uri)
   logger_stdout.info("ADBC connection created")
   return connection
   ```
   
   - The function to execute the SQL query was:
   ```python
   def handle_sql_query(
   sql: str,
   index_col: Optional[str] = None,
   connection: Optional[Connection] = None,
   need_to_close: bool = False,
   ) -> pd.DataFrame:
   if engine is None:
   logger_stdout.info(f"Engine is None, creating a new ADBC connection 
at {pd.Timestamp.now()}.")
   connection= create_adbc_conn()
   need_to_close = True
   try:
   logger_stdout.info("Executing SQL query with connection")
   return pd.read_sql_query(sql=sql, con=connection, 
index_col=index_col, parse_dates=[index_col])
   finally:
   if need_to_close:
   logger_stdout.info("Closing the ADBC connection.")
   connection.close()
   ```
   
   - The SQL queries ranged from `select * from TABLE_NAME` to selecting 
specific columns on a range of specific dates
   
   ### Environment/Setup
   
   python 3.11
   pandas == 2.2.2
   adbc_driver_postgresql==0.11.0
   adbc-driver-manager==0.11.0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Minor: moving EncodedStats by default rather than copying [arrow]

2024-05-22 Thread via GitHub


mapleFU closed issue #41726: [C++][Parquet] Minor: moving EncodedStats by 
default rather than copying
URL: https://github.com/apache/arrow/issues/41726


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++] Make git-dependent preprocessor definitions internal [arrow]

2024-05-22 Thread via GitHub


pitrou opened a new issue, #41783:
URL: https://github.com/apache/arrow/issues/41783

   ### Describe the enhancement requested
   
   The `ARROW_GIT_ID` and `ARROW_GIT_DESCRIPTION` preprocessor variables are 
currently exposed in `arrow/util/config.h`, and included from `arrow/config.h`. 
This means that any file indirectly including these headers will have to be 
recompiled if the git information changes - something which happens quite 
frequently during development.
   
   Using ccache with properly tuned configuration can work around the issue, 
but does not fully remove overhead. It also requires users to think about the 
best ccache configuration.
   
   By making those two variable privates, we should fix the problem entirely.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] ParquetDataset object fails with a .read() method due to hive partition schema columns. [arrow]

2024-05-22 Thread via GitHub


j0bekt01 closed issue #41779: ParquetDataset object fails with a .read() method 
due to hive partition schema columns. 
URL: https://github.com/apache/arrow/issues/41779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Java][Flight] Flight SQL tests are flaky [arrow]

2024-05-22 Thread via GitHub


laurentgo opened a new issue, #41782:
URL: https://github.com/apache/arrow/issues/41782

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Several test failures in `flight-sql` module have been observed in multiple 
job executions:
   - https://github.com/apache/arrow/actions/runs/9185953424/job/25260768750
   - 
https://github.com/apache/arrow/actions/runs/9185714602/job/25260156899?pr=41772
   
   The reported issue is 
   
   ```
   Error:  Errors: 
   Error:TestFlightSqlStreams.tearDown:224 » IllegalState Memory was leaked 
by query. Memory leaked: (250384)
   Allocator(ROOT) 0/250384/250896/2147483647 (res/actual/peak/limit)
   ```
   
   Note that there are also multiple messages about unclosed `ManagedChannels` 
in:
   - `org.apache.arrow.flight.auth2.TestBasicAuth2`
   - 
`org.apache.arrow.flight.core/org.apache.arrow.flight.TestFlightGrpcUtils.testMultipleGrpcServices`
   - `org.apache.arrow.flight.core/org.apache.arrow.flight.TestDoExchange.setUp`
   - 
`org.apache.arrow.flight.core/org.apache.arrow.flight.TestServerOptions.addHealthCheckService`
   
   ```
   May 22, 2024 5:50:41 AM 
io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference cleanQueue
   SEVERE: *~*~*~ Previous channel ManagedChannelImpl{logId=505, 
target=directaddress:///localhost/127.0.0.1:} was garbage collected without 
being shut down! ~*~*~*
   Make sure to call shutdown()/shutdownNow()
   java.lang.RuntimeException: ManagedChannel allocation site
at 
io.grpc.internal@1.63.0/io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:102)
at 
io.grpc.internal@1.63.0/io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:60)
at 
io.grpc.internal@1.63.0/io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:51)
at 
io.grpc.internal@1.63.0/io.grpc.internal.ManagedChannelImplBuilder.build(ManagedChannelImplBuilder.java:672)
at 
io.grpc@1.63.0/io.grpc.ForwardingChannelBuilder2.build(ForwardingChannelBuilder2.java:260)
at 
org.apache.arrow.flight.core/org.apache.arrow.flight.TestServerOptions.addHealthCheckService(TestServerOptions.java:191)
at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
   ```
   
   but those seems to only cause warnings, not errors
   
   
   ### Component(s)
   
   FlightRPC, Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++][Flight] Flight benchmark doesn't work anymore [arrow]

2024-05-22 Thread via GitHub


pitrou opened a new issue, #41780:
URL: https://github.com/apache/arrow/issues/41780

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   On my local build:
   ```console
   $ /build/build-release/relwithdebinfo/arrow-flight-benchmark 
   Testing method: DoGet
   Using spawned TCP server
   Server running with pid 71195
   Server host: localhost
   Server port: 31337
   Failed with error: << IOError: Flight returned unavailable error, with 
message: failed to connect to all addresses; last error: UNKNOWN: 
ipv4:127.0.0.1:31337: Failed to connect to remote host: Connection refused. 
Detail: Unavailable
   ```
   
   ### Component(s)
   
   Benchmarking, C++, FlightRPC


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] ParquetDataset object fails with a .read() method due to hive partition schema columns. [arrow]

2024-05-22 Thread via GitHub


j0bekt01 opened a new issue, #41779:
URL: https://github.com/apache/arrow/issues/41779

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I'm trying to read parquet files from S3 that have a Hive partition 
'/year=/month=MM/day=DD/hour=HH/' using the .read() method, but it fails, 
stating that one of the partition columns doesn't exist. However, if I exclude 
the partition columns and provide a list of columns that are actually present 
in the file, it reads without any issues. According to the documentation, the 
read() method should ignore Hive partition columns.
   
   `import pyarrow.parquet as pq
   import datetime
   import polars as pl
   
   dt = datetime.datetime(2024, 5, 17)
   path = f"{bucket}/folder-to-files/year={dt.year}/month={dt.month:02d}/"
   dataset = pq.ParquetDataset(path, partitioning='hive', 
filesystem=s3fs.S3FileSystem())
   
   # This Fails
   (
   pl.LazyFrame(dataset.read()) 
 .select(pl.all()) 
 .head(100)
 .collect()
   )
   
   # Remove the partition columns
   cols = dataset.schema.names
   [cols.remove(item) for item in ['year','month', 'day', 'hour'] if item in 
cols]
   
   (
   pl.LazyFrame(dataset.read()) 
 .select(pl.all()) 
 .head(100)
 .collect()
   )
   `
   windows 11
   python 3.10
   pyarrow 16.1.0
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Add file metadata read/write benchmark [arrow]

2024-05-22 Thread via GitHub


pitrou closed issue #41760: [C++][Parquet] Add file metadata read/write 
benchmark
URL: https://github.com/apache/arrow/issues/41760


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] pyarrow.fs.HadoopFileSystem Usage Problems [arrow]

2024-05-22 Thread via GitHub


deep826 opened a new issue, #41777:
URL: https://github.com/apache/arrow/issues/41777

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   hi, I use pyarrow.fs.HadoopFileSystem client to interact with hdfs. And I 
write some bytes to a file in hdfs, then download it to local filesystem. I 
read the file using python native api: read, but the result is false. When I 
use pyarrow hdfs client to read the file in hdfs, the result is right. I'm 
confused. here are some sudo code snippet.
   `a = 1000 `
   `b = 64`
   `with hdfs_client.open_output_stream(path) as f:`
   `f.write(a.to_bytes(8, sys.byteorder))`
   `f.write(b.to_bytes(4, sys.byteorder))`
   here, i write 12 bytes to file: path, then i download  it from hdfs to 
local_path and read these bytes as follows:
   `with open(local_path, 'rb') as f:`
   `bs = f.read(12)`
   `a = int.from_bytes(bs[0:8], sys.byteorder)`
   `b = int.from_bytes(bs[8:12], sys.byteorder)`
   `print(f"a: {a}, b: {b}")`
   The print result is: a: 559903, b: 3158573824, the expected values should 
be: a: 1000, b: 64.
   So What's the problem.
   
   ### Component(s)
   
   C, C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Wrong Length value in the example of ListView in Columnar specification document [arrow]

2024-05-22 Thread via GitHub


Jagdish-Motwani opened a new issue, #41774:
URL: https://github.com/apache/arrow/issues/41774

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   In the example Layout: ``ListView`` Array with 5 elements,  the length 
is specified as 4.
   Shouldn't it be 5?
   
   
   ### Snippet from the website
   
   
   We continue with the ListView type, but this instance illustrates out 
of order offsets and sharing of child array values. It is an array with length 
5 having logical values:
   
   [[12, -7, 25], null, [0, -127, 127, 50], [], [50, 12]]
   It may have the following representation:
   
   * Length: 4, Null count: 1
   * Validity bitmap buffer:
   
 | Byte 0 (validity bitmap) | Bytes 1-63|
 |--|---|
 | 00011101 | 0 (padding)   |
   
   ...
   -
   
   ### Component(s)
   
   Website


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Python][Parquet] Documentation to parquet.write_table should be updated for new byte_stream_split encoding options [arrow]

2024-05-22 Thread via GitHub


jorisvandenbossche closed issue #41748: [Python][Parquet] Documentation to 
parquet.write_table should be updated for new byte_stream_split encoding options
URL: https://github.com/apache/arrow/issues/41748


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++] Consuming or closing a RecordBatchReader created from a Dataset Scanner does not close underlying files [arrow]

2024-05-21 Thread via GitHub


adamreeve opened a new issue, #41771:
URL: https://github.com/apache/arrow/issues/41771

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Code to reproduce as a unit test that I added to 
`cpp/src/arrow/dataset/dataset_test.cc`, which logs the open files in the 
dataset directory (only works on Linux). This needs some extra headers:
   ```C++
   #include 
   #include 
   #include "arrow/dataset/file_ipc.h"
   #include "arrow/ipc/api.h" 
   ```
   
   Test methods:
   ```C++
   void ListOpenFilesInDir(const std::string& directory, const std::string& 
context) {
 std::cout << "Open files in directory " << directory << " " << context << 
":" << std::endl;
 auto open_files = std::filesystem::directory_iterator("/proc/self/fd");
 for (const auto& entry : open_files)
 {
   char target_path[PATH_MAX];
   ssize_t len = ::readlink(entry.path().c_str(), target_path, PATH_MAX - 
1);
   if (len != -1) {
 target_path[len] = '\0';
 std::string open_file_path(target_path);
 if (open_file_path.find(directory) == 0)
 {
   std::cout << open_file_path << std::endl;
 }
   }
 }
   }
   
   TEST(TestDatasetScan, ScanToRecordBatchReader) {
 ASSERT_OK_AND_ASSIGN(auto tempdir, 
arrow::internal::TemporaryDir::Make("dataset-scan-test-"));
 std::string tempdir_path = tempdir->path().ToString();
   
 auto schema = arrow::schema({field("x", int64()), field("y", int64())});
 auto table = TableFromJSON(schema, {R"([
 [1, 2],
 [3, 4]
   ])"});
   
 auto format = std::make_shared();
 auto file_system = std::make_shared();
 ASSERT_OK_AND_ASSIGN(auto file_path, tempdir->path().Join("data.arrow"));
 std::string file_path_str = file_path.ToString();
   
 {
   EXPECT_OK_AND_ASSIGN(auto out_stream, 
file_system->OpenOutputStream(file_path_str));
   ASSERT_OK_AND_ASSIGN(
   auto file_writer,
   MakeFileWriter(out_stream, schema, 
arrow::ipc::IpcWriteOptions::Defaults()));
   ASSERT_OK(file_writer->WriteTable(*table));
   ASSERT_OK(file_writer->Close());
 }
   
 std::vector paths {file_path_str};
 FileSystemFactoryOptions options;
 ASSERT_OK_AND_ASSIGN(auto factory, 
arrow::dataset::FileSystemDatasetFactory::Make(file_system, paths, format, 
options));
 ASSERT_OK_AND_ASSIGN(auto dataset, factory->Finish());
   
 {
   ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
   ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
   {
 ASSERT_OK_AND_ASSIGN(auto record_batch_reader, 
scanner->ToRecordBatchReader());
 ASSERT_OK_AND_ASSIGN(auto read_table, record_batch_reader->ToTable());
 ListOpenFilesInDir(tempdir_path, "after read");
 ASSERT_OK(record_batch_reader->Close());
 ListOpenFilesInDir(tempdir_path, "after close");
   }
   ListOpenFilesInDir(tempdir_path, "after reader destruct");
 }
 ListOpenFilesInDir(tempdir_path, "after scanner destruct");
   }
   ```
   
   When I run this (on Fedora 39, using GCC 13)  I get output like:
   ```
   Open files in directory /tmp/dataset-scan-test-268jyz3s/ after read:
   /tmp/dataset-scan-test-268jyz3s/data.arrow
   Open files in directory /tmp/dataset-scan-test-268jyz3s/ after close:
   /tmp/dataset-scan-test-268jyz3s/data.arrow
   Open files in directory /tmp/dataset-scan-test-268jyz3s/ after reader 
destruct:
   Open files in directory /tmp/dataset-scan-test-268jyz3s/ after scanner 
destruct:
   ```
   
   This shows that neither consuming the `RecordBatchReader` by reading it into 
a table nor calling the `Close` method results in the IPC file being closed, 
it's only closed after the reader is destroyed. The `Close` implementation 
doesn't do anything other than consume all the data: 
https://github.com/apache/arrow/blob/37e5240e2430564b1c2dfa5d1e6a7a6b58576f83/cpp/src/arrow/dataset/scanner.cc#L113-L120
   
   For context, this causes errors trying to remove the dataset directory in 
Windows when using the GLib bindings via Ruby, where there isn't a way to force 
destruction of the reader and we have to rely on GC (#41750).
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [CI][GLib] Suppress "`unlink': Permission denied" warnings in tests on Windows [arrow]

2024-05-21 Thread via GitHub


kou opened a new issue, #41770:
URL: https://github.com/apache/arrow/issues/41770

   ### Describe the enhancement requested
   
   
https://github.com/apache/arrow/actions/runs/9183539981/job/25254413025#step:12:83
   
   ```text
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-8gmqb9.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-jhtj66.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-cm213m.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-9f22cw.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-l8mur.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-h2rr21.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-8wxgv6.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-n5khu0.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-whjqi1.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-nigllm.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-5d2aoc.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-moorbx.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-vq6f58.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   test/run-test.rb: warning: Exception in finalizer 
#>
   C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`unlink': Permission denied @ apply2files - 
D:/a/_temp/data20240522-5072-giv9k1.parquet (Errno::EACCES)
from 
C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in 
`call'
   

[I] [Java] Rework how Java cookbooks are developed and built [arrow-cookbook]

2024-05-21 Thread via GitHub


amoeba opened a new issue, #351:
URL: https://github.com/apache/arrow-cookbook/issues/351

   In https://github.com/apache/arrow-cookbook/pull/350#issuecomment-2121850653 
it was pointed out that the way the Java cookbooks work could be improved quite 
a bit. We might consider two more recent approaches:
   
   - 
https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/adbc_cookbook.py
   - 
https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/javadoc_inventory.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [GLib] Separate version macros for each GLib library [arrow]

2024-05-21 Thread via GitHub


kou closed issue #41681: [GLib] Separate version macros for each GLib library
URL: https://github.com/apache/arrow/issues/41681


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++][Parquet] `parquet::arrow::FileWriter` does not propagate schema-level metadata when `ArrowWriterProperties::store_schema` is false [arrow]

2024-05-21 Thread via GitHub


TheNeuralBit opened a new issue, #41766:
URL: https://github.com/apache/arrow/issues/41766

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When `store_schema` is true the `FileWriter` first copies any existing 
metadata before storing the serialized schema:
   
https://github.com/apache/arrow/blob/8169d6e719453acd0e7ca1b6f784d800cca4f113/cpp/src/parquet/arrow/writer.cc#L537-L542
   
   But when `store_schema` is false, the `FileWriter` just returns an empty 
metadata, and custom metadata is not copied: 
https://github.com/apache/arrow/blob/8169d6e719453acd0e7ca1b6f784d800cca4f113/cpp/src/parquet/arrow/writer.cc#L531-L534
   
   Could someone confirm if this is intentional or not? It looks like an 
oversight to me and I have a patch ready to address it.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Parquet][C++] Behaviour of unknown logical type when encountered in Parquet reader [arrow]

2024-05-21 Thread via GitHub


paleolimbot opened a new issue, #41764:
URL: https://github.com/apache/arrow/issues/41764

   ### Describe the enhancement requested
   
   In https://github.com/apache/parquet-format/pull/240 there is concern 
regarding the ability to add a new logical type (in this case GEOMETRY) in a 
backwards compatible way such that readers that don't yet implement support for 
the new logical type can still read the file.
   
   @jorisvandenbossche found the place where the error would be thrown:
   
   
https://github.com/apache/arrow/blob/34f042762061f4e302e133c2d378ea444505049e/cpp/src/parquet/types.cc#L467
   
   I'm not sure what the best behaviour would be here: it will help drive 
support for new logical types to actually be written to files if it's possible 
to know that older readers won't choke on them. There was some indication that 
this would be a bug ( 
https://github.com/apache/parquet-format/pull/240#issuecomment-2122972227 ); 
however, it is definitely safer for a reader in general to error when it 
encounters a type that it doesn't understand. On the other hand, Arrow C++ 
silently drops unregistered extension types which, if I'm understanding the 
issue, is roughly the same.
   
   It seems like returning `NoLogicalType::Make();` would fall back to the 
physical type here; however, it also seems like that should be opt-in somehow 
and I don't see an obvious route to "type inference" options or similar at that 
particular place in the code.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Import/Export ArrowDeviceArrayStream [arrow]

2024-05-21 Thread via GitHub


zeroshade closed issue #40078: [C++] Import/Export ArrowDeviceArrayStream
URL: https://github.com/apache/arrow/issues/40078


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [R][CI] CRAN-style openssl not being picked up [arrow]

2024-05-21 Thread via GitHub


assignUser closed issue #41426: [R][CI] CRAN-style openssl not being picked up
URL: https://github.com/apache/arrow/issues/41426


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Add a benchmark for grouper for preventing performance regression [arrow]

2024-05-21 Thread via GitHub


pitrou closed issue #41035: [C++] Add a benchmark for grouper for preventing 
performance regression
URL: https://github.com/apache/arrow/issues/41035


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] adbc_ingest() is dropping rows in Snowflake [arrow-adbc]

2024-05-21 Thread via GitHub


zeroshade closed issue #1847: adbc_ingest() is dropping rows in Snowflake
URL: https://github.com/apache/arrow-adbc/issues/1847


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++][Parquet] Add file metadata read/write benchmark [arrow]

2024-05-21 Thread via GitHub


pitrou opened a new issue, #41760:
URL: https://github.com/apache/arrow/issues/41760

   ### Describe the enhancement requested
   
   Following the discussions on the Parquet ML (see [this 
thread](https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo) and 
[this 
thread](https://lists.apache.org/thread/vs3w2z5bk6s3c975rrkqdttr1dpsdn7h)), we 
should add a benchmark to measure the overhead of Parquet file metadata parsing 
or serialization for different numbers of row groups and columns.
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++][Python] Segfault when reading a RecordBatchReader constructed from an Arrow Table [arrow]

2024-05-21 Thread via GitHub


Mytherin opened a new issue, #41758:
URL: https://github.com/apache/arrow/issues/41758

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The following code snippet crashes for me when running PyArrow 16.1 in 
Python 3.12:
   
   ```py
   import pyarrow as pa
   
   print(pa.__version__)
   # 16.1.0
   
   tbl = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})
   t = pa.RecordBatchReader(tbl.to_batches())
   
   print(t.read_all())
   # zsh: segmentation fault  python3
   ```
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Python] Expose bit_width and byte_width on Python Extension types with underlying fixed type [arrow]

2024-05-21 Thread via GitHub


jorisvandenbossche closed issue #41389: [Python] Expose bit_width and 
byte_width on Python Extension types with underlying fixed type
URL: https://github.com/apache/arrow/issues/41389


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++][Python] SEGFAULT when casting FixedSizeTensorArray to storage type then back to FixedSizeTensorArray [arrow]

2024-05-21 Thread via GitHub


judahrand opened a new issue, #41756:
URL: https://github.com/apache/arrow/issues/41756

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Minimum reproducible example:
   
   ```python
   import pyarrow
   
   tensor_type = pyarrow.fixed_shape_tensor(pyarrow.int32(), [4])
   storage_type = pyarrow.list_(pyarrow.int32(), 4)
   
   py_list = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
   storage_arr = pyarrow.array(py_list, storage_type)
   arr = pyarrow.ExtensionArray.from_storage(tensor_type, storage_arr)
   arr.cast(
   storage_type,
   ).cast(
   tensor_type,
   )
   ```
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [C++] take into account orc's capabilities for finding tzdb [arrow]

2024-05-21 Thread via GitHub


h-vetinari opened a new issue, #41755:
URL: https://github.com/apache/arrow/issues/41755

   ### Describe the enhancement requested
   
   As one of the follow-ups to #36026, https://github.com/apache/orc/pull/1882 
got merged into orc 2.0.1, which will use conda(-forge)'s `tzdata` also on 
windows, even if the `TZDIR` environment variable is not being set (inserting 
that variable into all user environments would have been very intrusive).
   
   Based on this new functionality, I've successfully added orc-on-python 
support to arrow v13-v15, but some of the _other_ checks introduced in the 
context of #36026 now fail in 
https://github.com/conda-forge/pyarrow-feedstock/pull/122, because they haven't 
yet been taught to allow the case that orc>=2.0.1 now handles.
   
   ### Component(s)
   
   C++, Packaging, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Hadoop v 2.6.0 [arrow]

2024-05-21 Thread via GitHub


dwp0980 opened a new issue, #41753:
URL: https://github.com/apache/arrow/issues/41753

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hello,
   
   Is it folly to even attempt to connect pyarrow to Hadoop v 2.6.0?  At the 
moment, I'm pinned to python version 3.6.10 and therefore pyarrow 6.0.1
   
   CoPilot is telling me that Hadoop 2.7.0 is the minimum supported version, 
but not specifically telling me where it found that info from and so far my 
attempts to connect result in:
   
   `OSError: HDFS connection failed`
   
   So I just wanted to check if I'm fighting a losing battle before I go much 
further with trouble shooting.
   
   Many thanks
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file [arrow]

2024-05-21 Thread via GitHub


mapleFU closed issue #41321: [C++][Parquet] Crash / heap-use-after-free in 
ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file
URL: https://github.com/apache/arrow/issues/41321


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [CI][Integration] Spark jobs are failing with problem on org.apache.arrow.flatbuf [arrow]

2024-05-21 Thread via GitHub


lidavidm closed issue #41571: [CI][Integration] Spark jobs are failing with 
problem on org.apache.arrow.flatbuf
URL: https://github.com/apache/arrow/issues/41571


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Packaging] soversion bumps on minor releases [arrow]

2024-05-21 Thread via GitHub


jorisvandenbossche closed issue #41659: [Packaging] soversion bumps on minor 
releases
URL: https://github.com/apache/arrow/issues/41659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Python] 'pyarrow._parquet.SortingColumn' object has no attribute 'to_dict' [arrow]

2024-05-21 Thread via GitHub


AlenkaF closed issue #41699: [Python] 'pyarrow._parquet.SortingColumn' object 
has no attribute 'to_dict'
URL: https://github.com/apache/arrow/issues/41699


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [GLib] Allow getting a RecordBatchReader from a Dataset or Dataset Scanner [arrow]

2024-05-21 Thread via GitHub


adamreeve opened a new issue, #41749:
URL: https://github.com/apache/arrow/issues/41749

   ### Describe the enhancement requested
   
   In order to allow efficient processing of large datasets, it should be 
possible to read a dataset or a scanner using a RecordBatchReader rather than 
using the `to_table` method.
   
   ### Component(s)
   
   GLib


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Corner case of temp vector stack overflow check [arrow]

2024-05-20 Thread via GitHub


felipecrv closed issue #41738: [C++] Corner case of temp vector stack overflow 
check
URL: https://github.com/apache/arrow/issues/41738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Python][Parquet] Documentation to parquet.write_table should be updated for new byte_stream_split encoding options [arrow]

2024-05-20 Thread via GitHub


etseidl opened a new issue, #41748:
URL: https://github.com/apache/arrow/issues/41748

   ### Describe the enhancement requested
   
   The docstring for `parquet.write_table` still says BYTE_STREAM_SPLIT 
encoding is valid only for floating-point data. This should be updated now that 
other fixed length types are supported.
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] arrow-vector 16.1.0 has a change that breaks Java 8 support [arrow]

2024-05-20 Thread via GitHub


lidavidm closed issue #41717: [Java] arrow-vector 16.1.0 has a change that 
breaks Java 8 support
URL: https://github.com/apache/arrow/issues/41717


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [CI] GitHub bot cannot run Java CIs [arrow]

2024-05-20 Thread via GitHub


kou closed issue #41735: [CI] GitHub bot cannot run Java CIs 
URL: https://github.com/apache/arrow/issues/41735


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Java][FlightSQL] Arrow Flight Driver returns -1 for getUpdateCount() [arrow]

2024-05-20 Thread via GitHub


rcprcp opened a new issue, #41747:
URL: https://github.com/apache/arrow/issues/41747

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Tested on Arrow Flight JDBC 15, 15.02 and with a locally-built Arrow Flight 
SQL JDBC Driver 17.0.0-SNAPSHOT. 
   
   When using a JDBC statement to execute an UPDATE, the getUpdateCount() 
method seems to always return -1.  This seems to be incorrect.  The 
getUpdateCount() should return the number of rows UPDATEd.
   
   For a data source, we're using the [Voltron docker container] - 
(https://hub.docker.com/r/voltrondata/flight-sql)
   
   In this debugger image, you can see the return counts from the 
getUpdateCount() and from the actual result set (the update returns one row, 
with one column, that indicates the number of UPDATEd rows. 
   
   https://github.com/apache/arrow/assets/17998205/5189462f-e608-42fb-8996-61c871da6360;>
   
   Here is a screen grab of the output data: 
   https://github.com/apache/arrow/assets/17998205/61f24a80-ee23-4e98-9bed-a281c233237d;>
   
   And, in this screengrab, I used the Postgres driver to update a table in 
postgres:latest Docker image: 
   https://github.com/apache/arrow/assets/17998205/bb7dcbfe-370a-4b5d-a1d3-6414e4b3b9ef;>
   
   The data in the Postgres table is different, but the getUpdateCount() method 
returned 1 which is correct for that data. 
   The test program I used is checked into this Github repo:   
[ZD128958](https://github.com/rcprcp/ZD128958)
   
   Thank you!
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   4   5   6   7   8   9   10   >