Re: [I] [C++] Consuming or closing a RecordBatchReader created from a Dataset Scanner does not close underlying files [arrow]
bkietz closed issue #41771: [C++] Consuming or closing a RecordBatchReader created from a Dataset Scanner does not close underlying files URL: https://github.com/apache/arrow/issues/41771 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [MATLAB] Add C Data Interface format import/export functionality for `arrow.tabular.RecordBatch` [arrow]
sgilmore10 closed issue #41803: [MATLAB] Add C Data Interface format import/export functionality for `arrow.tabular.RecordBatch` URL: https://github.com/apache/arrow/issues/41803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Implement a function to load field buffers from external buffers for StringView [arrow]
vibhatha closed issue #40931: [Java] Implement a function to load field buffers from external buffers for StringView URL: https://github.com/apache/arrow/issues/40931 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Implement a strategy to return variable width buffer count for StringView in TypeLayout [arrow]
vibhatha closed issue #40935: [Java] Implement a strategy to return variable width buffer count for StringView in TypeLayout URL: https://github.com/apache/arrow/issues/40935 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] TypeLayout enhancement to support StringView [arrow]
vibhatha closed issue #40934: [Java] TypeLayout enhancement to support StringView URL: https://github.com/apache/arrow/issues/40934 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Support LZ4_RAW for parquet writing [arrow]
douglas-raillard-arm opened a new issue, #41863: URL: https://github.com/apache/arrow/issues/41863 ### Describe the enhancement requested `pyarrow.dataset.write_dataset(compression='lz4_raw')` currently fails with: ``` Traceback (most recent call last): File "/work/projects/lisa/testpyarrow.py", line 3, in _reencode_parquet('sched_switch.lz4.parquet', 'updated.parquet', compression='lz4_raw')#, row_group_size=128*1024*1024, compression='LZ4') ^^^ File "x.py", line 1, in my_write_parquet options = pyarrow.dataset.ParquetFileFormat().make_write_options( ^^^ File "pyarrow/_dataset_parquet.pyx", line 206, in pyarrow._dataset_parquet.ParquetFileFormat.make_write_options File "pyarrow/_dataset_parquet.pyx", line 594, in pyarrow._dataset_parquet.ParquetFileWriteOptions.update File "pyarrow/_dataset_parquet.pyx", line 599, in pyarrow._dataset_parquet.ParquetFileWriteOptions._set_properties File "pyarrow/_parquet.pyx", line 1855, in pyarrow._parquet._create_writer_properties File "pyarrow/_parquet.pyx", line 1369, in pyarrow._parquet.check_compression_name pyarrow.lib.ArrowException: Unsupported compression: lz4_raw ``` And indeed, no mention of `lz4_raw` is to be found in `python/pyarrow/_parquet.pyx`. Would it be possible to add support for LZ4_RAW codec when writing parquet files, particularly using the dataset API ? ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Thread deadlock in ObjectOutputStream [arrow]
icexelloss opened a new issue, #41862: URL: https://github.com/apache/arrow/issues/41862 ### Describe the bug, including details regarding any error messages, version, and platform. I am seeing a deadlock when destructing an ObjectOutputStream. I have attached the stack trace. I did some debugging and found that the issue seems to be that the mutex in question is already held by this thread (I checked the __owner field in the pthread_mutex_t which points to the hanging thread) Unfortunately the stack trace doesn’t show exactly which mutex it is trying to lock. I wonder if someone more familiar with the IO code has some ideas what might be the issue and where to dig deeper? [arrow_object_output_stream_stacktrace.txt](https://github.com/apache/arrow/files/15469090/arrow_object_output_stream_stacktrace.txt) ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] How to concatenate multiple tables in one parquet? [arrow]
zliucd opened a new issue, #41858: URL: https://github.com/apache/arrow/issues/41858 ### Describe the usage question you have. Please include as many useful details as possible. Hi, It's possible to write multiple tables in a single parquet by appending each rows from individual parquet? All tables read from parquets have same columns. This functionality is similar to Python ```dataframe.concat([df1, df2])```. For example: ``` table1 Name Age Jim 36 Bill 30 table2 Name Age Sam28 Joe 30 ``` The concatenated table and parquet file should be: ``` Name Age Jim 36 Bill30 Sam 28 Joe 30 ``` We can concatenate tables using ```auto con_tables = arrow::ConcatenateTables```, but it's not possible to write ```con_tables``` using ```parquet::arrow::WriteTable()```.The first param of WriteTable() is a single ```arrow::Table```. This post shows how to merge tables by appending columns, but my context is appending rows. https://stackoverflow.com/questions/71183352/merging-tables-in-apache-arrow Thanks. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Packaging][RPM] Mismatch between package version and library version in naming [arrow]
kou closed issue #41784: [Packaging][RPM] Mismatch between package version and library version in naming URL: https://github.com/apache/arrow/issues/41784 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Error repeating df.to_parquet in pytest: "pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined" [arrow]
bjfar opened a new issue, #41857: URL: https://github.com/apache/arrow/issues/41857 ### Describe the bug, including details regarding any error messages, version, and platform. Python version: 3.10.14 pyarrow version: 16.1.0 pandas version: 2.2.2 pytest version: 8.2.1 I have some apparently niche circumstances that trigger the following error: ``` /home/benf/repos/tetra/python/tests/test_minimal.py:24: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/util/_decorators.py:333: in wrapper return func(*args, **kwargs) /home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/core/frame.py:3113: in to_parquet return to_parquet( /home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/io/parquet.py:476: in to_parquet impl = get_engine(engine) /home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/io/parquet.py:63: in get_engine return engine_class() /home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/io/parquet.py:169: in __init__ import pandas.core.arrays.arrow.extension_types # pyright: ignore[reportUnusedImport] # noqa: F401 /home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/core/arrays/arrow/extension_types.py:59: in pyarrow.register_extension_type(_period_type) pyarrow/types.pxi:1954: in pyarrow.lib.register_extension_type ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined pyarrow/error.pxi:91: ArrowKeyError = short test summary info = FAILED python/tests/test_minimal.py::test_pyarrow_issue_2 - pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined ``` It seems to have something to do with how pytest orchestrates its tests. Here is my minimal example: test_minimal.py ``` import pytest import pandas as pd pytest_plugins = ["pytester"] def test_pyarrow_issue(testdir, tmp_path): path = str(tmp_path / "test.tar") df = pd.DataFrame() df.to_parquet(path) def test_pyarrow_issue_2(testdir, tmp_path): path = str(tmp_path / "test_2.tar") df = pd.DataFrame() df.to_parquet(path) ``` Running `pytest test_minimal.py` then triggers the error. Notably, the error does *not* occur if either test is run independently, and it does not occur if the `testdir` fixture is removed or replaced with some other fixture. So I guess it has something to do with whatever `testdir` is doing under the hood. Presumably to do with how pandas/pyarrow get imported. In my real case I would really quite like to keep using the `testdir` fixture, though I can probably find a different way to do things. But nonetheless this behaviour seemed worth reporting. Not sure if it is a pyarrow issue though, or whether it is more of a pytest issue, or maybe even pandas. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] arrow flight sql jdbc drive with Lz4Compression [arrow]
kou closed issue #41456: arrow flight sql jdbc drive with Lz4Compression URL: https://github.com/apache/arrow/issues/41456 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [CI][Packaging] Fix conda arrow-nightlies channel [arrow]
amoeba opened a new issue, #41856: URL: https://github.com/apache/arrow/issues/41856 ### Describe the bug, including details regarding any error messages, version, and platform. The Conda [arrow-nightlies channel is empty](https://anaconda.org/arrow-nightlies/repo/files?label=main=conda), which means you can't install Arrow C++ or PyArrow nightlies from it at the moment. I noticed this in CI on https://github.com/apache/arrow-cookbook/pull/352. It's my understanding that the jobs that upload artifacts to this channel are running but failing, see the failing builds at http://crossbow.voltrondata.com/. From a quick look, the failures may just be due to Azure deprecations based on this error I see in a few Azure Pipelines logs: > The CondaEnvironment@1 (Conda environment) task has been deprecated since February 13, 2019 and will soon be retired. Use the Conda CLI ('conda') directly from a bash/pwsh/script task. Please visit https://aka.ms/azdo-deprecated-tasks to learn more about deprecated tasks. ### Component(s) Continuous Integration, Packaging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [R][CI]: Remove more defunct rhub containers [arrow]
jonkeane opened a new issue, #41841: URL: https://github.com/apache/arrow/issues/41841 ### Describe the enhancement requested In debugging a CRAN submission, found another location where we are using the stale rhub containers. ### Component(s) Continuous Integration, R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Format][FlightRPC] Flight SQL evolution [arrow]
lidavidm opened a new issue, #41840: URL: https://github.com/apache/arrow/issues/41840 ### Describe the enhancement requested From https://github.com/apache/arrow-rs/issues/5731#issuecomment-2133104504 Originally Flight RPC was implemented as a framework wrapping gRPC. This was especially expedient for the C++ implementation. By now it's mostly a weight dragging down Flight users, especially Flight SQL. If we have the chance to evolve Flight SQL and/or Flight RPC, some changes may include: - Use a proper gRPC service definition, instead of opaque bytes payloads ### Component(s) FlightRPC, Format -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] take into account orc's capabilities for finding tzdb [arrow]
kou closed issue #41755: [C++] take into account orc's capabilities for finding tzdb URL: https://github.com/apache/arrow/issues/41755 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Add support for FileIO [arrow-julia]
Beforerr opened a new issue, #507: URL: https://github.com/apache/arrow-julia/issues/507 It is registered in FileIO however neither `load` nor `fileio_load` is defined. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Issue using open_dataset() in r4.4.0 [arrow]
SHEvElynP opened a new issue, #41835: URL: https://github.com/apache/arrow/issues/41835 ### Describe the usage question you have. Please include as many useful details as possible. Hello My workplace has recently moved from R4.3.2 to R4.4.0. I used to be able to do open_dataset(dir_name, format = "arrow", partitioning = hive_partition()) but now I get an error saying "This build of the arrow package does not support Datasets". I attempted the workaround in comment [https://github.com/apache/arrow/issues/40667#issuecomment-2007942987](url) but it broke my .proj file and RStudio would not open it so I had to create a new one. Does anyone know any other workaround? I am fairly new to anything resembling coding Thank you! ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Fields within a null struct are not initialized with null values [arrow]
timsaucer opened a new issue, #41833: URL: https://github.com/apache/arrow/issues/41833 ### Describe the bug, including details regarding any error messages, version, and platform. When creating an array from a python dict, field entries of a null struct are initialized with default values rather than null even if their field is nullable. In the minimal example below, you would expect the 3rd row to have values of `inner_1` and `inner_2` to be null. ``` import pyarrow as pa print(pa.array([ {"outer": {"inner_1": 1, "inner_2": 2}}, {"outer": {"inner_1": 3, "inner_2": None}}, {"outer": None}, ])) ``` Generates the following output: ``` -- is_valid: all not null -- child 0 type: struct -- is_valid: [ true, true, false ] -- child 0 type: int64 [ 1, 3, 0 ] -- child 1 type: int64 [ 2, null, 0 ] ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [GLib] Allow getting a RecordBatchReader from a Dataset or Dataset Scanner [arrow]
kou closed issue #41749: [GLib] Allow getting a RecordBatchReader from a Dataset or Dataset Scanner URL: https://github.com/apache/arrow/issues/41749 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [R] Update relative URLs in README to absolute paths to prevent CRAN check failures [arrow]
thisisnic opened a new issue, #41829: URL: https://github.com/apache/arrow/issues/41829 ### Describe the bug, including details regarding any error messages, version, and platform. In #40148, we updated the README, but there were some URLs in there which pointed to relative links; we should update them to point to the absolute path so we don't fail CRAN checks. ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [R] Update NEWS.md for 16.0.0 [arrow]
thisisnic closed issue #41420: [R] Update NEWS.md for 16.0.0 URL: https://github.com/apache/arrow/issues/41420 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++][Parquet][Benchmark] Adding benchmarking for reading Statistics [arrow]
mapleFU opened a new issue, #41826: URL: https://github.com/apache/arrow/issues/41826 ### Describe the enhancement requested This pr ( https://github.com/apache/arrow/pull/41761 ) does a basics for benchmarking metadata. We'd like to add more benchmarks on Statistics encoding/decoding Parquet standard support statistics ( see https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L244 ). And in C++ Parquet, the statistics, would be decoded to thrift and convert to a `EncodedStats` or `Statistics` ( See https://github.com/apache/arrow/blob/7c8ce4589ae9e3c4a9c0cd54cff81a54ac003079/cpp/src/parquet/statistics.h ) We'd like to adding benchmark for reading/writing Statistics, specailly for BYTE_ARRAY, which could having a long `min_value` and `max_value` here. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [CI][GLib] Suppress "`unlink': Permission denied" warnings in tests on Windows [arrow]
kou closed issue #41770: [CI][GLib] Suppress "`unlink': Permission denied" warnings in tests on Windows URL: https://github.com/apache/arrow/issues/41770 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] python/adbc_driver_postgresql ingest NOT_IMPLEMENTED when running adbc_ingest with json array [arrow-adbc]
lidavidm closed issue #1868: python/adbc_driver_postgresql ingest NOT_IMPLEMENTED when running adbc_ingest with json array URL: https://github.com/apache/arrow-adbc/issues/1868 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Enhance the `copyFrom*` functionality in StringView [arrow]
lidavidm closed issue #40933: [Java] Enhance the `copyFrom*` functionality in StringView URL: https://github.com/apache/arrow/issues/40933 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++][Parquet] Unify normalize dictionary encoding handling [arrow]
mapleFU opened a new issue, #41818: URL: https://github.com/apache/arrow/issues/41818 ### Describe the enhancement requested This is mentioned here: https://github.com/apache/arrow/pull/40957#discussion_r1562703901 There're some points: 1. https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L444-L445 . encoding is not passed in Encoder 2. But shit, it's RLE in decoder : https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L1607 it will be detect and normalized in other place, like: 3. https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L876 We'd better unifying them ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Create Meson WrapDB Entry for Arrow [arrow]
WillAyd opened a new issue, #41816: URL: https://github.com/apache/arrow/issues/41816 ### Describe the enhancement requested Meson has a rather nice collection of projects in its WrapDB, which makes it rather easy to add dependencies to your project: https://mesonbuild.com/Wrapdb-projects.html I do not believe this would require Arrow to implement the Meson build system; we would just have to provide Meson patch files as part of the WrapDB: https://mesonbuild.com/Adding-new-projects-to-wrapdb.html This is also something I've explored for nanoarrow, with the only difference being that nanoarrow has meson build files in the source tree. Would this be something the Arrow team would be interested in? And if so, are there any thoughts on the dependencies we would like to provide? I was thinking something along the lines of: - arrow_core - arrow_parquet - arrow_flight - arrow_flight - arrow_gandiva - arrow_acero - arrow_dataset - arrow_substrait To match how @raulcd created the new conda packages for pyarrow https://github.com/conda-forge/arrow-cpp-feedstock/pull/1255#issuecomment-1920988437 ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] `pyarrow.write_feather` can't be used in `atexit` contexts to write a `pandas.DataFrame` [arrow]
pjh40 opened a new issue, #41815: URL: https://github.com/apache/arrow/issues/41815 ### Describe the bug, including details regarding any error messages, version, and platform. When `pyarrow.write_feather()` is given a `pandas.DataFrame`, `write_feather()` unconditionally calls `Table.from_pandas()` with the default `nthreads=None` argument. This is then passed to `pandas_compat.dataframe_to_arrays()`, allowing it to heuristically use a `concurrent.futures.ThreadPoolExecuter` to convert columns. This causes a runtime error when `write_feather` is used in an `atexit` (or `weakref.finalize`) context on exit of the interpreter: ``` RuntimeError: cannot schedule new futures after interpreter shutdown ``` This scenario can be avoided by adding a `use_threads` parameter to `write_feather` that can be used to force serial operation. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Clean up Assorted Warnings to get a clean nanoarrow build [arrow]
bkietz closed issue #41478: [C++] Clean up Assorted Warnings to get a clean nanoarrow build URL: https://github.com/apache/arrow/issues/41478 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Segfault when collecting parquet dataset query results [arrow]
mrd0ll4r opened a new issue, #41813: URL: https://github.com/apache/arrow/issues/41813 ### Describe the bug, including details regarding any error messages, version, and platform. Hello! I've been using arrow with R for a while now to great success. Recently, I've re-opened an old project (managed with renv, so I'm pretty confident all the package versions were the same). It is possible I upgraded the OS and/or OS packages in the meantime. Now, some of my queries on a gzip-compressed dataset of parquet files lead to a segfault: ``` *** caught segfault *** address 0x7f54ce520898, cause 'memory not mapped' Traceback: 1: Table__from_ExecPlanReader(self) 2: x$read_table() 3: as_arrow_table.RecordBatchReader(reader) 4: as_arrow_table(reader) 5: as_arrow_table.arrow_dplyr_query(x) 6: as_arrow_table(x) 7: doTryCatch(return(expr), name, parentenv, handler) 8: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 9: tryCatchList(expr, classes, parentenv, handlers) 10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {augment_io_error_msg(e, call, schema = schema())}) 11: compute.arrow_dplyr_query(x) 12: collect.arrow_dplyr_query(.) 13: collect(.) 14: d_redacted %>% group_by(year, month, cid) %>% summarize(n = n()) %>% collect() ``` I have a core dump from that session, but it's 46GB. I'm not a professional in analyzing these things, but this is what I got: ``` Core was generated by `/usr/lib/R/bin/exec/R'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x7f612d4ea3b0 in arrow::compute::KeyCompare::CompareBinaryColumnToRow_avx2(bool, unsigned int, unsigned int, unsigned short const*, unsigned int const*, arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, arrow::compute::RowTableImpl const&, unsigned char*) () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so [Current thread is 1 (Thread 0x7f6093fff640 (LWP 2273813))] (gdb) bt #0 0x7f612d4ea3b0 in arrow::compute::KeyCompare::CompareBinaryColumnToRow_avx2(bool, unsigned int, unsigned int, unsigned short const*, unsigned int const*, arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, arrow::compute::RowTableImpl const&, unsigned char*) () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #1 0x7f612d4d7093 in void arrow::compute::KeyCompare::CompareBinaryColumnToRow(unsigned int, unsigned int, unsigned short const*, unsigned int const*, arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, arrow::compute::RowTableImpl const&, unsigned char*) () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #2 0x7f612d4d6278 in arrow::compute::KeyCompare::CompareColumnsToRows(unsigned int, unsigned short const*, unsigned int const*, arrow::compute::LightContext*, unsigned int*, unsigned short*, std::vector > const&, arrow::compute::RowTableImpl const&, bool, unsigned char*) () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #3 0x7f612d4d896e in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #4 0x7f612d3a98e6 in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #5 0x7f612d3ab154 in arrow::compute::SwissTable::find(int, unsigned int const*, unsigned char*, unsigned char const*, unsigned int*, arrow::util::TempVectorStack*, std::function const&, void*) const () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #6 0x7f612d4df2d0 in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #7 0x7f612d4dfb73 in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #8 0x7f612cf8da83 in arrow::acero::aggregate::GroupByNode::Merge() () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #9 0x7f612cf8f8a3 in arrow::acero::aggregate::GroupByNode::OutputResult(bool) () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so #10 0x7f612cf941f6 in arrow::acero::aggregate::GroupByNode::InputReceived(arrow::acero::ExecNode*, arrow::compute::ExecBatch) ()
[I] Table.from_arrow can't import nan values into a non-null float column [arrow]
lord opened a new issue, #41812: URL: https://github.com/apache/arrow/issues/41812 ### Describe the bug, including details regarding any error messages, version, and platform. This small examples fails with `ValueError: Field pyarrow.Field was non-nullable but pandas column had 1 null values` on 16.1.0. ``` import pandas as pd import pyarrow as pa df = pd.DataFrame({"a": [1.0, float("nan")]}) schema = pa.schema([pa.field('a', pa.float64(), nullable=False)]) pa.Table.from_pandas(df, schema=schema) ``` I guess this seems like a bug to me, but I'm no pandas expert. It does feel like this makes roundtripping a non-null float column through pandas impossible? ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Importing an extension type without `ARROW:extension:metadata` crashes [arrow]
paleolimbot closed issue #41741: [C++] Importing an extension type without `ARROW:extension:metadata` crashes URL: https://github.com/apache/arrow/issues/41741 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++] Add Compute Kernel for Casting from union to string [arrow]
llama90 opened a new issue, #41810: URL: https://github.com/apache/arrow/issues/41810 ### Describe the enhancement requested This is a sub-issue of the issue mentioned below. - #35560 This issue is aiming to address #39182. A pull request (https://github.com/apache/arrow/pull/40237) has been submitted to resolve issue, and additional features that need to be supported have emerged. | From | To| Using Function | ||---|--| | sparse_union | utf8 | cast_string | | dense_union | utf8 | cast_string | ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++] Add Compute Kernel for Casting from map to string [arrow]
llama90 opened a new issue, #41809: URL: https://github.com/apache/arrow/issues/41809 ### Describe the enhancement requested This is a sub-issue of the issue mentioned below. - #35560 This issue is aiming to address #39182. A pull request (https://github.com/apache/arrow/pull/40237) has been submitted to resolve issue, and additional features that need to be supported have emerged. | From | To| Using Function | ||---|--| | map | utf8 | cast_string | ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [JAVA] Jni mvn generate-resources failed because not generate arrow-bom [arrow]
jinchengchenghh opened a new issue, #41808: URL: https://github.com/apache/arrow/issues/41808 ### Describe the bug, including details regarding any error messages, version, and platform. arrow_ep/src/arrow_ep/java# mvn generate-resources -P generate-libs-cdata-all-os -Darrow.c.jni.dist.dir=$ARROW_INSTALL_DIR -Dmaven.test.skip -Drat.skip -Dmaven.gitcommitid.skip -Dcheckstyle.skip -N [INFO] Scanning for projects... Downloading from central: https://repo.maven.apache.org/maven2/org/apache/arrow/arrow-bom/15.0.0-gluten-3/arrow-bom-15.0.0-gluten-3.pom [ERROR] [ERROR] Some problems were encountered while processing the POMs: [ERROR] Non-resolvable import POM: The following artifacts could not be resolved: org.apache.arrow:arrow-bom:pom:15.0.0-gluten-3 (absent): Could not find artifact org.apache.arrow:arrow-bom:pom:15.0.0-gluten-3 in central (https://repo.maven.apache.org/maven2) @ line 601, column 20 @ [ERROR] The build could not read 1 project -> [Help 1] [ERROR] [ERROR] The project org.apache.arrow:arrow-java-root:15.0.0-gluten-3 (/mnt/DP_disk1/code/incubator-gluten/ep/build-velox/build/velox_ep/_build/release/third_party/arrow_ep/src/arrow_ep/java/pom.xml) has 1 error [ERROR] Non-resolvable import POM: The following artifacts could not be resolved: org.apache.arrow:arrow-bom:pom:15.0.0-gluten-3 (absent): Could not find artifact org.apache.arrow:arrow-bom:pom:15.0.0-gluten-3 in central (https://repo.maven.apache.org/maven2) @ line 601, column 20 -> [Help 2] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Adding `variadicBufferCounts` to `RecordBatch` [arrow]
lidavidm closed issue #41730: [Java] Adding `variadicBufferCounts` to `RecordBatch` URL: https://github.com/apache/arrow/issues/41730 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Nullability of struct child vectors not preserved in TransferPair [arrow]
lidavidm closed issue #41686: [Java] Nullability of struct child vectors not preserved in TransferPair URL: https://github.com/apache/arrow/issues/41686 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Transition from gradle-enterprise-maven-extension to develocity-maven-extension [arrow]
lidavidm closed issue #41799: [Java] Transition from gradle-enterprise-maven-extension to develocity-maven-extension URL: https://github.com/apache/arrow/issues/41799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Implement a function to retrieve reference buffers in StringView [arrow]
lidavidm closed issue #40930: [Java] Implement a function to retrieve reference buffers in StringView URL: https://github.com/apache/arrow/issues/40930 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [GLib][CI] Use vcpkg for C++ dependencies when building GLib libraries with MSVC [arrow]
adamreeve opened a new issue, #41806: URL: https://github.com/apache/arrow/issues/41806 ### Describe the enhancement requested This is a follow up to #41134 and should hopefully allow building more of the GLib libraries with MSVC. Context: https://github.com/apache/arrow/pull/41599#discussion_r1596163069 and https://github.com/apache/arrow/pull/41599#issuecomment-2126145170 ### Component(s) Continuous Integration, GLib -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Use immutables value-annotations instead of value artifact [arrow]
lidavidm closed issue #41789: [Java] Use immutables value-annotations instead of value artifact URL: https://github.com/apache/arrow/issues/41789 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [GLib] Add support for MSVC with vcpkg [arrow]
kou closed issue #41134: [GLib] Add support for MSVC with vcpkg URL: https://github.com/apache/arrow/issues/41134 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Thirdparty: Bump xsimd to 13.0.0 [arrow]
kou closed issue #41547: [C++] Thirdparty: Bump xsimd to 13.0.0 URL: https://github.com/apache/arrow/issues/41547 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Flight] Flight benchmark doesn't work anymore [arrow]
kou closed issue #41780: [C++][Flight] Flight benchmark doesn't work anymore URL: https://github.com/apache/arrow/issues/41780 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Swift] Add Struct (Nested) types [arrow]
abandy opened a new issue, #41804: URL: https://github.com/apache/arrow/issues/41804 ### Describe the enhancement requested Struct (Nested) types are currently not implemented in Swift. Adding Nested types as this is required to implement other arrow features. ### Component(s) Swift -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [MATLAB] Add C Data Interface format import/export functionality for `arrow.tabular.RecordBatch` [arrow]
sgilmore10 opened a new issue, #41803: URL: https://github.com/apache/arrow/issues/41803 ### Describe the enhancement requested Now that #41656 has been closed, we should add MATLAB APIs for importing/exporting `arrow.tabular.RecordBatch`es using the C Data Interface format. The C Data Interface format import/export workflows would like this: ### Import into MATLAB ```matlab cArray = arrow.c.Array cSchema = arrow.c.Schema . . . % Pass cArray and cSchema to export APIs of another Arrow language bindings to fill in C struct details . . . % Import Arrow RecordBatch from pre-populated C Data interface format C structs rb = arrow.tabular.RecordBatch.importFromC(cArray, cSchema); ``` ### Export from MATLAB ```matlab . . . % Create C Data Interface format ArrowArray and ArrowSchema C structs using APIs of another Arrow language binding ... . . . rb = arrow.recordBatch(table(1:10)')). % Export Arrow RecordBatch from MATLAB to C Data Interface format and fill in C struct details rb.exportToC(cArrayAddress, cSchemaAddress) . . . % Import Arrow RecordBatch from pre-populated C Data Interface format C structs using APIs of another Arrow language binding ``` We can implement this functionality using the C Data Interface format C++ APIS defined in https://github.com/apache/arrow/blob/main/cpp/src/arrow/c/bridge.h. ### Component(s) MATLAB -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] Java Cookbook fails on 16.0.0-SNAPSHOT [arrow-cookbook]
amoeba closed issue #347: [Java] Java Cookbook fails on 16.0.0-SNAPSHOT URL: https://github.com/apache/arrow-cookbook/issues/347 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet][Doc] Denote PARQUET:field_id in parquet.rst [arrow]
pitrou closed issue #41186: [C++][Parquet][Doc] Denote PARQUET:field_id in parquet.rst URL: https://github.com/apache/arrow/issues/41186 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Acero] An useless parameter for QueryContext::Init called in hash_join_benchmark [arrow]
pitrou closed issue #41720: [C++][Acero] An useless parameter for QueryContext::Init called in hash_join_benchmark URL: https://github.com/apache/arrow/issues/41720 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Add functionality to MemoryManager for copying a slice of a buffer [arrow]
pitrou closed issue #39858: [C++] Add functionality to MemoryManager for copying a slice of a buffer URL: https://github.com/apache/arrow/issues/39858 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++][S3] Remove GetBucketRegion hack for never AWS SDK versions [arrow]
pitrou opened a new issue, #41797: URL: https://github.com/apache/arrow/issues/41797 ### Describe the enhancement requested In https://github.com/aws/aws-sdk-cpp/issues/1885#issuecomment-2118124214 it was pointed out that the "x-amz-bucket-region" header of successful HeadBucket responses are now accessible using `S3Model::HeadBucketResult::GetRegion`. We should use that API whenever possible. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Deserialization as Vector{SubArray} breaks `push!` on DataFrame [arrow-julia]
maleadt opened a new issue, #506: URL: https://github.com/apache/arrow-julia/issues/506 I'm using Arrow v2.7.2 with DataFrames v1.6.1 on Julia 1.10, and am running into an issue that seems to stem from Arrow.jl deserializing my `Vector{Vector{T}}` columns as `Vector{SubArray{...}}`: ```julia julia> using Arrow, DataFrames julia> df = DataFrame(foo=Vector{Int}[]); julia> push!(df, [[1,2,3]]) 1×1 DataFrame Row │ foo │ Array… ─┼─── 1 │ [1, 2, 3] julia> Arrow.write("/tmp/test.arrow", df) "/tmp/test.arrow" julia> df2 = copy(DataFrame(Arrow.Table("/tmp/test.arrow"))); julia> typeof(df2.foo) Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1}) ``` This breaks certain `push!`es on the dataframe, which I haven't been able to reproduce in isolation, but which looks as follows: ``` MethodError: Cannot `convert` an object of type Vector{Int64} to an object of type SubArray{Int64, 1, Arrow.Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true} Stacktrace: [1] push!(a::Vector{SubArray{Int64, 1, Arrow.Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}}, item::Vector{Int64}) @ Base ./array.jl:1118 [2] _row_inserter!(df::DataFrame, loc::Int64, row::Tuple{String, Vector{Int64}, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, String, Bool, Bool, Bool, Vector{Int64}, Vector{Int64}, Vector{Int64}, String, String, Float64}, mode::Val{:push}, promote::Bool) @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/dataframe/insertion.jl:663 [3] push!(df::DataFrame, row::Tuple{String, Vector{Int64}, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, String, Bool, Bool, Bool, Vector{Int64}, Vector{Int64}, Vector{Int64}, String, String, Float64}) @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/dataframe/insertion.jl:457 ``` It's possible I'm doing something wrong; first time Arrow.jl user here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++] [Python] Add functionality of `STSProfileCredentialsProvider` to default credentials chain for `S3FileSystem` [arrow]
fjetter opened a new issue, #41794: URL: https://github.com/apache/arrow/issues/41794 ### Describe the enhancement requested Given a typical AWS credentials setup that defines IAM roles like the following ``` # ~/.aws/config [default] region=us-east-2 role_arn=arn:aws:iam::123456789012:role/RoleName source_profile=default # ~/.aws/credentials [default] aws_access_key_id=XXX aws_secret_access_key= ``` almost all AWS sdks are interpreting this correctly as an `assume-role` method that generates a temporary STS token pair. For example, using python this looks like ```python import boto3 b3sess = boto3.Session() creds = b3sess.get_credentials() { "method": creds.method, "secret": creds.secret_key[:5] + "...", "token": creds.token[:5] + "...", } {'method': 'assume-role', 'secret': 'jALbI...', 'token': 'IQoJb...'} ``` The C++ sdk is deviating from how the default credentials chain is implemented and is not supporting this kind of configuration but instead uses the plain access key + secret key pair that is found in the configuration which does not necessarily provide sufficient permissions. Dask adopted the S3FileSystem as a more performant alternative to the existing default fsspec filesystem for its parquet reader but this lack of support in the C++ sdk is a bit of a nasty blocker for further adoption. We ended up writing a workaround for our benchmarking by using boto to read the credentials and initialize the [S3FileSystem manually](https://github.com/coiled/benchmarks/blob/934a69e0ed093ef7319a5034b87c03a53dc0c0d8/tests/tpch/conftest.py#L290-L301) but this has a couple of flaws. For starters, this is pretty unergonomic and nontrivial but more importantly this prohibits the refresh of the token after expiration (max duration is 1hr) There's been some discussion on the aws-sdk-cpp repo about this with a suggestion to implement an amended credentials chain, see [here](https://github.com/aws/aws-sdk-cpp/issues/150#issuecomment-538548438) that includes the `STSProfileCredentialsProvider` but it's also pointed out that this is flawed as well. Also related - https://github.com/aws/aws-sdk-cpp/issues/2814 - https://github.com/aws/aws-sdk-cpp/pull/2815 I know this is ultimately a aws-sdk-cpp problem but end users of the arrow `S3FileSystem` do not have this transparency and expect things to "just work", particularly when consuming the python API and they are used from how boto and other libraries are parsing credentials. cc @pitrou since you've been poking in this area recently ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Can get tables info with schema contain custom field type [arrow]
Curricane closed issue #41722: Can get tables info with schema contain custom field type URL: https://github.com/apache/arrow/issues/41722 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] fmpp-maven-plugin generates files directly under target/generated-sources [arrow]
lidavidm closed issue #41787: [Java] fmpp-maven-plugin generates files directly under target/generated-sources URL: https://github.com/apache/arrow/issues/41787 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [CI][Integration][Release] RC verification script failed [arrow]
kou opened a new issue, #41792: URL: https://github.com/apache/arrow/issues/41792 ### Describe the bug, including details regarding any error messages, version, and platform. verify-rc-source-integration-linux-almalinux-8-amd64: https://github.com/ursacomputing/crossbow/actions/runs/9191601776/job/25278362624#step:6:86624 ```text Traceback (most recent call last): File "/tmp/arrow-HEAD.WiViY/venv-source/bin/archery", line 8, in sys.exit(archery()) ^ File "/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^ File "/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^ File "/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^ File "/tmp/arrow-HEAD.WiViY/venv-source/lib64/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^ File "/arrow/dev/archery/archery/cli.py", line 771, in integration from .integration.runner import write_js_test_json, run_all_tests File "/arrow/dev/archery/archery/integration/runner.py", line 36, in from .tester_java import JavaTester File "/arrow/dev/archery/archery/integration/tester_java.py", line 53, in _arrow_version = load_version_from_pom() ^^^ File "/arrow/dev/archery/archery/integration/tester_java.py", line 37, in load_version_from_pom tree = ET.parse(os.path.join(ARROW_BUILD_ROOT, 'java', 'pom.xml')) ^^^ File "/usr/lib64/python3.11/xml/etree/ElementTree.py", line 1218, in parse tree.parse(source, parser) File "/usr/lib64/python3.11/xml/etree/ElementTree.py", line 569, in parse source = open(source, "rb") ^^ FileNotFoundError: [Errno 2] No such file or directory: '/java/pom.xml' ``` ### Component(s) Continuous Integration, Integration, Release -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [CI][Conda] The CondaEnvironment@1 (Conda environment) task has been deprecated since February 13, 2019 and will soon be retired [arrow]
kou opened a new issue, #41791: URL: https://github.com/apache/arrow/issues/41791 ### Describe the bug, including details regarding any error messages, version, and platform. conda-linux-aarch64-cpu-py3: https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=64087=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=cf5ca333-5432-59fc-78af-35cb2d46743b=161 ```text ##[error]The CondaEnvironment@1 (Conda environment) task has been deprecated since February 13, 2019 and will soon be retired. Use the Conda CLI ('conda') directly from a bash/pwsh/script task. Please visit https://aka.ms/azdo-deprecated-tasks to learn more about deprecated tasks. ``` `CondaEnvironment` is used here: https://github.com/apache/arrow/blob/9185d7dad773ed8768f90fb63ad3ef7e7a92f108/dev/tasks/conda-recipes/azure.linux.yml#L49-L53 ### Component(s) Continuous Integration, Packaging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Make git-dependent preprocessor definitions internal [arrow]
kou closed issue #41783: [C++] Make git-dependent preprocessor definitions internal URL: https://github.com/apache/arrow/issues/41783 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Java] fmpp-maven-plugin generates files directly under target/generated-sources [arrow]
laurentgo opened a new issue, #41787: URL: https://github.com/apache/arrow/issues/41787 ### Describe the bug, including details regarding any error messages, version, and platform. `fmpp-plugin-maven` is used in `arrow-vector` module to generate source files before the compilation phase. Those files are generated directly under `target/generated-sources` where they conflict with the `target/generated-sources/annotations` directory created by `javac`. Per convention each plugin generates files under its own directory to prevent risks of conflicts. Although it doesn't cause a direct issue with the build, it may also confuse some IDEs (Eclipse and VSCode notably) which are detecting overlapping source directories ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [CI][Python] AMD64 Conda Java C Data Interface Integration Failure building PyArrow trying to use PYARROW_PARQUET [arrow]
kou closed issue #41725: [CI][Python] AMD64 Conda Java C Data Interface Integration Failure building PyArrow trying to use PYARROW_PARQUET URL: https://github.com/apache/arrow/issues/41725 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Add option to disable exact matches optional in join_asof [arrow]
0x26res opened a new issue, #41786: URL: https://github.com/apache/arrow/issues/41786 ### Describe the enhancement requested I would like to do a `join_asof` that would exclude exact matches. This is supported in pandas https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html In the example below, I expect [1,2,2,3] instead of [1,2,3,4]. ``` left = pa.table({"left": [10, 20, 30, 40], "key": [1, 1, 1, 1]}) right = pa.table( { "right": [9, 12, 30, 41], "key": [1, 1, 1, 1], "value": [1, 2, 3, 4], } ) assert left.join_asof( right, on="left", by="key", tolerance=-10, right_on="right", right_by="key" ) == pa.table( { "left": [10, 20, 30, 40], "key": [1, 1, 1, 1], "value_right": [1, 2, 3, 3], # Should be [1,2,2,3] } ) ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Mismatch between package version and library version in naming [arrow]
daeden opened a new issue, #41784: URL: https://github.com/apache/arrow/issues/41784 ### Describe the bug, including details regarding any error messages, version, and platform. **Version**: 16.1.0 **Platform**: Details about the operating system or environment where the bug was found **Summary**: The version number in the library name does not match the version of the package that is installed. This causes load time issues where we fail to find library dependencies even when version mismatch is only on minor version change, which should backwards compatible. **Steps to Reproduce**: 1. Install the package for version 16.1.0 2. List the installed libraries (`ls /lib64/libarrow.so*`) **Expected Result**: The libraries should be labeled with the correct version number (16.1.0) with symlinks for major version and non versioned. For example, I would expect to find: ``` $ ls /lib64/libarrow.so* /lib64/libarrow.so@ /lib64/libarrow.so.16@ /lib64/libarrow.so.16.1.0* ``` **Actual Result**: The libraries are labeled with an incorrect version number (1601.0.0) ``` $ ls /lib64/libarrow.so* /lib64/libarrow.so@ /lib64/libarrow.so.1601@ /lib64/libarrow.so.1601.0.0* ``` ### Component(s) Packaging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Thrift: generate template method to accelerate reading thrift [arrow]
pitrou closed issue #41702: [C++][Parquet] Thrift: generate template method to accelerate reading thrift URL: https://github.com/apache/arrow/issues/41702 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] ADBC Python Postgres - Stuck connections to the database [arrow-adbc]
gaspardc-met opened a new issue, #1881: URL: https://github.com/apache/arrow-adbc/issues/1881 ### What happened? Context before the bug (working): - Postgres database on Kubernetes with several tables - 4 services (webapp, machine learning inference, and FastAPI backend APIs) deployed on kubernetes and fetching data from postgres - 1 service, data orchestrator, writing data to Postgres - Fetching data from PG with `pd.read_sql` from Pandas and a SQLalchemy engine - Been doing this for 1+ year without any Postgres issues Switching to ADBC: - Following my upgrade to pandas >2.0.0 I wanted to switch to `adbc_driver_postgresql`'s `dbapi` connection with `pd.read_sql` - Initial tests were great, it was faster than before - Deployed this to production on all aforementioned services twice (initially with connection caching, then no caching and properly closing each and every connection - Once again initially smooth, everything worked and was fast Problem: - In both instances of the deployment, within ~12 hours, the connections would be stuck - Webapp or another service would create an ADBC connection, and run the sql query with `pd.read_sql` (know this through caching) and then wait indefinitely. - Reloading the webapp, clearing webapp cache, recreating the connection would do nothing at all - The log on the Postgres pod indicated a password issue with the current database/user, which never happened before - Both SQLalchemy and ADBC get the same postgres URI to create the engine/connection with - Reverting to SQLalchemy solved the problem, and the error has not been seen again ### How can we reproduce the bug? - The given URI was `"postgresql://{user}:{password}@{host}:{port}/{db}"` formatted with the proper values - The function was used to create the ADBC connection: ```python def create_adbc_conn() -> Connection: logger_stdout.info(f"Creating a new ADBC connection at {pd.Timestamp.now()}.") uri = get_default_uri() # URI shown above, formatted connection = dbapi.connect(uri=uri) logger_stdout.info("ADBC connection created") return connection ``` - The function to execute the SQL query was: ```python def handle_sql_query( sql: str, index_col: Optional[str] = None, connection: Optional[Connection] = None, need_to_close: bool = False, ) -> pd.DataFrame: if engine is None: logger_stdout.info(f"Engine is None, creating a new ADBC connection at {pd.Timestamp.now()}.") connection= create_adbc_conn() need_to_close = True try: logger_stdout.info("Executing SQL query with connection") return pd.read_sql_query(sql=sql, con=connection, index_col=index_col, parse_dates=[index_col]) finally: if need_to_close: logger_stdout.info("Closing the ADBC connection.") connection.close() ``` - The SQL queries ranged from `select * from TABLE_NAME` to selecting specific columns on a range of specific dates ### Environment/Setup python 3.11 pandas == 2.2.2 adbc_driver_postgresql==0.11.0 adbc-driver-manager==0.11.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Minor: moving EncodedStats by default rather than copying [arrow]
mapleFU closed issue #41726: [C++][Parquet] Minor: moving EncodedStats by default rather than copying URL: https://github.com/apache/arrow/issues/41726 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++] Make git-dependent preprocessor definitions internal [arrow]
pitrou opened a new issue, #41783: URL: https://github.com/apache/arrow/issues/41783 ### Describe the enhancement requested The `ARROW_GIT_ID` and `ARROW_GIT_DESCRIPTION` preprocessor variables are currently exposed in `arrow/util/config.h`, and included from `arrow/config.h`. This means that any file indirectly including these headers will have to be recompiled if the git information changes - something which happens quite frequently during development. Using ccache with properly tuned configuration can work around the issue, but does not fully remove overhead. It also requires users to think about the best ccache configuration. By making those two variable privates, we should fix the problem entirely. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] ParquetDataset object fails with a .read() method due to hive partition schema columns. [arrow]
j0bekt01 closed issue #41779: ParquetDataset object fails with a .read() method due to hive partition schema columns. URL: https://github.com/apache/arrow/issues/41779 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Java][Flight] Flight SQL tests are flaky [arrow]
laurentgo opened a new issue, #41782: URL: https://github.com/apache/arrow/issues/41782 ### Describe the bug, including details regarding any error messages, version, and platform. Several test failures in `flight-sql` module have been observed in multiple job executions: - https://github.com/apache/arrow/actions/runs/9185953424/job/25260768750 - https://github.com/apache/arrow/actions/runs/9185714602/job/25260156899?pr=41772 The reported issue is ``` Error: Errors: Error:TestFlightSqlStreams.tearDown:224 » IllegalState Memory was leaked by query. Memory leaked: (250384) Allocator(ROOT) 0/250384/250896/2147483647 (res/actual/peak/limit) ``` Note that there are also multiple messages about unclosed `ManagedChannels` in: - `org.apache.arrow.flight.auth2.TestBasicAuth2` - `org.apache.arrow.flight.core/org.apache.arrow.flight.TestFlightGrpcUtils.testMultipleGrpcServices` - `org.apache.arrow.flight.core/org.apache.arrow.flight.TestDoExchange.setUp` - `org.apache.arrow.flight.core/org.apache.arrow.flight.TestServerOptions.addHealthCheckService` ``` May 22, 2024 5:50:41 AM io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference cleanQueue SEVERE: *~*~*~ Previous channel ManagedChannelImpl{logId=505, target=directaddress:///localhost/127.0.0.1:} was garbage collected without being shut down! ~*~*~* Make sure to call shutdown()/shutdownNow() java.lang.RuntimeException: ManagedChannel allocation site at io.grpc.internal@1.63.0/io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:102) at io.grpc.internal@1.63.0/io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:60) at io.grpc.internal@1.63.0/io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:51) at io.grpc.internal@1.63.0/io.grpc.internal.ManagedChannelImplBuilder.build(ManagedChannelImplBuilder.java:672) at io.grpc@1.63.0/io.grpc.ForwardingChannelBuilder2.build(ForwardingChannelBuilder2.java:260) at org.apache.arrow.flight.core/org.apache.arrow.flight.TestServerOptions.addHealthCheckService(TestServerOptions.java:191) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) ``` but those seems to only cause warnings, not errors ### Component(s) FlightRPC, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++][Flight] Flight benchmark doesn't work anymore [arrow]
pitrou opened a new issue, #41780: URL: https://github.com/apache/arrow/issues/41780 ### Describe the bug, including details regarding any error messages, version, and platform. On my local build: ```console $ /build/build-release/relwithdebinfo/arrow-flight-benchmark Testing method: DoGet Using spawned TCP server Server running with pid 71195 Server host: localhost Server port: 31337 Failed with error: << IOError: Flight returned unavailable error, with message: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:31337: Failed to connect to remote host: Connection refused. Detail: Unavailable ``` ### Component(s) Benchmarking, C++, FlightRPC -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] ParquetDataset object fails with a .read() method due to hive partition schema columns. [arrow]
j0bekt01 opened a new issue, #41779: URL: https://github.com/apache/arrow/issues/41779 ### Describe the bug, including details regarding any error messages, version, and platform. I'm trying to read parquet files from S3 that have a Hive partition '/year=/month=MM/day=DD/hour=HH/' using the .read() method, but it fails, stating that one of the partition columns doesn't exist. However, if I exclude the partition columns and provide a list of columns that are actually present in the file, it reads without any issues. According to the documentation, the read() method should ignore Hive partition columns. `import pyarrow.parquet as pq import datetime import polars as pl dt = datetime.datetime(2024, 5, 17) path = f"{bucket}/folder-to-files/year={dt.year}/month={dt.month:02d}/" dataset = pq.ParquetDataset(path, partitioning='hive', filesystem=s3fs.S3FileSystem()) # This Fails ( pl.LazyFrame(dataset.read()) .select(pl.all()) .head(100) .collect() ) # Remove the partition columns cols = dataset.schema.names [cols.remove(item) for item in ['year','month', 'day', 'hour'] if item in cols] ( pl.LazyFrame(dataset.read()) .select(pl.all()) .head(100) .collect() ) ` windows 11 python 3.10 pyarrow 16.1.0 ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Add file metadata read/write benchmark [arrow]
pitrou closed issue #41760: [C++][Parquet] Add file metadata read/write benchmark URL: https://github.com/apache/arrow/issues/41760 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] pyarrow.fs.HadoopFileSystem Usage Problems [arrow]
deep826 opened a new issue, #41777: URL: https://github.com/apache/arrow/issues/41777 ### Describe the usage question you have. Please include as many useful details as possible. hi, I use pyarrow.fs.HadoopFileSystem client to interact with hdfs. And I write some bytes to a file in hdfs, then download it to local filesystem. I read the file using python native api: read, but the result is false. When I use pyarrow hdfs client to read the file in hdfs, the result is right. I'm confused. here are some sudo code snippet. `a = 1000 ` `b = 64` `with hdfs_client.open_output_stream(path) as f:` `f.write(a.to_bytes(8, sys.byteorder))` `f.write(b.to_bytes(4, sys.byteorder))` here, i write 12 bytes to file: path, then i download it from hdfs to local_path and read these bytes as follows: `with open(local_path, 'rb') as f:` `bs = f.read(12)` `a = int.from_bytes(bs[0:8], sys.byteorder)` `b = int.from_bytes(bs[8:12], sys.byteorder)` `print(f"a: {a}, b: {b}")` The print result is: a: 559903, b: 3158573824, the expected values should be: a: 1000, b: 64. So What's the problem. ### Component(s) C, C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Wrong Length value in the example of ListView in Columnar specification document [arrow]
Jagdish-Motwani opened a new issue, #41774: URL: https://github.com/apache/arrow/issues/41774 ### Describe the bug, including details regarding any error messages, version, and platform. In the example Layout: ``ListView`` Array with 5 elements, the length is specified as 4. Shouldn't it be 5? ### Snippet from the website We continue with the ListView type, but this instance illustrates out of order offsets and sharing of child array values. It is an array with length 5 having logical values: [[12, -7, 25], null, [0, -127, 127, 50], [], [50, 12]] It may have the following representation: * Length: 4, Null count: 1 * Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63| |--|---| | 00011101 | 0 (padding) | ... - ### Component(s) Website -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Python][Parquet] Documentation to parquet.write_table should be updated for new byte_stream_split encoding options [arrow]
jorisvandenbossche closed issue #41748: [Python][Parquet] Documentation to parquet.write_table should be updated for new byte_stream_split encoding options URL: https://github.com/apache/arrow/issues/41748 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++] Consuming or closing a RecordBatchReader created from a Dataset Scanner does not close underlying files [arrow]
adamreeve opened a new issue, #41771: URL: https://github.com/apache/arrow/issues/41771 ### Describe the bug, including details regarding any error messages, version, and platform. Code to reproduce as a unit test that I added to `cpp/src/arrow/dataset/dataset_test.cc`, which logs the open files in the dataset directory (only works on Linux). This needs some extra headers: ```C++ #include #include #include "arrow/dataset/file_ipc.h" #include "arrow/ipc/api.h" ``` Test methods: ```C++ void ListOpenFilesInDir(const std::string& directory, const std::string& context) { std::cout << "Open files in directory " << directory << " " << context << ":" << std::endl; auto open_files = std::filesystem::directory_iterator("/proc/self/fd"); for (const auto& entry : open_files) { char target_path[PATH_MAX]; ssize_t len = ::readlink(entry.path().c_str(), target_path, PATH_MAX - 1); if (len != -1) { target_path[len] = '\0'; std::string open_file_path(target_path); if (open_file_path.find(directory) == 0) { std::cout << open_file_path << std::endl; } } } } TEST(TestDatasetScan, ScanToRecordBatchReader) { ASSERT_OK_AND_ASSIGN(auto tempdir, arrow::internal::TemporaryDir::Make("dataset-scan-test-")); std::string tempdir_path = tempdir->path().ToString(); auto schema = arrow::schema({field("x", int64()), field("y", int64())}); auto table = TableFromJSON(schema, {R"([ [1, 2], [3, 4] ])"}); auto format = std::make_shared(); auto file_system = std::make_shared(); ASSERT_OK_AND_ASSIGN(auto file_path, tempdir->path().Join("data.arrow")); std::string file_path_str = file_path.ToString(); { EXPECT_OK_AND_ASSIGN(auto out_stream, file_system->OpenOutputStream(file_path_str)); ASSERT_OK_AND_ASSIGN( auto file_writer, MakeFileWriter(out_stream, schema, arrow::ipc::IpcWriteOptions::Defaults())); ASSERT_OK(file_writer->WriteTable(*table)); ASSERT_OK(file_writer->Close()); } std::vector paths {file_path_str}; FileSystemFactoryOptions options; ASSERT_OK_AND_ASSIGN(auto factory, arrow::dataset::FileSystemDatasetFactory::Make(file_system, paths, format, options)); ASSERT_OK_AND_ASSIGN(auto dataset, factory->Finish()); { ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish()); { ASSERT_OK_AND_ASSIGN(auto record_batch_reader, scanner->ToRecordBatchReader()); ASSERT_OK_AND_ASSIGN(auto read_table, record_batch_reader->ToTable()); ListOpenFilesInDir(tempdir_path, "after read"); ASSERT_OK(record_batch_reader->Close()); ListOpenFilesInDir(tempdir_path, "after close"); } ListOpenFilesInDir(tempdir_path, "after reader destruct"); } ListOpenFilesInDir(tempdir_path, "after scanner destruct"); } ``` When I run this (on Fedora 39, using GCC 13) I get output like: ``` Open files in directory /tmp/dataset-scan-test-268jyz3s/ after read: /tmp/dataset-scan-test-268jyz3s/data.arrow Open files in directory /tmp/dataset-scan-test-268jyz3s/ after close: /tmp/dataset-scan-test-268jyz3s/data.arrow Open files in directory /tmp/dataset-scan-test-268jyz3s/ after reader destruct: Open files in directory /tmp/dataset-scan-test-268jyz3s/ after scanner destruct: ``` This shows that neither consuming the `RecordBatchReader` by reading it into a table nor calling the `Close` method results in the IPC file being closed, it's only closed after the reader is destroyed. The `Close` implementation doesn't do anything other than consume all the data: https://github.com/apache/arrow/blob/37e5240e2430564b1c2dfa5d1e6a7a6b58576f83/cpp/src/arrow/dataset/scanner.cc#L113-L120 For context, this causes errors trying to remove the dataset directory in Windows when using the GLib bindings via Ruby, where there isn't a way to force destruction of the reader and we have to rely on GC (#41750). ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [CI][GLib] Suppress "`unlink': Permission denied" warnings in tests on Windows [arrow]
kou opened a new issue, #41770: URL: https://github.com/apache/arrow/issues/41770 ### Describe the enhancement requested https://github.com/apache/arrow/actions/runs/9183539981/job/25254413025#step:12:83 ```text test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-8gmqb9.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-jhtj66.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-cm213m.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-9f22cw.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-l8mur.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-h2rr21.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-8wxgv6.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-n5khu0.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-whjqi1.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-nigllm.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-5d2aoc.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-moorbx.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-vq6f58.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call' test/run-test.rb: warning: Exception in finalizer #> C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `unlink': Permission denied @ apply2files - D:/a/_temp/data20240522-5072-giv9k1.parquet (Errno::EACCES) from C:/hostedtoolcache/windows/Ruby/3.1.5/x64/lib/ruby/3.1.0/tempfile.rb:265:in `call'
[I] [Java] Rework how Java cookbooks are developed and built [arrow-cookbook]
amoeba opened a new issue, #351: URL: https://github.com/apache/arrow-cookbook/issues/351 In https://github.com/apache/arrow-cookbook/pull/350#issuecomment-2121850653 it was pointed out that the way the Java cookbooks work could be improved quite a bit. We might consider two more recent approaches: - https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/adbc_cookbook.py - https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/javadoc_inventory.py -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [GLib] Separate version macros for each GLib library [arrow]
kou closed issue #41681: [GLib] Separate version macros for each GLib library URL: https://github.com/apache/arrow/issues/41681 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++][Parquet] `parquet::arrow::FileWriter` does not propagate schema-level metadata when `ArrowWriterProperties::store_schema` is false [arrow]
TheNeuralBit opened a new issue, #41766: URL: https://github.com/apache/arrow/issues/41766 ### Describe the bug, including details regarding any error messages, version, and platform. When `store_schema` is true the `FileWriter` first copies any existing metadata before storing the serialized schema: https://github.com/apache/arrow/blob/8169d6e719453acd0e7ca1b6f784d800cca4f113/cpp/src/parquet/arrow/writer.cc#L537-L542 But when `store_schema` is false, the `FileWriter` just returns an empty metadata, and custom metadata is not copied: https://github.com/apache/arrow/blob/8169d6e719453acd0e7ca1b6f784d800cca4f113/cpp/src/parquet/arrow/writer.cc#L531-L534 Could someone confirm if this is intentional or not? It looks like an oversight to me and I have a patch ready to address it. ### Component(s) Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Parquet][C++] Behaviour of unknown logical type when encountered in Parquet reader [arrow]
paleolimbot opened a new issue, #41764: URL: https://github.com/apache/arrow/issues/41764 ### Describe the enhancement requested In https://github.com/apache/parquet-format/pull/240 there is concern regarding the ability to add a new logical type (in this case GEOMETRY) in a backwards compatible way such that readers that don't yet implement support for the new logical type can still read the file. @jorisvandenbossche found the place where the error would be thrown: https://github.com/apache/arrow/blob/34f042762061f4e302e133c2d378ea444505049e/cpp/src/parquet/types.cc#L467 I'm not sure what the best behaviour would be here: it will help drive support for new logical types to actually be written to files if it's possible to know that older readers won't choke on them. There was some indication that this would be a bug ( https://github.com/apache/parquet-format/pull/240#issuecomment-2122972227 ); however, it is definitely safer for a reader in general to error when it encounters a type that it doesn't understand. On the other hand, Arrow C++ silently drops unregistered extension types which, if I'm understanding the issue, is roughly the same. It seems like returning `NoLogicalType::Make();` would fall back to the physical type here; however, it also seems like that should be opt-in somehow and I don't see an obvious route to "type inference" options or similar at that particular place in the code. ### Component(s) Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Import/Export ArrowDeviceArrayStream [arrow]
zeroshade closed issue #40078: [C++] Import/Export ArrowDeviceArrayStream URL: https://github.com/apache/arrow/issues/40078 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [R][CI] CRAN-style openssl not being picked up [arrow]
assignUser closed issue #41426: [R][CI] CRAN-style openssl not being picked up URL: https://github.com/apache/arrow/issues/41426 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Add a benchmark for grouper for preventing performance regression [arrow]
pitrou closed issue #41035: [C++] Add a benchmark for grouper for preventing performance regression URL: https://github.com/apache/arrow/issues/41035 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] adbc_ingest() is dropping rows in Snowflake [arrow-adbc]
zeroshade closed issue #1847: adbc_ingest() is dropping rows in Snowflake URL: https://github.com/apache/arrow-adbc/issues/1847 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++][Parquet] Add file metadata read/write benchmark [arrow]
pitrou opened a new issue, #41760: URL: https://github.com/apache/arrow/issues/41760 ### Describe the enhancement requested Following the discussions on the Parquet ML (see [this thread](https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo) and [this thread](https://lists.apache.org/thread/vs3w2z5bk6s3c975rrkqdttr1dpsdn7h)), we should add a benchmark to measure the overhead of Parquet file metadata parsing or serialization for different numbers of row groups and columns. ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++][Python] Segfault when reading a RecordBatchReader constructed from an Arrow Table [arrow]
Mytherin opened a new issue, #41758: URL: https://github.com/apache/arrow/issues/41758 ### Describe the bug, including details regarding any error messages, version, and platform. The following code snippet crashes for me when running PyArrow 16.1 in Python 3.12: ```py import pyarrow as pa print(pa.__version__) # 16.1.0 tbl = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]}) t = pa.RecordBatchReader(tbl.to_batches()) print(t.read_all()) # zsh: segmentation fault python3 ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Python] Expose bit_width and byte_width on Python Extension types with underlying fixed type [arrow]
jorisvandenbossche closed issue #41389: [Python] Expose bit_width and byte_width on Python Extension types with underlying fixed type URL: https://github.com/apache/arrow/issues/41389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++][Python] SEGFAULT when casting FixedSizeTensorArray to storage type then back to FixedSizeTensorArray [arrow]
judahrand opened a new issue, #41756: URL: https://github.com/apache/arrow/issues/41756 ### Describe the bug, including details regarding any error messages, version, and platform. Minimum reproducible example: ```python import pyarrow tensor_type = pyarrow.fixed_shape_tensor(pyarrow.int32(), [4]) storage_type = pyarrow.list_(pyarrow.int32(), 4) py_list = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]] storage_arr = pyarrow.array(py_list, storage_type) arr = pyarrow.ExtensionArray.from_storage(tensor_type, storage_arr) arr.cast( storage_type, ).cast( tensor_type, ) ``` ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [C++] take into account orc's capabilities for finding tzdb [arrow]
h-vetinari opened a new issue, #41755: URL: https://github.com/apache/arrow/issues/41755 ### Describe the enhancement requested As one of the follow-ups to #36026, https://github.com/apache/orc/pull/1882 got merged into orc 2.0.1, which will use conda(-forge)'s `tzdata` also on windows, even if the `TZDIR` environment variable is not being set (inserting that variable into all user environments would have been very intrusive). Based on this new functionality, I've successfully added orc-on-python support to arrow v13-v15, but some of the _other_ checks introduced in the context of #36026 now fail in https://github.com/conda-forge/pyarrow-feedstock/pull/122, because they haven't yet been taught to allow the case that orc>=2.0.1 now handles. ### Component(s) C++, Packaging, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Hadoop v 2.6.0 [arrow]
dwp0980 opened a new issue, #41753: URL: https://github.com/apache/arrow/issues/41753 ### Describe the usage question you have. Please include as many useful details as possible. Hello, Is it folly to even attempt to connect pyarrow to Hadoop v 2.6.0? At the moment, I'm pinned to python version 3.6.10 and therefore pyarrow 6.0.1 CoPilot is telling me that Hadoop 2.7.0 is the minimum supported version, but not specifically telling me where it found that info from and so far my attempts to connect result in: `OSError: HDFS connection failed` So I just wanted to check if I'm fighting a losing battle before I go much further with trouble shooting. Many thanks ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file [arrow]
mapleFU closed issue #41321: [C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file URL: https://github.com/apache/arrow/issues/41321 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [CI][Integration] Spark jobs are failing with problem on org.apache.arrow.flatbuf [arrow]
lidavidm closed issue #41571: [CI][Integration] Spark jobs are failing with problem on org.apache.arrow.flatbuf URL: https://github.com/apache/arrow/issues/41571 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Packaging] soversion bumps on minor releases [arrow]
jorisvandenbossche closed issue #41659: [Packaging] soversion bumps on minor releases URL: https://github.com/apache/arrow/issues/41659 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Python] 'pyarrow._parquet.SortingColumn' object has no attribute 'to_dict' [arrow]
AlenkaF closed issue #41699: [Python] 'pyarrow._parquet.SortingColumn' object has no attribute 'to_dict' URL: https://github.com/apache/arrow/issues/41699 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [GLib] Allow getting a RecordBatchReader from a Dataset or Dataset Scanner [arrow]
adamreeve opened a new issue, #41749: URL: https://github.com/apache/arrow/issues/41749 ### Describe the enhancement requested In order to allow efficient processing of large datasets, it should be possible to read a dataset or a scanner using a RecordBatchReader rather than using the `to_table` method. ### Component(s) GLib -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Corner case of temp vector stack overflow check [arrow]
felipecrv closed issue #41738: [C++] Corner case of temp vector stack overflow check URL: https://github.com/apache/arrow/issues/41738 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Python][Parquet] Documentation to parquet.write_table should be updated for new byte_stream_split encoding options [arrow]
etseidl opened a new issue, #41748: URL: https://github.com/apache/arrow/issues/41748 ### Describe the enhancement requested The docstring for `parquet.write_table` still says BYTE_STREAM_SPLIT encoding is valid only for floating-point data. This should be updated now that other fixed length types are supported. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java] arrow-vector 16.1.0 has a change that breaks Java 8 support [arrow]
lidavidm closed issue #41717: [Java] arrow-vector 16.1.0 has a change that breaks Java 8 support URL: https://github.com/apache/arrow/issues/41717 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [CI] GitHub bot cannot run Java CIs [arrow]
kou closed issue #41735: [CI] GitHub bot cannot run Java CIs URL: https://github.com/apache/arrow/issues/41735 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Java][FlightSQL] Arrow Flight Driver returns -1 for getUpdateCount() [arrow]
rcprcp opened a new issue, #41747: URL: https://github.com/apache/arrow/issues/41747 ### Describe the bug, including details regarding any error messages, version, and platform. Tested on Arrow Flight JDBC 15, 15.02 and with a locally-built Arrow Flight SQL JDBC Driver 17.0.0-SNAPSHOT. When using a JDBC statement to execute an UPDATE, the getUpdateCount() method seems to always return -1. This seems to be incorrect. The getUpdateCount() should return the number of rows UPDATEd. For a data source, we're using the [Voltron docker container] - (https://hub.docker.com/r/voltrondata/flight-sql) In this debugger image, you can see the return counts from the getUpdateCount() and from the actual result set (the update returns one row, with one column, that indicates the number of UPDATEd rows. https://github.com/apache/arrow/assets/17998205/5189462f-e608-42fb-8996-61c871da6360;> Here is a screen grab of the output data: https://github.com/apache/arrow/assets/17998205/61f24a80-ee23-4e98-9bed-a281c233237d;> And, in this screengrab, I used the Postgres driver to update a table in postgres:latest Docker image: https://github.com/apache/arrow/assets/17998205/bb7dcbfe-370a-4b5d-a1d3-6414e4b3b9ef;> The data in the Postgres table is different, but the getUpdateCount() method returned 1 which is correct for that data. The test program I used is checked into this Github repo: [ZD128958](https://github.com/rcprcp/ZD128958) Thank you! ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org