[jira] [Created] (ARROW-16475) [Python] Publically expose Expression._call
Weston Pace created ARROW-16475: --- Summary: [Python] Publically expose Expression._call Key: ARROW-16475 URL: https://issues.apache.org/jira/browse/ARROW-16475 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Weston Pace When writing a projection expression I can write something clean when using the builtin functions: {noformat} dataset.to_table(columns={'projected': pc.ascii_upper(ds.field('name'))}) {noformat} However, if I am using a custom function (UDF) then there isn't a great solution today that I can find. The best I can come up with is: {noformat} dataset.to_table(columns={'projected': pc.Expression._call('my_udf', [ds.field('name')])}) {noformat} I'd think one approach could be: {noformat} dataset.to_table(columns={'projected': pc.call('my_udf', [ds.field('name')])}) {noformat} However, I'm open to other suggestions as well. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16474) [C++] Fix build package break with Scalar UDF Integration
Vibhatha Lakmal Abeykoon created ARROW-16474: Summary: [C++] Fix build package break with Scalar UDF Integration Key: ARROW-16474 URL: https://issues.apache.org/jira/browse/ARROW-16474 Project: Apache Arrow Issue Type: Bug Reporter: Vibhatha Lakmal Abeykoon ARROW-15639 solved by PR:[https://github.com/apache/arrow/pull/12590] broke some build packages and it was found out when 8.0.0 was prepared. The summary of broken build packages can be found here: [https://lists.apache.org/thread/6bdwrqnq8y5lrm61m9y1d4wz8slzfkz2] The discussion on the fix was discussed in the PR thread itself. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16473) [Go] Memory leak in parquet page reading
Min-Young Wu created ARROW-16473: Summary: [Go] Memory leak in parquet page reading Key: ARROW-16473 URL: https://issues.apache.org/jira/browse/ARROW-16473 Project: Apache Arrow Issue Type: Bug Components: Go, Parquet Reporter: Min-Young Wu Assignee: Min-Young Wu {code:go} package main_test import ( "context" "os" "testing" "github.com/apache/arrow/go/v8/arrow/memory" "github.com/apache/arrow/go/v8/parquet" "github.com/apache/arrow/go/v8/parquet/file" "github.com/apache/arrow/go/v8/parquet/pqarrow" ) func TestParquetReading(t *testing.T) { ctx := context.Background() mem := memory.NewCheckedAllocator(memory.DefaultAllocator) defer mem.AssertSize(t, 0) f, err := os.Open("test.parquet") if err != nil { t.Fatal(err) } defer f.Close() pf, err := file.NewParquetReader( f, // Note: use the provided memory allocator file.WithReadProps(parquet.NewReaderProperties(mem)), ) if err != nil { t.Fatal(err) } defer pf.Close() r, err := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{}, mem) if err != nil { t.Fatal(err) } table, err := r.ReadTable(ctx) if err != nil { t.Fatal(err) } defer table.Release() } {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16472) [Java] InaccessibleObjectException when using JDK16+
ZHUO ZHANG created ARROW-16472: -- Summary: [Java] InaccessibleObjectException when using JDK16+ Key: ARROW-16472 URL: https://issues.apache.org/jira/browse/ARROW-16472 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 4.0.0 Reporter: ZHUO ZHANG Caused by: java.lang.ExceptionInInitializerError at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1161) at org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:446) at org.apache.arrow.vector.BaseFixedWidthVector.handleSafe(BaseFixedWidthVector.java:836) at org.apache.arrow.vector.DecimalVector.setSafe(DecimalVector.java:446) at net.snowflake.ingest.streaming.internal.ArrowRowBuffer.convertRowToArrow(ArrowRowBuffer.java:698) at net.snowflake.ingest.streaming.internal.ArrowRowBuffer.insertRows(ArrowRowBuffer.java:282) ... 3 more Caused by: java.lang.RuntimeException: Failed to initialize MemoryUtil. at org.apache.arrow.memory.util.MemoryUtil.(MemoryUtil.java:136) ... 9 more Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make field long java.nio.Buffer.address accessible: module java.base does not "opens java.nio" to unnamed module @24105dc5 at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354) at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297) at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:180) at java.base/java.lang.reflect.Field.setAccessible(Field.java:174) at org.apache.arrow.memory.util.MemoryUtil.(MemoryUtil.java:84) ... 9 more -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16471) RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex values
Phillip LeBlanc created ARROW-16471: --- Summary: RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex values Key: ARROW-16471 URL: https://issues.apache.org/jira/browse/ARROW-16471 Project: Apache Arrow Issue Type: Bug Components: Go Affects Versions: 7.0.0 Reporter: Phillip LeBlanc The fix for https://issues.apache.org/jira/browse/ARROW-16456 only included support for simple unknown fields with a single value. i.e. {code:javascript} {"region": "NY", "model": "3", "sales": 742.0, "extra": 1234} {code} However, nested objects or arrays are still not handled properly. {code:javascript} {"region": "NY", "model": "3", "sales": 742.0, "extra_array": [1234], "extra_object": {"nested": ["deeply"]}} {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16470) [Python][Doc] Document Table.filter capability in compute documentation
Alessandro Molina created ARROW-16470: - Summary: [Python][Doc] Document Table.filter capability in compute documentation Key: ARROW-16470 URL: https://issues.apache.org/jira/browse/ARROW-16470 Project: Apache Arrow Issue Type: Sub-task Components: Documentation Reporter: Alessandro Molina Fix For: 9.0.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16469) [Python] Extend Table.filter to accept Expressions
Alessandro Molina created ARROW-16469: - Summary: [Python] Extend Table.filter to accept Expressions Key: ARROW-16469 URL: https://issues.apache.org/jira/browse/ARROW-16469 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Alessandro Molina Fix For: 9.0.0 If {{Table.filter}} receives an expression, it should invoke {{{}_exec_plan.filter_table{}}}. Also extend the docstring to reflect this change. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16468) [Python] Test the _exec_plan.filter_table helper with complex expressions
Alessandro Molina created ARROW-16468: - Summary: [Python] Test the _exec_plan.filter_table helper with complex expressions Key: ARROW-16468 URL: https://issues.apache.org/jira/browse/ARROW-16468 Project: Apache Arrow Issue Type: Sub-task Reporter: Alessandro Molina Create a comprehensive test suite for {{_exec_plan.filter_table}} with the primary purpose of testing its convenience and ease of use. (PS: {{pc.field}} and {{pc.scalar}} shoul be used when building expressions, not {{Expression._fied}} etc..) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16467) [Python] Allow execplan to handle Filter nodes
Alessandro Molina created ARROW-16467: - Summary: [Python] Allow execplan to handle Filter nodes Key: ARROW-16467 URL: https://issues.apache.org/jira/browse/ARROW-16467 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Alessandro Molina Fix For: 9.0.0 Create a {{filter_table}} helper function in {{_exec_plan}} that allows passing a {{Table}} and an {{Expression}} to filter the table with the provided expression. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16466) Bundle DLLs for JNI interfaces into Maven Jars
Larry White created ARROW-16466: --- Summary: Bundle DLLs for JNI interfaces into Maven Jars Key: ARROW-16466 URL: https://issues.apache.org/jira/browse/ARROW-16466 Project: Apache Arrow Issue Type: Improvement Components: Java Affects Versions: 8.0.0 Reporter: Larry White -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16465) Create build scripts and documentation for producing DLLs for JNI interfaces
Larry White created ARROW-16465: --- Summary: Create build scripts and documentation for producing DLLs for JNI interfaces Key: ARROW-16465 URL: https://issues.apache.org/jira/browse/ARROW-16465 Project: Apache Arrow Issue Type: Improvement Components: Java Affects Versions: 8.0.0 Reporter: Larry White -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16464) [C++][CI][GPU] Add CUDA CI
Antoine Pitrou created ARROW-16464: -- Summary: [C++][CI][GPU] Add CUDA CI Key: ARROW-16464 URL: https://issues.apache.org/jira/browse/ARROW-16464 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration, GPU Reporter: Antoine Pitrou Fix For: 9.0.0 Arrow C++, PyArrow and perhaps other bindings have CUDA support, but none is currently tested on CI, and I think few of the contributors enable CUDA on their local builds. We should definitely exercise CUDA support, at least in the nightly builds where we may have more flexibility to use custom machines. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16463) [C++] Add support for non-local filesystem URIs in the Substrait consumer
Weston Pace created ARROW-16463: --- Summary: [C++] Add support for non-local filesystem URIs in the Substrait consumer Key: ARROW-16463 URL: https://issues.apache.org/jira/browse/ARROW-16463 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace Currently the Substrait consumer only accepts URIs that use the {{file}} scheme. We should add support for URI schemes that we support ({{s3}}, {{gcfs}}) similar to the way pyarrow can create filesystems from URIs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16462) [C++] CMake cannot find CUDA toolkit
Antoine Pitrou created ARROW-16462: -- Summary: [C++] CMake cannot find CUDA toolkit Key: ARROW-16462 URL: https://issues.apache.org/jira/browse/ARROW-16462 Project: Apache Arrow Issue Type: Bug Components: C++, GPU Reporter: Antoine Pitrou For some reason, after a conda update it seems that CMake is not able to find the CUDA toolkit anymore: {code} -- Unable to find cudart library. CMake Error at /home/antoine/miniconda3/envs/pyarrow/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:230 (message): Could NOT find CUDAToolkit (missing: CUDA_CUDART) (found version "10.1.243") Call Stack (most recent call first): /home/antoine/miniconda3/envs/pyarrow/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE) /home/antoine/miniconda3/envs/pyarrow/share/cmake-3.23/Modules/FindCUDAToolkit.cmake:818 (find_package_handle_standard_args) src/arrow/gpu/CMakeLists.txt:40 (find_package) -- Configuring incomplete, errors occurred! {code} which is weird as the CUDA toolkit is installed as a Ubuntu package. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16461) [C++] Sporadic thread sanitizer failure in TaskGroup in debug mode
Antoine Pitrou created ARROW-16461: -- Summary: [C++] Sporadic thread sanitizer failure in TaskGroup in debug mode Key: ARROW-16461 URL: https://issues.apache.org/jira/browse/ARROW-16461 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 9.0.0 See https://github.com/ursacomputing/crossbow/runs/6291615923?check_suite_focus=true#step:5:6272 The {{ThreadedTaskGroup::finished_}} member can be accessed for debug purposes without the internal lock held. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16460) [Python] Some dataset tests using PyFileSystem are failing on Windows
Joris Van den Bossche created ARROW-16460: - Summary: [Python] Some dataset tests using PyFileSystem are failing on Windows Key: ARROW-16460 URL: https://issues.apache.org/jira/browse/ARROW-16460 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche We have some dataset tests that are skipped on Windows, because they are failing with FileNotFound errors. * https://github.com/apache/arrow/blob/3c3e68c194ca6ac07086ddc1bb44fe153970213e/python/pyarrow/tests/test_dataset.py#L3261-L3264 *https://github.com/apache/arrow/blob/893faa741f34ee450070503566dafb7291e24d9f/python/pyarrow/tests/test_dataset.py#L3124-L3145 (and see https://github.com/apache/arrow/pull/13033#issuecomment-1116180259 for some analysis) In the second case, it seems that for some reason, the file paths of the fragments are relative paths to the root of the dataset (while locally for me this gives absolute paths). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16459) [C++] Update GetFileInfo in FromProto to use async filesystem APIs
Ariana Villegas created ARROW-16459: --- Summary: [C++] Update GetFileInfo in FromProto to use async filesystem APIs Key: ARROW-16459 URL: https://issues.apache.org/jira/browse/ARROW-16459 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Ariana Villegas GetGlobFiles function in {{arrrow/engine/substrait/relation_internal.cc}} discovery directories with sync APIs. However, it would be more efficient to use async APIs to avoid blocking calls. {code:c++} for (auto res : results) { if (res.type() != fs::FileType::Directory) continue; selector.base_dir = res.path() + cur; ARROW_ASSIGN_OR_RAISE(auto entries, filesystem->GetFileInfo(selector)); } {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16458) [Python] Run S3 tests in the nightly dask integration build
Joris Van den Bossche created ARROW-16458: - Summary: [Python] Run S3 tests in the nightly dask integration build Key: ARROW-16458 URL: https://issues.apache.org/jira/browse/ARROW-16458 Project: Apache Arrow Issue Type: Test Components: Continuous Integration, Python Reporter: Joris Van den Bossche As a follow-up on https://github.com/apache/arrow/pull/13033 (ARROW-16413), we should update the {{integration_dask.sh}} script to also run the S3 tests from the dask test suite. See https://github.com/apache/arrow/pull/13033/commits/1bca56e932434d6b0dc947dd51915d83f9dd3a43 (in that commit I removed that again, because it was still failing due to some moto timeout) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16457) [Python] Support AWS S3 Web identity credentials
Antoine Pitrou created ARROW-16457: -- Summary: [Python] Support AWS S3 Web identity credentials Key: ARROW-16457 URL: https://issues.apache.org/jira/browse/ARROW-16457 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Antoine Pitrou Fix For: 9.0.0 ARROW-10675 added support for AWS S3 Web identity credentials on the C++ side. We should bind that functionality on the Python side. To avoid proliferation of authentication arguments to the {{S3FileSystem}} constructor, some of which mutually exclusive (but not all), we should probably add instead a more flexible {{auth}} argument that could represent to different authentication kinds. There is a bit of API design necessary. IMHO it's probably best if the {{auth}} argument is a dedicated {{S3Auth}} object with several constructors, but perhaps we can instead admit some kind of dict? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16456) RecordBuilder UnmarshalJSON does not handle extra unknown fields
Phillip LeBlanc created ARROW-16456: --- Summary: RecordBuilder UnmarshalJSON does not handle extra unknown fields Key: ARROW-16456 URL: https://issues.apache.org/jira/browse/ARROW-16456 Project: Apache Arrow Issue Type: Bug Components: Go Affects Versions: 7.0.0 Reporter: Phillip LeBlanc When calling array.RecordBuilder.UnmarshalJSON with a JSON object that contains fields that are unknown to the RecordBuilder's schema, it fails to decode the JSON object properly and will panic. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16455) [CI] [Packaging] Anaconda storage size exceeded for linux-ppc64le
Raúl Cumplido created ARROW-16455: - Summary: [CI] [Packaging] Anaconda storage size exceeded for linux-ppc64le Key: ARROW-16455 URL: https://issues.apache.org/jira/browse/ARROW-16455 Project: Apache Arrow Issue Type: Task Components: Continuous Integration, Packaging Reporter: Raúl Cumplido Assignee: Raúl Cumplido Our Anaconda storage size for nightlies is exceeded: {code:java} "[ERROR] ('Storage requirements exceeded (3221225472 bytes). Payment is required to add a file. Please go to https://anaconda.org/binstar.settings/billing to update your plan', 402)" {code} It seems we forgot to add *linux-ppc64le* to the architectures list on this fix: [https://github.com/apache/arrow/pull/12604] See original issue: https://issues.apache.org/jira/browse/ARROW-15898 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16454) [C++][CI] Sporadic timeouts in arrow-gcsfs-test
Antoine Pitrou created ARROW-16454: -- Summary: [C++][CI] Sporadic timeouts in arrow-gcsfs-test Key: ARROW-16454 URL: https://issues.apache.org/jira/browse/ARROW-16454 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Antoine Pitrou It seems that {{arrow-gcsfs-test}} might have become less reliable recently, as some timeouts have started appearing in some builds, e.g.: https://github.com/ursacomputing/crossbow/runs/6286469507?check_suite_focus=true#step:5:3464 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16453) [C++] Thread sanitizer failure in arrow-ipc-read-write-test
Antoine Pitrou created ARROW-16453: -- Summary: [C++] Thread sanitizer failure in arrow-ipc-read-write-test Key: ARROW-16453 URL: https://issues.apache.org/jira/browse/ARROW-16453 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Antoine Pitrou This seems to be a sporadic error that happened in {{PreBuffering.MixedAccess}} on an unrelated PR: https://github.com/ursacomputing/crossbow/runs/6286476904?check_suite_focus=true#step:5:4985 -- This message was sent by Atlassian Jira (v8.20.7#820007)