[jira] [Created] (ARROW-16475) [Python] Publically expose Expression._call

2022-05-04 Thread Weston Pace (Jira)
Weston Pace created ARROW-16475:
---

 Summary: [Python] Publically expose Expression._call
 Key: ARROW-16475
 URL: https://issues.apache.org/jira/browse/ARROW-16475
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Weston Pace


When writing a projection expression I can write something clean when using the 
builtin functions:

{noformat}
dataset.to_table(columns={'projected': pc.ascii_upper(ds.field('name'))})
{noformat}

However, if I am using a custom function (UDF) then there isn't a great 
solution today that I can find.  The best I can come up with is:

{noformat}
dataset.to_table(columns={'projected': pc.Expression._call('my_udf', 
[ds.field('name')])})
{noformat}

I'd think one approach could be:

{noformat}
dataset.to_table(columns={'projected': pc.call('my_udf', [ds.field('name')])})
{noformat}

However, I'm open to other suggestions as well.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16474) [C++] Fix build package break with Scalar UDF Integration

2022-05-04 Thread Vibhatha Lakmal Abeykoon (Jira)
Vibhatha Lakmal Abeykoon created ARROW-16474:


 Summary: [C++] Fix build package break with Scalar UDF Integration
 Key: ARROW-16474
 URL: https://issues.apache.org/jira/browse/ARROW-16474
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Vibhatha Lakmal Abeykoon


ARROW-15639 solved by PR:[https://github.com/apache/arrow/pull/12590] broke 
some build packages and it was found out when 8.0.0 was prepared. The summary 
of broken build packages can be found here: 
[https://lists.apache.org/thread/6bdwrqnq8y5lrm61m9y1d4wz8slzfkz2] 

The discussion on the fix was discussed in the PR thread itself.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16473) [Go] Memory leak in parquet page reading

2022-05-04 Thread Min-Young Wu (Jira)
Min-Young Wu created ARROW-16473:


 Summary: [Go] Memory leak in parquet page reading
 Key: ARROW-16473
 URL: https://issues.apache.org/jira/browse/ARROW-16473
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Parquet
Reporter: Min-Young Wu
Assignee: Min-Young Wu


{code:go}
package main_test

import (
"context"
"os"
"testing"

"github.com/apache/arrow/go/v8/arrow/memory"
"github.com/apache/arrow/go/v8/parquet"
"github.com/apache/arrow/go/v8/parquet/file"
"github.com/apache/arrow/go/v8/parquet/pqarrow"
)

func TestParquetReading(t *testing.T) {
ctx := context.Background()
mem := memory.NewCheckedAllocator(memory.DefaultAllocator)
defer mem.AssertSize(t, 0)

f, err := os.Open("test.parquet")
if err != nil {
t.Fatal(err)
}
defer f.Close()

pf, err := file.NewParquetReader(
f,
// Note: use the provided memory allocator
file.WithReadProps(parquet.NewReaderProperties(mem)),
)
if err != nil {
t.Fatal(err)
}
defer pf.Close()

r, err := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{}, mem)
if err != nil {
t.Fatal(err)
}

table, err := r.ReadTable(ctx)
if err != nil {
t.Fatal(err)
}
defer table.Release()
}
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16472) [Java] InaccessibleObjectException when using JDK16+

2022-05-04 Thread ZHUO ZHANG (Jira)
ZHUO ZHANG created ARROW-16472:
--

 Summary: [Java] InaccessibleObjectException when using JDK16+
 Key: ARROW-16472
 URL: https://issues.apache.org/jira/browse/ARROW-16472
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 4.0.0
Reporter: ZHUO ZHANG


Caused by: java.lang.ExceptionInInitializerError
    at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1161)
    at 
org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:446)
    at 
org.apache.arrow.vector.BaseFixedWidthVector.handleSafe(BaseFixedWidthVector.java:836)
    at org.apache.arrow.vector.DecimalVector.setSafe(DecimalVector.java:446)
    at 
net.snowflake.ingest.streaming.internal.ArrowRowBuffer.convertRowToArrow(ArrowRowBuffer.java:698)
    at 
net.snowflake.ingest.streaming.internal.ArrowRowBuffer.insertRows(ArrowRowBuffer.java:282)
    ... 3 more
Caused by: java.lang.RuntimeException: Failed to initialize MemoryUtil.
    at org.apache.arrow.memory.util.MemoryUtil.(MemoryUtil.java:136)
    ... 9 more
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make field 
long java.nio.Buffer.address accessible: module java.base does not "opens 
java.nio" to unnamed module @24105dc5
    at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
    at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
    at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:180)
    at java.base/java.lang.reflect.Field.setAccessible(Field.java:174)
    at org.apache.arrow.memory.util.MemoryUtil.(MemoryUtil.java:84)
    ... 9 more



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16471) RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex values

2022-05-04 Thread Phillip LeBlanc (Jira)
Phillip LeBlanc created ARROW-16471:
---

 Summary: RecordBuilder UnmarshalJSON does not handle extra unknown 
fields with complex values
 Key: ARROW-16471
 URL: https://issues.apache.org/jira/browse/ARROW-16471
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Affects Versions: 7.0.0
Reporter: Phillip LeBlanc


The fix for https://issues.apache.org/jira/browse/ARROW-16456 only included 
support for simple unknown fields with a single value.

i.e.
{code:javascript}
{"region": "NY", "model": "3", "sales": 742.0, "extra": 1234}
{code}
However, nested objects or arrays are still not handled properly.
{code:javascript}
{"region": "NY", "model": "3", "sales": 742.0, "extra_array": [1234], 
"extra_object": {"nested": ["deeply"]}}
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16470) [Python][Doc] Document Table.filter capability in compute documentation

2022-05-04 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16470:
-

 Summary: [Python][Doc] Document Table.filter capability in compute 
documentation
 Key: ARROW-16470
 URL: https://issues.apache.org/jira/browse/ARROW-16470
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Documentation
Reporter: Alessandro Molina
 Fix For: 9.0.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16469) [Python] Extend Table.filter to accept Expressions

2022-05-04 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16469:
-

 Summary: [Python] Extend Table.filter to accept Expressions
 Key: ARROW-16469
 URL: https://issues.apache.org/jira/browse/ARROW-16469
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alessandro Molina
 Fix For: 9.0.0


If {{Table.filter}} receives an expression, it should invoke 
{{{}_exec_plan.filter_table{}}}.

Also extend the docstring to reflect this change.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16468) [Python] Test the _exec_plan.filter_table helper with complex expressions

2022-05-04 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16468:
-

 Summary: [Python] Test the _exec_plan.filter_table helper with 
complex expressions
 Key: ARROW-16468
 URL: https://issues.apache.org/jira/browse/ARROW-16468
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Alessandro Molina


Create a comprehensive test suite for {{_exec_plan.filter_table}} with the 
primary purpose of testing its convenience and ease of use.

(PS: {{pc.field}} and {{pc.scalar}} shoul be used when building expressions, 
not {{Expression._fied}} etc..)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16467) [Python] Allow execplan to handle Filter nodes

2022-05-04 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16467:
-

 Summary: [Python] Allow execplan to handle Filter nodes
 Key: ARROW-16467
 URL: https://issues.apache.org/jira/browse/ARROW-16467
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alessandro Molina
 Fix For: 9.0.0


Create a {{filter_table}} helper function in {{_exec_plan}} that allows passing 
a {{Table}} and an {{Expression}} to filter the table with the provided 
expression.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16466) Bundle DLLs for JNI interfaces into Maven Jars

2022-05-04 Thread Larry White (Jira)
Larry White created ARROW-16466:
---

 Summary: Bundle DLLs for JNI interfaces into Maven Jars
 Key: ARROW-16466
 URL: https://issues.apache.org/jira/browse/ARROW-16466
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 8.0.0
Reporter: Larry White






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16465) Create build scripts and documentation for producing DLLs for JNI interfaces

2022-05-04 Thread Larry White (Jira)
Larry White created ARROW-16465:
---

 Summary: Create build scripts and documentation for producing DLLs 
for JNI interfaces
 Key: ARROW-16465
 URL: https://issues.apache.org/jira/browse/ARROW-16465
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 8.0.0
Reporter: Larry White






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16464) [C++][CI][GPU] Add CUDA CI

2022-05-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16464:
--

 Summary: [C++][CI][GPU] Add CUDA CI
 Key: ARROW-16464
 URL: https://issues.apache.org/jira/browse/ARROW-16464
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration, GPU
Reporter: Antoine Pitrou
 Fix For: 9.0.0


Arrow C++, PyArrow and perhaps other bindings have CUDA support, but none is 
currently tested on CI, and I think few of the contributors enable CUDA on 
their local builds.

We should definitely exercise CUDA support, at least in the nightly builds 
where we may have more flexibility to use custom machines.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16463) [C++] Add support for non-local filesystem URIs in the Substrait consumer

2022-05-04 Thread Weston Pace (Jira)
Weston Pace created ARROW-16463:
---

 Summary: [C++] Add support for non-local filesystem URIs in the 
Substrait consumer
 Key: ARROW-16463
 URL: https://issues.apache.org/jira/browse/ARROW-16463
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


Currently the Substrait consumer only accepts URIs that use the {{file}} 
scheme.  We should add support for URI schemes that we support ({{s3}}, 
{{gcfs}}) similar to the way pyarrow can create filesystems from URIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16462) [C++] CMake cannot find CUDA toolkit

2022-05-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16462:
--

 Summary: [C++] CMake cannot find CUDA toolkit
 Key: ARROW-16462
 URL: https://issues.apache.org/jira/browse/ARROW-16462
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, GPU
Reporter: Antoine Pitrou


For some reason, after a conda update it seems that CMake is not able to find 
the CUDA toolkit anymore:
{code}
-- Unable to find cudart library.
CMake Error at 
/home/antoine/miniconda3/envs/pyarrow/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:230
 (message):
  Could NOT find CUDAToolkit (missing: CUDA_CUDART) (found version
  "10.1.243")
Call Stack (most recent call first):
  
/home/antoine/miniconda3/envs/pyarrow/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:594
 (_FPHSA_FAILURE_MESSAGE)
  
/home/antoine/miniconda3/envs/pyarrow/share/cmake-3.23/Modules/FindCUDAToolkit.cmake:818
 (find_package_handle_standard_args)
  src/arrow/gpu/CMakeLists.txt:40 (find_package)


-- Configuring incomplete, errors occurred!
{code}

which is weird as the CUDA toolkit is installed as a Ubuntu package.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16461) [C++] Sporadic thread sanitizer failure in TaskGroup in debug mode

2022-05-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16461:
--

 Summary: [C++] Sporadic thread sanitizer failure in TaskGroup in 
debug mode
 Key: ARROW-16461
 URL: https://issues.apache.org/jira/browse/ARROW-16461
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 9.0.0


See 
https://github.com/ursacomputing/crossbow/runs/6291615923?check_suite_focus=true#step:5:6272

The {{ThreadedTaskGroup::finished_}} member can be accessed for debug purposes 
without the internal lock held.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16460) [Python] Some dataset tests using PyFileSystem are failing on Windows

2022-05-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16460:
-

 Summary: [Python] Some dataset tests using PyFileSystem are 
failing on Windows
 Key: ARROW-16460
 URL: https://issues.apache.org/jira/browse/ARROW-16460
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


We have some dataset tests that are skipped on Windows, because they are 
failing with FileNotFound errors.

* 
https://github.com/apache/arrow/blob/3c3e68c194ca6ac07086ddc1bb44fe153970213e/python/pyarrow/tests/test_dataset.py#L3261-L3264
*https://github.com/apache/arrow/blob/893faa741f34ee450070503566dafb7291e24d9f/python/pyarrow/tests/test_dataset.py#L3124-L3145
 (and see https://github.com/apache/arrow/pull/13033#issuecomment-1116180259 
for some analysis)

In the second case, it seems that for some reason, the file paths of the 
fragments are relative paths to the root of the dataset (while locally for me 
this gives absolute paths). 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16459) [C++] Update GetFileInfo in FromProto to use async filesystem APIs

2022-05-04 Thread Ariana Villegas (Jira)
Ariana Villegas created ARROW-16459:
---

 Summary: [C++] Update GetFileInfo in FromProto to use async 
filesystem APIs
 Key: ARROW-16459
 URL: https://issues.apache.org/jira/browse/ARROW-16459
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ariana Villegas


GetGlobFiles function in {{arrrow/engine/substrait/relation_internal.cc}} 
discovery directories with sync APIs. However, it would be more efficient to 
use async APIs to avoid blocking calls.
{code:c++}
for (auto res : results) {
    if (res.type() != fs::FileType::Directory) continue;
    selector.base_dir = res.path() + cur;
    ARROW_ASSIGN_OR_RAISE(auto entries, filesystem->GetFileInfo(selector));
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16458) [Python] Run S3 tests in the nightly dask integration build

2022-05-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16458:
-

 Summary: [Python] Run S3 tests in the nightly dask integration 
build
 Key: ARROW-16458
 URL: https://issues.apache.org/jira/browse/ARROW-16458
 Project: Apache Arrow
  Issue Type: Test
  Components: Continuous Integration, Python
Reporter: Joris Van den Bossche


As a follow-up on https://github.com/apache/arrow/pull/13033 (ARROW-16413), we 
should update the {{integration_dask.sh}} script to also run the S3 tests from 
the dask test suite. 

See 
https://github.com/apache/arrow/pull/13033/commits/1bca56e932434d6b0dc947dd51915d83f9dd3a43
 (in that commit I removed that again, because it was still failing due to some 
moto timeout)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16457) [Python] Support AWS S3 Web identity credentials

2022-05-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16457:
--

 Summary: [Python] Support AWS S3 Web identity credentials
 Key: ARROW-16457
 URL: https://issues.apache.org/jira/browse/ARROW-16457
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Antoine Pitrou
 Fix For: 9.0.0


ARROW-10675 added support for AWS S3 Web identity credentials on the C++ side. 
We should bind that functionality on the Python side.

To avoid proliferation of authentication arguments to the {{S3FileSystem}} 
constructor, some of which mutually exclusive (but not all), we should probably 
add instead a more flexible {{auth}} argument that could represent to different 
authentication kinds.

There is a bit of API design necessary. IMHO it's probably best if the {{auth}} 
argument is a dedicated {{S3Auth}} object with several constructors, but 
perhaps we can instead admit some kind of dict?




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16456) RecordBuilder UnmarshalJSON does not handle extra unknown fields

2022-05-04 Thread Phillip LeBlanc (Jira)
Phillip LeBlanc created ARROW-16456:
---

 Summary: RecordBuilder UnmarshalJSON does not handle extra unknown 
fields
 Key: ARROW-16456
 URL: https://issues.apache.org/jira/browse/ARROW-16456
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Affects Versions: 7.0.0
Reporter: Phillip LeBlanc


When calling array.RecordBuilder.UnmarshalJSON with a JSON object that contains 
fields that are unknown to the RecordBuilder's schema, it fails to decode the 
JSON object properly and will panic.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16455) [CI] [Packaging] Anaconda storage size exceeded for linux-ppc64le

2022-05-04 Thread Jira
Raúl Cumplido created ARROW-16455:
-

 Summary: [CI] [Packaging] Anaconda storage size exceeded for 
linux-ppc64le 
 Key: ARROW-16455
 URL: https://issues.apache.org/jira/browse/ARROW-16455
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration, Packaging
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido


Our Anaconda storage size for nightlies is exceeded:
{code:java}
"[ERROR] ('Storage requirements exceeded (3221225472 bytes). Payment is 
required to add a file. Please go to 
https://anaconda.org/binstar.settings/billing to update your plan', 402)" {code}
It seems we forgot to add *linux-ppc64le* to the architectures list on this 
fix: [https://github.com/apache/arrow/pull/12604]

See original issue: https://issues.apache.org/jira/browse/ARROW-15898



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16454) [C++][CI] Sporadic timeouts in arrow-gcsfs-test

2022-05-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16454:
--

 Summary: [C++][CI] Sporadic timeouts in arrow-gcsfs-test
 Key: ARROW-16454
 URL: https://issues.apache.org/jira/browse/ARROW-16454
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


It seems that {{arrow-gcsfs-test}} might have become less reliable recently, as 
some timeouts have started appearing in some builds, e.g.:
https://github.com/ursacomputing/crossbow/runs/6286469507?check_suite_focus=true#step:5:3464





--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16453) [C++] Thread sanitizer failure in arrow-ipc-read-write-test

2022-05-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16453:
--

 Summary: [C++] Thread sanitizer failure in 
arrow-ipc-read-write-test
 Key: ARROW-16453
 URL: https://issues.apache.org/jira/browse/ARROW-16453
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


This seems to be a sporadic error that happened in {{PreBuffering.MixedAccess}} 
on an unrelated PR:
https://github.com/ursacomputing/crossbow/runs/6286476904?check_suite_focus=true#step:5:4985





--
This message was sent by Atlassian Jira
(v8.20.7#820007)