[jira] [Created] (ARROW-9368) [Python] Rename predicate argument to filter in split_by_row_group()
Joris Van den Bossche created ARROW-9368: Summary: [Python] Rename predicate argument to filter in split_by_row_group() Key: ARROW-9368 URL: https://issues.apache.org/jira/browse/ARROW-9368 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 1.0.0 For consistency with to_table() and get_fragments() -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9369) Can't convert dictionary type using table.from_pandas
Tomas Remes created ARROW-9369: -- Summary: Can't convert dictionary type using table.from_pandas Key: ARROW-9369 URL: https://issues.apache.org/jira/browse/ARROW-9369 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.17.1 Reporter: Tomas Remes Hello, I am trying to do the following (please correct me if I am doing some non-sense): {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq fields = [pa.field("object", pa.dictionary(pa.int64(), pa.string()))] data = {"object": { "a": "a", "b": "b", "c": "c", "s": "d" }} df = pd.DataFrame(data) table = pa.Table.from_pandas(df, pa.schema(fields)) pq.write_table(table, "test.parquet") {code} and I am getting: {noformat} Traceback (most recent call last): File "pa_test.py", line 17, in table = pa.Table.from_pandas(df, pa.schema(fields)) File "pyarrow/table.pxi", line 1451, in pyarrow.lib.Table.from_pandas File "/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py", line 575, in dataframe_to_arrays for c, f in zip(columns_to_convert, convert_fields)] File "/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py", line 575, in for c, f in zip(columns_to_convert, convert_fields)] File "/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py", line 566, in convert_column raise e File "/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py", line 560, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 265, in pyarrow.lib.array File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 106, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: ('Sequence converter for type dictionary not implemented', 'Conversion failed for column object with type object') {noformat} Workaround is to use {{df.to_parquet("test.parquet")}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9370) [Java] Bump Netty version
Ryan Murray created ARROW-9370: -- Summary: [Java] Bump Netty version Key: ARROW-9370 URL: https://issues.apache.org/jira/browse/ARROW-9370 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ryan Murray Assignee: Ryan Murray As per https://github.com/apache/arrow/pull/7619#issuecomment-655246147 there is a security vulnerability in the current version of Netty. This will upgrade to latest version -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9371) [Java] Run vector tests for both allocators
Ryan Murray created ARROW-9371: -- Summary: [Java] Run vector tests for both allocators Key: ARROW-9371 URL: https://issues.apache.org/jira/browse/ARROW-9371 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ryan Murray Assignee: Ryan Murray As per https://github.com/apache/arrow/pull/7619#discussion_r451140735 the vector tests should be run for both netty and unsafe allocators -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9372) [Dev][Archery] conda-python Docker image fails running
Antoine Pitrou created ARROW-9372: - Summary: [Dev][Archery] conda-python Docker image fails running Key: ARROW-9372 URL: https://issues.apache.org/jira/browse/ARROW-9372 Project: Apache Arrow Issue Type: Bug Components: Archery, Developer Tools Reporter: Antoine Pitrou I tried this: {code} archery docker run -e PYTHON=3.6 conda-python {code} And after the Docker image was built, running it failed with: {code} + pushd /arrow/python /arrow/python / ++ realpath --relative-to=. /build/python + relative_build_dir=../../build/python + 3.6 setup.py build --build-base /build/python install --single-version-externally-managed --record ../../build/python/record.txt /arrow/ci/scripts/python_build.sh: line 50: 3.6: command not found {code} Yet the comments in the {{docker-compose.yml}} say: {code} conda-python: # Usage: # docker-compose build conda-cpp # docker-compose build conda-python # docker-compose run --rm conda-python # Parameters: # ARCH: amd64, arm32v7 # PYTHON: 3.6, 3.7, 3.8 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9373) [C++] Fix Parquet crash on invalid input (OSS-Fuzz)
Antoine Pitrou created ARROW-9373: - Summary: [C++] Fix Parquet crash on invalid input (OSS-Fuzz) Key: ARROW-9373 URL: https://issues.apache.org/jira/browse/ARROW-9373 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [arrow-testing] pitrou opened a new pull request #36: ARROW-9373: Add Parquet fuzz regression file
pitrou opened a new pull request #36: URL: https://github.com/apache/arrow-testing/pull/36 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] pitrou merged pull request #36: ARROW-9373: Add Parquet fuzz regression file
pitrou merged pull request #36: URL: https://github.com/apache/arrow-testing/pull/36 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-9374) [C++][Python] Expose MakeArrayFromScalar
Krisztian Szucs created ARROW-9374: -- Summary: [C++][Python] Expose MakeArrayFromScalar Key: ARROW-9374 URL: https://issues.apache.org/jira/browse/ARROW-9374 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 Currently there is no efficient way to create a pyarrow array with identical values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9375) [FlightRPC][Integration] Add support for setting metadata version for integration tests
David Li created ARROW-9375: --- Summary: [FlightRPC][Integration] Add support for setting metadata version for integration tests Key: ARROW-9375 URL: https://issues.apache.org/jira/browse/ARROW-9375 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC, Integration Reporter: David Li Assignee: David Li -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9376) [Python]
Athanassios Hatzis created ARROW-9376: - Summary: [Python] Key: ARROW-9376 URL: https://issues.apache.org/jira/browse/ARROW-9376 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Reporter: Athanassios Hatzis h3. First try {code:python} data = [pa.array([1, 2, 3, 4]), pa.array(['foo', 'bar', 'baz', None]), pa.array([True, None, False, True])] batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2']) {code} Hi, I use PyCharm IDE for development and I am getting the following inspection description when I write this piece of code above in the editor. _Expected type 'RecordBatch', got 'List[Union[Union[ChunkedArray, Array], Any]]' instead_ _Inspection info: This inspection detects type errors in function call expressions. Due to dynamic dispatch and duck typing, this is possible in a limited but useful number of cases. Types of function parameters can be specified in docstrings or in Python 3 function annotations._ h3. Second try {code:python} batch = pa.RecordBatch.from_arrays(data, names=['f0', 'f1', 'f2']){code} Then you get an insection descriptions _Parameter 'list_arrays' unfilled_ _Passing list instead of pyarrow.lib.RecordBatch.RecordBatch. Is this intentional?_ h3. Third try {code:python} batch = pa.RecordBatch.from_arrays(list_arrays=data, names=['f0', 'f1', 'f2']) {code} Then you get an insection description and a type error _Parameter 'self' unfilled_ _TypeError: from_arrays() takes at least 1 positional argument (0 given)_ Similar response, behaviour happens with the pa.Table.from_arrays -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9377) [Java] Support unsigned dictionary indices
Wes McKinney created ARROW-9377: --- Summary: [Java] Support unsigned dictionary indices Key: ARROW-9377 URL: https://issues.apache.org/jira/browse/ARROW-9377 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Wes McKinney child of ARROW-9259 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9378) [Go] Support unsigned dictionary indices
Wes McKinney created ARROW-9378: --- Summary: [Go] Support unsigned dictionary indices Key: ARROW-9378 URL: https://issues.apache.org/jira/browse/ARROW-9378 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9379) [Rust] Support unsigned dictionary indices
Wes McKinney created ARROW-9379: --- Summary: [Rust] Support unsigned dictionary indices Key: ARROW-9379 URL: https://issues.apache.org/jira/browse/ARROW-9379 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9380) [C++] Segfaults in compute::CallFunction
Neal Richardson created ARROW-9380: -- Summary: [C++] Segfaults in compute::CallFunction Key: ARROW-9380 URL: https://issues.apache.org/jira/browse/ARROW-9380 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Neal Richardson I triggered these from R, so that's what the reproducers are in. 1. Calling "filter" with no args segfaults. {code:r} arrow:::compute__CallFunction("filter", list(), list(keep_na = FALSE)) {code} Top of the backtrace from lldb: {code} * frame #0: 0x000109e1c2c7 libarrow.100.dylib`arrow::Datum::type() const + 7 frame #1: 0x00010a14a232 libarrow.100.dylib`arrow::compute::internal::(anonymous namespace)::FilterMetaFunction::ExecuteImpl(std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const + 66 frame #2: 0x000109fc32c9 libarrow.100.dylib`arrow::compute::MetaFunction::Execute(std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const + 41 frame #3: 0x000109fb3d3c libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 844 frame #4: 0x000109fb3c47 libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 599 {code} This is not the case with at least some other functions. If I try to call "sum" with no args, I get {{Invalid: Function accepts 1 arguments but passed 0}} and no segfault. 2. Something is strange with is_null. It creates what appears to be a valid boolean array, but if I pass it to filter, it segfaults. I'm adding bindings for this in ARROW-9187 but this should run on current master: {code:r} library(arrow) a <- Array$create(1:4) b <- arrow:::shared_ptr(Array, arrow:::call_function("is_null", a)) a$Filter(b) {code} Backtrace: {code} * frame #0: 0x00010a120bb6 libarrow.100.dylib`arrow::compute::internal::GetFilterOutputSize(arrow::ArrayData const&, arrow::compute::FilterOptions::NullSelectionBehavior) + 38 frame #1: 0x00010a125659 libarrow.100.dylib`arrow::compute::internal::(anonymous namespace)::PrimitiveFilter(arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*) + 121 frame #2: 0x000109fbbea4 libarrow.100.dylib`arrow::compute::detail::VectorExecutor::ExecuteBatch(arrow::compute::ExecBatch const&, arrow::compute::detail::ExecListener*) + 996 frame #3: 0x000109fba3e6 libarrow.100.dylib`arrow::compute::detail::VectorExecutor::Execute(std::__1::vector > const&, arrow::compute::detail::ExecListener*) + 150 frame #4: 0x000109fc0948 libarrow.100.dylib`arrow::compute::Function::Execute(std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const + 1016 frame #5: 0x000109fb3d3c libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 844 frame #6: 0x00010a14a9b5 libarrow.100.dylib`arrow::compute::internal::(anonymous namespace)::FilterMetaFunction::ExecuteImpl(std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const + 1989 frame #7: 0x000109fc32c9 libarrow.100.dylib`arrow::compute::MetaFunction::Execute(std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const + 41 frame #8: 0x000109fb3d3c libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 844 frame #9: 0x000109fb3c47 libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, std::__1::vector > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 599 {code} BUT: if I call {{as.vector}} on {{b}} before using it as a Filter, it works--even though I've discarded the as.vector result and am still using the Array to filter. {code:r} library(arrow) a <- Array$create(1:4) b <- arrow:::shared_ptr(Array, arrow:::call_function("is_null", a)) as.vector(b) a$Filter(b) {code} Just printing (calling {{ToString}}) on {{b}} doesn't prevent the segfault. And I have not observed this with other boolean kernels. E.g. this does not segfault: {code:r} library(arrow) a <- Array$create(1:4) b <- arrow:::shared_ptr(Array, arrow:::call_function("greater", a, Scalar$create(3L))) a$Filter(b) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9381) [Python] test_dataset_schema_metadata fails on AppVeyor fork
Antoine Pitrou created ARROW-9381: - Summary: [Python] test_dataset_schema_metadata fails on AppVeyor fork Key: ARROW-9381 URL: https://issues.apache.org/jira/browse/ARROW-9381 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Antoine Pitrou I have this consistent error on all builds on my AppVeyor account: https://ci.appveyor.com/project/pitrou/arrow/builds/33985399/job/mxb95s5u6f0aoaxj#L1756 {code} raise ImportError( > "Unable to find a usable engine; " "tried using: 'pyarrow', 'fastparquet'.\n" "pyarrow or fastparquet is required for parquet " "support" ) E ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'. E pyarrow or fastparquet is required for parquet support {code} It never happens on the Apache AppVeyor account, for some unknown reason. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9382) Add boolean to valid keys of groupBy
Jorge created ARROW-9382: Summary: Add boolean to valid keys of groupBy Key: ARROW-9382 URL: https://issues.apache.org/jira/browse/ARROW-9382 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Currently we do not support boolean columns on groupBy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9383) [Python] Support fsspec filesystems in Dataset API through fs handler
Joris Van den Bossche created ARROW-9383: Summary: [Python] Support fsspec filesystems in Dataset API through fs handler Key: ARROW-9383 URL: https://issues.apache.org/jira/browse/ARROW-9383 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)