[jira] [Created] (ARROW-9382) Add boolean to valid keys of groupBy

2020-07-08 Thread Jorge (Jira)
Jorge created ARROW-9382:


 Summary: Add boolean to valid keys of groupBy
 Key: ARROW-9382
 URL: https://issues.apache.org/jira/browse/ARROW-9382
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge


Currently we do not support boolean columns on groupBy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9381) [Python] test_dataset_schema_metadata fails on AppVeyor fork

2020-07-08 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9381:
-

 Summary: [Python] test_dataset_schema_metadata fails on AppVeyor 
fork
 Key: ARROW-9381
 URL: https://issues.apache.org/jira/browse/ARROW-9381
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Antoine Pitrou


I have this consistent error on all builds on my AppVeyor account:
https://ci.appveyor.com/project/pitrou/arrow/builds/33985399/job/mxb95s5u6f0aoaxj#L1756

{code}
raise ImportError(
>   "Unable to find a usable engine; "
"tried using: 'pyarrow', 'fastparquet'.\n"
"pyarrow or fastparquet is required for parquet "
"support"
)
E   ImportError: Unable to find a usable engine; tried using: 
'pyarrow', 'fastparquet'.
E   pyarrow or fastparquet is required for parquet support
{code}

It never happens on the Apache AppVeyor account, for some unknown reason.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9380) [C++] Segfaults in compute::CallFunction

2020-07-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9380:
--

 Summary: [C++] Segfaults in compute::CallFunction
 Key: ARROW-9380
 URL: https://issues.apache.org/jira/browse/ARROW-9380
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson


I triggered these from R, so that's what the reproducers are in.

1. Calling "filter" with no args segfaults.

{code:r}
arrow:::compute__CallFunction("filter", list(), list(keep_na = FALSE))
{code}

Top of the backtrace from lldb:

{code}
  * frame #0: 0x000109e1c2c7 libarrow.100.dylib`arrow::Datum::type() const 
+ 7
frame #1: 0x00010a14a232 
libarrow.100.dylib`arrow::compute::internal::(anonymous 
namespace)::FilterMetaFunction::ExecuteImpl(std::__1::vector > const&, arrow::compute::FunctionOptions 
const*, arrow::compute::ExecContext*) const + 66
frame #2: 0x000109fc32c9 
libarrow.100.dylib`arrow::compute::MetaFunction::Execute(std::__1::vector > const&, arrow::compute::FunctionOptions 
const*, arrow::compute::ExecContext*) const + 41
frame #3: 0x000109fb3d3c 
libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, 
std::__1::vector > const&, 
arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 844
frame #4: 0x000109fb3c47 
libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, 
std::__1::vector > const&, 
arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 599
{code}

This is not the case with at least some other functions. If I try to call "sum" 
with no args, I get {{Invalid: Function accepts 1 arguments but passed 0}} and 
no segfault.

2. Something is strange with is_null. It creates what appears to be a valid 
boolean array, but if I pass it to filter, it segfaults. I'm adding bindings 
for this in ARROW-9187 but this should run on current master:

{code:r}
library(arrow)
a <- Array$create(1:4)
b <- arrow:::shared_ptr(Array, arrow:::call_function("is_null", a))
a$Filter(b)
{code}

Backtrace:

{code}
 * frame #0: 0x00010a120bb6 
libarrow.100.dylib`arrow::compute::internal::GetFilterOutputSize(arrow::ArrayData
 const&, arrow::compute::FilterOptions::NullSelectionBehavior) + 38
frame #1: 0x00010a125659 
libarrow.100.dylib`arrow::compute::internal::(anonymous 
namespace)::PrimitiveFilter(arrow::compute::KernelContext*, 
arrow::compute::ExecBatch const&, arrow::Datum*) + 121
frame #2: 0x000109fbbea4 
libarrow.100.dylib`arrow::compute::detail::VectorExecutor::ExecuteBatch(arrow::compute::ExecBatch
 const&, arrow::compute::detail::ExecListener*) + 996
frame #3: 0x000109fba3e6 
libarrow.100.dylib`arrow::compute::detail::VectorExecutor::Execute(std::__1::vector > const&, 
arrow::compute::detail::ExecListener*) + 150
frame #4: 0x000109fc0948 
libarrow.100.dylib`arrow::compute::Function::Execute(std::__1::vector > const&, arrow::compute::FunctionOptions 
const*, arrow::compute::ExecContext*) const + 1016
frame #5: 0x000109fb3d3c 
libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, 
std::__1::vector > const&, 
arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 844
frame #6: 0x00010a14a9b5 
libarrow.100.dylib`arrow::compute::internal::(anonymous 
namespace)::FilterMetaFunction::ExecuteImpl(std::__1::vector > const&, arrow::compute::FunctionOptions 
const*, arrow::compute::ExecContext*) const + 1989
frame #7: 0x000109fc32c9 
libarrow.100.dylib`arrow::compute::MetaFunction::Execute(std::__1::vector > const&, arrow::compute::FunctionOptions 
const*, arrow::compute::ExecContext*) const + 41
frame #8: 0x000109fb3d3c 
libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, 
std::__1::vector > const&, 
arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 844
frame #9: 0x000109fb3c47 
libarrow.100.dylib`arrow::compute::CallFunction(std::__1::basic_string, std::__1::allocator > const&, 
std::__1::vector > const&, 
arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) + 599
{code}

BUT: if I call {{as.vector}} on {{b}} before using it as a Filter, it 
works--even though I've discarded the as.vector result and am still using the 
Array to filter. 

{code:r}
library(arrow)
a <- Array$create(1:4)
b <- arrow:::shared_ptr(Array, arrow:::call_function("is_null", a))
as.vector(b)
a$Filter(b)
{code}

Just printing (calling {{ToString}}) on {{b}} doesn't prevent the segfault. And 
I have not observed this with other boolean kernels. E.g. this does not 
segfault:

{code:r}
library(arrow)
a <- Array$create(1:4)
b <- arrow:::shared_ptr(Array, arrow:::call_function("greater", a, 
Scalar$create(3L)))
a$Filter(b)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9379) [Rust] Support unsigned dictionary indices

2020-07-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9379:
---

 Summary: [Rust] Support unsigned dictionary indices
 Key: ARROW-9379
 URL: https://issues.apache.org/jira/browse/ARROW-9379
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9378) [Go] Support unsigned dictionary indices

2020-07-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9378:
---

 Summary: [Go] Support unsigned dictionary indices
 Key: ARROW-9378
 URL: https://issues.apache.org/jira/browse/ARROW-9378
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9377) [Java] Support unsigned dictionary indices

2020-07-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9377:
---

 Summary: [Java] Support unsigned dictionary indices
 Key: ARROW-9377
 URL: https://issues.apache.org/jira/browse/ARROW-9377
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Wes McKinney


child of ARROW-9259



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9376) [Python]

2020-07-08 Thread Athanassios Hatzis (Jira)
Athanassios Hatzis created ARROW-9376:
-

 Summary: [Python]
 Key: ARROW-9376
 URL: https://issues.apache.org/jira/browse/ARROW-9376
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
Reporter: Athanassios Hatzis


h3. First try
{code:python}
 data = [pa.array([1, 2, 3, 4]), pa.array(['foo', 'bar', 'baz', None]), 
pa.array([True, None, False, True])]
 batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
{code}
Hi, I use PyCharm IDE for development and I am getting the following inspection 
description when I write this piece of code above in the editor.

_Expected type 'RecordBatch', got 'List[Union[Union[ChunkedArray, Array], 
Any]]' instead_

_Inspection info: This inspection detects type errors in function call 
expressions. Due to dynamic dispatch and duck typing, this is possible in a 
limited but useful number of cases. Types of function parameters can be 
specified in docstrings or in Python 3 function annotations._
h3. Second try
{code:python}
batch = pa.RecordBatch.from_arrays(data, names=['f0', 'f1', 'f2']){code}
Then you get an insection descriptions

_Parameter 'list_arrays' unfilled_

_Passing list instead of pyarrow.lib.RecordBatch.RecordBatch. Is this 
intentional?_ 
h3. Third try
{code:python}
batch = pa.RecordBatch.from_arrays(list_arrays=data, names=['f0', 'f1', 'f2'])
{code}
Then you get an insection description and a type error


 _Parameter 'self' unfilled_ 
 _TypeError: from_arrays() takes at least 1 positional argument (0 given)_ 

 

Similar response, behaviour happens with the pa.Table.from_arrays

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9375) [FlightRPC][Integration] Add support for setting metadata version for integration tests

2020-07-08 Thread David Li (Jira)
David Li created ARROW-9375:
---

 Summary: [FlightRPC][Integration] Add support for setting metadata 
version for integration tests
 Key: ARROW-9375
 URL: https://issues.apache.org/jira/browse/ARROW-9375
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Integration
Reporter: David Li
Assignee: David Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9374) [C++][Python] Expose MakeArrayFromScalar

2020-07-08 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9374:
--

 Summary: [C++][Python] Expose MakeArrayFromScalar
 Key: ARROW-9374
 URL: https://issues.apache.org/jira/browse/ARROW-9374
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


Currently there is no efficient way to create a pyarrow array with identical 
values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [arrow-testing] pitrou opened a new pull request #36: ARROW-9373: Add Parquet fuzz regression file

2020-07-08 Thread GitBox


pitrou opened a new pull request #36:
URL: https://github.com/apache/arrow-testing/pull/36


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-testing] pitrou merged pull request #36: ARROW-9373: Add Parquet fuzz regression file

2020-07-08 Thread GitBox


pitrou merged pull request #36:
URL: https://github.com/apache/arrow-testing/pull/36


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (ARROW-9373) [C++] Fix Parquet crash on invalid input (OSS-Fuzz)

2020-07-08 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9373:
-

 Summary: [C++] Fix Parquet crash on invalid input (OSS-Fuzz)
 Key: ARROW-9373
 URL: https://issues.apache.org/jira/browse/ARROW-9373
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9372) [Dev][Archery] conda-python Docker image fails running

2020-07-08 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9372:
-

 Summary: [Dev][Archery] conda-python Docker image fails running
 Key: ARROW-9372
 URL: https://issues.apache.org/jira/browse/ARROW-9372
 Project: Apache Arrow
  Issue Type: Bug
  Components: Archery, Developer Tools
Reporter: Antoine Pitrou


I tried this:
{code}
archery docker run  -e PYTHON=3.6 conda-python
{code}

And after the Docker image was built, running it failed with:
{code}
+ pushd /arrow/python
/arrow/python /
++ realpath --relative-to=. /build/python
+ relative_build_dir=../../build/python
+ 3.6 setup.py build --build-base /build/python install 
--single-version-externally-managed --record ../../build/python/record.txt
/arrow/ci/scripts/python_build.sh: line 50: 3.6: command not found
{code}

Yet the comments in the {{docker-compose.yml}} say:
{code}
  conda-python:
# Usage:
#   docker-compose build conda-cpp
#   docker-compose build conda-python
#   docker-compose run --rm conda-python
# Parameters:
#   ARCH: amd64, arm32v7
#   PYTHON: 3.6, 3.7, 3.8
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9371) [Java] Run vector tests for both allocators

2020-07-08 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-9371:
--

 Summary: [Java] Run vector tests for both allocators
 Key: ARROW-9371
 URL: https://issues.apache.org/jira/browse/ARROW-9371
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


As per https://github.com/apache/arrow/pull/7619#discussion_r451140735 the 
vector tests should be run for both netty and unsafe allocators



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9370) [Java] Bump Netty version

2020-07-08 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-9370:
--

 Summary: [Java] Bump Netty version
 Key: ARROW-9370
 URL: https://issues.apache.org/jira/browse/ARROW-9370
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


As per https://github.com/apache/arrow/pull/7619#issuecomment-655246147 there 
is a security vulnerability in the current version of Netty. This will upgrade 
to latest version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9369) Can't convert dictionary type using table.from_pandas

2020-07-08 Thread Tomas Remes (Jira)
Tomas Remes created ARROW-9369:
--

 Summary: Can't convert dictionary type using table.from_pandas
 Key: ARROW-9369
 URL: https://issues.apache.org/jira/browse/ARROW-9369
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.17.1
Reporter: Tomas Remes


Hello, I am trying to do the following (please correct me if I am doing some 
non-sense):
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

fields = [pa.field("object", pa.dictionary(pa.int64(), pa.string()))]
data = {"object": { 
   "a": "a",
   "b": "b",
   "c": "c", 
   "s": "d" }}
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df, pa.schema(fields))
pq.write_table(table, "test.parquet") 
{code}
and I am getting:
{noformat}
Traceback (most recent call last):
  File "pa_test.py", line 17, in 
table = pa.Table.from_pandas(df, pa.schema(fields))
  File "pyarrow/table.pxi", line 1451, in pyarrow.lib.Table.from_pandas
  File 
"/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 575, in dataframe_to_arrays
for c, f in zip(columns_to_convert, convert_fields)]
  File 
"/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 575, in 
for c, f in zip(columns_to_convert, convert_fields)]
  File 
"/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 566, in convert_column
raise e
  File 
"/home/tremes/GITHUB/data-pipeline/venv/lib64/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 560, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 106, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: ('Sequence converter for type 
dictionary not implemented', 
'Conversion failed for column object with type object')
{noformat}
Workaround is to use {{df.to_parquet("test.parquet")}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9368) [Python] Rename predicate argument to filter in split_by_row_group()

2020-07-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9368:


 Summary: [Python] Rename predicate argument to filter in 
split_by_row_group()
 Key: ARROW-9368
 URL: https://issues.apache.org/jira/browse/ARROW-9368
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 1.0.0


For consistency with to_table() and get_fragments()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)