[jira] [Created] (ARROW-9310) Use feature enum in java

2020-07-02 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9310:
--

 Summary: Use feature enum in java
 Key: ARROW-9310
 URL: https://issues.apache.org/jira/browse/ARROW-9310
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9311) [Javascript] Use feature enum in javascript

2020-07-02 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9311:
--

 Summary: [Javascript] Use feature enum in javascript
 Key: ARROW-9311
 URL: https://issues.apache.org/jira/browse/ARROW-9311
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9315) [Java] Fix the failure of testAllocationManagerType

2020-07-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-9315:
---

 Summary: [Java] Fix the failure of testAllocationManagerType
 Key: ARROW-9315
 URL: https://issues.apache.org/jira/browse/ARROW-9315
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


It appears sometimes in the CI build. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9309) Start writing out feature enums to value (umbrella issue)

2020-07-02 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9309:
--

 Summary: Start writing out feature enums to value (umbrella issue)
 Key: ARROW-9309
 URL: https://issues.apache.org/jira/browse/ARROW-9309
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield


Proposed logic:

1.  Add flag where appropriate for supports dictionary replacement if there is 
a possibility it can be used.

2.  Only add compressed buffers when requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9314) [Go] Use Feature enum

2020-07-02 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9314:
--

 Summary: [Go] Use Feature enum
 Key: ARROW-9314
 URL: https://issues.apache.org/jira/browse/ARROW-9314
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9313) [Rust] Use feature enum

2020-07-02 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9313:
--

 Summary: [Rust] Use feature enum
 Key: ARROW-9313
 URL: https://issues.apache.org/jira/browse/ARROW-9313
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9312) [C++] Use feature enum

2020-07-02 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9312:
--

 Summary: [C++] Use feature enum
 Key: ARROW-9312
 URL: https://issues.apache.org/jira/browse/ARROW-9312
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9308) Add Feature enum to schema.fbs for forward compatibity

2020-07-02 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9308:
--

 Summary: Add Feature enum to schema.fbs for forward compatibity
 Key: ARROW-9308
 URL: https://issues.apache.org/jira/browse/ARROW-9308
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Micah Kornfield
Assignee: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9307) [Ruby] Add Arrow::RecordBatchIterator#to_a

2020-07-02 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9307:
---

 Summary: [Ruby] Add Arrow::RecordBatchIterator#to_a
 Key: ARROW-9307
 URL: https://issues.apache.org/jira/browse/ARROW-9307
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9306) [Ruby] Add support for Arrow::RecordBatch.new(raw_table)

2020-07-02 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9306:
---

 Summary: [Ruby] Add support for Arrow::RecordBatch.new(raw_table)
 Key: ARROW-9306
 URL: https://issues.apache.org/jira/browse/ARROW-9306
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9305) [Python] Dependency load failure in Windows wheel build

2020-07-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9305:
---

 Summary: [Python] Dependency load failure in Windows wheel build
 Key: ARROW-9305
 URL: https://issues.apache.org/jira/browse/ARROW-9305
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


The Windows wheels are experiencing a DLL load failure probably due to one of 
the dependencies



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9304) [C++] Add "AppendEmptyValue" builder APIs for use inside StructBuilder::AppendNull

2020-07-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9304:
---

 Summary: [C++] Add "AppendEmptyValue" builder APIs for use inside 
StructBuilder::AppendNull
 Key: ARROW-9304
 URL: https://issues.apache.org/jira/browse/ARROW-9304
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


StructBuilder should probably also add "UnsafeAppendNull" so that there is the 
option of using the Unsafe* operations on the children



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9303) Can't install R arrow on CentOS 7.6.1810

2020-07-02 Thread Nathan TeBlunthuis (Jira)
Nathan TeBlunthuis created ARROW-9303:
-

 Summary: Can't install R arrow on CentOS 7.6.1810
 Key: ARROW-9303
 URL: https://issues.apache.org/jira/browse/ARROW-9303
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.17.1
 Environment: CentOS 7.6.1810
R 4.0.2
Reporter: Nathan TeBlunthuis


I'm following the instructions here: https://arrow.apache.org/install/

arrow::install_arrow()

gives error:

{{./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or 
directory}}

{{This leaves me without a working arrow::read_feather.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9302) Specifying columns in a dataset drops the index (pandas) metadata.

2020-07-02 Thread Troy Zimmerman (Jira)
Troy Zimmerman created ARROW-9302:
-

 Summary: Specifying columns in a dataset drops the index (pandas) 
metadata.
 Key: ARROW-9302
 URL: https://issues.apache.org/jira/browse/ARROW-9302
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Troy Zimmerman


I'm not sure if this is a missing feature, or just undocumented, or perhaps not 
even something I should expect to work.

Let's start with a multi-index dataframe.

{code}
>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pyarrow.parquet as pq
>>>
>>> df
   data  id  when
letter number
a  10.0  a1 2020-05-05 08:30:01+00:00
b  21.1  b2 2020-05-05 08:30:01+00:00
   31.2  b3 2020-05-05 08:30:01+00:00
c  42.1  c4 2020-05-05 08:30:01+00:00
   52.2  c5 2020-05-05 08:30:01+00:00
   62.3  c6 2020-05-05 08:30:01+00:00

>>> tbl = pa.Table.from_pandas(df)
>>> tbl
pyarrow.Table
data: double
id: string
when: timestamp[ns, tz=+00:00]
letter: string
number: int64
>>> tbl.schema
data: double
id: string
when: timestamp[ns, tz=+00:00]
letter: string
number: int64
-- schema metadata --
pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' + 783
{code}

This of course works as expected, so let's write the table to disk, and read it 
with a {{dataset}}.

{code}
>>> pq.write_table(tbl, "/tmp/df.parquet")
>>> data = ds.dataset("/tmp/df.parquet")
>>> data.to_table(filter=ds.field("letter") == "c").to_pandas()
   data  id  when
letter number
c  42.1  c4 2020-05-05 08:30:01+00:00
   52.2  c5 2020-05-05 08:30:01+00:00
   62.3  c6 2020-05-05 08:30:01+00:00
{code}

The filter also works as expected, and the dataframe is reconstructed properly. 
Let's do it again, but this time with a column selection.

{code}
>>> data.to_table(filter=ds.field("letter") == "c", columns=["data", 
>>> "id"]).to_pandas()
   data  id
0   2.1  c4
1   2.2  c5
2   2.3  c6
{code}

Hmm, not quite what I was thinking, but excluding the indices from the columns 
seems like a dumb move on my part, so let's try again, and this time include 
all columns to be safe.

{code}
>>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter", 
>>> "number", "data", "id", "when"])
>>> tbl.to_pandas()
  letter  number  data  id  when
0  c   4   2.1  c4 2020-05-05 08:30:01+00:00
1  c   5   2.2  c5 2020-05-05 08:30:01+00:00
2  c   6   2.3  c6 2020-05-05 08:30:01+00:00
>>> tbl
pyarrow.Table
letter: string
number: int64
data: double
id: string
when: timestamp[us, tz=UTC]
{code}

It seems that when I specify any or all columns, the schema metadata is lost 
along the way, so {{to_pandas}} doesn't reconstruct the dataframe to match the 
original.

Here's my relevant versions:

- arrow-cpp: 0.17.1
- pyarrow: 0.17.1
- parquet-cpp: 1.5.1
- python: 3.7.6
- thrift-cpp: 0.13.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9301) [R] Cannot open parquet files with binary arrays

2020-07-02 Thread Steve Jacobs (Jira)
Steve Jacobs created ARROW-9301:
---

 Summary: [R] Cannot open parquet files with binary arrays
 Key: ARROW-9301
 URL: https://issues.apache.org/jira/browse/ARROW-9301
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 1.0.0
 Environment: apache arrow 0.17.1
Reporter: Steve Jacobs


When trying to open a parquet file with a binary column the following error is 
returned:```
Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
  cannot handle Array of type binary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9300) [Java] Separate Netty Memory to its own module

2020-07-02 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-9300:
--

 Summary: [Java] Separate Netty Memory to its own module
 Key: ARROW-9300
 URL: https://issues.apache.org/jira/browse/ARROW-9300
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


Finish the work started in ARROW-8230



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9299) Expose ORC metadata() in Python ORCFile

2020-07-02 Thread Jeremy Dyer (Jira)
Jeremy Dyer created ARROW-9299:
--

 Summary: Expose ORC metadata() in Python ORCFile
 Key: ARROW-9299
 URL: https://issues.apache.org/jira/browse/ARROW-9299
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 0.17.1
Reporter: Jeremy Dyer


There is currently no way for a user to directly access the underlying ORC 
metadata of a given file. It seems the C++ functions and objects already 
existing and rather the plumbing is just missing the the cython/python and 
potentially a few c++ shims. Giving users the ability to retrieve the metadata 
without first reading the entire file could help numerous applications to 
increase their query performance by allowing them to intelligently determine 
which ORC stripes should be read.  

This would allow for something like 
{code:java}
import pyarrow as pa 
orc_metadata = pa.orc.ORCFile(filename).metadata()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [arrow-testing] pitrou merged pull request #33: Add IPC fuzz regression files

2020-07-02 Thread GitBox


pitrou merged pull request #33:
URL: https://github.com/apache/arrow-testing/pull/33


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-testing] pitrou opened a new pull request #33: Add IPC fuzz regression files

2020-07-02 Thread GitBox


pitrou opened a new pull request #33:
URL: https://github.com/apache/arrow-testing/pull/33


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-testing] pitrou merged pull request #32: Add IPC fuzz regression files

2020-07-02 Thread GitBox


pitrou merged pull request #32:
URL: https://github.com/apache/arrow-testing/pull/32


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-testing] pitrou opened a new pull request #32: Add IPC fuzz regression files

2020-07-02 Thread GitBox


pitrou opened a new pull request #32:
URL: https://github.com/apache/arrow-testing/pull/32


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (ARROW-9298) [C++] Fix crashes on invalid input (OSS-Fuzz)

2020-07-02 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9298:
-

 Summary: [C++] Fix crashes on invalid input (OSS-Fuzz)
 Key: ARROW-9298
 URL: https://issues.apache.org/jira/browse/ARROW-9298
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9297) [C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB)

2020-07-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9297:


 Summary: [C++][Dataset] Dataset scanner cannot handle large binary 
column (> 2 GB)
 Key: ARROW-9297
 URL: https://issues.apache.org/jira/browse/ARROW-9297
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Related to ARROW-3762 (the parquet issue which has been solved), and discovered 
in ARROW-9139.

When creating a Parquet file with a large binary column (larger than 
BinaryArray capacity):

{code}
# code from the test_parquet.py::test_binary_array_overflow_to_chunked test
values = [b'x'] + [ 
b'x' * (1 << 20) 
] * 2 * (1 << 10)   

  

table = pa.table({'byte_col': values})  

  
pq.write_table(table, "test_large_binary.parquet")  

  
{code}

then reading this with the parquet API works (fixed by ARROW-3762):

{code}
In [3]: pq.read_table("test_large_binary.parquet")  

  
Out[3]: 
pyarrow.Table
byte_col: binary
{code}

but with the Datasets API this still fails:

{code}
In [1]: import pyarrow.dataset as ds

   

In [2]: dataset = ds.dataset("test_large_binary.parquet", format="parquet") 

   

In [4]: dataset.to_table()  

   
---
ArrowNotImplementedError  Traceback (most recent call last)
 in 
> 1 dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Scanner.to_table()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: This class cannot yet iterate chunked arrays

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9296) [CI][Rust] Enable more clippy lint checks

2020-07-02 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9296:
--

 Summary: [CI][Rust] Enable more clippy lint checks
 Key: ARROW-9296
 URL: https://issues.apache.org/jira/browse/ARROW-9296
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Rust
Reporter: Krisztian Szucs


Currently only {{clippy::redundant_field_names}} is allowed, so we should 
incrementally extend the list of enabled lints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9295) [Archery] Support rust clippy in the lint command

2020-07-02 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9295:
--

 Summary: [Archery] Support rust clippy in the lint command
 Key: ARROW-9295
 URL: https://issues.apache.org/jira/browse/ARROW-9295
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Archery
Reporter: Krisztian Szucs
 Fix For: 2.0.0


https://github.com/apache/arrow/pull/7501 introduces clippy support which we 
should move to the main linting job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)