[jira] [Assigned] (ARROW-6887) [Java] Create prose documentation for using ValueVectors

2019-10-14 Thread Ji Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu reassigned ARROW-6887:
-

Assignee: Ji Liu

> [Java] Create prose documentation for using ValueVectors
> 
>
> Key: ARROW-6887
> URL: https://issues.apache.org/jira/browse/ARROW-6887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>
> We should create documentation for the library that demonstrates:
> 1.  Basic construction of ValueVectors.  Highlighting:
>     * ValueVector lifecycle
>     * Reading by rows using Readers (mentioning that it is not as efficient 
> as direct access).
>     * Populating with Writers
> 2.  Reading and writing IPC stream format and file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6452) [Java] Override ValueVector toString() method

2019-10-14 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6452.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5271
[https://github.com/apache/arrow/pull/5271]

> [Java] Override ValueVector toString() method
> -
>
> Key: ARROW-6452
> URL: https://issues.apache.org/jira/browse/ARROW-6452
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Currently cpp code {{Array#ToString}} returns the human readable format 
> string like:
> [
>   1,
>   2,
>   3
> ]
> But Java {{ValueVector}} did not implement like this way now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6184) [Java] Provide hash table based dictionary encoder

2019-10-14 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6184.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5058
[https://github.com/apache/arrow/pull/5058]

> [Java] Provide hash table based dictionary encoder
> --
>
> Key: ARROW-6184
> URL: https://issues.apache.org/jira/browse/ARROW-6184
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> This is the second part of ARROW-5917. We provide a sort based encoder, as 
> well as a hash table based encoder, to solve the problem with the current 
> dictionary encoder. 
> In particular, we solve the following problems with the current encoder:
>  # There are repeated conversions between Java objects and bytes (e.g. 
> vector.getObject(i)).
>  # Unnecessary memory copy (the vector data must be copied to the hash table).
>  # The hash table cannot be reused for encoding multiple vectors (other data 
> structure & results cannot be reused either).
>  # The output vector should not be created/managed by the encoder (just like 
> in the out-of-place sorter)
>  # The hash table requires that the hashCode & equals methods be implemented 
> appropriately, but this is not guaranteed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6799) [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)

2019-10-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951627#comment-16951627
 ] 

Micah Kornfield commented on ARROW-6799:


As part of this we should make sure we have this component running in CI

> [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)
> -
>
> Key: ARROW-6799
> URL: https://issues.apache.org/jira/browse/ARROW-6799
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Does not appear to be tested in CI. Originally reported at 
> https://github.com/apache/arrow/issues/5575



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2892) [Plasma] Implement interface to get Java arrow objects from Plasma

2019-10-14 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-2892:
---
Component/s: Java

> [Plasma] Implement interface to get Java arrow objects from Plasma
> --
>
> Key: ARROW-2892
> URL: https://issues.apache.org/jira/browse/ARROW-2892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma, Java
>Reporter: Philipp Moritz
>Priority: Major
>
> Currently we have a low level interface to access bytes stored in plasma from 
> Java, using the JNI: [https://github.com/apache/arrow/pull/2065/]
>  
> As a followup, we should implement reading (and writing) Java arrow objects 
> from plasma, if possible using zero-copy.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-2892) [Plasma] Implement interface to get Java arrow objects from Plasma

2019-10-14 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reopened ARROW-2892:


> [Plasma] Implement interface to get Java arrow objects from Plasma
> --
>
> Key: ARROW-2892
> URL: https://issues.apache.org/jira/browse/ARROW-2892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Philipp Moritz
>Priority: Major
>
> Currently we have a low level interface to access bytes stored in plasma from 
> Java, using the JNI: [https://github.com/apache/arrow/pull/2065/]
>  
> As a followup, we should implement reading (and writing) Java arrow objects 
> from plasma, if possible using zero-copy.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6887) [Java] Create prose documentation for using ValueVectors

2019-10-14 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6887:
--

 Summary: [Java] Create prose documentation for using ValueVectors
 Key: ARROW-6887
 URL: https://issues.apache.org/jira/browse/ARROW-6887
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Java
Reporter: Micah Kornfield


We should create documentation for the library that demonstrates:

1.  Basic construction of ValueVectors.  Highlighting:

    * ValueVector lifecycle

    * Reading by rows using Readers (mentioning that it is not as efficient as 
direct access).

    * Populating with Writers

2.  Reading and writing IPC stream format and file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6877) [C++] Boost not found from the correct environment

2019-10-14 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-6877.
-
Fix Version/s: (was: 0.15.1)
   Resolution: Fixed

Issue resolved by pull request 5654
[https://github.com/apache/arrow/pull/5654]

> [C++] Boost not found from the correct environment
> --
>
> Key: ARROW-6877
> URL: https://issues.apache.org/jira/browse/ARROW-6877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> My local dev build started to fail, due to cmake founding a wrong boost (it 
> found {{-- Found Boost 1.70.0 at 
> /home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different 
> conda environment.
> I can reproduce this with creating a new conda env from scratch following our 
> documentation.
> By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it 
> works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6882:
--
Fix Version/s: 0.15.1

> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with another memory view it's  OK
> pa.chunked_array([fca.indices.view(fca.indices.type)]) 
> Out[45]: 
> 
> [
>   [
> 0,
> 0,
> 1,
> 1,
> 2
>   ]
> ]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6882.
---
Fix Version/s: (was: 0.15.1)
   1.0.0
   Resolution: Fixed

Issue resolved by pull request 5656
[https://github.com/apache/arrow/pull/5656]

> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with another memory view it's  OK
> pa.chunked_array([fca.indices.view(fca.indices.type)]) 
> Out[45]: 
> 
> [
>   [
> 0,
> 0,
> 1,
> 1,
> 2
>   ]
> ]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6283) [Rust] [DataFusion] Implement operator to write query results to partitioned CSV

2019-10-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-6283.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5640
[https://github.com/apache/arrow/pull/5640]

> [Rust] [DataFusion] Implement operator to write query results to partitioned 
> CSV
> 
>
> Key: ARROW-6283
> URL: https://issues.apache.org/jira/browse/ARROW-6283
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-4219) [Rust] [Parquet] Implement ArrowReader

2019-10-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-4219.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5523
[https://github.com/apache/arrow/pull/5523]

> [Rust] [Parquet] Implement ArrowReader
> --
>
> Key: ARROW-4219
> URL: https://issues.apache.org/jira/browse/ARROW-4219
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> ArrowReader reads parquet into arrow. In this ticket our goal is to  
> implement get_schema and read row groups into record batch iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6884) [Python][Flight] Make server-side RPC exceptions more friendly?

2019-10-14 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951421#comment-16951421
 ] 

David Li commented on ARROW-6884:
-

Ah, yeah, that would be a good improvement. (Especially the gRPC bits 
could/should probably go away.)

> [Python][Flight] Make server-side RPC exceptions more friendly?
> ---
>
> Key: ARROW-6884
> URL: https://issues.apache.org/jira/browse/ARROW-6884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Here is what an error looks like when a client RPC fails in the server
> {code}
> E   pyarrow.lib.ArrowException: Unknown error: gRPC returned unknown error, 
> with message: a bytes-like object is required, not 'str'
> E   In ../src/arrow/python/flight.cc, line 201, code: CheckPyError(). Detail: 
> Python exception: TypeError
> {code}
> The "line 201, code:" business is added by -DARROW_EXTRA_ERROR_CONTEXT=ON so 
> the normal use won't see this
> It might be nice to re-raise the same exception type in the client with some 
> extra context added to make clear that it is a server-side error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6884) [Python][Flight] Make server-side RPC exceptions more friendly?

2019-10-14 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951414#comment-16951414
 ] 

Wes McKinney commented on ARROW-6884:
-

It might be as simple as showing

{code}
ServerException("TypeError: a bytes-like object is required, not 'str'")
{code}

It looks like all of that information is already there, it's just about 
formatting it in a more friendly way. 

> [Python][Flight] Make server-side RPC exceptions more friendly?
> ---
>
> Key: ARROW-6884
> URL: https://issues.apache.org/jira/browse/ARROW-6884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Here is what an error looks like when a client RPC fails in the server
> {code}
> E   pyarrow.lib.ArrowException: Unknown error: gRPC returned unknown error, 
> with message: a bytes-like object is required, not 'str'
> E   In ../src/arrow/python/flight.cc, line 201, code: CheckPyError(). Detail: 
> Python exception: TypeError
> {code}
> The "line 201, code:" business is added by -DARROW_EXTRA_ERROR_CONTEXT=ON so 
> the normal use won't see this
> It might be nice to re-raise the same exception type in the client with some 
> extra context added to make clear that it is a server-side error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6844) [C++][Parquet][Python] List columns read broken with 0.15.0

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6844:
--
Labels: parquet pull-request-available  (was: parquet)

> [C++][Parquet][Python] List columns read broken with 0.15.0
> 
>
> Key: ARROW-6844
> URL: https://issues.apache.org/jira/browse/ARROW-6844
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: Benoit Rostykus
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: dbg_sample.gz.parquet, dbg_sample2.gz.parquet
>
>
> Columns of type {{array}} (such as `array`, 
> `array`...) are not readable anymore using {{pyarrow == 0.15.0}} (but 
> were with {{pyarrow == 0.14.1}}) when the original writer of the parquet file 
> is {{parquet-mr 1.9.1}}.
> {code}
> import pyarrow.parquet as pq
> pf = pq.ParquetFile('sample.gz.parquet')
> print(pf.read(columns=['profile_ids']))
> {code}
> with 0.14.1:
> {code}
> pyarrow.Table
> profile_ids: list
>  child 0, element: int64
> ...
> {code}
> with 0.15.0:
> {code}
> Traceback (most recent call last):
>  File "", line 1, in 
>  File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 253, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1131, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column data for field 0 with type list 
> is inconsistent with schema list
> {code}
> I've tested parquet files coming from multiple tables (with various schemas) 
> created with `parquet-mr`, couldn't read any `array` column 
> anymore.
>  
> I _think_ the bug was introduced with [this 
> commit|[https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5]].
> I think the root of the issue comes from the fact that `parquet-mr` writes 
> the inner struct name as `"element"` by default (see 
> [here|[https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java#L33]]),
>  whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example 
> [this 
> test|[https://github.com/apache/arrow/blob/c805b5fadb548925c915e0e130d6ed03c95d1398/python/pyarrow/tests/test_schema.py#L74]]).
>  The round-tripping tests write/read in pyarrow only obviously won't catch 
> this.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6884) [Python][Flight] Make server-side RPC exceptions more friendly?

2019-10-14 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951408#comment-16951408
 ] 

David Li commented on ARROW-6884:
-

I'm a little wary of automatically mirroring server-side exceptions as that 
leaks implementation details into the client (also, you don't want an exception 
with a sensitive repr getting propagated on accident, though I guess that ship 
has sailed). But we could do more on the server (I think we don't log 
exceptions or anything, so there is nothing on the server side to tell you what 
happened by default).

> [Python][Flight] Make server-side RPC exceptions more friendly?
> ---
>
> Key: ARROW-6884
> URL: https://issues.apache.org/jira/browse/ARROW-6884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Here is what an error looks like when a client RPC fails in the server
> {code}
> E   pyarrow.lib.ArrowException: Unknown error: gRPC returned unknown error, 
> with message: a bytes-like object is required, not 'str'
> E   In ../src/arrow/python/flight.cc, line 201, code: CheckPyError(). Detail: 
> Python exception: TypeError
> {code}
> The "line 201, code:" business is added by -DARROW_EXTRA_ERROR_CONTEXT=ON so 
> the normal use won't see this
> It might be nice to re-raise the same exception type in the client with some 
> extra context added to make clear that it is a server-side error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6886) [C++] arrow::io header nvcc compiler warnings

2019-10-14 Thread Paul Taylor (Jira)
Paul Taylor created ARROW-6886:
--

 Summary: [C++] arrow::io header nvcc compiler warnings
 Key: ARROW-6886
 URL: https://issues.apache.org/jira/browse/ARROW-6886
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.15.0
Reporter: Paul Taylor


Seeing the following compiler warnings statically linking the arrow::io headers 
with nvcc:

{noformat}
arrow/install/include/arrow/io/file.h(189): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MemoryMappedFile"

arrow/install/include/arrow/io/memory.h(98): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MockOutputStream"

arrow/install/include/arrow/io/memory.h(116): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::FixedSizeBufferWriter"

arrow/install/include/arrow/io/file.h(189): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MemoryMappedFile"

arrow/install/include/arrow/io/memory.h(98): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MockOutputStream"

arrow/install/include/arrow/io/memory.h(116): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::FixedSizeBufferWriter"

arrow/install/include/arrow/io/file.h(189): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MemoryMappedFile"

arrow/install/include/arrow/io/memory.h(98): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MockOutputStream"

arrow/install/include/arrow/io/memory.h(116): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::FixedSizeBufferWriter"
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6885) [Python] Remove superfluous skipped timedelta test

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6885:
--
Labels: pull-request-available  (was: )

> [Python] Remove superfluous skipped timedelta test
> --
>
> Key: ARROW-6885
> URL: https://issues.apache.org/jira/browse/ARROW-6885
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Now that we support timedelta / duration type, there is an old xfailed test 
> that can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6885) [Python] Remove superfluous skipped timedelta test

2019-10-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6885:


 Summary: [Python] Remove superfluous skipped timedelta test
 Key: ARROW-6885
 URL: https://issues.apache.org/jira/browse/ARROW-6885
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Now that we support timedelta / duration type, there is an old xfailed test 
that can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6882:
--
Labels: pull-request-available  (was: )

> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.1
>
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with another memory view it's  OK
> pa.chunked_array([fca.indices.view(fca.indices.type)]) 
> Out[45]: 
> 
> [
>   [
> 0,
> 0,
> 1,
> 1,
> 2
>   ]
> ]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-6882:


Assignee: Joris Van den Bossche

> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.15.1
>
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with another memory view it's  OK
> pa.chunked_array([fca.indices.view(fca.indices.type)]) 
> Out[45]: 
> 
> [
>   [
> 0,
> 0,
> 1,
> 1,
> 2
>   ]
> ]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6837:
--
Labels: pull-request-available  (was: )

> [C++/Python] access File Footer custom_metadata
> ---
>
> Key: ARROW-6837
> URL: https://issues.apache.org/jira/browse/ARROW-6837
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
>
> Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6884) [Python][Flight] Make server-side RPC exceptions more friendly?

2019-10-14 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6884:
---

 Summary: [Python][Flight] Make server-side RPC exceptions more 
friendly?
 Key: ARROW-6884
 URL: https://issues.apache.org/jira/browse/ARROW-6884
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Python
Reporter: Wes McKinney
 Fix For: 1.0.0


Here is what an error looks like when a client RPC fails in the server

{code}
E   pyarrow.lib.ArrowException: Unknown error: gRPC returned unknown error, 
with message: a bytes-like object is required, not 'str'
E   In ../src/arrow/python/flight.cc, line 201, code: CheckPyError(). Detail: 
Python exception: TypeError
{code}

The "line 201, code:" business is added by -DARROW_EXTRA_ERROR_CONTEXT=ON so 
the normal use won't see this

It might be nice to re-raise the same exception type in the client with some 
extra context added to make clear that it is a server-side error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6857) [Python][C++] Segfault for dictionary_encode on empty chunked_array (edge case)

2019-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6857.
-
Resolution: Fixed

Issue resolved by pull request 5650
[https://github.com/apache/arrow/pull/5650]

> [Python][C++] Segfault for dictionary_encode on empty chunked_array (edge 
> case)
> ---
>
> Key: ARROW-6857
> URL: https://issues.apache.org/jira/browse/ARROW-6857
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> a reproducer is here :
> {code:python}
> import pyarrow as pa
> aa = pa.chunked_array([pa.array(['a', 'b', 'c'])])
> aa[:0].dictionary_encode()  
> # Segmentation fault: 11
> {code}
> For pyarrow=0.14, I could not reproduce. 
>  I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951364#comment-16951364
 ] 

Joris Van den Bossche edited comment on ARROW-6882 at 10/14/19 9:12 PM:


Although, it is only a regression because we now validate the resulting array 
automatically. On 0.14.1 manually validating the resulting array gives the same 
error.

And you get the same when actually validating the indices array:

{code}
In [23]: fca.indices.validate()   
...
ArrowInvalid: Unexpected dictionary values in array of type int32
{code}


was (Author: jorisvandenbossche):
Although, it is only a regression because we now validate the resulting array 
automatically. On 0.14.1 manually validating the resulting array gives the same 
error.

> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.15.1
>
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with another memory view it's  OK
> pa.chunked_array([fca.indices.view(fca.indices.type)]) 
> Out[45]: 
> 
> [
>   [
> 0,
> 0,
> 1,
> 1,
> 2
>   ]
> ]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6857) [Python][C++] Segfault for dictionary_encode on empty chunked_array (edge case)

2019-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6857:

Summary: [Python][C++] Segfault for dictionary_encode on empty 
chunked_array (edge case)  (was: Segfault for dictionary_encode on empty 
chunked_array (edge case))

> [Python][C++] Segfault for dictionary_encode on empty chunked_array (edge 
> case)
> ---
>
> Key: ARROW-6857
> URL: https://issues.apache.org/jira/browse/ARROW-6857
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> a reproducer is here :
> {code:python}
> import pyarrow as pa
> aa = pa.chunked_array([pa.array(['a', 'b', 'c'])])
> aa[:0].dictionary_encode()  
> # Segmentation fault: 11
> {code}
> For pyarrow=0.14, I could not reproduce. 
>  I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951364#comment-16951364
 ] 

Joris Van den Bossche commented on ARROW-6882:
--

Although, it is only a regression because we now validate the resulting array 
automatically. On 0.14.1 manually validating the resulting array gives the same 
error.

> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.15.1
>
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with another memory view it's  OK
> pa.chunked_array([fca.indices.view(fca.indices.type)]) 
> Out[45]: 
> 
> [
>   [
> 0,
> 0,
> 1,
> 1,
> 2
>   ]
> ]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951362#comment-16951362
 ] 

Joris Van den Bossche commented on ARROW-6882:
--

Thanks for the report. Labeling as 0.15.1 for now as it seems a regression 
compared to 0.14

> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.15.1
>
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with another memory view it's  OK
> pa.chunked_array([fca.indices.view(fca.indices.type)]) 
> Out[45]: 
> 
> [
>   [
> 0,
> 0,
> 1,
> 1,
> 2
>   ]
> ]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Artem KOZHEVNIKOV (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem KOZHEVNIKOV updated ARROW-6882:
-
Description: 
I've experienced a strange error raise when trying to apply `pa.chunked_array` 
directly on the indices of dictionary_encoding (code is below). Making a memory 
view solves the problem.
{code:python}
import pyarrow as pa
ca = pa.array(['a', 'a', 'b', 'b', 'c'])
   
fca = ca.dictionary_encode()
   
fca.indices 
   

[
  0,
  0,
  1,
  1,
  2
]

pa.chunked_array([fca.indices]) 
   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.chunked_array([fca.indices])

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.chunked_array()

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()

ArrowInvalid: Unexpected dictionary values in array of type int32

# with another memory view it's  OK
pa.chunked_array([fca.indices.view(fca.indices.type)]) 
Out[45]: 

[
  [
0,
0,
1,
1,
2
  ]
]
 {code}

  was:
I've experienced a strange error raise when trying to apply `pa.chunked_array` 
directly on the indices of dictionary_encoding (code is below). Making a memory 
view solves the problem.
{code:python}
import pyarrow as pa
ca = pa.array(['a', 'a', 'b', 'b', 'c'])
   
fca = ca.dictionary_encode()
   
fca.indices 
   

[
  0,
  0,
  1,
  1,
  2
]

pa.chunked_array([fca.indices]) 
   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.chunked_array([fca.indices])

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.chunked_array()

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()

ArrowInvalid: Unexpected dictionary values in array of type int32

# with another memory view it's  OK
pa.chunked_array([pa.Array.from_buffers(type=pa.int32(), 
length=len(fca.indices), buffers=fca.indices.buffers())]) 
Out[45]: 

[
  [
0,
0,
1,
1,
2
  ]
]
 {code}


> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.15.1
>
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with 

[jira] [Updated] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6882:
-
Fix Version/s: 0.15.1

> cannot create a chunked_array from dictionary_encoding result
> -
>
> Key: ARROW-6882
> URL: https://issues.apache.org/jira/browse/ARROW-6882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.15.1
>
>
> I've experienced a strange error raise when trying to apply 
> `pa.chunked_array` directly on the indices of dictionary_encoding (code is 
> below). Making a memory view solves the problem.
> {code:python}
> import pyarrow as pa
> ca = pa.array(['a', 'a', 'b', 'b', 'c'])  
>  
> fca = ca.dictionary_encode()  
>  
> fca.indices   
>  
> 
> [
>   0,
>   0,
>   1,
>   1,
>   2
> ]
> pa.chunked_array([fca.indices])   
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.chunked_array([fca.indices])
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.chunked_array()
> ~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: Unexpected dictionary values in array of type int32
> # with another memory view it's  OK
> pa.chunked_array([pa.Array.from_buffers(type=pa.int32(), 
> length=len(fca.indices), buffers=fca.indices.buffers())]) 
> Out[45]: 
> 
> [
>   [
> 0,
> 0,
> 1,
> 1,
> 2
>   ]
> ]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6877) [C++] Boost not found from the correct environment

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6877:
--
Labels: pull-request-available  (was: )

> [C++] Boost not found from the correct environment
> --
>
> Key: ARROW-6877
> URL: https://issues.apache.org/jira/browse/ARROW-6877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>
> My local dev build started to fail, due to cmake founding a wrong boost (it 
> found {{-- Found Boost 1.70.0 at 
> /home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different 
> conda environment.
> I can reproduce this with creating a new conda env from scratch following our 
> documentation.
> By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it 
> works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6877) [C++] Boost not found from the correct environment

2019-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6877:
---

Assignee: Wes McKinney

> [C++] Boost not found from the correct environment
> --
>
> Key: ARROW-6877
> URL: https://issues.apache.org/jira/browse/ARROW-6877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
>
> My local dev build started to fail, due to cmake founding a wrong boost (it 
> found {{-- Found Boost 1.70.0 at 
> /home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different 
> conda environment.
> I can reproduce this with creating a new conda env from scratch following our 
> documentation.
> By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it 
> works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6883) [C++] Support sending delta DictionaryBatch or replacement DictionaryBatch in IPC stream writer class

2019-10-14 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6883:
---

 Summary: [C++] Support sending delta DictionaryBatch or 
replacement DictionaryBatch in IPC stream writer class
 Key: ARROW-6883
 URL: https://issues.apache.org/jira/browse/ARROW-6883
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I didn't see other JIRA issues about this, but this is one significant matter 
to have complete columnar format coverage in the C++ library.

This functionality will flow through to the various bindings, so it would be 
helpful to add unit tests to assert that things work correctly e.g. in Python 
from an end-user perspective



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-14 Thread John Muehlhausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951353#comment-16951353
 ] 

John Muehlhausen commented on ARROW-6837:
-

Initially proposed API:
{noformat}
static Status RecordBatchFileWriter::Open(io::OutputStream* sink,
const std::shared_ptr& schema, 
std::shared_ptr* out,
const std::shared_ptr& metadata = NULLPTR);

static Result> RecordBatchFileWriter::Open(
io::OutputStream* sink, const std::shared_ptr& schema,
const std::shared_ptr& metadata = NULLPTR);

std::shared_ptr RecordBatchFileReader::metadata() const;
{noformat}

> [C++/Python] access File Footer custom_metadata
> ---
>
> Key: ARROW-6837
> URL: https://issues.apache.org/jira/browse/ARROW-6837
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: John Muehlhausen
>Priority: Minor
>
> Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6417) [C++][Parquet] Non-dictionary BinaryArray reads from Parquet format have slowed down since 0.11.x

2019-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6417.
---
Fix Version/s: 0.15.0
   Resolution: Fixed

This was fixed in 0.15.0 by the jemalloc toolchain work and other optimizations

> [C++][Parquet] Non-dictionary BinaryArray reads from Parquet format have 
> slowed down since 0.11.x
> -
>
> Key: ARROW-6417
> URL: https://issues.apache.org/jira/browse/ARROW-6417
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
> Attachments: 20190903_parquet_benchmark.py, 
> 20190903_parquet_read_perf.png
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In doing some benchmarking, I have found that binary reads seem to be slower 
> from Arrow 0.11.1 to master branch. It would be a good idea to do some basic 
> profiling to see where we might improve our memory allocation strategy (or 
> whatever the bottleneck turns out to be)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6666) [Rust] [DataFusion] Implement string literal expression

2019-10-14 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951347#comment-16951347
 ] 

Kyle McCarthy commented on ARROW-:
--

Does this require for rust's arrow to implement a 
[StringType|[https://arrow.apache.org/docs/cpp/api/datatype.html#classarrow_1_1_string_type]]
 similar to the one in the C++ implementation? 

> [Rust] [DataFusion] Implement string literal expression
> ---
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> Implement string literal expression in the new physical query plan. It is 
> already implemented in the code that executed directly from the logical plan 
> so it should largely be a copy and paste exercise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6882) cannot create a chunked_array from dictionary_encoding result

2019-10-14 Thread Artem KOZHEVNIKOV (Jira)
Artem KOZHEVNIKOV created ARROW-6882:


 Summary: cannot create a chunked_array from dictionary_encoding 
result
 Key: ARROW-6882
 URL: https://issues.apache.org/jira/browse/ARROW-6882
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
Reporter: Artem KOZHEVNIKOV


I've experienced a strange error raise when trying to apply `pa.chunked_array` 
directly on the indices of dictionary_encoding (code is below). Making a memory 
view solves the problem.
{code:python}
import pyarrow as pa
ca = pa.array(['a', 'a', 'b', 'b', 'c'])
   
fca = ca.dictionary_encode()
   
fca.indices 
   

[
  0,
  0,
  1,
  1,
  2
]

pa.chunked_array([fca.indices]) 
   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.chunked_array([fca.indices])

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.chunked_array()

~/Projects/miniconda3/envs/pyarrow/lib/python3.7/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()

ArrowInvalid: Unexpected dictionary values in array of type int32

# with another memory view it's  OK
pa.chunked_array([pa.Array.from_buffers(type=pa.int32(), 
length=len(fca.indices), buffers=fca.indices.buffers())]) 
Out[45]: 

[
  [
0,
0,
1,
1,
2
  ]
]
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) [Python] Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951339#comment-16951339
 ] 

Wes McKinney commented on ARROW-6876:
-

Marked this for 0.15.1

> [Python] Reading parquet file becomes really slow for 0.15.0
> 
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-14 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy updated ARROW-6659:
-
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>  Labels: pull-request-available
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) [Python] Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6876:
--
Labels: pull-request-available  (was: )

> [Python] Reading parquet file becomes really slow for 0.15.0
> 
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) [Python] Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6876:

Fix Version/s: 0.15.1
   1.0.0

> [Python] Reading parquet file becomes really slow for 0.15.0
> 
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6874) [Python] Memory leak in Table.to_pandas() when nested columns are present

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951314#comment-16951314
 ] 

Joris Van den Bossche commented on ARROW-6874:
--

This seems to be caused by ARROW-6570 
(https://github.com/apache/arrow/commit/19545f878d17f99a07e51e818eddc8c77f38f56b).
 The problem comes up for object-dtype arrays (so for list, struct, string 
dtype)

> [Python] Memory leak in Table.to_pandas() when nested columns are present
> -
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow: 
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
>Reporter: Sergey Mozharov
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which 
> appears to have a memory leak in the latest version. See details below to 
> reproduce this issue.
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type  # ListType(list)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 1
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6878) [Python] pa.array() does not handle list of dicts with bytes keys correctly under python3

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6878:
--
Labels: pull-request-available  (was: )

> [Python] pa.array() does not handle list of dicts with bytes keys correctly 
> under python3
> -
>
> Key: ARROW-6878
> URL: https://issues.apache.org/jira/browse/ARROW-6878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Zhuo Peng
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>
> It creates sub-arrays with nulls filled, instead of the provided values.
> $ python
> Python 3.6.8 (default, Jan 3 2019, 03:42:36) 
> [GCC 8.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.15.0'
> >>> a = pa.array([\{b"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  null
>  ]
> >>> a = pa.array([\{"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  [
>  1,
>  2,
>  3
>  ]
>  ]
>  
> It works under python2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6878) [Python] pa.array() does not handle list of dicts with bytes keys correctly under python3

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6878:
--
Fix Version/s: 0.15.1

> [Python] pa.array() does not handle list of dicts with bytes keys correctly 
> under python3
> -
>
> Key: ARROW-6878
> URL: https://issues.apache.org/jira/browse/ARROW-6878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Zhuo Peng
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
>
> It creates sub-arrays with nulls filled, instead of the provided values.
> $ python
> Python 3.6.8 (default, Jan 3 2019, 03:42:36) 
> [GCC 8.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.15.0'
> >>> a = pa.array([\{b"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  null
>  ]
> >>> a = pa.array([\{"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  [
>  1,
>  2,
>  3
>  ]
>  ]
>  
> It works under python2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6878) [Python] pa.array() does not handle list of dicts with bytes keys correctly under python3

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6878:
--
Fix Version/s: 1.0.0

> [Python] pa.array() does not handle list of dicts with bytes keys correctly 
> under python3
> -
>
> Key: ARROW-6878
> URL: https://issues.apache.org/jira/browse/ARROW-6878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Zhuo Peng
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> It creates sub-arrays with nulls filled, instead of the provided values.
> $ python
> Python 3.6.8 (default, Jan 3 2019, 03:42:36) 
> [GCC 8.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.15.0'
> >>> a = pa.array([\{b"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  null
>  ]
> >>> a = pa.array([\{"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  [
>  1,
>  2,
>  3
>  ]
>  ]
>  
> It works under python2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6881) [Rust] Remove "array_ops" in favor of the "compute" sub-module

2019-10-14 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-6881:
--

 Summary: [Rust] Remove "array_ops" in favor of the "compute" 
sub-module
 Key: ARROW-6881
 URL: https://issues.apache.org/jira/browse/ARROW-6881
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 0.15.0
Reporter: Paddy Horan
Assignee: Paddy Horan
 Fix For: 1.0.0


Once 4591 (https://issues.apache.org/jira/browse/ARROW-4591) is complete only 
filter and limit will remain in the "array_ops" module and they can be moved 
under the "compute" sub-crate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6880) [Rust] Add explicit SIMD for min/max kernel

2019-10-14 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-6880:
--

 Summary: [Rust] Add explicit SIMD for min/max kernel
 Key: ARROW-6880
 URL: https://issues.apache.org/jira/browse/ARROW-6880
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 0.15.0
Reporter: Paddy Horan
Assignee: Paddy Horan
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6879) [Rust] Add explicit SIMD for sum kernel

2019-10-14 Thread Paddy Horan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan updated ARROW-6879:
---
Parent: ARROW-4591
Issue Type: Sub-task  (was: Improvement)

> [Rust] Add explicit SIMD for sum kernel
> ---
>
> Key: ARROW-6879
> URL: https://issues.apache.org/jira/browse/ARROW-6879
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 0.15.0
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Minor
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6879) [Rust] Add explicit SIMD for sum kernel

2019-10-14 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-6879:
--

 Summary: [Rust] Add explicit SIMD for sum kernel
 Key: ARROW-6879
 URL: https://issues.apache.org/jira/browse/ARROW-6879
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 0.15.0
Reporter: Paddy Horan
Assignee: Paddy Horan
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6878) [Python] pa.array() does not handle list of dicts with bytes keys correctly under python3

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-6878:
-

Assignee: Antoine Pitrou

> [Python] pa.array() does not handle list of dicts with bytes keys correctly 
> under python3
> -
>
> Key: ARROW-6878
> URL: https://issues.apache.org/jira/browse/ARROW-6878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Zhuo Peng
>Assignee: Antoine Pitrou
>Priority: Major
>
> It creates sub-arrays with nulls filled, instead of the provided values.
> $ python
> Python 3.6.8 (default, Jan 3 2019, 03:42:36) 
> [GCC 8.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.15.0'
> >>> a = pa.array([\{b"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  null
>  ]
> >>> a = pa.array([\{"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  [
>  1,
>  2,
>  3
>  ]
>  ]
>  
> It works under python2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6789) [Python] Automatically box bytes/buffer-like values yielded from `FlightServerBase.do_action` in Result values

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6789:
--
Labels: pull-request-available  (was: )

> [Python] Automatically box bytes/buffer-like values yielded from 
> `FlightServerBase.do_action` in Result values
> --
>
> Key: ARROW-6789
> URL: https://issues.apache.org/jira/browse/ARROW-6789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This will help with less boilerplate for server implementations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) [Python] Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6876:

Summary: [Python] Reading parquet file becomes really slow for 0.15.0  
(was: Reading parquet file becomes really slow for 0.15.0)

> [Python] Reading parquet file becomes really slow for 0.15.0
> 
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6789) [Python] Automatically box bytes/buffer-like values yielded from `FlightServerBase.do_action` in Result values

2019-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6789:
---

Assignee: Wes McKinney

> [Python] Automatically box bytes/buffer-like values yielded from 
> `FlightServerBase.do_action` in Result values
> --
>
> Key: ARROW-6789
> URL: https://issues.apache.org/jira/browse/ARROW-6789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This will help with less boilerplate for server implementations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6874) Memory leak in Table.to_pandas() when nested columns are present

2019-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6874:

Fix Version/s: 1.0.0

> Memory leak in Table.to_pandas() when nested columns are present
> 
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow: 
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
>Reporter: Sergey Mozharov
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which 
> appears to have a memory leak in the latest version. See details below to 
> reproduce this issue.
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type  # ListType(list)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 1
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6878) [Python] pa.array() does not handle list of dicts with bytes keys correctly under python3

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6878:
--
Component/s: Python

> [Python] pa.array() does not handle list of dicts with bytes keys correctly 
> under python3
> -
>
> Key: ARROW-6878
> URL: https://issues.apache.org/jira/browse/ARROW-6878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Zhuo Peng
>Priority: Major
>
> It creates sub-arrays with nulls filled, instead of the provided values.
> $ python
> Python 3.6.8 (default, Jan 3 2019, 03:42:36) 
> [GCC 8.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.15.0'
> >>> a = pa.array([\{b"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  null
>  ]
> >>> a = pa.array([\{"a": [1, 2, 3]}])
> >>> a
> 
> -- is_valid: all not null
> -- child 0 type: list
>  [
>  [
>  1,
>  2,
>  3
>  ]
>  ]
>  
> It works under python2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6874) [Python] Memory leak in Table.to_pandas() when nested columns are present

2019-10-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6874:

Summary: [Python] Memory leak in Table.to_pandas() when nested columns are 
present  (was: Memory leak in Table.to_pandas() when nested columns are present)

> [Python] Memory leak in Table.to_pandas() when nested columns are present
> -
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow: 
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
>Reporter: Sergey Mozharov
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which 
> appears to have a memory leak in the latest version. See details below to 
> reproduce this issue.
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type  # ListType(list)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 1
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6878) [Python] pa.array() does not handle list of dicts with bytes keys correctly under python3

2019-10-14 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-6878:


 Summary: [Python] pa.array() does not handle list of dicts with 
bytes keys correctly under python3
 Key: ARROW-6878
 URL: https://issues.apache.org/jira/browse/ARROW-6878
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


It creates sub-arrays with nulls filled, instead of the provided values.

$ python

Python 3.6.8 (default, Jan 3 2019, 03:42:36) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> pa.__version__
'0.15.0'
>>> a = pa.array([\{b"a": [1, 2, 3]}])
>>> a

-- is_valid: all not null
-- child 0 type: list
 [
 null
 ]
>>> a = pa.array([\{"a": [1, 2, 3]}])
>>> a

-- is_valid: all not null
-- child 0 type: list
 [
 [
 1,
 2,
 3
 ]
 ]

 

It works under python2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951214#comment-16951214
 ] 

Joris Van den Bossche commented on ARROW-6876:
--

Small reproducer:

{code}
import pyarrow as pa
import pyarrow.parquet as pq 
table = pa.table({'c' + str(i): np.random.randn(10) for i in range(1)})  
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")
{code}

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6877) [C++] Boost not found from the correct environment

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6877:
--
Component/s: C++

> [C++] Boost not found from the correct environment
> --
>
> Key: ARROW-6877
> URL: https://issues.apache.org/jira/browse/ARROW-6877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> My local dev build started to fail, due to cmake founding a wrong boost (it 
> found {{-- Found Boost 1.70.0 at 
> /home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different 
> conda environment.
> I can reproduce this with creating a new conda env from scratch following our 
> documentation.
> By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it 
> works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6877) [C++] Boost not found from the correct environment

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6877:
--
Fix Version/s: 0.15.1
   1.0.0

> [C++] Boost not found from the correct environment
> --
>
> Key: ARROW-6877
> URL: https://issues.apache.org/jira/browse/ARROW-6877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
>
> My local dev build started to fail, due to cmake founding a wrong boost (it 
> found {{-- Found Boost 1.70.0 at 
> /home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different 
> conda environment.
> I can reproduce this with creating a new conda env from scratch following our 
> documentation.
> By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it 
> works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6877) [C++] Boost not found from the correct environment

2019-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951207#comment-16951207
 ] 

Antoine Pitrou commented on ARROW-6877:
---

cc [~wesm]

> [C++] Boost not found from the correct environment
> --
>
> Key: ARROW-6877
> URL: https://issues.apache.org/jira/browse/ARROW-6877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> My local dev build started to fail, due to cmake founding a wrong boost (it 
> found {{-- Found Boost 1.70.0 at 
> /home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different 
> conda environment.
> I can reproduce this with creating a new conda env from scratch following our 
> documentation.
> By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it 
> works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6877) [C++] Boost not found from the correct environment

2019-10-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6877:


 Summary: [C++] Boost not found from the correct environment
 Key: ARROW-6877
 URL: https://issues.apache.org/jira/browse/ARROW-6877
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Joris Van den Bossche


My local dev build started to fail, due to cmake founding a wrong boost (it 
found {{-- Found Boost 1.70.0 at 
/home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different 
conda environment.

I can reproduce this with creating a new conda env from scratch following our 
documentation.

By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it 
works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6857) Segfault for dictionary_encode on empty chunked_array (edge case)

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6857:
--
Fix Version/s: 1.0.0

> Segfault for dictionary_encode on empty chunked_array (edge case)
> -
>
> Key: ARROW-6857
> URL: https://issues.apache.org/jira/browse/ARROW-6857
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> a reproducer is here :
> {code:python}
> import pyarrow as pa
> aa = pa.chunked_array([pa.array(['a', 'b', 'c'])])
> aa[:0].dictionary_encode()  
> # Segmentation fault: 11
> {code}
> For pyarrow=0.14, I could not reproduce. 
>  I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951177#comment-16951177
 ] 

Bob commented on ARROW-6876:


I also tried fastparquet as an engine and it just thrown an error to me when 
reading the file.. Seems it just cannot decode the file.

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6873) [Python] Stale CColumn reference break Cython cimport pyarrow

2019-10-14 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn updated ARROW-6873:

Fix Version/s: 0.15.1

> [Python] Stale CColumn reference break Cython cimport pyarrow
> -
>
> Key: ARROW-6873
> URL: https://issues.apache.org/jira/browse/ARROW-6873
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Traceback:
> {code}
> Error compiling Cython file:
> 
> ...
> # under the License.
> from __future__ import absolute_import
> from libcpp.memory cimport shared_ptr
> from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CDataType,
> ^
> 
> …/lib/python3.7/site-packages/pyarrow/__init__.pxd:21:0: 
> 'pyarrow/includes/libarrow/CColumn.pxd' not found
> Error compiling Cython file:
> 
> ...
> cdef object wrap_tensor(const shared_ptr[CTensor]& sp_tensor)
> cdef object wrap_sparse_tensor_coo(
> const shared_ptr[CSparseTensorCOO]& sp_sparse_tensor)
> cdef object wrap_sparse_tensor_csr(
> const shared_ptr[CSparseTensorCSR]& sp_sparse_tensor)
> cdef object wrap_column(const shared_ptr[CColumn]& ccolumn)
>^
> 
> …/lib/python3.7/site-packages/pyarrow/__init__.pxd:39:52: unknown type in 
> template argument
> Error compiling Cython file:
> 
> ...
> from pyarrow cimport Int64ArrayBuilder
> ^
> 
> /Users/uwe/.ipython/cython/_cython_magic_3eb31dd63fb578b618cc8e98a60dbdf5.pyx:2:0:
>  'pyarrow/Int64ArrayBuilder.pxd' not found
> ---
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951176#comment-16951176
 ] 

Bob commented on ARROW-6876:


[~jorisvandenbossche] thanks. let me know if I can help. We are very special in 
this case I think, Also I am not sure if the multilevel columns thing adds any 
complexity – seems parquet do not handle this very well?

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951175#comment-16951175
 ] 

Joris Van den Bossche commented on ARROW-6876:
--

Thanks, if it is just floats, I'll try to reproduce based on that description. 
But it's probably related to the fact that you have a very wide dataframe (n 
columns >> n rows). In general, the parquet is not very suited for that kind of 
data (also in 0.14 the 2 seconds to read is very slow). But that said, it's 
still a performance regression compared to 0.14 that is worth looking into.

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951172#comment-16951172
 ] 

Bob edited comment on ARROW-6876 at 10/14/19 5:18 PM:
--

[~jorisvandenbossche] seems you guys started calling this function which caused 
the issue:

 

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118]


was (Author: dorafmon):
[~jorisvandenbossche] seems you guys added this function which caused the issue:

 

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118]

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951172#comment-16951172
 ] 

Bob commented on ARROW-6876:


[~jorisvandenbossche] seems you guys added this function which caused the issue:

 

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118]

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951168#comment-16951168
 ] 

Bob commented on ARROW-6876:


[~jorisvandenbossche] sorry I cannot share the data with you because they 
contain our IP. Something I can share with is:

 

In [6]: df.shape
Out[6]: (61, 31835)

 

All fields are just pain floats, I believe you can create a dataframe just like 
this with no difficulties?

 

One thing to note is that in our dataframe we use multilevel columns. But I 
suppose that is not an issue?

 

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951164#comment-16951164
 ] 

Joris Van den Bossche commented on ARROW-6876:
--

Thanks for the report. Would you be able to share a script that reproduces it 
(that writes a parquet file that has the issue, or otherwise share a file)? 
What's the schema of the data?

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
> In [4]: %timeit df = pd.read_parquet(path)
> 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
> In [5]: %timeit df = pd.read_parquet(path)
> 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated ARROW-6876:
---
Description: 
Hi,

 

I just noticed that reading a parquet file becomes really slow after I upgraded 
to 0.15.0 when using pandas.

 

Example:

*With 0.14.1*
 In [4]: %timeit df = pd.read_parquet(path)
 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

*With 0.15.0*
 In [5]: %timeit df = pd.read_parquet(path)
 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

The file is about 15MB in size. I am testing on the same machine using the same 
version of python and pandas.

 

Have you received similar complain? What could be the issue here?

 

Thanks a lot.

 

 

Edit1:

Some profiling I did:

0.14.1:

!image-2019-10-14-18-12-07-652.png!

 

0.15.0:

!image-2019-10-14-18-10-42-850.png!

 

  was:
Hi,

 

I just noticed that reading a parquet file becomes really slow after I upgraded 
to 0.15.0 when using pandas.

 

Example:

*With 0.14.1*
In [4]: %timeit df = pd.read_parquet(path)
2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

*With 0.15.0*
In [5]: %timeit df = pd.read_parquet(path)
22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

The file is about 15MB in size. I am testing on the same machine using the same 
version of python and pandas.

 

Have you received similar complain? What could be the issue here?

 

Thanks a lot.

 

 


> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated ARROW-6876:
---
Attachment: image-2019-10-14-18-12-07-652.png

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
> In [4]: %timeit df = pd.read_parquet(path)
> 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
> In [5]: %timeit df = pd.read_parquet(path)
> 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated ARROW-6876:
---
Attachment: image-2019-10-14-18-10-42-850.png

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
> In [4]: %timeit df = pd.read_parquet(path)
> 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
> In [5]: %timeit df = pd.read_parquet(path)
> 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6874) Memory leak in Table.to_pandas() when nested columns are present

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951148#comment-16951148
 ] 

Joris Van den Bossche edited comment on ARROW-6874 at 10/14/19 5:09 PM:


Thanks for the report! 

EDIT:  -tried to reproduce this, but don't see the issue with pyarrow 0.15 
(installed with conda) or master on py 3.7 on Linux (ubuntu)- I can indeed 
reproduce this




was (Author: jorisvandenbossche):
Thanks for the report! 

I tried to reproduce this, but don't see the issue with pyarrow 0.15 (installed 
with conda) or master on py 3.7 on Linux (ubuntu)

> Memory leak in Table.to_pandas() when nested columns are present
> 
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow: 
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
>Reporter: Sergey Mozharov
>Priority: Major
> Fix For: 0.15.1
>
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which 
> appears to have a memory leak in the latest version. See details below to 
> reproduce this issue.
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type  # ListType(list)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 1
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)
Bob created ARROW-6876:
--

 Summary: Reading parquet file becomes really slow for 0.15.0
 Key: ARROW-6876
 URL: https://issues.apache.org/jira/browse/ARROW-6876
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
 Environment: python3.7
Reporter: Bob


Hi,

 

I just noticed that reading a parquet file becomes really slow after I upgraded 
to 0.15.0 when using pandas.

 

Example:

*With 0.14.1*
In [4]: %timeit df = pd.read_parquet(path)
2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

*With 0.15.0*
In [5]: %timeit df = pd.read_parquet(path)
22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

The file is about 15MB in size. I am testing on the same machine using the same 
version of python and pandas.

 

Have you received similar complain? What could be the issue here?

 

Thanks a lot.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6873) [Python] Stale CColumn reference break Cython cimport pyarrow

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-6873:
-

Assignee: Uwe Korn

> [Python] Stale CColumn reference break Cython cimport pyarrow
> -
>
> Key: ARROW-6873
> URL: https://issues.apache.org/jira/browse/ARROW-6873
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Traceback:
> {code}
> Error compiling Cython file:
> 
> ...
> # under the License.
> from __future__ import absolute_import
> from libcpp.memory cimport shared_ptr
> from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CDataType,
> ^
> 
> …/lib/python3.7/site-packages/pyarrow/__init__.pxd:21:0: 
> 'pyarrow/includes/libarrow/CColumn.pxd' not found
> Error compiling Cython file:
> 
> ...
> cdef object wrap_tensor(const shared_ptr[CTensor]& sp_tensor)
> cdef object wrap_sparse_tensor_coo(
> const shared_ptr[CSparseTensorCOO]& sp_sparse_tensor)
> cdef object wrap_sparse_tensor_csr(
> const shared_ptr[CSparseTensorCSR]& sp_sparse_tensor)
> cdef object wrap_column(const shared_ptr[CColumn]& ccolumn)
>^
> 
> …/lib/python3.7/site-packages/pyarrow/__init__.pxd:39:52: unknown type in 
> template argument
> Error compiling Cython file:
> 
> ...
> from pyarrow cimport Int64ArrayBuilder
> ^
> 
> /Users/uwe/.ipython/cython/_cython_magic_3eb31dd63fb578b618cc8e98a60dbdf5.pyx:2:0:
>  'pyarrow/Int64ArrayBuilder.pxd' not found
> ---
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6873) [Python] Stale CColumn reference break Cython cimport pyarrow

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6873.
---
Fix Version/s: (was: 0.15.1)
   1.0.0
   Resolution: Fixed

Issue resolved by pull request 5646
[https://github.com/apache/arrow/pull/5646]

> [Python] Stale CColumn reference break Cython cimport pyarrow
> -
>
> Key: ARROW-6873
> URL: https://issues.apache.org/jira/browse/ARROW-6873
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Traceback:
> {code}
> Error compiling Cython file:
> 
> ...
> # under the License.
> from __future__ import absolute_import
> from libcpp.memory cimport shared_ptr
> from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CDataType,
> ^
> 
> …/lib/python3.7/site-packages/pyarrow/__init__.pxd:21:0: 
> 'pyarrow/includes/libarrow/CColumn.pxd' not found
> Error compiling Cython file:
> 
> ...
> cdef object wrap_tensor(const shared_ptr[CTensor]& sp_tensor)
> cdef object wrap_sparse_tensor_coo(
> const shared_ptr[CSparseTensorCOO]& sp_sparse_tensor)
> cdef object wrap_sparse_tensor_csr(
> const shared_ptr[CSparseTensorCSR]& sp_sparse_tensor)
> cdef object wrap_column(const shared_ptr[CColumn]& ccolumn)
>^
> 
> …/lib/python3.7/site-packages/pyarrow/__init__.pxd:39:52: unknown type in 
> template argument
> Error compiling Cython file:
> 
> ...
> from pyarrow cimport Int64ArrayBuilder
> ^
> 
> /Users/uwe/.ipython/cython/_cython_magic_3eb31dd63fb578b618cc8e98a60dbdf5.pyx:2:0:
>  'pyarrow/Int64ArrayBuilder.pxd' not found
> ---
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6857) Segfault for dictionary_encode on empty chunked_array (edge case)

2019-10-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951152#comment-16951152
 ] 

Antoine Pitrou commented on ARROW-6857:
---

Thanks for the report. Indeed it seems like there's a regression here.

> Segfault for dictionary_encode on empty chunked_array (edge case)
> -
>
> Key: ARROW-6857
> URL: https://issues.apache.org/jira/browse/ARROW-6857
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> a reproducer is here :
> {code:python}
> import pyarrow as pa
> aa = pa.chunked_array([pa.array(['a', 'b', 'c'])])
> aa[:0].dictionary_encode()  
> # Segmentation fault: 11
> {code}
> For pyarrow=0.14, I could not reproduce. 
>  I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6857) Segfault for dictionary_encode on empty chunked_array (edge case)

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6857:
--
Labels: pull-request-available  (was: )

> Segfault for dictionary_encode on empty chunked_array (edge case)
> -
>
> Key: ARROW-6857
> URL: https://issues.apache.org/jira/browse/ARROW-6857
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.1
>
>
> a reproducer is here :
> {code:python}
> import pyarrow as pa
> aa = pa.chunked_array([pa.array(['a', 'b', 'c'])])
> aa[:0].dictionary_encode()  
> # Segmentation fault: 11
> {code}
> For pyarrow=0.14, I could not reproduce. 
>  I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6874) Memory leak in Table.to_pandas() when nested columns are present

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951148#comment-16951148
 ] 

Joris Van den Bossche commented on ARROW-6874:
--

Thanks for the report! 

I tried to reproduce this, but don't see the issue with pyarrow 0.15 (installed 
with conda) or master on py 3.7 on Linux (ubuntu)

> Memory leak in Table.to_pandas() when nested columns are present
> 
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow: 
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
>Reporter: Sergey Mozharov
>Priority: Major
> Fix For: 0.15.1
>
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which 
> appears to have a memory leak in the latest version. See details below to 
> reproduce this issue.
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type  # ListType(list)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 1
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6836) [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6836:
--
Labels: pull-request-available  (was: )

> [Format] add a custom_metadata:[KeyValue] field to the Footer table in 
> File.fbs
> ---
>
> Key: ARROW-6836
> URL: https://issues.apache.org/jira/browse/ARROW-6836
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
>
> add a custom_metadata:[KeyValue] field to the Footer table in File.fbs
> Use case:
> If a file is expanded with additional recordbatches and the custom_metadata 
> changes, Schema is no longer an appropriate place to make this change since 
> the two copies of Schema (at the beginning and end of the file) would then be 
> ambiguous
> cf 
> https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6874) Memory leak in Table.to_pandas() when nested columns are present

2019-10-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6874:
-
Fix Version/s: 0.15.1

> Memory leak in Table.to_pandas() when nested columns are present
> 
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow: 
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
>Reporter: Sergey Mozharov
>Priority: Major
> Fix For: 0.15.1
>
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which 
> appears to have a memory leak in the latest version. See details below to 
> reproduce this issue.
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type  # ListType(list)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 1
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6875) [Python][Flight] Implement Criteria for ListFlights RPC / list_flights method

2019-10-14 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6875:
---

 Summary: [Python][Flight] Implement Criteria for ListFlights RPC / 
list_flights method
 Key: ARROW-6875
 URL: https://issues.apache.org/jira/browse/ARROW-6875
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


We should work through how to pass a custom Criteria to ListFlights



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6857) Segfault for dictionary_encode on empty chunked_array (edge case)

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-6857:
-

Assignee: Antoine Pitrou

> Segfault for dictionary_encode on empty chunked_array (edge case)
> -
>
> Key: ARROW-6857
> URL: https://issues.apache.org/jira/browse/ARROW-6857
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.15.1
>
>
> a reproducer is here :
> {code:python}
> import pyarrow as pa
> aa = pa.chunked_array([pa.array(['a', 'b', 'c'])])
> aa[:0].dictionary_encode()  
> # Segmentation fault: 11
> {code}
> For pyarrow=0.14, I could not reproduce. 
>  I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6874) Memory leak in Table.to_pandas() when nested columns are present

2019-10-14 Thread Sergey Mozharov (Jira)
Sergey Mozharov created ARROW-6874:
--

 Summary: Memory leak in Table.to_pandas() when nested columns are 
present
 Key: ARROW-6874
 URL: https://issues.apache.org/jira/browse/ARROW-6874
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
 Environment: Operating system: Windows 10
pyarrow installed via conda
both python environments were identical except pyarrow: 
python: 3.6.7
numpy: 1.17.2
pandas: 0.25.1
Reporter: Sergey Mozharov


I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
interpreter ran out of memory.

I narrowed the issue down to the pyarrow.Table.to_pandas() call, which appears 
to have a memory leak in the latest version. See details below to reproduce 
this issue.

 
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa

# create a table with one nested array column
nested_array = pa.array([np.random.rand(1000) for i in range(500)])
nested_array.type  # ListType(list)
table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])

# convert it to a pandas DataFrame in a loop to monitor memory consumption
num_iterations = 1
# pyarrow v0.14.1: Memory allocation does not grow during loop execution
# pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
for i in range(num_iterations):
df = pa.Table.to_pandas(table)


# When the table column is not nested, no memory leak is observed
array = pa.array(np.random.rand(500 * 1000))
table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
# no memory leak:
for i in range(num_iterations):
df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6852) [C++] memory-benchmark build failed on Arm64

2019-10-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6852.
---
Resolution: Fixed

Issue resolved by pull request 5624
[https://github.com/apache/arrow/pull/5624]

> [C++] memory-benchmark build failed on Arm64
> 
>
> Key: ARROW-6852
> URL: https://issues.apache.org/jira/browse/ARROW-6852
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> After the new commit: ARROW-6381 was merged in master,
> build would fail on Arm64 when DARROW_BUILD_BENCHMARKS is enabled:
>  
>  
> {code:java}
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:205:31: error: 
> 'kMemoryPerCore' was not declared in this scope
>  const int64_t buffer_size = kMemoryPerCore;
>  ^~
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: error: 
> 'Buffer' was not declared in this scope
>  std::shared_ptr src, dst;
>  ^~
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: note: 
> suggested alternative:
> In file included from /home/builder/arrow/cpp/src/arrow/array.h:28:0,
>  from /home/builder/arrow/cpp/src/arrow/api.h:23,
>  from /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:20:
> /home/builder/arrow/cpp/src/arrow/buffer.h:50:20: note: 'arrow::Buffer'
>  class ARROW_EXPORT Buffer {
>  ^~
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:25: error: 
> template argument 1 is invalid
>  std::shared_ptr src, dst;
> ...
> .
> .
>  
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6870) [C#] Add Support for Dictionary Arrays and Dictionary Encoding

2019-10-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6870:
-
Summary: [C#] Add Support for Dictionary Arrays and Dictionary Encoding  
(was: Add Support for Dictionary Arrays and Dictionary Encoding)

> [C#] Add Support for Dictionary Arrays and Dictionary Encoding
> --
>
> Key: ARROW-6870
> URL: https://issues.apache.org/jira/browse/ARROW-6870
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Daniel Parubotchy
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The C# implementaiton of Arrow doesn't support dictionary arrays, 
> serialization/deserialization of dictionary batches, or dictionary encoding.
> Dictionary arrays and dictionary encoding could provide a huge performance 
> benefit for certain data sets.
> I propose creating dictionary array types that correspond to the existing 
> array types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6857) Segfault for dictionary_encode on empty chunked_array (edge case)

2019-10-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6857:
-
Fix Version/s: 0.15.1

> Segfault for dictionary_encode on empty chunked_array (edge case)
> -
>
> Key: ARROW-6857
> URL: https://issues.apache.org/jira/browse/ARROW-6857
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.15.1
>
>
> a reproducer is here :
> {code:python}
> import pyarrow as pa
> aa = pa.chunked_array([pa.array(['a', 'b', 'c'])])
> aa[:0].dictionary_encode()  
> # Segmentation fault: 11
> {code}
> For pyarrow=0.14, I could not reproduce. 
>  I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5971) [Website] Blog post introducing Arrow Flight

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5971:
--
Labels: pull-request-available  (was: )

> [Website] Blog post introducing Arrow Flight
> 
>
> Key: ARROW-5971
> URL: https://issues.apache.org/jira/browse/ARROW-5971
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> I think it's a good time to be bringing more attention to our work over the 
> last 12-14 months on Arrow Flight. 
> I would be OK to draft an initial version of the blog post, and I can 
> circulate to others for review / edit / comment. If there are particular 
> benchmarks you would like to see included, contributing code for that would 
> also be helpful. My plan would be to show tcp throughput on localhost, and 
> node-to-node throughput on a local gigabit ethernet network. I think the 
> localhost throughput is important to show that Flight is a tool that you 
> would want to reach for for faster throughput in high performance networking 
> (e.g. 10/40 gigabit)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6873) [Python] Stale CColumn reference break Cython cimport pyarrow

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6873:
--
Labels: pull-request-available  (was: )

> [Python] Stale CColumn reference break Cython cimport pyarrow
> -
>
> Key: ARROW-6873
> URL: https://issues.apache.org/jira/browse/ARROW-6873
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.1
>
>
> Traceback:
> {code}
> Error compiling Cython file:
> 
> ...
> # under the License.
> from __future__ import absolute_import
> from libcpp.memory cimport shared_ptr
> from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CDataType,
> ^
> 
> …/lib/python3.7/site-packages/pyarrow/__init__.pxd:21:0: 
> 'pyarrow/includes/libarrow/CColumn.pxd' not found
> Error compiling Cython file:
> 
> ...
> cdef object wrap_tensor(const shared_ptr[CTensor]& sp_tensor)
> cdef object wrap_sparse_tensor_coo(
> const shared_ptr[CSparseTensorCOO]& sp_sparse_tensor)
> cdef object wrap_sparse_tensor_csr(
> const shared_ptr[CSparseTensorCSR]& sp_sparse_tensor)
> cdef object wrap_column(const shared_ptr[CColumn]& ccolumn)
>^
> 
> …/lib/python3.7/site-packages/pyarrow/__init__.pxd:39:52: unknown type in 
> template argument
> Error compiling Cython file:
> 
> ...
> from pyarrow cimport Int64ArrayBuilder
> ^
> 
> /Users/uwe/.ipython/cython/_cython_magic_3eb31dd63fb578b618cc8e98a60dbdf5.pyx:2:0:
>  'pyarrow/Int64ArrayBuilder.pxd' not found
> ---
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6873) [Python] Stale CColumn reference break Cython cimport pyarrow

2019-10-14 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-6873:
---

 Summary: [Python] Stale CColumn reference break Cython cimport 
pyarrow
 Key: ARROW-6873
 URL: https://issues.apache.org/jira/browse/ARROW-6873
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
Reporter: Uwe Korn
 Fix For: 0.15.1


Traceback:

{code}
Error compiling Cython file:

...
# under the License.

from __future__ import absolute_import

from libcpp.memory cimport shared_ptr
from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CDataType,
^


…/lib/python3.7/site-packages/pyarrow/__init__.pxd:21:0: 
'pyarrow/includes/libarrow/CColumn.pxd' not found

Error compiling Cython file:

...
cdef object wrap_tensor(const shared_ptr[CTensor]& sp_tensor)
cdef object wrap_sparse_tensor_coo(
const shared_ptr[CSparseTensorCOO]& sp_sparse_tensor)
cdef object wrap_sparse_tensor_csr(
const shared_ptr[CSparseTensorCSR]& sp_sparse_tensor)
cdef object wrap_column(const shared_ptr[CColumn]& ccolumn)
   ^


…/lib/python3.7/site-packages/pyarrow/__init__.pxd:39:52: unknown type in 
template argument

Error compiling Cython file:

...

from pyarrow cimport Int64ArrayBuilder
^


/Users/uwe/.ipython/cython/_cython_magic_3eb31dd63fb578b618cc8e98a60dbdf5.pyx:2:0:
 'pyarrow/Int64ArrayBuilder.pxd' not found
---
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6872) [C++][Python] Empty table with dictionary-columns raises ArrowNotImplementedError

2019-10-14 Thread Marco Neumann (Jira)
Marco Neumann created ARROW-6872:


 Summary: [C++][Python] Empty table with dictionary-columns raises 
ArrowNotImplementedError
 Key: ARROW-6872
 URL: https://issues.apache.org/jira/browse/ARROW-6872
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.15.0
Reporter: Marco Neumann


h2. Abstract
As a pyarrow user, I would expect that I can create an empty table out of every 
schema that I created via pandas. This does not work for dictionary types (e.g. 
{{"category"}} dtypes).

h2. Test Case
This code:

{code:python}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"x": pd.Series(["x", "y"], dtype="category")})
table = pa.Table.from_pandas(df)
schema = table.schema
table_empty = schema.empty_table()  # boom
{code}

produces this exception:

{noformat}
Traceback (most recent call last):
  File "arrow_bug.py", line 8, in 
table_empty = schema.empty_table()
  File "pyarrow/types.pxi", line 860, in __iter__
  File "pyarrow/array.pxi", line 211, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Sequence converter for type 
dictionary not implemented
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6871) [Java] Enhance TransferPair related parameters check and tests

2019-10-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6871:
--
Labels: pull-request-available  (was: )

> [Java] Enhance TransferPair related parameters check and tests
> --
>
> Key: ARROW-6871
> URL: https://issues.apache.org/jira/browse/ARROW-6871
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
>
> {{TransferPair}} related param checks in different classes have potential 
> problems:
> i. {{copyValueSafe}} do not check from index, if from > valueCount, no error 
> is shown.
> ii. {{splitAndTansfer}} has no indices check in classes like {{VarcharVector}}
> iii. {{splitAndTranser}} indices check in classes like UnionVector is not 
> correct (Preconditions.checkArgument(startIndex + length <= valueCount)), 
> should check params separately.
> iv. some assert usages should be replaced with {{Preconditions}}.
> v. should add more UT to cover corner cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6871) [Java] Enhance TransferPair related parameters check and tests

2019-10-14 Thread Ji Liu (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950754#comment-16950754
 ] 

Ji Liu commented on ARROW-6871:
---

Thanks for your reminder, I will also add a benchmark, if there's no much 
regression, params check is needed/corrected to avoid potential problems.

> [Java] Enhance TransferPair related parameters check and tests
> --
>
> Key: ARROW-6871
> URL: https://issues.apache.org/jira/browse/ARROW-6871
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>
> {{TransferPair}} related param checks in different classes have potential 
> problems:
> i. {{copyValueSafe}} do not check from index, if from > valueCount, no error 
> is shown.
> ii. {{splitAndTansfer}} has no indices check in classes like {{VarcharVector}}
> iii. {{splitAndTranser}} indices check in classes like UnionVector is not 
> correct (Preconditions.checkArgument(startIndex + length <= valueCount)), 
> should check params separately.
> iv. some assert usages should be replaced with {{Preconditions}}.
> v. should add more UT to cover corner cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)