date:20200114

[jira] [Commented] (ARROW-6445) [CI][Crossbow] Nightly Gandiva jar trusty job fails

2020-01-14 Thread Prudhvi Porandla (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015604#comment-17015604
 ] 

Prudhvi Porandla commented on ARROW-6445:
-

yes, we can close this ticket. Thanks.

> [CI][Crossbow] Nightly Gandiva jar trusty job fails
> ---
>
> Key: ARROW-6445
> URL: https://issues.apache.org/jira/browse/ARROW-6445
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Packaging
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> https://travis-ci.org/ursa-labs/crossbow/builds/580192384. Error is due to 
> use of {{std::regex}}; replace with RE2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6445) [CI][Crossbow] Nightly Gandiva jar trusty job fails

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6445:
---
Fix Version/s: (was: 0.15.0)
   0.16.0

> [CI][Crossbow] Nightly Gandiva jar trusty job fails
> ---
>
> Key: ARROW-6445
> URL: https://issues.apache.org/jira/browse/ARROW-6445
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Packaging
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> https://travis-ci.org/ursa-labs/crossbow/builds/580192384. Error is due to 
> use of {{std::regex}}; replace with RE2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6445) [CI][Crossbow] Nightly Gandiva jar trusty job fails

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6445.

Resolution: Fixed

> [CI][Crossbow] Nightly Gandiva jar trusty job fails
> ---
>
> Key: ARROW-6445
> URL: https://issues.apache.org/jira/browse/ARROW-6445
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Packaging
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> https://travis-ci.org/ursa-labs/crossbow/builds/580192384. Error is due to 
> use of {{std::regex}}; replace with RE2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7582) [Rust][Flight] Unable to compile arrow.flight.protocol.rs

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7582.
-
Fix Version/s: 0.16.0
   Resolution: Fixed

Issue resolved by pull request 6198
[https://github.com/apache/arrow/pull/6198]

> [Rust][Flight] Unable to compile arrow.flight.protocol.rs
> -
>
> Key: ARROW-7582
> URL: https://issues.apache.org/jira/browse/ARROW-7582
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Not sure exactly why, perhaps it has something to do with the recently 
> updated dependencies: https://github.com/apache/arrow/runs/389937707
> cc [~andygrove] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7582) [Rust][Flight] Unable to compile arrow.flight.protocol.rs

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7582:
---

Assignee: Krisztian Szucs

> [Rust][Flight] Unable to compile arrow.flight.protocol.rs
> -
>
> Key: ARROW-7582
> URL: https://issues.apache.org/jira/browse/ARROW-7582
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Not sure exactly why, perhaps it has something to do with the recently 
> updated dependencies: https://github.com/apache/arrow/runs/389937707
> cc [~andygrove] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7571) [Java] Correct minimal java version on README

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7571.
-
Resolution: Fixed

Issue resolved by pull request 6190
[https://github.com/apache/arrow/pull/6190]

> [Java] Correct minimal java version on README
> -
>
> Key: ARROW-7571
> URL: https://issues.apache.org/jira/browse/ARROW-7571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7571) [Java] Correct minimal java version on README

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7571:
---

Assignee: Fokko Driesprong

> [Java] Correct minimal java version on README
> -
>
> Key: ARROW-7571
> URL: https://issues.apache.org/jira/browse/ARROW-7571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6663) [C++] Use software __builtin_popcountll when building without SSE4.2

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6663:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Use software __builtin_popcountll when building without SSE4.2
> 
>
> Key: ARROW-6663
> URL: https://issues.apache.org/jira/browse/ARROW-6663
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: 1.0.0
>
>
> This is to be extra safe in the context of ARROW-5381



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4359) [Python] Column metadata is not saved or loaded in parquet

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4359:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Python] Column metadata is not saved or loaded in parquet
> --
>
> Key: ARROW-4359
> URL: https://issues.apache.org/jira/browse/ARROW-4359
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Seb Fru
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Hi all,
> a while ago I posted this issue: ARROW-3866
> While working with Pyarrow I encountered another potential bug related to 
> column metadata: If I create a table containing columns with metadata 
> everything is fine. But after I save the table to parquet and load it back as 
> a table using pq.read_table, the column metadata is gone.
>  
> As of now I can not say yet whether the metadata is not saved correctly or 
> not loaded correctly, as I have no idea how to verify it. Unfortunately I 
> also don't have the time try a lot, but I wanted to let you know anyway. 
>  
> {code}
> field0 = pa.field('field1', pa.int64(), metadata=dict(a="A", b="B"))
> field1 = pa.field('field2', pa.int64(), nullable=False)
> columns = [
> pa.column(field0, pa.array([1, 2])),
> pa.column(field1, pa.array([3, 4]))
> ]
> table = pa.Table.from_arrays(columns)
> pq.write_table(tab, path)
> tab2 = pq.read_table(path)
> tab2.column(0).field.metadata
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6390:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Python][Flight] Add Python documentation / tutorial for Flight
> ---
>
> Key: ARROW-6390
> URL: https://issues.apache.org/jira/browse/ARROW-6390
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There is no Sphinx documentation for using Flight from Python. I have found 
> that writing documentation is an effective way to uncover usability problems 
> -- I would suggest we write comprehensive documentation for using Flight from 
> Python as a way to refine the public Python API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7256) [C++] Remove ARROW_MEMORY_POOL_DEFAULT option

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7256.
-
  Assignee: Francois Saint-Jacques
Resolution: Fixed

This was done in 
https://github.com/apache/arrow/commit/df613bcb6a44a3f7be53faf8d368bf529436c50a

> [C++] Remove ARROW_MEMORY_POOL_DEFAULT option
> -
>
> Key: ARROW-7256
> URL: https://issues.apache.org/jira/browse/ARROW-7256
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.16.0
>
>
> As mentioned elsewhere in a JIRA I recall, we aren't testing adequately the 
> CMake option for "no default memory pool", so it would either be better to 
> require explicit memory pools or pass the default, rather than having a 
> build-time option to set whether a default will be passed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7091) [C++] Move all factories to type_fwd.h

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7091:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Move all factories to type_fwd.h
> --
>
> Key: ARROW-7091
> URL: https://issues.apache.org/jira/browse/ARROW-7091
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 1.0.0
>
>
> There's no particular reason why parameter-less factories are in 
> {{type_fwd.h}}, but the others in their respective implementation headers. By 
> putting more factories in {{type_fwd.h}}, we may be able to avoid importing 
> the heavier headers in some places.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6799) [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6799:

Fix Version/s: (was: 0.16.0)

> [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)
> -
>
> Key: ARROW-6799
> URL: https://issues.apache.org/jira/browse/ARROW-6799
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Does not appear to be tested in CI. Originally reported at 
> https://github.com/apache/arrow/issues/5575



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6821) [C++][Parquet] Do not require Thrift compiler when building (but still require library)

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6821:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++][Parquet] Do not require Thrift compiler when building (but still 
> require library)
> ---
>
> Key: ARROW-6821
> URL: https://issues.apache.org/jira/browse/ARROW-6821
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Building Thrift from source carries extra toolchain dependencies (bison and 
> flex). If we check in the files produced by compiling parquet.thrift, then 
> the EP can be simplified to only build the Thrift C++ library and not the 
> compiler. This also results in a simpler build for third parties



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6917) [Developer] Implement Python script to generate git cherry-pick commands needed to create patch build branch for maint releases

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6917:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Developer] Implement Python script to generate git cherry-pick commands 
> needed to create patch build branch for maint releases
> ---
>
> Key: ARROW-6917
> URL: https://issues.apache.org/jira/browse/ARROW-6917
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> For 0.14.1, I maintained this script by hand. It would be less failure-prone 
> (maybe) to generate it based on the fix versions set in JIRA



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6915) [Developer] Do not overwrite minor release version with merge script, even if not specified by committer

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6915:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Developer] Do not overwrite minor release version with merge script, even if 
> not specified by committer
> 
>
> Key: ARROW-6915
> URL: https://issues.apache.org/jira/browse/ARROW-6915
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Not every committer knows to write "$MAJOR_VERSION,$MINOR_VERSION" for the 
> fix version when merging



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6547) [C++] valgrind errors in diff-test

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6547:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] valgrind errors in diff-test
> --
>
> Key: ARROW-6547
> URL: https://issues.apache.org/jira/browse/ARROW-6547
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: 1.0.0
>
>
> Not sure when these crept in but I encountered when looking into a segfault 
> in a build today
> https://gist.github.com/wesm/b388dda4f0e2e38a8aa77dfc9bd91914



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6645) [Python] Dictionary indices are boundschecked unconditionally in CategoricalBlock.to_pandas

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6645:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Python] Dictionary indices are boundschecked unconditionally in 
> CategoricalBlock.to_pandas
> ---
>
> Key: ARROW-6645
> URL: https://issues.apache.org/jira/browse/ARROW-6645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This was added at some point to fix a bug. I suspect we might want to move 
> this check somewhere else rather than do it every time {{to_pandas}} is called



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6110) [Java] Support LargeList Type and add integration test with C++

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6110:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Java] Support LargeList Type and add integration test with C++
> ---
>
> Key: ARROW-6110
> URL: https://issues.apache.org/jira/browse/ARROW-6110
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6281) [Python] Produce chunked arrays for nested types in pyarrow.array

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6281:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Python] Produce chunked arrays for nested types in pyarrow.array
> -
>
> Key: ARROW-6281
> URL: https://issues.apache.org/jira/browse/ARROW-6281
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> As follow up to ARROW-5028 and other issues, in a case like
> {code}
> vals = [['x' * 1024]] * ((2 << 20) + 1)
> arr = pa.array(vals)
> {code}
> The child array of the ListArray cannot hold all of the string data. After 
> the patch for ARROW-5028, an exception is raised rather than returning a 
> malformed array. We could (with some effort) instead produce a chunked array 
> of list type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6404) [C++] CMake build of arrow libraries fails on Windows

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6404:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] CMake build of arrow libraries fails on Windows
> -
>
> Key: ARROW-6404
> URL: https://issues.apache.org/jira/browse/ARROW-6404
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: ARF
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: build, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> I am trying to build the python pyarrow extension on Windows 10 using Visual 
> Studio 2015 Build Tools and the current stable CMake.
> Following [the 
> instructions|https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst]
>  to the letter, CMake fails with the error:
> {{??CMake Error at cmake_modules/SetupCxxFlags.cmake:42 (string):??}}
> {{?? string no output variable specified??}}
> {{??Call Stack (most recent call first):??}}
> {{?? CMakeLists.txt:357 (include)??}}
> 
> Complete output:
> {{(pyarrow-dev) Z:\devel\arrow\cpp\build>cmake -G "Visual Studio 14 2015 
> Win64" ^}}
> {{More? -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^}}
> {{More? -DARROW_CXXFLAGS="/WX /MP" ^}}
> {{More? -DARROW_GANDIVA=on ^}}
> {{More? -DARROW_PARQUET=on ^}}
> {{More? -DARROW_PYTHON=on ..}}
> {{-- Building using CMake version: 3.15.2}}
> {{CMake Error at CMakeLists.txt:30 (string):}}
> {{ string no output variable specified}}
> {{-- Selecting Windows SDK version to target Windows 10.0.17763.}}
> {{-- The C compiler identification is MSVC 19.0.24210.0}}
> {{-- The CXX compiler identification is MSVC 19.0.24210.0}}
> {{-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio 14.0/VC/bin/x86_amd64/cl.exe}}
> {{-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio 14.0/VC/bin/x86_amd64/cl.exe -- works}}
> {{-- Detecting C compiler ABI info}}
> {{-- Detecting C compiler ABI info - done}}
> {{-- Detecting C compile features}}
> {{-- Detecting C compile features - done}}
> {{-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio 14.0/VC/bin/x86_amd64/cl.exe}}
> {{-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio 14.0/VC/bin/x86_amd64/cl.exe -- works}}
> {{-- Detecting CXX compiler ABI info}}
> {{-- Detecting CXX compiler ABI info - done}}
> {{-- Detecting CXX compile features}}
> {{-- Detecting CXX compile features - done}}
> {{-- Arrow version: 0.15.0 (full: '0.15.0-SNAPSHOT')}}
> {{-- Arrow SO version: 15 (full: 15.0.0)}}
> {{-- Found PkgConfig: 
> Z:/Systemdateien/Miniconda3/envs/pyarrow-dev/Library/bin/pkg-config.exe 
> (found version "0.29.2")}}
> {{-- clang-tidy not found}}
> {{-- clang-format not found}}
> {{-- infer not found}}
> {{-- Found PythonInterp: 
> Z:/Systemdateien/Miniconda3/envs/pyarrow-dev/python.exe (found version 
> "3.7.3")}}
> {{-- Found cpplint executable at Z:/devel/arrow/cpp/build-support/cpplint.py}}
> {{-- Compiler command: C:/Program Files (x86)/Microsoft Visual Studio 
> 14.0/VC/bin/x86_amd64/cl.exe}}
> {{-- Compiler version:}}
> {{-- Compiler id: MSVC}}
> {{Selected compiler msvc}}
> {{-- Performing Test CXX_SUPPORTS_SSE4_2}}
> {{-- Performing Test CXX_SUPPORTS_SSE4_2 - Failed}}
> {{-- Performing Test CXX_SUPPORTS_ALTIVEC}}
> {{-- Performing Test CXX_SUPPORTS_ALTIVEC - Failed}}
> {{-- Performing Test CXX_SUPPORTS_ARMCRC}}
> {{-- Performing Test CXX_SUPPORTS_ARMCRC - Failed}}
> {{-- Performing Test CXX_SUPPORTS_ARMV8_CRC_CRYPTO}}
> {{-- Performing Test CXX_SUPPORTS_ARMV8_CRC_CRYPTO - Failed}}
> {{CMake Error at cmake_modules/SetupCxxFlags.cmake:42 (string):}}
> {{ string no output variable specified}}
> {{Call Stack (most recent call first):}}
> {{ CMakeLists.txt:357 (include)}}
> {{-- Arrow build warning level: CHECKIN}}
> {{Configured for build (set with cmake 
> -DCMAKE_BUILD_TYPE=\{release,debug,...})}}
> {{CMake Error at cmake_modules/SetupCxxFlags.cmake:438 (message):}}
> {{ Unknown build type:}}
> {{Call Stack (most recent call first):}}
> {{ CMakeLists.txt:357 (include)}}
> {{-- Configuring incomplete, errors occurred!}}
> {{See also "Z:/devel/arrow/cpp/build/CMakeFiles/CMakeOutput.log".}}
> {{See also "Z:/devel/arrow/cpp/build/CMakeFiles/CMakeError.log".}}
> {{(pyarrow-dev) Z:\devel\arrow\cpp\build>}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6111:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4965) [Python] Timestamp array type detection should use tzname of datetime.datetime objects

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4965:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Python] Timestamp array type detection should use tzname of 
> datetime.datetime objects
> --
>
> Key: ARROW-4965
> URL: https://issues.apache.org/jira/browse/ARROW-4965
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
> Environment: $ python --version
> Python 3.7.2
> $ pip freeze
> numpy==1.16.2
> pyarrow==0.12.1
> pytz==2018.9
> six==1.12.0
> $ sw_vers
> ProductName:Mac OS X
> ProductVersion: 10.14.3
> BuildVersion:   18D109
> (pyarrow) 
>Reporter: Tim Swast
>Priority: Major
> Fix For: 1.0.0
>
>
> The type detection from datetime objects to array appears to ignore the 
> presence of a tzinfo on the datetime object, instead storing them as naive 
> timestamp columns.
> Python code:
> {code:python}
> import datetime
> import pytz
> import pyarrow as pa
> naive_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10)
> utc_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10, tzinfo=pytz.utc)
> tzaware_datetime = 
> utc_datetime.astimezone(pytz.timezone('America/Los_Angeles'))
> def inspect(varname):
> print(varname)
> arr = globals()[varname]
> print(arr.type)
> print(arr)
> print()
> auto_naive_arr = pa.array([naive_datetime])
> inspect("auto_naive_arr")
> auto_utc_arr = pa.array([utc_datetime])
> inspect("auto_utc_arr")
> auto_tzaware_arr = pa.array([tzaware_datetime])
> inspect("auto_tzaware_arr")
> auto_mixed_arr = pa.array([utc_datetime, tzaware_datetime])
> inspect("auto_mixed_arr")
> naive_type = pa.timestamp("us", naive_datetime.tzname())
> utc_type = pa.timestamp("us", utc_datetime.tzname())
> tzaware_type = pa.timestamp("us", tzaware_datetime.tzname())
> naive_arr = pa.array([naive_datetime], type=naive_type)
> inspect("naive_arr")
> utc_arr = pa.array([utc_datetime], type=utc_type)
> inspect("utc_arr")
> tzaware_arr = pa.array([tzaware_datetime], type=tzaware_type)
> inspect("tzaware_arr")
> mixed_arr = pa.array([utc_datetime, tzaware_datetime], type=utc_type)
> inspect("mixed_arr")
> {code}
> This prints:
> {noformat}
> $ python detect_timezone.py
> auto_naive_arr
> timestamp[us]
> [
>   154738147000
> ]
> auto_utc_arr
> timestamp[us]
> [
>   154738147000
> ]
> auto_tzaware_arr
> timestamp[us]
> [
>   154735267000
> ]
> auto_mixed_arr
> timestamp[us]
> [
>   154738147000,
>   154735267000
> ]
> naive_arr
> timestamp[us]
> [
>   154738147000
> ]
> utc_arr
> timestamp[us, tz=UTC]
> [
>   154738147000
> ]
> tzaware_arr
> timestamp[us, tz=PST]
> [
>   154735267000
> ]
> mixed_arr
> timestamp[us, tz=UTC]
> [
>   154738147000,
>   154735267000
> ]
> {noformat}
> But I would expect the following types instead:
> * {{naive_datetime}}: {{timestamp[us]}}
> * {{auto_utc_arr}}: {{timestamp[us, tz=UTC]}}
> * {{auto_tzaware_arr}}: {{timestamp[us, tz=PST]}} (Or maybe 
> {{tz='America/Los_Angeles'}}. I'm not sure why {{pytz}} returns {{PST}} as 
> the {{tzname}})
> * {{auto_mixed_arr}}: {{timestamp[us, tz=UTC]}}
> Also, in the "mixed" case, I'd expect the actual stored microseconds to be 
> the same for both rows, since {{utc_datetime}} and {{tzaware_datetime}} both 
> refer to the same point in time. It seems reasonable for any naive datetime 
> objects mixed in with tz-aware datetimes to be interpreted as UTC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5106) [Packaging] [C++/Python] Add conda package verification scripts

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5106:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Packaging] [C++/Python] Add conda package verification scripts
> ---
>
> Key: ARROW-5106
> URL: https://issues.apache.org/jira/browse/ARROW-5106
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 1.0.0
>
>
> Following the conventions of apt/yum verification script: 
> https://github.com/apache/arrow/pull/4098



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5213) [Format] Script for updating various checked-in Flatbuffers files

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5213:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Format] Script for updating various checked-in Flatbuffers files
> -
>
> Key: ARROW-5213
> URL: https://issues.apache.org/jira/browse/ARROW-5213
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Format, Go, Rust
>Reporter: Wes McKinney
>Assignee: Andy Grove
>Priority: Minor
> Fix For: 1.0.0
>
>
> Some subprojects have begun checking in generated Flatbuffers files to source 
> control. This presents a maintainability issue when there are additions or 
> changes made to the .fbs sources. It would be useful to be able to automate 
> the update of these files so it doesn't have to happen on a manual / 
> case-by-case basis



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6459) [C++] Remove "python" from conda_env_cpp.yml

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6459:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Remove "python" from conda_env_cpp.yml
> 
>
> Key: ARROW-6459
> URL: https://issues.apache.org/jira/browse/ARROW-6459
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: 1.0.0
>
>
> I'm not sure why "python" is in this dependency file -- if it is used to 
> maintain a toolchain external to a particular Python environment then it 
> confuses CMake like
> {code}
> CMake Warning at cmake_modules/BuildUtils.cmake:529 (add_executable):
>   Cannot generate a safe runtime search path for target arrow-python-test
>   because there is a cycle in the constraint graph:
> dir 0 is [/home/wesm/code/arrow/cpp/build/debug]
> dir 1 is [/home/wesm/miniconda/envs/arrow-3.7/lib]
>   dir 2 must precede it due to runtime library [libcrypto.so.1.1]
> dir 2 is [/home/wesm/cpp-toolchain/lib]
>   dir 1 must precede it due to runtime library [libpython3.7m.so.1.0]
>   Some of these libraries may not be found correctly.
> Call Stack (most recent call first):
>   src/arrow/CMakeLists.txt:52 (add_test_case)
>   src/arrow/python/CMakeLists.txt:139 (add_arrow_test)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6275) [C++] Deprecate RecordBatchReader::ReadNext

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6275:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Deprecate RecordBatchReader::ReadNext
> ---
>
> Key: ARROW-6275
> URL: https://issues.apache.org/jira/browse/ARROW-6275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
> Fix For: 1.0.0
>
>
> After 6161, RecordBatchReader is a refinement of util::Iterator and the 
> ReadNext method is redundant and can be deprecated. (util::Iterator provides 
> Next, which has identical semantics.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6529) [C++] Feather: slow writing of NullArray

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6529:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Feather: slow writing of NullArray
> 
>
> Key: ARROW-6529
> URL: https://issues.apache.org/jira/browse/ARROW-6529
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
>  Labels: feather
> Fix For: 1.0.0
>
>
> From 
> https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none
> Smaller example with just using pyarrow, it seems that writing an array of 
> nulls takes much longer than an array of for example ints, which seems a bit 
> strange:
> {code}
> In [93]: arr = pa.array([None]*1000, type='int64')
> In [94]: %%timeit 
> ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
> ...: w.writer.write_array('x', arr) 
> ...: w.writer.close() 
> 31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)
> In [95]: arr = pa.array([None]*1000)  
> In [96]: arr
> Out[96]: 
> 
> 1000 nulls
> In [97]: %%timeit 
> ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
> ...: w.writer.write_array('x', arr) 
> ...: w.writer.close() 
> 3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> {code}
> So writing the same length NullArray takes ca 100x more time compared to an 
> array of nulls but with Integer type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6529) [C++] Feather: slow writing of NullArray

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6529:
---

Assignee: (was: Wes McKinney)

> [C++] Feather: slow writing of NullArray
> 
>
> Key: ARROW-6529
> URL: https://issues.apache.org/jira/browse/ARROW-6529
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: feather
> Fix For: 1.0.0
>
>
> From 
> https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none
> Smaller example with just using pyarrow, it seems that writing an array of 
> nulls takes much longer than an array of for example ints, which seems a bit 
> strange:
> {code}
> In [93]: arr = pa.array([None]*1000, type='int64')
> In [94]: %%timeit 
> ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
> ...: w.writer.write_array('x', arr) 
> ...: w.writer.close() 
> 31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)
> In [95]: arr = pa.array([None]*1000)  
> In [96]: arr
> Out[96]: 
> 
> 1000 nulls
> In [97]: %%timeit 
> ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
> ...: w.writer.write_array('x', arr) 
> ...: w.writer.close() 
> 3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> {code}
> So writing the same length NullArray takes ca 100x more time compared to an 
> array of nulls but with Integer type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6899) [Python] to_pandas() not implemented on list

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6899:
---

Assignee: Wes McKinney

> [Python] to_pandas() not implemented on list indices=int32>
> -
>
> Key: ARROW-6899
> URL: https://issues.apache.org/jira/browse/ARROW-6899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.15.0
>Reporter: Razvan Chitu
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: encoded.arrow
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hi,
> {{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data 
> vector is of type "dictionary encoded string". Here is the table schema as 
> printed by pyarrow:
> {code:java}
> pyarrow.Table
> encodedList: list<$data$: dictionary 
> not null> not null
>   child 0, $data$: dictionary not 
> null
> metadata
> 
> OrderedDict() {code}
> and the data (also attached in a file to this ticket)
> {code:java}
> 
> [
>   [
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 1,
> 2
>   ],
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 3
>   ]
>   ]
> ] {code}
> and the exception I got
> {code:java}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 df.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi
>  in pyarrow.lib._PandasConvertible.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table._to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in table_to_blockmanager(options, table, categories, ignore_metadata)
> 700 
> 701 _check_data_column_metadata_consistency(all_columns)
> --> 702 blocks = _table_to_blocks(options, table, categories)
> 703 columns = _deserialize_column_index(table, all_columns, 
> column_indexes)
> 704 
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in _table_to_blocks(options, block_table, categories)
> 972 
> 973 # Convert an arrow table to Block from the internal pandas API
> --> 974 result = pa.lib.table_to_blocks(options, block_table, categories)
> 975 
> 976 # Defined above
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.table_to_blocks()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
> dictionary {code}
> Note that the data vector itself can be loaded successfully by to_pandas.
> It'd be great if this would be addressed in the next version of pyarrow. For 
> now, is there anything I can do on my end to bypass this unimplemented 
> conversion?
> Thanks,
> Razvan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6899) [Python] to_pandas() not implemented on list

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6899:

Fix Version/s: (was: 1.0.0)
   0.16.0

> [Python] to_pandas() not implemented on list indices=int32>
> -
>
> Key: ARROW-6899
> URL: https://issues.apache.org/jira/browse/ARROW-6899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.15.0
>Reporter: Razvan Chitu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: encoded.arrow
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hi,
> {{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data 
> vector is of type "dictionary encoded string". Here is the table schema as 
> printed by pyarrow:
> {code:java}
> pyarrow.Table
> encodedList: list<$data$: dictionary 
> not null> not null
>   child 0, $data$: dictionary not 
> null
> metadata
> 
> OrderedDict() {code}
> and the data (also attached in a file to this ticket)
> {code:java}
> 
> [
>   [
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 1,
> 2
>   ],
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 3
>   ]
>   ]
> ] {code}
> and the exception I got
> {code:java}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 df.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi
>  in pyarrow.lib._PandasConvertible.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table._to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in table_to_blockmanager(options, table, categories, ignore_metadata)
> 700 
> 701 _check_data_column_metadata_consistency(all_columns)
> --> 702 blocks = _table_to_blocks(options, table, categories)
> 703 columns = _deserialize_column_index(table, all_columns, 
> column_indexes)
> 704 
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in _table_to_blocks(options, block_table, categories)
> 972 
> 973 # Convert an arrow table to Block from the internal pandas API
> --> 974 result = pa.lib.table_to_blocks(options, block_table, categories)
> 975 
> 976 # Defined above
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.table_to_blocks()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
> dictionary {code}
> Note that the data vector itself can be loaded successfully by to_pandas.
> It'd be great if this would be addressed in the next version of pyarrow. For 
> now, is there anything I can do on my end to bypass this unimplemented 
> conversion?
> Thanks,
> Razvan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6899) [Python] to_pandas() not implemented on list

2020-01-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6899:
--
Labels: pull-request-available  (was: )

> [Python] to_pandas() not implemented on list indices=int32>
> -
>
> Key: ARROW-6899
> URL: https://issues.apache.org/jira/browse/ARROW-6899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.15.0
>Reporter: Razvan Chitu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: encoded.arrow
>
>
> Hi,
> {{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data 
> vector is of type "dictionary encoded string". Here is the table schema as 
> printed by pyarrow:
> {code:java}
> pyarrow.Table
> encodedList: list<$data$: dictionary 
> not null> not null
>   child 0, $data$: dictionary not 
> null
> metadata
> 
> OrderedDict() {code}
> and the data (also attached in a file to this ticket)
> {code:java}
> 
> [
>   [
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 1,
> 2
>   ],
> -- dictionary:
>   [
> "a",
> "b",
> "c",
> "d"
>   ]
> -- indices:
>   [
> 0,
> 3
>   ]
>   ]
> ] {code}
> and the exception I got
> {code:java}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 df.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi
>  in pyarrow.lib._PandasConvertible.to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table._to_pandas()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in table_to_blockmanager(options, table, categories, ignore_metadata)
> 700 
> 701 _check_data_column_metadata_consistency(all_columns)
> --> 702 blocks = _table_to_blocks(options, table, categories)
> 703 columns = _deserialize_column_index(table, all_columns, 
> column_indexes)
> 704 
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
>  in _table_to_blocks(options, block_table, categories)
> 972 
> 973 # Convert an arrow table to Block from the internal pandas API
> --> 974 result = pa.lib.table_to_blocks(options, block_table, categories)
> 975 
> 976 # Defined above
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.table_to_blocks()
> ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
> dictionary {code}
> Note that the data vector itself can be loaded successfully by to_pandas.
> It'd be great if this would be addressed in the next version of pyarrow. For 
> now, is there anything I can do on my end to bypass this unimplemented 
> conversion?
> Thanks,
> Razvan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-3789) [Python] Enable calling object in Table.to_pandas to "self-destruct" for improved memory use

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3789.
-
Resolution: Fixed

Issue resolved by pull request 6067
[https://github.com/apache/arrow/pull/6067]

> [Python] Enable calling object in Table.to_pandas to "self-destruct" for 
> improved memory use
> 
>
> Key: ARROW-3789
> URL: https://issues.apache.org/jira/browse/ARROW-3789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> One issue with using {{Table.to_pandas}} is that it results in a memory 
> doubling (at least, more if there are a lot of Python objects created). It 
> would be useful if there was an option to destroy the {{arrow::Column}} 
> references once they've been transferred into the target data frame. This 
> would render the {{pyarrow.Table}} object useless afterward



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata

2020-01-14 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015493#comment-17015493
 ] 

Neal Richardson commented on ARROW-7063:


IMO the extra metadata should be accessible by some property/method on the 
schema, so you can get at it and view it and parse it or whatever you want, but 
it doesn't belong in the simple pretty print method.

I'm fine with writing it my way in R (i.e. schema print only prints its fields, 
assuming I can iterate over the Fields in a Schema and print each), and if 
y'all like how that looks, we can consider making that the C++ behavior.

> [C++] Schema print method prints too much metadata
> --
>
> Key: ARROW-7063
> URL: https://issues.apache.org/jira/browse/ARROW-7063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: dataset, parquet
> Fix For: 0.16.0
>
>
> I loaded some taxi data in a Dataset and printed the schema. This is what was 
> printed:
> {code}
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> pickup_longitude: float
> pickup_latitude: float
> rate_code_id: null
> store_and_fwd_flag: string
> dropoff_longitude: float
> dropoff_latitude: float
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> total_amount: float
> -- metadata --
> pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, 
> "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, 
> "field_name": null, "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", 
> "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", 
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, 
> {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", 
> "numpy_type": "datetime64[ns]", "metadata": null}, {"name": 
> "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", 
> "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", 
> "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": 
> "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "pickup_latitude", "field_name": 
> "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", 
> "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": 
> "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null}, {"name": 
> "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "payment_type", "field_name": "payment_type", "pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null}, {"name": "fare_amount", 
> "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "extra", "field_name": "extra", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, 
> {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", 
> "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", 
> "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "tolls_amount", "field_name": 
> "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "total_amount", "field_name": "total_amount", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], 
> "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": 
> "0.25.3"}
> ARROW:schema: 
>

[jira] [Resolved] (ARROW-7575) [R] Linux binary packaging followup

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7575.

Resolution: Fixed

Issue resolved by pull request 6194
[https://github.com/apache/arrow/pull/6194]

> [R] Linux binary packaging followup
> ---
>
> Key: ARROW-7575
> URL: https://issues.apache.org/jira/browse/ARROW-7575
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> After ARROW-6793 merged, I set up some nightly binary building CI and need to 
> iterate on the install script and documentation to reflect what is available 
> there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7582) [Rust][Flight] Unable to compile arrow.flight.protocol.rs

2020-01-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7582:
--
Labels: pull-request-available  (was: )

> [Rust][Flight] Unable to compile arrow.flight.protocol.rs
> -
>
> Key: ARROW-7582
> URL: https://issues.apache.org/jira/browse/ARROW-7582
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Not sure exactly why, perhaps it has something to do with the recently 
> updated dependencies: https://github.com/apache/arrow/runs/389937707
> cc [~andygrove] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7582) [Rust][Flight] Unable to compile arrow.flight.protocol.rs

2020-01-14 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-7582:
--

 Summary: [Rust][Flight] Unable to compile arrow.flight.protocol.rs
 Key: ARROW-7582
 URL: https://issues.apache.org/jira/browse/ARROW-7582
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Krisztian Szucs


Not sure exactly why, perhaps it has something to do with the recently updated 
dependencies: https://github.com/apache/arrow/runs/389937707

cc [~andygrove] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7243) [Docs] Add common "implementation status" table to the README of each native language implementation, as well as top level README

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7243:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Docs] Add common "implementation status" table to the README of each native 
> language implementation, as well as top level README
> -
>
> Key: ARROW-7243
> URL: https://issues.apache.org/jira/browse/ARROW-7243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This will help us accurately set user expectations about what level of 
> functional / testing completeness each native Arrow library has, per mailing 
> list discussion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4412) [DOCUMENTATION] Add explicit version numbers to the arrow specification documents.

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-4412:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [DOCUMENTATION] Add explicit version numbers to the arrow specification 
> documents.
> --
>
> Key: ARROW-4412
> URL: https://issues.apache.org/jira/browse/ARROW-4412
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Micah Kornfield
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: image-2019-10-10-15-10-53-261.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Based on conversation on the mailing list it might pay to include 
> version/revision numbers on the specification document.  One way is to 
> include the "release" version, another might be to only update versioning on 
> changes to the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6164) [Docs][Format] Document project versioning schema and forward/backward compatibility policies

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6164.
-
Fix Version/s: (was: 0.16.0)
   0.15.0
   Resolution: Fixed

Yes, this was done in 0.15.0

> [Docs][Format] Document project versioning schema and forward/backward 
> compatibility policies
> -
>
> Key: ARROW-6164
> URL: https://issues.apache.org/jira/browse/ARROW-6164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Based on policy adopted via vote on mailing list



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6829) [Docs] Migrate integration test docs to Sphinx, fix instructions after ARROW-6466

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6829:
---
  Component/s: Integration
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Docs] Migrate integration test docs to Sphinx, fix instructions after 
> ARROW-6466
> -
>
> Key: ARROW-6829
> URL: https://issues.apache.org/jira/browse/ARROW-6829
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Integration
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-6466.
> Also, the readme uses out of date archery flags
> https://github.com/apache/arrow/blob/master/integration/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6062) [FlightRPC] Allow timeouts on all stream reads

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6062:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [FlightRPC] Allow timeouts on all stream reads
> --
>
> Key: ARROW-6062
> URL: https://issues.apache.org/jira/browse/ARROW-6062
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Priority: Major
> Fix For: 1.0.0
>
>
> Anywhere where we offer reading from a stream in Flight, we need to offer a 
> timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5405) [Documentation] Move integration testing documentation to Sphinx docs, add instructions for JavaScript

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5405:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Documentation] Move integration testing documentation to Sphinx docs, add 
> instructions for JavaScript
> --
>
> Key: ARROW-5405
> URL: https://issues.apache.org/jira/browse/ARROW-5405
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I noticed that JavaScript information is not in integration/README.md. It 
> would be a good opportunity to migrate over this to the 
> docs/source/developers directory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6602) [Doc] Add feature / implementation matrix

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6602:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Doc] Add feature / implementation matrix
> -
>
> Key: ARROW-6602
> URL: https://issues.apache.org/jira/browse/ARROW-6602
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> We have many different implementations and each implementation makes a 
> different set of features available. It would be nice to have a top-level doc 
> page making it clear which implementation supports what.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6456) [C++] Possible to reduce object code generated in compute/kernels/take.cc?

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6456:

Fix Version/s: (was: 1.0.0)

> [C++] Possible to reduce object code generated in compute/kernels/take.cc?
> --
>
> Key: ARROW-6456
> URL: https://issues.apache.org/jira/browse/ARROW-6456
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> According to 
> https://gist.github.com/wesm/90f73d050a81cbff6772aea2203cdf93
> take.cc is our largest piece of object code in the codebase. This is a pretty 
> important function but I wonder if it's possible to make the implementation 
> "leaner" than it is currently to reduce generated code, without sacrificing 
> performance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6165) [Integration] Use multiprocessing to run integration tests on multiple CPU cores

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6165:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Integration] Use multiprocessing to run integration tests on multiple CPU 
> cores
> 
>
> Key: ARROW-6165
> URL: https://issues.apache.org/jira/browse/ARROW-6165
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> The stdout/stderr will have to be captured appropriate so that the console 
> output when run in parallel is still readable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6075) [FlightRPC] Handle uncaught exceptions in middleware

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6075:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [FlightRPC] Handle uncaught exceptions in middleware
> 
>
> Key: ARROW-6075
> URL: https://issues.apache.org/jira/browse/ARROW-6075
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Priority: Major
> Fix For: 1.0.0
>
>
> For some discussion on the java side see 
> [https://github.com/apache/arrow/pull/4916]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6507) [C++] Add ExtensionArray::ExtensionValidate for custom validation?

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6507:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Add ExtensionArray::ExtensionValidate for custom validation?
> --
>
> Key: ARROW-6507
> URL: https://issues.apache.org/jira/browse/ARROW-6507
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> From discussing ARROW-6506, [~bkietz] said: an extension type might place 
> more constraints on an array than those implicit in its storage type, and 
> users will probably expect to be able to plug those into {{Validate}}.
> So we could have a {{ExtensionArray::ExtensionValidate}} that the visitor for 
> {{ExtensionArray}} can call, similarly like there is also an 
> {{ExtensionType::ExtensionEquals}} that the visitor calls when extension 
> types are checked for equality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6407) [C++] Stop duplicating dependency URLs between thirdparty/versions.txt and ThirdpartyToolchain.cmake

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6407:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Stop duplicating dependency URLs between thirdparty/versions.txt and 
> ThirdpartyToolchain.cmake
> 
>
> Key: ARROW-6407
> URL: https://issues.apache.org/jira/browse/ARROW-6407
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This will prevent issues like ARROW-6406



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6164) [Docs][Format] Document project versioning schema and forward/backward compatibility policies

2020-01-14 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015464#comment-17015464
 ] 

Neal Richardson commented on ARROW-6164:


This is done, right? 
https://github.com/apache/arrow/blob/master/docs/source/format/Versioning.rst 
[~wesm]

> [Docs][Format] Document project versioning schema and forward/backward 
> compatibility policies
> -
>
> Key: ARROW-6164
> URL: https://issues.apache.org/jira/browse/ARROW-6164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.16.0
>
>
> Based on policy adopted via vote on mailing list



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4096) [C++] Preserve "ordered" metadata in some special cases in dictionary unification

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-4096:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Preserve "ordered" metadata in some special cases in dictionary 
> unification
> -
>
> Key: ARROW-4096
> URL: https://issues.apache.org/jira/browse/ARROW-4096
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> In the event that all dictionaries are prefixes of a common dictionary, and 
> all have ordered=true (note: this is not the same thing as being sorted), the 
> resulting unified dictionary can also have ordered=true



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5563) [Format] Update integration test JSON format documentation in Metadata.rst

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5563:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Format] Update integration test JSON format documentation in Metadata.rst
> --
>
> Key: ARROW-5563
> URL: https://issues.apache.org/jira/browse/ARROW-5563
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This has slipped behind what is in the integration tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5936) [C++] [FlightRPC] user_metadata is not present in fields read from flight

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5936:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] [FlightRPC] user_metadata is not present in fields read from flight
> -
>
> Key: ARROW-5936
> URL: https://issues.apache.org/jira/browse/ARROW-5936
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Ben Kietzman
>Priority: Minor
> Fix For: 1.0.0
>
>
> Should this go in the arrow::Field::metadata property somewhere? Does 
> user_metadata round trip through some other channel?
> https://github.com/apache/arrow/pull/4841#discussion_r302623241



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5082) [Python][Packaging] Reduce size of macOS and manylinux1 wheels

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5082:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Python][Packaging] Reduce size of macOS and manylinux1 wheels
> --
>
> Key: ARROW-5082
> URL: https://issues.apache.org/jira/browse/ARROW-5082
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available, wheel
> Fix For: 1.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> The wheels more than tripled in size from 0.12.0 to 0.13.0. I think this is 
> mostly because of LLVM but we should take a closer look to see if the size 
> can be reduced



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5143) [Flight] Enable integration testing of batches with dictionaries

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5143:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Flight] Enable integration testing of batches with dictionaries
> 
>
> Key: ARROW-5143
> URL: https://issues.apache.org/jira/browse/ARROW-5143
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Integration
>Reporter: David Li
>Priority: Major
>  Labels: flight
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4484) [Java] improve Flight DoPut busy wait

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-4484:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [Java] improve Flight DoPut busy wait
> -
>
> Key: ARROW-4484
> URL: https://issues.apache.org/jira/browse/ARROW-4484
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: David Li
>Priority: Major
>  Labels: flight
> Fix For: 1.0.0
>
>
> Currently the implementation of putNext in FlightClient.java busy-waits until 
> gRPC indicates that the server can receive a message. We should either 
> improve the busy-wait (e.g. add sleep times), or rethink the API and make it 
> non-blocking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6399) [C++] More extensive attributes usage could improve debugging

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6399:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] More extensive attributes usage could improve debugging
> -
>
> Key: ARROW-6399
> URL: https://issues.apache.org/jira/browse/ARROW-6399
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Minor
> Fix For: 1.0.0
>
>
> Wrapping  raw or smart pointer parameters and other declarations with 
> {{gsl::not_null}} will assert they are not null. The check is dropped for 
> release builds.
> Status is tagged with ARROW_MUST_USE_RESULT which emits warnings when a 
> Status might be ignored if compiling with clang, but Result<> should probably 
> be tagged with this too



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6033) [C++] Provide an initialization and/or compatibility check function

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6033:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Provide an initialization and/or compatibility check function
> ---
>
> Key: ARROW-6033
> URL: https://issues.apache.org/jira/browse/ARROW-6033
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 1.0.0
>
>
> Some Arrow functions will fail if e.g. the CPU doesn't have the right 
> instruction set extensions (e.g. POPCNT on x86 - see ARROW-5381). We may want 
> to provide a global function that checks requirements and/or otherwise 
> initializes Arrow internal structures (the single one I can think of is 
> `InitializeUTF8()` in `util/utf8.h`).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6479) [C++] inline errors from external projects' build logs

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6479:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

This would be nice; one alternative is to turn on 
ARROW_VERBOSE_THIRDPARTY_BUILD when you have a failure (though it's usually too 
much noise normally).

> [C++] inline errors from external projects' build logs
> --
>
> Key: ARROW-6479
> URL: https://issues.apache.org/jira/browse/ARROW-6479
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Minor
> Fix For: 1.0.0
>
>
> Currently when an external project build fails, we get a very uninformative 
> message:
> {code}
> [88/543] Performing build step for 'flatbuffers_ep'
> FAILED: flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build 
> flatbuffers_ep-prefix/src/flatbuffers_ep-install/bin/flatc 
> flatbuffers_ep-prefix/src/flatbuffers_ep-install/lib/libflatbuffers.a 
> cd /build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-build && 
> /usr/bin/cmake -P 
> /build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build-DEBUG.cmake
>  && /usr/bin/cmake -E touch 
> /build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build
> CMake Error at 
> /build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build-DEBUG.cmake:16
>  (message):
>   Command failed: 1
>'/usr/bin/cmake' '--build' '.'
>   See also
> 
> /build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build-*.log
> {code}
> It would be far more useful if the error were caught and relevant section (or 
> even the entirity) of {{ 
> /build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build-*.log}}
>  were output instead. This is doubly the case on CI where accessing those 
> logs is non trivial



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6456) [C++] Possible to reduce object code generated in compute/kernels/take.cc?

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6456:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Possible to reduce object code generated in compute/kernels/take.cc?
> --
>
> Key: ARROW-6456
> URL: https://issues.apache.org/jira/browse/ARROW-6456
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> According to 
> https://gist.github.com/wesm/90f73d050a81cbff6772aea2203cdf93
> take.cc is our largest piece of object code in the codebase. This is a pretty 
> important function but I wonder if it's possible to make the implementation 
> "leaner" than it is currently to reduce generated code, without sacrificing 
> performance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6872) [C++][Python] Empty table with dictionary-columns raises ArrowNotImplementedError

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6872:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++][Python] Empty table with dictionary-columns raises 
> ArrowNotImplementedError
> -
>
> Key: ARROW-6872
> URL: https://issues.apache.org/jira/browse/ARROW-6872
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: Marco Neumann
>Priority: Minor
> Fix For: 1.0.0
>
>
> h2. Abstract
> As a pyarrow user, I would expect that I can create an empty table out of 
> every schema that I created via pandas. This does not work for dictionary 
> types (e.g. {{"category"}} dtypes).
> h2. Test Case
> This code:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({"x": pd.Series(["x", "y"], dtype="category")})
> table = pa.Table.from_pandas(df)
> schema = table.schema
> table_empty = schema.empty_table()  # boom
> {code}
> produces this exception:
> {noformat}
> Traceback (most recent call last):
>   File "arrow_bug.py", line 8, in 
> table_empty = schema.empty_table()
>   File "pyarrow/types.pxi", line 860, in __iter__
>   File "pyarrow/array.pxi", line 211, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Sequence converter for type 
> dictionary not implemented
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7256) [C++] Remove ARROW_MEMORY_POOL_DEFAULT option

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7256:
---
Summary: [C++] Remove ARROW_MEMORY_POOL_DEFAULT option  (was: [C++] Remove 
ARROW_DEFAULT_MEMORY_POOL option)

> [C++] Remove ARROW_MEMORY_POOL_DEFAULT option
> -
>
> Key: ARROW-7256
> URL: https://issues.apache.org/jira/browse/ARROW-7256
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.16.0
>
>
> As mentioned elsewhere in a JIRA I recall, we aren't testing adequately the 
> CMake option for "no default memory pool", so it would either be better to 
> require explicit memory pools or pass the default, rather than having a 
> build-time option to set whether a default will be passed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6858) [C++] Create Python script to handle transitive component dependencies

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6858:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Create Python script to handle transitive component dependencies
> --
>
> Key: ARROW-6858
> URL: https://issues.apache.org/jira/browse/ARROW-6858
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> In the C++ build system, we are handling relationships between optional 
> components in an ad hoc fashion
> https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L266
> This doesn't seem ideal. 
> As discussed on the mailing list, I suggest declaring dependencies in a 
> Python data structure and then generating and checking in a .cmake file that 
> can be {{include}}d. This will be a big easier than maintaining this on an ad 
> hoc basis. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6941) [C++] Unpin gtest in build environment

2020-01-14 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6941:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Unpin gtest in build environment
> --
>
> Key: ARROW-6941
> URL: https://issues.apache.org/jira/browse/ARROW-6941
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to failure triaged in ARROW-6834



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7581) [R] Documentation/polishing for 0.16 release

2020-01-14 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-7581:
--

 Summary: [R] Documentation/polishing for 0.16 release
 Key: ARROW-7581
 URL: https://issues.apache.org/jira/browse/ARROW-7581
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.16.0


Includes updating NEWS.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7510) [C++] Array::null_count() is not thread-compatible

2020-01-14 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7510.
---
Fix Version/s: (was: 1.0.0)
   0.16.0
   Resolution: Fixed

Issue resolved by pull request 6184
[https://github.com/apache/arrow/pull/6184]

> [C++] Array::null_count() is not thread-compatible
> --
>
> Key: ARROW-7510
> URL: https://issues.apache.org/jira/browse/ARROW-7510
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Zhuo Peng
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> ArrayData has a mutable member null_count, that can be updated in a const 
> function. However null_count is not atomic, so it's subject to data race.
>  
> I guess Arrays are not thread-safe (which is reasonable), but at least they 
> should be thread-compatible so that concurrent access to const member 
> functions are fine.
> (The race looks "benign", but see [1][2])
> [https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/cpp/src/arrow/array.cc#L123]
>  
> [1][https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong]
> [2][https://bartoszmilewski.com/2014/10/25/dealing-with-benign-data-races-the-c-way/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7573) [Rust] Reduce boxing and cleanup

2020-01-14 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-7573.

Fix Version/s: 0.16.0
   Resolution: Fixed

Issue resolved by pull request 6192
[https://github.com/apache/arrow/pull/6192]

> [Rust] Reduce boxing and cleanup
> 
>
> Key: ARROW-7573
> URL: https://issues.apache.org/jira/browse/ARROW-7573
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Gurwinder Singh
>Assignee: Gurwinder Singh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7580) [Website] 0.16 release post

2020-01-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7580:
--
Labels: pull-request-available  (was: )

> [Website] 0.16 release post
> ---
>
> Key: ARROW-7580
> URL: https://issues.apache.org/jira/browse/ARROW-7580
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7580) [Website] 0.16 release post

2020-01-14 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-7580:
--

 Summary: [Website] 0.16 release post
 Key: ARROW-7580
 URL: https://issues.apache.org/jira/browse/ARROW-7580
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7579) [FlightRPC] Make Handshake optional

2020-01-14 Thread David Li (Jira)

David Li created ARROW-7579:
---

 Summary: [FlightRPC] Make Handshake optional
 Key: ARROW-7579
 URL: https://issues.apache.org/jira/browse/ARROW-7579
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC
Reporter: David Li
 Fix For: 1.0.0


We should make it possible to _not_ invoke Handshake for services that don't 
want it. Especially when using it with flight-grpc, where the standard gRPC 
authentication mechanisms don't know about Flight and try to authenticate the 
Handshake endpoint - it's easy to forget to configure this endpoint to bypass 
authentication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

2020-01-14 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-7059.
-
Resolution: Fixed

Issue resolved by pull request 6181
[https://github.com/apache/arrow/pull/6181]

> [Python] Reading parquet file with many columns is much slower in 0.15.x 
> versus 0.14.x
> --
>
> Key: ARROW-7059
> URL: https://issues.apache.org/jira/browse/ARROW-7059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Linux OS with RHEL 7.7 distribution
> blkcqas037:~$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):32
> On-line CPU(s) list:   0-31
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>Reporter: Eric Kisslinger
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, performance, pull-request-available
> Fix For: 0.16.0
>
> Attachments: image-2019-11-06-08-18-42-783.png, 
> image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, 
> image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, 
> image-2019-11-06-13-16-05-102.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Reading Parquet files with large number of columns still seems to be very 
> slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 
> except I set {{use_threads=False}} to make for an apples-to-apples comparison 
> with respect to # of CPUs.
> {code}
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})
> pq.write_table(table, "test_wide.parquet")
> res = pq.read_table("test_wide.parquet")
> print(pa.__version__)
> %time res = pq.read_table("test_wide.parquet", use_threads=False)
> {code}
> *In 0.14.1 with use_threads=False:*
> {{0.14.1}}
> {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
> {{Wall time: 525 ms}}
> **
> *In 0.15.1 with* *use_threads=False**:*
> {{0.15.1}}
> {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
> {{Wall time: 9.93 s}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6895) [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

2020-01-14 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6895:
-

Assignee: Francois Saint-Jacques  (was: Wes McKinney)

> [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader 
> repeats returned values when calling `NextBatch()`
> ---
>
> Key: ARROW-6895
> URL: https://issues.apache.org/jira/browse/ARROW-6895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.0
> Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
>Reporter: Adam Hooper
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.16.0
>
> Attachments: bad.parquet, reset-dictionary-on-read.diff, works.parquet
>
>
> Given most columns, I can run a loop like:
> {code:cpp}
> std::unique_ptr columnReader(/*...*/);
> while (nRowsRemaining > 0) {
> int n = std::min(100, nRowsRemaining);
> std::shared_ptr chunkedArray;
> auto status = columnReader->NextBatch(n, );
> // ... and then use `chunkedArray`
> nRowsRemaining -= n;
> }
> {code}
> (The context is: "convert Parquet to CSV/JSON, with small memory footprint." 
> Used in https://github.com/CJWorkbench/parquet-to-arrow)
> Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; 
> the second return value looks like {{val100...val199}}; and so on.
> ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The 
> first {{NextBatch()}} return value looks like {{val0...val100}}; the second 
> return value looks like {{val0...val99, val100...val199}} (ChunkedArray with 
> two arrays); the third return value looks like {{val0...val99, 
> val100...val199, val200...val299}} (ChunkedArray with three arrays); and so 
> on. The returned arrays are never cleared.
> In sum: {{NextBatch()}} on a dictionary column reader returns the wrong 
> values.
> I've attached a minimal Parquet file that presents this problem with the 
> above code; and I've written a patch that fixes this one case, to illustrate 
> where things are wrong. I don't think I understand enough edge cases to 
> decree that my patch is a correct fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7545) [C++] [Dataset] Scanning dataset with dictionary type hangs

2020-01-14 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-7545.
---
Resolution: Duplicate

> [C++] [Dataset] Scanning dataset with dictionary type hangs
> ---
>
> Key: ARROW-7545
> URL: https://issues.apache.org/jira/browse/ARROW-7545
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: dataset
> Fix For: 0.16.0
>
>
> I assume it is an issue on the C++ side of the datasets code, but reproducer 
> in Python. 
> I create a small parquet file with a single column of dictionary type. 
> Reading it with {{pq.read_table}} works fine, reading it with the datasets 
> machinery hangs when scanning:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)})
> arrow_table = pa.Table.from_pandas(df)
> filename = "test.parquet"
> pq.write_table(arrow_table, filename)
> from pyarrow.fs import LocalFileSystem
> from pyarrow.dataset import ParquetFileFormat, Dataset, 
> FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions
> filesystem = LocalFileSystem()
> format = ParquetFileFormat()
> options = FileSystemDiscoveryOptions()
> discovery = FileSystemDataSourceDiscovery(
> filesystem, [filename], format, options)
> inspected_schema = discovery.inspect()
> dataset = Dataset([discovery.finish()], inspected_schema)
> # dataset.schema works fine and gives correct schema
> dataset.schema
> scanner_builder = dataset.new_scan()
> scanner = scanner_builder.finish()
> # this hangs
> scanner.to_table()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

2020-01-14 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-7059:
---

Assignee: Wes McKinney

> [Python] Reading parquet file with many columns is much slower in 0.15.x 
> versus 0.14.x
> --
>
> Key: ARROW-7059
> URL: https://issues.apache.org/jira/browse/ARROW-7059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Linux OS with RHEL 7.7 distribution
> blkcqas037:~$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):32
> On-line CPU(s) list:   0-31
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>Reporter: Eric Kisslinger
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, performance, pull-request-available
> Fix For: 0.16.0
>
> Attachments: image-2019-11-06-08-18-42-783.png, 
> image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, 
> image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, 
> image-2019-11-06-13-16-05-102.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Reading Parquet files with large number of columns still seems to be very 
> slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 
> except I set {{use_threads=False}} to make for an apples-to-apples comparison 
> with respect to # of CPUs.
> {code}
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})
> pq.write_table(table, "test_wide.parquet")
> res = pq.read_table("test_wide.parquet")
> print(pa.__version__)
> %time res = pq.read_table("test_wide.parquet", use_threads=False)
> {code}
> *In 0.14.1 with use_threads=False:*
> {{0.14.1}}
> {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
> {{Wall time: 525 ms}}
> **
> *In 0.15.1 with* *use_threads=False**:*
> {{0.15.1}}
> {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
> {{Wall time: 9.93 s}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7578) [R] Add support for datasets with IPC files and with multiple sources

2020-01-14 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015359#comment-17015359
 ] 

Neal Richardson commented on ARROW-7578:


Thanks for the heads up. That's not what this issue is about, but we'll keep it 
in mind.

> [R] Add support for datasets with IPC files and with multiple sources
> -
>
> Key: ARROW-7578
> URL: https://issues.apache.org/jira/browse/ARROW-7578
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.16.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7578) [R] Add support for datasets with IPC files and with multiple sources

2020-01-14 Thread Tyler Brown (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015346#comment-17015346
 ] 

Tyler Brown commented on ARROW-7578:


for SQL dump file parsing, may want to reference: 
[https://github.com/hyrise/sql-parser]

> [R] Add support for datasets with IPC files and with multiple sources
> -
>
> Key: ARROW-7578
> URL: https://issues.apache.org/jira/browse/ARROW-7578
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.16.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7578) [R] Add support for datasets with IPC files and with multiple sources

2020-01-14 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-7578:
--

 Summary: [R] Add support for datasets with IPC files and with 
multiple sources
 Key: ARROW-7578
 URL: https://issues.apache.org/jira/browse/ARROW-7578
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.16.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7532) [CI] Unskip brew test after Homebrew fixes it upstream

2020-01-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7532:
--
Labels: pull-request-available  (was: )

> [CI] Unskip brew test after Homebrew fixes it upstream
> --
>
> Key: ARROW-7532
> URL: https://issues.apache.org/jira/browse/ARROW-7532
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>
> Followup to ARROW-7492. See https://github.com/Homebrew/brew/issues/6908.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7281) [C++] AdaptiveIntBuilder::length() does not consider pending_pos_.

2020-01-14 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7281.
---
Resolution: Fixed

Issue resolved by pull request 6174
[https://github.com/apache/arrow/pull/6174]

> [C++] AdaptiveIntBuilder::length() does not consider pending_pos_.
> --
>
> Key: ARROW-7281
> URL: https://issues.apache.org/jira/browse/ARROW-7281
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Adam Hooper
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> {code:c++}
> arrow::AdaptiveIntBuilder builder(arrow::default_memory_pool());
> builder.Append(1);
> std::cout << builder.length() << std::endl;
> {code}
> Expected output: {{1}}
> Actual output: {{0}}
> I imagine this regression came with https://github.com/apache/arrow/pull/3040
> My use case: I'm building a JSON parser that appends "records" (JSON Objects 
> mapping key=>value) to Arrow columns (each key gets an ArrayBuilder). Not all 
> JSON Objects contain all keys; so {{builder.Append()}} isn't always called. 
> So on a subsequent row, I want to add nulls for every append that was 
> skipped: {{builder.AppendNulls(row - builder.length()); 
> builder.Append(value)}}. This fails because {{builder.length()}} is wrong.
> Annoying but simple workaround: I maintain a separate {{length}} value 
> alongside {{builder}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7577) [C++][CI] Check fuzzer setup in CI

2020-01-14 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-7577:
--
Description: 
It is desirable to check that there is no regression for compiling and running 
the fuzz targets and assorted utilities.
Perhaps as a cron job.

  was:Perhaps as a cron job.


> [C++][CI] Check fuzzer setup in CI
> --
>
> Key: ARROW-7577
> URL: https://issues.apache.org/jira/browse/ARROW-7577
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> It is desirable to check that there is no regression for compiling and 
> running the fuzz targets and assorted utilities.
> Perhaps as a cron job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7577) [C++][CI] Check fuzzer setup in CI

2020-01-14 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-7577:
-

 Summary: [C++][CI] Check fuzzer setup in CI
 Key: ARROW-7577
 URL: https://issues.apache.org/jira/browse/ARROW-7577
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


Perhaps as a cron job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7576) [C++][Dev] Improve fuzzing setup

2020-01-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7576:
--
Labels: pull-request-available  (was: )

> [C++][Dev] Improve fuzzing setup
> 
>
> Key: ARROW-7576
> URL: https://issues.apache.org/jira/browse/ARROW-7576
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7576) [C++][Dev] Improve fuzzing setup

2020-01-14 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-7576:
-

 Summary: [C++][Dev] Improve fuzzing setup
 Key: ARROW-7576
 URL: https://issues.apache.org/jira/browse/ARROW-7576
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++, Developer Tools
Reporter: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7567) [Java] Bump Checkstyle from 6.19 to 8.18

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7567:
---

Assignee: Fokko Driesprong

> [Java] Bump Checkstyle from 6.19 to 8.18
> 
>
> Key: ARROW-7567
> URL: https://issues.apache.org/jira/browse/ARROW-7567
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7568) [Java] Bump Apache Avro from 1.9.0 to 1.9.1

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7568:
---

Assignee: Fokko Driesprong

> [Java] Bump Apache Avro from 1.9.0 to 1.9.1
> ---
>
> Key: ARROW-7568
> URL: https://issues.apache.org/jira/browse/ARROW-7568
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Apache Avro 1.9.1 contains some bugfixes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7570) [Java] Fix high severity issues

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7570:
---

Assignee: Fokko Driesprong

> [Java] Fix high severity issues
> ---
>
> Key: ARROW-7570
> URL: https://issues.apache.org/jira/browse/ARROW-7570
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Fixes high severity issues reported by LGTM:
> [https://lgtm.com/projects/g/apache/arrow/?mode=list=java=error]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7570) [Java] Fix high severity issues reported by LGTM

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7570:

Summary: [Java] Fix high severity issues reported by LGTM  (was: [Java] Fix 
high severity issues)

> [Java] Fix high severity issues reported by LGTM
> 
>
> Key: ARROW-7570
> URL: https://issues.apache.org/jira/browse/ARROW-7570
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Fixes high severity issues reported by LGTM:
> [https://lgtm.com/projects/g/apache/arrow/?mode=list=java=error]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7575) [R] Linux binary packaging followup

2020-01-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7575:
--
Labels: pull-request-available  (was: )

> [R] Linux binary packaging followup
> ---
>
> Key: ARROW-7575
> URL: https://issues.apache.org/jira/browse/ARROW-7575
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>
> After ARROW-6793 merged, I set up some nightly binary building CI and need to 
> iterate on the install script and documentation to reflect what is available 
> there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7219) [CI][Python] Install pickle5 in the conda-python docker image for python version 3.6

2020-01-14 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-7219.

Resolution: Fixed

Issue resolved by pull request 6183
[https://github.com/apache/arrow/pull/6183]

> [CI][Python] Install pickle5 in the conda-python docker image for python 
> version 3.6
> 
>
> Key: ARROW-7219
> URL: https://issues.apache.org/jira/browse/ARROW-7219
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See conversation 
> https://github.com/apache/arrow/pull/5873#discussion_r348510729



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7575) [R] Linux binary packaging followup

2020-01-14 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-7575:
--

 Summary: [R] Linux binary packaging followup
 Key: ARROW-7575
 URL: https://issues.apache.org/jira/browse/ARROW-7575
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.16.0


After ARROW-6793 merged, I set up some nightly binary building CI and need to 
iterate on the install script and documentation to reflect what is available 
there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5757) [Python] Stop supporting Python 2.7

2020-01-14 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015226#comment-17015226
 ] 

Wes McKinney commented on ARROW-5757:
-

I'm also +1 for dropping after 0.16.0. We aren't doing ourselves or the users 
any favors by continuing 2.7 support. 

> [Python] Stop supporting Python 2.7
> ---
>
> Key: ARROW-5757
> URL: https://issues.apache.org/jira/browse/ARROW-5757
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> By the end of 2019 many scientific Python projects will stop supporting 
> Python 2 altogether:
> https://python3statement.org/
> We'll certainly support Python 2 in Arrow 1.0 but we could perhaps drop 
> support in 1.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7063) [C++] Schema print method prints too much metadata

2020-01-14 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-7063:
---

Assignee: Ben Kietzman

> [C++] Schema print method prints too much metadata
> --
>
> Key: ARROW-7063
> URL: https://issues.apache.org/jira/browse/ARROW-7063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: dataset, parquet
> Fix For: 0.16.0
>
>
> I loaded some taxi data in a Dataset and printed the schema. This is what was 
> printed:
> {code}
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> pickup_longitude: float
> pickup_latitude: float
> rate_code_id: null
> store_and_fwd_flag: string
> dropoff_longitude: float
> dropoff_latitude: float
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> total_amount: float
> -- metadata --
> pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, 
> "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, 
> "field_name": null, "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", 
> "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", 
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, 
> {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", 
> "numpy_type": "datetime64[ns]", "metadata": null}, {"name": 
> "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", 
> "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", 
> "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": 
> "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "pickup_latitude", "field_name": 
> "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", 
> "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": 
> "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null}, {"name": 
> "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "payment_type", "field_name": "payment_type", "pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null}, {"name": "fare_amount", 
> "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "extra", "field_name": "extra", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, 
> {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", 
> "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", 
> "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "tolls_amount", "field_name": 
> "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "total_amount", "field_name": "total_amount", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], 
> "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": 
> "0.25.3"}
> ARROW:schema: 
>

[jira] [Commented] (ARROW-7555) [Python] Drop support for python 2.7

2020-01-14 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015224#comment-17015224
 ] 

Wes McKinney commented on ARROW-7555:
-

I think pyarrow is being picked up by a lot of automated container builds on 
GCE. We should ask Apache Beam (which is a proxy in part for Google Cloud 
DataFlow) what their plan in regarding Python 2.7

> [Python] Drop support for python 2.7
> 
>
> Key: ARROW-7555
> URL: https://issues.apache.org/jira/browse/ARROW-7555
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 1.0.0
>
>
> After the 0.16 release we should consider to drop support for python 2.7 
> because it is not maintained anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-7544) [Python] pyarrow.lib.ArrowNotImplementedError: gRPC returned unimplemented error, with message: clear is not implemented

2020-01-14 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-7544.
---
Resolution: Not A Bug

If you'd like to propose adding things to the example Python client/server 
please open a JIRA issue and describe what you're looking for (PRs welcome!) 

> [Python] pyarrow.lib.ArrowNotImplementedError: gRPC returned unimplemented 
> error, with message: clear is not implemented
> 
>
> Key: ARROW-7544
> URL: https://issues.apache.org/jira/browse/ARROW-7544
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Ji Wong Park
>Priority: Minor
>
> /arrow/python/examples/flight$ py client.py do 0.0.0.0:5005 clear
>  
> Running action clear
> Traceback (most recent call last):
>  File "client.py", line 162, in 
>  main()
>  File "client.py", line 158, in main
>  commands[args.action](args, client)
>  File "client.py", line 69, in do_action
>  for result in client.do_action(action):
>  File "pyarrow/_flight.pyx", line 1068, in do_action
>  File "pyarrow/_flight.pyx", line 75, in pyarrow._flight.check_flight_status
>  File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: gRPC returned unimplemented error, with 
> message: clear is not implemented.. Detail: Python exception: 
> NotImplementedError



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-6398) [C++] Consolidate ScanOptions and ScanContext

2020-01-14 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman closed ARROW-6398.
---
Resolution: Won't Fix

> [C++] Consolidate ScanOptions and ScanContext
> -
>
> Key: ARROW-6398
> URL: https://issues.apache.org/jira/browse/ARROW-6398
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: dataset
> Fix For: 0.16.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Currently ScanOptions has two distinct responsibilities: it contains the data 
> selector (and eventually projection schema) for the current scan and it 
> serves as the base class for format specific scan options.
> In addition, we have ScanContext which holds the memory pool for the current 
> scan.
> I think these classes should be rearranged as follows: ScanOptions will be 
> removed and FileScanOptions will be the abstract base class for format 
> specific scan options. ScanContext will be a concrete struct and contain the 
> data selector, projection schema, a vector of FileScanOptions, and any other 
> shared scan state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-6386) [C++][Documentation] Explicit documentation of null slot interpretation

2020-01-14 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman closed ARROW-6386.
---
Resolution: Resolved

Interpretation of nulls will be left open for individual kernels. 
https://github.com/apache/arrow/pull/5771 adds explicit Kleene logic overloads 
of the boolean kernels, allowing SQL compatible behavior in dataset filter 
expressions.

> [C++][Documentation] Explicit documentation of null slot interpretation
> ---
>
> Key: ARROW-6386
> URL: https://issues.apache.org/jira/browse/ARROW-6386
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> To my knowledge, there isn't explicit documentation on how null slots in an 
> array should be interpreted. SQL uses Kleene logic, wherein a null is 
> explicitly an unknown rather than a special value. This yields for example 
> `(null AND false) -> false`, since `(x AND false) -> false` for all possible 
> values of x. This is also the behavior of Gandiva's boolean expressions.
> By contrast the boolean kernels implement something closer to the behavior of 
> NaN: `(null AND false) -> null`. I think this is simply an error in the 
> boolean kernels but in any case I think explicit documentation should be 
> added to prevent future confusion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7412) [C++][Dataset] Ensure that dataset code is robust to schemas with duplicate field names

2020-01-14 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-7412:

Fix Version/s: 1.0.0

> [C++][Dataset] Ensure that dataset code is robust to schemas with duplicate 
> field names
> ---
>
> Key: ARROW-7412
> URL: https://issues.apache.org/jira/browse/ARROW-7412
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> Fields in a schema don't have to have unique names, so we should make sure 
> we're not assuming that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7412) [C++][Dataset] Ensure that dataset code is robust to schemas with duplicate field names

2020-01-14 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-7412:
---

Assignee: Ben Kietzman

> [C++][Dataset] Ensure that dataset code is robust to schemas with duplicate 
> field names
> ---
>
> Key: ARROW-7412
> URL: https://issues.apache.org/jira/browse/ARROW-7412
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
>
> Fields in a schema don't have to have unique names, so we should make sure 
> we're not assuming that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7547) [C++] [Python] [Dataset] Additional reader options in ParquetFileFormat

2020-01-14 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-7547:
---

Assignee: Ben Kietzman

> [C++] [Python] [Dataset] Additional reader options in ParquetFileFormat
> ---
>
> Key: ARROW-7547
> URL: https://issues.apache.org/jira/browse/ARROW-7547
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, Python
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 0.16.0
>
>
> [looking into using the datasets machinery in the current python parquet code]
> In the current python API, we expose several options that influence reading 
> the parquet file (eg {{read_dictionary}} to indicate to read certain 
> BYTE_ARRAY columns directly into a dictionary type, or {{memory_map}}, 
> {{buffer_size}}).
> Those could be added to {{ParquetFileFormat}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-7373) [C++][Dataset] Remove FileSource

2020-01-14 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman closed ARROW-7373.
---
Resolution: Won't Fix

> [C++][Dataset] Remove FileSource
> 
>
> Key: ARROW-7373
> URL: https://issues.apache.org/jira/browse/ARROW-7373
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
> Fix For: 1.0.0
>
>
> FileSource doesn't do enough, and should be removed. Methods in 
> {{FileFormat}} etc which reference the class should be refactored to take a 
> {{RandomAccessFile}} and convenience overloads provided to take a buffer or a 
> (path, filesystem).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 157 matches

Mail list logo