[jira] [Updated] (ARROW-3729) [C++] Support for writing TIMESTAMP_NANOS Parquet metadata

2019-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3729:
--
Labels: parquet pull-request-available  (was: parquet)

> [C++] Support for writing TIMESTAMP_NANOS Parquet metadata
> --
>
> Key: ARROW-3729
> URL: https://issues.apache.org/jira/browse/ARROW-3729
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: TP Boudreau
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>
> This was brought up on the mailing list.
> We also will need to do corresponding work in the parquet-cpp library to opt 
> in to writing nanosecond timestamps instead of casting to micro- or 
> millisecond.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3104) [Python] Python bindings for HiveServer2 client interface

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3104:

Fix Version/s: (was: 0.14.0)

> [Python] Python bindings for HiveServer2 client interface
> -
>
> Key: ARROW-3104
> URL: https://issues.apache.org/jira/browse/ARROW-3104
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: HiveServer2
>
> These will be a 1-1 mapping to the current C++ classes, with support for 
> yielding Arrow record batches or tables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3134) [C++] Implement n-ary iterator for a collection of chunked arrays with possibly different chunking layouts

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3134:

Labels: dataframe  (was: )

> [C++] Implement n-ary iterator for a collection of chunked arrays with 
> possibly different chunking layouts
> --
>
> Key: ARROW-3134
> URL: https://issues.apache.org/jira/browse/ARROW-3134
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
> Fix For: 0.14.0
>
>
> This is a common pattern that will result in kernel invocation on chunked 
> arrays



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3097) [Format] Interval type is not documented

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3097:

Fix Version/s: (was: 0.14.0)
   1.0.0

> [Format] Interval type is not documented
> 
>
> Key: ARROW-3097
> URL: https://issues.apache.org/jira/browse/ARROW-3097
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Konstantin Shaposhnikov
>Priority: Major
>  Labels: columnar-format-1.0
> Fix For: 1.0.0
>
>
> All types except Interval are documented in Metadata.md. Information about 
> Interval is missing, in particular what is its size (64 bit?) and what is the 
> meaning of IntervalUnit values. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3103) [C++] Conversion to Arrow record batch for HiveServer2 ColumnarRowSet

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3103:

Fix Version/s: (was: 0.14.0)

> [C++] Conversion to Arrow record batch for HiveServer2 ColumnarRowSet
> -
>
> Key: ARROW-3103
> URL: https://issues.apache.org/jira/browse/ARROW-3103
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: HiveServer2, database
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4515) [C++, lint] Use clang-format more efficiently in `check-format` target

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4515:

Fix Version/s: (was: 0.14.0)

> [C++, lint] Use clang-format more efficiently in `check-format` target
> --
>
> Key: ARROW-4515
> URL: https://issues.apache.org/jira/browse/ARROW-4515
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>
> `clang-format` supports command line option `-output-replacements-xml` which 
> (in the case of no required changes) outputs:
> ```
> 
> 
> 
> ```
> Using this option during `check-format` instead of using python to compute a 
> diff between formatted and on-disk should speed up that target significantly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4534) [Rust] Build JSON reader for reading record batches from line-delimited JSON files

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4534:

Fix Version/s: (was: 0.14.0)

> [Rust] Build JSON reader for reading record batches from line-delimited JSON 
> files
> --
>
> Key: ARROW-4534
> URL: https://issues.apache.org/jira/browse/ARROW-4534
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 0.12.0
>Reporter: Neville Dipale
>Priority: Major
>
> Similar to ARROW-694, this is an umbrella issue for supporting reading JSON 
> line-delimited files in Arrow.
> I have a reference implementation at 
> https://github.com/nevi-me/rust-dataframe/blob/io/json/src/io/json.rs where 
> I'm building a Rust-based dataframe library using Arrow.
> I'd like us to have feature parity with CPP at some point.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4838) [C++] Implement safe Make constructor

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4838:

Fix Version/s: (was: 0.14.0)

> [C++] Implement safe Make constructor
> -
>
> Key: ARROW-4838
> URL: https://issues.apache.org/jira/browse/ARROW-4838
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The following classes need validating constructors:
> * ArrayData
> * ChunkedArray
> * RecordBatch
> * Column
> * Table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4845) [C++] Compiler warnings on Windows MingW64

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4845:

Summary: [C++] Compiler warnings on Windows MingW64  (was: Compiler 
warnings on Windows)

> [C++] Compiler warnings on Windows MingW64
> --
>
> Key: ARROW-4845
> URL: https://issues.apache.org/jira/browse/ARROW-4845
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Priority: Major
> Fix For: 0.14.0
>
>
> I am seeing the warnings below when compiling the R bindings on Windows. Most 
> of these seem easy to fix (comparing int with size_t or int32 with int64).
> {code}
> array.cpp: In function 'Rcpp::LogicalVector Array__Mask(const 
> std::shared_ptr&)':
> array.cpp:102:24: warning: comparison of integer expressions of different 
> signedness: 'size_t' {aka 'long long unsigned int'} and 'int64_t' {aka 'long 
> long int'} [-Wsign-compare]
>for (size_t i = 0; i < array->length(); i++, bitmap_reader.Next()) {
>   ~~^
> /mingw64/bin/g++  -std=gnu++11 -I"C:/PROGRA~1/R/R-testing/include" -DNDEBUG 
> -DARROW_STATIC -I"C:/R/library/Rcpp/include"-O2 -Wall  -mtune=generic 
> -c array__to_vector.cpp -o array__to_vector.o
> array__to_vector.cpp: In member function 'virtual arrow::Status 
> arrow::r::Converter_Boolean::Ingest_some_nulls(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const':
> array__to_vector.cpp:254:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, data_reader.Next(), null_reader.Next(), 
> ++p_data) {
>   ~~^~~
> array__to_vector.cpp:258:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, data_reader.Next(), ++p_data) {
>   ~~^~~
> array__to_vector.cpp: In member function 'virtual arrow::Status 
> arrow::r::Converter_Decimal::Ingest_some_nulls(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const':
> array__to_vector.cpp:473:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, bitmap_reader.Next(), ++p_data) {
>   ~~^~~
> array__to_vector.cpp:478:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, ++p_data) {
>   ~~^~~
> array__to_vector.cpp: In member function 'virtual arrow::Status 
> arrow::r::Converter_Int64::Ingest_some_nulls(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const':
> array__to_vector.cpp:515:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, bitmap_reader.Next(), ++p_data) {
>   ~~^~~
> array__to_vector.cpp: In instantiation of 'arrow::Status 
> arrow::r::SomeNull_Ingest(SEXP, R_xlen_t, R_xlen_t, const array_value_type*, 
> const std::shared_ptr&, Lambda) [with int RTYPE = 14; 
> array_value_type = long long int; Lambda = 
> arrow::r::Converter_Date64::Ingest_some_nulls(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const::; 
> SEXP = SEXPREC*; R_xlen_t = long long int]':
> array__to_vector.cpp:366:77:   required from here
> array__to_vector.cpp:116:26: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>  for (size_t i = 0; i < n; i++, bitmap_reader.Next(), ++p_data, 
> ++p_values) {
> ~~^~~
> array__to_vector.cpp: In instantiation of 'arrow::Status 
> arrow::r::SomeNull_Ingest(SEXP, R_xlen_t, R_xlen_t, const array_value_type*, 
> const std::shared_ptr&, Lambda) [with int RTYPE = 13; 
> array_value_type = unsigned char; Lambda = 
> arrow::r::Converter_Dictionary::Ingest_some_nulls_Impl(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const [with Type = 
> arrow::UInt8Type; SEXP = SEXPREC*; R_xlen_t = long long 
> int]::; SEXP = SEXPREC*; R_xlen_t = long long int]':
> array__to_vector.cpp:341:47:   required from 'arrow::Status 
> arrow::r::Converter_Dictionary::Ingest_some_nulls_Impl(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const [with Type = 
> arrow::UInt8Type; 

[jira] [Commented] (ARROW-1957) [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit

2019-05-30 Thread TP Boudreau (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852680#comment-16852680
 ] 

TP Boudreau commented on ARROW-1957:


Yes, thanks for assigning it.

> [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit
> 
>
> Key: ARROW-1957
> URL: https://issues.apache.org/jira/browse/ARROW-1957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Python 3.6.4.  Mac OSX and CentOS Linux release 
> 7.3.1611.  Pandas 0.21.1 .
>Reporter: Jordan Samuels
>Assignee: TP Boudreau
>Priority: Minor
>  Labels: parquet
> Fix For: 0.14.0
>
>
> The following code
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> n=3
> df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', 
> freq='1n', periods=n))
> pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code}
> results in:
> {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> 14832288001}}
> The desired effect is that we can save nanosecond resolution without losing 
> precision (e.g. conversion to ms).  Note that if {{freq='1u'}} is used, the 
> code runs properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5448) [CI] MinGW build failures on AppVeyor

2019-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5448:
--
Labels: pull-request-available  (was: )

> [CI] MinGW build failures on AppVeyor
> -
>
> Key: ARROW-5448
> URL: https://issues.apache.org/jira/browse/ARROW-5448
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Kouhei Sutou
>Priority: Blocker
>  Labels: pull-request-available
>
> Apparently the Numpy package is broken. See 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/24922425/job/9yoq08uepk5p6dwb
> {code}
> -- Found PythonLibs: C:/msys64/mingw32/lib/libpython3.7m.dll.a
> CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
>   NumPy import failure:
>   Traceback (most recent call last):
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\__init__.py", line 
> 40, in 
>   from . import multiarray
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\multiarray.py", 
> line 12, in 
>   from . import overrides
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\overrides.py", line 
> 6, in 
>   from numpy.core._multiarray_umath import (
>   ImportError: DLL load failed: The specified module could not be found.
>   
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5453) [C++] Just-released cmake-format 0.5.2 breaks the build

2019-05-30 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5453:
---

 Summary: [C++] Just-released cmake-format 0.5.2 breaks the build
 Key: ARROW-5453
 URL: https://issues.apache.org/jira/browse/ARROW-5453
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.14.0


It seems we should always pin the cmake-format version until the developers 
stop changing the formatting algorithm



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2164) [C++] Clean up unnecessary decimal module refs

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2164:

Fix Version/s: (was: 0.14.0)

> [C++] Clean up unnecessary decimal module refs
> --
>
> Key: ARROW-2164
> URL: https://issues.apache.org/jira/browse/ARROW-2164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> See this comment: 
> https://github.com/apache/arrow/pull/1610#discussion_r168533239



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2248) [Python] Nightly or on-demand HDFS test builds

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852498#comment-16852498
 ] 

Wes McKinney commented on ARROW-2248:
-

cc [~npr]

> [Python] Nightly or on-demand HDFS test builds
> --
>
> Key: ARROW-2248
> URL: https://issues.apache.org/jira/browse/ARROW-2248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> We continue to acquire more functionality related to HDFS and Parquet. 
> Testing this, including tests that involve interoperability with other 
> systems, like Spark, will require some work outside of our normal CI 
> infrastructure.
> I suggest we start with testing the C++/Python HDFS integration, which will 
> help with validating patches like ARROW-1643 
> https://github.com/apache/arrow/pull/1668



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2249) [Java/Python] in-process vector sharing from Java to Python

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2249:

Fix Version/s: (was: 0.14.0)

> [Java/Python] in-process vector sharing from Java to Python
> ---
>
> Key: ARROW-2249
> URL: https://issues.apache.org/jira/browse/ARROW-2249
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: beginner
>
> Currently we seem to use in all applications of Arrow the IPC capabilities to 
> move data between a Java process and a Python process. While this is 
> 0-serialization, it is not zero-copy. By taking the address and offset, we 
> can already create Python buffers from Java buffers: 
> https://github.com/apache/arrow/pull/1693. This is still a very low-level 
> interface and we should provide the user with:
> * A guide on how to load Apache Arrow java libraries in Python (either 
> through a fat-jar that was shipped with Arrow or how he should integrate it 
> into its Java packaging)
> * {{pyarrow.Array.from_jvm}}, {{pyarrow.RecordBatch.from_jvm}}, … functions 
> that take the respective Java objects and emit Python objects. These Python 
> objects should also ensure that the underlying memory regions are kept alive 
> as long as the Python objects exist.
> This issue can also be used as a tracker for the various sub-tasks that will 
> need to be done to complete this rather large milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2260) [C++][Plasma] plasma_store should show usage

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2260:

Fix Version/s: (was: 0.14.0)

> [C++][Plasma] plasma_store should show usage
> 
>
> Key: ARROW-2260
> URL: https://issues.apache.org/jira/browse/ARROW-2260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Minor
>
> Currently the options exposed by the {{plasma_store}} executable aren't very 
> discoverable:
> {code:bash}
> $ plasma_store -h
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store 
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store --help
> plasma_store: invalid option -- '-'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2888) [Plasma] Several GPU-related APIs are used in places where errors cannot be appropriately handled

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2888:

Fix Version/s: (was: 0.14.0)

> [Plasma] Several GPU-related APIs are used in places where errors cannot be 
> appropriately handled
> -
>
> Key: ARROW-2888
> URL: https://issues.apache.org/jira/browse/ARROW-2888
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Reporter: Wes McKinney
>Priority: Major
>
> I'm adding {{DCHECK_OK}} statements for ARROW-2883 to fix the unchecked 
> Status warnings, but this code should be refactored so that these errors can 
> bubble up properly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2879) [Python] Arrow plasma can only use a small part of specified shared memory

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852513#comment-16852513
 ] 

Wes McKinney commented on ARROW-2879:
-

Want to submit a pull request?

> [Python] Arrow plasma can only use a small part of specified shared memory
> --
>
> Key: ARROW-2879
> URL: https://issues.apache.org/jira/browse/ARROW-2879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: chineking
>Priority: Major
> Fix For: 0.14.0
>
>
> Hi, thanks for the great job of arrow, it helps us a lot.
> However, we encounter a problem when we were using plasma.
> The sample code:
> {code:python}
> import numpy as np
> import pyarrow as pa
> import pyarrow.plasma as plasma
> client = plasma.connect("/tmp/plasma", "", 0)
> puts = []
> nbytes = 0
> while True:
> a = np.ones((1000, 1000))
> try:
> oid = client.put(a)
> puts.append(client.get(oid))
> nbytes += a.nbytes
> except pa.lib.PlasmaStoreFull:
> print('use nbytes', nbytes)
> break
> {code}
> We start a plasma store with 1G memory, but the nbytes output above is only 
> 49600, which cannot even reach half of the memory we specified.
> I cannot figure out why plasma can only use such a small part of shared 
> memory. Could anybody help me? Thanks a lot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2912) [Website] Build more detailed Community landing patch a la Apache Spark

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852514#comment-16852514
 ] 

Wes McKinney commented on ARROW-2912:
-

cc [~npr]

> [Website] Build more detailed Community landing patch a la Apache Spark
> ---
>
> Key: ARROW-2912
> URL: https://issues.apache.org/jira/browse/ARROW-2912
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> It would be useful to have some prose descriptions of where to get help and 
> where to direct questions. See example:
> http://spark.apache.org/community.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2910) [Packaging] Build from official apache archive

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2910:

Fix Version/s: (was: 0.14.0)

> [Packaging] Build from official apache archive
> --
>
> Key: ARROW-2910
> URL: https://issues.apache.org/jira/browse/ARROW-2910
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2887) [Plasma] Methods in plasma/store.h returning PlasmaError should return Status instead

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2887:

Fix Version/s: (was: 0.14.0)

> [Plasma] Methods in plasma/store.h returning PlasmaError should return Status 
> instead
> -
>
> Key: ARROW-2887
> URL: https://issues.apache.org/jira/browse/ARROW-2887
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Reporter: Wes McKinney
>Priority: Major
>
> These functions are not able to return other kinds of errors (e.g. 
> CUDA-related errors) as a result of this. I encountered this while working on 
> ARROW-2883



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2939) [Python] API documentation version doesn't match latest on PyPI

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852516#comment-16852516
 ] 

Wes McKinney commented on ARROW-2939:
-

The published docs are now for the latest released version. I think it would be 
useful to have an archive of old documentation versions, though. Removing this 
from the 0.14 Fix Version

> [Python] API documentation version doesn't match latest on PyPI
> ---
>
> Key: ARROW-2939
> URL: https://issues.apache.org/jira/browse/ARROW-2939
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ian Robertson
>Priority: Minor
>  Labels: documentation
> Fix For: 0.14.0
>
>
> Hey folks, apologies if this isn't the right place to raise this.  In poking 
> around the web documentation (for pyarrow specifically), it looks like the 
> auto-generated API docs contain commits past the release of 0.9.0.  For 
> example:
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.column]
>  * Contains differences merged here: 
> [https://github.com/apache/arrow/pull/1923]
>  * But latest pypi/conda versions of pyarrow are 0.9.0, which don't include 
> that change.
> Not sure if the docs are auto-built off master somewhere, I couldn't find 
> anything about building docs in the docs itself.  I would guess that you may 
> want some of the usage docs to be published in between releases if they're 
> not about new functionality, but the API reference being out of date can be 
> confusing.  Is it possible to anchor the API docs to the latest released 
> version?  Or even something like how Pandas has a whole bunch of old versions 
> still available? (e.g. [https://pandas.pydata.org/pandas-docs/stable/] vs. 
> old versions like [http://pandas.pydata.org/pandas-docs/version/0.17.0/])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2853) [Python] Implementing support for zero copy NumPy arrays in libarrow_python

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2853:

Fix Version/s: (was: 0.14.0)

> [Python] Implementing support for zero copy NumPy arrays in libarrow_python
> ---
>
> Key: ARROW-2853
> URL: https://issues.apache.org/jira/browse/ARROW-2853
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Florian Rathgeber
>Priority: Major
>
> Implementing support for zero copy NumPy arrays in libarrow_python (i.e. in 
> C++). We can utilize common code paths with `{{to_pandas`}} and toggle 
> between NumPy-for-pandas and NumPy-for-NumPy behavior (and use the 
> `zero_copy_only` flag where needed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2858) [Packaging] Add unit tests for crossbow

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2858:

Fix Version/s: (was: 0.14.0)

> [Packaging] Add unit tests for crossbow
> ---
>
> Key: ARROW-2858
> URL: https://issues.apache.org/jira/browse/ARROW-2858
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Phillip Cloud
>Priority: Major
>
> As this code grows we should start adding unit tests to make sure we can make 
> changes safely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2818) [Python] Better error message when passing SparseDataFrame into Table.from_pandas

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852510#comment-16852510
 ] 

Wes McKinney commented on ARROW-2818:
-

[~jorisvandenbossche]

> [Python] Better error message when passing SparseDataFrame into 
> Table.from_pandas
> -
>
> Key: ARROW-2818
> URL: https://issues.apache.org/jira/browse/ARROW-2818
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> This can be a rough edge for users. Note that pandas sparse support is being 
> considered for deprecation
> original issue https://github.com/apache/arrow/issues/1894



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2870) [Python] Define API for handling null markers from Array.to_numpy

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852511#comment-16852511
 ] 

Wes McKinney commented on ARROW-2870:
-

[~jorisvandenbossche]

> [Python] Define API for handling null markers from Array.to_numpy
> -
>
> Key: ARROW-2870
> URL: https://issues.apache.org/jira/browse/ARROW-2870
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> This is follow-up work for {{Arrow.to_numpy}} started in ARROW-564



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2446) [C++] SliceBuffer on CudaBuffer should return CudaBuffer

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2446:

Fix Version/s: (was: 0.14.0)

> [C++] SliceBuffer on CudaBuffer should return CudaBuffer
> 
>
> Key: ARROW-2446
> URL: https://issues.apache.org/jira/browse/ARROW-2446
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, GPU
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> Currently {{SliceBuffer}} on a {{CudaBuffer}} returns a plain {{Buffer}} 
> instance, which is dangerous for unsuspecting consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5344) [C++] Use ArrayDataVisitor in implementation of dictionary unpacking in compute/kernels/cast.cc

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5344:

Fix Version/s: (was: 0.14.0)

> [C++] Use ArrayDataVisitor in implementation of dictionary unpacking in 
> compute/kernels/cast.cc
> ---
>
> Key: ARROW-5344
> URL: https://issues.apache.org/jira/browse/ARROW-5344
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Follow-up to code review from ARROW-3144



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5334) [C++] Add "Type" to names of arrow::Integer, arrow::FloatingPoint classes for consistency

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5334:
---

Assignee: Wes McKinney

> [C++] Add "Type" to names of arrow::Integer, arrow::FloatingPoint classes for 
> consistency
> -
>
> Key: ARROW-5334
> URL: https://issues.apache.org/jira/browse/ARROW-5334
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> These intermediate classes used for template metaprogramming (in particular, 
> {{std::is_base_of}}) have inconsistent names with the rest of data types. For 
> clarity, I think we should add "Type" to these class names and others like 
> them
> Please do after ARROW-3144



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5073) [C++] Build toolchain support for libcurl

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5073:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Build toolchain support for libcurl
> -
>
> Key: ARROW-5073
> URL: https://issues.apache.org/jira/browse/ARROW-5073
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.15.0
>
>
> libcurl can be used in a number of different situations (e.g. TensorFlow uses 
> it for GCS interactions 
> https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/cloud/gcs_file_system.cc)
>  so this will likely be required once we begin to tackle that problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5420) [Java] Implement or remove getCurrentSizeInBytes in VariableWidthVector

2019-05-30 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5420.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4390
[https://github.com/apache/arrow/pull/4390]

> [Java] Implement or remove getCurrentSizeInBytes in VariableWidthVector
> ---
>
> Key: ARROW-5420
> URL: https://issues.apache.org/jira/browse/ARROW-5420
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Now VariableWidthVector#getCurrentSizeInBytes doesn't seem to have been 
> implemented. We should implement it or just remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2671) [Python] Run ASV suite in nightly build, only run in Travis CI on demand

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2671:

Fix Version/s: (was: 0.14.0)

> [Python] Run ASV suite in nightly build, only run in Travis CI on demand
> 
>
> Key: ARROW-2671
> URL: https://issues.apache.org/jira/browse/ARROW-2671
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: nightly
>
> Lately the main Travis CI build is running nearly 40 minutes long, e.g. here 
> is the latest commit on master
> https://travis-ci.org/apache/arrow/builds/387326546
> A fair chunk of the long runtime is spent running the Python benchmarks at 
> the end of the test suite. We should absolutely keep these running smoothly. 
> However:
> * It may be just as valuable to run them on master nightly, and report in if 
> they are broken
> * We could add a check to look at the commit message and run them in Travis 
> CI if requested
> If others agree, I suggest that as soon as the packaging bot / nightly build 
> tool is working properly, that we make these changes in the interest of 
> improving CI build times



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2667) [C++/Python] Add pandas-like take method to Array

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2667:

Summary: [C++/Python] Add pandas-like take method to Array  (was: 
[C++/Python] Add pandas-like take method to Array/Column/ChunkedArray)

> [C++/Python] Add pandas-like take method to Array
> -
>
> Key: ARROW-2667
> URL: https://issues.apache.org/jira/browse/ARROW-2667
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> We should add a {{take}} method to {{Array/ChunkedArray/Column}} that takes a 
> list of indices and returns a reordered array.
> For reference, see Pandas' interface: 
> https://github.com/pandas-dev/pandas/blob/2cbdd9a2cd19501c98582490e35c5402ae6de941/pandas/core/arrays/base.py#L466



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2801) [Python] Implement splt_row_groups for ParquetDataset

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2801:

Labels: datasets parquet pull-request-available  (was: parquet 
pull-request-available)

> [Python] Implement splt_row_groups for ParquetDataset
> -
>
> Key: ARROW-2801
> URL: https://issues.apache.org/jira/browse/ARROW-2801
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Robbie Gruener
>Assignee: Robbie Gruener
>Priority: Minor
>  Labels: datasets, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently the split_row_groups argument in ParquetDataset yields a not 
> implemented error. An easy and efficient way to implement this is by using 
> the summary metadata file instead of opening every footer file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2671) [Python] Run ASV suite in nightly build, only run in Travis CI on demand

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852509#comment-16852509
 ] 

Wes McKinney commented on ARROW-2671:
-

Well 6 months later we still aren't running these. I hope we are running them 
by the end of 2019

cc [~npr] 

> [Python] Run ASV suite in nightly build, only run in Travis CI on demand
> 
>
> Key: ARROW-2671
> URL: https://issues.apache.org/jira/browse/ARROW-2671
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: nightly
>
> Lately the main Travis CI build is running nearly 40 minutes long, e.g. here 
> is the latest commit on master
> https://travis-ci.org/apache/arrow/builds/387326546
> A fair chunk of the long runtime is spent running the Python benchmarks at 
> the end of the test suite. We should absolutely keep these running smoothly. 
> However:
> * It may be just as valuable to run them on master nightly, and report in if 
> they are broken
> * We could add a check to look at the commit message and run them in Travis 
> CI if requested
> If others agree, I suggest that as soon as the packaging bot / nightly build 
> tool is working properly, that we make these changes in the interest of 
> improving CI build times



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2702) [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc to see if we are using the right error type in each instance

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2702:

Fix Version/s: (was: 0.14.0)

> [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc 
> to see if we are using the right error type in each instance
> -
>
> Key: ARROW-2702
> URL: https://issues.apache.org/jira/browse/ARROW-2702
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> See discussion in [https://github.com/apache/arrow/pull/2075]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

2019-05-30 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5454:
---

 Summary: [C++] Implement Take on ChunkedArray for DataFrame use
 Key: ARROW-5454
 URL: https://issues.apache.org/jira/browse/ARROW-5454
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


Follow up to ARROW-2667



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3246:
---

Assignee: Wes McKinney

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 0.14.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3232) [Python] Return an ndarray from Column.to_pandas

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3232:

Fix Version/s: (was: 0.14.0)

> [Python] Return an ndarray from Column.to_pandas
> 
>
> Key: ARROW-3232
> URL: https://issues.apache.org/jira/browse/ARROW-3232
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>
> See discussion: 
> https://github.com/apache/arrow/pull/2535#discussion_r216299243



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3192) [Java] Implement "ArrowBufReadChannel" abstraction and alternate MessageSerializer that uses this

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3192:

Fix Version/s: (was: 0.14.0)

> [Java] Implement "ArrowBufReadChannel" abstraction and alternate 
> MessageSerializer that uses this
> -
>
> Key: ARROW-3192
> URL: https://issues.apache.org/jira/browse/ARROW-3192
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Wes McKinney
>Priority: Major
>
> The current MessageSerializer implementation is wasteful when used to read an 
> IPC payload that is already in-memory in an {{ArrowBuf}}. In particular, 
> reads out of a {{ReadChannel}} require memory allocation
> * 
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L569
> * 
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L290
> In C++, we have abstracted memory allocation out of the IPC read path so that 
> zero-copy is possible. I suggest that a similar mechanism can be developed 
> for Java to improve deserialization performance for in-memory messages. The 
> new interface would return {{ArrowBuf}} when performing reads, which could be 
> zero-copy when possible, but when not the current strategy of allocate-copy 
> could be used



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3185) [C++] Address libparquet SO version convention in unified build

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3185:

Fix Version/s: (was: 0.14.0)

> [C++] Address libparquet SO version convention in unified build
> ---
>
> Key: ARROW-3185
> URL: https://issues.apache.org/jira/browse/ARROW-3185
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Follow up work to ARROW-3075



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852543#comment-16852543
 ] 

Wes McKinney commented on ARROW-3543:
-

cc [~npr]

> [R] Time zone adjustment issue when reading Feather file written by Python
> --
>
> Key: ARROW-3543
> URL: https://issues.apache.org/jira/browse/ARROW-3543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Olaf
>Priority: Critical
> Fix For: 0.14.0
>
>
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>   
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3579) [Crossbow] Unintuitive error message when remote branch has not been pushed

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3579:

Fix Version/s: (was: 0.14.0)

> [Crossbow] Unintuitive error message when remote branch has not been pushed
> ---
>
> Key: ARROW-3579
> URL: https://issues.apache.org/jira/browse/ARROW-3579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>
> {code}
> $ python dev/tasks/crossbow.py submit -g linux --arrow-version 0.11.1-rc0
> Traceback (most recent call last):
>   File "dev/tasks/crossbow.py", line 796, in 
> crossbow(obj={}, auto_envvar_prefix='CROSSBOW')
>   File 
> "/home/wesm/miniconda/envs/arrow-release/lib/python3.6/site-packages/click/core.py",
>  line 764, in __call__
> return self.main(*args, **kwargs)
>   File 
> "/home/wesm/miniconda/envs/arrow-release/lib/python3.6/site-packages/click/core.py",
>  line 717, in main
> rv = self.invoke(ctx)
>   File 
> "/home/wesm/miniconda/envs/arrow-release/lib/python3.6/site-packages/click/core.py",
>  line 1137, in invoke
> return _process_result(sub_ctx.command.invoke(sub_ctx))
>   File 
> "/home/wesm/miniconda/envs/arrow-release/lib/python3.6/site-packages/click/core.py",
>  line 956, in invoke
> return ctx.invoke(self.callback, **ctx.params)
>   File 
> "/home/wesm/miniconda/envs/arrow-release/lib/python3.6/site-packages/click/core.py",
>  line 555, in invoke
> return callback(*args, **kwargs)
>   File 
> "/home/wesm/miniconda/envs/arrow-release/lib/python3.6/site-packages/click/decorators.py",
>  line 17, in new_func
> return f(get_current_context(), *args, **kwargs)
>   File "dev/tasks/crossbow.py", line 596, in submit
> target = Target.from_repo(arrow)
>   File "dev/tasks/crossbow.py", line 407, in from_repo
> remote=repo.remote_url,
>   File "dev/tasks/crossbow.py", line 235, in remote_url
> return self.remote.url.replace(
>   File "dev/tasks/crossbow.py", line 225, in remote
> return self.repo.remotes[self.branch.upstream.remote_name]
> AttributeError: 'NoneType' object has no attribute 'remote_name'
> {code}
> The fix was to make sure the local branch and the reference branch for the 
> build in my fork wesm/arrow was the same



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3571) [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852544#comment-16852544
 ] 

Wes McKinney commented on ARROW-3571:
-

cc [~npr]

> [Wiki] Release management guide does not explain how to set up Crossbow or 
> where to find instructions
> -
>
> Key: ARROW-3571
> URL: https://issues.apache.org/jira/browse/ARROW-3571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> If you follow the guide, at one point it says "Launch a Crossbow build" but 
> provides no link to the setup instructions for this



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3650) [Python] Mixed column indexes are read back as strings

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3650:

Fix Version/s: (was: 0.14.0)

> [Python] Mixed column indexes are read back as strings 
> ---
>
> Key: ARROW-3650
> URL: https://issues.apache.org/jira/browse/ARROW-3650
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Armin Berres
>Priority: Major
>  Labels: parquet, pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Consider the following example: 
> {code:java}
> df = pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['a 
> string', pd.to_datetime('2018/01/02')])
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'test.parquet')
> ref_df = pq.read_pandas('test.parquet').to_pandas()
> print(df.columns)
> # Index(['a string', 2018-01-02 00:00:00], dtype='object')
> print(ref_df.columns)
> # Index(['a string', '2018-01-02 00:00:00'], dtype='object')
> {code}
> The serialized data frame has an index with a string and a datetime field 
> (happened when resetting the index of a formerly datetime only column).
> When reading the string back the datetime is converted into a string.
> When looking at the schema I find {{"pandas_type": "mixed", "numpy_ty'
> b'pe": "object"}} before serializing and {{"pandas_type": 
> "unicode", "numpy_'
> b'type": "object"}} after reading back. So the schema was aware 
> of the mixed type but did not store the actual types.
> The same happens with other types like numbers as well. One can produce 
> interesting situations:
> {{pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['1', 1])}} 
> can be written but fails to be read back as the index is no more unique with 
> '1' showing up two times.
> IIf this is not a bug but expected maybe the user should be somehow warned 
> that information is lost? Like a {{NotImplemented}} exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3538:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] ability to override the automated assignment of uuid for filenames 
> when writing datasets
> -
>
> Key: ARROW-3538
> URL: https://issues.apache.org/jira/browse/ARROW-3538
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Ji Xu
>Priority: Major
>  Labels: datasets, features, parquet
> Fix For: 0.15.0
>
>
> Say I have a pandas DataFrame {{df}} that I would like to store on disk as 
> dataset using pyarrow parquet, I would do this:
> {code:java}
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_to_dataset(table, root_path=some_path, 
> partition_cols=['a',]){code}
> On disk the dataset would look like something like this:
>  {color:#14892c}some_path{color}
>  {color:#14892c}├── a=1{color}
>  {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color}
>  {color:#14892c}├── a=2{color}
>  {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color}
> *Wished Feature:* It'd be great if I can override the auto-assignment of the 
> long UUID as filename somehow during the *dataset* writing. My purpose is to 
> be able to overwrite the dataset on disk when I have a new version of {{df}}. 
> Currently if I try to write the dataset again, another new uniquely named 
> [UUID].parquet file will be placed next to the old one, with the same, 
> redundant data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3435) [C++] Add option to use dynamic linking with re2

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3435.
-
   Resolution: Fixed
Fix Version/s: (was: 0.14.0)
   0.13.0

If libre2.so is available, it is used now instead of static linking

{code}
$ ldd ~/local/lib/libgandiva.so
linux-vdso.so.1 (0x7ffe46d76000)
libarrow.so.14 => /home/wesm/local/lib/libarrow.so.14 
(0x7f59ce1ac000)
libre2.so.0 => /home/wesm/cpp-runtime-toolchain/lib/libre2.so.0 
(0x7f59ce13a000)
libglog.so.0 => /home/wesm/cpp-runtime-toolchain/lib/libglog.so.0 
(0x7f59ce106000)
libz.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libz.so.1 
(0x7f59ce0ec000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f59ce0c4000)
libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 
(0x7f59ce096000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7f59ce073000)
libstdc++.so.6 => /home/wesm/cpp-runtime-toolchain/lib/libstdc++.so.6 
(0x7f59cdf31000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f59cdde3000)
libgcc_s.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libgcc_s.so.1 
(0x7f59cddcf000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f59cdbe4000)
/lib64/ld-linux-x86-64.so.2 (0x7f59d1291000)
libbrotlienc.so.1 => /usr/lib/x86_64-linux-gnu/libbrotlienc.so.1 
(0x7f59cdb56000)
libbrotlidec.so.1 => /usr/lib/x86_64-linux-gnu/libbrotlidec.so.1 
(0x7f59cdb45000)
libbz2.so.1.0 => /home/wesm/cpp-runtime-toolchain/lib/libbz2.so.1.0 
(0x7f59cdb31000)
liblz4.so.1 => /home/wesm/cpp-runtime-toolchain/lib/liblz4.so.1 
(0x7f59cd921000)
libsnappy.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libsnappy.so.1 
(0x7f59cd916000)
libzstd.so.1.3.8 => 
/home/wesm/cpp-runtime-toolchain/lib/libzstd.so.1.3.8 (0x7f59cd868000)
libboost_system.so.1.68.0 => 
/home/wesm/cpp-runtime-toolchain/lib/libboost_system.so.1.68.0 
(0x7f59cd861000)
libboost_filesystem.so.1.68.0 => 
/home/wesm/cpp-runtime-toolchain/lib/libboost_filesystem.so.1.68.0 
(0x7f59cd841000)
libboost_regex.so.1.68.0 => 
/home/wesm/cpp-runtime-toolchain/lib/libboost_regex.so.1.68.0 
(0x7f59cd738000)
libbrotlicommon.so.1 => /usr/lib/x86_64-linux-gnu/libbrotlicommon.so.1 
(0x7f59cd715000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f59cd70a000)
libicudata.so.58 => 
/home/wesm/cpp-runtime-toolchain/lib/./libicudata.so.58 (0x7f59cbe06000)
libicui18n.so.58 => 
/home/wesm/cpp-runtime-toolchain/lib/./libicui18n.so.58 (0x7f59cbb87000)
libicuuc.so.58 => /home/wesm/cpp-runtime-toolchain/lib/./libicuuc.so.58 
(0x7f59cb9d4000)
{code}

> [C++] Add option to use dynamic linking with re2
> 
>
> Key: ARROW-3435
> URL: https://issues.apache.org/jira/browse/ARROW-3435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.13.0
>
>
> Initial support for re2 uses static linking -- some applications may wish to 
> use dynamic linking



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3538:

Labels: datasets features parquet  (was: features parquet)

> [Python] ability to override the automated assignment of uuid for filenames 
> when writing datasets
> -
>
> Key: ARROW-3538
> URL: https://issues.apache.org/jira/browse/ARROW-3538
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Ji Xu
>Priority: Major
>  Labels: datasets, features, parquet
> Fix For: 0.14.0
>
>
> Say I have a pandas DataFrame {{df}} that I would like to store on disk as 
> dataset using pyarrow parquet, I would do this:
> {code:java}
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_to_dataset(table, root_path=some_path, 
> partition_cols=['a',]){code}
> On disk the dataset would look like something like this:
>  {color:#14892c}some_path{color}
>  {color:#14892c}├── a=1{color}
>  {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color}
>  {color:#14892c}├── a=2{color}
>  {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color}
> *Wished Feature:* It'd be great if I can override the auto-assignment of the 
> long UUID as filename somehow during the *dataset* writing. My purpose is to 
> be able to overwrite the dataset on disk when I have a new version of {{df}}. 
> Currently if I try to write the dataset again, another new uniquely named 
> [UUID].parquet file will be placed next to the old one, with the same, 
> redundant data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3495) [Java] Optimize bit operations performance

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3495:

Fix Version/s: (was: 0.14.0)

> [Java] Optimize bit operations performance
> --
>
> Key: ARROW-3495
> URL: https://issues.apache.org/jira/browse/ARROW-3495
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.11.0
>Reporter: Li Jin
>Assignee: Animesh Trivedi
>Priority: Major
>
> From [~atrivedi]'s benchmark finding:
> 2) Materialize values from Validity and Value direct buffers instead of
> calling getInt() function on the IntVector. This is implemented as a new
> Unsafe reader type (
> [https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31]
> )
> 3) Optimize bitmap operation to check if a bit is set or not (
> [https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23]
> )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3496) [Java] Add microbenchmark code to Java

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3496:

Fix Version/s: (was: 0.14.0)

> [Java] Add microbenchmark code to Java
> --
>
> Key: ARROW-3496
> URL: https://issues.apache.org/jira/browse/ARROW-3496
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.11.0
>Reporter: Li Jin
>Assignee: Animesh Trivedi
>Priority: Major
>
> [~atrivedi] has done some microbenchmarking with the Java API. Let's consider 
> adding them to the codebase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3471) [C++][Gandiva] Investigate caching isomorphic expressions

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3471:

Fix Version/s: (was: 0.14.0)

> [C++][Gandiva] Investigate caching isomorphic expressions
> -
>
> Key: ARROW-3471
> URL: https://issues.apache.org/jira/browse/ARROW-3471
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Praveen Kumar Desabandu
>Priority: Major
>  Labels: gandiva
>
> Two expressions say add(a+b) and add(c+d), could potentially be reused if the 
> only thing differing are the names.
> Test E2E.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3503) [Python] Allow config hadoop_bin in pyarrow hdfs.py

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3503:

Fix Version/s: (was: 0.14.0)

> [Python] Allow config hadoop_bin in pyarrow hdfs.py 
> 
>
> Key: ARROW-3503
> URL: https://issues.apache.org/jira/browse/ARROW-3503
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wenbo Zhao
>Priority: Major
>  Labels: filesystem, pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently, the hadoop_bin is either from `HADOOP_HOME` or the `hadoop` 
> command. 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/hdfs.py#L130]
> However, in some of environment setup, hadoop_bin could be some other 
> location. Can we do something like 
>  
> {code:java}
> if 'HADOOP_BIN' in os.environ:
>     hadoop_bin = os.environ['HADOOP_BIN']
> elif 'HADOOP_HOME' in os.environ:
>     hadoop_bin = '{0}/bin/hadoop'.format(os.environ['HADOOP_HOME'])
> else:
>     hadoop_bin = 'hadoop'
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3444) [Python] Table.nbytes attribute

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3444:

Fix Version/s: (was: 0.14.0)

> [Python] Table.nbytes attribute
> ---
>
> Key: ARROW-3444
> URL: https://issues.apache.org/jira/browse/ARROW-3444
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Dave Hirschfeld
>Priority: Minor
>
> As it says in the title, I think this would be a very handy attribute to have 
> available in Python. You can get it by converting to pandas and using 
> `DataFrame.nbytes` but this is wasteful of both time and memory so it would 
> be good to have this information on the `pyarrow.Table` object itself.
> This could be implemented using the 
> [__sizeof__|https://docs.python.org/3/library/sys.html#sys.getsizeof] protocol



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5455) [Rust] Build broken by 2019-05-30 Rust nightly

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5455:

Priority: Blocker  (was: Major)

> [Rust] Build broken by 2019-05-30 Rust nightly
> --
>
> Key: ARROW-5455
> URL: https://issues.apache.org/jira/browse/ARROW-5455
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Seem example failed build
> https://travis-ci.org/apache/arrow/jobs/539477452



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4398) [Python] Add benchmarks for Arrow<>Parquet BYTE_ARRAY serialization (read and write)

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4398:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Add benchmarks for Arrow<>Parquet BYTE_ARRAY serialization (read and 
> write)
> 
>
> Key: ARROW-4398
> URL: https://issues.apache.org/jira/browse/ARROW-4398
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> This is follow-on work to PARQUET-1508, so we can monitor the performance of 
> this operation over time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4369) [Packaging] Release verification script should test linux packages via docker

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852579#comment-16852579
 ] 

Wes McKinney commented on ARROW-4369:
-

[~kszucs] any thoughts about this for 0.14? We can also postpone

> [Packaging] Release verification script should test linux packages via docker
> -
>
> Key: ARROW-4369
> URL: https://issues.apache.org/jira/browse/ARROW-4369
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.14.0
>
>
> It shouldn't be too hard to create a verification script which checks the 
> linux packages. This could prevent issues like [ARROW-4368] / 
> [https://github.com/apache/arrow/issues/3476]
> I suggest to separate the current verification script into one which verifies 
> the source release artifact and another which verifies the binaries:
>  * checksum and signatures as is right now
>  * install linux packages on multiple distros via docker
> We could test wheels and conda packages as well, but in follow-up PRs.
>  
> cc [~kou]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4419) [Flight] Deal with body buffers in FlightData

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852580#comment-16852580
 ] 

Wes McKinney commented on ARROW-4419:
-

[~lidavidm] where does this issue stand?

> [Flight] Deal with body buffers in FlightData
> -
>
> Key: ARROW-4419
> URL: https://issues.apache.org/jira/browse/ARROW-4419
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Priority: Minor
>  Labels: flight
> Fix For: 0.14.0
>
>
> The Java implementation will fail to decode a schema message if the message 
> also contains (empty) body buffers (see ArrowMessage.asSchema's precondition 
> checks). However, clients using default Protobuf serialization will likely 
> write an empty body buffer by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4409) [C++] Enable arrow::ipc internal JSON reader to read from a file path

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4409:

Fix Version/s: (was: 0.14.0)

> [C++] Enable arrow::ipc internal JSON reader to read from a file path
> -
>
> Key: ARROW-4409
> URL: https://issues.apache.org/jira/browse/ARROW-4409
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
>
> This may make tests easier to write. Currently an input buffer is required, 
> so reading from a file requires some boilerplate



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5448) [CI] MinGW build failures on AppVeyor

2019-05-30 Thread Kouhei Sutou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-5448:
---

Assignee: Kouhei Sutou

> [CI] MinGW build failures on AppVeyor
> -
>
> Key: ARROW-5448
> URL: https://issues.apache.org/jira/browse/ARROW-5448
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> Apparently the Numpy package is broken. See 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/24922425/job/9yoq08uepk5p6dwb
> {code}
> -- Found PythonLibs: C:/msys64/mingw32/lib/libpython3.7m.dll.a
> CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
>   NumPy import failure:
>   Traceback (most recent call last):
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\__init__.py", line 
> 40, in 
>   from . import multiarray
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\multiarray.py", 
> line 12, in 
>   from . import overrides
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\overrides.py", line 
> 6, in 
>   from numpy.core._multiarray_umath import (
>   ImportError: DLL load failed: The specified module could not be found.
>   
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5453) [C++] Just-released cmake-format 0.5.2 breaks the build

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5453:
---

Assignee: Wes McKinney

> [C++] Just-released cmake-format 0.5.2 breaks the build
> ---
>
> Key: ARROW-5453
> URL: https://issues.apache.org/jira/browse/ARROW-5453
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: 0.14.0
>
>
> It seems we should always pin the cmake-format version until the developers 
> stop changing the formatting algorithm



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2365) [Plasma] Return status codes instead of crashing

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2365:

Fix Version/s: (was: 0.14.0)

> [Plasma] Return status codes instead of crashing
> 
>
> Key: ARROW-2365
> URL: https://issues.apache.org/jira/browse/ARROW-2365
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Reporter: Antoine Pitrou
>Priority: Major
>
> When certain {{PlasmaClient}} methods are called with bad arguments, 
> PlasmaClient crashes instead of returning an error Status. For example, try 
> calling {{Seal()}} with a non-existent object id.
> This is hostile towards users of high-level languages such as Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2339) [Python] Add a fast path for int hashing

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2339:

Fix Version/s: (was: 0.14.0)

> [Python] Add a fast path for int hashing
> 
>
> Key: ARROW-2339
> URL: https://issues.apache.org/jira/browse/ARROW-2339
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alex Hagerman
>Priority: Minor
>
> Create a __hash__ fast path for Int scalars that avoids using as_py().
>  
> https://issues.apache.org/jira/browse/ARROW-640
> [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2366) [Python] Support reading Parquet files having a permutation of column order

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852499#comment-16852499
 ] 

Wes McKinney commented on ARROW-2366:
-

This will need to be addressed as part of general schema conformance in the C++ 
Datasets API

cc [~pitrou] [~npr]

> [Python] Support reading Parquet files having a permutation of column order
> ---
>
> Key: ARROW-2366
> URL: https://issues.apache.org/jira/browse/ARROW-2366
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: datasets, parquet
> Fix For: 0.14.0
>
>
> See discussion in https://github.com/dask/fastparquet/issues/320



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2367) [Python] ListArray has trouble with sizes greater than kMaximumCapacity

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2367:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] ListArray has trouble with sizes greater than kMaximumCapacity
> ---
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2410) [JS] Add DataFrame.scanAsync

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852500#comment-16852500
 ] 

Wes McKinney commented on ARROW-2410:
-

[~bhulette] [~paul.e.taylor] of interest for 0.14?

> [JS] Add DataFrame.scanAsync
> 
>
> Key: ARROW-2410
> URL: https://issues.apache.org/jira/browse/ARROW-2410
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
> Fix For: 0.14.0
>
>
> Add a version of `DataFrame.scan`, `scanAsync` that yields periodically. The 
> yield frequency could be specified either as a number of record batches, or a 
> number of records.
> This scan should also be cancellable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2379) [Plasma] PlasmaClient::Info() should return whether an object is in use

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2379:

Fix Version/s: (was: 0.14.0)

> [Plasma] PlasmaClient::Info() should return whether an object is in use
> ---
>
> Key: ARROW-2379
> URL: https://issues.apache.org/jira/browse/ARROW-2379
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Antoine Pitrou
>Priority: Major
>
> It can be useful to know whether a given object is already in use by the 
> local client.
> See https://github.com/apache/arrow/pull/1807#discussion_r178611472



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2366) [Python] Support reading Parquet files having a permutation of column order

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2366:

Labels: datasets parquet  (was: parquet)

> [Python] Support reading Parquet files having a permutation of column order
> ---
>
> Key: ARROW-2366
> URL: https://issues.apache.org/jira/browse/ARROW-2366
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: datasets, parquet
> Fix For: 0.14.0
>
>
> See discussion in https://github.com/dask/fastparquet/issues/320



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4343) [C++] Add as complete as possible Ubuntu Trusty / 14.04 build to docker-compose setup

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852576#comment-16852576
 ] 

Wes McKinney commented on ARROW-4343:
-

What does it mean now that Ubuntu Trusty is no longer an LTS release?

> [C++] Add as complete as possible Ubuntu Trusty / 14.04 build to 
> docker-compose setup
> -
>
> Key: ARROW-4343
> URL: https://issues.apache.org/jira/browse/ARROW-4343
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> Until we formally stop supporting Trusty it would be useful to be able to 
> verify in Docker that builds work there. I still have an Ubuntu 14.04 machine 
> that I use (and I've been filing bugs that I find on it) but not sure for how 
> much longer



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4350) [Python] nested numpy arrays

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852577#comment-16852577
 ] 

Wes McKinney commented on ARROW-4350:
-

[~jorisvandenbossche] could you take a look and maybe clarify the issue title 
etc.?

> [Python] nested numpy arrays
> 
>
> Key: ARROW-4350
> URL: https://issues.apache.org/jira/browse/ARROW-4350
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
>Reporter: yu peng
>Priority: Major
> Fix For: 0.14.0
>
>
> {code:java}
> In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})
> In [20]: df.iloc[0].to_dict()
> Out[20]: {'a': [[1], [2]], 'b': 1}
> In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
> Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}
> In [24]: np.array(df.iloc[0].to_dict()['a']).shape
> Out[24]: (2, 1)
> In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
> Out[25]: (2,)
> {code}
> Adding extra array type is not functioning as expected. 
>  
> More importantly, this would fail
>  
> {code:java}
> In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': 
> [[1, 2],[2, 3]]})
> In [109]: df
> Out[109]:
> a b
> 0 [[1, 2], [2, 3]] [1, 2]
> 1 [[1, 2], [2, 3]] [2, 3]
> In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> ---
> ArrowTypeError Traceback (most recent call last)
>  in ()
> > 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table.from_pandas()
> 1215 
> 1216 """
> -> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays(
> 1218 df,
> 1219 schema=schema,
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 379 arrays = [convert_column(c, t)
> 380 for c, t in zip(columns_to_convert,
> --> 381 convert_types)]
> 382 else:
> 383 from concurrent import futures
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in convert_column(col, ty)
> 374 e.args += ("Conversion failed for column {0!s} with type {1!s}"
> 375 .format(col.name, col.dtype),)
> --> 376 raise e
> 377
> 378 if nthreads == 1:
> ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', 
> 'Conversion failed for column a with type object')
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4333:

Fix Version/s: (was: 0.14.0)

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4337) [C#] Array / RecordBatch Builder Fluent API

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4337:

Fix Version/s: (was: 0.14.0)

> [C#] Array / RecordBatch Builder Fluent API
> ---
>
> Key: ARROW-4337
> URL: https://issues.apache.org/jira/browse/ARROW-4337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Chris Hutchinson
>Assignee: Chris Hutchinson
>Priority: Major
>  Labels: c#, pull-request-available
>   Original Estimate: 12h
>  Time Spent: 5h 10m
>  Remaining Estimate: 6h 50m
>
> Implement a fluent API for building arrays and record batches from Arrow 
> buffers, flat arrays, spans, enumerables, etc.
> A future implementation could extend this API with support for ADO.NET 
> DataTables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4324) [Python] Array dtype inference incorrect when created from list of mixed numpy scalars

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852574#comment-16852574
 ] 

Wes McKinney commented on ARROW-4324:
-

[~jorisvandenbossche] could you take a look?

> [Python] Array dtype inference incorrect when created from list of mixed 
> numpy scalars
> --
>
> Key: ARROW-4324
> URL: https://issues.apache.org/jira/browse/ARROW-4324
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Keith Kraus
>Priority: Minor
> Fix For: 0.14.0
>
>
> Minimal reproducer:
> {code:python}
> import pyarrow as pa
> import numpy as np
> test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)]
> test_array = pa.array(test_list)
> # Expected
> # test_array
> # 
> # [
> #   10,
> #   0.5
> # ]
> # Got
> # test_array
> # 
> # [
> #   10,
> #   0
> # ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5458) Apache Arrow parallel CRC32c computation optimization

2019-05-30 Thread Yuqi Gu (JIRA)
Yuqi Gu created ARROW-5458:
--

 Summary: Apache Arrow parallel CRC32c computation optimization
 Key: ARROW-5458
 URL: https://issues.apache.org/jira/browse/ARROW-5458
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yuqi Gu


ARMv8 defines VMULL/PMULL crypto instruction.
This patch optimizes crc32c calculate with the instruction when
available rather than original linear crc instructions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5453) [C++] Just-released cmake-format 0.5.2 breaks the build

2019-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5453:
--
Labels: pull-request-available  (was: )

> [C++] Just-released cmake-format 0.5.2 breaks the build
> ---
>
> Key: ARROW-5453
> URL: https://issues.apache.org/jira/browse/ARROW-5453
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> It seems we should always pin the cmake-format version until the developers 
> stop changing the formatting algorithm



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2652) [C++/Python] Document how to provide information on segfaults

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2652:

Fix Version/s: (was: 0.14.0)

> [C++/Python] Document how to provide information on segfaults
> -
>
> Key: ARROW-2652
> URL: https://issues.apache.org/jira/browse/ARROW-2652
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Python
>Reporter: Uwe L. Korn
>Priority: Major
>
> We often have users that report segmentation faults in {{pyarrow}}. This will 
> sadly keep reappearing as we also don't have the magical ability of writing 
> 100%-bug-free code. Thus we should have a small section in our documentation 
> on how people can give us the relevant information in the case of a 
> segmentation fault. Preferably the documentation covers {{gdb}} and {{lldb}}. 
> They both have similar commands but differ in some minor flags.
> For one of the example comments I gave to a user in tickets see 
> https://github.com/apache/arrow/issues/2089#issuecomment-393477116



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2606) [Java/Python]  Add unit test for pyarrow.decimal128 in Array.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2606:
---

Assignee: (was: Uwe L. Korn)

> [Java/Python]  Add unit test for pyarrow.decimal128 in Array.from_jvm
> -
>
> Key: ARROW-2606
> URL: https://issues.apache.org/jira/browse/ARROW-2606
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249. We need to 
> find the correct code to construct Java decimals and fill them into a 
> {{DecimalVector}}. Afterwards, we should activate the decimal128 type on 
> {{test_jvm_array}} and ensure that we load them correctly from Java into 
> Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2610) [Java/Python] Add support for dictionary type to pyarrow.Field.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2610:
---

Assignee: (was: Uwe L. Korn)

> [Java/Python] Add support for dictionary type to pyarrow.Field.from_jvm
> ---
>
> Key: ARROW-2610
> URL: https://issues.apache.org/jira/browse/ARROW-2610
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
>
> The DictionaryType is a bit more complex as it also references the dictionary 
> values itself. This also needs to be integrated into 
> {{pyarrow.Field.from_jvm}} but the work to make DictionaryType working maybe 
> also depends on that {{pyarrow.Array.from_jvm}} first supports non-primitive 
> arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2619) [Rust] Move JSON serde code to separate file/module

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852507#comment-16852507
 ] 

Wes McKinney commented on ARROW-2619:
-

Still of interest for 0.14?

> [Rust] Move JSON serde code to separate file/module
> ---
>
> Key: ARROW-2619
> URL: https://issues.apache.org/jira/browse/ARROW-2619
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2587) [Python] Unable to write StructArrays with multiple children to parquet

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852504#comment-16852504
 ] 

Wes McKinney commented on ARROW-2587:
-

Nested Parquet is not yet on my immediate critical path, but it will be 
eventually (hopefully in 2019)

> [Python] Unable to write StructArrays with multiple children to parquet
> ---
>
> Key: ARROW-2587
> URL: https://issues.apache.org/jira/browse/ARROW-2587
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: jacques
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: Screen Shot 2018-05-16 at 12.24.39.png
>
>
> Although I am able to read StructArray from parquet, I am still unable to 
> write it back from pa.Table to parquet.
> I get an "ArrowInvalid: Nested column branch had multiple children"
> Here is a quick example:
> {noformat}
> In [2]: import pyarrow.parquet as pq
> In [3]: table = pq.read_table('test.parquet')
> In [4]: table
>  Out[4]: 
>  pyarrow.Table
>  weight: double
>  animal_type: string
>  animal_interpretation: struct
>    child 0, is_large_animal: bool
>    child 1, is_mammal: bool
>  metadata
>  
>  \{'org.apache.spark.sql.parquet.row.metadata': 
> '{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},{"name":"animal_interpretation","type":{"type":"struct","fields":[\\{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\\\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}
> In [5]: table.schema
>  Out[5]: 
>  weight: double
>  animal_type: string
>  animal_interpretation: struct
>    child 0, is_large_animal: bool
>    child 1, is_mammal: bool
>  metadata
>  
>  \{'org.apache.spark.sql.parquet.row.metadata': 
> '{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},{"name":"animal_interpretation","type":{"type":"struct","fields":[\\{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\\\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}
> In [6]: pq.write_table(table,"test_write.parquet")
>  ---
>  ArrowInvalid  Traceback (most recent call last)
>   in ()
>  > 1 pq.write_table(table,"test_write.parquet")
> /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in 
> write_table(table, where, row_group_size, version, use_dictionary, 
> compression, use_deprecated_int96_timestamps, coerce_timestamps, flavor, 
> **kwargs)
>      982 use_deprecated_int96_timestamps=use_int96,
>      983 **kwargs) as writer:
>  --> 984 writer.write_table(table, row_group_size=row_group_size)
>      985 except Exception:
>      986 if is_path(where):
> /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in 
> write_table(self, table, row_group_size)
>      325 table = _sanitize_table(table, self.schema, self.flavor)
>      326 assert self.is_open
>  --> 327 self.writer.write_table(table, row_group_size=row_group_size)
>      328 
>      329 def close(self):
> /usr/local/lib/python2.7/dist-packages/pyarrow/_parquet.so in 
> pyarrow._parquet.ParquetWriter.write_table()
> /usr/local/lib/python2.7/dist-packages/pyarrow/lib.so in 
> pyarrow.lib.check_status()
> ArrowInvalid: Nested column branch had multiple children
> {noformat}
>  
> I would really appreciate a fix on this.
> Best,
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2610) [Java/Python] Add support for dictionary type to pyarrow.Field.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852506#comment-16852506
 ] 

Wes McKinney commented on ARROW-2610:
-

This should be less complex now after the recent DictionaryType changes

> [Java/Python] Add support for dictionary type to pyarrow.Field.from_jvm
> ---
>
> Key: ARROW-2610
> URL: https://issues.apache.org/jira/browse/ARROW-2610
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> The DictionaryType is a bit more complex as it also references the dictionary 
> values itself. This also needs to be integrated into 
> {{pyarrow.Field.from_jvm}} but the work to make DictionaryType working maybe 
> also depends on that {{pyarrow.Array.from_jvm}} first supports non-primitive 
> arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2609) [Java/Python] Complex type conversion in pyarrow.Field.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2609:

Fix Version/s: (was: 0.14.0)

> [Java/Python] Complex type conversion in pyarrow.Field.from_jvm
> ---
>
> Key: ARROW-2609
> URL: https://issues.apache.org/jira/browse/ARROW-2609
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
>
> The converter {{pyarrow.Field.from_jvm}} currently only works for primitive 
> types. Types like List, Struct or Union that have children in their 
> definition are not supported. We should add the needed recursion for these 
> types and enable the respective tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2606) [Java/Python]  Add unit test for pyarrow.decimal128 in Array.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2606:

Fix Version/s: (was: 0.14.0)

> [Java/Python]  Add unit test for pyarrow.decimal128 in Array.from_jvm
> -
>
> Key: ARROW-2606
> URL: https://issues.apache.org/jira/browse/ARROW-2606
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Priority: Major
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249. We need to 
> find the correct code to construct Java decimals and fill them into a 
> {{DecimalVector}}. Afterwards, we should activate the decimal128 type on 
> {{test_jvm_array}} and ensure that we load them correctly from Java into 
> Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2605) [Java/Python] Add unit test for pyarrow.timeX types in Array.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852505#comment-16852505
 ] 

Wes McKinney commented on ARROW-2605:
-

cc [~jorisvandenbossche]

> [Java/Python] Add unit test for pyarrow.timeX types in Array.from_jvm
> -
>
> Key: ARROW-2605
> URL: https://issues.apache.org/jira/browse/ARROW-2605
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Priority: Major
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249 as we are 
> missing the necessary methods to construct these arrays conveniently on the 
> Python side.
> Once there is a path to construct {{pyarrow.Array}} instances from a Python 
> list of {{datetime.time}} for the various time types, we should activate the 
> time types on {{test_jvm_array}} and ensure that we load them correctly from 
> Java into Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2587) [Python] Unable to write StructArrays with multiple children to parquet

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2587:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Unable to write StructArrays with multiple children to parquet
> ---
>
> Key: ARROW-2587
> URL: https://issues.apache.org/jira/browse/ARROW-2587
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: jacques
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: Screen Shot 2018-05-16 at 12.24.39.png
>
>
> Although I am able to read StructArray from parquet, I am still unable to 
> write it back from pa.Table to parquet.
> I get an "ArrowInvalid: Nested column branch had multiple children"
> Here is a quick example:
> {noformat}
> In [2]: import pyarrow.parquet as pq
> In [3]: table = pq.read_table('test.parquet')
> In [4]: table
>  Out[4]: 
>  pyarrow.Table
>  weight: double
>  animal_type: string
>  animal_interpretation: struct
>    child 0, is_large_animal: bool
>    child 1, is_mammal: bool
>  metadata
>  
>  \{'org.apache.spark.sql.parquet.row.metadata': 
> '{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},{"name":"animal_interpretation","type":{"type":"struct","fields":[\\{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\\\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}
> In [5]: table.schema
>  Out[5]: 
>  weight: double
>  animal_type: string
>  animal_interpretation: struct
>    child 0, is_large_animal: bool
>    child 1, is_mammal: bool
>  metadata
>  
>  \{'org.apache.spark.sql.parquet.row.metadata': 
> '{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},{"name":"animal_interpretation","type":{"type":"struct","fields":[\\{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\\\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}
> In [6]: pq.write_table(table,"test_write.parquet")
>  ---
>  ArrowInvalid  Traceback (most recent call last)
>   in ()
>  > 1 pq.write_table(table,"test_write.parquet")
> /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in 
> write_table(table, where, row_group_size, version, use_dictionary, 
> compression, use_deprecated_int96_timestamps, coerce_timestamps, flavor, 
> **kwargs)
>      982 use_deprecated_int96_timestamps=use_int96,
>      983 **kwargs) as writer:
>  --> 984 writer.write_table(table, row_group_size=row_group_size)
>      985 except Exception:
>      986 if is_path(where):
> /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in 
> write_table(self, table, row_group_size)
>      325 table = _sanitize_table(table, self.schema, self.flavor)
>      326 assert self.is_open
>  --> 327 self.writer.write_table(table, row_group_size=row_group_size)
>      328 
>      329 def close(self):
> /usr/local/lib/python2.7/dist-packages/pyarrow/_parquet.so in 
> pyarrow._parquet.ParquetWriter.write_table()
> /usr/local/lib/python2.7/dist-packages/pyarrow/lib.so in 
> pyarrow.lib.check_status()
> ArrowInvalid: Nested column branch had multiple children
> {noformat}
>  
> I would really appreciate a fix on this.
> Best,
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2501) [Java] Remove Jackson from compile-time dependencies for arrow-vector

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852501#comment-16852501
 ] 

Wes McKinney commented on ARROW-2501:
-

Could this be done in 0.14? cc [~pravindra] [~siddteotia]

> [Java] Remove Jackson from compile-time dependencies for arrow-vector
> -
>
> Key: ARROW-2501
> URL: https://issues.apache.org/jira/browse/ARROW-2501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.9.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I would like to upgrade Jackson to the latest version (2.9.5). If there are 
> no objections I will create a PR (it is literally just changing the version 
> number in the pom - no code changes required).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2512) [Python] Enable direct interaction of GPU Objects in Python

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2512:

Fix Version/s: (was: 0.14.0)

> [Python] Enable direct interaction of GPU Objects in Python
> ---
>
> Key: ARROW-2512
> URL: https://issues.apache.org/jira/browse/ARROW-2512
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma, GPU, Python
>Reporter: William Paul
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Plasma can now manage objects on the GPU, but in order to use this 
> functionality in Python, there needs to be some way to represent these GPU 
> objects in Python that allows computation on the GPU.
> The easiest way to enable this is to rely on a third party library, such as 
> Pytorch, which will allow us to use all of its existing functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2532) [C++] Add chunked builder classes

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2532:

Fix Version/s: (was: 0.14.0)

> [C++] Add chunked builder classes
> -
>
> Key: ARROW-2532
> URL: https://issues.apache.org/jira/browse/ARROW-2532
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> I think it would be useful to have chunked builders for list, string and 
> binary types. A chunked builder would produce a chunked array as output, 
> circumventing the 32-bit offset limit of those types. There's some 
> special-casing scatterred around our Numpy conversion routines right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2607) [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2607:
---

Assignee: (was: Uwe L. Korn)

> [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm
> ---
>
> Key: ARROW-2607
> URL: https://issues.apache.org/jira/browse/ARROW-2607
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Priority: Major
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249: Currently 
> only primitive arrays are supported in {{pyarrow.Array.from_jvm}} as it uses 
> {{pyarrow.Array.from_buffers}} underneath. We should extend one of the two 
> functions to be able to deal with string arrays. There is a currently failing 
> unit test {{test_jvm_string_array}} in {{pyarrow/tests/test_jvm.py}} to 
> verify the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2609) [Java/Python] Complex type conversion in pyarrow.Field.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2609:
---

Assignee: (was: Uwe L. Korn)

> [Java/Python] Complex type conversion in pyarrow.Field.from_jvm
> ---
>
> Key: ARROW-2609
> URL: https://issues.apache.org/jira/browse/ARROW-2609
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> The converter {{pyarrow.Field.from_jvm}} currently only works for primitive 
> types. Types like List, Struct or Union that have children in their 
> definition are not supported. We should add the needed recursion for these 
> types and enable the respective tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2605) [Java/Python] Add unit test for pyarrow.timeX types in Array.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2605:

Fix Version/s: (was: 0.14.0)

> [Java/Python] Add unit test for pyarrow.timeX types in Array.from_jvm
> -
>
> Key: ARROW-2605
> URL: https://issues.apache.org/jira/browse/ARROW-2605
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249 as we are 
> missing the necessary methods to construct these arrays conveniently on the 
> Python side.
> Once there is a path to construct {{pyarrow.Array}} instances from a Python 
> list of {{datetime.time}} for the various time types, we should activate the 
> time types on {{test_jvm_array}} and ensure that we load them correctly from 
> Java into Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2610) [Java/Python] Add support for dictionary type to pyarrow.Field.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2610:

Fix Version/s: (was: 0.14.0)

> [Java/Python] Add support for dictionary type to pyarrow.Field.from_jvm
> ---
>
> Key: ARROW-2610
> URL: https://issues.apache.org/jira/browse/ARROW-2610
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>
> The DictionaryType is a bit more complex as it also references the dictionary 
> values itself. This also needs to be integrated into 
> {{pyarrow.Field.from_jvm}} but the work to make DictionaryType working maybe 
> also depends on that {{pyarrow.Array.from_jvm}} first supports non-primitive 
> arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2605) [Java/Python] Add unit test for pyarrow.timeX types in Array.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2605:
---

Assignee: (was: Uwe L. Korn)

> [Java/Python] Add unit test for pyarrow.timeX types in Array.from_jvm
> -
>
> Key: ARROW-2605
> URL: https://issues.apache.org/jira/browse/ARROW-2605
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Priority: Major
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249 as we are 
> missing the necessary methods to construct these arrays conveniently on the 
> Python side.
> Once there is a path to construct {{pyarrow.Array}} instances from a Python 
> list of {{datetime.time}} for the various time types, we should activate the 
> time types on {{test_jvm_array}} and ensure that we load them correctly from 
> Java into Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2607) [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2607:

Fix Version/s: (was: 0.14.0)

> [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm
> ---
>
> Key: ARROW-2607
> URL: https://issues.apache.org/jira/browse/ARROW-2607
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249: Currently 
> only primitive arrays are supported in {{pyarrow.Array.from_jvm}} as it uses 
> {{pyarrow.Array.from_buffers}} underneath. We should extend one of the two 
> functions to be able to deal with string arrays. There is a currently failing 
> unit test {{test_jvm_string_array}} in {{pyarrow/tests/test_jvm.py}} to 
> verify the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2600) [Python] Add additional LocalFileSystem filesystem methods

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2600:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Add additional LocalFileSystem filesystem methods
> --
>
> Key: ARROW-2600
> URL: https://issues.apache.org/jira/browse/ARROW-2600
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alex Hagerman
>Priority: Minor
>  Labels: filesystem, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Related to https://issues.apache.org/jira/browse/ARROW-1319 I noticed the 
> methods Martin listed are also not part of the LocalFileSystem class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3277) [Python] Validate manylinux1 builds with crossbow instead of each Travis CI build

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3277:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Validate manylinux1 builds with crossbow instead of each Travis CI 
> build
> -
>
> Key: ARROW-3277
> URL: https://issues.apache.org/jira/browse/ARROW-3277
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> The recent manylinxu1 timeouts bring up a bigger question which is 
> centralizing the validation of packaging builds. We definitely want the 
> project to be notified in a timely way when there is some problem with a 
> packaging build -- since manylinux1 can be run locally in Docker, it is 
> easier to debug and need not necessarily be run on every commit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3378) [C++] Implement whitespace CSV tokenizer

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3378:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Implement whitespace CSV tokenizer
> 
>
> Key: ARROW-3378
> URL: https://issues.apache.org/jira/browse/ARROW-3378
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3332) [Gandiva] Remove usages of mutable reference out arguments

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3332:

Fix Version/s: (was: 0.14.0)

> [Gandiva] Remove usages of mutable reference out arguments
> --
>
> Key: ARROW-3332
> URL: https://issues.apache.org/jira/browse/ARROW-3332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Priority: Major
>
> I have noticed several usages of mutable reference out arguments, e.g. 
> gandiva/regex_util.h. We should change these to conform to the style guide 
> (out arguments as pointers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3359) [Gandiva][C++] Nest gandiva inside arrow namespace?

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3359:

Fix Version/s: (was: 0.14.0)

> [Gandiva][C++] Nest gandiva inside arrow namespace?
> ---
>
> Key: ARROW-3359
> URL: https://issues.apache.org/jira/browse/ARROW-3359
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Priority: Major
>
> This would make for more readable code by making symbols from the outer scope 
> visible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3379) [C++] Implement regex/multichar delimiter tokenizer

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3379:

Labels: csv datasets  (was: csv)

> [C++] Implement regex/multichar delimiter tokenizer
> ---
>
> Key: ARROW-3379
> URL: https://issues.apache.org/jira/browse/ARROW-3379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, datasets
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-5445) [Website] Remove language that encourages pinning a version

2019-05-30 Thread Kouhei Sutou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou closed ARROW-5445.
---
Resolution: Won't Fix

https://github.com/apache/arrow/pull/4411#discussion_r288957237

{quote}
Version pinning is commonplace in the Python world -- I don't think API 
stability has much to do with it (we will still have some API changes or 
deprecations after 1.0 I would guess)
{quote}

> [Website] Remove language that encourages pinning a version
> ---
>
> Key: ARROW-5445
> URL: https://issues.apache.org/jira/browse/ARROW-5445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Priority: Minor
> Fix For: 1.0.0
>
>
> See [https://github.com/apache/arrow/pull/4411#discussion_r288804415]. 
> Whenever we decide to stop threatening to break APIs (1.0 release or 
> otherwise), purge any recommendations like this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   >