[jira] [Commented] (ARROW-4446) [Python] Run Gandiva tests on Windows and Appveyor

2019-02-05 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761542#comment-16761542
 ] 

Pindikura Ravindra commented on ARROW-4446:
---

[~shyamsingh] can you open a new Jira for fixing the date/tz issue on windows ? 
Till that's fixed, we could disable the test on windows.

> [Python] Run Gandiva tests on Windows and Appveyor
> --
>
> Key: ARROW-4446
> URL: https://issues.apache.org/jira/browse/ARROW-4446
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4446) [Python] Run Gandiva tests on Windows and Appveyor

2019-02-05 Thread shyam narayan singh (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761540#comment-16761540
 ] 

shyam narayan singh commented on ARROW-4446:


For the timezone implementation, we are using the operating systems timezone 
database. For windows, the expectation is IANA database is downloaded into the 
required location as specified below.

REF : [https://howardhinnant.github.io/date/tz.html]

1. USE_OS_TZDB : If {{USE_OS_TZDB}} is {{1}} then this library will use the 
zic-compiled time zone database provided by your OS. This option relieves you 
of having to install the IANA time zone database, either manually, or 
automatically with {{AUTO_DOWNLOAD}}. This option is not available on Windows.

2. If the macro {{INSTALL}} is not defined, the default location of the 
database is {{~/Downloads/tzdata}}({{%homedrive%\%homepath%\downloads\tzdata}} 
on Windows).

3. On Windows, {{HAS_REMOTE_API}} defaults to {{0}}. Everywhere else it 
defaults to {{1}}. This is because [{{libcurl}}|https://curl.haxx.se/libcurl/] 
comes preinstalled everywhere but Windows, but it is available for Windows.

4. If you want to enable {{HAS_REMOTE_API}} and/or {{AUTO_DOWNLOAD}} on Windows 
you will have to manually install [curl|https://curl.haxx.se/libcurl/] and 
[7-zip|http://www.7-zip.org/] into their default locations.

5. If you do not enable {{HAS_REMOTE_API}}, you will need to also install 
[http://unicode.org/repos/cldr/trunk/common/supplemental/windowsZones.xml] into 
your {{install}} location. This will be done for you if you have enabled 
{{HAS_REMOTE_API}} and let {{AUTO_DOWNLOAD}} default to 1.

> [Python] Run Gandiva tests on Windows and Appveyor
> --
>
> Key: ARROW-4446
> URL: https://issues.apache.org/jira/browse/ARROW-4446
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2058) [Packaging] Add wheels for Alpine Linux

2019-02-05 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn closed ARROW-2058.
--
Resolution: Won't Fix

> [Packaging] Add wheels for Alpine Linux
> ---
>
> Key: ARROW-2058
> URL: https://issues.apache.org/jira/browse/ARROW-2058
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.4.1, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 
> 0.8.0
>Reporter: Omer Katz
>Priority: Major
>  Labels: alpine
>
> Currently when attempting to install pyarrow using pip on Alpine Linux you 
> get the following error message from pip:
>  
> {code:java}
> pip install pyarrow --verbose
> Collecting pyarrow
>   1 location(s) to search for versions of pyarrow:
>   * https://pypi.python.org/simple/pyarrow/
>   Getting page https://pypi.python.org/simple/pyarrow/
>   Looking up "https://pypi.python.org/simple/pyarrow/; in the cache
>   Current age based on date: 596
>   Freshness lifetime from max-age: 600
>   Freshness lifetime from request max-age: 600
>   The response is "fresh", returning cached response
>   600 > 596
>   Analyzing links from page https://pypi.python.org/simple/pyarrow/
>     Skipping link 
> https://pypi.python.org/packages/03/fe/d3e86d9a534093f84ec6bb92c5285796feca7713f9328cc2b607ee9fc158/pyarrow-0.2.0-cp35-cp35m-manylinux1_x86_64.whl#md5=283d6d42277a07f724c0d944ff031c0c
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/06/e9/ac196752b306732afedf415805d327752bd85fb1e4517b97085129b5d02e/pyarrow-0.4.1-cp27-cp27mu-manylinux1_x86_64.whl#md5=884433983d1482e9eba7cdedd82201e5
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0b/1c/c7e00871d85f091cbe4b71dd6ff2ce393b6e736d6defd806f571da87280c/pyarrow-0.5.0-cp36-cp36m-win_amd64.whl#md5=d7e3d8b9d17e7a964c058f879e11e733
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0b/e8/e907b7e597981e488d60ea8554db0c6b55a4ddc01ad31bb18156f1dc1526/pyarrow-0.5.0.post2-cp34-cp34m-manylinux1_x86_64.whl#md5=9353e2bcfc77a2b40daa5d31cb9c5dac
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0c/01/2e283b8fae727c4932a4335e2b1980a65c2ef754c69a7d97e39b0157627d/pyarrow-0.7.0-cp34-cp34m-manylinux1_x86_64.whl#md5=6d8ec243f77a382667b6f9b0aa434fd2
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0c/19/805aa541740279bc8a198eeeb57509de5551f55f0cbc6371fa897bfc3245/pyarrow-0.8.0-cp35-cp35m-manylinux1_x86_64.whl#md5=382cb788fd740b0e25be3b305ab46142
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0d/39/b0e21b10b53386f3dad906a8b734074cc0008c5af6a31d2e441569633d94/pyarrow-0.6.0-cp36-cp36m-manylinux1_x86_64.whl#md5=392930f4ace76ac65965258f5da99e9d
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0f/22/97ba96f7178a52f257b45eac079d7a47dc4bc22d0961e828f10a76c254a7/pyarrow-0.4.1-cp35-cp35m-macosx_10_6_intel.whl#md5=96db8da8ee09952e62731ef8afd1f20d
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/15/5c/20192ab842b291d889f12f7013a5ac5c4416e231377024ad6823fc42a456/pyarrow-0.8.0-cp35-cp35m-win_amd64.whl#md5=8123173dc4905e7186ecf35ba180817a
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/20/b6/50f42a2dd53e0679ffe5fb74bdc745fcad3b5e0975e9ae356256c0370d06/pyarrow-0.7.1-cp35-cp35m-macosx_10_6_intel.whl#md5=5d06b3332b5bac0682d55f20ab2cb556
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/22/2f/7bf9225142d9db6e67e74cff8a18aa98514159cb5c96b15d15044db9ea5f/pyarrow-0.7.1-cp35-cp35m-win_amd64.whl#md5=111be7aac9a73210c2b1ae8e1e459819
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/23/60/f3db27c6a201994a5b1afb4f263afdfa22f5292380379900d7af325d679f/pyarrow-0.5.0-cp35-cp35m-win_amd64.whl#md5=cf45b4190ba1079cc2532c1a9fd09285
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> 

[jira] [Commented] (ARROW-2058) [Packaging] Add wheels for Alpine Linux

2019-02-05 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761518#comment-16761518
 ] 

Uwe L. Korn commented on ARROW-2058:


You cannot provide these wheels on PyPI. There is no platform tag yet that 
indicates the use of musl-libc. The musl/Alpine community must first invest in 
a standard like manylinux1 (which is only for glibc based distros).

If there was an alternative Package repository for wheels on Alpine Linux, we 
could upload wheels there but this also does not seem to exist.

Closing as "Won't fix" until the Alpine community has adressed this.

> [Packaging] Add wheels for Alpine Linux
> ---
>
> Key: ARROW-2058
> URL: https://issues.apache.org/jira/browse/ARROW-2058
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.4.1, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 
> 0.8.0
>Reporter: Omer Katz
>Priority: Major
>  Labels: alpine
>
> Currently when attempting to install pyarrow using pip on Alpine Linux you 
> get the following error message from pip:
>  
> {code:java}
> pip install pyarrow --verbose
> Collecting pyarrow
>   1 location(s) to search for versions of pyarrow:
>   * https://pypi.python.org/simple/pyarrow/
>   Getting page https://pypi.python.org/simple/pyarrow/
>   Looking up "https://pypi.python.org/simple/pyarrow/; in the cache
>   Current age based on date: 596
>   Freshness lifetime from max-age: 600
>   Freshness lifetime from request max-age: 600
>   The response is "fresh", returning cached response
>   600 > 596
>   Analyzing links from page https://pypi.python.org/simple/pyarrow/
>     Skipping link 
> https://pypi.python.org/packages/03/fe/d3e86d9a534093f84ec6bb92c5285796feca7713f9328cc2b607ee9fc158/pyarrow-0.2.0-cp35-cp35m-manylinux1_x86_64.whl#md5=283d6d42277a07f724c0d944ff031c0c
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/06/e9/ac196752b306732afedf415805d327752bd85fb1e4517b97085129b5d02e/pyarrow-0.4.1-cp27-cp27mu-manylinux1_x86_64.whl#md5=884433983d1482e9eba7cdedd82201e5
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0b/1c/c7e00871d85f091cbe4b71dd6ff2ce393b6e736d6defd806f571da87280c/pyarrow-0.5.0-cp36-cp36m-win_amd64.whl#md5=d7e3d8b9d17e7a964c058f879e11e733
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0b/e8/e907b7e597981e488d60ea8554db0c6b55a4ddc01ad31bb18156f1dc1526/pyarrow-0.5.0.post2-cp34-cp34m-manylinux1_x86_64.whl#md5=9353e2bcfc77a2b40daa5d31cb9c5dac
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0c/01/2e283b8fae727c4932a4335e2b1980a65c2ef754c69a7d97e39b0157627d/pyarrow-0.7.0-cp34-cp34m-manylinux1_x86_64.whl#md5=6d8ec243f77a382667b6f9b0aa434fd2
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0c/19/805aa541740279bc8a198eeeb57509de5551f55f0cbc6371fa897bfc3245/pyarrow-0.8.0-cp35-cp35m-manylinux1_x86_64.whl#md5=382cb788fd740b0e25be3b305ab46142
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0d/39/b0e21b10b53386f3dad906a8b734074cc0008c5af6a31d2e441569633d94/pyarrow-0.6.0-cp36-cp36m-manylinux1_x86_64.whl#md5=392930f4ace76ac65965258f5da99e9d
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0f/22/97ba96f7178a52f257b45eac079d7a47dc4bc22d0961e828f10a76c254a7/pyarrow-0.4.1-cp35-cp35m-macosx_10_6_intel.whl#md5=96db8da8ee09952e62731ef8afd1f20d
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/15/5c/20192ab842b291d889f12f7013a5ac5c4416e231377024ad6823fc42a456/pyarrow-0.8.0-cp35-cp35m-win_amd64.whl#md5=8123173dc4905e7186ecf35ba180817a
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/20/b6/50f42a2dd53e0679ffe5fb74bdc745fcad3b5e0975e9ae356256c0370d06/pyarrow-0.7.1-cp35-cp35m-macosx_10_6_intel.whl#md5=5d06b3332b5bac0682d55f20ab2cb556
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> 

[jira] [Assigned] (ARROW-1572) [C++] Implement "value counts" kernels for tabulating value frequencies

2019-02-05 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-1572:
--

Assignee: Micah Kornfield

> [C++] Implement "value counts" kernels for tabulating value frequencies
> ---
>
> Key: ARROW-1572
> URL: https://issues.apache.org/jira/browse/ARROW-1572
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This is related to "match", "isin", and "unique" since hashing is generally 
> required



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1636) Integration tests for null type

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761514#comment-16761514
 ] 

Wes McKinney commented on ARROW-1636:
-

We observe data that is all nulls, so in such cases we refuse to guess about 
the type. It might be a good idea for Gandiva to support null at some point 
(which can cast to anything)

> Integration tests for null type
> ---
>
> Key: ARROW-1636
> URL: https://issues.apache.org/jira/browse/ARROW-1636
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java
>Reporter: Wes McKinney
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: columnar-format-1.0
> Fix For: 0.14.0
>
>
> This was not implemented on the C++ side, and came up in ARROW-1584. 
> Realistically arrays may be of null type, and we should be able to message 
> these correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-1636) Integration tests for null type

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761514#comment-16761514
 ] 

Wes McKinney edited comment on ARROW-1636 at 2/6/19 6:03 AM:
-

We observe (un-schema'd) data that is all nulls, so in such cases we refuse to 
guess about the type. It might be a good idea for Gandiva to support null at 
some point (which can cast to anything)


was (Author: wesmckinn):
We observe data that is all nulls, so in such cases we refuse to guess about 
the type. It might be a good idea for Gandiva to support null at some point 
(which can cast to anything)

> Integration tests for null type
> ---
>
> Key: ARROW-1636
> URL: https://issues.apache.org/jira/browse/ARROW-1636
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java
>Reporter: Wes McKinney
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: columnar-format-1.0
> Fix For: 0.14.0
>
>
> This was not implemented on the C++ side, and came up in ARROW-1584. 
> Realistically arrays may be of null type, and we should be able to message 
> these correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-590) Add integration tests for Union types

2019-02-05 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761512#comment-16761512
 ] 

Pindikura Ravindra commented on ARROW-590:
--

[~wesmckinn], will do.Will work on this next week.

> Add integration tests for Union types
> -
>
> Key: ARROW-590
> URL: https://issues.apache.org/jira/browse/ARROW-590
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java
>Reporter: Wes McKinney
>Assignee: Li Jin
>Priority: Major
>  Labels: columnar-format-1.0, pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4491) [Python] Remove usage of std::to_string and std::stoi

2019-02-05 Thread Philipp Moritz (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761501#comment-16761501
 ] 

Philipp Moritz commented on ARROW-4491:
---

Ok, I think I understand this now. On some implementations, int8_t seems to be 
a typedef to char and the conversion in this case produces a character and not 
a number.

> [Python] Remove usage of std::to_string and std::stoi
> -
>
> Key: ARROW-4491
> URL: https://issues.apache.org/jira/browse/ARROW-4491
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Not sure why this is happening, but for some older compilers I'm seeing
> {code:java}
> terminate called after throwing an instance of 'std::invalid_argument'
>   what():  stoi{code}
> since 
> [https://github.com/apache/arrow/pull/3423|https://github.com/apache/arrow/pull/3423.]
> Possible cause is that there is no int8_t version of 
> [https://en.cppreference.com/w/cpp/string/basic_string/to_string|https://en.cppreference.com/w/cpp/string/basic_string/to_string,]
>  so it might not convert it to a proper string representation of the number.
> Any insight on why this could be happening is appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1636) Integration tests for null type

2019-02-05 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761503#comment-16761503
 ] 

Pindikura Ravindra commented on ARROW-1636:
---

Gandiva doesn't support null types. It's a bit convoluted but the list of 
supported types in gandiva are :

[https://github.com/apache/arrow/blob/master/cpp/src/gandiva/llvm_types.cc#L24]

curious, what's the use case for a null type ?

> Integration tests for null type
> ---
>
> Key: ARROW-1636
> URL: https://issues.apache.org/jira/browse/ARROW-1636
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java
>Reporter: Wes McKinney
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: columnar-format-1.0
> Fix For: 0.14.0
>
>
> This was not implemented on the C++ side, and came up in ARROW-1584. 
> Realistically arrays may be of null type, and we should be able to message 
> these correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1636) Integration tests for null type

2019-02-05 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra reassigned ARROW-1636:
-

Assignee: Pindikura Ravindra

> Integration tests for null type
> ---
>
> Key: ARROW-1636
> URL: https://issues.apache.org/jira/browse/ARROW-1636
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java
>Reporter: Wes McKinney
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: columnar-format-1.0
> Fix For: 0.14.0
>
>
> This was not implemented on the C++ side, and came up in ARROW-1584. 
> Realistically arrays may be of null type, and we should be able to message 
> these correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2913) [Python] Exported buffers don't expose type information

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2913.
---
   Resolution: Won't Fix
Fix Version/s: (was: 0.13.0)

> [Python] Exported buffers don't expose type information
> ---
>
> Key: ARROW-2913
> URL: https://issues.apache.org/jira/browse/ARROW-2913
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.10.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> Using the {{buffers()}} method on array gives you a list of buffers backing 
> the array, but those buffers lose typing information:
> {code:python}
> >>> a = pa.array(range(10))
> >>> a.type
> DataType(int64)
> >>> buffers = a.buffers()
> >>> [(memoryview(buf).format, memoryview(buf).shape) for buf in buffers]
> [('b', (2,)), ('b', (80,))]
> {code}
> Conversely, Numpy exposes type information in the Python buffer protocol:
> {code:python}
> >>> a = pa.array(range(10))
> >>> memoryview(a.to_numpy()).format
> 'l'
> >>> memoryview(a.to_numpy()).shape
> (10,)
> {code}
> Exposing type information on buffers could be important for third-party 
> systems, such as Dask/distributed, for type-based data compression when 
> serializing.
> Since our C++ buffers are not typed, it's not obvious how to solve this. 
> Should we return tensors instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3016) [C++] Add ability to enable call stack logging for each memory allocation

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3016:

Fix Version/s: (was: 0.13.0)

> [C++] Add ability to enable call stack logging for each memory allocation
> -
>
> Key: ARROW-3016
> URL: https://issues.apache.org/jira/browse/ARROW-3016
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> It is possible to gain programmatic access to the call stack in C/C++, e.g.
> https://eli.thegreenplace.net/2015/programmatic-access-to-the-call-stack-in-c/
> It would be valuable to have a debugging option to log the sizes of memory 
> allocations as well as showing the call stack where that allocation is 
> performed. In complex programs, this could help determine the origin of a 
> memory leak



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2904) [C++] Use FirstTimeBitmapWriter instead of SetBit functions in builder.h/cc

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2904.
---
Resolution: Won't Fix

This is now handled by {{TypedBufferBuilder}}

> [C++] Use FirstTimeBitmapWriter instead of SetBit functions in builder.h/cc
> ---
>
> Key: ARROW-2904
> URL: https://issues.apache.org/jira/browse/ARROW-2904
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
>
> See discussion in patch for ARROW-2826



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2853) [Python] Implementing support for zero copy NumPy arrays in libarrow_python

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2853:

Fix Version/s: 0.14.0

> [Python] Implementing support for zero copy NumPy arrays in libarrow_python
> ---
>
> Key: ARROW-2853
> URL: https://issues.apache.org/jira/browse/ARROW-2853
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Florian Rathgeber
>Priority: Major
> Fix For: 0.14.0
>
>
> Implementing support for zero copy NumPy arrays in libarrow_python (i.e. in 
> C++). We can utilize common code paths with `{{to_pandas`}} and toggle 
> between NumPy-for-pandas and NumPy-for-NumPy behavior (and use the 
> `zero_copy_only` flag where needed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3032) [Python] Clean up NumPy-related C++ headers

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3032:

Fix Version/s: (was: 0.13.0)

> [Python] Clean up NumPy-related C++ headers
> ---
>
> Key: ARROW-3032
> URL: https://issues.apache.org/jira/browse/ARROW-3032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> There are 4 different headers. After ARROW-2814, we can probably eliminate 
> numpy_convert.h and combine with numpy_to_arrow.h



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3052) [C++] Support ORC, GRPC, Thrift, and Protobuf when using $ARROW_BUILD_TOOLCHAIN

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3052:

Fix Version/s: (was: 0.13.0)

> [C++] Support ORC, GRPC, Thrift, and Protobuf when using 
> $ARROW_BUILD_TOOLCHAIN
> ---
>
> Key: ARROW-3052
> URL: https://issues.apache.org/jira/browse/ARROW-3052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> It would be good to support these additional toolchain components without 
> having to set extra environment variables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3014) [C++] Minimal writer adapter for ORC file format

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3014:

Fix Version/s: (was: 0.13.0)

> [C++] Minimal writer adapter for ORC file format
> 
>
> Key: ARROW-3014
> URL: https://issues.apache.org/jira/browse/ARROW-3014
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: orc
>
> See also ARROW-3009, ARROW-1968



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3052) [C++] Support ORC, GRPC, Thrift, and Protobuf when using $ARROW_BUILD_TOOLCHAIN

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761480#comment-16761480
 ] 

Wes McKinney commented on ARROW-3052:
-

I think we are mostly there on this. We don't have ORC in conda-forge yet, 
though

> [C++] Support ORC, GRPC, Thrift, and Protobuf when using 
> $ARROW_BUILD_TOOLCHAIN
> ---
>
> Key: ARROW-3052
> URL: https://issues.apache.org/jira/browse/ARROW-3052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
>
> It would be good to support these additional toolchain components without 
> having to set extra environment variables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2719) [Python/C++] ArrowSchema not hashable

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2719:

Fix Version/s: 0.14.0

> [Python/C++] ArrowSchema not hashable
> -
>
> Key: ARROW-2719
> URL: https://issues.apache.org/jira/browse/ARROW-2719
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.14.0
>
>
> The arrow schema is immutable and should provide a way of hashing itself. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2709) [Python] write_to_dataset poor performance when splitting

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2709:

Fix Version/s: 0.14.0

> [Python] write_to_dataset poor performance when splitting
> -
>
> Key: ARROW-2709
> URL: https://issues.apache.org/jira/browse/ARROW-2709
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Olaf
>Priority: Critical
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Hello,
> Posting this from github (master [~wesmckinn] asked for it :) )
> [https://github.com/apache/arrow/issues/2138]
>  
> {code:java}
> import pandas as pd 
> import numpy as np 
> import pyarrow.parquet as pq 
> import pyarrow as pa 
> idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', 
> freq = 'T') 
> dataframe = pd.DataFrame({'numeric_col' : np.random.rand(len(idx)), 
>   'string_col' : 
> pd.util.testing.rands_array(8,len(idx))}, 
>  index = idx){code}
>  
> {code:java}
> df["dt"] = df.index 
> df["dt"] = df["dt"].dt.date 
> table = pa.Table.from_pandas(df) 
> pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['dt'], 
> flavor='spark'){code}
>  
> {{this works but is inefficient memory-wise. The arrow table is a copy of the 
> large pandas daframe and quickly saturates the RAM.}}
>  
> {{Thanks!}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2651) [Python] Build & Test with PyPy

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761473#comment-16761473
 ] 

Wes McKinney commented on ARROW-2651:
-

If a contributor with an interest in PyPy wants to do the work, and maintain 
it, that's fine with me

> [Python] Build & Test with PyPy
> ---
>
> Key: ARROW-2651
> URL: https://issues.apache.org/jira/browse/ARROW-2651
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: outline_for_beginners
>
> At the moment, we only build with CPython in our CI matrix and only do 
> releases for it. As reported in 
> https://github.com/apache/arrow/issues/2089#issuecomment-393126040 not 
> everything is working yet. This may either be due to problems on our or 
> PyPy's side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2681) [C++] Use source releases when building ORC instead of using GitHub tag snapshots

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2681:

Fix Version/s: (was: 0.13.0)

> [C++] Use source releases when building ORC instead of using GitHub tag 
> snapshots
> -
>
> Key: ARROW-2681
> URL: https://issues.apache.org/jira/browse/ARROW-2681
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> See related discussion in ORC-374. It would be better to use the release 
> artifacts that have been voted on by the ORC PMC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (ARROW-2904) [C++] Use FirstTimeBitmapWriter instead of SetBit functions in builder.h/cc

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-2904:
-

> [C++] Use FirstTimeBitmapWriter instead of SetBit functions in builder.h/cc
> ---
>
> Key: ARROW-2904
> URL: https://issues.apache.org/jira/browse/ARROW-2904
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
>
> See discussion in patch for ARROW-2826



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2904) [C++] Use FirstTimeBitmapWriter instead of SetBit functions in builder.h/cc

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2904.
---
Resolution: Not A Problem

> [C++] Use FirstTimeBitmapWriter instead of SetBit functions in builder.h/cc
> ---
>
> Key: ARROW-2904
> URL: https://issues.apache.org/jira/browse/ARROW-2904
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
>
> See discussion in patch for ARROW-2826



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2667) [C++/Python] Add pandas-like take method to Array/Column/ChunkedArray

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2667:

Fix Version/s: 0.14.0

> [C++/Python] Add pandas-like take method to Array/Column/ChunkedArray
> -
>
> Key: ARROW-2667
> URL: https://issues.apache.org/jira/browse/ARROW-2667
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> We should add a {{take}} method to {{Array/ChunkedArray/Column}} that takes a 
> list of indices and returns a reordered array.
> For reference, see Pandas' interface: 
> https://github.com/pandas-dev/pandas/blob/2cbdd9a2cd19501c98582490e35c5402ae6de941/pandas/core/arrays/base.py#L466



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2981) [C++] Support scripts / documentation for running clang-tidy on codebase

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2981:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [C++] Support scripts / documentation for running clang-tidy on codebase
> 
>
> Key: ARROW-2981
> URL: https://issues.apache.org/jira/browse/ARROW-2981
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> Related to ARROW-2952, ARROW-2980



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2912) [Website] Build more detailed Community landing patch a la Apache Spark

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2912:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Website] Build more detailed Community landing patch a la Apache Spark
> ---
>
> Key: ARROW-2912
> URL: https://issues.apache.org/jira/browse/ARROW-2912
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> It would be useful to have some prose descriptions of where to get help and 
> where to direct questions. See example:
> http://spark.apache.org/community.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2880) [Packaging] Script like verify-release-candidate.sh for automated testing of conda and wheel Python packages in ASF dist

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2880:

Fix Version/s: 0.14.0

> [Packaging] Script like verify-release-candidate.sh for automated testing of 
> conda and wheel Python packages in ASF dist
> 
>
> Key: ARROW-2880
> URL: https://issues.apache.org/jira/browse/ARROW-2880
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Packaging
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> We have a script for verifying a source release candidate. We should make a 
> similar script to test out the wheels and conda packages for the supported 
> Python versions (2.7, 3.5, 3.6, soon 3.7) in an automated fashion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2702) [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc to see if we are using the right error type in each instance

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2702:

Fix Version/s: 0.14.0

> [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc 
> to see if we are using the right error type in each instance
> -
>
> Key: ARROW-2702
> URL: https://issues.apache.org/jira/browse/ARROW-2702
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> See discussion in [https://github.com/apache/arrow/pull/2075]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2873) [Python] Micro-optimize scalar value instantiation

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2873:

Fix Version/s: 0.14.0

> [Python] Micro-optimize scalar value instantiation
> --
>
> Key: ARROW-2873
> URL: https://issues.apache.org/jira/browse/ARROW-2873
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Minor
> Fix For: 0.14.0
>
>
> This lead to a 20% time increase in __getitem__: 
> https://pandas.pydata.org/speed/arrow/#array_ops.ScalarAccess.time_getitem
> See conversation: 
> https://github.com/apache/arrow/commit/dc80a768c0a15e62998ccd32d8353d2035302cb6#r29746119



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2587) [Python] Unable to write StructArrays with multiple children to parquet

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2587:

Fix Version/s: 0.14.0

> [Python] Unable to write StructArrays with multiple children to parquet
> ---
>
> Key: ARROW-2587
> URL: https://issues.apache.org/jira/browse/ARROW-2587
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: jacques
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
> Attachments: Screen Shot 2018-05-16 at 12.24.39.png
>
>
> Although I am able to read StructArray from parquet, I am still unable to 
> write it back from pa.Table to parquet.
> I get an "ArrowInvalid: Nested column branch had multiple children"
> Here is a quick example:
> {noformat}
> In [2]: import pyarrow.parquet as pq
> In [3]: table = pq.read_table('test.parquet')
> In [4]: table
>  Out[4]: 
>  pyarrow.Table
>  weight: double
>  animal_type: string
>  animal_interpretation: struct
>    child 0, is_large_animal: bool
>    child 1, is_mammal: bool
>  metadata
>  
>  \{'org.apache.spark.sql.parquet.row.metadata': 
> '{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},{"name":"animal_interpretation","type":{"type":"struct","fields":[\\{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\\\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}
> In [5]: table.schema
>  Out[5]: 
>  weight: double
>  animal_type: string
>  animal_interpretation: struct
>    child 0, is_large_animal: bool
>    child 1, is_mammal: bool
>  metadata
>  
>  \{'org.apache.spark.sql.parquet.row.metadata': 
> '{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},{"name":"animal_interpretation","type":{"type":"struct","fields":[\\{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\\\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}
> In [6]: pq.write_table(table,"test_write.parquet")
>  ---
>  ArrowInvalid  Traceback (most recent call last)
>   in ()
>  > 1 pq.write_table(table,"test_write.parquet")
> /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in 
> write_table(table, where, row_group_size, version, use_dictionary, 
> compression, use_deprecated_int96_timestamps, coerce_timestamps, flavor, 
> **kwargs)
>      982 use_deprecated_int96_timestamps=use_int96,
>      983 **kwargs) as writer:
>  --> 984 writer.write_table(table, row_group_size=row_group_size)
>      985 except Exception:
>      986 if is_path(where):
> /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in 
> write_table(self, table, row_group_size)
>      325 table = _sanitize_table(table, self.schema, self.flavor)
>      326 assert self.is_open
>  --> 327 self.writer.write_table(table, row_group_size=row_group_size)
>      328 
>      329 def close(self):
> /usr/local/lib/python2.7/dist-packages/pyarrow/_parquet.so in 
> pyarrow._parquet.ParquetWriter.write_table()
> /usr/local/lib/python2.7/dist-packages/pyarrow/lib.so in 
> pyarrow.lib.check_status()
> ArrowInvalid: Nested column branch had multiple children
> {noformat}
>  
> I would really appreciate a fix on this.
> Best,
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2801) [Python] Implement splt_row_groups for ParquetDataset

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761478#comment-16761478
 ] 

Wes McKinney commented on ARROW-2801:
-

hi [~rgruener] would you like to complete this for 0.13?

> [Python] Implement splt_row_groups for ParquetDataset
> -
>
> Key: ARROW-2801
> URL: https://issues.apache.org/jira/browse/ARROW-2801
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Robbie Gruener
>Assignee: Robbie Gruener
>Priority: Minor
>  Labels: parquet, pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently the split_row_groups argument in ParquetDataset yields a not 
> implemented error. An easy and efficient way to implement this is by using 
> the summary metadata file instead of opening every footer file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2796) [C++] Simplify symbols.map file, use when building libarrow_python

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2796:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [C++] Simplify symbols.map file, use when building libarrow_python
> --
>
> Key: ARROW-2796
> URL: https://issues.apache.org/jira/browse/ARROW-2796
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> I did a little work on this in https://github.com/apache/arrow/pull/2096. 
> While that patch was not merged, the changes related to symbol visibility 
> ought to be plucked into a new patch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2633) [Python] Parquet file not accesible to write after first read using PyArrow

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2633.
---
Resolution: Cannot Reproduce

> [Python] Parquet file not accesible to write after first read using PyArrow
> ---
>
> Key: ARROW-2633
> URL: https://issues.apache.org/jira/browse/ARROW-2633
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Suman
>Priority: Major
>  Labels: parquet
>
>  
> I am trying to read a parquet file in pandas dataframe, do some manipulation 
> and write it back in the same file, however it seems file is not accessible 
> to write after the first read in same function.
> It only works, if I don't perform STEP 1 below. Is there anyway to unlock the 
> file as such?
> {code:python}
> #STEP 1: Read entire parquet file
> pq_file = pq.ParquetFile('\dev\abc.parquet')
> exp_df = pq_file.read(nthreads=1, use_pandas_metadata=True).to_pandas()
> #STEP 2: Change some data in dataframe
> #
> #STEP 3: write merged dataframe
> pyarrow_table = pa.Table.from_pandas(exp_df)
> pq.write_table(pyarrow_table, '\dev\abc.parquet',compression='none',)
> {code}
> Error:
> {code}
> File "C:\Python36\lib\site-packages\pyarrow\parquet.py", line 943, in 
> write_table
>  **kwargs)
> File "C:\Python36\lib\site-packages\pyarrow\parquet.py", line 286, in __init__
>  **options)
> File "_parquet.pyx", line 832, in pyarrow._parquet.ParquetWriter.__cinit__
> File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Failed to open local file: \dev\abc.parquet , 
> error: Invalid argument
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1628) [Python] Incorrect serialization of numpy datetimes.

2019-02-05 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761477#comment-16761477
 ] 

Robert Nishihara commented on ARROW-1628:
-

[~wesmckinn] It'd still be good to fix, so I think we should leave the issue 
open, but I don't think it needs to be prioritized at the moment.

> [Python] Incorrect serialization of numpy datetimes.
> 
>
> Key: ARROW-1628
> URL: https://issues.apache.org/jira/browse/ARROW-1628
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>
> See https://github.com/ray-project/ray/issues/1041.
> The issue can be reproduced as follows.
> {code}
> import pyarrow as pa
> import numpy as np
> t = np.datetime64(datetime.datetime.now())
> print(type(t), t)  #  2017-09-30T09:50:46.089952
> t_new = pa.deserialize(pa.serialize(t).to_buffer())
> print(type(t_new), t_new)  #  0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2651) [Python] Build & Test with PyPy

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2651:

Fix Version/s: (was: 0.13.0)

> [Python] Build & Test with PyPy
> ---
>
> Key: ARROW-2651
> URL: https://issues.apache.org/jira/browse/ARROW-2651
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: outline_for_beginners
>
> At the moment, we only build with CPython in our CI matrix and only do 
> releases for it. As reported in 
> https://github.com/apache/arrow/issues/2089#issuecomment-393126040 not 
> everything is working yet. This may either be due to problems on our or 
> PyPy's side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2606) [Java/Python]  Add unit test for pyarrow.decimal128 in Array.from_jvm

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2606:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Java/Python]  Add unit test for pyarrow.decimal128 in Array.from_jvm
> -
>
> Key: ARROW-2606
> URL: https://issues.apache.org/jira/browse/ARROW-2606
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249. We need to 
> find the correct code to construct Java decimals and fill them into a 
> {{DecimalVector}}. Afterwards, we should activate the decimal128 type on 
> {{test_jvm_array}} and ensure that we load them correctly from Java into 
> Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2769) [Python] Deprecate and rename add_metadata methods

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2769:

Fix Version/s: 0.13.0

> [Python] Deprecate and rename add_metadata methods
> --
>
> Key: ARROW-2769
> URL: https://issues.apache.org/jira/browse/ARROW-2769
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Priority: Minor
> Fix For: 0.13.0
>
>
> Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named 
> methods) with replace_metadata, set_metadata or with_metadata. Knowing 
> Spark's immutable API, I would have chosen with_metadata but I guess this is 
> probably not what the average Python user would expect as naming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2730) [C++] Set up CMAKE_C_FLAGS more thoughtfully instead of using CMAKE_CXX_FLAGS

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2730.
---
Resolution: Fixed

This was resolved in passing by several other build system patches

> [C++] Set up CMAKE_C_FLAGS more thoughtfully instead of using CMAKE_CXX_FLAGS
> -
>
> Key: ARROW-2730
> URL: https://issues.apache.org/jira/browse/ARROW-2730
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> See discussion on GitHub for ARROW-2676. We are setting CMAKE_C_FLAGS to the 
> value of CMAKE_CXX_FLAGS, which is rather heavy-handed. Most of the stuff we 
> are putting in the CXX_FLAGS is C++-specific related to warning suppressions, 
> etc. The number of C-specific flags we need should be much smaller, probably 
> just the optimization level, position-independent code setting, 
> -fno-strict-aliasing, and some other standard stuff



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2728) [Python] Support partitioned Parquet datasets using glob-style file paths

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2728:

Fix Version/s: 0.14.0

> [Python] Support partitioned Parquet datasets using glob-style file paths
> -
>
> Key: ARROW-2728
> URL: https://issues.apache.org/jira/browse/ARROW-2728
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: pyarrow : 0.9.0.post1
> dask : 0.17.1
> Mac OS
>Reporter: pranav kohli
>Priority: Minor
>  Labels: parquet
> Fix For: 0.14.0
>
>
> I am saving a dask dataframe to parquet with two partition columns using the 
> pyarrow engine. The problem arises in scanning the partition columns. When I 
> scan using the directory path, I get the partition columns in the output 
> dataframe, whereas if I scan using the glob path, I dont get these columns
>  
> https://github.com/apache/arrow/issues/2147



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2707) [C++] Implement Table::Slice methods using Column::Slice

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2707:

Fix Version/s: 0.14.0

> [C++] Implement Table::Slice methods using Column::Slice
> 
>
> Key: ARROW-2707
> URL: https://issues.apache.org/jira/browse/ARROW-2707
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> see discussion in ARROW-2358



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2703) [C++] Always use statically-linked Boost with private namespace

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761475#comment-16761475
 ] 

Wes McKinney commented on ARROW-2703:
-

It would be useful to provide CMake options to do this, but not be the default

> [C++] Always use statically-linked Boost with private namespace
> ---
>
> Key: ARROW-2703
> URL: https://issues.apache.org/jira/browse/ARROW-2703
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> We have recently added tooling to ship Python wheels with a bundled, private 
> Boost (using the bcp tool). We might consider statically-linking a private 
> Boost exclusively in libarrow (i.e. built via our thirdparty toolchain) to 
> avoid any conflicts with other libraries that may use a different version of 
> Boost



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2719) [Python/C++] ArrowSchema not hashable

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2719:

Component/s: Python
 C++

> [Python/C++] ArrowSchema not hashable
> -
>
> Key: ARROW-2719
> URL: https://issues.apache.org/jira/browse/ARROW-2719
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Florian Jetter
>Priority: Minor
>
> The arrow schema is immutable and should provide a way of hashing itself. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2703) [C++] Always use statically-linked Boost with private namespace

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2703:

Fix Version/s: 0.14.0

> [C++] Always use statically-linked Boost with private namespace
> ---
>
> Key: ARROW-2703
> URL: https://issues.apache.org/jira/browse/ARROW-2703
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> We have recently added tooling to ship Python wheels with a bundled, private 
> Boost (using the bcp tool). We might consider statically-linking a private 
> Boost exclusively in libarrow (i.e. built via our thirdparty toolchain) to 
> avoid any conflicts with other libraries that may use a different version of 
> Boost



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2476) [Python/Question] Maximum length of an Array created from ndarray

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2476.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

This was clarified in the format documentation

> [Python/Question] Maximum length of an Array created from ndarray
> -
>
> Key: ARROW-2476
> URL: https://issues.apache.org/jira/browse/ARROW-2476
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Priority: Minor
> Fix For: 0.12.0
>
>
> So the format 
> [describes|https://github.com/apache/arrow/blob/master/format/Layout.md#array-lengths]
>  that an array max length is 2^31 - 1, however the following python snippet 
> creates a 2**32 length arrow array:
> {code:python}
> a = np.ones((2**32,), dtype='int8')
> A = pa.Array.from_pandas(a)
> type(A)
> {code}
> {code}pyarrow.lib.Int8Array{code}
> Based the layout specification I'd expect a ChunkedArray of three Int8Array's 
> with lengths:
> [2^31 - 1, 2^31 - 1, 2] or should raise an exception?
> If it's the expectation is there any documentation for it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2625) [Python] Serialize timedelta64 values from pandas to Arrow interval types

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2625:

Fix Version/s: 0.14.0

> [Python] Serialize timedelta64 values from pandas to Arrow interval types
> -
>
> Key: ARROW-2625
> URL: https://issues.apache.org/jira/browse/ARROW-2625
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> This work is blocked on ARROW-835



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2627) [Python] Add option (or some equivalent) to toggle memory mapping functionality when using parquet.ParquetFile or other read entry points

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2627:

Fix Version/s: 0.13.0

> [Python] Add option (or some equivalent) to toggle memory mapping 
> functionality when using parquet.ParquetFile or other read entry points
> -
>
> Key: ARROW-2627
> URL: https://issues.apache.org/jira/browse/ARROW-2627
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> See issue described in https://github.com/apache/arrow/issues/1946. When 
> passing a filename to {{parquet.ParquetFile}}, one cannot control what kind 
> of file reader internally is created (OSFile or MemoryMappedFile)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2627) [Python] Add option (or some equivalent) to toggle memory mapping functionality when using parquet.ParquetFile or other read entry points

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761472#comment-16761472
 ] 

Wes McKinney commented on ARROW-2627:
-

I think I addressed this, but I will confirm

> [Python] Add option (or some equivalent) to toggle memory mapping 
> functionality when using parquet.ParquetFile or other read entry points
> -
>
> Key: ARROW-2627
> URL: https://issues.apache.org/jira/browse/ARROW-2627
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> See issue described in https://github.com/apache/arrow/issues/1946. When 
> passing a filename to {{parquet.ParquetFile}}, one cannot control what kind 
> of file reader internally is created (OSFile or MemoryMappedFile)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2628:

Fix Version/s: 0.14.0

> [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
> --
>
> Key: ARROW-2628
> URL: https://issues.apache.org/jira/browse/ARROW-2628
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1749. We should 
> consider strategies for writing very large tables to a partitioned directory 
> scheme. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2622) [C++] Array methods IsNull and IsValid are not complementary

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2622:

Fix Version/s: 0.13.0

> [C++] Array methods IsNull and IsValid are not complementary
> 
>
> Key: ARROW-2622
> URL: https://issues.apache.org/jira/browse/ARROW-2622
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Thomas Buhrmann
>Priority: Major
> Fix For: 0.13.0
>
>
> Hi, not sure if this is a bug or if I misinterpret the spec. According to the 
> latter, "Arrays having a 0 null count may choose to not allocate the null 
> bitmap". From this I'd infer that the statement also holds in the other 
> direction, i.e. non-allocated bitmaps imply a 0 null count. This would mean 
> that if a bitmap is not allocated, IsValid() should always return true. But 
> at the moment it's doing this:
> {code:java}
> bool IsNull(int64_t i) const {
>   return null_bitmap_data_ != NULLPTR &&
> BitUtil::BitNotSet(null_bitmap_data_, i + data_->offset);
> }
> bool IsValid(int64_t i) const {
>   return null_bitmap_data_ != NULLPTR &&
> BitUtil::GetBit(null_bitmap_data_, i + data_->offset);
> }
> {code}
> Which leads to a situation where in the case of non-allocated bitmaps values 
> are neither Null nor Valid. Shouldn't it rather be:
> {code:java}
> bool IsValid(int64_t i) const {
>   return null_bitmap_data_ == NULLPTR ||
> BitUtil::GetBit(null_bitmap_data_, i + data_->offset);
> }{code}
> ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2605) [Java/Python] Add unit test for pyarrow.timeX types in Array.from_jvm

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2605:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Java/Python] Add unit test for pyarrow.timeX types in Array.from_jvm
> -
>
> Key: ARROW-2605
> URL: https://issues.apache.org/jira/browse/ARROW-2605
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249 as we are 
> missing the necessary methods to construct these arrays conveniently on the 
> Python side.
> Once there is a path to construct {{pyarrow.Array}} instances from a Python 
> list of {{datetime.time}} for the various time types, we should activate the 
> time types on {{test_jvm_array}} and ensure that we load them correctly from 
> Java into Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2610) [Java/Python] Add support for dictionary type to pyarrow.Field.from_jvm

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2610:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Java/Python] Add support for dictionary type to pyarrow.Field.from_jvm
> ---
>
> Key: ARROW-2610
> URL: https://issues.apache.org/jira/browse/ARROW-2610
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> The DictionaryType is a bit more complex as it also references the dictionary 
> values itself. This also needs to be integrated into 
> {{pyarrow.Field.from_jvm}} but the work to make DictionaryType working maybe 
> also depends on that {{pyarrow.Array.from_jvm}} first supports non-primitive 
> arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2621) [Python/CI] Use pep8speaks for Python PRs

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2621:

Fix Version/s: 0.14.0

> [Python/CI] Use pep8speaks for Python PRs
> -
>
> Key: ARROW-2621
> URL: https://issues.apache.org/jira/browse/ARROW-2621
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.14.0
>
>
> It would be nice if we would get automated comments by 
> [https://pep8speaks.com/] on the Python PRs. This should be much better 
> readable than the current `flake8` ouput in the Travis logs. This issue is 
> split up into two tasks:
>  * Create an issue with INFRA kindly asking them for activating pep8speaks 
> for Arrow
>  * Setup {{.pep8speaks.yml}} to align with our {{flake8}} config. For 
> reference, see Pandas' config: 
> [https://github.com/pandas-dev/pandas/blob/master/.pep8speaks.yml] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2607) [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2607:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm
> ---
>
> Key: ARROW-2607
> URL: https://issues.apache.org/jira/browse/ARROW-2607
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249: Currently 
> only primitive arrays are supported in {{pyarrow.Array.from_jvm}} as it uses 
> {{pyarrow.Array.from_buffers}} underneath. We should extend one of the two 
> functions to be able to deal with string arrays. There is a currently failing 
> unit test {{test_jvm_string_array}} in {{pyarrow/tests/test_jvm.py}} to 
> verify the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2609) [Java/Python] Complex type conversion in pyarrow.Field.from_jvm

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2609:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Java/Python] Complex type conversion in pyarrow.Field.from_jvm
> ---
>
> Key: ARROW-2609
> URL: https://issues.apache.org/jira/browse/ARROW-2609
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> The converter {{pyarrow.Field.from_jvm}} currently only works for primitive 
> types. Types like List, Struct or Union that have children in their 
> definition are not supported. We should add the needed recursion for these 
> types and enable the respective tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2598) [Python] table.to_pandas segfault

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2598:

Fix Version/s: 0.13.0

> [Python]  table.to_pandas segfault
> --
>
> Key: ARROW-2598
> URL: https://issues.apache.org/jira/browse/ARROW-2598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: jacques
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> Here is a small snippet which produces a segfault:
> {noformat}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.parquet as pq
> In [3]: pa_ar = pa.array([[], []])
> In [4]: pq.write_table(
>    ...: table=pa.Table.from_arrays([pa_ar],["test"]),
>    ...: where="test5.parquet",
>    ...: compression="snappy",
>    ...: flavor="spark"
>    ...: )
> In [5]: pq.read_table("test5.parquet")
> Out[5]: 
> pyarrow.Table
> test: list
>   child 0, item: null
> In [6]: pq.read_table("test5.parquet").to_pydict()
> Out[6]: OrderedDict([(u'test', [None, None])])
> In [7]: pq.read_table("test5.parquet").to_pandas()
> Segmentation fault
> {noformat}
> I thank you in advance for having this fixed.
> Best, 
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1959) [Python] Add option for "lossy" conversions (overflow -> null) from timestamps to datetime.datetime / pandas.Timestamp

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1959:

Fix Version/s: 0.14.0

> [Python] Add option for "lossy" conversions (overflow -> null) from 
> timestamps to datetime.datetime / pandas.Timestamp
> --
>
> Key: ARROW-1959
> URL: https://issues.apache.org/jira/browse/ARROW-1959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> See discussion in 
> https://stackoverflow.com/questions/47946038/overflow-error-using-datetimes-with-pyarrow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1562) [C++] Numeric kernel implementations for add (+)

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1562:

Fix Version/s: 0.14.0

> [C++] Numeric kernel implementations for add (+)
> 
>
> Key: ARROW-1562
> URL: https://issues.apache.org/jira/browse/ARROW-1562
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 0.14.0
>
>
> This function should respect consistent type promotions between types of 
> different sizes and signed and unsigned integers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2579) [Python] Appending to streamable table file format doesn't seem to work

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2579:

Summary: [Python] Appending to streamable table file format doesn't seem to 
work  (was: Appending to streamable table file format doesn't seem to work)

> [Python] Appending to streamable table file format doesn't seem to work
> ---
>
> Key: ARROW-2579
> URL: https://issues.apache.org/jira/browse/ARROW-2579
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rob Ambalu
>Priority: Major
>
> As far as I can tell it looks like appending to a streaming file format isn’t 
> currently supported, is that right?
> RecordBatchStreamWriter always writes the schema up front, and it doesn’t 
> look like a schema is expected mid file ( assuming im doing this append test 
> correctly, this is the error I hit when I try to read back this file into 
> python:
>  Traceback (most recent call last):
>   File "/home/ra7293/rba_arrow_mmap.py", line 9, in 
>     table = reader.read_all()
>   File "ipc.pxi", line 302, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Message not expected type: record batch, was: 1
>  
> This reader script works fine if I write once / don’t append.
> Seeing as IO interfaces support Append, streaming should support it as well ( 
> if for whatever reason this cant be supported, RecordBatchStreamWriter should 
> throw if configured with an OutputStreamer that is attempting to append )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2579) [Python] Appending to streamable table file format doesn't seem to work

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2579:

Fix Version/s: 0.14.0

> [Python] Appending to streamable table file format doesn't seem to work
> ---
>
> Key: ARROW-2579
> URL: https://issues.apache.org/jira/browse/ARROW-2579
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rob Ambalu
>Priority: Major
> Fix For: 0.14.0
>
>
> As far as I can tell it looks like appending to a streaming file format isn’t 
> currently supported, is that right?
> RecordBatchStreamWriter always writes the schema up front, and it doesn’t 
> look like a schema is expected mid file ( assuming im doing this append test 
> correctly, this is the error I hit when I try to read back this file into 
> python:
>  Traceback (most recent call last):
>   File "/home/ra7293/rba_arrow_mmap.py", line 9, in 
>     table = reader.read_all()
>   File "ipc.pxi", line 302, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Message not expected type: record batch, was: 1
>  
> This reader script works fine if I write once / don’t append.
> Seeing as IO interfaces support Append, streaming should support it as well ( 
> if for whatever reason this cant be supported, RecordBatchStreamWriter should 
> throw if configured with an OutputStreamer that is attempting to append )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2512) [Python ]Enable direct interaction of GPU Objects in Python

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2512:

Fix Version/s: 0.14.0

> [Python ]Enable direct interaction of GPU Objects in Python
> ---
>
> Key: ARROW-2512
> URL: https://issues.apache.org/jira/browse/ARROW-2512
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma, GPU, Python
>Reporter: William Paul
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Plasma can now manage objects on the GPU, but in order to use this 
> functionality in Python, there needs to be some way to represent these GPU 
> objects in Python that allows computation on the GPU.
> The easiest way to enable this is to rely on a third party library, such as 
> Pytorch, which will allow us to use all of its existing functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2490) [C++] input stream locking inconsistent

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2490:

Fix Version/s: 0.14.0

> [C++] input stream locking inconsistent
> ---
>
> Key: ARROW-2490
> URL: https://issues.apache.org/jira/browse/ARROW-2490
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> Reading from the current file pointer is inherently thread-unsafe, since the 
> file pointer may be updated by another thread (either before or during the 
> operation). However, currently, we have:
> * {{ReadableFile::Read}} takes a lock
> * {{MemoryMappedFile::Read}} doesn't take a lock
> * {{BufferReader::Read}} doesn't take a lock
> We could always take a lock in {{Read}}. But I don't think there's a pattern 
> where it's useful to call {{Read}} from multiple threads at once (since 
> you're not sure where the file pointer will be exactly when the read starts). 
> So we could as well specify that {{Read}} isn't thread-safe and let people 
> make sure they don't call it from multiple threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2532) [C++] Add chunked builder classes

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2532:

Fix Version/s: 0.14.0

> [C++] Add chunked builder classes
> -
>
> Key: ARROW-2532
> URL: https://issues.apache.org/jira/browse/ARROW-2532
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> I think it would be useful to have chunked builders for list, string and 
> binary types. A chunked builder would produce a chunked array as output, 
> circumventing the 32-bit offset limit of those types. There's some 
> special-casing scatterred around our Numpy conversion routines right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2446) [C++] SliceBuffer on CudaBuffer should return CudaBuffer

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2446:

Fix Version/s: 0.14.0

> [C++] SliceBuffer on CudaBuffer should return CudaBuffer
> 
>
> Key: ARROW-2446
> URL: https://issues.apache.org/jira/browse/ARROW-2446
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, GPU
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> Currently {{SliceBuffer}} on a {{CudaBuffer}} returns a plain {{Buffer}} 
> instance, which is dangerous for unsuspecting consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2412) [Integration] Add nested dictionary integration test

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761470#comment-16761470
 ] 

Wes McKinney commented on ARROW-2412:
-

[~bhulette] is there a patch for this?

> [Integration] Add nested dictionary integration test
> 
>
> Key: ARROW-2412
> URL: https://issues.apache.org/jira/browse/ARROW-2412
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Integration
>Reporter: Brian Hulette
>Priority: Major
> Fix For: 0.13.0
>
>
> Add nested dictionary generator to the integration test. The tests will 
> probably fail at first but can serve as a starting point for developing this 
> capability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2444) [Python] Better handle reading empty parquet files

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2444:

Fix Version/s: 0.14.0

> [Python] Better handle reading empty parquet files
> --
>
> Key: ARROW-2444
> URL: https://issues.apache.org/jira/browse/ARROW-2444
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jim Crist
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]
>  
> Currently pyarrow reads empty parts as float64, even if the underlying 
> columns have other dtypes. This can cause problems for pandas downstream, as 
> certain operations are only valid on certain dtypes, even if the columns are 
> empty.
>  
> Copying the comment Uwe over:
>  
> bq. {quote}This is the expected behaviour as an empty string column in Pandas 
> is simply an empty column of type object. Sadly object does not tell us much 
> about the type of the column at all. We return numpy.float64 in this case as 
> it's the most efficient type to store nulls in Pandas.{quote}
> {quote}This seems unintuitive at best to me. An empty object column in pandas 
> is treated differently in many operations than an empty float64 column (str 
> accessor is available, excluded from numeric operations, etc..). Having an 
> empty file read in as a different dtype than was written could lead to errors 
> in processing code downstream. Would arrow be willing to change this 
> behavior?{quote}
> We should probably use another method than `field.type.to_pandas_dtype()` in 
> this case. The column saved in Parquet should be saved with `NA` as type 
> which sadly does not provide enough information. 
> We also store the original dtype in the Pandas metadata that is used for the 
> actual DataFrame reconstruction later on. If we would also pick up the 
> metadata when it was written, we should be able to correctly reconstruct the 
> dtype.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2412) [Integration] Add nested dictionary integration test

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2412:

Fix Version/s: (was: 0.14.0)
   0.13.0

> [Integration] Add nested dictionary integration test
> 
>
> Key: ARROW-2412
> URL: https://issues.apache.org/jira/browse/ARROW-2412
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Integration
>Reporter: Brian Hulette
>Priority: Major
> Fix For: 0.13.0
>
>
> Add nested dictionary generator to the integration test. The tests will 
> probably fail at first but can serve as a starting point for developing this 
> capability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2447) [C++] Create a device abstraction

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2447:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [C++] Create a device abstraction
> -
>
> Key: ARROW-2447
> URL: https://issues.apache.org/jira/browse/ARROW-2447
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, GPU
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> Right now, a plain Buffer doesn't carry information about where it actually 
> lies. That information also cannot be passed around, so you get APIs like 
> {{PlasmaClient}} which take or return device number integers, and have 
> implementations which hardcode operations on CUDA buffers. Also, unsuspecting 
> receivers of a {{Buffer}} pointer may try to act on the underlying memory 
> without knowing whether it's CPU-reachable or not.
> Here is a sketch for a proposed Device abstraction:
> {code}
> class Device {
> enum DeviceKind { KIND_CPU, KIND_CUDA };
> virtual DeviceKind kind() const;
> //MemoryPool* default_memory_pool() const;
> //std::shared_ptr Allocate(...);
> };
> class CpuDevice : public Device {};
> class CudaDevice : public Device {
> int device_num() const;
> };
> class Buffer {
> virtual DeviceKind device_kind() const;
> virtual std::shared_ptr device() const;
> virtual bool on_cpu() const {
> return true;
> }
> const uint8_t* cpu_data() const {
> return on_cpu() ? data() : nullptr;
> }
> uint8_t* cpu_mutable_data() {
> return on_cpu() ? mutable_data() : nullptr;
> }
> virtual CopyToCpu(std::shared_ptr dest) const;
> virtual CopyFromCpu(std::shared_ptr src);
> };
> class CudaBuffer : public Buffer {
> virtual bool on_cpu() const {
> return false;
> }
> };
> CopyBuffer(std::shared_ptr dest, const std::shared_ptr src);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2379) [Plasma] PlasmaClient::Info() should return whether an object is in use

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2379:

Fix Version/s: 0.14.0

> [Plasma] PlasmaClient::Info() should return whether an object is in use
> ---
>
> Key: ARROW-2379
> URL: https://issues.apache.org/jira/browse/ARROW-2379
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> It can be useful to know whether a given object is already in use by the 
> local client.
> See https://github.com/apache/arrow/pull/1807#discussion_r178611472



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2412) [Integration] Add nested dictionary integration test

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2412:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Integration] Add nested dictionary integration test
> 
>
> Key: ARROW-2412
> URL: https://issues.apache.org/jira/browse/ARROW-2412
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Integration
>Reporter: Brian Hulette
>Priority: Major
> Fix For: 0.14.0
>
>
> Add nested dictionary generator to the integration test. The tests will 
> probably fail at first but can serve as a starting point for developing this 
> capability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2392) [Python] pyarrow RecordBatchStreamWriter allows writing batches with different schemas

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2392:

Component/s: Python

> [Python] pyarrow RecordBatchStreamWriter allows writing batches with 
> different schemas
> --
>
> Key: ARROW-2392
> URL: https://issues.apache.org/jira/browse/ARROW-2392
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Ernesto Ocampo
>Priority: Minor
> Fix For: 0.13.0
>
>
> A RecordBatchStreamWriter initialised with a given schema will still allow 
> writing RecordBatches that have different schemas. Example:
>  
> {code:java}
> schema = pa.schema([pa.field('some_field', pa.int64())])
> stream = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(stream, schema)
> data = [pa.array([1.234])]
> batch = pa.RecordBatch.from_arrays(data, ['some_field'])  
> # batch does not conform to schema
> assert batch.schema != schema
> writer.write_batch(batch)  # no exception raised
> writer.close()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2392) [Python] pyarrow RecordBatchStreamWriter allows writing batches with different schemas

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2392:

Fix Version/s: 0.13.0

> [Python] pyarrow RecordBatchStreamWriter allows writing batches with 
> different schemas
> --
>
> Key: ARROW-2392
> URL: https://issues.apache.org/jira/browse/ARROW-2392
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Ernesto Ocampo
>Priority: Minor
> Fix For: 0.13.0
>
>
> A RecordBatchStreamWriter initialised with a given schema will still allow 
> writing RecordBatches that have different schemas. Example:
>  
> {code:java}
> schema = pa.schema([pa.field('some_field', pa.int64())])
> stream = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(stream, schema)
> data = [pa.array([1.234])]
> batch = pa.RecordBatch.from_arrays(data, ['some_field'])  
> # batch does not conform to schema
> assert batch.schema != schema
> writer.write_batch(batch)  # no exception raised
> writer.close()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2392) [Python] pyarrow RecordBatchStreamWriter allows writing batches with different schemas

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2392:

Summary: [Python] pyarrow RecordBatchStreamWriter allows writing batches 
with different schemas  (was: pyarrow RecordBatchStreamWriter allows writing 
batches with different schemas)

> [Python] pyarrow RecordBatchStreamWriter allows writing batches with 
> different schemas
> --
>
> Key: ARROW-2392
> URL: https://issues.apache.org/jira/browse/ARROW-2392
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Ernesto Ocampo
>Priority: Minor
>
> A RecordBatchStreamWriter initialised with a given schema will still allow 
> writing RecordBatches that have different schemas. Example:
>  
> {code:java}
> schema = pa.schema([pa.field('some_field', pa.int64())])
> stream = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(stream, schema)
> data = [pa.array([1.234])]
> batch = pa.RecordBatch.from_arrays(data, ['some_field'])  
> # batch does not conform to schema
> assert batch.schema != schema
> writer.write_batch(batch)  # no exception raised
> writer.close()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2367:

Fix Version/s: 0.13.0

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
> Fix For: 0.13.0
>
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761468#comment-16761468
 ] 

Wes McKinney commented on ARROW-2367:
-

There were some improvements around chunked data handling. I'll take a peek at 
this to see what needs to be done

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
> Fix For: 0.13.0
>
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2221) [C++] Nightly build with "infer" tool

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2221:

Fix Version/s: 0.14.0

> [C++] Nightly build with "infer" tool
> -
>
> Key: ARROW-2221
> URL: https://issues.apache.org/jira/browse/ARROW-2221
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> As a follow-up to ARROW-1626, we ought to periodically look at the output of 
> the "infer" tool to fix issues as they come up. This is probably too 
> heavyweight to run in each CI build
> cc [~renesugar]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2339) [Python] Add a fast path for int hashing

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2339:

Fix Version/s: 0.14.0

> [Python] Add a fast path for int hashing
> 
>
> Key: ARROW-2339
> URL: https://issues.apache.org/jira/browse/ARROW-2339
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alex Hagerman
>Priority: Minor
> Fix For: 0.14.0
>
>
> Create a __hash__ fast path for Int scalars that avoids using as_py().
>  
> https://issues.apache.org/jira/browse/ARROW-640
> [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2294) [Java] Fix splitAndTransfer for variable width vector

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2294:

Summary: [Java] Fix splitAndTransfer for variable width vector  (was: Fix 
splitAndTransfer for variable width vector)

> [Java] Fix splitAndTransfer for variable width vector
> -
>
> Key: ARROW-2294
> URL: https://issues.apache.org/jira/browse/ARROW-2294
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>
> When we splitAndTransfer a vector, the value count to set for the target 
> vector should be equal to split length and not the value count of source 
> vector. 
> We have seen cases in operator slike FLATTEN and under low memory conditions, 
> we end up allocating a lot more memory for the target vector because of using 
> a large value in setValueCount after split and transfer is done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4468) [Rust] Implement AND/OR kernels for Buffer (with SIMD)

2019-02-05 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4468:
--
Labels: pull-request-available  (was: )

> [Rust] Implement AND/OR kernels for Buffer (with SIMD)
> --
>
> Key: ARROW-4468
> URL: https://issues.apache.org/jira/browse/ARROW-4468
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2058) [Packaging] Add wheels for Alpine Linux

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761456#comment-16761456
 ] 

Wes McKinney commented on ARROW-2058:
-

Do many projects provide alpine wheels? What is their build toolchain like?

> [Packaging] Add wheels for Alpine Linux
> ---
>
> Key: ARROW-2058
> URL: https://issues.apache.org/jira/browse/ARROW-2058
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.4.1, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 
> 0.8.0
>Reporter: Omer Katz
>Priority: Major
>  Labels: alpine
>
> Currently when attempting to install pyarrow using pip on Alpine Linux you 
> get the following error message from pip:
>  
> {code:java}
> pip install pyarrow --verbose
> Collecting pyarrow
>   1 location(s) to search for versions of pyarrow:
>   * https://pypi.python.org/simple/pyarrow/
>   Getting page https://pypi.python.org/simple/pyarrow/
>   Looking up "https://pypi.python.org/simple/pyarrow/; in the cache
>   Current age based on date: 596
>   Freshness lifetime from max-age: 600
>   Freshness lifetime from request max-age: 600
>   The response is "fresh", returning cached response
>   600 > 596
>   Analyzing links from page https://pypi.python.org/simple/pyarrow/
>     Skipping link 
> https://pypi.python.org/packages/03/fe/d3e86d9a534093f84ec6bb92c5285796feca7713f9328cc2b607ee9fc158/pyarrow-0.2.0-cp35-cp35m-manylinux1_x86_64.whl#md5=283d6d42277a07f724c0d944ff031c0c
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/06/e9/ac196752b306732afedf415805d327752bd85fb1e4517b97085129b5d02e/pyarrow-0.4.1-cp27-cp27mu-manylinux1_x86_64.whl#md5=884433983d1482e9eba7cdedd82201e5
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0b/1c/c7e00871d85f091cbe4b71dd6ff2ce393b6e736d6defd806f571da87280c/pyarrow-0.5.0-cp36-cp36m-win_amd64.whl#md5=d7e3d8b9d17e7a964c058f879e11e733
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0b/e8/e907b7e597981e488d60ea8554db0c6b55a4ddc01ad31bb18156f1dc1526/pyarrow-0.5.0.post2-cp34-cp34m-manylinux1_x86_64.whl#md5=9353e2bcfc77a2b40daa5d31cb9c5dac
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0c/01/2e283b8fae727c4932a4335e2b1980a65c2ef754c69a7d97e39b0157627d/pyarrow-0.7.0-cp34-cp34m-manylinux1_x86_64.whl#md5=6d8ec243f77a382667b6f9b0aa434fd2
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0c/19/805aa541740279bc8a198eeeb57509de5551f55f0cbc6371fa897bfc3245/pyarrow-0.8.0-cp35-cp35m-manylinux1_x86_64.whl#md5=382cb788fd740b0e25be3b305ab46142
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0d/39/b0e21b10b53386f3dad906a8b734074cc0008c5af6a31d2e441569633d94/pyarrow-0.6.0-cp36-cp36m-manylinux1_x86_64.whl#md5=392930f4ace76ac65965258f5da99e9d
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/0f/22/97ba96f7178a52f257b45eac079d7a47dc4bc22d0961e828f10a76c254a7/pyarrow-0.4.1-cp35-cp35m-macosx_10_6_intel.whl#md5=96db8da8ee09952e62731ef8afd1f20d
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/15/5c/20192ab842b291d889f12f7013a5ac5c4416e231377024ad6823fc42a456/pyarrow-0.8.0-cp35-cp35m-win_amd64.whl#md5=8123173dc4905e7186ecf35ba180817a
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/20/b6/50f42a2dd53e0679ffe5fb74bdc745fcad3b5e0975e9ae356256c0370d06/pyarrow-0.7.1-cp35-cp35m-macosx_10_6_intel.whl#md5=5d06b3332b5bac0682d55f20ab2cb556
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/22/2f/7bf9225142d9db6e67e74cff8a18aa98514159cb5c96b15d15044db9ea5f/pyarrow-0.7.1-cp35-cp35m-win_amd64.whl#md5=111be7aac9a73210c2b1ae8e1e459819
>  (from https://pypi.python.org/simple/pyarrow/); it is not compatible with 
> this Python
>     Skipping link 
> https://pypi.python.org/packages/23/60/f3db27c6a201994a5b1afb4f263afdfa22f5292380379900d7af325d679f/pyarrow-0.5.0-cp35-cp35m-win_amd64.whl#md5=cf45b4190ba1079cc2532c1a9fd09285
>  (from https://pypi.python.org/simple/pyarrow/); 

[jira] [Updated] (ARROW-2119) [C++][Java] Handle Arrow stream with zero record batch

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2119:

Component/s: Java
 C++

> [C++][Java] Handle Arrow stream with zero record batch
> --
>
> Key: ARROW-2119
> URL: https://issues.apache.org/jira/browse/ARROW-2119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Reporter: Jingyuan Wang
>Priority: Major
> Fix For: 0.13.0
>
>
> It looks like currently many places of the code assume that there needs to be 
> at least one record batch for streaming format. Is zero-recordbatch not 
> supported by design?
> e.g. 
> [https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java#L45]
> {code:none}
>   public static void convert(InputStream in, OutputStream out) throws 
> IOException {
> BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
> try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) {
>   VectorSchemaRoot root = reader.getVectorSchemaRoot();
>   // load the first batch before instantiating the writer so that we have 
> any dictionaries
>   if (!reader.loadNextBatch()) {
> throw new IOException("Unable to read first record batch");
>   }
>   ...
> {code}
> Pyarrow-0.8.0 does not load 0-recordbatch stream either. It would throw an 
> exception originated from 
> [https://github.com/apache/arrow/blob/a95465b8ce7a32feeaae3e13d0a64102ffa590d9/cpp/src/arrow/table.cc#L309:]
> {code:none}
> Status Table::FromRecordBatches(const 
> std::vector>& batches,
> std::shared_ptr* table) {
>   if (batches.size() == 0) {
> return Status::Invalid("Must pass at least one record batch");
>   }
>   ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2266) [CI] Improve runtime of integration tests in Travis CI

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2266.
---
Resolution: Won't Fix

The integration tests are taking 20 minutes in Travis CI right now. That seems 
acceptable for now

> [CI] Improve runtime of integration tests in Travis CI
> --
>
> Key: ARROW-2266
> URL: https://issues.apache.org/jira/browse/ARROW-2266
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Integration
>Reporter: Wes McKinney
>Priority: Major
>
> I was surprised to see that travis_script_integration.sh is taking over 25 
> minutes to run (https://travis-ci.org/apache/arrow/jobs/349493491). My only 
> real guess about what's going on is that JVM startup time on these hosts is 
> super slow.
> I can think of some things we could do to make things better:
> * Add debugging output so we can see what's slow
> * Write a Java integration test handler that validates multiple files at once
> * Generate a single set of binary files for each producer rather than 
> regenerating them each time (so Java would only need to produce binary files 
> once instead of 3 times like now)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2249) [Java/Python] in-process vector sharing from Java to Python

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2249:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Java/Python] in-process vector sharing from Java to Python
> ---
>
> Key: ARROW-2249
> URL: https://issues.apache.org/jira/browse/ARROW-2249
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.14.0
>
>
> Currently we seem to use in all applications of Arrow the IPC capabilities to 
> move data between a Java process and a Python process. While this is 
> 0-serialization, it is not zero-copy. By taking the address and offset, we 
> can already create Python buffers from Java buffers: 
> https://github.com/apache/arrow/pull/1693. This is still a very low-level 
> interface and we should provide the user with:
> * A guide on how to load Apache Arrow java libraries in Python (either 
> through a fat-jar that was shipped with Arrow or how he should integrate it 
> into its Java packaging)
> * {{pyarrow.Array.from_jvm}}, {{pyarrow.RecordBatch.from_jvm}}, … functions 
> that take the respective Java objects and emit Python objects. These Python 
> objects should also ensure that the underlying memory regions are kept alive 
> as long as the Python objects exist.
> This issue can also be used as a tracker for the various sub-tasks that will 
> need to be done to complete this rather large milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2260) [C++][Plasma] plasma_store should show usage

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2260:

Fix Version/s: 0.14.0

> [C++][Plasma] plasma_store should show usage
> 
>
> Key: ARROW-2260
> URL: https://issues.apache.org/jira/browse/ARROW-2260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 0.14.0
>
>
> Currently the options exposed by the {{plasma_store}} executable aren't very 
> discoverable:
> {code:bash}
> $ plasma_store -h
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store 
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store --help
> plasma_store: invalid option -- '-'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2260) [C++][Plasma] plasma_store should show usage

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2260:

Summary: [C++][Plasma] plasma_store should show usage  (was: plasma_store 
should show usage)

> [C++][Plasma] plasma_store should show usage
> 
>
> Key: ARROW-2260
> URL: https://issues.apache.org/jira/browse/ARROW-2260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Minor
>
> Currently the options exposed by the {{plasma_store}} executable aren't very 
> discoverable:
> {code:bash}
> $ plasma_store -h
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store 
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store --help
> plasma_store: invalid option -- '-'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2248) [Python] Nightly or on-demand HDFS test builds

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2248:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Python] Nightly or on-demand HDFS test builds
> --
>
> Key: ARROW-2248
> URL: https://issues.apache.org/jira/browse/ARROW-2248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> We continue to acquire more functionality related to HDFS and Parquet. 
> Testing this, including tests that involve interoperability with other 
> systems, like Spark, will require some work outside of our normal CI 
> infrastructure.
> I suggest we start with testing the C++/Python HDFS integration, which will 
> help with validating patches like ARROW-1643 
> https://github.com/apache/arrow/pull/1668



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2243) [C++] Enable IPO/LTO

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2243.
---
Resolution: Won't Fix

Closing as Won't Fix for now. At some point it may make sense to do some LTO in 
some select portions of the codebase where there are performance benefits

> [C++] Enable IPO/LTO
> 
>
> Key: ARROW-2243
> URL: https://issues.apache.org/jira/browse/ARROW-2243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Minor
> Fix For: 0.13.0
>
>
> We should enable interprocedural/link-time optimization. CMake >= 3.9.4 
> supports a generic way of doing this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1975) [C++] Add abi-compliance-checker to build process

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1975:

Fix Version/s: (was: 0.14.0)

> [C++] Add abi-compliance-checker to build process
> -
>
> Key: ARROW-1975
> URL: https://issues.apache.org/jira/browse/ARROW-1975
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I would like to check our baseline modules with 
> https://lvc.github.io/abi-compliance-checker/ to ensure that version upgrades 
> are much smoother and that we don‘t break the ABI in patch releases. 
> As we‘re pre-1.0 yet, I accept that there will be breakage but I would like 
> to keep them to a minimum. Currently the biggest pain with Arrow is you need 
> to pin it in Python always with {{==0.x.y}}, otherwise segfaults are 
> inevitable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2051) [Python] Support serializing UUID objects to tables

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2051:

Summary: [Python] Support serializing UUID objects to tables  (was: Support 
serializing UUID objects to tables)

> [Python] Support serializing UUID objects to tables
> ---
>
> Key: ARROW-2051
> URL: https://issues.apache.org/jira/browse/ARROW-2051
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Omer Katz
>Priority: Major
>
> UUID objects can be easily supported and can be represented as 128-bit 
> integers or a stream of bytes.
> The fastest way I know to construct a UUID object is by using it's 128-bit 
> (16 bytes) integer representation.
>  
> {code:java}
> %timeit uuid.UUID(int=24197857161011715162171839636988778104)
> 611 ns ± 6.27 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
> %timeit uuid.UUID(bytes=b'\x124Vx\x124Vx\x124Vx\x124Vx')
> 1.17 µs ± 7.5 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
> %timeit uuid.UUID('12345678-1234-5678-1234-567812345678')
> 1.47 µs ± 6.08 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
> {code}
>  
> Right now I have to do this manually which is pretty tedious.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2006) [C++] Add option to trim excess padding when writing IPC messages

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2006:

Fix Version/s: 0.14.0

> [C++] Add option to trim excess padding when writing IPC messages
> -
>
> Key: ARROW-2006
> URL: https://issues.apache.org/jira/browse/ARROW-2006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> This will help with situations like 
> [https://github.com/apache/arrow/issues/1467] where we don't really need the 
> extra padding bytes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2221) [C++] Nightly build with "infer" tool

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761463#comment-16761463
 ] 

Wes McKinney commented on ARROW-2221:
-

If we dockerize this and put it into in the UL buildbot then we can look at 
these whenever we want

> [C++] Nightly build with "infer" tool
> -
>
> Key: ARROW-2221
> URL: https://issues.apache.org/jira/browse/ARROW-2221
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> As a follow-up to ARROW-1626, we ought to periodically look at the output of 
> the "infer" tool to fix issues as they come up. This is probably too 
> heavyweight to run in each CI build
> cc [~renesugar]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2183) [C++] Add helper CMake function for globbing the right header files

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2183.
-
   Resolution: Fixed
 Assignee: Wes McKinney
Fix Version/s: 0.12.0

I did this in 0.12

> [C++] Add helper CMake function for globbing the right header files 
> 
>
> Key: ARROW-2183
> URL: https://issues.apache.org/jira/browse/ARROW-2183
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> Brought up by discussion in https://github.com/apache/arrow/pull/1631 on 
> ARROW-2179. We should collect header files but do not install ones containing 
> particular patterns for non-public headers, like {{-internal}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2187) RFC: Organize language implementations in a top-level lib/ directory

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2187.
---
Resolution: Won't Fix

> RFC: Organize language implementations in a top-level lib/ directory
> 
>
> Key: ARROW-2187
> URL: https://issues.apache.org/jira/browse/ARROW-2187
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Priority: Major
>
> As we acquire more Arrow implementations, the number of top-level directories 
> may grow significantly. We might consider nesting these implementations under 
> a new top-level directory, similar to Apache Thrift: 
> https://github.com/apache/thrift (see the "lib/" directory)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2148) [Python] to_pandas() on struct array returns object array

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2148.
---
Resolution: Won't Fix

pandas doesn't support NumPy structured dtypes

> [Python] to_pandas() on struct array returns object array
> -
>
> Key: ARROW-2148
> URL: https://issues.apache.org/jira/browse/ARROW-2148
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Antoine Pitrou
>Priority: Major
>
> This should probably return a Numpy struct array instead:
> {code:python}
> >>> arr = pa.array([{'a': 1, 'b': 2.5}, {'a': 2, 'b': 3.5}], 
> >>> type=pa.struct([pa.field('a', pa.int32()), pa.field('b', pa.float64())]))
> >>> arr.type
> StructType(struct)
> >>> arr.to_pandas()
> array([{'a': 1, 'b': 2.5}, {'a': 2, 'b': 3.5}], dtype=object)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2136) [Python] Non-nullable schema fields not checked in conversions from pandas

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2136:

Fix Version/s: 0.13.0

> [Python] Non-nullable schema fields not checked in conversions from pandas
> --
>
> Key: ARROW-2136
> URL: https://issues.apache.org/jira/browse/ARROW-2136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Priority: Major
> Fix For: 0.13.0
>
>
> If you provide a schema with {{nullable=False}} but pass a {{DataFrame}} 
> which in fact has nulls it appears the schema is ignored? I would expect an 
> error here.
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1.2, 2.1, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.float64(), nullable=False)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1.2,
>   2.1,
>   NA
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2130) [Python] Support converting pandas.Timestamp in pyarrow.array

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2130:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Python] Support converting pandas.Timestamp in pyarrow.array
> -
>
> Key: ARROW-2130
> URL: https://issues.apache.org/jira/browse/ARROW-2130
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> This is follow up work to ARROW-2106; since pandas.Timestamp supports 
> nanoseconds, this will require a slightly different code path. Tests should 
> also include using {{Table.from_pandas}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2119) Handle Arrow stream with zero record batch

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2119:

Fix Version/s: 0.13.0

> Handle Arrow stream with zero record batch
> --
>
> Key: ARROW-2119
> URL: https://issues.apache.org/jira/browse/ARROW-2119
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jingyuan Wang
>Priority: Major
> Fix For: 0.13.0
>
>
> It looks like currently many places of the code assume that there needs to be 
> at least one record batch for streaming format. Is zero-recordbatch not 
> supported by design?
> e.g. 
> [https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java#L45]
> {code:none}
>   public static void convert(InputStream in, OutputStream out) throws 
> IOException {
> BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
> try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) {
>   VectorSchemaRoot root = reader.getVectorSchemaRoot();
>   // load the first batch before instantiating the writer so that we have 
> any dictionaries
>   if (!reader.loadNextBatch()) {
> throw new IOException("Unable to read first record batch");
>   }
>   ...
> {code}
> Pyarrow-0.8.0 does not load 0-recordbatch stream either. It would throw an 
> exception originated from 
> [https://github.com/apache/arrow/blob/a95465b8ce7a32feeaae3e13d0a64102ffa590d9/cpp/src/arrow/table.cc#L309:]
> {code:none}
> Status Table::FromRecordBatches(const 
> std::vector>& batches,
> std::shared_ptr* table) {
>   if (batches.size() == 0) {
> return Status::Invalid("Must pass at least one record batch");
>   }
>   ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2119) [C++][Java] Handle Arrow stream with zero record batch

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2119:

Summary: [C++][Java] Handle Arrow stream with zero record batch  (was: 
Handle Arrow stream with zero record batch)

> [C++][Java] Handle Arrow stream with zero record batch
> --
>
> Key: ARROW-2119
> URL: https://issues.apache.org/jira/browse/ARROW-2119
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jingyuan Wang
>Priority: Major
> Fix For: 0.13.0
>
>
> It looks like currently many places of the code assume that there needs to be 
> at least one record batch for streaming format. Is zero-recordbatch not 
> supported by design?
> e.g. 
> [https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java#L45]
> {code:none}
>   public static void convert(InputStream in, OutputStream out) throws 
> IOException {
> BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
> try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) {
>   VectorSchemaRoot root = reader.getVectorSchemaRoot();
>   // load the first batch before instantiating the writer so that we have 
> any dictionaries
>   if (!reader.loadNextBatch()) {
> throw new IOException("Unable to read first record batch");
>   }
>   ...
> {code}
> Pyarrow-0.8.0 does not load 0-recordbatch stream either. It would throw an 
> exception originated from 
> [https://github.com/apache/arrow/blob/a95465b8ce7a32feeaae3e13d0a64102ffa590d9/cpp/src/arrow/table.cc#L309:]
> {code:none}
> Status Table::FromRecordBatches(const 
> std::vector>& batches,
> std::shared_ptr* table) {
>   if (batches.size() == 0) {
> return Status::Invalid("Must pass at least one record batch");
>   }
>   ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2119) [C++][Java] Handle Arrow stream with zero record batch

2019-02-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761458#comment-16761458
 ] 

Wes McKinney commented on ARROW-2119:
-

I think this might be fixed. I added it to 0.13 to check C++ at least

> [C++][Java] Handle Arrow stream with zero record batch
> --
>
> Key: ARROW-2119
> URL: https://issues.apache.org/jira/browse/ARROW-2119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Reporter: Jingyuan Wang
>Priority: Major
> Fix For: 0.13.0
>
>
> It looks like currently many places of the code assume that there needs to be 
> at least one record batch for streaming format. Is zero-recordbatch not 
> supported by design?
> e.g. 
> [https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java#L45]
> {code:none}
>   public static void convert(InputStream in, OutputStream out) throws 
> IOException {
> BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
> try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) {
>   VectorSchemaRoot root = reader.getVectorSchemaRoot();
>   // load the first batch before instantiating the writer so that we have 
> any dictionaries
>   if (!reader.loadNextBatch()) {
> throw new IOException("Unable to read first record batch");
>   }
>   ...
> {code}
> Pyarrow-0.8.0 does not load 0-recordbatch stream either. It would throw an 
> exception originated from 
> [https://github.com/apache/arrow/blob/a95465b8ce7a32feeaae3e13d0a64102ffa590d9/cpp/src/arrow/table.cc#L309:]
> {code:none}
> Status Table::FromRecordBatches(const 
> std::vector>& batches,
> std::shared_ptr* table) {
>   if (batches.size() == 0) {
> return Status::Invalid("Must pass at least one record batch");
>   }
>   ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   >