date:20200318

[jira] [Resolved] (ARROW-8153) [Packaging] Update the conda feedstock files and upload artifacts to Anaconda

2020-03-18 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8153.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6658
[https://github.com/apache/arrow/pull/6658]

> [Packaging] Update the conda feedstock files and upload artifacts to Anaconda
> -
>
> Key: ARROW-8153
> URL: https://issues.apache.org/jira/browse/ARROW-8153
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The windows builds were failing, so the feedstock files must be updated.
> Under the same hat add support for uploading the produced artifacts to 
> Anaconda labeled as nightly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8158) Getting length of data buffer and base variable width vector

2020-03-18 Thread Gaurangi Saxena (Jira)

Gaurangi Saxena created ARROW-8158:
--

 Summary: Getting length of data buffer and base variable width 
vector
 Key: ARROW-8158
 URL: https://issues.apache.org/jira/browse/ARROW-8158
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Gaurangi Saxena


For string data buffer and base variable width vector can we have a way to get 
length of the data? 

For instance, in ArrowColumnVector in StringAccessor we use stringResult.start 
and stringResult.end, instead we would like to get length of the data through 
an exposed function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2020-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7365:
--
Labels: pull-request-available  (was: )

> [Python] Support FixedSizeList type in conversion to numpy/pandas
> -
>
> Key: ARROW-7365
> URL: https://issues.apache.org/jira/browse/ARROW-7365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> Follow-up on ARROW-7261, still need to add support for FixedSizeListType in 
> the arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062241#comment-17062241
 ] 

Wes McKinney commented on ARROW-8141:
-

[~frank.du] please leave completed issues in Resolved state

> [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
> --
>
> Key: ARROW-8141
> URL: https://issues.apache.org/jira/browse/ARROW-8141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: image-2020-03-18-11-08-38-201.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We are running benchmark on the arrow avx512 build, perf show unpack1_32 as 
> the major hotspot for BM_PlainDecodingBoolean indicator.
> Implement this func with Intrinsics code show big improvements. See below the 
> results on CLX 8280 cpu which is capable of AVX512.
> |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics 
> improvements|
> |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964|
> |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945|
> |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288|
> |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8141.
-
Resolution: Fixed

> [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
> --
>
> Key: ARROW-8141
> URL: https://issues.apache.org/jira/browse/ARROW-8141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: image-2020-03-18-11-08-38-201.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We are running benchmark on the arrow avx512 build, perf show unpack1_32 as 
> the major hotspot for BM_PlainDecodingBoolean indicator.
> Implement this func with Intrinsics code show big improvements. See below the 
> results on CLX 8280 cpu which is capable of AVX512.
> |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics 
> improvements|
> |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964|
> |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945|
> |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288|
> |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-8141:
-

> [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
> --
>
> Key: ARROW-8141
> URL: https://issues.apache.org/jira/browse/ARROW-8141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: image-2020-03-18-11-08-38-201.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We are running benchmark on the arrow avx512 build, perf show unpack1_32 as 
> the major hotspot for BM_PlainDecodingBoolean indicator.
> Implement this func with Intrinsics code show big improvements. See below the 
> results on CLX 8280 cpu which is capable of AVX512.
> |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics 
> improvements|
> |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964|
> |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945|
> |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288|
> |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8080) [C++] Add AVX512 build option

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062240#comment-17062240
 ] 

Wes McKinney commented on ARROW-8080:
-

[~frank.du] please leave completed issues in "Resolved" state

> [C++] Add AVX512 build option
> -
>
> Key: ARROW-8080
> URL: https://issues.apache.org/jira/browse/ARROW-8080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Introduce a build option(ARROW_AVX512) to utilize the compiler feature for 
> AVX512 machine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8080) [C++] Add AVX512 build option

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8080.
-
Resolution: Fixed

> [C++] Add AVX512 build option
> -
>
> Key: ARROW-8080
> URL: https://issues.apache.org/jira/browse/ARROW-8080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Introduce a build option(ARROW_AVX512) to utilize the compiler feature for 
> AVX512 machine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (ARROW-8080) [C++] Add AVX512 build option

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-8080:
-

> [C++] Add AVX512 build option
> -
>
> Key: ARROW-8080
> URL: https://issues.apache.org/jira/browse/ARROW-8080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Introduce a build option(ARROW_AVX512) to utilize the compiler feature for 
> AVX512 machine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Clark Zinzow (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062197#comment-17062197
 ] 

Clark Zinzow commented on ARROW-1231:
-

Sorry for the confusion.  My current plan is to tackle the packaging/toolchain 
issues ARROW-8147 and ARROW-8148, along with anything else required for the 
Arrow C++ build system to be able to build and link against the GCP C++ sdk.  
Once that is working, I'm planning on developing an external store GCS 
implementation for Plasma, ARROW-8031, so that objects can be evicted to GCS.  
AFAICT, this shouldn't involve much more than implementing the {{Put}} and 
{{Get}} interfaces using the C++ GCS client {{WriteObject}} and {{ReadObject}} 
APIs, respectively.

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8080) [C++] Add AVX512 build option

2020-03-18 Thread Frank Du (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Du closed ARROW-8080.
---

> [C++] Add AVX512 build option
> -
>
> Key: ARROW-8080
> URL: https://issues.apache.org/jira/browse/ARROW-8080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Introduce a build option(ARROW_AVX512) to utilize the compiler feature for 
> AVX512 machine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API

2020-03-18 Thread Frank Du (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Du closed ARROW-8141.
---

> [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
> --
>
> Key: ARROW-8141
> URL: https://issues.apache.org/jira/browse/ARROW-8141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: image-2020-03-18-11-08-38-201.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We are running benchmark on the arrow avx512 build, perf show unpack1_32 as 
> the major hotspot for BM_PlainDecodingBoolean indicator.
> Implement this func with Intrinsics code show big improvements. See below the 
> results on CLX 8280 cpu which is capable of AVX512.
> |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics 
> improvements|
> |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964|
> |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945|
> |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288|
> |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062191#comment-17062191
 ] 

Wes McKinney edited comment on ARROW-1231 at 3/19/20, 1:05 AM:
---

I guess we may be talking past each other. The Arrow C++ build system needs to 
be informed about how to build and/or link to the Google C++ client libraries. 
In other words, adding an option to the build system like {{-DARROW_GCS=ON}} 
like we currently have {{-DARROW_S3=ON}}. You are welcome to tackle the problem 
in any order you wish. I will wait for your pull requests


was (Author: wesmckinn):
I guess we may be talking past each other. The Arrow C++ build system needs to 
be informed about how to build and link to the Google C++ client libraries. In 
other words, adding an option to the build system like {{-DARROW_GCS=ON}} like 
we currently have {{-DARROW_S3=ON}}. You are welcome to tackle the problem in 
any order you wish. I will wait for your pull requests

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062191#comment-17062191
 ] 

Wes McKinney commented on ARROW-1231:
-

I guess we may be talking past each other. The Arrow C++ build system needs to 
be informed about how to build and link to the Google C++ client libraries. In 
other words, adding an option to the build system like {{-DARROW_GCS=ON}} like 
we currently have {{-DARROW_S3=ON}}. You are welcome to tackle the problem in 
any order you wish. I will wait for your pull requests

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Clark Zinzow (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188
 ] 

Clark Zinzow edited comment on ARROW-1231 at 3/19/20, 12:57 AM:


Maybe I don't have a correct understanding of the external store interface and 
semantics.  It was my impression after looking at the 
[interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]and
 the [first pass at an S3 external store 
implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a 
put and get interface has to be implemented, where the Plasma objects can be 
put/get to/from GCS buckets as opaque blobs using the C++ GCS client.  Am I 
understanding that correctly?


was (Author: clarkzinzow):
Maybe I don't have a correct understanding of the external store interface and 
semantics.  It was my impression after looking at the 
[interface|[https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]]
 and the [first pass at an S3 external store 
implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a 
put and get interface has to be implemented, where the Plasma objects can be 
put/get to/from GCS buckets as opaque blobs using the C++ GCS client.  Am I 
understanding that correctly?

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Clark Zinzow (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188
 ] 

Clark Zinzow edited comment on ARROW-1231 at 3/19/20, 12:57 AM:


Maybe I don't have a correct understanding of the external store interface and 
semantics.  It was my impression after looking at the 
[interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]
 and the [first pass at an S3 external store 
implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a 
put and get interface has to be implemented, where the Plasma objects can be 
put/get to/from GCS buckets as opaque blobs using the C++ GCS client.  Am I 
understanding that correctly?


was (Author: clarkzinzow):
Maybe I don't have a correct understanding of the external store interface and 
semantics.  It was my impression after looking at the 
[interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]and
 the [first pass at an S3 external store 
implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a 
put and get interface has to be implemented, where the Plasma objects can be 
put/get to/from GCS buckets as opaque blobs using the C++ GCS client.  Am I 
understanding that correctly?

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Clark Zinzow (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188
 ] 

Clark Zinzow edited comment on ARROW-1231 at 3/19/20, 12:57 AM:


Maybe I don't have a correct understanding of the external store interface and 
semantics.  It was my impression after looking at the 
[interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]
 and the [first pass at an S3 external store 
implementation|#diff-c17d56d3503f18faacf739e160958f6e] that essentially only a 
put and get interface has to be implemented, where the Plasma objects can be 
put/get to/from GCS buckets as opaque blobs using the C++ GCS client.  Am I 
understanding that correctly?


was (Author: clarkzinzow):
Maybe I don't have a correct understanding of the external store interface and 
semantics.  It was my impression after looking at the 
[interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]
 and the [first pass at an S3 external store 
implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a 
put and get interface has to be implemented, where the Plasma objects can be 
put/get to/from GCS buckets as opaque blobs using the C++ GCS client.  Am I 
understanding that correctly?

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Clark Zinzow (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188
 ] 

Clark Zinzow commented on ARROW-1231:
-

Maybe I don't have the correct understanding of the external store interface 
and semantics.  It was my impression after looking at the 
[interface|[https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]]
 and the [first pass at an S3 external store 
implementation|[https://github.com/apache/arrow/pull/3559/files#diff-c17d56d3503f18faacf739e160958f6e]]
 that essentially only a put and get interface has to be implemented, where the 
Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ 
GCS client.  Am I understanding that correctly?

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Clark Zinzow (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188
 ] 

Clark Zinzow edited comment on ARROW-1231 at 3/19/20, 12:56 AM:


Maybe I don't have a correct understanding of the external store interface and 
semantics.  It was my impression after looking at the 
[interface|[https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]]
 and the [first pass at an S3 external store 
implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a 
put and get interface has to be implemented, where the Plasma objects can be 
put/get to/from GCS buckets as opaque blobs using the C++ GCS client.  Am I 
understanding that correctly?


was (Author: clarkzinzow):
Maybe I don't have the correct understanding of the external store interface 
and semantics.  It was my impression after looking at the 
[interface|[https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]]
 and the [first pass at an S3 external store 
implementation|[https://github.com/apache/arrow/pull/3559/files#diff-c17d56d3503f18faacf739e160958f6e]]
 that essentially only a put and get interface has to be implemented, where the 
Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ 
GCS client.  Am I understanding that correctly?

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8122) [Python] Empty numpy arrays with shape cannot be deserialized

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8122.
-
Resolution: Fixed

Issue resolved by pull request 6624
[https://github.com/apache/arrow/pull/6624]

> [Python] Empty numpy arrays with shape cannot be deserialized
> -
>
> Key: ARROW-8122
> URL: https://issues.apache.org/jira/browse/ARROW-8122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Wenjun Si
>Assignee: Wenjun Si
>Priority: Major
>  Labels: pull-request-available, serialization
> Fix For: 0.17.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In PyArrow 0.16.0, when we try to deserialize a serialized empty Numpy Array 
> with shape, for instance, np.array([[], []]), an ArrowInvalid is raised.
> Code reproducing this error:
> {code:python}
> import numpy as np
> import pyarrow
> arr = np.array([[], []])
> pyarrow.deserialize(pyarrow.serialize(arr).to_buffer())  # this line cannot 
> work
> {code}
> and the error stack is
> {code:python}
> Traceback (most recent call last):
>   File 
> "/Users/wenjun/miniconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py",
>  line 3326, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
>   File "", line 1, in 
> pyarrow.deserialize(pyarrow.serialize(arr).to_buffer())
>   File "pyarrow/serialization.pxi", line 476, in pyarrow.lib.deserialize
>   File "pyarrow/serialization.pxi", line 438, in pyarrow.lib.deserialize_from
>   File "pyarrow/serialization.pxi", line 414, in pyarrow.lib.read_serialized
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: strides must not involve buffer over run
> {code}
> The same code works in PyArrow 0.15.x



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7996) [Python] Error serializing empty pandas DataFrame with pyarrow

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7996:
---

Assignee: Wenjun Si

> [Python] Error serializing empty pandas DataFrame with pyarrow
> --
>
> Key: ARROW-7996
> URL: https://issues.apache.org/jira/browse/ARROW-7996
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Juan David Agudelo
>Assignee: Wenjun Si
>Priority: Major
>  Labels: serialization
> Fix For: 0.17.0
>
>
> The following code does not work:
>  
> {code:python}
> import pandas
> import pyarrow
> df = pandas.DataFrame({"timestamp": [], "value_123": [], "context_123": []})
> data = [df]
> context = pyarrow.default_serialization_context()  
> serialized_data = context.serialize(data)  
> file_path = "file.txt"
> with open(file_path, "wb") as f:  
> serialized_data.write_to(f)
> with open(file_path, "rb") as f:  
> context = pyarrow.default_serialization_context()  
> decoded_data = context.deserialize(f.read())
> {code}
> Throws the following error:
> {code:java}
> ArrowInvalid: strides must not involve buffer over run{code}
> I am using Python 3.6.9 in Ubuntu 18.04 and the version of pyarrow is 0.16.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7996) [Python] Error serializing empty pandas DataFrame with pyarrow

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7996.
-
Resolution: Fixed

Resolved by 
https://github.com/apache/arrow/commit/7916fb49a0e4c125a02f8c13afbe1f749e6b41d7

> [Python] Error serializing empty pandas DataFrame with pyarrow
> --
>
> Key: ARROW-7996
> URL: https://issues.apache.org/jira/browse/ARROW-7996
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Juan David Agudelo
>Assignee: Wenjun Si
>Priority: Major
>  Labels: serialization
> Fix For: 0.17.0
>
>
> The following code does not work:
>  
> {code:python}
> import pandas
> import pyarrow
> df = pandas.DataFrame({"timestamp": [], "value_123": [], "context_123": []})
> data = [df]
> context = pyarrow.default_serialization_context()  
> serialized_data = context.serialize(data)  
> file_path = "file.txt"
> with open(file_path, "wb") as f:  
> serialized_data.write_to(f)
> with open(file_path, "rb") as f:  
> context = pyarrow.default_serialization_context()  
> decoded_data = context.deserialize(f.read())
> {code}
> Throws the following error:
> {code:java}
> ArrowInvalid: strides must not involve buffer over run{code}
> I am using Python 3.6.9 in Ubuntu 18.04 and the version of pyarrow is 0.16.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062175#comment-17062175
 ] 

Wes McKinney commented on ARROW-1231:
-

Perhaps I'm not understanding ARROW-8031. Are you proposing to use the generic 
Filesystem API (instead of a GCS implementation thereof) to offload objects 
from Plasma? If that's the case then I agree. Otherwise if you need to 
read/write to GCS in particular, without this issue being resolved I'm not sure 
how you can proceed. 

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Clark Zinzow (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062166#comment-17062166
 ] 

Clark Zinzow commented on ARROW-1231:
-

[~wesm] Ah I don't think I was very clear, sorry about that. I'm mostly 
interested (as an Arrow user) in being able to use GCS as an external store for 
Plasma, ARROW-8031; I was offering to work on the GCS filesystem implementation 
issue since I thought that it was a prerequisite for the Plasma external store 
issue. AFAICT, the only real blockers for working on the external store 
implementation for Plasma is adding a google-cloud-cpp conda-forge recipe and 
adding google-cloud-cpp to ThirdPartyToolchain, ARROW-8147 and ARROW-8148; 
i.e., if those packaging/toolchain issues are broken out as separate from the 
GCS filesystem issue, then this GCS filesystem implementation is _not_ a 
prerequisite for the external store implementation for Plasma.

If that is the case, I'm asking if I (or someone else, if they are interested) 
could take on the packaging/toolchain issues ARROW-8147 and ARROW-8148, and 
once those are finished, I could work on the GCS external store implementation 
for Plasma.  This would leave the much larger effort around the GCS filesystem 
implementation for later.

Does that make sense?  And is my judgement of the actual GCS filesystem 
implementation _not_ being a prerequisite for the GCS external store 
implementation for Plasma correct?

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7858) [C++][Python] Support casting an Extension type to its storage type

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7858.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6633
[https://github.com/apache/arrow/pull/6633]

> [C++][Python] Support casting an Extension type to its storage type
> ---
>
> Key: ARROW-7858
> URL: https://issues.apache.org/jira/browse/ARROW-7858
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Currently, casting an extension type will always fail: "No cast implemented 
> from extension to ...".
> However, for casting, we could fall back to the storage array's casting rules?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8157) [C++] Upgrade to LLVM 9

2020-03-18 Thread Jun NAITOH (Jira)

Jun NAITOH created ARROW-8157:
-

 Summary: [C++] Upgrade to LLVM 9
 Key: ARROW-8157
 URL: https://issues.apache.org/jira/browse/ARROW-8157
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Jun NAITOH


Now that LLVM 9 has already been released.

LLVM branch 10 has been created on https://apt.llvm.org/
LLVM branch 9 has already been promoted to the old-stable branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8154.
---

> [Python] HDFS Filesystem does not set environment variables in  pyarrow 
> 0.16.0 release
> --
>
> Key: ARROW-8154
> URL: https://issues.apache.org/jira/browse/ARROW-8154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Eric Henry
>Priority: Major
> Fix For: 0.17.0
>
>
> In pyarrow 0.15.x, HDFS filesystem works as follows:
> If you set HADOOP_HOME env var, it looks for libhdfs.so in 
> $HADOOP_HOME/lib/native.
> In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in 
> $HADOOP_HOME, which is incorrect behaviour on all systems I am using.
> Also, CLASSPATH no longer gets set automatically, which is very convenient. 
> The issue here is that I need to set hadoop home correctly to be able to use 
> other libraries, but have to reset it to use apache arrow. e.g.
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> ..do stuff here..
> ...then connect to arrow...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native"
> hdfs = pyarrow.hdfs.connect(host, port)
> ...then reset my hadoop home...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> etc.
>  
> Example:
> >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> >>> hdfs = pyarrow.hdfs.connect(host, port)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 215, in connect
>     extra_conf=extra_conf)
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 40, in __init__
>     self._connect(host, port, user, kerb_ticket, driver, extra_conf)
>   File "pyarrow/io-hdfs.pxi", line 89, in 
> pyarrow.lib.HadoopFileSystem._connect
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open 
> shared object file: No such file or directory
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062135#comment-17062135
 ] 

Wes McKinney commented on ARROW-7854:
-

I'm not totally satisfied by the global memory mapping option in 
LocalFilesystem so I opened ARROW-8156

> [C++][Dataset] Option to memory map when reading IPC format
> ---
>
> Key: ARROW-7854
> URL: https://issues.apache.org/jira/browse/ARROW-7854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> For the IPC format it would be interesting to be able to memory map the IPC 
> files?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8156) [C++] Add variant of Filesystem::OpenInputFile that has memory-map like behavior if it is possible

2020-03-18 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-8156:
---

 Summary: [C++] Add variant of Filesystem::OpenInputFile that has 
memory-map like behavior if it is possible
 Key: ARROW-8156
 URL: https://issues.apache.org/jira/browse/ARROW-8156
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


TensorFlow has the notion of a ReadOnlyMappedRegion

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/file_system.h#L106

Rather than toggling memory mapping globally at the LocalFilesystem level, it 
would be useful for code to be able to request a memory-mapped 
{[RandomAccessFile}} if memory mapping is possible

See also ARROW-7854



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8145) [C++] Rename GetTargetInfos

2020-03-18 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062124#comment-17062124
 ] 

Kouhei Sutou commented on ARROW-8145:
-

Thanks!

> [C++] Rename GetTargetInfos
> ---
>
> Key: ARROW-8145
> URL: https://issues.apache.org/jira/browse/ARROW-8145
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Trivial
> Fix For: 0.17.0
>
>
> Sorry, but I think I'm irked by the new "GetTargetInfos" spelling.
> I suggest either "GetTargetInfo" or "GetFileInfo" (both singular).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062123#comment-17062123
 ] 

Wes McKinney commented on ARROW-1231:
-

[~clarkzinzow] well, adding the thirdparty dependencies is a necessary 
condition to be able to add a Filesystem implementation that wraps 
google-cloud-cpp, like 

https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc

If you want to work on a filesystem implementation for GCS without dealing with 
the packaging / toolchain issues, you are welcome to do that also. At some 
point all of this work (the filesystem wrapper and thirdparty toolchain 
support) has to be done properly so that we can package and deploy the software 
all the places it needs to go.

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release

2020-03-18 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8154:

Fix Version/s: 0.17.0

> [Python] HDFS Filesystem does not set environment variables in  pyarrow 
> 0.16.0 release
> --
>
> Key: ARROW-8154
> URL: https://issues.apache.org/jira/browse/ARROW-8154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Eric Henry
>Priority: Major
> Fix For: 0.17.0
>
>
> In pyarrow 0.15.x, HDFS filesystem works as follows:
> If you set HADOOP_HOME env var, it looks for libhdfs.so in 
> $HADOOP_HOME/lib/native.
> In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in 
> $HADOOP_HOME, which is incorrect behaviour on all systems I am using.
> Also, CLASSPATH no longer gets set automatically, which is very convenient. 
> The issue here is that I need to set hadoop home correctly to be able to use 
> other libraries, but have to reset it to use apache arrow. e.g.
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> ..do stuff here..
> ...then connect to arrow...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native"
> hdfs = pyarrow.hdfs.connect(host, port)
> ...then reset my hadoop home...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> etc.
>  
> Example:
> >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> >>> hdfs = pyarrow.hdfs.connect(host, port)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 215, in connect
>     extra_conf=extra_conf)
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 40, in __init__
>     self._connect(host, port, user, kerb_ticket, driver, extra_conf)
>   File "pyarrow/io-hdfs.pxi", line 89, in 
> pyarrow.lib.HadoopFileSystem._connect
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open 
> shared object file: No such file or directory
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release

2020-03-18 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8154.
-
Resolution: Duplicate

I think so too.

Sorry for the regression.
Please wait for 0.17.0.

> [Python] HDFS Filesystem does not set environment variables in  pyarrow 
> 0.16.0 release
> --
>
> Key: ARROW-8154
> URL: https://issues.apache.org/jira/browse/ARROW-8154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Eric Henry
>Priority: Major
>
> In pyarrow 0.15.x, HDFS filesystem works as follows:
> If you set HADOOP_HOME env var, it looks for libhdfs.so in 
> $HADOOP_HOME/lib/native.
> In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in 
> $HADOOP_HOME, which is incorrect behaviour on all systems I am using.
> Also, CLASSPATH no longer gets set automatically, which is very convenient. 
> The issue here is that I need to set hadoop home correctly to be able to use 
> other libraries, but have to reset it to use apache arrow. e.g.
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> ..do stuff here..
> ...then connect to arrow...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native"
> hdfs = pyarrow.hdfs.connect(host, port)
> ...then reset my hadoop home...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> etc.
>  
> Example:
> >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> >>> hdfs = pyarrow.hdfs.connect(host, port)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 215, in connect
>     extra_conf=extra_conf)
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 40, in __init__
>     self._connect(host, port, user, kerb_ticket, driver, extra_conf)
>   File "pyarrow/io-hdfs.pxi", line 89, in 
> pyarrow.lib.HadoopFileSystem._connect
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open 
> shared object file: No such file or directory
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8152) [C++] IO: split large coalesced reads into smaller ones

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062119#comment-17062119
 ] 

Wes McKinney commented on ARROW-8152:
-

What do you think about introducing the read parallelism as an option in 

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/caching.h

? It might make sense to introduce an options struct so we can't cramming a 
bunch of parameters into the {{Cache}} method here

> [C++] IO: split large coalesced reads into smaller ones
> ---
>
> Key: ARROW-8152
> URL: https://issues.apache.org/jira/browse/ARROW-8152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Priority: Major
>
> We have a facility to coalesce small reads, but remote filesystems may also 
> benefit from splitting large reads to take advantage of concurrency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8155) [C++] Add "ON only if system dependencies available" build mode for certain optional Arrow components

2020-03-18 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-8155:
---

 Summary: [C++] Add "ON only if system dependencies available" 
build mode for certain optional Arrow components
 Key: ARROW-8155
 URL: https://issues.apache.org/jira/browse/ARROW-8155
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Related to ARROW-8103, there is a need for a static build script to be able to 
build the C++ library in a system environment where some dependencies may not 
be available.

Currently we have toolchain options AUTO, SYSTEM, and BUNDLED:

* AUTO: use system packages if possible, else BUNDLE
* SYSTEM: use only system packages, failing otherwise
* BUNDLED: build using the ExternalProject facility

There is a case that may not be accounted for. Suppose we want to build with 
LZ4 support _only if_ LZ4 is available on the system. So then something like

{code}
-DARROW_WITH_LZ4=IF_AVAILABLE
{code}

(not sure what this should be called).

The idea is that this would use SYSTEM dependency resolution, and if LZ4 is not 
found then the component would be disabled.

The ROI on this feature might be low, but it would be useful to packagers who 
are building from source on an uncertain system (and where downloading a 
tarball and building an EP may not be an option)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8080) [C++] Add AVX512 build option

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8080.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6585
[https://github.com/apache/arrow/pull/6585]

> [C++] Add AVX512 build option
> -
>
> Key: ARROW-8080
> URL: https://issues.apache.org/jira/browse/ARROW-8080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Introduce a build option(ARROW_AVX512) to utilize the compiler feature for 
> AVX512 machine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8080) [C++] Add AVX512 build option

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8080:
---

Assignee: Frank Du

> [C++] Add AVX512 build option
> -
>
> Key: ARROW-8080
> URL: https://issues.apache.org/jira/browse/ARROW-8080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Introduce a build option(ARROW_AVX512) to utilize the compiler feature for 
> AVX512 machine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8141.
-
Resolution: Fixed

Issue resolved by pull request 6650
[https://github.com/apache/arrow/pull/6650]

> [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
> --
>
> Key: ARROW-8141
> URL: https://issues.apache.org/jira/browse/ARROW-8141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: image-2020-03-18-11-08-38-201.png
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We are running benchmark on the arrow avx512 build, perf show unpack1_32 as 
> the major hotspot for BM_PlainDecodingBoolean indicator.
> Implement this func with Intrinsics code show big improvements. See below the 
> results on CLX 8280 cpu which is capable of AVX512.
> |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics 
> improvements|
> |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964|
> |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945|
> |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288|
> |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8141:

Fix Version/s: 0.17.0

> [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
> --
>
> Key: ARROW-8141
> URL: https://issues.apache.org/jira/browse/ARROW-8141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: image-2020-03-18-11-08-38-201.png
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We are running benchmark on the arrow avx512 build, perf show unpack1_32 as 
> the major hotspot for BM_PlainDecodingBoolean indicator.
> Implement this func with Intrinsics code show big improvements. See below the 
> results on CLX 8280 cpu which is capable of AVX512.
> |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics 
> improvements|
> |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964|
> |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945|
> |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288|
> |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8141:
---

Assignee: Frank Du

> [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
> --
>
> Key: ARROW-8141
> URL: https://issues.apache.org/jira/browse/ARROW-8141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: image-2020-03-18-11-08-38-201.png
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We are running benchmark on the arrow avx512 build, perf show unpack1_32 as 
> the major hotspot for BM_PlainDecodingBoolean indicator.
> Implement this func with Intrinsics code show big improvements. See below the 
> results on CLX 8280 cpu which is capable of AVX512.
> |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics 
> improvements|
> |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964|
> |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945|
> |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288|
> |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062082#comment-17062082
 ] 

Wes McKinney commented on ARROW-8154:
-

I think this is a dup of ARROW-7841, a regression that has been fixed since 
0.16.0 was released

> [Python] HDFS Filesystem does not set environment variables in  pyarrow 
> 0.16.0 release
> --
>
> Key: ARROW-8154
> URL: https://issues.apache.org/jira/browse/ARROW-8154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Eric Henry
>Priority: Major
>
> In pyarrow 0.15.x, HDFS filesystem works as follows:
> If you set HADOOP_HOME env var, it looks for libhdfs.so in 
> $HADOOP_HOME/lib/native.
> In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in 
> $HADOOP_HOME, which is incorrect behaviour on all systems I am using.
> Also, CLASSPATH no longer gets set automatically, which is very convenient. 
> The issue here is that I need to set hadoop home correctly to be able to use 
> other libraries, but have to reset it to use apache arrow. e.g.
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> ..do stuff here..
> ...then connect to arrow...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native"
> hdfs = pyarrow.hdfs.connect(host, port)
> ...then reset my hadoop home...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> etc.
>  
> Example:
> >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> >>> hdfs = pyarrow.hdfs.connect(host, port)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 215, in connect
>     extra_conf=extra_conf)
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 40, in __init__
>     self._connect(host, port, user, kerb_ticket, driver, extra_conf)
>   File "pyarrow/io-hdfs.pxi", line 89, in 
> pyarrow.lib.HadoopFileSystem._connect
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open 
> shared object file: No such file or directory
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8154:

Summary: [Python] HDFS Filesystem does not set environment variables in  
pyarrow 0.16.0 release  (was:  HDFS Filesystem does not set environment 
variables in  pyarrow 0.16.0 release)

> [Python] HDFS Filesystem does not set environment variables in  pyarrow 
> 0.16.0 release
> --
>
> Key: ARROW-8154
> URL: https://issues.apache.org/jira/browse/ARROW-8154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Eric Henry
>Priority: Major
>
> In pyarrow 0.15.x, HDFS filesystem works as follows:
> If you set HADOOP_HOME env var, it looks for libhdfs.so in 
> $HADOOP_HOME/lib/native.
> In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in 
> $HADOOP_HOME, which is incorrect behaviour on all systems I am using.
> Also, CLASSPATH no longer gets set automatically, which is very convenient. 
> The issue here is that I need to set hadoop home correctly to be able to use 
> other libraries, but have to reset it to use apache arrow. e.g.
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> ..do stuff here..
> ...then connect to arrow...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native"
> hdfs = pyarrow.hdfs.connect(host, port)
> ...then reset my hadoop home...
> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> etc.
>  
> Example:
> >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"
> >>> hdfs = pyarrow.hdfs.connect(host, port)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 215, in connect
>     extra_conf=extra_conf)
>   File 
> "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
>  line 40, in __init__
>     self._connect(host, port, user, kerb_ticket, driver, extra_conf)
>   File "pyarrow/io-hdfs.pxi", line 89, in 
> pyarrow.lib.HadoopFileSystem._connect
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open 
> shared object file: No such file or directory
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently

2020-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7966:
--
Labels: pull-request-available  (was: )

> [Integration][Flight][C++] Client should verify each batch independently
> 
>
> Key: ARROW-7966
> URL: https://issues.apache.org/jira/browse/ARROW-7966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Integration
>Reporter: Bryan Cutler
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>
> Currently the C++ Flight test client in {{test_integration_client.cc}} reads 
> all batches from JSON into a Table, reads all batches in the flight stream 
> from the server into a Table, then compares the Tables for equality.  This is 
> potentially a problem because a record batch might have specific information 
> that is then lost in the conversion to a Table. For example, if the server 
> sends empty batches, the resulting Table would not be different from one with 
> no empty batches.
> Instead, the client should check each record batch from the JSON file against 
> each record batch from the server independently. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently

2020-03-18 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062043#comment-17062043
 ] 

David Li edited comment on ARROW-7966 at 3/18/20, 8:57 PM:
---

Fixing this causes the test I added in ARROW-7899 to fail again. I'll dig 
deeper...

(On a closer look, it's likely because I neglected to rebuild the Java JARs 
after rebasing!)


was (Author: lidavidm):
Fixing this causes the test I added in ARROW-7899 to fail again. I'll dig 
deeper...

> [Integration][Flight][C++] Client should verify each batch independently
> 
>
> Key: ARROW-7966
> URL: https://issues.apache.org/jira/browse/ARROW-7966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Integration
>Reporter: Bryan Cutler
>Assignee: David Li
>Priority: Major
>
> Currently the C++ Flight test client in {{test_integration_client.cc}} reads 
> all batches from JSON into a Table, reads all batches in the flight stream 
> from the server into a Table, then compares the Tables for equality.  This is 
> potentially a problem because a record batch might have specific information 
> that is then lost in the conversion to a Table. For example, if the server 
> sends empty batches, the resulting Table would not be different from one with 
> no empty batches.
> Instead, the client should check each record batch from the JSON file against 
> each record batch from the server independently. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently

2020-03-18 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062043#comment-17062043
 ] 

David Li commented on ARROW-7966:
-

Fixing this causes the test I added in ARROW-7899 to fail again. I'll dig 
deeper...

> [Integration][Flight][C++] Client should verify each batch independently
> 
>
> Key: ARROW-7966
> URL: https://issues.apache.org/jira/browse/ARROW-7966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Integration
>Reporter: Bryan Cutler
>Assignee: David Li
>Priority: Major
>
> Currently the C++ Flight test client in {{test_integration_client.cc}} reads 
> all batches from JSON into a Table, reads all batches in the flight stream 
> from the server into a Table, then compares the Tables for equality.  This is 
> potentially a problem because a record batch might have specific information 
> that is then lost in the conversion to a Table. For example, if the server 
> sends empty batches, the resulting Table would not be different from one with 
> no empty batches.
> Instead, the client should check each record batch from the JSON file against 
> each record batch from the server independently. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently

2020-03-18 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-7966:
---

Assignee: David Li

> [Integration][Flight][C++] Client should verify each batch independently
> 
>
> Key: ARROW-7966
> URL: https://issues.apache.org/jira/browse/ARROW-7966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Integration
>Reporter: Bryan Cutler
>Assignee: David Li
>Priority: Major
>
> Currently the C++ Flight test client in {{test_integration_client.cc}} reads 
> all batches from JSON into a Table, reads all batches in the flight stream 
> from the server into a Table, then compares the Tables for equality.  This is 
> potentially a problem because a record batch might have specific information 
> that is then lost in the conversion to a Table. For example, if the server 
> sends empty batches, the resulting Table would not be different from one with 
> no empty batches.
> Instead, the client should check each record batch from the JSON file against 
> each record batch from the server independently. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8154) HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release

2020-03-18 Thread Eric Henry (Jira)

Eric Henry created ARROW-8154:
-

 Summary:  HDFS Filesystem does not set environment variables in  
pyarrow 0.16.0 release
 Key: ARROW-8154
 URL: https://issues.apache.org/jira/browse/ARROW-8154
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Eric Henry


In pyarrow 0.15.x, HDFS filesystem works as follows:

If you set HADOOP_HOME env var, it looks for libhdfs.so in 
$HADOOP_HOME/lib/native.

In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in 
$HADOOP_HOME, which is incorrect behaviour on all systems I am using.

Also, CLASSPATH no longer gets set automatically, which is very convenient. The 
issue here is that I need to set hadoop home correctly to be able to use other 
libraries, but have to reset it to use apache arrow. e.g.

os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"

..do stuff here..

...then connect to arrow...

os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native"

hdfs = pyarrow.hdfs.connect(host, port)

...then reset my hadoop home...

os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"

etc.

 

Example:

>>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop"

>>> hdfs = pyarrow.hdfs.connect(host, port)

Traceback (most recent call last):

  File "", line 1, in 

  File 
"/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
 line 215, in connect

    extra_conf=extra_conf)

  File 
"/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py",
 line 40, in __init__

    self._connect(host, port, user, kerb_ticket, driver, extra_conf)

  File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect

  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status

OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open shared 
object file: No such file or directory

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8144) [CI] Cmake 3.2 nightly build fails

2020-03-18 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8144.
-
Resolution: Fixed

Issue resolved by pull request 6654
[https://github.com/apache/arrow/pull/6654]

> [CI] Cmake 3.2 nightly build fails
> --
>
> Key: ARROW-8144
> URL: https://issues.apache.org/jira/browse/ARROW-8144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In the LLVM 8 Migration PR wget was 
> [removed|https://github.com/apache/arrow/commit/58ec1bc3984b8453011ba6ca45c727ff6ceed78c#diff-0a4bf63085865017969bbbdac6f66880L29]
>  so the build is 
> [missing|https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-18-0-circle-test-ubuntu-18.04-cpp-cmake32]
>  wget.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8145) [C++] Rename GetTargetInfos

2020-03-18 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8145:
--
Fix Version/s: 0.17.0

> [C++] Rename GetTargetInfos
> ---
>
> Key: ARROW-8145
> URL: https://issues.apache.org/jira/browse/ARROW-8145
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Trivial
> Fix For: 0.17.0
>
>
> Sorry, but I think I'm irked by the new "GetTargetInfos" spelling.
> I suggest either "GetTargetInfo" or "GetFileInfo" (both singular).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8145) [C++] Rename GetTargetInfos

2020-03-18 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062036#comment-17062036
 ] 

Antoine Pitrou commented on ARROW-8145:
---

I can do it. There's no hurry in any case.

> [C++] Rename GetTargetInfos
> ---
>
> Key: ARROW-8145
> URL: https://issues.apache.org/jira/browse/ARROW-8145
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Trivial
> Fix For: 0.17.0
>
>
> Sorry, but I think I'm irked by the new "GetTargetInfos" spelling.
> I suggest either "GetTargetInfo" or "GetFileInfo" (both singular).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8145) [C++] Rename GetTargetInfos

2020-03-18 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062035#comment-17062035
 ] 

Kouhei Sutou commented on ARROW-8145:
-

Oh, sorry.
I'm OK either. Should I create a pull request for it? Or do you want to create 
a pull request?

> [C++] Rename GetTargetInfos
> ---
>
> Key: ARROW-8145
> URL: https://issues.apache.org/jira/browse/ARROW-8145
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Trivial
>
> Sorry, but I think I'm irked by the new "GetTargetInfos" spelling.
> I suggest either "GetTargetInfo" or "GetFileInfo" (both singular).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8153) [Packaging] Update the conda feedstock files and upload artifacts to Anaconda

2020-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8153:
--
Labels: pull-request-available  (was: )

> [Packaging] Update the conda feedstock files and upload artifacts to Anaconda
> -
>
> Key: ARROW-8153
> URL: https://issues.apache.org/jira/browse/ARROW-8153
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> The windows builds were failing, so the feedstock files must be updated.
> Under the same hat add support for uploading the produced artifacts to 
> Anaconda labeled as nightly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8144) [CI] Cmake 3.2 nightly build fails

2020-03-18 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8144:

Component/s: Continuous Integration

> [CI] Cmake 3.2 nightly build fails
> --
>
> Key: ARROW-8144
> URL: https://issues.apache.org/jira/browse/ARROW-8144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In the LLVM 8 Migration PR wget was 
> [removed|https://github.com/apache/arrow/commit/58ec1bc3984b8453011ba6ca45c727ff6ceed78c#diff-0a4bf63085865017969bbbdac6f66880L29]
>  so the build is 
> [missing|https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-18-0-circle-test-ubuntu-18.04-cpp-cmake32]
>  wget.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8153) [Packaging] Update the conda feedstock files and upload artifacts to Anaconda

2020-03-18 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-8153:
--

 Summary: [Packaging] Update the conda feedstock files and upload 
artifacts to Anaconda
 Key: ARROW-8153
 URL: https://issues.apache.org/jira/browse/ARROW-8153
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


The windows builds were failing, so the feedstock files must be updated.

Under the same hat add support for uploading the produced artifacts to Anaconda 
labeled as nightly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format

2020-03-18 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques closed ARROW-7854.
-
Resolution: Not A Problem

Already supported.

> [C++][Dataset] Option to memory map when reading IPC format
> ---
>
> Key: ARROW-7854
> URL: https://issues.apache.org/jira/browse/ARROW-7854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> For the IPC format it would be interesting to be able to memory map the IPC 
> files?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7390) [C++][Dataset] Concurrency race in Projector::Project

2020-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7390:
--
Labels: pull-request-available  (was: )

> [C++][Dataset] Concurrency race in Projector::Project 
> --
>
> Key: ARROW-7390
> URL: https://issues.apache.org/jira/browse/ARROW-7390
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> When a DataFragment is invoked by 2 scan tasks of the same DataFragment, 
> there's a race to invoke SetInputSchema. Note that ResizeMissingColumns also 
> suffers from this race. The ideal goal is to make Project a const method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8150) [Rust] Allow writing custom FileMetaData k/v pairs

2020-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8150:
--
Labels: pull-request-available  (was: )

> [Rust] Allow writing custom FileMetaData k/v pairs
> --
>
> Key: ARROW-8150
> URL: https://issues.apache.org/jira/browse/ARROW-8150
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: David Kegley
>Priority: Minor
>  Labels: pull-request-available
>
> It would be nice to be able to write custom k/v metadata in the rust 
> implementation of parquet. It looks like there was a plan for this previously 
> but it has not been implemented yet.
> [https://github.com/apache/arrow/blob/a1eb440c92ae1c8edc93bc9ae646a8e371d756c6/rust/parquet/src/file/writer.rs#L182]
> I have a working implementation that adds a `key_value_metadata` field to the 
> `WriterProperties` struct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Clark Zinzow (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061994#comment-17061994
 ] 

Clark Zinzow commented on ARROW-1231:
-

[~apitrou] [~wesm] Great, thanks! Is it safe to say that adding the 
google-cloud-cpp conda-forge recipe and adding google-cloud-cpp to 
ThirdPartyToolchain are the only true blockers for adding the GCS external 
store implementation for Plasma? If that's the case and if this issue isn't of 
high priority for anyone ATM, then I would probably prefer to work on 
ARROW-8031 instead of this issue after ARROW-8147 and ARROW-8148 are done, if 
that's acceptable.

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8152) [C++] IO: split large coalesced reads into smaller ones

2020-03-18 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-8152:

Issue Type: Improvement  (was: Bug)

> [C++] IO: split large coalesced reads into smaller ones
> ---
>
> Key: ARROW-8152
> URL: https://issues.apache.org/jira/browse/ARROW-8152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Priority: Major
>
> We have a facility to coalesce small reads, but remote filesystems may also 
> benefit from splitting large reads to take advantage of concurrency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8151) [Benchmarking][Dataset] Benchmark Parquet read performance with S3File

2020-03-18 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-8151:

Issue Type: Improvement  (was: Bug)

> [Benchmarking][Dataset] Benchmark Parquet read performance with S3File
> --
>
> Key: ARROW-8151
> URL: https://issues.apache.org/jira/browse/ARROW-8151
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking, C++ - Dataset
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>
> We should establish a performance baseline with the current S3File 
> implementation and Parquet reader before proceeding with work like 
> PARQUET-1698.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8152) [C++] IO: split large coalesced reads into smaller ones

2020-03-18 Thread David Li (Jira)

David Li created ARROW-8152:
---

 Summary: [C++] IO: split large coalesced reads into smaller ones
 Key: ARROW-8152
 URL: https://issues.apache.org/jira/browse/ARROW-8152
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: David Li


We have a facility to coalesce small reads, but remote filesystems may also 
benefit from splitting large reads to take advantage of concurrency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8151) [Benchmarking][Dataset] Benchmark Parquet read performance with S3File

2020-03-18 Thread David Li (Jira)

David Li created ARROW-8151:
---

 Summary: [Benchmarking][Dataset] Benchmark Parquet read 
performance with S3File
 Key: ARROW-8151
 URL: https://issues.apache.org/jira/browse/ARROW-8151
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking, C++ - Dataset
Reporter: David Li
Assignee: David Li


We should establish a performance baseline with the current S3File 
implementation and Parquet reader before proceeding with work like PARQUET-1698.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8150) [Rust] Allow writing custom FileMetaData k/v pairs

2020-03-18 Thread David Kegley (Jira)

David Kegley created ARROW-8150:
---

 Summary: [Rust] Allow writing custom FileMetaData k/v pairs
 Key: ARROW-8150
 URL: https://issues.apache.org/jira/browse/ARROW-8150
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: David Kegley


It would be nice to be able to write custom k/v metadata in the rust 
implementation of parquet. It looks like there was a plan for this previously 
but it has not been implemented yet.

[https://github.com/apache/arrow/blob/a1eb440c92ae1c8edc93bc9ae646a8e371d756c6/rust/parquet/src/file/writer.rs#L182]

I have a working implementation that adds a `key_value_metadata` field to the 
`WriterProperties` struct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [arrow-testing] pitrou merged pull request #21: PARQUET-1819: [C++] Add parquet fuzz files

2020-03-18 Thread GitBox

pitrou merged pull request #21: PARQUET-1819: [C++] Add parquet fuzz files
URL: https://github.com/apache/arrow-testing/pull/21
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-testing] pitrou opened a new pull request #21: PARQUET-1819: [C++] Add parquet fuzz files

2020-03-18 Thread GitBox

pitrou opened a new pull request #21: PARQUET-1819: [C++] Add parquet fuzz files
URL: https://github.com/apache/arrow-testing/pull/21
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode

2020-03-18 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7673:
---
Fix Version/s: (was: 0.17.0)
   1.0.0

> [C++][Dataset] Revisit File discovery failure mode
> --
>
> Key: ARROW-7673
> URL: https://issues.apache.org/jira/browse/ARROW-7673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will 
> silently ignore unsupported files (either IO error, not of the valid format, 
> corruption, missing compression codecs, etc...) when creating a 
> `FileSystemSource`.
> We should change this behavior to propagate an error in the Inspect/Finish 
> calls by default and allow the user to toggle `exclude_invalid_files`. The 
> error should contain at least the file path and a decipherable error (if 
> possible).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-18 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-8088:


Assignee: Ben Kietzman  (was: Joris Van den Bossche)

> [C++][Dataset] Partition columns with specified dictionary type result in all 
> nulls
> ---
>
> Key: ARROW-8088
> URL: https://issues.apache.org/jira/browse/ARROW-8088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When specifying an explicit schema for the Partitioning, and when using a 
> dictionary type, the materialization of the partition keys goes wrong: you 
> don't get an error, but you get columns with all nulls.
> Python example:
> {code:python}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> When reading with discovery, all is fine:
> {code:python}
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().schema
> values: double
> bar: string
> foo: int32
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().to_pandas().head(2)
>  values bar  foo
> 0  2.505903   a0
> 1 -1.760135   a0
> {code}
> But when specifying the partition columns to be dictionary type with explicit 
> {{HivePartitioning}}, you get no error but all null values:
> {code:python}
> >>> partitioning = ds.HivePartitioning(pa.schema([
> ... ("foo", pa.dictionary(pa.int32(), pa.int64())),
> ... ("bar", pa.dictionary(pa.int32(), pa.string()))
> ... ]))
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().to_pandas().head(2)
>  values  foo  bar
> 0  2.505903  NaN  NaN
> 1 -1.760135  NaN  NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-18 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-8088:


Assignee: Joris Van den Bossche

> [C++][Dataset] Partition columns with specified dictionary type result in all 
> nulls
> ---
>
> Key: ARROW-8088
> URL: https://issues.apache.org/jira/browse/ARROW-8088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When specifying an explicit schema for the Partitioning, and when using a 
> dictionary type, the materialization of the partition keys goes wrong: you 
> don't get an error, but you get columns with all nulls.
> Python example:
> {code:python}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> When reading with discovery, all is fine:
> {code:python}
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().schema
> values: double
> bar: string
> foo: int32
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().to_pandas().head(2)
>  values bar  foo
> 0  2.505903   a0
> 1 -1.760135   a0
> {code}
> But when specifying the partition columns to be dictionary type with explicit 
> {{HivePartitioning}}, you get no error but all null values:
> {code:python}
> >>> partitioning = ds.HivePartitioning(pa.schema([
> ... ("foo", pa.dictionary(pa.int32(), pa.int64())),
> ... ("bar", pa.dictionary(pa.int32(), pa.string()))
> ... ]))
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().to_pandas().head(2)
>  values  foo  bar
> 0  2.505903  NaN  NaN
> 1 -1.760135  NaN  NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-18 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8088:
-
Fix Version/s: 0.17.0

> [C++][Dataset] Partition columns with specified dictionary type result in all 
> nulls
> ---
>
> Key: ARROW-8088
> URL: https://issues.apache.org/jira/browse/ARROW-8088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When specifying an explicit schema for the Partitioning, and when using a 
> dictionary type, the materialization of the partition keys goes wrong: you 
> don't get an error, but you get columns with all nulls.
> Python example:
> {code:python}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> When reading with discovery, all is fine:
> {code:python}
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().schema
> values: double
> bar: string
> foo: int32
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().to_pandas().head(2)
>  values bar  foo
> 0  2.505903   a0
> 1 -1.760135   a0
> {code}
> But when specifying the partition columns to be dictionary type with explicit 
> {{HivePartitioning}}, you get no error but all null values:
> {code:python}
> >>> partitioning = ds.HivePartitioning(pa.schema([
> ... ("foo", pa.dictionary(pa.int32(), pa.int64())),
> ... ("bar", pa.dictionary(pa.int32(), pa.string()))
> ... ]))
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().to_pandas().head(2)
>  values  foo  bar
> 0  2.505903  NaN  NaN
> 1 -1.760135  NaN  NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8127) [C++] [Parquet] Incorrect column chunk metadata for multipage batch writes

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8127.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6637
[https://github.com/apache/arrow/pull/6637]

> [C++] [Parquet] Incorrect column chunk metadata for multipage batch writes
> --
>
> Key: ARROW-8127
> URL: https://issues.apache.org/jira/browse/ARROW-8127
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: TP Boudreau
>Assignee: TP Boudreau
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: multipage-batch-write.cc
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When writing to a buffered column writer using PLAIN encoding, if the size of 
> the batch supplied for writing exceeds the page size for the writer, the 
> resulting file has an incorrect data_page_offset set in its column chunk 
> metadata.  This causes an exception to be thrown when reading the file (file 
> appears to be too short to the reader).
> For example, the attached code, which attempts to write a batch of 262145 
> Int32's (= 1048576 + 4 bytes) using the default page size of 1048576 bytes 
> (with buffered writer, PLAIN encoding), fails on reading, throwing the error: 
> "Tried reading 1048678 bytes starting at position 1048633 from file but only 
> got 333".
> The error is caused by the second page write tripping the conditional here 
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L302,
>  in the serialized in-memory writer wrapped by the buffered writer.
> The fix builds the metadata with offsets from the terminal sink rather than 
> the in memory buffered sink.  A PR is coming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-18 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-8088:


Assignee: Joris Van den Bossche  (was: Ben Kietzman)

> [C++][Dataset] Partition columns with specified dictionary type result in all 
> nulls
> ---
>
> Key: ARROW-8088
> URL: https://issues.apache.org/jira/browse/ARROW-8088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When specifying an explicit schema for the Partitioning, and when using a 
> dictionary type, the materialization of the partition keys goes wrong: you 
> don't get an error, but you get columns with all nulls.
> Python example:
> {code:python}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> When reading with discovery, all is fine:
> {code:python}
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().schema
> values: double
> bar: string
> foo: int32
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().to_pandas().head(2)
>  values bar  foo
> 0  2.505903   a0
> 1 -1.760135   a0
> {code}
> But when specifying the partition columns to be dictionary type with explicit 
> {{HivePartitioning}}, you get no error but all null values:
> {code:python}
> >>> partitioning = ds.HivePartitioning(pa.schema([
> ... ("foo", pa.dictionary(pa.int32(), pa.int64())),
> ... ("bar", pa.dictionary(pa.int32(), pa.string()))
> ... ]))
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().to_pandas().head(2)
>  values  foo  bar
> 0  2.505903  NaN  NaN
> 1 -1.760135  NaN  NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-18 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-8088:


Assignee: Ben Kietzman  (was: Joris Van den Bossche)

> [C++][Dataset] Partition columns with specified dictionary type result in all 
> nulls
> ---
>
> Key: ARROW-8088
> URL: https://issues.apache.org/jira/browse/ARROW-8088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When specifying an explicit schema for the Partitioning, and when using a 
> dictionary type, the materialization of the partition keys goes wrong: you 
> don't get an error, but you get columns with all nulls.
> Python example:
> {code:python}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> When reading with discovery, all is fine:
> {code:python}
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().schema
> values: double
> bar: string
> foo: int32
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning="hive").to_table().to_pandas().head(2)
>  values bar  foo
> 0  2.505903   a0
> 1 -1.760135   a0
> {code}
> But when specifying the partition columns to be dictionary type with explicit 
> {{HivePartitioning}}, you get no error but all null values:
> {code:python}
> >>> partitioning = ds.HivePartitioning(pa.schema([
> ... ("foo", pa.dictionary(pa.int32(), pa.int64())),
> ... ("bar", pa.dictionary(pa.int32(), pa.string()))
> ... ]))
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", 
> >>> partitioning=partitioning).to_table().to_pandas().head(2)
>  values  foo  bar
> 0  2.505903  NaN  NaN
> 1 -1.760135  NaN  NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5572) [Python] raise error message when passing invalid filter in parquet reading

2020-03-18 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061915#comment-17061915
 ] 

Joris Van den Bossche commented on ARROW-5572:
--

This works now correctly with the new Datasets API, since we can filter on both 
partition keys as "normal" columns. 

So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), 
this issue will be resolved.


> [Python] raise error message when passing invalid filter in parquet reading
> ---
>
> Key: ARROW-5572
> URL: https://issues.apache.org/jira/browse/ARROW-5572
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset-parquet-read, parquet
>
> From 
> https://stackoverflow.com/questions/56522977/using-predicates-to-filter-rows-from-pyarrow-parquet-parquetdataset
> For example, when specifying a column in the filter which is a normal column 
> and not a key in your partitioned folder hierarchy, the filter gets silently 
> ignored. It would be nice to get an error message for this.  
> Reproducible example:
> {code:python}
> df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1], 'c': [1, 2, 3, 4]})
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, 'test_parquet_row_filters', partition_cols=['a'])
> # filter on 'a' (partition column) -> works
> pq.read_table('test_parquet_row_filters', filters=[('a', '=', 1)]).to_pandas()
> # filter on normal column (in future could do row group filtering) -> 
> silently does nothing
> pq.read_table('test_parquet_row_filters', filters=[('b', '=', 1)]).to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6122) [C++] ArgSort kernel must support FixedSizeBinary

2020-03-18 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6122:
--
Fix Version/s: (was: 0.17.0)

> [C++] ArgSort kernel must support FixedSizeBinary
> -
>
> Key: ARROW-6122
> URL: https://issues.apache.org/jira/browse/ARROW-6122
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format

2020-03-18 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061882#comment-17061882
 ] 

Joris Van den Bossche commented on ARROW-7854:
--

[~fsaintjacques] this actually already turned out to be possible from the 
python side, by specifying this option when creating the LocalFileSystem 
object. 


> [C++][Dataset] Option to memory map when reading IPC format
> ---
>
> Key: ARROW-7854
> URL: https://issues.apache.org/jira/browse/ARROW-7854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> For the IPC format it would be interesting to be able to memory map the IPC 
> files?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7820) [C++][Gandiva] Add CMake support for compiling LLVM's IR into a library

2020-03-18 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-7820:
--
Fix Version/s: (was: 0.17.0)

> [C++][Gandiva] Add CMake support for compiling LLVM's IR into a library
> ---
>
> Key: ARROW-7820
> URL: https://issues.apache.org/jira/browse/ARROW-7820
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> We should be able to inject LLVM IR into libraries, assuming that `llc` is 
> found on the platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7818) [C++][Gandiva] Generate Filter kernels from gandiva code at compile time

2020-03-18 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-7818:
--
Fix Version/s: (was: 0.17.0)

> [C++][Gandiva] Generate Filter kernels from gandiva code at compile time
> 
>
> Key: ARROW-7818
> URL: https://issues.apache.org/jira/browse/ARROW-7818
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, C++ - Gandiva
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> The goal of this feature is to support generating kernels at compile time 
> (and possibly runtime if gandiva is linked) to avoid rewriting C++ kernels 
> that gandiva knows how to compile. The generated kernels would be linked in 
> the compute module. 
> This is an experimental task that will guide future development, notably 
> implementing aggregate kernels in gandiva once instead both C++ and gandiva 
> implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5744) [C++] Do not error in Table::CombineChunks for BinaryArray types that overflow 2GB limit

2020-03-18 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-5744:

Fix Version/s: (was: 0.17.0)
   1.0.0

> [C++] Do not error in Table::CombineChunks for BinaryArray types that 
> overflow 2GB limit
> 
>
> Key: ARROW-5744
> URL: https://issues.apache.org/jira/browse/ARROW-5744
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> Discovered during ARROW-5635 code review



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format

2020-03-18 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-7854:
-

Assignee: Francois Saint-Jacques

> [C++][Dataset] Option to memory map when reading IPC format
> ---
>
> Key: ARROW-7854
> URL: https://issues.apache.org/jira/browse/ARROW-7854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> For the IPC format it would be interesting to be able to memory map the IPC 
> files?
> cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8149) [C++/Python] Enable CUDA Support in conda recipes

2020-03-18 Thread Uwe Korn (Jira)

Uwe Korn created ARROW-8149:
---

 Summary: [C++/Python] Enable CUDA Support in conda recipes
 Key: ARROW-8149
 URL: https://issues.apache.org/jira/browse/ARROW-8149
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Packaging
Reporter: Uwe Korn
 Fix For: 0.17.0


See the changes in 
[https://github.com/conda-forge/arrow-cpp-feedstock/pull/123], we need to copy 
this into the Arrow repository and also test CUDA in these recipes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode

2020-03-18 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-7673:
--
Fix Version/s: (was: 1.0.0)
   0.17.0

> [C++][Dataset] Revisit File discovery failure mode
> --
>
> Key: ARROW-7673
> URL: https://issues.apache.org/jira/browse/ARROW-7673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.17.0
>
>
> Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will 
> silently ignore unsupported files (either IO error, not of the valid format, 
> corruption, missing compression codecs, etc...) when creating a 
> `FileSystemSource`.
> We should change this behavior to propagate an error in the Inspect/Finish 
> calls by default and allow the user to toggle `exclude_invalid_files`. The 
> error should contain at least the file path and a decipherable error (if 
> possible).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7579) [FlightRPC] Make Handshake optional

2020-03-18 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-7579:

Fix Version/s: (was: 0.17.0)

> [FlightRPC] Make Handshake optional
> ---
>
> Key: ARROW-7579
> URL: https://issues.apache.org/jira/browse/ARROW-7579
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC
>Reporter: David Li
>Priority: Major
>
> We should make it possible to _not_ invoke Handshake for services that don't 
> want it. Especially when using it with flight-grpc, where the standard gRPC 
> authentication mechanisms don't know about Flight and try to authenticate the 
> Handshake endpoint - it's easy to forget to configure this endpoint to bypass 
> authentication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6062) [FlightRPC] Allow timeouts on all stream reads

2020-03-18 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061876#comment-17061876
 ] 

David Li commented on ARROW-6062:
-

Removing from 0.17. 

> [FlightRPC] Allow timeouts on all stream reads
> --
>
> Key: ARROW-6062
> URL: https://issues.apache.org/jira/browse/ARROW-6062
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Priority: Major
>
> Anywhere where we offer reading from a stream in Flight, we need to offer a 
> timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7579) [FlightRPC] Make Handshake optional

2020-03-18 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061874#comment-17061874
 ] 

David Li commented on ARROW-7579:
-

Not a blocker for 0.17, removing from fix versions.

> [FlightRPC] Make Handshake optional
> ---
>
> Key: ARROW-7579
> URL: https://issues.apache.org/jira/browse/ARROW-7579
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC
>Reporter: David Li
>Priority: Major
>
> We should make it possible to _not_ invoke Handshake for services that don't 
> want it. Especially when using it with flight-grpc, where the standard gRPC 
> authentication mechanisms don't know about Flight and try to authenticate the 
> Handshake endpoint - it's easy to forget to configure this endpoint to bypass 
> authentication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6062) [FlightRPC] Allow timeouts on all stream reads

2020-03-18 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-6062:

Fix Version/s: (was: 0.17.0)

> [FlightRPC] Allow timeouts on all stream reads
> --
>
> Key: ARROW-6062
> URL: https://issues.apache.org/jira/browse/ARROW-6062
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Priority: Major
>
> Anywhere where we offer reading from a stream in Flight, we need to offer a 
> timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5745) [C++] properties of Map(Array|Type) are confusingly named

2020-03-18 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-5745:

Fix Version/s: (was: 0.17.0)
   1.0.0

> [C++] properties of Map(Array|Type) are confusingly named
> -
>
> Key: ARROW-5745
> URL: https://issues.apache.org/jira/browse/ARROW-5745
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> In the context of ListArrays, "values" indicates the elements in a slot of 
> the ListArray. Since MapArray isa ListArray, "values" indicates the same 
> thing and the elements are key-item pairs. This naming scheme is not 
> idiomatic; these *should* be called key-value pairs but that would require 
> propagating the renaming down to ListArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode

2020-03-18 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-7673:
-

Assignee: Francois Saint-Jacques

> [C++][Dataset] Revisit File discovery failure mode
> --
>
> Key: ARROW-7673
> URL: https://issues.apache.org/jira/browse/ARROW-7673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will 
> silently ignore unsupported files (either IO error, not of the valid format, 
> corruption, missing compression codecs, etc...) when creating a 
> `FileSystemSource`.
> We should change this behavior to propagate an error in the Inspect/Finish 
> calls by default and allow the user to toggle `exclude_invalid_files`. The 
> error should contain at least the file path and a decipherable error (if 
> possible).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8058) [C++][Python][Dataset] Provide an option to toggle validation and schema inference in FileSystemDatasetFactoryOptions

2020-03-18 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8058:
--
Fix Version/s: (was: 1.0.0)
   0.17.0

> [C++][Python][Dataset] Provide an option to toggle validation and schema 
> inference in FileSystemDatasetFactoryOptions
> -
>
> Key: ARROW-8058
> URL: https://issues.apache.org/jira/browse/ARROW-8058
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, Python
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.17.0
>
>
> This can be costly and is not always necessary.
> At the same time we could move file validation into the scan tasks; currently 
> all files are inspected as the dataset is constructed, which can be expensive 
> if the filesystem is slow. We'll be performing the validation multiple times 
> but the check will be cheap since at scan time we'll be reading the file into 
> memory anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8058) [C++][Python][Dataset] Provide an option to toggle validation and schema inference in FileSystemDatasetFactoryOptions

2020-03-18 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8058:
-

Assignee: Francois Saint-Jacques

> [C++][Python][Dataset] Provide an option to toggle validation and schema 
> inference in FileSystemDatasetFactoryOptions
> -
>
> Key: ARROW-8058
> URL: https://issues.apache.org/jira/browse/ARROW-8058
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, Python
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> This can be costly and is not always necessary.
> At the same time we could move file validation into the scan tasks; currently 
> all files are inspected as the dataset is constructed, which can be expensive 
> if the filesystem is slow. We'll be performing the validation multiple times 
> but the check will be cheap since at scan time we'll be reading the file into 
> memory anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-4484) [Java] improve Flight DoPut busy wait

2020-03-18 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061873#comment-17061873
 ] 

David Li commented on ARROW-4484:
-

Not a blocker for any version.

> [Java] improve Flight DoPut busy wait
> -
>
> Key: ARROW-4484
> URL: https://issues.apache.org/jira/browse/ARROW-4484
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: David Li
>Priority: Major
>  Labels: flight
>
> Currently the implementation of putNext in FlightClient.java busy-waits until 
> gRPC indicates that the server can receive a message. We should either 
> improve the busy-wait (e.g. add sleep times), or rethink the API and make it 
> non-blocking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4484) [Java] improve Flight DoPut busy wait

2020-03-18 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-4484:

Fix Version/s: (was: 0.17.0)

> [Java] improve Flight DoPut busy wait
> -
>
> Key: ARROW-4484
> URL: https://issues.apache.org/jira/browse/ARROW-4484
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: David Li
>Priority: Major
>  Labels: flight
>
> Currently the implementation of putNext in FlightClient.java busy-waits until 
> gRPC indicates that the server can receive a message. We should either 
> improve the busy-wait (e.g. add sleep times), or rethink the API and make it 
> non-blocking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8147) [C++][Packaging] Add google-cloud-cpp to ThirdpartyToolchain

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8147:

Summary: [C++][Packaging] Add google-cloud-cpp to ThirdpartyToolchain  
(was: [Packaging] Add google-cloud-cpp to ThirdpartyToolchain)

> [C++][Packaging] Add google-cloud-cpp to ThirdpartyToolchain
> 
>
> Key: ARROW-8147
> URL: https://issues.apache.org/jira/browse/ARROW-8147
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This is a requirement to be able to make progress on ARROW-1231



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8147) [C++] Add google-cloud-cpp to ThirdpartyToolchain

2020-03-18 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8147:

Summary: [C++] Add google-cloud-cpp to ThirdpartyToolchain  (was: 
[C++][Packaging] Add google-cloud-cpp to ThirdpartyToolchain)

> [C++] Add google-cloud-cpp to ThirdpartyToolchain
> -
>
> Key: ARROW-8147
> URL: https://issues.apache.org/jira/browse/ARROW-8147
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This is a requirement to be able to make progress on ARROW-1231



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8148) [Packaging][C++] Add google-cloud-cpp to conda-forge

2020-03-18 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-8148:
---

 Summary: [Packaging][C++] Add google-cloud-cpp to conda-forge
 Key: ARROW-8148
 URL: https://issues.apache.org/jira/browse/ARROW-8148
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging
Reporter: Wes McKinney


This is a requirement for ARROW-1231 to be able to move forward



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8147) [Packaging] Add google-cloud-cpp to ThirdpartyToolchain

2020-03-18 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-8147:
---

 Summary: [Packaging] Add google-cloud-cpp to ThirdpartyToolchain
 Key: ARROW-8147
 URL: https://issues.apache.org/jira/browse/ARROW-8147
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


This is a requirement to be able to make progress on ARROW-1231



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2020-03-18 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061849#comment-17061849
 ] 

Wes McKinney commented on ARROW-1231:
-

[~clarkzinzow] note that google-cloud-cpp does not seem to be available in 
conda-forge yet, so I'm opening a child issue about dealing with that. I don't 
know anyone else for whom this is a short term priority until later this year 
so we are happy to help and give advice / code review

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8146) [C++] Add per-filesystem facility to sanitize a path

2020-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8146:
--
Labels: pull-request-available  (was: )

> [C++] Add per-filesystem facility to sanitize a path
> 
>
> Key: ARROW-8146
> URL: https://issues.apache.org/jira/browse/ARROW-8146
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8146) [C++] Add per-filesystem facility to sanitize a path

2020-03-18 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-8146:
-

 Summary: [C++] Add per-filesystem facility to sanitize a path
 Key: ARROW-8146
 URL: https://issues.apache.org/jira/browse/ARROW-8146
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-18 Thread Jacek Pliszka (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Pliszka reopened ARROW-3329:
--

It is resolved in C++.

Now I need to work on Python part.

Thank you to your work! What is left of my contribution is now mostly braces. :)

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8145) [C++] Rename GetTargetInfos

2020-03-18 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-8145:
-

 Summary: [C++] Rename GetTargetInfos
 Key: ARROW-8145
 URL: https://issues.apache.org/jira/browse/ARROW-8145
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++, Python
Reporter: Antoine Pitrou


Sorry, but I think I'm irked by the new "GetTargetInfos" spelling.
I suggest either "GetTargetInfo" or "GetFileInfo" (both singular).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 121 matches

Mail list logo