[jira] [Resolved] (ARROW-8153) [Packaging] Update the conda feedstock files and upload artifacts to Anaconda
[ https://issues.apache.org/jira/browse/ARROW-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8153. - Fix Version/s: 0.17.0 Resolution: Fixed Issue resolved by pull request 6658 [https://github.com/apache/arrow/pull/6658] > [Packaging] Update the conda feedstock files and upload artifacts to Anaconda > - > > Key: ARROW-8153 > URL: https://issues.apache.org/jira/browse/ARROW-8153 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The windows builds were failing, so the feedstock files must be updated. > Under the same hat add support for uploading the produced artifacts to > Anaconda labeled as nightly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8158) Getting length of data buffer and base variable width vector
Gaurangi Saxena created ARROW-8158: -- Summary: Getting length of data buffer and base variable width vector Key: ARROW-8158 URL: https://issues.apache.org/jira/browse/ARROW-8158 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Gaurangi Saxena For string data buffer and base variable width vector can we have a way to get length of the data? For instance, in ArrowColumnVector in StringAccessor we use stringResult.start and stringResult.end, instead we would like to get length of the data through an exposed function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas
[ https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7365: -- Labels: pull-request-available (was: ) > [Python] Support FixedSizeList type in conversion to numpy/pandas > - > > Key: ARROW-7365 > URL: https://issues.apache.org/jira/browse/ARROW-7365 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > > Follow-up on ARROW-7261, still need to add support for FixedSizeListType in > the arrow -> python conversion (arrow_to_pandas.cc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
[ https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062241#comment-17062241 ] Wes McKinney commented on ARROW-8141: - [~frank.du] please leave completed issues in Resolved state > [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API > -- > > Key: ARROW-8141 > URL: https://issues.apache.org/jira/browse/ARROW-8141 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Attachments: image-2020-03-18-11-08-38-201.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > We are running benchmark on the arrow avx512 build, perf show unpack1_32 as > the major hotspot for BM_PlainDecodingBoolean indicator. > Implement this func with Intrinsics code show big improvements. See below the > results on CLX 8280 cpu which is capable of AVX512. > |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics > improvements| > |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964| > |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945| > |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288| > |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
[ https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8141. - Resolution: Fixed > [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API > -- > > Key: ARROW-8141 > URL: https://issues.apache.org/jira/browse/ARROW-8141 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Attachments: image-2020-03-18-11-08-38-201.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > We are running benchmark on the arrow avx512 build, perf show unpack1_32 as > the major hotspot for BM_PlainDecodingBoolean indicator. > Implement this func with Intrinsics code show big improvements. See below the > results on CLX 8280 cpu which is capable of AVX512. > |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics > improvements| > |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964| > |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945| > |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288| > |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
[ https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reopened ARROW-8141: - > [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API > -- > > Key: ARROW-8141 > URL: https://issues.apache.org/jira/browse/ARROW-8141 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Attachments: image-2020-03-18-11-08-38-201.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > We are running benchmark on the arrow avx512 build, perf show unpack1_32 as > the major hotspot for BM_PlainDecodingBoolean indicator. > Implement this func with Intrinsics code show big improvements. See below the > results on CLX 8280 cpu which is capable of AVX512. > |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics > improvements| > |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964| > |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945| > |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288| > |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8080) [C++] Add AVX512 build option
[ https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062240#comment-17062240 ] Wes McKinney commented on ARROW-8080: - [~frank.du] please leave completed issues in "Resolved" state > [C++] Add AVX512 build option > - > > Key: ARROW-8080 > URL: https://issues.apache.org/jira/browse/ARROW-8080 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Introduce a build option(ARROW_AVX512) to utilize the compiler feature for > AVX512 machine. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8080) [C++] Add AVX512 build option
[ https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8080. - Resolution: Fixed > [C++] Add AVX512 build option > - > > Key: ARROW-8080 > URL: https://issues.apache.org/jira/browse/ARROW-8080 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Introduce a build option(ARROW_AVX512) to utilize the compiler feature for > AVX512 machine. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-8080) [C++] Add AVX512 build option
[ https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reopened ARROW-8080: - > [C++] Add AVX512 build option > - > > Key: ARROW-8080 > URL: https://issues.apache.org/jira/browse/ARROW-8080 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Introduce a build option(ARROW_AVX512) to utilize the compiler feature for > AVX512 machine. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062197#comment-17062197 ] Clark Zinzow commented on ARROW-1231: - Sorry for the confusion. My current plan is to tackle the packaging/toolchain issues ARROW-8147 and ARROW-8148, along with anything else required for the Arrow C++ build system to be able to build and link against the GCP C++ sdk. Once that is working, I'm planning on developing an external store GCS implementation for Plasma, ARROW-8031, so that objects can be evicted to GCS. AFAICT, this shouldn't involve much more than implementing the {{Put}} and {{Get}} interfaces using the C++ GCS client {{WriteObject}} and {{ReadObject}} APIs, respectively. > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8080) [C++] Add AVX512 build option
[ https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Du closed ARROW-8080. --- > [C++] Add AVX512 build option > - > > Key: ARROW-8080 > URL: https://issues.apache.org/jira/browse/ARROW-8080 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Introduce a build option(ARROW_AVX512) to utilize the compiler feature for > AVX512 machine. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
[ https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Du closed ARROW-8141. --- > [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API > -- > > Key: ARROW-8141 > URL: https://issues.apache.org/jira/browse/ARROW-8141 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Attachments: image-2020-03-18-11-08-38-201.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > We are running benchmark on the arrow avx512 build, perf show unpack1_32 as > the major hotspot for BM_PlainDecodingBoolean indicator. > Implement this func with Intrinsics code show big improvements. See below the > results on CLX 8280 cpu which is capable of AVX512. > |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics > improvements| > |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964| > |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945| > |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288| > |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062191#comment-17062191 ] Wes McKinney edited comment on ARROW-1231 at 3/19/20, 1:05 AM: --- I guess we may be talking past each other. The Arrow C++ build system needs to be informed about how to build and/or link to the Google C++ client libraries. In other words, adding an option to the build system like {{-DARROW_GCS=ON}} like we currently have {{-DARROW_S3=ON}}. You are welcome to tackle the problem in any order you wish. I will wait for your pull requests was (Author: wesmckinn): I guess we may be talking past each other. The Arrow C++ build system needs to be informed about how to build and link to the Google C++ client libraries. In other words, adding an option to the build system like {{-DARROW_GCS=ON}} like we currently have {{-DARROW_S3=ON}}. You are welcome to tackle the problem in any order you wish. I will wait for your pull requests > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062191#comment-17062191 ] Wes McKinney commented on ARROW-1231: - I guess we may be talking past each other. The Arrow C++ build system needs to be informed about how to build and link to the Google C++ client libraries. In other words, adding an option to the build system like {{-DARROW_GCS=ON}} like we currently have {{-DARROW_S3=ON}}. You are welcome to tackle the problem in any order you wish. I will wait for your pull requests > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188 ] Clark Zinzow edited comment on ARROW-1231 at 3/19/20, 12:57 AM: Maybe I don't have a correct understanding of the external store interface and semantics. It was my impression after looking at the [interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]and the [first pass at an S3 external store implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client. Am I understanding that correctly? was (Author: clarkzinzow): Maybe I don't have a correct understanding of the external store interface and semantics. It was my impression after looking at the [interface|[https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]] and the [first pass at an S3 external store implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client. Am I understanding that correctly? > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188 ] Clark Zinzow edited comment on ARROW-1231 at 3/19/20, 12:57 AM: Maybe I don't have a correct understanding of the external store interface and semantics. It was my impression after looking at the [interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h] and the [first pass at an S3 external store implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client. Am I understanding that correctly? was (Author: clarkzinzow): Maybe I don't have a correct understanding of the external store interface and semantics. It was my impression after looking at the [interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]and the [first pass at an S3 external store implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client. Am I understanding that correctly? > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188 ] Clark Zinzow edited comment on ARROW-1231 at 3/19/20, 12:57 AM: Maybe I don't have a correct understanding of the external store interface and semantics. It was my impression after looking at the [interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h] and the [first pass at an S3 external store implementation|#diff-c17d56d3503f18faacf739e160958f6e] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client. Am I understanding that correctly? was (Author: clarkzinzow): Maybe I don't have a correct understanding of the external store interface and semantics. It was my impression after looking at the [interface|https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h] and the [first pass at an S3 external store implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client. Am I understanding that correctly? > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188 ] Clark Zinzow commented on ARROW-1231: - Maybe I don't have the correct understanding of the external store interface and semantics. It was my impression after looking at the [interface|[https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]] and the [first pass at an S3 external store implementation|[https://github.com/apache/arrow/pull/3559/files#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client. Am I understanding that correctly? > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062188#comment-17062188 ] Clark Zinzow edited comment on ARROW-1231 at 3/19/20, 12:56 AM: Maybe I don't have a correct understanding of the external store interface and semantics. It was my impression after looking at the [interface|[https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]] and the [first pass at an S3 external store implementation|#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client. Am I understanding that correctly? was (Author: clarkzinzow): Maybe I don't have the correct understanding of the external store interface and semantics. It was my impression after looking at the [interface|[https://github.com/apache/arrow/blob/master/cpp/src/plasma/external_store.h]] and the [first pass at an S3 external store implementation|[https://github.com/apache/arrow/pull/3559/files#diff-c17d56d3503f18faacf739e160958f6e]] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client. Am I understanding that correctly? > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8122) [Python] Empty numpy arrays with shape cannot be deserialized
[ https://issues.apache.org/jira/browse/ARROW-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8122. - Resolution: Fixed Issue resolved by pull request 6624 [https://github.com/apache/arrow/pull/6624] > [Python] Empty numpy arrays with shape cannot be deserialized > - > > Key: ARROW-8122 > URL: https://issues.apache.org/jira/browse/ARROW-8122 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Wenjun Si >Assignee: Wenjun Si >Priority: Major > Labels: pull-request-available, serialization > Fix For: 0.17.0 > > Time Spent: 1h > Remaining Estimate: 0h > > In PyArrow 0.16.0, when we try to deserialize a serialized empty Numpy Array > with shape, for instance, np.array([[], []]), an ArrowInvalid is raised. > Code reproducing this error: > {code:python} > import numpy as np > import pyarrow > arr = np.array([[], []]) > pyarrow.deserialize(pyarrow.serialize(arr).to_buffer()) # this line cannot > work > {code} > and the error stack is > {code:python} > Traceback (most recent call last): > File > "/Users/wenjun/miniconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", > line 3326, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > pyarrow.deserialize(pyarrow.serialize(arr).to_buffer()) > File "pyarrow/serialization.pxi", line 476, in pyarrow.lib.deserialize > File "pyarrow/serialization.pxi", line 438, in pyarrow.lib.deserialize_from > File "pyarrow/serialization.pxi", line 414, in pyarrow.lib.read_serialized > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: strides must not involve buffer over run > {code} > The same code works in PyArrow 0.15.x -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7996) [Python] Error serializing empty pandas DataFrame with pyarrow
[ https://issues.apache.org/jira/browse/ARROW-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-7996: --- Assignee: Wenjun Si > [Python] Error serializing empty pandas DataFrame with pyarrow > -- > > Key: ARROW-7996 > URL: https://issues.apache.org/jira/browse/ARROW-7996 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Juan David Agudelo >Assignee: Wenjun Si >Priority: Major > Labels: serialization > Fix For: 0.17.0 > > > The following code does not work: > > {code:python} > import pandas > import pyarrow > df = pandas.DataFrame({"timestamp": [], "value_123": [], "context_123": []}) > data = [df] > context = pyarrow.default_serialization_context() > serialized_data = context.serialize(data) > file_path = "file.txt" > with open(file_path, "wb") as f: > serialized_data.write_to(f) > with open(file_path, "rb") as f: > context = pyarrow.default_serialization_context() > decoded_data = context.deserialize(f.read()) > {code} > Throws the following error: > {code:java} > ArrowInvalid: strides must not involve buffer over run{code} > I am using Python 3.6.9 in Ubuntu 18.04 and the version of pyarrow is 0.16.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7996) [Python] Error serializing empty pandas DataFrame with pyarrow
[ https://issues.apache.org/jira/browse/ARROW-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-7996. - Resolution: Fixed Resolved by https://github.com/apache/arrow/commit/7916fb49a0e4c125a02f8c13afbe1f749e6b41d7 > [Python] Error serializing empty pandas DataFrame with pyarrow > -- > > Key: ARROW-7996 > URL: https://issues.apache.org/jira/browse/ARROW-7996 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Juan David Agudelo >Assignee: Wenjun Si >Priority: Major > Labels: serialization > Fix For: 0.17.0 > > > The following code does not work: > > {code:python} > import pandas > import pyarrow > df = pandas.DataFrame({"timestamp": [], "value_123": [], "context_123": []}) > data = [df] > context = pyarrow.default_serialization_context() > serialized_data = context.serialize(data) > file_path = "file.txt" > with open(file_path, "wb") as f: > serialized_data.write_to(f) > with open(file_path, "rb") as f: > context = pyarrow.default_serialization_context() > decoded_data = context.deserialize(f.read()) > {code} > Throws the following error: > {code:java} > ArrowInvalid: strides must not involve buffer over run{code} > I am using Python 3.6.9 in Ubuntu 18.04 and the version of pyarrow is 0.16.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062175#comment-17062175 ] Wes McKinney commented on ARROW-1231: - Perhaps I'm not understanding ARROW-8031. Are you proposing to use the generic Filesystem API (instead of a GCS implementation thereof) to offload objects from Plasma? If that's the case then I agree. Otherwise if you need to read/write to GCS in particular, without this issue being resolved I'm not sure how you can proceed. > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062166#comment-17062166 ] Clark Zinzow commented on ARROW-1231: - [~wesm] Ah I don't think I was very clear, sorry about that. I'm mostly interested (as an Arrow user) in being able to use GCS as an external store for Plasma, ARROW-8031; I was offering to work on the GCS filesystem implementation issue since I thought that it was a prerequisite for the Plasma external store issue. AFAICT, the only real blockers for working on the external store implementation for Plasma is adding a google-cloud-cpp conda-forge recipe and adding google-cloud-cpp to ThirdPartyToolchain, ARROW-8147 and ARROW-8148; i.e., if those packaging/toolchain issues are broken out as separate from the GCS filesystem issue, then this GCS filesystem implementation is _not_ a prerequisite for the external store implementation for Plasma. If that is the case, I'm asking if I (or someone else, if they are interested) could take on the packaging/toolchain issues ARROW-8147 and ARROW-8148, and once those are finished, I could work on the GCS external store implementation for Plasma. This would leave the much larger effort around the GCS filesystem implementation for later. Does that make sense? And is my judgement of the actual GCS filesystem implementation _not_ being a prerequisite for the GCS external store implementation for Plasma correct? > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7858) [C++][Python] Support casting an Extension type to its storage type
[ https://issues.apache.org/jira/browse/ARROW-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-7858. - Fix Version/s: 0.17.0 Resolution: Fixed Issue resolved by pull request 6633 [https://github.com/apache/arrow/pull/6633] > [C++][Python] Support casting an Extension type to its storage type > --- > > Key: ARROW-7858 > URL: https://issues.apache.org/jira/browse/ARROW-7858 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 4h > Remaining Estimate: 0h > > Currently, casting an extension type will always fail: "No cast implemented > from extension to ...". > However, for casting, we could fall back to the storage array's casting rules? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8157) [C++] Upgrade to LLVM 9
Jun NAITOH created ARROW-8157: - Summary: [C++] Upgrade to LLVM 9 Key: ARROW-8157 URL: https://issues.apache.org/jira/browse/ARROW-8157 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Jun NAITOH Now that LLVM 9 has already been released. LLVM branch 10 has been created on https://apt.llvm.org/ LLVM branch 9 has already been promoted to the old-stable branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release
[ https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-8154. --- > [Python] HDFS Filesystem does not set environment variables in pyarrow > 0.16.0 release > -- > > Key: ARROW-8154 > URL: https://issues.apache.org/jira/browse/ARROW-8154 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Eric Henry >Priority: Major > Fix For: 0.17.0 > > > In pyarrow 0.15.x, HDFS filesystem works as follows: > If you set HADOOP_HOME env var, it looks for libhdfs.so in > $HADOOP_HOME/lib/native. > In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in > $HADOOP_HOME, which is incorrect behaviour on all systems I am using. > Also, CLASSPATH no longer gets set automatically, which is very convenient. > The issue here is that I need to set hadoop home correctly to be able to use > other libraries, but have to reset it to use apache arrow. e.g. > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > ..do stuff here.. > ...then connect to arrow... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native" > hdfs = pyarrow.hdfs.connect(host, port) > ...then reset my hadoop home... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > etc. > > Example: > >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > >>> hdfs = pyarrow.hdfs.connect(host, port) > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 215, in connect > extra_conf=extra_conf) > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 40, in __init__ > self._connect(host, port, user, kerb_ticket, driver, extra_conf) > File "pyarrow/io-hdfs.pxi", line 89, in > pyarrow.lib.HadoopFileSystem._connect > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open > shared object file: No such file or directory > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format
[ https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062135#comment-17062135 ] Wes McKinney commented on ARROW-7854: - I'm not totally satisfied by the global memory mapping option in LocalFilesystem so I opened ARROW-8156 > [C++][Dataset] Option to memory map when reading IPC format > --- > > Key: ARROW-7854 > URL: https://issues.apache.org/jira/browse/ARROW-7854 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Francois Saint-Jacques >Priority: Major > > For the IPC format it would be interesting to be able to memory map the IPC > files? > cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8156) [C++] Add variant of Filesystem::OpenInputFile that has memory-map like behavior if it is possible
Wes McKinney created ARROW-8156: --- Summary: [C++] Add variant of Filesystem::OpenInputFile that has memory-map like behavior if it is possible Key: ARROW-8156 URL: https://issues.apache.org/jira/browse/ARROW-8156 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney TensorFlow has the notion of a ReadOnlyMappedRegion https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/file_system.h#L106 Rather than toggling memory mapping globally at the LocalFilesystem level, it would be useful for code to be able to request a memory-mapped {[RandomAccessFile}} if memory mapping is possible See also ARROW-7854 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8145) [C++] Rename GetTargetInfos
[ https://issues.apache.org/jira/browse/ARROW-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062124#comment-17062124 ] Kouhei Sutou commented on ARROW-8145: - Thanks! > [C++] Rename GetTargetInfos > --- > > Key: ARROW-8145 > URL: https://issues.apache.org/jira/browse/ARROW-8145 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Trivial > Fix For: 0.17.0 > > > Sorry, but I think I'm irked by the new "GetTargetInfos" spelling. > I suggest either "GetTargetInfo" or "GetFileInfo" (both singular). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062123#comment-17062123 ] Wes McKinney commented on ARROW-1231: - [~clarkzinzow] well, adding the thirdparty dependencies is a necessary condition to be able to add a Filesystem implementation that wraps google-cloud-cpp, like https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc If you want to work on a filesystem implementation for GCS without dealing with the packaging / toolchain issues, you are welcome to do that also. At some point all of this work (the filesystem wrapper and thirdparty toolchain support) has to be done properly so that we can package and deploy the software all the places it needs to go. > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release
[ https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-8154: Fix Version/s: 0.17.0 > [Python] HDFS Filesystem does not set environment variables in pyarrow > 0.16.0 release > -- > > Key: ARROW-8154 > URL: https://issues.apache.org/jira/browse/ARROW-8154 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Eric Henry >Priority: Major > Fix For: 0.17.0 > > > In pyarrow 0.15.x, HDFS filesystem works as follows: > If you set HADOOP_HOME env var, it looks for libhdfs.so in > $HADOOP_HOME/lib/native. > In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in > $HADOOP_HOME, which is incorrect behaviour on all systems I am using. > Also, CLASSPATH no longer gets set automatically, which is very convenient. > The issue here is that I need to set hadoop home correctly to be able to use > other libraries, but have to reset it to use apache arrow. e.g. > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > ..do stuff here.. > ...then connect to arrow... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native" > hdfs = pyarrow.hdfs.connect(host, port) > ...then reset my hadoop home... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > etc. > > Example: > >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > >>> hdfs = pyarrow.hdfs.connect(host, port) > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 215, in connect > extra_conf=extra_conf) > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 40, in __init__ > self._connect(host, port, user, kerb_ticket, driver, extra_conf) > File "pyarrow/io-hdfs.pxi", line 89, in > pyarrow.lib.HadoopFileSystem._connect > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open > shared object file: No such file or directory > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release
[ https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8154. - Resolution: Duplicate I think so too. Sorry for the regression. Please wait for 0.17.0. > [Python] HDFS Filesystem does not set environment variables in pyarrow > 0.16.0 release > -- > > Key: ARROW-8154 > URL: https://issues.apache.org/jira/browse/ARROW-8154 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Eric Henry >Priority: Major > > In pyarrow 0.15.x, HDFS filesystem works as follows: > If you set HADOOP_HOME env var, it looks for libhdfs.so in > $HADOOP_HOME/lib/native. > In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in > $HADOOP_HOME, which is incorrect behaviour on all systems I am using. > Also, CLASSPATH no longer gets set automatically, which is very convenient. > The issue here is that I need to set hadoop home correctly to be able to use > other libraries, but have to reset it to use apache arrow. e.g. > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > ..do stuff here.. > ...then connect to arrow... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native" > hdfs = pyarrow.hdfs.connect(host, port) > ...then reset my hadoop home... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > etc. > > Example: > >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > >>> hdfs = pyarrow.hdfs.connect(host, port) > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 215, in connect > extra_conf=extra_conf) > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 40, in __init__ > self._connect(host, port, user, kerb_ticket, driver, extra_conf) > File "pyarrow/io-hdfs.pxi", line 89, in > pyarrow.lib.HadoopFileSystem._connect > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open > shared object file: No such file or directory > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8152) [C++] IO: split large coalesced reads into smaller ones
[ https://issues.apache.org/jira/browse/ARROW-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062119#comment-17062119 ] Wes McKinney commented on ARROW-8152: - What do you think about introducing the read parallelism as an option in https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/caching.h ? It might make sense to introduce an options struct so we can't cramming a bunch of parameters into the {{Cache}} method here > [C++] IO: split large coalesced reads into smaller ones > --- > > Key: ARROW-8152 > URL: https://issues.apache.org/jira/browse/ARROW-8152 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Priority: Major > > We have a facility to coalesce small reads, but remote filesystems may also > benefit from splitting large reads to take advantage of concurrency. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8155) [C++] Add "ON only if system dependencies available" build mode for certain optional Arrow components
Wes McKinney created ARROW-8155: --- Summary: [C++] Add "ON only if system dependencies available" build mode for certain optional Arrow components Key: ARROW-8155 URL: https://issues.apache.org/jira/browse/ARROW-8155 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Related to ARROW-8103, there is a need for a static build script to be able to build the C++ library in a system environment where some dependencies may not be available. Currently we have toolchain options AUTO, SYSTEM, and BUNDLED: * AUTO: use system packages if possible, else BUNDLE * SYSTEM: use only system packages, failing otherwise * BUNDLED: build using the ExternalProject facility There is a case that may not be accounted for. Suppose we want to build with LZ4 support _only if_ LZ4 is available on the system. So then something like {code} -DARROW_WITH_LZ4=IF_AVAILABLE {code} (not sure what this should be called). The idea is that this would use SYSTEM dependency resolution, and if LZ4 is not found then the component would be disabled. The ROI on this feature might be low, but it would be useful to packagers who are building from source on an uncertain system (and where downloading a tarball and building an EP may not be an option) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8080) [C++] Add AVX512 build option
[ https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8080. - Fix Version/s: 0.17.0 Resolution: Fixed Issue resolved by pull request 6585 [https://github.com/apache/arrow/pull/6585] > [C++] Add AVX512 build option > - > > Key: ARROW-8080 > URL: https://issues.apache.org/jira/browse/ARROW-8080 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Introduce a build option(ARROW_AVX512) to utilize the compiler feature for > AVX512 machine. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8080) [C++] Add AVX512 build option
[ https://issues.apache.org/jira/browse/ARROW-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8080: --- Assignee: Frank Du > [C++] Add AVX512 build option > - > > Key: ARROW-8080 > URL: https://issues.apache.org/jira/browse/ARROW-8080 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > Introduce a build option(ARROW_AVX512) to utilize the compiler feature for > AVX512 machine. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
[ https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8141. - Resolution: Fixed Issue resolved by pull request 6650 [https://github.com/apache/arrow/pull/6650] > [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API > -- > > Key: ARROW-8141 > URL: https://issues.apache.org/jira/browse/ARROW-8141 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Attachments: image-2020-03-18-11-08-38-201.png > > Time Spent: 1.5h > Remaining Estimate: 0h > > We are running benchmark on the arrow avx512 build, perf show unpack1_32 as > the major hotspot for BM_PlainDecodingBoolean indicator. > Implement this func with Intrinsics code show big improvements. See below the > results on CLX 8280 cpu which is capable of AVX512. > |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics > improvements| > |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964| > |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945| > |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288| > |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
[ https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8141: Fix Version/s: 0.17.0 > [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API > -- > > Key: ARROW-8141 > URL: https://issues.apache.org/jira/browse/ARROW-8141 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Attachments: image-2020-03-18-11-08-38-201.png > > Time Spent: 1.5h > Remaining Estimate: 0h > > We are running benchmark on the arrow avx512 build, perf show unpack1_32 as > the major hotspot for BM_PlainDecodingBoolean indicator. > Implement this func with Intrinsics code show big improvements. See below the > results on CLX 8280 cpu which is capable of AVX512. > |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics > improvements| > |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964| > |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945| > |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288| > |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
[ https://issues.apache.org/jira/browse/ARROW-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8141: --- Assignee: Frank Du > [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API > -- > > Key: ARROW-8141 > URL: https://issues.apache.org/jira/browse/ARROW-8141 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Frank Du >Assignee: Frank Du >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Attachments: image-2020-03-18-11-08-38-201.png > > Time Spent: 1.5h > Remaining Estimate: 0h > > We are running benchmark on the arrow avx512 build, perf show unpack1_32 as > the major hotspot for BM_PlainDecodingBoolean indicator. > Implement this func with Intrinsics code show big improvements. See below the > results on CLX 8280 cpu which is capable of AVX512. > |Indictor|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics > improvements| > |BM_PlainDecodingBoolean/1024(G/s)|1.55394|3.77701|5.02805|1.331224964| > |BM_PlainDecodingBoolean/4096(G/s)|1.83472|5.3826|8.3443|1.550235945| > |BM_PlainDecodingBoolean/32768(G/s)|2.00957|6.1258|10.3793|1.694358288| > |BM_PlainDecodingBoolean/65536(G/s)|2.02249|6.20035|10.5778|1.706000468| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release
[ https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062082#comment-17062082 ] Wes McKinney commented on ARROW-8154: - I think this is a dup of ARROW-7841, a regression that has been fixed since 0.16.0 was released > [Python] HDFS Filesystem does not set environment variables in pyarrow > 0.16.0 release > -- > > Key: ARROW-8154 > URL: https://issues.apache.org/jira/browse/ARROW-8154 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Eric Henry >Priority: Major > > In pyarrow 0.15.x, HDFS filesystem works as follows: > If you set HADOOP_HOME env var, it looks for libhdfs.so in > $HADOOP_HOME/lib/native. > In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in > $HADOOP_HOME, which is incorrect behaviour on all systems I am using. > Also, CLASSPATH no longer gets set automatically, which is very convenient. > The issue here is that I need to set hadoop home correctly to be able to use > other libraries, but have to reset it to use apache arrow. e.g. > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > ..do stuff here.. > ...then connect to arrow... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native" > hdfs = pyarrow.hdfs.connect(host, port) > ...then reset my hadoop home... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > etc. > > Example: > >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > >>> hdfs = pyarrow.hdfs.connect(host, port) > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 215, in connect > extra_conf=extra_conf) > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 40, in __init__ > self._connect(host, port, user, kerb_ticket, driver, extra_conf) > File "pyarrow/io-hdfs.pxi", line 89, in > pyarrow.lib.HadoopFileSystem._connect > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open > shared object file: No such file or directory > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8154) [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release
[ https://issues.apache.org/jira/browse/ARROW-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8154: Summary: [Python] HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release (was: HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release) > [Python] HDFS Filesystem does not set environment variables in pyarrow > 0.16.0 release > -- > > Key: ARROW-8154 > URL: https://issues.apache.org/jira/browse/ARROW-8154 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Eric Henry >Priority: Major > > In pyarrow 0.15.x, HDFS filesystem works as follows: > If you set HADOOP_HOME env var, it looks for libhdfs.so in > $HADOOP_HOME/lib/native. > In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in > $HADOOP_HOME, which is incorrect behaviour on all systems I am using. > Also, CLASSPATH no longer gets set automatically, which is very convenient. > The issue here is that I need to set hadoop home correctly to be able to use > other libraries, but have to reset it to use apache arrow. e.g. > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > ..do stuff here.. > ...then connect to arrow... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native" > hdfs = pyarrow.hdfs.connect(host, port) > ...then reset my hadoop home... > os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > etc. > > Example: > >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" > >>> hdfs = pyarrow.hdfs.connect(host, port) > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 215, in connect > extra_conf=extra_conf) > File > "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", > line 40, in __init__ > self._connect(host, port, user, kerb_ticket, driver, extra_conf) > File "pyarrow/io-hdfs.pxi", line 89, in > pyarrow.lib.HadoopFileSystem._connect > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open > shared object file: No such file or directory > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently
[ https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7966: -- Labels: pull-request-available (was: ) > [Integration][Flight][C++] Client should verify each batch independently > > > Key: ARROW-7966 > URL: https://issues.apache.org/jira/browse/ARROW-7966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC, Integration >Reporter: Bryan Cutler >Assignee: David Li >Priority: Major > Labels: pull-request-available > > Currently the C++ Flight test client in {{test_integration_client.cc}} reads > all batches from JSON into a Table, reads all batches in the flight stream > from the server into a Table, then compares the Tables for equality. This is > potentially a problem because a record batch might have specific information > that is then lost in the conversion to a Table. For example, if the server > sends empty batches, the resulting Table would not be different from one with > no empty batches. > Instead, the client should check each record batch from the JSON file against > each record batch from the server independently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently
[ https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062043#comment-17062043 ] David Li edited comment on ARROW-7966 at 3/18/20, 8:57 PM: --- Fixing this causes the test I added in ARROW-7899 to fail again. I'll dig deeper... (On a closer look, it's likely because I neglected to rebuild the Java JARs after rebasing!) was (Author: lidavidm): Fixing this causes the test I added in ARROW-7899 to fail again. I'll dig deeper... > [Integration][Flight][C++] Client should verify each batch independently > > > Key: ARROW-7966 > URL: https://issues.apache.org/jira/browse/ARROW-7966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC, Integration >Reporter: Bryan Cutler >Assignee: David Li >Priority: Major > > Currently the C++ Flight test client in {{test_integration_client.cc}} reads > all batches from JSON into a Table, reads all batches in the flight stream > from the server into a Table, then compares the Tables for equality. This is > potentially a problem because a record batch might have specific information > that is then lost in the conversion to a Table. For example, if the server > sends empty batches, the resulting Table would not be different from one with > no empty batches. > Instead, the client should check each record batch from the JSON file against > each record batch from the server independently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently
[ https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062043#comment-17062043 ] David Li commented on ARROW-7966: - Fixing this causes the test I added in ARROW-7899 to fail again. I'll dig deeper... > [Integration][Flight][C++] Client should verify each batch independently > > > Key: ARROW-7966 > URL: https://issues.apache.org/jira/browse/ARROW-7966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC, Integration >Reporter: Bryan Cutler >Assignee: David Li >Priority: Major > > Currently the C++ Flight test client in {{test_integration_client.cc}} reads > all batches from JSON into a Table, reads all batches in the flight stream > from the server into a Table, then compares the Tables for equality. This is > potentially a problem because a record batch might have specific information > that is then lost in the conversion to a Table. For example, if the server > sends empty batches, the resulting Table would not be different from one with > no empty batches. > Instead, the client should check each record batch from the JSON file against > each record batch from the server independently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently
[ https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-7966: --- Assignee: David Li > [Integration][Flight][C++] Client should verify each batch independently > > > Key: ARROW-7966 > URL: https://issues.apache.org/jira/browse/ARROW-7966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC, Integration >Reporter: Bryan Cutler >Assignee: David Li >Priority: Major > > Currently the C++ Flight test client in {{test_integration_client.cc}} reads > all batches from JSON into a Table, reads all batches in the flight stream > from the server into a Table, then compares the Tables for equality. This is > potentially a problem because a record batch might have specific information > that is then lost in the conversion to a Table. For example, if the server > sends empty batches, the resulting Table would not be different from one with > no empty batches. > Instead, the client should check each record batch from the JSON file against > each record batch from the server independently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8154) HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release
Eric Henry created ARROW-8154: - Summary: HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release Key: ARROW-8154 URL: https://issues.apache.org/jira/browse/ARROW-8154 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Eric Henry In pyarrow 0.15.x, HDFS filesystem works as follows: If you set HADOOP_HOME env var, it looks for libhdfs.so in $HADOOP_HOME/lib/native. In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in $HADOOP_HOME, which is incorrect behaviour on all systems I am using. Also, CLASSPATH no longer gets set automatically, which is very convenient. The issue here is that I need to set hadoop home correctly to be able to use other libraries, but have to reset it to use apache arrow. e.g. os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" ..do stuff here.. ...then connect to arrow... os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native" hdfs = pyarrow.hdfs.connect(host, port) ...then reset my hadoop home... os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" etc. Example: >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" >>> hdfs = pyarrow.hdfs.connect(host, port) Traceback (most recent call last): File "", line 1, in File "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", line 215, in connect extra_conf=extra_conf) File "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", line 40, in __init__ self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open shared object file: No such file or directory -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8144) [CI] Cmake 3.2 nightly build fails
[ https://issues.apache.org/jira/browse/ARROW-8144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8144. - Resolution: Fixed Issue resolved by pull request 6654 [https://github.com/apache/arrow/pull/6654] > [CI] Cmake 3.2 nightly build fails > -- > > Key: ARROW-8144 > URL: https://issues.apache.org/jira/browse/ARROW-8144 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 50m > Remaining Estimate: 0h > > In the LLVM 8 Migration PR wget was > [removed|https://github.com/apache/arrow/commit/58ec1bc3984b8453011ba6ca45c727ff6ceed78c#diff-0a4bf63085865017969bbbdac6f66880L29] > so the build is > [missing|https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-18-0-circle-test-ubuntu-18.04-cpp-cmake32] > wget. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8145) [C++] Rename GetTargetInfos
[ https://issues.apache.org/jira/browse/ARROW-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-8145: -- Fix Version/s: 0.17.0 > [C++] Rename GetTargetInfos > --- > > Key: ARROW-8145 > URL: https://issues.apache.org/jira/browse/ARROW-8145 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Trivial > Fix For: 0.17.0 > > > Sorry, but I think I'm irked by the new "GetTargetInfos" spelling. > I suggest either "GetTargetInfo" or "GetFileInfo" (both singular). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8145) [C++] Rename GetTargetInfos
[ https://issues.apache.org/jira/browse/ARROW-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062036#comment-17062036 ] Antoine Pitrou commented on ARROW-8145: --- I can do it. There's no hurry in any case. > [C++] Rename GetTargetInfos > --- > > Key: ARROW-8145 > URL: https://issues.apache.org/jira/browse/ARROW-8145 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Trivial > Fix For: 0.17.0 > > > Sorry, but I think I'm irked by the new "GetTargetInfos" spelling. > I suggest either "GetTargetInfo" or "GetFileInfo" (both singular). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8145) [C++] Rename GetTargetInfos
[ https://issues.apache.org/jira/browse/ARROW-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062035#comment-17062035 ] Kouhei Sutou commented on ARROW-8145: - Oh, sorry. I'm OK either. Should I create a pull request for it? Or do you want to create a pull request? > [C++] Rename GetTargetInfos > --- > > Key: ARROW-8145 > URL: https://issues.apache.org/jira/browse/ARROW-8145 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Trivial > > Sorry, but I think I'm irked by the new "GetTargetInfos" spelling. > I suggest either "GetTargetInfo" or "GetFileInfo" (both singular). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8153) [Packaging] Update the conda feedstock files and upload artifacts to Anaconda
[ https://issues.apache.org/jira/browse/ARROW-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8153: -- Labels: pull-request-available (was: ) > [Packaging] Update the conda feedstock files and upload artifacts to Anaconda > - > > Key: ARROW-8153 > URL: https://issues.apache.org/jira/browse/ARROW-8153 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > > The windows builds were failing, so the feedstock files must be updated. > Under the same hat add support for uploading the produced artifacts to > Anaconda labeled as nightly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8144) [CI] Cmake 3.2 nightly build fails
[ https://issues.apache.org/jira/browse/ARROW-8144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-8144: Component/s: Continuous Integration > [CI] Cmake 3.2 nightly build fails > -- > > Key: ARROW-8144 > URL: https://issues.apache.org/jira/browse/ARROW-8144 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 40m > Remaining Estimate: 0h > > In the LLVM 8 Migration PR wget was > [removed|https://github.com/apache/arrow/commit/58ec1bc3984b8453011ba6ca45c727ff6ceed78c#diff-0a4bf63085865017969bbbdac6f66880L29] > so the build is > [missing|https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-18-0-circle-test-ubuntu-18.04-cpp-cmake32] > wget. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8153) [Packaging] Update the conda feedstock files and upload artifacts to Anaconda
Krisztian Szucs created ARROW-8153: -- Summary: [Packaging] Update the conda feedstock files and upload artifacts to Anaconda Key: ARROW-8153 URL: https://issues.apache.org/jira/browse/ARROW-8153 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs The windows builds were failing, so the feedstock files must be updated. Under the same hat add support for uploading the produced artifacts to Anaconda labeled as nightly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format
[ https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques closed ARROW-7854. - Resolution: Not A Problem Already supported. > [C++][Dataset] Option to memory map when reading IPC format > --- > > Key: ARROW-7854 > URL: https://issues.apache.org/jira/browse/ARROW-7854 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Francois Saint-Jacques >Priority: Major > > For the IPC format it would be interesting to be able to memory map the IPC > files? > cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7390) [C++][Dataset] Concurrency race in Projector::Project
[ https://issues.apache.org/jira/browse/ARROW-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7390: -- Labels: pull-request-available (was: ) > [C++][Dataset] Concurrency race in Projector::Project > -- > > Key: ARROW-7390 > URL: https://issues.apache.org/jira/browse/ARROW-7390 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > > When a DataFragment is invoked by 2 scan tasks of the same DataFragment, > there's a race to invoke SetInputSchema. Note that ResizeMissingColumns also > suffers from this race. The ideal goal is to make Project a const method. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8150) [Rust] Allow writing custom FileMetaData k/v pairs
[ https://issues.apache.org/jira/browse/ARROW-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8150: -- Labels: pull-request-available (was: ) > [Rust] Allow writing custom FileMetaData k/v pairs > -- > > Key: ARROW-8150 > URL: https://issues.apache.org/jira/browse/ARROW-8150 > Project: Apache Arrow > Issue Type: Improvement >Reporter: David Kegley >Priority: Minor > Labels: pull-request-available > > It would be nice to be able to write custom k/v metadata in the rust > implementation of parquet. It looks like there was a plan for this previously > but it has not been implemented yet. > [https://github.com/apache/arrow/blob/a1eb440c92ae1c8edc93bc9ae646a8e371d756c6/rust/parquet/src/file/writer.rs#L182] > I have a working implementation that adds a `key_value_metadata` field to the > `WriterProperties` struct -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061994#comment-17061994 ] Clark Zinzow commented on ARROW-1231: - [~apitrou] [~wesm] Great, thanks! Is it safe to say that adding the google-cloud-cpp conda-forge recipe and adding google-cloud-cpp to ThirdPartyToolchain are the only true blockers for adding the GCS external store implementation for Plasma? If that's the case and if this issue isn't of high priority for anyone ATM, then I would probably prefer to work on ARROW-8031 instead of this issue after ARROW-8147 and ARROW-8148 are done, if that's acceptable. > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8152) [C++] IO: split large coalesced reads into smaller ones
[ https://issues.apache.org/jira/browse/ARROW-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-8152: Issue Type: Improvement (was: Bug) > [C++] IO: split large coalesced reads into smaller ones > --- > > Key: ARROW-8152 > URL: https://issues.apache.org/jira/browse/ARROW-8152 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Priority: Major > > We have a facility to coalesce small reads, but remote filesystems may also > benefit from splitting large reads to take advantage of concurrency. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8151) [Benchmarking][Dataset] Benchmark Parquet read performance with S3File
[ https://issues.apache.org/jira/browse/ARROW-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-8151: Issue Type: Improvement (was: Bug) > [Benchmarking][Dataset] Benchmark Parquet read performance with S3File > -- > > Key: ARROW-8151 > URL: https://issues.apache.org/jira/browse/ARROW-8151 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking, C++ - Dataset >Reporter: David Li >Assignee: David Li >Priority: Major > > We should establish a performance baseline with the current S3File > implementation and Parquet reader before proceeding with work like > PARQUET-1698. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8152) [C++] IO: split large coalesced reads into smaller ones
David Li created ARROW-8152: --- Summary: [C++] IO: split large coalesced reads into smaller ones Key: ARROW-8152 URL: https://issues.apache.org/jira/browse/ARROW-8152 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: David Li We have a facility to coalesce small reads, but remote filesystems may also benefit from splitting large reads to take advantage of concurrency. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8151) [Benchmarking][Dataset] Benchmark Parquet read performance with S3File
David Li created ARROW-8151: --- Summary: [Benchmarking][Dataset] Benchmark Parquet read performance with S3File Key: ARROW-8151 URL: https://issues.apache.org/jira/browse/ARROW-8151 Project: Apache Arrow Issue Type: Bug Components: Benchmarking, C++ - Dataset Reporter: David Li Assignee: David Li We should establish a performance baseline with the current S3File implementation and Parquet reader before proceeding with work like PARQUET-1698. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8150) [Rust] Allow writing custom FileMetaData k/v pairs
David Kegley created ARROW-8150: --- Summary: [Rust] Allow writing custom FileMetaData k/v pairs Key: ARROW-8150 URL: https://issues.apache.org/jira/browse/ARROW-8150 Project: Apache Arrow Issue Type: Improvement Reporter: David Kegley It would be nice to be able to write custom k/v metadata in the rust implementation of parquet. It looks like there was a plan for this previously but it has not been implemented yet. [https://github.com/apache/arrow/blob/a1eb440c92ae1c8edc93bc9ae646a8e371d756c6/rust/parquet/src/file/writer.rs#L182] I have a working implementation that adds a `key_value_metadata` field to the `WriterProperties` struct -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [arrow-testing] pitrou merged pull request #21: PARQUET-1819: [C++] Add parquet fuzz files
pitrou merged pull request #21: PARQUET-1819: [C++] Add parquet fuzz files URL: https://github.com/apache/arrow-testing/pull/21 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [arrow-testing] pitrou opened a new pull request #21: PARQUET-1819: [C++] Add parquet fuzz files
pitrou opened a new pull request #21: PARQUET-1819: [C++] Add parquet fuzz files URL: https://github.com/apache/arrow-testing/pull/21 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Updated] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode
[ https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7673: --- Fix Version/s: (was: 0.17.0) 1.0.0 > [C++][Dataset] Revisit File discovery failure mode > -- > > Key: ARROW-7673 > URL: https://issues.apache.org/jira/browse/ARROW-7673 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Fix For: 1.0.0 > > > Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will > silently ignore unsupported files (either IO error, not of the valid format, > corruption, missing compression codecs, etc...) when creating a > `FileSystemSource`. > We should change this behavior to propagate an error in the Inspect/Finish > calls by default and allow the user to toggle `exclude_invalid_files`. The > error should contain at least the file path and a decipherable error (if > possible). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls
[ https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-8088: Assignee: Ben Kietzman (was: Joris Van den Bossche) > [C++][Dataset] Partition columns with specified dictionary type result in all > nulls > --- > > Key: ARROW-8088 > URL: https://issues.apache.org/jira/browse/ARROW-8088 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > When specifying an explicit schema for the Partitioning, and when using a > dictionary type, the materialization of the partition keys goes wrong: you > don't get an error, but you get columns with all nulls. > Python example: > {code:python} > foo_keys = [0, 1] > bar_keys = ['a', 'b', 'c'] > N = 30 > df = pd.DataFrame({ > 'foo': np.array(foo_keys, dtype='i4').repeat(15), > 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), > 'values': np.random.randn(N) > }) > pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) > {code} > When reading with discovery, all is fine: > {code:python} > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().schema > values: double > bar: string > foo: int32 > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().to_pandas().head(2) > values bar foo > 0 2.505903 a0 > 1 -1.760135 a0 > {code} > But when specifying the partition columns to be dictionary type with explicit > {{HivePartitioning}}, you get no error but all null values: > {code:python} > >>> partitioning = ds.HivePartitioning(pa.schema([ > ... ("foo", pa.dictionary(pa.int32(), pa.int64())), > ... ("bar", pa.dictionary(pa.int32(), pa.string())) > ... ])) > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().schema > values: double > foo: dictionary > bar: dictionary > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().to_pandas().head(2) > values foo bar > 0 2.505903 NaN NaN > 1 -1.760135 NaN NaN > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls
[ https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-8088: Assignee: Joris Van den Bossche > [C++][Dataset] Partition columns with specified dictionary type result in all > nulls > --- > > Key: ARROW-8088 > URL: https://issues.apache.org/jira/browse/ARROW-8088 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > When specifying an explicit schema for the Partitioning, and when using a > dictionary type, the materialization of the partition keys goes wrong: you > don't get an error, but you get columns with all nulls. > Python example: > {code:python} > foo_keys = [0, 1] > bar_keys = ['a', 'b', 'c'] > N = 30 > df = pd.DataFrame({ > 'foo': np.array(foo_keys, dtype='i4').repeat(15), > 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), > 'values': np.random.randn(N) > }) > pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) > {code} > When reading with discovery, all is fine: > {code:python} > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().schema > values: double > bar: string > foo: int32 > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().to_pandas().head(2) > values bar foo > 0 2.505903 a0 > 1 -1.760135 a0 > {code} > But when specifying the partition columns to be dictionary type with explicit > {{HivePartitioning}}, you get no error but all null values: > {code:python} > >>> partitioning = ds.HivePartitioning(pa.schema([ > ... ("foo", pa.dictionary(pa.int32(), pa.int64())), > ... ("bar", pa.dictionary(pa.int32(), pa.string())) > ... ])) > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().schema > values: double > foo: dictionary > bar: dictionary > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().to_pandas().head(2) > values foo bar > 0 2.505903 NaN NaN > 1 -1.760135 NaN NaN > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls
[ https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-8088: - Fix Version/s: 0.17.0 > [C++][Dataset] Partition columns with specified dictionary type result in all > nulls > --- > > Key: ARROW-8088 > URL: https://issues.apache.org/jira/browse/ARROW-8088 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When specifying an explicit schema for the Partitioning, and when using a > dictionary type, the materialization of the partition keys goes wrong: you > don't get an error, but you get columns with all nulls. > Python example: > {code:python} > foo_keys = [0, 1] > bar_keys = ['a', 'b', 'c'] > N = 30 > df = pd.DataFrame({ > 'foo': np.array(foo_keys, dtype='i4').repeat(15), > 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), > 'values': np.random.randn(N) > }) > pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) > {code} > When reading with discovery, all is fine: > {code:python} > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().schema > values: double > bar: string > foo: int32 > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().to_pandas().head(2) > values bar foo > 0 2.505903 a0 > 1 -1.760135 a0 > {code} > But when specifying the partition columns to be dictionary type with explicit > {{HivePartitioning}}, you get no error but all null values: > {code:python} > >>> partitioning = ds.HivePartitioning(pa.schema([ > ... ("foo", pa.dictionary(pa.int32(), pa.int64())), > ... ("bar", pa.dictionary(pa.int32(), pa.string())) > ... ])) > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().schema > values: double > foo: dictionary > bar: dictionary > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().to_pandas().head(2) > values foo bar > 0 2.505903 NaN NaN > 1 -1.760135 NaN NaN > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8127) [C++] [Parquet] Incorrect column chunk metadata for multipage batch writes
[ https://issues.apache.org/jira/browse/ARROW-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-8127. - Fix Version/s: 0.17.0 Resolution: Fixed Issue resolved by pull request 6637 [https://github.com/apache/arrow/pull/6637] > [C++] [Parquet] Incorrect column chunk metadata for multipage batch writes > -- > > Key: ARROW-8127 > URL: https://issues.apache.org/jira/browse/ARROW-8127 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: TP Boudreau >Assignee: TP Boudreau >Priority: Minor > Labels: pull-request-available > Fix For: 0.17.0 > > Attachments: multipage-batch-write.cc > > Time Spent: 1.5h > Remaining Estimate: 0h > > When writing to a buffered column writer using PLAIN encoding, if the size of > the batch supplied for writing exceeds the page size for the writer, the > resulting file has an incorrect data_page_offset set in its column chunk > metadata. This causes an exception to be thrown when reading the file (file > appears to be too short to the reader). > For example, the attached code, which attempts to write a batch of 262145 > Int32's (= 1048576 + 4 bytes) using the default page size of 1048576 bytes > (with buffered writer, PLAIN encoding), fails on reading, throwing the error: > "Tried reading 1048678 bytes starting at position 1048633 from file but only > got 333". > The error is caused by the second page write tripping the conditional here > https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L302, > in the serialized in-memory writer wrapped by the buffered writer. > The fix builds the metadata with offsets from the terminal sink rather than > the in memory buffered sink. A PR is coming. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls
[ https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-8088: Assignee: Joris Van den Bossche (was: Ben Kietzman) > [C++][Dataset] Partition columns with specified dictionary type result in all > nulls > --- > > Key: ARROW-8088 > URL: https://issues.apache.org/jira/browse/ARROW-8088 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > When specifying an explicit schema for the Partitioning, and when using a > dictionary type, the materialization of the partition keys goes wrong: you > don't get an error, but you get columns with all nulls. > Python example: > {code:python} > foo_keys = [0, 1] > bar_keys = ['a', 'b', 'c'] > N = 30 > df = pd.DataFrame({ > 'foo': np.array(foo_keys, dtype='i4').repeat(15), > 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), > 'values': np.random.randn(N) > }) > pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) > {code} > When reading with discovery, all is fine: > {code:python} > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().schema > values: double > bar: string > foo: int32 > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().to_pandas().head(2) > values bar foo > 0 2.505903 a0 > 1 -1.760135 a0 > {code} > But when specifying the partition columns to be dictionary type with explicit > {{HivePartitioning}}, you get no error but all null values: > {code:python} > >>> partitioning = ds.HivePartitioning(pa.schema([ > ... ("foo", pa.dictionary(pa.int32(), pa.int64())), > ... ("bar", pa.dictionary(pa.int32(), pa.string())) > ... ])) > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().schema > values: double > foo: dictionary > bar: dictionary > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().to_pandas().head(2) > values foo bar > 0 2.505903 NaN NaN > 1 -1.760135 NaN NaN > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls
[ https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-8088: Assignee: Ben Kietzman (was: Joris Van den Bossche) > [C++][Dataset] Partition columns with specified dictionary type result in all > nulls > --- > > Key: ARROW-8088 > URL: https://issues.apache.org/jira/browse/ARROW-8088 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When specifying an explicit schema for the Partitioning, and when using a > dictionary type, the materialization of the partition keys goes wrong: you > don't get an error, but you get columns with all nulls. > Python example: > {code:python} > foo_keys = [0, 1] > bar_keys = ['a', 'b', 'c'] > N = 30 > df = pd.DataFrame({ > 'foo': np.array(foo_keys, dtype='i4').repeat(15), > 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), > 'values': np.random.randn(N) > }) > pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) > {code} > When reading with discovery, all is fine: > {code:python} > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().schema > values: double > bar: string > foo: int32 > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().to_pandas().head(2) > values bar foo > 0 2.505903 a0 > 1 -1.760135 a0 > {code} > But when specifying the partition columns to be dictionary type with explicit > {{HivePartitioning}}, you get no error but all null values: > {code:python} > >>> partitioning = ds.HivePartitioning(pa.schema([ > ... ("foo", pa.dictionary(pa.int32(), pa.int64())), > ... ("bar", pa.dictionary(pa.int32(), pa.string())) > ... ])) > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().schema > values: double > foo: dictionary > bar: dictionary > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().to_pandas().head(2) > values foo bar > 0 2.505903 NaN NaN > 1 -1.760135 NaN NaN > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5572) [Python] raise error message when passing invalid filter in parquet reading
[ https://issues.apache.org/jira/browse/ARROW-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061915#comment-17061915 ] Joris Van den Bossche commented on ARROW-5572: -- This works now correctly with the new Datasets API, since we can filter on both partition keys as "normal" columns. So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), this issue will be resolved. > [Python] raise error message when passing invalid filter in parquet reading > --- > > Key: ARROW-5572 > URL: https://issues.apache.org/jira/browse/ARROW-5572 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Joris Van den Bossche >Priority: Minor > Labels: dataset-parquet-read, parquet > > From > https://stackoverflow.com/questions/56522977/using-predicates-to-filter-rows-from-pyarrow-parquet-parquetdataset > For example, when specifying a column in the filter which is a normal column > and not a key in your partitioned folder hierarchy, the filter gets silently > ignored. It would be nice to get an error message for this. > Reproducible example: > {code:python} > df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1], 'c': [1, 2, 3, 4]}) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, 'test_parquet_row_filters', partition_cols=['a']) > # filter on 'a' (partition column) -> works > pq.read_table('test_parquet_row_filters', filters=[('a', '=', 1)]).to_pandas() > # filter on normal column (in future could do row group filtering) -> > silently does nothing > pq.read_table('test_parquet_row_filters', filters=[('b', '=', 1)]).to_pandas() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6122) [C++] ArgSort kernel must support FixedSizeBinary
[ https://issues.apache.org/jira/browse/ARROW-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6122: -- Fix Version/s: (was: 0.17.0) > [C++] ArgSort kernel must support FixedSizeBinary > - > > Key: ARROW-6122 > URL: https://issues.apache.org/jira/browse/ARROW-6122 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.0 >Reporter: Francois Saint-Jacques >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format
[ https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061882#comment-17061882 ] Joris Van den Bossche commented on ARROW-7854: -- [~fsaintjacques] this actually already turned out to be possible from the python side, by specifying this option when creating the LocalFileSystem object. > [C++][Dataset] Option to memory map when reading IPC format > --- > > Key: ARROW-7854 > URL: https://issues.apache.org/jira/browse/ARROW-7854 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Francois Saint-Jacques >Priority: Major > > For the IPC format it would be interesting to be able to memory map the IPC > files? > cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7820) [C++][Gandiva] Add CMake support for compiling LLVM's IR into a library
[ https://issues.apache.org/jira/browse/ARROW-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-7820: -- Fix Version/s: (was: 0.17.0) > [C++][Gandiva] Add CMake support for compiling LLVM's IR into a library > --- > > Key: ARROW-7820 > URL: https://issues.apache.org/jira/browse/ARROW-7820 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > > We should be able to inject LLVM IR into libraries, assuming that `llc` is > found on the platform. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7818) [C++][Gandiva] Generate Filter kernels from gandiva code at compile time
[ https://issues.apache.org/jira/browse/ARROW-7818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-7818: -- Fix Version/s: (was: 0.17.0) > [C++][Gandiva] Generate Filter kernels from gandiva code at compile time > > > Key: ARROW-7818 > URL: https://issues.apache.org/jira/browse/ARROW-7818 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, C++ - Gandiva >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > > The goal of this feature is to support generating kernels at compile time > (and possibly runtime if gandiva is linked) to avoid rewriting C++ kernels > that gandiva knows how to compile. The generated kernels would be linked in > the compute module. > This is an experimental task that will guide future development, notably > implementing aggregate kernels in gandiva once instead both C++ and gandiva > implementations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5744) [C++] Do not error in Table::CombineChunks for BinaryArray types that overflow 2GB limit
[ https://issues.apache.org/jira/browse/ARROW-5744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman updated ARROW-5744: Fix Version/s: (was: 0.17.0) 1.0.0 > [C++] Do not error in Table::CombineChunks for BinaryArray types that > overflow 2GB limit > > > Key: ARROW-5744 > URL: https://issues.apache.org/jira/browse/ARROW-5744 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > Discovered during ARROW-5635 code review -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format
[ https://issues.apache.org/jira/browse/ARROW-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-7854: - Assignee: Francois Saint-Jacques > [C++][Dataset] Option to memory map when reading IPC format > --- > > Key: ARROW-7854 > URL: https://issues.apache.org/jira/browse/ARROW-7854 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Joris Van den Bossche >Assignee: Francois Saint-Jacques >Priority: Major > > For the IPC format it would be interesting to be able to memory map the IPC > files? > cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8149) [C++/Python] Enable CUDA Support in conda recipes
Uwe Korn created ARROW-8149: --- Summary: [C++/Python] Enable CUDA Support in conda recipes Key: ARROW-8149 URL: https://issues.apache.org/jira/browse/ARROW-8149 Project: Apache Arrow Issue Type: New Feature Components: C++, Packaging Reporter: Uwe Korn Fix For: 0.17.0 See the changes in [https://github.com/conda-forge/arrow-cpp-feedstock/pull/123], we need to copy this into the Arrow repository and also test CUDA in these recipes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode
[ https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-7673: -- Fix Version/s: (was: 1.0.0) 0.17.0 > [C++][Dataset] Revisit File discovery failure mode > -- > > Key: ARROW-7673 > URL: https://issues.apache.org/jira/browse/ARROW-7673 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Fix For: 0.17.0 > > > Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will > silently ignore unsupported files (either IO error, not of the valid format, > corruption, missing compression codecs, etc...) when creating a > `FileSystemSource`. > We should change this behavior to propagate an error in the Inspect/Finish > calls by default and allow the user to toggle `exclude_invalid_files`. The > error should contain at least the file path and a decipherable error (if > possible). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7579) [FlightRPC] Make Handshake optional
[ https://issues.apache.org/jira/browse/ARROW-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-7579: Fix Version/s: (was: 0.17.0) > [FlightRPC] Make Handshake optional > --- > > Key: ARROW-7579 > URL: https://issues.apache.org/jira/browse/ARROW-7579 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC >Reporter: David Li >Priority: Major > > We should make it possible to _not_ invoke Handshake for services that don't > want it. Especially when using it with flight-grpc, where the standard gRPC > authentication mechanisms don't know about Flight and try to authenticate the > Handshake endpoint - it's easy to forget to configure this endpoint to bypass > authentication. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6062) [FlightRPC] Allow timeouts on all stream reads
[ https://issues.apache.org/jira/browse/ARROW-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061876#comment-17061876 ] David Li commented on ARROW-6062: - Removing from 0.17. > [FlightRPC] Allow timeouts on all stream reads > -- > > Key: ARROW-6062 > URL: https://issues.apache.org/jira/browse/ARROW-6062 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Reporter: David Li >Priority: Major > > Anywhere where we offer reading from a stream in Flight, we need to offer a > timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7579) [FlightRPC] Make Handshake optional
[ https://issues.apache.org/jira/browse/ARROW-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061874#comment-17061874 ] David Li commented on ARROW-7579: - Not a blocker for 0.17, removing from fix versions. > [FlightRPC] Make Handshake optional > --- > > Key: ARROW-7579 > URL: https://issues.apache.org/jira/browse/ARROW-7579 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC >Reporter: David Li >Priority: Major > > We should make it possible to _not_ invoke Handshake for services that don't > want it. Especially when using it with flight-grpc, where the standard gRPC > authentication mechanisms don't know about Flight and try to authenticate the > Handshake endpoint - it's easy to forget to configure this endpoint to bypass > authentication. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6062) [FlightRPC] Allow timeouts on all stream reads
[ https://issues.apache.org/jira/browse/ARROW-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-6062: Fix Version/s: (was: 0.17.0) > [FlightRPC] Allow timeouts on all stream reads > -- > > Key: ARROW-6062 > URL: https://issues.apache.org/jira/browse/ARROW-6062 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Reporter: David Li >Priority: Major > > Anywhere where we offer reading from a stream in Flight, we need to offer a > timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5745) [C++] properties of Map(Array|Type) are confusingly named
[ https://issues.apache.org/jira/browse/ARROW-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman updated ARROW-5745: Fix Version/s: (was: 0.17.0) 1.0.0 > [C++] properties of Map(Array|Type) are confusingly named > - > > Key: ARROW-5745 > URL: https://issues.apache.org/jira/browse/ARROW-5745 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > In the context of ListArrays, "values" indicates the elements in a slot of > the ListArray. Since MapArray isa ListArray, "values" indicates the same > thing and the elements are key-item pairs. This naming scheme is not > idiomatic; these *should* be called key-value pairs but that would require > propagating the renaming down to ListArray. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode
[ https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-7673: - Assignee: Francois Saint-Jacques > [C++][Dataset] Revisit File discovery failure mode > -- > > Key: ARROW-7673 > URL: https://issues.apache.org/jira/browse/ARROW-7673 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Fix For: 1.0.0 > > > Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will > silently ignore unsupported files (either IO error, not of the valid format, > corruption, missing compression codecs, etc...) when creating a > `FileSystemSource`. > We should change this behavior to propagate an error in the Inspect/Finish > calls by default and allow the user to toggle `exclude_invalid_files`. The > error should contain at least the file path and a decipherable error (if > possible). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8058) [C++][Python][Dataset] Provide an option to toggle validation and schema inference in FileSystemDatasetFactoryOptions
[ https://issues.apache.org/jira/browse/ARROW-8058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-8058: -- Fix Version/s: (was: 1.0.0) 0.17.0 > [C++][Python][Dataset] Provide an option to toggle validation and schema > inference in FileSystemDatasetFactoryOptions > - > > Key: ARROW-8058 > URL: https://issues.apache.org/jira/browse/ARROW-8058 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset, Python >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Francois Saint-Jacques >Priority: Major > Fix For: 0.17.0 > > > This can be costly and is not always necessary. > At the same time we could move file validation into the scan tasks; currently > all files are inspected as the dataset is constructed, which can be expensive > if the filesystem is slow. We'll be performing the validation multiple times > but the check will be cheap since at scan time we'll be reading the file into > memory anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8058) [C++][Python][Dataset] Provide an option to toggle validation and schema inference in FileSystemDatasetFactoryOptions
[ https://issues.apache.org/jira/browse/ARROW-8058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-8058: - Assignee: Francois Saint-Jacques > [C++][Python][Dataset] Provide an option to toggle validation and schema > inference in FileSystemDatasetFactoryOptions > - > > Key: ARROW-8058 > URL: https://issues.apache.org/jira/browse/ARROW-8058 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset, Python >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Francois Saint-Jacques >Priority: Major > Fix For: 1.0.0 > > > This can be costly and is not always necessary. > At the same time we could move file validation into the scan tasks; currently > all files are inspected as the dataset is constructed, which can be expensive > if the filesystem is slow. We'll be performing the validation multiple times > but the check will be cheap since at scan time we'll be reading the file into > memory anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4484) [Java] improve Flight DoPut busy wait
[ https://issues.apache.org/jira/browse/ARROW-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061873#comment-17061873 ] David Li commented on ARROW-4484: - Not a blocker for any version. > [Java] improve Flight DoPut busy wait > - > > Key: ARROW-4484 > URL: https://issues.apache.org/jira/browse/ARROW-4484 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Java >Reporter: David Li >Priority: Major > Labels: flight > > Currently the implementation of putNext in FlightClient.java busy-waits until > gRPC indicates that the server can receive a message. We should either > improve the busy-wait (e.g. add sleep times), or rethink the API and make it > non-blocking. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-4484) [Java] improve Flight DoPut busy wait
[ https://issues.apache.org/jira/browse/ARROW-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-4484: Fix Version/s: (was: 0.17.0) > [Java] improve Flight DoPut busy wait > - > > Key: ARROW-4484 > URL: https://issues.apache.org/jira/browse/ARROW-4484 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Java >Reporter: David Li >Priority: Major > Labels: flight > > Currently the implementation of putNext in FlightClient.java busy-waits until > gRPC indicates that the server can receive a message. We should either > improve the busy-wait (e.g. add sleep times), or rethink the API and make it > non-blocking. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8147) [C++][Packaging] Add google-cloud-cpp to ThirdpartyToolchain
[ https://issues.apache.org/jira/browse/ARROW-8147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8147: Summary: [C++][Packaging] Add google-cloud-cpp to ThirdpartyToolchain (was: [Packaging] Add google-cloud-cpp to ThirdpartyToolchain) > [C++][Packaging] Add google-cloud-cpp to ThirdpartyToolchain > > > Key: ARROW-8147 > URL: https://issues.apache.org/jira/browse/ARROW-8147 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > This is a requirement to be able to make progress on ARROW-1231 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8147) [C++] Add google-cloud-cpp to ThirdpartyToolchain
[ https://issues.apache.org/jira/browse/ARROW-8147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8147: Summary: [C++] Add google-cloud-cpp to ThirdpartyToolchain (was: [C++][Packaging] Add google-cloud-cpp to ThirdpartyToolchain) > [C++] Add google-cloud-cpp to ThirdpartyToolchain > - > > Key: ARROW-8147 > URL: https://issues.apache.org/jira/browse/ARROW-8147 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > This is a requirement to be able to make progress on ARROW-1231 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8148) [Packaging][C++] Add google-cloud-cpp to conda-forge
Wes McKinney created ARROW-8148: --- Summary: [Packaging][C++] Add google-cloud-cpp to conda-forge Key: ARROW-8148 URL: https://issues.apache.org/jira/browse/ARROW-8148 Project: Apache Arrow Issue Type: Improvement Components: C++, Packaging Reporter: Wes McKinney This is a requirement for ARROW-1231 to be able to move forward -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8147) [Packaging] Add google-cloud-cpp to ThirdpartyToolchain
Wes McKinney created ARROW-8147: --- Summary: [Packaging] Add google-cloud-cpp to ThirdpartyToolchain Key: ARROW-8147 URL: https://issues.apache.org/jira/browse/ARROW-8147 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney This is a requirement to be able to make progress on ARROW-1231 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061849#comment-17061849 ] Wes McKinney commented on ARROW-1231: - [~clarkzinzow] note that google-cloud-cpp does not seem to be available in conda-forge yet, so I'm opening a child issue about dealing with that. I don't know anyone else for whom this is a short term priority until later this year so we are happy to help and give advice / code review > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8146) [C++] Add per-filesystem facility to sanitize a path
[ https://issues.apache.org/jira/browse/ARROW-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8146: -- Labels: pull-request-available (was: ) > [C++] Add per-filesystem facility to sanitize a path > > > Key: ARROW-8146 > URL: https://issues.apache.org/jira/browse/ARROW-8146 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8146) [C++] Add per-filesystem facility to sanitize a path
Antoine Pitrou created ARROW-8146: - Summary: [C++] Add per-filesystem facility to sanitize a path Key: ARROW-8146 URL: https://issues.apache.org/jira/browse/ARROW-8146 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64
[ https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Pliszka reopened ARROW-3329: -- It is resolved in C++. Now I need to work on Python part. Thank you to your work! What is left of my contribution is now mostly braces. :) > [Python] Error casting decimal(38, 4) to int64 > -- > > Key: ARROW-3329 > URL: https://issues.apache.org/jira/browse/ARROW-3329 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: Python version : 3.6.5 > Pyarrow version : 0.10.0 >Reporter: Kavita Sheth >Assignee: Jacek Pliszka >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 7h > Remaining Estimate: 0h > > Git issue LInk : https://github.com/apache/arrow/issues/2627 > I want to cast pyarrow table column from decimal(38,4) to int64. > col.cast(pa.int64()) > Error: > File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast > File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, > 4) to int64 > Python version : 3.6.5 > Pyarrow version : 0.10.0 > is it not implemented yet or I am not using it correctly? If not implemented > yet, then any work around to cast columns? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8145) [C++] Rename GetTargetInfos
Antoine Pitrou created ARROW-8145: - Summary: [C++] Rename GetTargetInfos Key: ARROW-8145 URL: https://issues.apache.org/jira/browse/ARROW-8145 Project: Apache Arrow Issue Type: Wish Components: C++, Python Reporter: Antoine Pitrou Sorry, but I think I'm irked by the new "GetTargetInfos" spelling. I suggest either "GetTargetInfo" or "GetFileInfo" (both singular). -- This message was sent by Atlassian Jira (v8.3.4#803005)