[jira] [Commented] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API

2020-09-01 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188973#comment-17188973
 ] 

Micah Kornfield commented on ARROW-9821:


[~alamb] would you mind creating sub-issues, the preferred workflow is one Jira 
item one pull request (I forget exactly what this affects downstream, it might 
be release notes?)

> [Rust][DataFusion] User Defined PlanNode / Operator API
> ---
>
> Key: ARROW-9821
> URL: https://issues.apache.org/jira/browse/ARROW-9821
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> The basic goal is to  allow users to implement their own PlanNodes. I will 
> provide a google doc opened for comments shortly.
> Proposal: 
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> See also mailing list discussion here: 
> https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9794) [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86

2020-09-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9794:


Assignee: Apache Arrow JIRA Bot  (was: Frank Du)

> [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86
> 
>
> Key: ARROW-9794
> URL: https://issues.apache.org/jira/browse/ARROW-9794
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is needed to do runtime dispatches for places where pext/pdep can be 
> used.  These perform poorly on AMD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9794) [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86

2020-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9794:
--
Labels: pull-request-available  (was: )

> [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86
> 
>
> Key: ARROW-9794
> URL: https://issues.apache.org/jira/browse/ARROW-9794
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is needed to do runtime dispatches for places where pext/pdep can be 
> used.  These perform poorly on AMD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9794) [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86

2020-09-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9794:


Assignee: Frank Du  (was: Apache Arrow JIRA Bot)

> [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86
> 
>
> Key: ARROW-9794
> URL: https://issues.apache.org/jira/browse/ARROW-9794
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is needed to do runtime dispatches for places where pext/pdep can be 
> used.  These perform poorly on AMD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9794) [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86

2020-09-01 Thread Frank Du (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188914#comment-17188914
 ] 

Frank Du commented on ARROW-9794:
-

 Sure, I will take a look

> [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86
> 
>
> Key: ARROW-9794
> URL: https://issues.apache.org/jira/browse/ARROW-9794
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Frank Du
>Priority: Major
>
> This is needed to do runtime dispatches for places where pext/pdep can be 
> used.  These perform poorly on AMD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9794) [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86

2020-09-01 Thread Frank Du (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Du reassigned ARROW-9794:
---

Assignee: Frank Du

> [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86
> 
>
> Key: ARROW-9794
> URL: https://issues.apache.org/jira/browse/ARROW-9794
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Frank Du
>Priority: Major
>
> This is needed to do runtime dispatches for places where pext/pdep can be 
> used.  These perform poorly on AMD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5972) [Rust] Installing cargo-tarpaulin and generating coverage report takes over 20 minutes

2020-09-01 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188792#comment-17188792
 ] 

Wes McKinney commented on ARROW-5972:
-

You can certainly set up a nightly in 
https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml

> [Rust] Installing cargo-tarpaulin and generating coverage report takes over 
> 20 minutes
> --
>
> Key: ARROW-5972
> URL: https://issues.apache.org/jira/browse/ARROW-5972
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Wes McKinney
>Priority: Major
>
> See example build:
> https://travis-ci.org/apache/arrow/jobs/558986931
> Here, installing cargo-tarpaulin takes 13m32s. Running the coverage report 
> takes another 7m40s. 
> Given the Travis CI build queue issues we're having, this might be worth 
> optimizing or moving to Docker/Buildbot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API

2020-09-01 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reopened ARROW-9821:


I am somewhat cheating and have several related PRs that I want to categorize 
under this single JIRA... So reopening

> [Rust][DataFusion] User Defined PlanNode / Operator API
> ---
>
> Key: ARROW-9821
> URL: https://issues.apache.org/jira/browse/ARROW-9821
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> The basic goal is to  allow users to implement their own PlanNodes. I will 
> provide a google doc opened for comments shortly.
> Proposal: 
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> See also mailing list discussion here: 
> https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9894) [C++] Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, FILE*)'...

2020-09-01 Thread Keith Hughitt (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Hughitt closed ARROW-9894.

Resolution: Workaround

(See suggested work-around in issue comments)

> [C++] Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: error: 
> ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
> FILE*)'...
> 
>
> Key: ARROW-9894
> URL: https://issues.apache.org/jira/browse/ARROW-9894
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: - Arch Linux 5.8.3
> - Arrow 1.0.1
> - CMake 3.18.2
>Reporter: Keith Hughitt
>Priority: Major
>
> When attempting to build Arrow 1.0.1 for Arch Linux, an error is encountered 
> that appears to be related to a warning generated in a test:
> ```
> [  0%] Performing build step for 'googletest_ep'
> CMake Error at 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:37
>  (message):
>   Command failed: 2
>'make'
>   See also
> 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-*.log
> -- stdout output is:
> [ 12%] Building CXX object 
> googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
> -- stderr output is:
> In file included from 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest-all.cc:41:0:
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:
>  In constructor 
> 'testing::internal::ScopedPrematureExitFile::ScopedPrematureExitFile(const 
> char*)':
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:4388:13:
>  error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
> FILE*)', declared with attribute warn_unused_result [-Werror=unused-result]
>fwrite("0", 1, 1, pfile);
>~~^~
> cc1plus: all warnings being treated as errors
> make[5]: *** [googlemock/gtest/CMakeFiles/gtest.dir/build.make:82: 
> googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
> make[4]: *** [CMakeFiles/Makefile2:219: 
> googlemock/gtest/CMakeFiles/gtest.dir/all] Error 2
> make[3]: *** [Makefile:160: all] Error 2
> CMake Error at 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:47
>  (message):
>   Stopping after outputting logs.
> make[2]: *** [CMakeFiles/googletest_ep.dir/build.make:131: 
> googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build] Error 1
> make[1]: *** [CMakeFiles/Makefile2:1035: CMakeFiles/googletest_ep.dir/all] 
> Error 2
> make: *** [Makefile:160: all] Error 2
> ```
> Attempted installation via PKGBUILD: https://aur.archlinux.org/packages/arrow/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9894) [C++] Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, FILE*)'...

2020-09-01 Thread Keith Hughitt (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188735#comment-17188735
 ] 

Keith Hughitt commented on ARROW-9894:
--

Hi Neal,

Thanks for the quick response and suggestions!

I tried your first suggestion and install gtest (and also gmock), and removed 
the "-DGTest_SOURCE=BUNDLED" flag and that did the trick!

I'll report the issue downstream to the AUR package maintainer.

I'm going to close this issue since the original problem appears to relate more 
to a build dependency than to arrow itself. Feel free to re-open it if you 
think it's worth tracking / trying to add a check or work-around to the 
CMakefile though.

> [C++] Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: error: 
> ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
> FILE*)'...
> 
>
> Key: ARROW-9894
> URL: https://issues.apache.org/jira/browse/ARROW-9894
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: - Arch Linux 5.8.3
> - Arrow 1.0.1
> - CMake 3.18.2
>Reporter: Keith Hughitt
>Priority: Major
>
> When attempting to build Arrow 1.0.1 for Arch Linux, an error is encountered 
> that appears to be related to a warning generated in a test:
> ```
> [  0%] Performing build step for 'googletest_ep'
> CMake Error at 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:37
>  (message):
>   Command failed: 2
>'make'
>   See also
> 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-*.log
> -- stdout output is:
> [ 12%] Building CXX object 
> googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
> -- stderr output is:
> In file included from 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest-all.cc:41:0:
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:
>  In constructor 
> 'testing::internal::ScopedPrematureExitFile::ScopedPrematureExitFile(const 
> char*)':
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:4388:13:
>  error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
> FILE*)', declared with attribute warn_unused_result [-Werror=unused-result]
>fwrite("0", 1, 1, pfile);
>~~^~
> cc1plus: all warnings being treated as errors
> make[5]: *** [googlemock/gtest/CMakeFiles/gtest.dir/build.make:82: 
> googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
> make[4]: *** [CMakeFiles/Makefile2:219: 
> googlemock/gtest/CMakeFiles/gtest.dir/all] Error 2
> make[3]: *** [Makefile:160: all] Error 2
> CMake Error at 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:47
>  (message):
>   Stopping after outputting logs.
> make[2]: *** [CMakeFiles/googletest_ep.dir/build.make:131: 
> googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build] Error 1
> make[1]: *** [CMakeFiles/Makefile2:1035: CMakeFiles/googletest_ep.dir/all] 
> Error 2
> make: *** [Makefile:160: all] Error 2
> ```
> Attempted installation via PKGBUILD: https://aur.archlinux.org/packages/arrow/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9895) [RUST] Improve sort kernels

2020-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9895:
--
Labels: pull-request-available  (was: )

> [RUST] Improve sort kernels
> ---
>
> Key: ARROW-9895
> URL: https://issues.apache.org/jira/browse/ARROW-9895
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Followup from my mailing list post:
> {quote}1. When sorting by multiple columns (lexsort_to_indices) the Float32
> and Float64 data types are not supported because the implementation
> relies on the OrdArray trait. This trait is not implemented because
> f64/f32 only implements PartialOrd. The sort function for a single
> column (sort_to_indices) has some special logic which looks like it
> wants to treats NaN the same as null, but I'm also not convinced this
> is the correct way. For example postgres does the following
> (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)
> "In order to allow floating-point values to be sorted and used in
> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> than all non-NaN values."
> I propose to do the same in an OrdArray impl for
> Float64Array/Float32Array and then simplifying the sort_to_indices
> function accordingly.
> 2. Sorting for dictionary encoded strings. The problem here is that
> DictionaryArray does not have a generic parameter for the value type
> so it is not currently possible to only implement OrdArray for string
> dictionaries. Again for the single column case, the value data type
> could be checked and a sort could be implemented by looking up each
> key in the dictionary. An optimization could be to check the is_sorted
> flag of DictionaryArray (which does not seem to be used really) and
> then directly sort by the keys. For the general case I see roughly to
> options
> - Somehow implement an OrdArray view of the dictionary array. This
> could be easier if OrdArray did not extend Array but was a completely
> separate trait.
> - Change the lexicographic sort impl to not use dynamic calls but
> instead sort multiple times. So for a query `ORDER BY a, b`, first
> sort by b and afterwards sort again by a. With a stable sort
> implementation this should result in the same ordering. I'm curious
> about the performance, it could avoid dynamic method calls for each
> comparison, but it would process the indices vector multiple times.
> {quote}
> My plan is to open a draft PR with the following changes:
>  - {{sort_to_indices}} further splits up float64/float32 inputs into 
> nulls/non-nan/nan, sorts the non-nan values and then concats those 3 slices 
> according to the sort options. Nans are distinct from null and sort greater 
> than any other valid value
> - implement a sort method for dictionary arrays with string values. this 
> kernel checks the {{is_ordered}} flag and sorts just by the keys if it is 
> set, it will look up the string values otherwise
> - for the lexical sort use case the above kernel are not used, instead the 
> {{OrdArray}} trait is used. To make that more flexible and allow wrapping 
> arrays with differend ordering behavior I will make it no longer extend 
> {{Array}} and instead only contain the {{cmp_value}} method
> - string dictionary sorting can then be implemented with a wrapper struct 
> {{StringDictionaryArrayAsOrdArray}} which implements {{OrdArray}}
> - NaN aware sorting of floats can also be implemented with a wrapper struct 
> and trait implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API

2020-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-9821:
-

Assignee: Andrew Lamb

> [Rust][DataFusion] User Defined PlanNode / Operator API
> ---
>
> Key: ARROW-9821
> URL: https://issues.apache.org/jira/browse/ARROW-9821
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> The basic goal is to  allow users to implement their own PlanNodes. I will 
> provide a google doc opened for comments shortly.
> Proposal: 
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> See also mailing list discussion here: 
> https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API

2020-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9821.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8085
[https://github.com/apache/arrow/pull/8085]

> [Rust][DataFusion] User Defined PlanNode / Operator API
> ---
>
> Key: ARROW-9821
> URL: https://issues.apache.org/jira/browse/ARROW-9821
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The basic goal is to  allow users to implement their own PlanNodes. I will 
> provide a google doc opened for comments shortly.
> Proposal: 
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> See also mailing list discussion here: 
> https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9895) [RUST] Improve sort kernels

2020-09-01 Thread Jira
Jörn Horstmann created ARROW-9895:
-

 Summary: [RUST] Improve sort kernels
 Key: ARROW-9895
 URL: https://issues.apache.org/jira/browse/ARROW-9895
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 1.0.0
Reporter: Jörn Horstmann


Followup from my mailing list post:
{quote}1. When sorting by multiple columns (lexsort_to_indices) the Float32
and Float64 data types are not supported because the implementation
relies on the OrdArray trait. This trait is not implemented because
f64/f32 only implements PartialOrd. The sort function for a single
column (sort_to_indices) has some special logic which looks like it
wants to treats NaN the same as null, but I'm also not convinced this
is the correct way. For example postgres does the following
(https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)

"In order to allow floating-point values to be sorted and used in
tree-based indexes, PostgreSQL treats NaN values as equal, and greater
than all non-NaN values."

I propose to do the same in an OrdArray impl for
Float64Array/Float32Array and then simplifying the sort_to_indices
function accordingly.

2. Sorting for dictionary encoded strings. The problem here is that
DictionaryArray does not have a generic parameter for the value type
so it is not currently possible to only implement OrdArray for string
dictionaries. Again for the single column case, the value data type
could be checked and a sort could be implemented by looking up each
key in the dictionary. An optimization could be to check the is_sorted
flag of DictionaryArray (which does not seem to be used really) and
then directly sort by the keys. For the general case I see roughly to
options

- Somehow implement an OrdArray view of the dictionary array. This
could be easier if OrdArray did not extend Array but was a completely
separate trait.
- Change the lexicographic sort impl to not use dynamic calls but
instead sort multiple times. So for a query `ORDER BY a, b`, first
sort by b and afterwards sort again by a. With a stable sort
implementation this should result in the same ordering. I'm curious
about the performance, it could avoid dynamic method calls for each
comparison, but it would process the indices vector multiple times.
{quote}

My plan is to open a draft PR with the following changes:

 - {{sort_to_indices}} further splits up float64/float32 inputs into 
nulls/non-nan/nan, sorts the non-nan values and then concats those 3 slices 
according to the sort options. Nans are distinct from null and sort greater 
than any other valid value
- implement a sort method for dictionary arrays with string values. this kernel 
checks the {{is_ordered}} flag and sorts just by the keys if it is set, it will 
look up the string values otherwise
- for the lexical sort use case the above kernel are not used, instead the 
{{OrdArray}} trait is used. To make that more flexible and allow wrapping 
arrays with differend ordering behavior I will make it no longer extend 
{{Array}} and instead only contain the {{cmp_value}} method
- string dictionary sorting can then be implemented with a wrapper struct 
{{StringDictionaryArrayAsOrdArray}} which implements {{OrdArray}}
- NaN aware sorting of floats can also be implemented with a wrapper struct and 
trait implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8040) [Python][Packaging] Add Parquet encryption / OpenSSL to Python wheels

2020-09-01 Thread Itamar Turner-Trauring (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188633#comment-17188633
 ] 

Itamar Turner-Trauring commented on ARROW-8040:
---

Just checking in again—could someone answer my question above re the need for 
Python bindings in addition to packaging wheels? Thanks!

> [Python][Packaging] Add Parquet encryption / OpenSSL to Python wheels
> -
>
> Key: ARROW-8040
> URL: https://issues.apache.org/jira/browse/ARROW-8040
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9853) [RUST] Implement "take" kernel for dictionary arrays

2020-09-01 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann resolved ARROW-9853.
---
Resolution: Fixed

> [RUST] Implement "take" kernel for dictionary arrays
> 
>
> Key: ARROW-9853
> URL: https://issues.apache.org/jira/browse/ARROW-9853
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9853) [RUST] Implement "take" kernel for dictionary arrays

2020-09-01 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188622#comment-17188622
 ] 

Jörn Horstmann commented on ARROW-9853:
---

This was fixed in commit 
https://github.com/apache/arrow/commit/b4063cc9b1c01201e2ab0158c9baa0ecfef9362c

> [RUST] Implement "take" kernel for dictionary arrays
> 
>
> Key: ARROW-9853
> URL: https://issues.apache.org/jira/browse/ARROW-9853
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9885) [Rust] Simplify code of type coercion for binary types

2020-09-01 Thread Jorge (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188574#comment-17188574
 ] 

Jorge commented on ARROW-9885:
--

I am sorry, [~wesm] I forgot it on this one.

> [Rust] Simplify code of type coercion for binary types
> --
>
> Key: ARROW-9885
> URL: https://issues.apache.org/jira/browse/ARROW-9885
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The function `numerical_coercion` only uses the operator `op` for its error 
> formatting. But the function's intent can be simply generalized to "coerce 
> two types to numerically equivalent types".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9885) [Rust] [DataFusion] Simplify code of type coercion for binary types

2020-09-01 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-9885:
-
Summary: [Rust] [DataFusion] Simplify code of type coercion for binary 
types  (was: [Rust] Simplify code of type coercion for binary types)

> [Rust] [DataFusion] Simplify code of type coercion for binary types
> ---
>
> Key: ARROW-9885
> URL: https://issues.apache.org/jira/browse/ARROW-9885
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The function `numerical_coercion` only uses the operator `op` for its error 
> formatting. But the function's intent can be simply generalized to "coerce 
> two types to numerically equivalent types".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9885) [Rust] [DataFusion] Simplify code of type coercion for binary types

2020-09-01 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-9885:
-
Component/s: Rust

> [Rust] [DataFusion] Simplify code of type coercion for binary types
> ---
>
> Key: ARROW-9885
> URL: https://issues.apache.org/jira/browse/ARROW-9885
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The function `numerical_coercion` only uses the operator `op` for its error 
> formatting. But the function's intent can be simply generalized to "coerce 
> two types to numerically equivalent types".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9885) [Rust] Simplify code of type coercion for binary types

2020-09-01 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188566#comment-17188566
 ] 

Wes McKinney commented on ARROW-9885:
-

You can help keep the changelog clean by adding the "[Rust]" tag to the issue 
titles

> [Rust] Simplify code of type coercion for binary types
> --
>
> Key: ARROW-9885
> URL: https://issues.apache.org/jira/browse/ARROW-9885
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The function `numerical_coercion` only uses the operator `op` for its error 
> formatting. But the function's intent can be simply generalized to "coerce 
> two types to numerically equivalent types".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9885) [Rust] Simplify code of type coercion for binary types

2020-09-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9885:

Summary: [Rust] Simplify code of type coercion for binary types  (was: 
Simplify code of type coercion for binary types)

> [Rust] Simplify code of type coercion for binary types
> --
>
> Key: ARROW-9885
> URL: https://issues.apache.org/jira/browse/ARROW-9885
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The function `numerical_coercion` only uses the operator `op` for its error 
> formatting. But the function's intent can be simply generalized to "coerce 
> two types to numerically equivalent types".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9894) [C++] Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, FILE*)'...

2020-09-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9894:

Summary: [C++] Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: 
error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
FILE*)'...  (was: Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: 
error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
FILE*)'...)

> [C++] Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: error: 
> ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
> FILE*)'...
> 
>
> Key: ARROW-9894
> URL: https://issues.apache.org/jira/browse/ARROW-9894
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: - Arch Linux 5.8.3
> - Arrow 1.0.1
> - CMake 3.18.2
>Reporter: Keith Hughitt
>Priority: Major
>
> When attempting to build Arrow 1.0.1 for Arch Linux, an error is encountered 
> that appears to be related to a warning generated in a test:
> ```
> [  0%] Performing build step for 'googletest_ep'
> CMake Error at 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:37
>  (message):
>   Command failed: 2
>'make'
>   See also
> 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-*.log
> -- stdout output is:
> [ 12%] Building CXX object 
> googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
> -- stderr output is:
> In file included from 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest-all.cc:41:0:
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:
>  In constructor 
> 'testing::internal::ScopedPrematureExitFile::ScopedPrematureExitFile(const 
> char*)':
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:4388:13:
>  error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
> FILE*)', declared with attribute warn_unused_result [-Werror=unused-result]
>fwrite("0", 1, 1, pfile);
>~~^~
> cc1plus: all warnings being treated as errors
> make[5]: *** [googlemock/gtest/CMakeFiles/gtest.dir/build.make:82: 
> googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
> make[4]: *** [CMakeFiles/Makefile2:219: 
> googlemock/gtest/CMakeFiles/gtest.dir/all] Error 2
> make[3]: *** [Makefile:160: all] Error 2
> CMake Error at 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:47
>  (message):
>   Stopping after outputting logs.
> make[2]: *** [CMakeFiles/googletest_ep.dir/build.make:131: 
> googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build] Error 1
> make[1]: *** [CMakeFiles/Makefile2:1035: CMakeFiles/googletest_ep.dir/all] 
> Error 2
> make: *** [Makefile:160: all] Error 2
> ```
> Attempted installation via PKGBUILD: https://aur.archlinux.org/packages/arrow/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9894) Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, FILE*)'...

2020-09-01 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188535#comment-17188535
 ] 

Neal Richardson commented on ARROW-9894:


I'm not sure why that external project fails to build, but you could either (1) 
add gtest as a dependency in the PKGBUILD so that it doesn't try to build it 
from source (and remove -DGTest_SOURCE=BUNDLED), or (2) turn off building the 
tests in the arrow cmake (-DARROW_BUILD_TESTS=OFF)

> Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: error: ignoring 
> return value of 'size_t fwrite(const void*, size_t, size_t, FILE*)'...
> --
>
> Key: ARROW-9894
> URL: https://issues.apache.org/jira/browse/ARROW-9894
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: - Arch Linux 5.8.3
> - Arrow 1.0.1
> - CMake 3.18.2
>Reporter: Keith Hughitt
>Priority: Major
>
> When attempting to build Arrow 1.0.1 for Arch Linux, an error is encountered 
> that appears to be related to a warning generated in a test:
> ```
> [  0%] Performing build step for 'googletest_ep'
> CMake Error at 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:37
>  (message):
>   Command failed: 2
>'make'
>   See also
> 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-*.log
> -- stdout output is:
> [ 12%] Building CXX object 
> googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
> -- stderr output is:
> In file included from 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest-all.cc:41:0:
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:
>  In constructor 
> 'testing::internal::ScopedPrematureExitFile::ScopedPrematureExitFile(const 
> char*)':
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:4388:13:
>  error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
> FILE*)', declared with attribute warn_unused_result [-Werror=unused-result]
>fwrite("0", 1, 1, pfile);
>~~^~
> cc1plus: all warnings being treated as errors
> make[5]: *** [googlemock/gtest/CMakeFiles/gtest.dir/build.make:82: 
> googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
> make[4]: *** [CMakeFiles/Makefile2:219: 
> googlemock/gtest/CMakeFiles/gtest.dir/all] Error 2
> make[3]: *** [Makefile:160: all] Error 2
> CMake Error at 
> /mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:47
>  (message):
>   Stopping after outputting logs.
> make[2]: *** [CMakeFiles/googletest_ep.dir/build.make:131: 
> googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build] Error 1
> make[1]: *** [CMakeFiles/Makefile2:1035: CMakeFiles/googletest_ep.dir/all] 
> Error 2
> make: *** [Makefile:160: all] Error 2
> ```
> Attempted installation via PKGBUILD: https://aur.archlinux.org/packages/arrow/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9873) [C++][Compute] Improve mode kernel for intergers within limited value range

2020-09-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9873:


Assignee: Yibo Cai  (was: Apache Arrow JIRA Bot)

> [C++][Compute] Improve mode kernel for intergers within limited value range
> ---
>
> Key: ARROW-9873
> URL: https://issues.apache.org/jira/browse/ARROW-9873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Attachments: mode-range-skylake.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It's possible to improve mode kernel performance for integers within limited 
> value range by using a value indexed array instead of general hash table.
>  Similar trick is used in sorting kernel ARROW-1571.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9873) [C++][Compute] Improve mode kernel for intergers within limited value range

2020-09-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9873:


Assignee: Apache Arrow JIRA Bot  (was: Yibo Cai)

> [C++][Compute] Improve mode kernel for intergers within limited value range
> ---
>
> Key: ARROW-9873
> URL: https://issues.apache.org/jira/browse/ARROW-9873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Attachments: mode-range-skylake.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It's possible to improve mode kernel performance for integers within limited 
> value range by using a value indexed array instead of general hash table.
>  Similar trick is used in sorting kernel ARROW-1571.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9892) [Rust] [DataFusion] Add support for concat

2020-09-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9892:


Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] Add support for concat
> --
>
> Key: ARROW-9892
> URL: https://issues.apache.org/jira/browse/ARROW-9892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> So that we can concatenate strings together.
> {{pub fn concat(args: Vec) -> Expr}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9891) [Rust] [DataFusion] Make math functions support f32

2020-09-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9891:


Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] [DataFusion] Make math functions support f32
> ---
>
> Key: ARROW-9891
> URL: https://issues.apache.org/jira/browse/ARROW-9891
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given a math function `g`, we compute g(f32) using g(cast(f32 AS f64)).
> The goal of this issue is to make the operation be cast(g(f32) AS f64) 
> instead.
> Since computations on f32 are faster than on f64, this is a simple 
> optimization.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9892) [Rust] [DataFusion] Add support for concat

2020-09-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9892:


Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] [DataFusion] Add support for concat
> --
>
> Key: ARROW-9892
> URL: https://issues.apache.org/jira/browse/ARROW-9892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> So that we can concatenate strings together.
> {{pub fn concat(args: Vec) -> Expr}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9891) [Rust] [DataFusion] Make math functions support f32

2020-09-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9891:


Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] Make math functions support f32
> ---
>
> Key: ARROW-9891
> URL: https://issues.apache.org/jira/browse/ARROW-9891
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given a math function `g`, we compute g(f32) using g(cast(f32 AS f64)).
> The goal of this issue is to make the operation be cast(g(f32) AS f64) 
> instead.
> Since computations on f32 are faster than on f64, this is a simple 
> optimization.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9874) [C++] NewStreamWriter / NewFileWriter don't own output stream

2020-09-01 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9874.
---
Resolution: Fixed

Issue resolved by pull request 8084
[https://github.com/apache/arrow/pull/8084]

> [C++] NewStreamWriter / NewFileWriter don't own output stream
> -
>
> Key: ARROW-9874
> URL: https://issues.apache.org/jira/browse/ARROW-9874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Also the naming doesn't follow usual conventions for factories (e.g. 
> {{MakeStreamWriter}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9642) [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-09-01 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9642.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7898
[https://github.com/apache/arrow/pull/7898]

> [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the 
> starting bit width of the indices
> --
>
> Key: ARROW-9642
> URL: https://issues.apache.org/jira/browse/ARROW-9642
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9875) [Python] Let FileSystem.get_file_info accept a single path

2020-09-01 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9875:
-

Assignee: Joris Van den Bossche

> [Python] Let FileSystem.get_file_info accept a single path
> --
>
> Key: ARROW-9875
> URL: https://issues.apache.org/jira/browse/ARROW-9875
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Currently you need to do {{fs.get_file_info([path])[0]}} to get the info of a 
> single path. We can make the function also accept that directly (instead of a 
> list): {{fs.get_file_info(path)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9875) [Python] Let FileSystem.get_file_info accept a single path

2020-09-01 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9875.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8067
[https://github.com/apache/arrow/pull/8067]

> [Python] Let FileSystem.get_file_info accept a single path
> --
>
> Key: ARROW-9875
> URL: https://issues.apache.org/jira/browse/ARROW-9875
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Currently you need to do {{fs.get_file_info([path])[0]}} to get the info of a 
> single path. We can make the function also accept that directly (instead of a 
> list): {{fs.get_file_info(path)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9894) Arrow 1.0.1 fails to build on Arch Linux (gtest.cc:4388:13: error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, FILE*)'...

2020-09-01 Thread Keith Hughitt (Jira)
Keith Hughitt created ARROW-9894:


 Summary: Arrow 1.0.1 fails to build on Arch Linux 
(gtest.cc:4388:13: error: ignoring return value of 'size_t fwrite(const void*, 
size_t, size_t, FILE*)'...
 Key: ARROW-9894
 URL: https://issues.apache.org/jira/browse/ARROW-9894
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 1.0.1
 Environment: 
- Arch Linux 5.8.3
- Arrow 1.0.1
- CMake 3.18.2
Reporter: Keith Hughitt


When attempting to build Arrow 1.0.1 for Arch Linux, an error is encountered 
that appears to be related to a warning generated in a test:

```
[  0%] Performing build step for 'googletest_ep'
CMake Error at 
/mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:37
 (message):
  Command failed: 2

   'make'

  See also


/mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-*.log


-- stdout output is:
[ 12%] Building CXX object 
googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o

-- stderr output is:
In file included from 
/mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest-all.cc:41:0:
/mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:
 In constructor 
'testing::internal::ScopedPrematureExitFile::ScopedPrematureExitFile(const 
char*)':
/mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:4388:13:
 error: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, 
FILE*)', declared with attribute warn_unused_result [-Werror=unused-result]
   fwrite("0", 1, 1, pfile);
   ~~^~
cc1plus: all warnings being treated as errors
make[5]: *** [googlemock/gtest/CMakeFiles/gtest.dir/build.make:82: 
googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
make[4]: *** [CMakeFiles/Makefile2:219: 
googlemock/gtest/CMakeFiles/gtest.dir/all] Error 2
make[3]: *** [Makefile:160: all] Error 2

CMake Error at 
/mnt/storage/software/arrow/src/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-RELEASE.cmake:47
 (message):
  Stopping after outputting logs.


make[2]: *** [CMakeFiles/googletest_ep.dir/build.make:131: 
googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build] Error 1
make[1]: *** [CMakeFiles/Makefile2:1035: CMakeFiles/googletest_ep.dir/all] 
Error 2
make: *** [Makefile:160: all] Error 2
```

Attempted installation via PKGBUILD: https://aur.archlinux.org/packages/arrow/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9520) [Rust] [DataFusion] Can't alias an aggregate expression

2020-09-01 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188500#comment-17188500
 ] 

Andrew Lamb commented on ARROW-9520:


Here is another example:

{code}
> create external table sales(customer_id varchar, sales bigint) stored as CSV 
> location '/tmp/foo.csv';
0 rows in set. Query took 0 seconds.

> SELECT customer_id, sum(sales) FROM sales ORDER BY sum(sales);
General("Projection references non-aggregate values")
{code}

> [Rust] [DataFusion] Can't alias an aggregate expression
> ---
>
> Key: ARROW-9520
> URL: https://issues.apache.org/jira/browse/ARROW-9520
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>
> The following test (on execute) fails:
> {code}
> #[test]
> fn aggregate_with_alias() -> Result<()> {
> let results = execute("SELECT c1, COUNT(c2) AS count FROM test GROUP 
> BY c1", 4)?;
> assert_eq!(field_names(batch), vec!["c1", "count"]);
> let expected = vec!["0,10", "1,10", "2,10", "3,10"];
> let mut rows = test::format_batch();
> rows.sort();
> assert_eq!(rows, expected);
> Ok(())
> }
> {code}
> The root cause is that, in {{sql::planner}}, we interpret {{COUNT(c2) AS 
> count}} as An {{Expr::Alias}}, which fails the {{is_aggregate_expr}} 
> condition, thus being interpreted as grouped expression instead of an 
> aggregated expression. This raises the Error
> {{General("Projection references non-aggregate values")}}
> The planner could interpret the statement above as two steps: an aggregation 
> followed by a projection. Alternatively, we can allow aliases to be valid 
> aggregation expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9881) [C++] Add support for dictionary type to arrow::read_csv

2020-09-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9881:

Summary: [C++] Add support for dictionary type to arrow::read_csv  (was: 
Add support for dictionary type to arrow::read_csv)

> [C++] Add support for dictionary type to arrow::read_csv
> 
>
> Key: ARROW-9881
> URL: https://issues.apache.org/jira/browse/ARROW-9881
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Dmitry Chigarev
>Priority: Minor
>
> For now, the only way to encode column with dict-encoding is to set 
> {{auto_dict_encode = True}} at {{arrow::csv}}{{::}}{{ConvertOptions}}
> However, users should have the ability to specify column type as dict 
> explicitly via {{column_types}} parameter. (Currently it is raising an 
> exception that dict type is not supported)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9881) [C++] Add support for dictionary type to arrow::read_csv

2020-09-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9881:

Component/s: C++

> [C++] Add support for dictionary type to arrow::read_csv
> 
>
> Key: ARROW-9881
> URL: https://issues.apache.org/jira/browse/ARROW-9881
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Dmitry Chigarev
>Priority: Minor
>
> For now, the only way to encode column with dict-encoding is to set 
> {{auto_dict_encode = True}} at {{arrow::csv}}{{::}}{{ConvertOptions}}
> However, users should have the ability to specify column type as dict 
> explicitly via {{column_types}} parameter. (Currently it is raising an 
> exception that dict type is not supported)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9629) [Python] Kartothek integration tests failing due to missing freezegun module

2020-09-01 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-9629.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7891
[https://github.com/apache/arrow/pull/7891]

> [Python] Kartothek integration tests failing due to missing freezegun module
> 
>
> Key: ARROW-9629
> URL: https://issues.apache.org/jira/browse/ARROW-9629
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> See eg https://github.com/ursa-labs/crossbow/runs/939266052
> {code}
>  ERRORS 
> 
>  ERROR collecting test session 
> _
> /opt/conda/envs/arrow/lib/python3.7/importlib/__init__.py:127: in 
> import_module
> return _bootstrap._gcd_import(name[level:], package, level)
> :1006: in _gcd_import
> ???
> :983: in _find_and_load
> ???
> :967: in _find_and_load_unlocked
> ???
> :677: in _load_unlocked
> ???
> /opt/conda/envs/arrow/lib/python3.7/site-packages/_pytest/assertion/rewrite.py:170:
>  in exec_module
> exec(co, module.__dict__)
> tests/cli/conftest.py:11: in 
> from freezegun import freeze_time
> E   ModuleNotFoundError: No module named 'freezegun'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9889) [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with "Unsupported logical plan variant"

2020-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9889:
--
Component/s: Rust - DataFusion
 Rust

> [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with 
> "Unsupported logical plan variant"
> ---
>
> Key: ARROW-9889
> URL: https://issues.apache.org/jira/browse/ARROW-9889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When I try to make an external table using the DataFusion CLI I get an error:
> h3. Reproducer:
> # Check out master
> # Build via {{cd arrow/rust; cargo run --bin datafusion-cli}}
> # run this query: {{create external table test(c1 boolean) stored as CSV 
> location '/tmp/foo'}}
> *Expected Result*: An external table is created successfully
> *Actual Result*: An error is reported:
> {code}
> >create external table test(c1 boolean) stored as CSV location '/tmp/foo';
> General("Unsupported logical plan variant")
> >
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9889) [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with "Unsupported logical plan variant"

2020-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9889.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8083
[https://github.com/apache/arrow/pull/8083]

> [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with 
> "Unsupported logical plan variant"
> ---
>
> Key: ARROW-9889
> URL: https://issues.apache.org/jira/browse/ARROW-9889
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When I try to make an external table using the DataFusion CLI I get an error:
> h3. Reproducer:
> # Check out master
> # Build via {{cd arrow/rust; cargo run --bin datafusion-cli}}
> # run this query: {{create external table test(c1 boolean) stored as CSV 
> location '/tmp/foo'}}
> *Expected Result*: An external table is created successfully
> *Actual Result*: An error is reported:
> {code}
> >create external table test(c1 boolean) stored as CSV location '/tmp/foo';
> General("Unsupported logical plan variant")
> >
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4965) [Python] Timestamp array type detection should use tzname of datetime.datetime objects

2020-09-01 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188426#comment-17188426
 ] 

Krisztian Szucs commented on ARROW-4965:


This has possibly been fixed in ARROW-9528, need to check whether the issue 
persist or not.

> [Python] Timestamp array type detection should use tzname of 
> datetime.datetime objects
> --
>
> Key: ARROW-4965
> URL: https://issues.apache.org/jira/browse/ARROW-4965
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
> Environment: $ python --version
> Python 3.7.2
> $ pip freeze
> numpy==1.16.2
> pyarrow==0.12.1
> pytz==2018.9
> six==1.12.0
> $ sw_vers
> ProductName:Mac OS X
> ProductVersion: 10.14.3
> BuildVersion:   18D109
> (pyarrow) 
>Reporter: Tim Swast
>Priority: Major
> Fix For: 2.0.0
>
>
> The type detection from datetime objects to array appears to ignore the 
> presence of a tzinfo on the datetime object, instead storing them as naive 
> timestamp columns.
> Python code:
> {code:python}
> import datetime
> import pytz
> import pyarrow as pa
> naive_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10)
> utc_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10, tzinfo=pytz.utc)
> tzaware_datetime = 
> utc_datetime.astimezone(pytz.timezone('America/Los_Angeles'))
> def inspect(varname):
> print(varname)
> arr = globals()[varname]
> print(arr.type)
> print(arr)
> print()
> auto_naive_arr = pa.array([naive_datetime])
> inspect("auto_naive_arr")
> auto_utc_arr = pa.array([utc_datetime])
> inspect("auto_utc_arr")
> auto_tzaware_arr = pa.array([tzaware_datetime])
> inspect("auto_tzaware_arr")
> auto_mixed_arr = pa.array([utc_datetime, tzaware_datetime])
> inspect("auto_mixed_arr")
> naive_type = pa.timestamp("us", naive_datetime.tzname())
> utc_type = pa.timestamp("us", utc_datetime.tzname())
> tzaware_type = pa.timestamp("us", tzaware_datetime.tzname())
> naive_arr = pa.array([naive_datetime], type=naive_type)
> inspect("naive_arr")
> utc_arr = pa.array([utc_datetime], type=utc_type)
> inspect("utc_arr")
> tzaware_arr = pa.array([tzaware_datetime], type=tzaware_type)
> inspect("tzaware_arr")
> mixed_arr = pa.array([utc_datetime, tzaware_datetime], type=utc_type)
> inspect("mixed_arr")
> {code}
> This prints:
> {noformat}
> $ python detect_timezone.py
> auto_naive_arr
> timestamp[us]
> [
>   154738147000
> ]
> auto_utc_arr
> timestamp[us]
> [
>   154738147000
> ]
> auto_tzaware_arr
> timestamp[us]
> [
>   154735267000
> ]
> auto_mixed_arr
> timestamp[us]
> [
>   154738147000,
>   154735267000
> ]
> naive_arr
> timestamp[us]
> [
>   154738147000
> ]
> utc_arr
> timestamp[us, tz=UTC]
> [
>   154738147000
> ]
> tzaware_arr
> timestamp[us, tz=PST]
> [
>   154735267000
> ]
> mixed_arr
> timestamp[us, tz=UTC]
> [
>   154738147000,
>   154735267000
> ]
> {noformat}
> But I would expect the following types instead:
> * {{naive_datetime}}: {{timestamp[us]}}
> * {{auto_utc_arr}}: {{timestamp[us, tz=UTC]}}
> * {{auto_tzaware_arr}}: {{timestamp[us, tz=PST]}} (Or maybe 
> {{tz='America/Los_Angeles'}}. I'm not sure why {{pytz}} returns {{PST}} as 
> the {{tzname}})
> * {{auto_mixed_arr}}: {{timestamp[us, tz=UTC]}}
> Also, in the "mixed" case, I'd expect the actual stored microseconds to be 
> the same for both rows, since {{utc_datetime}} and {{tzaware_datetime}} both 
> refer to the same point in time. It seems reasonable for any naive datetime 
> objects mixed in with tz-aware datetimes to be interpreted as UTC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-4965) [Python] Timestamp array type detection should use tzname of datetime.datetime objects

2020-09-01 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-4965:
--

Assignee: Krisztian Szucs

> [Python] Timestamp array type detection should use tzname of 
> datetime.datetime objects
> --
>
> Key: ARROW-4965
> URL: https://issues.apache.org/jira/browse/ARROW-4965
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
> Environment: $ python --version
> Python 3.7.2
> $ pip freeze
> numpy==1.16.2
> pyarrow==0.12.1
> pytz==2018.9
> six==1.12.0
> $ sw_vers
> ProductName:Mac OS X
> ProductVersion: 10.14.3
> BuildVersion:   18D109
> (pyarrow) 
>Reporter: Tim Swast
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> The type detection from datetime objects to array appears to ignore the 
> presence of a tzinfo on the datetime object, instead storing them as naive 
> timestamp columns.
> Python code:
> {code:python}
> import datetime
> import pytz
> import pyarrow as pa
> naive_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10)
> utc_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10, tzinfo=pytz.utc)
> tzaware_datetime = 
> utc_datetime.astimezone(pytz.timezone('America/Los_Angeles'))
> def inspect(varname):
> print(varname)
> arr = globals()[varname]
> print(arr.type)
> print(arr)
> print()
> auto_naive_arr = pa.array([naive_datetime])
> inspect("auto_naive_arr")
> auto_utc_arr = pa.array([utc_datetime])
> inspect("auto_utc_arr")
> auto_tzaware_arr = pa.array([tzaware_datetime])
> inspect("auto_tzaware_arr")
> auto_mixed_arr = pa.array([utc_datetime, tzaware_datetime])
> inspect("auto_mixed_arr")
> naive_type = pa.timestamp("us", naive_datetime.tzname())
> utc_type = pa.timestamp("us", utc_datetime.tzname())
> tzaware_type = pa.timestamp("us", tzaware_datetime.tzname())
> naive_arr = pa.array([naive_datetime], type=naive_type)
> inspect("naive_arr")
> utc_arr = pa.array([utc_datetime], type=utc_type)
> inspect("utc_arr")
> tzaware_arr = pa.array([tzaware_datetime], type=tzaware_type)
> inspect("tzaware_arr")
> mixed_arr = pa.array([utc_datetime, tzaware_datetime], type=utc_type)
> inspect("mixed_arr")
> {code}
> This prints:
> {noformat}
> $ python detect_timezone.py
> auto_naive_arr
> timestamp[us]
> [
>   154738147000
> ]
> auto_utc_arr
> timestamp[us]
> [
>   154738147000
> ]
> auto_tzaware_arr
> timestamp[us]
> [
>   154735267000
> ]
> auto_mixed_arr
> timestamp[us]
> [
>   154738147000,
>   154735267000
> ]
> naive_arr
> timestamp[us]
> [
>   154738147000
> ]
> utc_arr
> timestamp[us, tz=UTC]
> [
>   154738147000
> ]
> tzaware_arr
> timestamp[us, tz=PST]
> [
>   154735267000
> ]
> mixed_arr
> timestamp[us, tz=UTC]
> [
>   154738147000,
>   154735267000
> ]
> {noformat}
> But I would expect the following types instead:
> * {{naive_datetime}}: {{timestamp[us]}}
> * {{auto_utc_arr}}: {{timestamp[us, tz=UTC]}}
> * {{auto_tzaware_arr}}: {{timestamp[us, tz=PST]}} (Or maybe 
> {{tz='America/Los_Angeles'}}. I'm not sure why {{pytz}} returns {{PST}} as 
> the {{tzname}})
> * {{auto_mixed_arr}}: {{timestamp[us, tz=UTC]}}
> Also, in the "mixed" case, I'd expect the actual stored microseconds to be 
> the same for both rows, since {{utc_datetime}} and {{tzaware_datetime}} both 
> refer to the same point in time. It seems reasonable for any naive datetime 
> objects mixed in with tz-aware datetimes to be interpreted as UTC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4046) [Python/CI] Run nightly large memory tests

2020-09-01 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188424#comment-17188424
 ] 

Krisztian Szucs commented on ARROW-4046:


I'm not sure about the memory requirements, but I assume the easiest would be 
to enable the large memory tests in the ursabot builders. Assigning it to 
myself.

> [Python/CI] Run nightly large memory tests
> --
>
> Key: ARROW-4046
> URL: https://issues.apache.org/jira/browse/ARROW-4046
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: nightly
> Fix For: 2.0.0
>
>
> See comment https://github.com/apache/arrow/pull/3171#issuecomment-447156646



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-4046) [Python/CI] Run nightly large memory tests

2020-09-01 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-4046:
--

Assignee: Krisztian Szucs

> [Python/CI] Run nightly large memory tests
> --
>
> Key: ARROW-4046
> URL: https://issues.apache.org/jira/browse/ARROW-4046
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: nightly
> Fix For: 2.0.0
>
>
> See comment https://github.com/apache/arrow/pull/3171#issuecomment-447156646



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API

2020-09-01 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9821:
---
Description: 
The basic goal is to  allow users to implement their own PlanNodes. I will 
provide a google doc opened for comments shortly.

Proposal: 
https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#

See also mailing list discussion here: 
https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E


  was:
The basic goal is to  allow users to implement their own PlanNodes. I will 
provide a google doc opened for comments shortly.

Proposal: 
https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#



> [Rust][DataFusion] User Defined PlanNode / Operator API
> ---
>
> Key: ARROW-9821
> URL: https://issues.apache.org/jira/browse/ARROW-9821
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The basic goal is to  allow users to implement their own PlanNodes. I will 
> provide a google doc opened for comments shortly.
> Proposal: 
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> See also mailing list discussion here: 
> https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3080) [Python] Unify Arrow to Python object conversion paths

2020-09-01 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188380#comment-17188380
 ] 

Krisztian Szucs commented on ARROW-3080:


After taking a quick look list of structs roundtrip works nicely, need to cover 
with tests.

> [Python] Unify Arrow to Python object conversion paths
> --
>
> Key: ARROW-3080
> URL: https://issues.apache.org/jira/browse/ARROW-3080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> Similar to ARROW-2814, we have inconsistent support for converting Arrow 
> nested types back to object sequences. For example, a list of structs fails 
> when calling {{to_pandas}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9893) [Python] Bindings for writing datasets to Parquet

2020-09-01 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9893:
-
Fix Version/s: 2.0.0

> [Python] Bindings for writing datasets to Parquet
> -
>
> Key: ARROW-9893
> URL: https://issues.apache.org/jira/browse/ARROW-9893
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 2.0.0
>
>
> Added to C++ in ARROW-9646, follow-up on Python bindings of ARROW-9658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9893) [Python] Bindings for writing datasets to Parquet

2020-09-01 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-9893:


Assignee: Joris Van den Bossche

> [Python] Bindings for writing datasets to Parquet
> -
>
> Key: ARROW-9893
> URL: https://issues.apache.org/jira/browse/ARROW-9893
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>
> Added to C++ in ARROW-9646, follow-up on Python bindings of ARROW-9658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9893) [Python] Bindings for writing datasets to Parquet

2020-09-01 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9893:


 Summary: [Python] Bindings for writing datasets to Parquet
 Key: ARROW-9893
 URL: https://issues.apache.org/jira/browse/ARROW-9893
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Added to C++ in ARROW-9646, follow-up on Python bindings of ARROW-9658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9658) [Python][Dataset] Bindings for dataset writing

2020-09-01 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-9658.
--
Resolution: Fixed

Issue resolved by pull request 7921
[https://github.com/apache/arrow/pull/7921]

> [Python][Dataset] Bindings for dataset writing
> --
>
> Key: ARROW-9658
> URL: https://issues.apache.org/jira/browse/ARROW-9658
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Neal Richardson
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9873) [C++][Compute] Improve mode kernel for intergers within limited value range

2020-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9873:
--
Labels: pull-request-available  (was: )

> [C++][Compute] Improve mode kernel for intergers within limited value range
> ---
>
> Key: ARROW-9873
> URL: https://issues.apache.org/jira/browse/ARROW-9873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Attachments: mode-range-skylake.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It's possible to improve mode kernel performance for integers within limited 
> value range by using a value indexed array instead of general hash table.
>  Similar trick is used in sorting kernel ARROW-1571.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9892) [Rust] [DataFusion] Add support for concat

2020-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9892:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Add support for concat
> --
>
> Key: ARROW-9892
> URL: https://issues.apache.org/jira/browse/ARROW-9892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> So that we can concatenate strings together.
> {{pub fn concat(args: Vec) -> Expr}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9892) [Rust] [DataFusion] Add support for concat

2020-09-01 Thread Jorge (Jira)
Jorge created ARROW-9892:


 Summary: [Rust] [DataFusion] Add support for concat
 Key: ARROW-9892
 URL: https://issues.apache.org/jira/browse/ARROW-9892
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


So that we can concatenate strings together.

{{pub fn concat(args: Vec) -> Expr}}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9891) [Rust] [DataFusion] Make math functions support f32

2020-09-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9891:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Make math functions support f32
> ---
>
> Key: ARROW-9891
> URL: https://issues.apache.org/jira/browse/ARROW-9891
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Given a math function `g`, we compute g(f32) using g(cast(f32 AS f64)).
> The goal of this issue is to make the operation be cast(g(f32) AS f64) 
> instead.
> Since computations on f32 are faster than on f64, this is a simple 
> optimization.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9891) [Rust] [DataFusion] Make math functions support f32

2020-09-01 Thread Jorge (Jira)
Jorge created ARROW-9891:


 Summary: [Rust] [DataFusion] Make math functions support f32
 Key: ARROW-9891
 URL: https://issues.apache.org/jira/browse/ARROW-9891
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


Given a math function `g`, we compute g(f32) using g(cast(f32 AS f64)).

The goal of this issue is to make the operation be cast(g(f32) AS f64) instead.

Since computations on f32 are faster than on f64, this is a simple optimization.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)