[jira] [Commented] (ARROW-9873) [C++][Compute] Improve mode kernel for intergers within limited value range

2020-08-31 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188142#comment-17188142
 ] 

Yibo Cai commented on ARROW-9873:
-

Array size in bytes looks a more reasonable crossover point than array size in 
items. Tested with int32/int64, int32 needs more items than int64 to benefit 
from this optimization.
So the threshold is:
- array size in bytes >= 8192
- array value range <= 16384

> [C++][Compute] Improve mode kernel for intergers within limited value range
> ---
>
> Key: ARROW-9873
> URL: https://issues.apache.org/jira/browse/ARROW-9873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
> Attachments: mode-range-skylake.png
>
>
> It's possible to improve mode kernel performance for integers within limited 
> value range by using a value indexed array instead of general hash table.
>  Similar trick is used in sorting kernel ARROW-1571.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9873) [C++][Compute] Improve mode kernel for intergers within limited value range

2020-08-31 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186213#comment-17186213
 ] 

Yibo Cai edited comment on ARROW-9873 at 9/1/20, 5:10 AM:
--

Maybe we can use counting method as first step, then scan the counter array and 
insert into a map finally. Guess there won't cause much performance loss as the 
map is small, and we can reserve buckets first. Will do some tests.

Test result with current arrow benchmark (values within -100~100, array size 1M 
in bytes):
- Small performance drop (< 10%) for Boolean and Int8.
- About 2x performance improvement for Int16/32/64 with limited value range.

Adjusting value range and array size leads to consistent performance uplift.


was (Author: yibo):
Maybe we can use counting method as first step, then scan the counter array and 
insert into a map finally. Guess there won't cause much performance loss as the 
map is small, and we can reserve buckets first. Will do some tests.

Test result with existing benchmark (values within -100~100, array size 1M in 
bytes):
- Small performance drop (< 10%) for Boolean and Int8.
- About 2x performance improvement for Int16/32/64 with limited value range.

Adjusting value range and array size leads to consistent performance uplift.

> [C++][Compute] Improve mode kernel for intergers within limited value range
> ---
>
> Key: ARROW-9873
> URL: https://issues.apache.org/jira/browse/ARROW-9873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
> Attachments: mode-range-skylake.png
>
>
> It's possible to improve mode kernel performance for integers within limited 
> value range by using a value indexed array instead of general hash table.
>  Similar trick is used in sorting kernel ARROW-1571.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9873) [C++][Compute] Improve mode kernel for intergers within limited value range

2020-08-31 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186213#comment-17186213
 ] 

Yibo Cai edited comment on ARROW-9873 at 9/1/20, 5:02 AM:
--

Maybe we can use counting method as first step, then scan the counter array and 
insert into a map finally. Guess there won't cause much performance loss as the 
map is small, and we can reserve buckets first. Will do some tests.

Test result with existing benchmark (values within -100~100, array size 1M in 
bytes):
- Small performance drop (< 10%) for Boolean and Int8.
- About 2x performance improvement for Int16/32/64 with limited value range.

Adjusting value range and array size leads to consistent performance uplift.


was (Author: yibo):
Maybe we can use counting method as first step, then scan the counter array and 
insert into a map finally. Guess there won't cause much performance loss as the 
map is small, and we can reserve buckets first. Will do some tests.

> [C++][Compute] Improve mode kernel for intergers within limited value range
> ---
>
> Key: ARROW-9873
> URL: https://issues.apache.org/jira/browse/ARROW-9873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
> Attachments: mode-range-skylake.png
>
>
> It's possible to improve mode kernel performance for integers within limited 
> value range by using a value indexed array instead of general hash table.
>  Similar trick is used in sorting kernel ARROW-1571.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9794) [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86

2020-08-31 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188134#comment-17188134
 ] 

Micah Kornfield commented on ARROW-9794:


[~frank.du] do you have any interest in looking at this?

> [C++] Add functionality to cpu_info to discriminate between Intel vs AMD x86
> 
>
> Key: ARROW-9794
> URL: https://issues.apache.org/jira/browse/ARROW-9794
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> This is needed to do runtime dispatches for places where pext/pdep can be 
> used.  These perform poorly on AMD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7731) [C++][Parquet] Support LargeListArray

2020-08-31 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188129#comment-17188129
 ] 

Micah Kornfield commented on ARROW-7731:


Supporting reading is tacked in ARROW-1644 and its children.

> [C++][Parquet] Support LargeListArray
> -
>
> Key: ARROW-7731
> URL: https://issues.apache.org/jira/browse/ARROW-7731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: marc abboud
>Priority: Major
>  Labels: parquet
> Fix For: 2.0.0
>
>
> For now it's not possible to write a pyarrow.Table containing a 
> LargeListArray in parquet. The lines
> {code:java}
> from pyarrow import parquet
> import pyarrow as pa
> indices = [1, 2, 3]
> indptr = [0, 1, 2, 3]
> q = pa.lib.LargeListArray.from_arrays(indptr, indices) 
> table = pa.Table.from_arrays([q], names=['no']) 
> parquet.write_table(table, '/test'){code}
> yields the error 
> {code:java}
> ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema 
> conversion: large_list
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7731) [C++][Parquet] Support LargeListArray

2020-08-31 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-7731.

Fix Version/s: (was: 2.0.0)
   0.17.1
 Assignee: Micah Kornfield
   Resolution: Duplicate

> [C++][Parquet] Support LargeListArray
> -
>
> Key: ARROW-7731
> URL: https://issues.apache.org/jira/browse/ARROW-7731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: marc abboud
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet
> Fix For: 0.17.1
>
>
> For now it's not possible to write a pyarrow.Table containing a 
> LargeListArray in parquet. The lines
> {code:java}
> from pyarrow import parquet
> import pyarrow as pa
> indices = [1, 2, 3]
> indptr = [0, 1, 2, 3]
> q = pa.lib.LargeListArray.from_arrays(indptr, indices) 
> table = pa.Table.from_arrays([q], names=['no']) 
> parquet.write_table(table, '/test'){code}
> yields the error 
> {code:java}
> ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema 
> conversion: large_list
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5569) [C++] import avro C++ code to code base.

2020-08-31 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188128#comment-17188128
 ] 

Micah Kornfield commented on ARROW-5569:


I think so.  I probably won't get to this until sometime in 2021 though at the 
earliest, so we can possible close and I can reopen later.

> [C++] import avro C++ code to code base.
> 
>
> Key: ARROW-5569
> URL: https://issues.apache.org/jira/browse/ARROW-5569
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> The goal here is to take code as is without compiling it, but flattening it 
> to conform with Arrow's code base standards.  This will give a basis for 
> future PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9887) [Rust] [DataFusion] Add support for complex return types of built-in functions

2020-08-31 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9887.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8080
[https://github.com/apache/arrow/pull/8080]

> [Rust] [DataFusion] Add support for complex return types of built-in functions
> --
>
> Key: ARROW-9887
> URL: https://issues.apache.org/jira/browse/ARROW-9887
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9886) [Rust] [DataFusion] Simplify code to test cast

2020-08-31 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9886.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8077
[https://github.com/apache/arrow/pull/8077]

> [Rust] [DataFusion] Simplify code to test cast
> --
>
> Key: ARROW-9886
> URL: https://issues.apache.org/jira/browse/ARROW-9886
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We have 3 tests with similar functionality, but that only vary on the types 
> they test. Let's create a macro to apply to all of them, so that the tests 
> are equivalent and DRY.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-3265) [C++] Restore CPACK support for Parquet libraries

2020-08-31 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou closed ARROW-3265.
---
Fix Version/s: (was: 2.0.0)
   Resolution: Not A Problem

> [C++] Restore CPACK support for Parquet libraries
> -
>
> Key: ARROW-3265
> URL: https://issues.apache.org/jira/browse/ARROW-3265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
>  Labels: parquet
>
> See https://github.com/apache/parquet-cpp/blob/master/CMakeLists.txt#L32



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3265) [C++] Restore CPACK support for Parquet libraries

2020-08-31 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188094#comment-17188094
 ] 

Kouhei Sutou commented on ARROW-3265:
-

Yes.
I close this.

> [C++] Restore CPACK support for Parquet libraries
> -
>
> Key: ARROW-3265
> URL: https://issues.apache.org/jira/browse/ARROW-3265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 2.0.0
>
>
> See https://github.com/apache/parquet-cpp/blob/master/CMakeLists.txt#L32



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-3080) [Python] Unify Arrow to Python object conversion paths

2020-08-31 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-3080:
--

Assignee: Krisztian Szucs

> [Python] Unify Arrow to Python object conversion paths
> --
>
> Key: ARROW-3080
> URL: https://issues.apache.org/jira/browse/ARROW-3080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>
> Similar to ARROW-2814, we have inconsistent support for converting Arrow 
> nested types back to object sequences. For example, a list of structs fails 
> when calling {{to_pandas}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3080) [Python] Unify Arrow to Python object conversion paths

2020-08-31 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188049#comment-17188049
 ] 

Krisztian Szucs commented on ARROW-3080:


Since the scalar refactor we should have better nested type support now. I'm 
going to investigate this.

> [Python] Unify Arrow to Python object conversion paths
> --
>
> Key: ARROW-3080
> URL: https://issues.apache.org/jira/browse/ARROW-3080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> Similar to ARROW-2814, we have inconsistent support for converting Arrow 
> nested types back to object sequences. For example, a list of structs fails 
> when calling {{to_pandas}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9868) [C++] Provide utility for copying files between filesystems

2020-08-31 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9868:
--

Assignee: Ben Kietzman

> [C++] Provide utility for copying files between filesystems
> ---
>
> Key: ARROW-9868
> URL: https://issues.apache.org/jira/browse/ARROW-9868
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: filesystem, s3
> Fix For: 2.0.0
>
>
> {{CopyStream}} in arrow/filesystem/util_internal.h does this, but we should 
> expose it, multithread it (can read in one thread while the other thread 
> writes), and further see if there are filesystem-specific optimizations (e.g. 
> S3 multipart uploading/downloading). We may also want a version that takes a 
> FileSelector or vector of paths and parallelizes the operations on them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9890) [R] Add zstandard compression codec in macOS build

2020-08-31 Thread Liang-Bo Wang (Jira)
Liang-Bo Wang created ARROW-9890:


 Summary: [R] Add zstandard compression codec in macOS build
 Key: ARROW-9890
 URL: https://issues.apache.org/jira/browse/ARROW-9890
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 1.0.0, 1.0.1
 Environment: macOS
Reporter: Liang-Bo Wang


I am using the default macOS build of R arrow 1.0.1 (R 4.0.2) and it doesn't 
support zstandard/zstd for compression:
{code:r}
> arrow::write_parquet(cars, '~/Downloads/cars.parquet', compression = 'zstd')
Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : 
  NotImplemented: ZSTD codec support not built
> arrow::codec_is_available('zstd')
[1] FALSE
{code}
Like ARROW-6960 which adds the lz4/zstd support in Windows, It'd be a great to 
have the zstd support by default in macOS as well.

I don't know if I have the right knowledge to add such support, but let me know 
how I can help. Thank you for making this great package!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7179) [C++][Compute] Array support for fill_null

2020-08-31 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-7179:

Description: 
Add kernels to support which replacing null values in an array with values 
taken from corresponding slots in another array:

{code}
fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
{code}

  was:
Add a kernel which replaces null values in an array with a scalar value or with 
values taken from another array:

{code}
coalesce([1, 2, null, 3], 5) -> [1, 2, 5, 3]
coalesce([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
{code}

The code in {{take_internal.h}} should be of some use with a bit of refactoring.

A filter Expression should be added at the same time.


> [C++][Compute] Array support for fill_null
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 2.0.0
>
>
> Add kernels to support which replacing null values in an array with values 
> taken from corresponding slots in another array:
> {code}
> fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7179) [C++][Compute] Array support for fill_null

2020-08-31 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-7179:

Summary: [C++][Compute] Array support for fill_null  (was: [C++][Compute] 
Coalesce kernel)

> [C++][Compute] Array support for fill_null
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 2.0.0
>
>
> Add a kernel which replaces null values in an array with a scalar value or 
> with values taken from another array:
> {code}
> coalesce([1, 2, null, 3], 5) -> [1, 2, 5, 3]
> coalesce([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}
> The code in {{take_internal.h}} should be of some use with a bit of 
> refactoring.
> A filter Expression should be added at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8493) [C++] Create unified schema resolution code for Array reconstruction.

2020-08-31 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8493.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7973
[https://github.com/apache/arrow/pull/7973]

> [C++] Create unified schema resolution code for Array reconstruction.
> -
>
> Key: ARROW-8493
> URL: https://issues.apache.org/jira/browse/ARROW-8493
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Augment SchemaField in SchemaManifest to track repeated ancestor definition 
> level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5569) [C++] import avro C++ code to code base.

2020-08-31 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187856#comment-17187856
 ] 

Antoine Pitrou commented on ARROW-5569:
---

Is this still something we want to pursue?

> [C++] import avro C++ code to code base.
> 
>
> Key: ARROW-5569
> URL: https://issues.apache.org/jira/browse/ARROW-5569
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> The goal here is to take code as is without compiling it, but flattening it 
> to conform with Arrow's code base standards.  This will give a basis for 
> future PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails

2020-08-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9859:
-

Assignee: Antoine Pitrou

> [C++] S3 FileSystemFromUri with special char in secret key fails
> 
>
> Key: ARROW-9859
> URL: https://issues.apache.org/jira/browse/ARROW-9859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 2.0.0
>
>
> S3 Secret access keys can contain special characters like {{/}}. When they do
> 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them 
> (e.g. replace / with %2F)
> 2) When you do escape the special characters, requests that require 
> authorization fail with the message "The request signature we calculated does 
> not match the signature you provided. Check your key and signing method." 
> This may suggest that there's some extra URL encoding/decoding that needs to 
> happen inside.
> I was only able to work around this by generating a new access key that 
> happened not to have special characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3265) [C++] Restore CPACK support for Parquet libraries

2020-08-31 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187855#comment-17187855
 ] 

Antoine Pitrou commented on ARROW-3265:
---

Should this be closed [~kou] ?

> [C++] Restore CPACK support for Parquet libraries
> -
>
> Key: ARROW-3265
> URL: https://issues.apache.org/jira/browse/ARROW-3265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 2.0.0
>
>
> See https://github.com/apache/parquet-cpp/blob/master/CMakeLists.txt#L32



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9884) [R] Bindings for writing datasets to Parquet

2020-08-31 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9884.

Resolution: Fixed

Issue resolved by pull request 8075
[https://github.com/apache/arrow/pull/8075]

> [R] Bindings for writing datasets to Parquet
> 
>
> Key: ARROW-9884
> URL: https://issues.apache.org/jira/browse/ARROW-9884
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Depends on ARROW-9646



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9814) [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs

2020-08-31 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-9814:
---

Assignee: Ben Kietzman

> [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs
> ---
>
> Key: ARROW-9814
> URL: https://issues.apache.org/jira/browse/ARROW-9814
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Ben Kietzman
>Priority: Critical
>
> This seems to happen with some Minio versions, but is definitely a problem in 
> Arrow.
> The crash message says:
> {code}
> pyarrow/tests/test_parquet.py::test_read_partitioned_directory_s3fs[False] 
> ../src/arrow/dataset/discovery.cc:188:  Check failed: relative.has_value() 
> GetFileInfo() yielded path outside selector.base_dir
> {code}
> The underlying problem is that we pass a full URI for the selector base_dir 
> (such as "s3://bucket/path.") and the S3 filesystem implementation then 
> returns regular paths (such as "bucket/path/foo/bar").
> I think we should do two things:
> 1) error out rather than crash (and include the path strings in the error 
> message), which would be more user-friendly
> 2) fix the issue that full URIs are passed in base_dir



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9646) [C++][Dataset] Add support for writing parquet datasets

2020-08-31 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-9646.
-
Resolution: Fixed

Issue resolved by pull request 8066
[https://github.com/apache/arrow/pull/8066]

> [C++][Dataset] Add support for writing parquet datasets
> ---
>
> Key: ARROW-9646
> URL: https://issues.apache.org/jira/browse/ARROW-9646
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> IpcFileFormat is currently the only format which supports writing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9882) [C++/Python] Update conda-forge-pinning to 3 for OSX conda packages

2020-08-31 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9882.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8073
[https://github.com/apache/arrow/pull/8073]

> [C++/Python] Update conda-forge-pinning to 3 for OSX conda packages
> ---
>
> Key: ARROW-9882
> URL: https://issues.apache.org/jira/browse/ARROW-9882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging, Python
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9888) [Rust] [DataFusion] ExecutionContext can not be shared between threads again

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9888:
---
Summary: [Rust] [DataFusion] ExecutionContext can not be shared between 
threads again  (was: [Rust] [DataFusion] Allow ExecutionContext to be shared 
between threads again)

> [Rust] [DataFusion] ExecutionContext can not be shared between threads again
> 
>
> Key: ARROW-9888
> URL: https://issues.apache.org/jira/browse/ARROW-9888
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As suggested by Jorge on  https://github.com/apache/arrow/pull/8079
> The high level idea is to allow ExecutionContext on multi-threaded 
> environments such as Python.
> The two use-cases:
> 1. when a project is planning a complex number of plans that depend on a 
> common set of sources and UDFs, it would be nice to be able to multi-thread 
> the planning. This is particularly important when planning requires reading 
> remote metadata to formulate themselves (e.g. when the source is in s3 with 
> many partitions). Metadata reading is often slow and network bounded, which 
> makes threads suitable for these workloads. If multi-threading is not 
> possible, either each plan needs to read the metadata independently (one 
> context per plan) or planning must be sequential (with lots of network 
> waiting).
> 2. when creating bindings to programming languages that support 
> multi-threading, it would be nice for the ExecutionContext to be thread safe, 
> so that we can more easily integrate with those languages.
> The code might look like:
> {code}
> alamb@MacBook-Pro rust % git diff
> diff --git a/rust/datafusion/src/execution/context.rs 
> b/rust/datafusion/src/execution/context.rs
> index 5f8aa342e..7374b0a78 100644
> --- a/rust/datafusion/src/execution/context.rs
> +++ b/rust/datafusion/src/execution/context.rs
> @@ -460,7 +460,7 @@ mod tests {
>  use arrow::array::{ArrayRef, Int32Array};
>  use arrow::compute::add;
>  use std::fs::File;
> -use std::io::prelude::*;
> +use std::{sync::Mutex, io::prelude::*};
>  use tempdir::TempDir;
>  use test::*;
>  
> @@ -928,6 +928,28 @@ mod tests {
>  Ok(())
>  }
>  
> +#[test]
> +fn send_context_to_threads() -> Result<()> {
> +// ensure that ExecutionContext's can be read by multiple threads 
> concurrently
> +let tmp_dir = TempDir::new("send_context_to_threads")?;
> +let partition_count = 4;
> +let mut ctx = Arc::new(Mutex::new(create_ctx(_dir, 
> partition_count)?));
> +
> +let threads: Vec>> = (0..2)
> +.map(|_| { ctx.clone() })
> +.map(|ctx_clone| thread::spawn(move || {
> +let ctx = ctx_clone.lock().expect("Locked context");
> +// Ensure we can create logical plan code on a separate 
> thread.
> +ctx.create_logical_plan("SELECT c1, c2 FROM test WHERE c1 > 
> 0 AND c1 < 3")
> +}))
> +.collect();
> +
> +for thread in threads {
> +thread.join().expect("Failed to join thread")?;
> +}
> +Ok(())
> +}
> +
>  #[test]
>  fn scalar_udf() -> Result<()> {
>  let schema = Schema::new(vec![
> {code}
> At the moment, Rust refuses to compile this example (and also refuses to 
> share ExecutionContexts between threads) due to the following (namely that 
> there are several `dyn` objects that are also not marked as Send + Sync:
> {code}
>Compiling datafusion v2.0.0-SNAPSHOT 
> (/Users/alamb/Software/arrow/rust/datafusion)
> error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner + 'static)` 
> cannot be sent between threads safely
>--> datafusion/src/execution/context.rs:940:30
> |
> 940 | .map(|ctx_clone| thread::spawn(move || {
> |  ^ `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)` cannot be sent between 
> threads safely
> | 
>::: 
> /Users/alamb/.rustup/toolchains/nightly-2020-04-22-x86_64-apple-darwin/lib/rustlib/src/rust/src/libstd/thread/mod.rs:616:8
> |
> 616 | F: Send + 'static,
> | required by this bound in `std::thread::spawn`
> |
> = help: the trait `std::marker::Send` is not implemented for `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)`
> = note: required because of the requirements on the impl of 
> `std::marker::Send` for `std::sync::Arc<(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)>`
> = note: required because it appears within the type 
> `std::option::Option 

[jira] [Comment Edited] (ARROW-9888) [Rust] [DataFusion] ExecutionContext can not be shared between threads

2020-08-31 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187633#comment-17187633
 ] 

Andrew Lamb edited comment on ARROW-9888 at 8/31/20, 2:49 PM:
--

This was previously attempted on ARROW-9425 but it seems like the behavior 
regressed in some subsequent PR

It will be good to have a test to prevent such reversions


was (Author: alamb):
This was previously attempted on ARROW-9425 but it seems like the behavior was 
reverted in some subsequent PR

It will be good to have a test to prevent such reversions

> [Rust] [DataFusion] ExecutionContext can not be shared between threads
> --
>
> Key: ARROW-9888
> URL: https://issues.apache.org/jira/browse/ARROW-9888
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As suggested by Jorge on  https://github.com/apache/arrow/pull/8079
> The high level idea is to allow ExecutionContext on multi-threaded 
> environments such as Python.
> The two use-cases:
> 1. when a project is planning a complex number of plans that depend on a 
> common set of sources and UDFs, it would be nice to be able to multi-thread 
> the planning. This is particularly important when planning requires reading 
> remote metadata to formulate themselves (e.g. when the source is in s3 with 
> many partitions). Metadata reading is often slow and network bounded, which 
> makes threads suitable for these workloads. If multi-threading is not 
> possible, either each plan needs to read the metadata independently (one 
> context per plan) or planning must be sequential (with lots of network 
> waiting).
> 2. when creating bindings to programming languages that support 
> multi-threading, it would be nice for the ExecutionContext to be thread safe, 
> so that we can more easily integrate with those languages.
> The code might look like:
> {code}
> alamb@MacBook-Pro rust % git diff
> diff --git a/rust/datafusion/src/execution/context.rs 
> b/rust/datafusion/src/execution/context.rs
> index 5f8aa342e..7374b0a78 100644
> --- a/rust/datafusion/src/execution/context.rs
> +++ b/rust/datafusion/src/execution/context.rs
> @@ -460,7 +460,7 @@ mod tests {
>  use arrow::array::{ArrayRef, Int32Array};
>  use arrow::compute::add;
>  use std::fs::File;
> -use std::io::prelude::*;
> +use std::{sync::Mutex, io::prelude::*};
>  use tempdir::TempDir;
>  use test::*;
>  
> @@ -928,6 +928,28 @@ mod tests {
>  Ok(())
>  }
>  
> +#[test]
> +fn send_context_to_threads() -> Result<()> {
> +// ensure that ExecutionContext's can be read by multiple threads 
> concurrently
> +let tmp_dir = TempDir::new("send_context_to_threads")?;
> +let partition_count = 4;
> +let mut ctx = Arc::new(Mutex::new(create_ctx(_dir, 
> partition_count)?));
> +
> +let threads: Vec>> = (0..2)
> +.map(|_| { ctx.clone() })
> +.map(|ctx_clone| thread::spawn(move || {
> +let ctx = ctx_clone.lock().expect("Locked context");
> +// Ensure we can create logical plan code on a separate 
> thread.
> +ctx.create_logical_plan("SELECT c1, c2 FROM test WHERE c1 > 
> 0 AND c1 < 3")
> +}))
> +.collect();
> +
> +for thread in threads {
> +thread.join().expect("Failed to join thread")?;
> +}
> +Ok(())
> +}
> +
>  #[test]
>  fn scalar_udf() -> Result<()> {
>  let schema = Schema::new(vec![
> {code}
> At the moment, Rust refuses to compile this example (and also refuses to 
> share ExecutionContexts between threads) due to the following (namely that 
> there are several `dyn` objects that are also not marked as Send + Sync:
> {code}
>Compiling datafusion v2.0.0-SNAPSHOT 
> (/Users/alamb/Software/arrow/rust/datafusion)
> error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner + 'static)` 
> cannot be sent between threads safely
>--> datafusion/src/execution/context.rs:940:30
> |
> 940 | .map(|ctx_clone| thread::spawn(move || {
> |  ^ `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)` cannot be sent between 
> threads safely
> | 
>::: 
> /Users/alamb/.rustup/toolchains/nightly-2020-04-22-x86_64-apple-darwin/lib/rustlib/src/rust/src/libstd/thread/mod.rs:616:8
> |
> 616 | F: Send + 'static,
> | required by this bound in `std::thread::spawn`
> |
> = help: the trait `std::marker::Send` is not implemented for `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)`
> = 

[jira] [Updated] (ARROW-9888) [Rust] [DataFusion] Allow ExecutionContext to be shared between threads again

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9888:
---
Priority: Minor  (was: Major)

> [Rust] [DataFusion] Allow ExecutionContext to be shared between threads again
> -
>
> Key: ARROW-9888
> URL: https://issues.apache.org/jira/browse/ARROW-9888
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As suggested by Jorge on  https://github.com/apache/arrow/pull/8079
> The high level idea is to allow ExecutionContext on multi-threaded 
> environments such as Python.
> The two use-cases:
> 1. when a project is planning a complex number of plans that depend on a 
> common set of sources and UDFs, it would be nice to be able to multi-thread 
> the planning. This is particularly important when planning requires reading 
> remote metadata to formulate themselves (e.g. when the source is in s3 with 
> many partitions). Metadata reading is often slow and network bounded, which 
> makes threads suitable for these workloads. If multi-threading is not 
> possible, either each plan needs to read the metadata independently (one 
> context per plan) or planning must be sequential (with lots of network 
> waiting).
> 2. when creating bindings to programming languages that support 
> multi-threading, it would be nice for the ExecutionContext to be thread safe, 
> so that we can more easily integrate with those languages.
> The code might look like:
> {code}
> alamb@MacBook-Pro rust % git diff
> diff --git a/rust/datafusion/src/execution/context.rs 
> b/rust/datafusion/src/execution/context.rs
> index 5f8aa342e..7374b0a78 100644
> --- a/rust/datafusion/src/execution/context.rs
> +++ b/rust/datafusion/src/execution/context.rs
> @@ -460,7 +460,7 @@ mod tests {
>  use arrow::array::{ArrayRef, Int32Array};
>  use arrow::compute::add;
>  use std::fs::File;
> -use std::io::prelude::*;
> +use std::{sync::Mutex, io::prelude::*};
>  use tempdir::TempDir;
>  use test::*;
>  
> @@ -928,6 +928,28 @@ mod tests {
>  Ok(())
>  }
>  
> +#[test]
> +fn send_context_to_threads() -> Result<()> {
> +// ensure that ExecutionContext's can be read by multiple threads 
> concurrently
> +let tmp_dir = TempDir::new("send_context_to_threads")?;
> +let partition_count = 4;
> +let mut ctx = Arc::new(Mutex::new(create_ctx(_dir, 
> partition_count)?));
> +
> +let threads: Vec>> = (0..2)
> +.map(|_| { ctx.clone() })
> +.map(|ctx_clone| thread::spawn(move || {
> +let ctx = ctx_clone.lock().expect("Locked context");
> +// Ensure we can create logical plan code on a separate 
> thread.
> +ctx.create_logical_plan("SELECT c1, c2 FROM test WHERE c1 > 
> 0 AND c1 < 3")
> +}))
> +.collect();
> +
> +for thread in threads {
> +thread.join().expect("Failed to join thread")?;
> +}
> +Ok(())
> +}
> +
>  #[test]
>  fn scalar_udf() -> Result<()> {
>  let schema = Schema::new(vec![
> {code}
> At the moment, Rust refuses to compile this example (and also refuses to 
> share ExecutionContexts between threads) due to the following (namely that 
> there are several `dyn` objects that are also not marked as Send + Sync:
> {code}
>Compiling datafusion v2.0.0-SNAPSHOT 
> (/Users/alamb/Software/arrow/rust/datafusion)
> error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner + 'static)` 
> cannot be sent between threads safely
>--> datafusion/src/execution/context.rs:940:30
> |
> 940 | .map(|ctx_clone| thread::spawn(move || {
> |  ^ `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)` cannot be sent between 
> threads safely
> | 
>::: 
> /Users/alamb/.rustup/toolchains/nightly-2020-04-22-x86_64-apple-darwin/lib/rustlib/src/rust/src/libstd/thread/mod.rs:616:8
> |
> 616 | F: Send + 'static,
> | required by this bound in `std::thread::spawn`
> |
> = help: the trait `std::marker::Send` is not implemented for `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)`
> = note: required because of the requirements on the impl of 
> `std::marker::Send` for `std::sync::Arc<(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)>`
> = note: required because it appears within the type 
> `std::option::Option execution::physical_plan::PhysicalPlanner + 'static)>>`
> = note: required because it appears within the type 
> `execution::context::ExecutionConfig`
>  

[jira] [Updated] (ARROW-9888) [Rust] [DataFusion] Allow ExecutionContext to be shared between threads again

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9888:
---
Issue Type: Bug  (was: New Feature)

> [Rust] [DataFusion] Allow ExecutionContext to be shared between threads again
> -
>
> Key: ARROW-9888
> URL: https://issues.apache.org/jira/browse/ARROW-9888
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As suggested by Jorge on  https://github.com/apache/arrow/pull/8079
> The high level idea is to allow ExecutionContext on multi-threaded 
> environments such as Python.
> The two use-cases:
> 1. when a project is planning a complex number of plans that depend on a 
> common set of sources and UDFs, it would be nice to be able to multi-thread 
> the planning. This is particularly important when planning requires reading 
> remote metadata to formulate themselves (e.g. when the source is in s3 with 
> many partitions). Metadata reading is often slow and network bounded, which 
> makes threads suitable for these workloads. If multi-threading is not 
> possible, either each plan needs to read the metadata independently (one 
> context per plan) or planning must be sequential (with lots of network 
> waiting).
> 2. when creating bindings to programming languages that support 
> multi-threading, it would be nice for the ExecutionContext to be thread safe, 
> so that we can more easily integrate with those languages.
> The code might look like:
> {code}
> alamb@MacBook-Pro rust % git diff
> diff --git a/rust/datafusion/src/execution/context.rs 
> b/rust/datafusion/src/execution/context.rs
> index 5f8aa342e..7374b0a78 100644
> --- a/rust/datafusion/src/execution/context.rs
> +++ b/rust/datafusion/src/execution/context.rs
> @@ -460,7 +460,7 @@ mod tests {
>  use arrow::array::{ArrayRef, Int32Array};
>  use arrow::compute::add;
>  use std::fs::File;
> -use std::io::prelude::*;
> +use std::{sync::Mutex, io::prelude::*};
>  use tempdir::TempDir;
>  use test::*;
>  
> @@ -928,6 +928,28 @@ mod tests {
>  Ok(())
>  }
>  
> +#[test]
> +fn send_context_to_threads() -> Result<()> {
> +// ensure that ExecutionContext's can be read by multiple threads 
> concurrently
> +let tmp_dir = TempDir::new("send_context_to_threads")?;
> +let partition_count = 4;
> +let mut ctx = Arc::new(Mutex::new(create_ctx(_dir, 
> partition_count)?));
> +
> +let threads: Vec>> = (0..2)
> +.map(|_| { ctx.clone() })
> +.map(|ctx_clone| thread::spawn(move || {
> +let ctx = ctx_clone.lock().expect("Locked context");
> +// Ensure we can create logical plan code on a separate 
> thread.
> +ctx.create_logical_plan("SELECT c1, c2 FROM test WHERE c1 > 
> 0 AND c1 < 3")
> +}))
> +.collect();
> +
> +for thread in threads {
> +thread.join().expect("Failed to join thread")?;
> +}
> +Ok(())
> +}
> +
>  #[test]
>  fn scalar_udf() -> Result<()> {
>  let schema = Schema::new(vec![
> {code}
> At the moment, Rust refuses to compile this example (and also refuses to 
> share ExecutionContexts between threads) due to the following (namely that 
> there are several `dyn` objects that are also not marked as Send + Sync:
> {code}
>Compiling datafusion v2.0.0-SNAPSHOT 
> (/Users/alamb/Software/arrow/rust/datafusion)
> error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner + 'static)` 
> cannot be sent between threads safely
>--> datafusion/src/execution/context.rs:940:30
> |
> 940 | .map(|ctx_clone| thread::spawn(move || {
> |  ^ `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)` cannot be sent between 
> threads safely
> | 
>::: 
> /Users/alamb/.rustup/toolchains/nightly-2020-04-22-x86_64-apple-darwin/lib/rustlib/src/rust/src/libstd/thread/mod.rs:616:8
> |
> 616 | F: Send + 'static,
> | required by this bound in `std::thread::spawn`
> |
> = help: the trait `std::marker::Send` is not implemented for `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)`
> = note: required because of the requirements on the impl of 
> `std::marker::Send` for `std::sync::Arc<(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)>`
> = note: required because it appears within the type 
> `std::option::Option execution::physical_plan::PhysicalPlanner + 'static)>>`
> = note: required because it appears within the type 
> 

[jira] [Updated] (ARROW-9888) [Rust] [DataFusion] ExecutionContext can not be shared between threads

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9888:
---
Summary: [Rust] [DataFusion] ExecutionContext can not be shared between 
threads  (was: [Rust] [DataFusion] ExecutionContext can not be shared between 
threads again)

> [Rust] [DataFusion] ExecutionContext can not be shared between threads
> --
>
> Key: ARROW-9888
> URL: https://issues.apache.org/jira/browse/ARROW-9888
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As suggested by Jorge on  https://github.com/apache/arrow/pull/8079
> The high level idea is to allow ExecutionContext on multi-threaded 
> environments such as Python.
> The two use-cases:
> 1. when a project is planning a complex number of plans that depend on a 
> common set of sources and UDFs, it would be nice to be able to multi-thread 
> the planning. This is particularly important when planning requires reading 
> remote metadata to formulate themselves (e.g. when the source is in s3 with 
> many partitions). Metadata reading is often slow and network bounded, which 
> makes threads suitable for these workloads. If multi-threading is not 
> possible, either each plan needs to read the metadata independently (one 
> context per plan) or planning must be sequential (with lots of network 
> waiting).
> 2. when creating bindings to programming languages that support 
> multi-threading, it would be nice for the ExecutionContext to be thread safe, 
> so that we can more easily integrate with those languages.
> The code might look like:
> {code}
> alamb@MacBook-Pro rust % git diff
> diff --git a/rust/datafusion/src/execution/context.rs 
> b/rust/datafusion/src/execution/context.rs
> index 5f8aa342e..7374b0a78 100644
> --- a/rust/datafusion/src/execution/context.rs
> +++ b/rust/datafusion/src/execution/context.rs
> @@ -460,7 +460,7 @@ mod tests {
>  use arrow::array::{ArrayRef, Int32Array};
>  use arrow::compute::add;
>  use std::fs::File;
> -use std::io::prelude::*;
> +use std::{sync::Mutex, io::prelude::*};
>  use tempdir::TempDir;
>  use test::*;
>  
> @@ -928,6 +928,28 @@ mod tests {
>  Ok(())
>  }
>  
> +#[test]
> +fn send_context_to_threads() -> Result<()> {
> +// ensure that ExecutionContext's can be read by multiple threads 
> concurrently
> +let tmp_dir = TempDir::new("send_context_to_threads")?;
> +let partition_count = 4;
> +let mut ctx = Arc::new(Mutex::new(create_ctx(_dir, 
> partition_count)?));
> +
> +let threads: Vec>> = (0..2)
> +.map(|_| { ctx.clone() })
> +.map(|ctx_clone| thread::spawn(move || {
> +let ctx = ctx_clone.lock().expect("Locked context");
> +// Ensure we can create logical plan code on a separate 
> thread.
> +ctx.create_logical_plan("SELECT c1, c2 FROM test WHERE c1 > 
> 0 AND c1 < 3")
> +}))
> +.collect();
> +
> +for thread in threads {
> +thread.join().expect("Failed to join thread")?;
> +}
> +Ok(())
> +}
> +
>  #[test]
>  fn scalar_udf() -> Result<()> {
>  let schema = Schema::new(vec![
> {code}
> At the moment, Rust refuses to compile this example (and also refuses to 
> share ExecutionContexts between threads) due to the following (namely that 
> there are several `dyn` objects that are also not marked as Send + Sync:
> {code}
>Compiling datafusion v2.0.0-SNAPSHOT 
> (/Users/alamb/Software/arrow/rust/datafusion)
> error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner + 'static)` 
> cannot be sent between threads safely
>--> datafusion/src/execution/context.rs:940:30
> |
> 940 | .map(|ctx_clone| thread::spawn(move || {
> |  ^ `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)` cannot be sent between 
> threads safely
> | 
>::: 
> /Users/alamb/.rustup/toolchains/nightly-2020-04-22-x86_64-apple-darwin/lib/rustlib/src/rust/src/libstd/thread/mod.rs:616:8
> |
> 616 | F: Send + 'static,
> | required by this bound in `std::thread::spawn`
> |
> = help: the trait `std::marker::Send` is not implemented for `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)`
> = note: required because of the requirements on the impl of 
> `std::marker::Send` for `std::sync::Arc<(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)>`
> = note: required because it appears within the type 
> `std::option::Option 

[jira] [Resolved] (ARROW-8383) [Rust] Easier random access to DictionaryArray keys and values

2020-08-31 Thread Paddy Horan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan resolved ARROW-8383.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8063
[https://github.com/apache/arrow/pull/8063]

> [Rust] Easier random access to DictionaryArray keys and values
> --
>
> Key: ARROW-8383
> URL: https://issues.apache.org/jira/browse/ARROW-8383
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently it's not that clear how to acces DictionaryArray keys and values 
> using random indices.
> The `DictionaryArray::keys` method exposes an Iterator with an `nth` method, 
> but this requires a mut reference and feels a little bit out of place 
> compared to other methods of accessing arrow data.
> Another alternative seems to be to use the `From for 
> PrimitiveArray` conversion like so `let keys : Int16Array = 
> dictionary_array.data().into()`. This seems to work fine but is not easily 
> discoverable and also needs to be done outside of any loops for performance 
> reasons.
> I'd like methods on `DictionaryArray` to directly get the key at some index
> ```
>  pub fn key(, i: usize) -> 
> ```
> Ideally I'd also like an easier way to directly access values at some index, 
> at least when those are primitive or string types.
> ```
> pub fn value(, i: usize) -> 
> ```
> I'm not sure how or if that would be possible to implement with rust generics.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9888) [Rust] [DataFusion] Allow ExecutionContext to be shared between threads

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reassigned ARROW-9888:
--

Assignee: Andrew Lamb

> [Rust] [DataFusion] Allow ExecutionContext to be shared between threads
> ---
>
> Key: ARROW-9888
> URL: https://issues.apache.org/jira/browse/ARROW-9888
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As suggested by Jorge on  https://github.com/apache/arrow/pull/8079
> The high level idea is to allow ExecutionContext on multi-threaded 
> environments such as Python.
> The two use-cases:
> 1. when a project is planning a complex number of plans that depend on a 
> common set of sources and UDFs, it would be nice to be able to multi-thread 
> the planning. This is particularly important when planning requires reading 
> remote metadata to formulate themselves (e.g. when the source is in s3 with 
> many partitions). Metadata reading is often slow and network bounded, which 
> makes threads suitable for these workloads. If multi-threading is not 
> possible, either each plan needs to read the metadata independently (one 
> context per plan) or planning must be sequential (with lots of network 
> waiting).
> 2. when creating bindings to programming languages that support 
> multi-threading, it would be nice for the ExecutionContext to be thread safe, 
> so that we can more easily integrate with those languages.
> The code might look like:
> {code}
> alamb@MacBook-Pro rust % git diff
> diff --git a/rust/datafusion/src/execution/context.rs 
> b/rust/datafusion/src/execution/context.rs
> index 5f8aa342e..7374b0a78 100644
> --- a/rust/datafusion/src/execution/context.rs
> +++ b/rust/datafusion/src/execution/context.rs
> @@ -460,7 +460,7 @@ mod tests {
>  use arrow::array::{ArrayRef, Int32Array};
>  use arrow::compute::add;
>  use std::fs::File;
> -use std::io::prelude::*;
> +use std::{sync::Mutex, io::prelude::*};
>  use tempdir::TempDir;
>  use test::*;
>  
> @@ -928,6 +928,28 @@ mod tests {
>  Ok(())
>  }
>  
> +#[test]
> +fn send_context_to_threads() -> Result<()> {
> +// ensure that ExecutionContext's can be read by multiple threads 
> concurrently
> +let tmp_dir = TempDir::new("send_context_to_threads")?;
> +let partition_count = 4;
> +let mut ctx = Arc::new(Mutex::new(create_ctx(_dir, 
> partition_count)?));
> +
> +let threads: Vec>> = (0..2)
> +.map(|_| { ctx.clone() })
> +.map(|ctx_clone| thread::spawn(move || {
> +let ctx = ctx_clone.lock().expect("Locked context");
> +// Ensure we can create logical plan code on a separate 
> thread.
> +ctx.create_logical_plan("SELECT c1, c2 FROM test WHERE c1 > 
> 0 AND c1 < 3")
> +}))
> +.collect();
> +
> +for thread in threads {
> +thread.join().expect("Failed to join thread")?;
> +}
> +Ok(())
> +}
> +
>  #[test]
>  fn scalar_udf() -> Result<()> {
>  let schema = Schema::new(vec![
> {code}
> At the moment, Rust refuses to compile this example (and also refuses to 
> share ExecutionContexts between threads) due to the following (namely that 
> there are several `dyn` objects that are also not marked as Send + Sync:
> {code}
>Compiling datafusion v2.0.0-SNAPSHOT 
> (/Users/alamb/Software/arrow/rust/datafusion)
> error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner + 'static)` 
> cannot be sent between threads safely
>--> datafusion/src/execution/context.rs:940:30
> |
> 940 | .map(|ctx_clone| thread::spawn(move || {
> |  ^ `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)` cannot be sent between 
> threads safely
> | 
>::: 
> /Users/alamb/.rustup/toolchains/nightly-2020-04-22-x86_64-apple-darwin/lib/rustlib/src/rust/src/libstd/thread/mod.rs:616:8
> |
> 616 | F: Send + 'static,
> | required by this bound in `std::thread::spawn`
> |
> = help: the trait `std::marker::Send` is not implemented for `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)`
> = note: required because of the requirements on the impl of 
> `std::marker::Send` for `std::sync::Arc<(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)>`
> = note: required because it appears within the type 
> `std::option::Option execution::physical_plan::PhysicalPlanner + 'static)>>`
> = note: required because it appears within the type 
> `execution::context::ExecutionConfig`
> = 

[jira] [Assigned] (ARROW-9874) [C++] NewStreamWriter / NewFileWriter don't own output stream

2020-08-31 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9874:


Assignee: Antoine Pitrou  (was: Apache Arrow JIRA Bot)

> [C++] NewStreamWriter / NewFileWriter don't own output stream
> -
>
> Key: ARROW-9874
> URL: https://issues.apache.org/jira/browse/ARROW-9874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Also the naming doesn't follow usual conventions for factories (e.g. 
> {{MakeStreamWriter}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9874) [C++] NewStreamWriter / NewFileWriter don't own output stream

2020-08-31 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9874:


Assignee: Apache Arrow JIRA Bot  (was: Antoine Pitrou)

> [C++] NewStreamWriter / NewFileWriter don't own output stream
> -
>
> Key: ARROW-9874
> URL: https://issues.apache.org/jira/browse/ARROW-9874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Also the naming doesn't follow usual conventions for factories (e.g. 
> {{MakeStreamWriter}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9874) [C++] NewStreamWriter / NewFileWriter don't own output stream

2020-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9874:
--
Labels: pull-request-available  (was: )

> [C++] NewStreamWriter / NewFileWriter don't own output stream
> -
>
> Key: ARROW-9874
> URL: https://issues.apache.org/jira/browse/ARROW-9874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Also the naming doesn't follow usual conventions for factories (e.g. 
> {{MakeStreamWriter}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9874) [C++] NewStreamWriter / NewFileWriter don't own output stream

2020-08-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9874:
-

Assignee: Antoine Pitrou

> [C++] NewStreamWriter / NewFileWriter don't own output stream
> -
>
> Key: ARROW-9874
> URL: https://issues.apache.org/jira/browse/ARROW-9874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 2.0.0
>
>
> Also the naming doesn't follow usual conventions for factories (e.g. 
> {{MakeStreamWriter}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9874) [C++] NewStreamWriter / NewFileWriter don't own output stream

2020-08-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9874:
--
Priority: Minor  (was: Major)

> [C++] NewStreamWriter / NewFileWriter don't own output stream
> -
>
> Key: ARROW-9874
> URL: https://issues.apache.org/jira/browse/ARROW-9874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: 2.0.0
>
>
> Also the naming doesn't follow usual conventions for factories (e.g. 
> {{MakeStreamWriter}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9382) [Rust] [DataFusion] Can not group by boolean columns (add boolean to valid keys of groupBy)

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9382:
---
Description: 
Currently we do not support boolean columns on groupBy.

Here is a reproducer:

{code}
alamb@MacBook-Pro:~/Software/arrow/rust$ echo "false" > /tmp/foo.csv
alamb@MacBook-Pro:~/Software/arrow/rust$ cargo run --bin datafusion-cli
Finished dev [unoptimized + debuginfo] target(s) in 0.14s
 Running `target/debug/datafusion-cli`
> create external table test(c1 boolean) stored as CSV location '/tmp/foo.csv';
0 rows in set. Query took 0 seconds.
> select count(c1), c1 from test group by c1;
ArrowError(ExternalError(ExecutionError("Unsupported GROUP BY data type")))
{code}

The expected result is 
{code}
1, false
{code}

  was:Currently we do not support boolean columns on groupBy.


> [Rust] [DataFusion] Can not group by boolean columns (add  boolean to valid 
> keys of groupBy)
> 
>
> Key: ARROW-9382
> URL: https://issues.apache.org/jira/browse/ARROW-9382
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently we do not support boolean columns on groupBy.
> Here is a reproducer:
> {code}
> alamb@MacBook-Pro:~/Software/arrow/rust$ echo "false" > /tmp/foo.csv
> alamb@MacBook-Pro:~/Software/arrow/rust$ cargo run --bin datafusion-cli
> Finished dev [unoptimized + debuginfo] target(s) in 0.14s
>  Running `target/debug/datafusion-cli`
> > create external table test(c1 boolean) stored as CSV location 
> > '/tmp/foo.csv';
> 0 rows in set. Query took 0 seconds.
> > select count(c1), c1 from test group by c1;
> ArrowError(ExternalError(ExecutionError("Unsupported GROUP BY data type")))
> {code}
> The expected result is 
> {code}
> 1, false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9382) [Rust] [DataFusion] Can not group by boolean columns (add boolean to valid keys of groupBy)

2020-08-31 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187717#comment-17187717
 ] 

Andrew Lamb edited comment on ARROW-9382 at 8/31/20, 1:02 PM:
--

Here is a reproducer:

{code}
alamb@MacBook-Pro:~/Software/arrow/rust$ echo "false" > /tmp/foo.csv
alamb@MacBook-Pro:~/Software/arrow/rust$ cargo run --bin datafusion-cli
Finished dev [unoptimized + debuginfo] target(s) in 0.14s
 Running `target/debug/datafusion-cli`
> create external table test(c1 boolean) stored as CSV location '/tmp/foo.csv';
0 rows in set. Query took 0 seconds.
> select count(c1), c1 from test group by c1;
ArrowError(ExternalError(ExecutionError("Unsupported GROUP BY data type")))
{code}


was (Author: alamb):
Here is a reproducer:

{code}
alamb@MacBook-Pro:~/Software/arrow/rust$ echo "false" > /tmp/foo.csv
alamb@MacBook-Pro:~/Software/arrow/rust$ cargo run --bin datafusion-cli
Finished dev [unoptimized + debuginfo] target(s) in 0.14s
 Running `target/debug/datafusion-cli`
> create external table test(c1 boolean) stored as CSV location '/tmp/foo.csv';
0 rows in set. Query took 0 seconds.
> select count(c1), c1 from test group by c1;
ArrowError(ExternalError(ExecutionError("Unsupported GROUP BY data type")))
{code}

> [Rust] [DataFusion] Can not group by boolean columns (add  boolean to valid 
> keys of groupBy)
> 
>
> Key: ARROW-9382
> URL: https://issues.apache.org/jira/browse/ARROW-9382
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently we do not support boolean columns on groupBy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-9382) [Rust] [DataFusion] Can not group by boolean columns (add boolean to valid keys of groupBy)

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9382:
---
Comment: was deleted

(was: Here is a reproducer:

{code}
alamb@MacBook-Pro:~/Software/arrow/rust$ echo "false" > /tmp/foo.csv
alamb@MacBook-Pro:~/Software/arrow/rust$ cargo run --bin datafusion-cli
Finished dev [unoptimized + debuginfo] target(s) in 0.14s
 Running `target/debug/datafusion-cli`
> create external table test(c1 boolean) stored as CSV location '/tmp/foo.csv';
0 rows in set. Query took 0 seconds.
> select count(c1), c1 from test group by c1;
ArrowError(ExternalError(ExecutionError("Unsupported GROUP BY data type")))
{code})

> [Rust] [DataFusion] Can not group by boolean columns (add  boolean to valid 
> keys of groupBy)
> 
>
> Key: ARROW-9382
> URL: https://issues.apache.org/jira/browse/ARROW-9382
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently we do not support boolean columns on groupBy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9382) [Rust] Add boolean to valid keys of groupBy

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9382:
---
Issue Type: Bug  (was: Improvement)

> [Rust] Add boolean to valid keys of groupBy
> ---
>
> Key: ARROW-9382
> URL: https://issues.apache.org/jira/browse/ARROW-9382
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently we do not support boolean columns on groupBy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9382) [Rust] Add boolean to valid keys of groupBy

2020-08-31 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187717#comment-17187717
 ] 

Andrew Lamb commented on ARROW-9382:


Here is a reproducer:

{code}
alamb@MacBook-Pro:~/Software/arrow/rust$ echo "false" > /tmp/foo.csv
alamb@MacBook-Pro:~/Software/arrow/rust$ cargo run --bin datafusion-cli
Finished dev [unoptimized + debuginfo] target(s) in 0.14s
 Running `target/debug/datafusion-cli`
> create external table test(c1 boolean) stored as CSV location '/tmp/foo.csv';
0 rows in set. Query took 0 seconds.
> select count(c1), c1 from test group by c1;
ArrowError(ExternalError(ExecutionError("Unsupported GROUP BY data type")))
{code}

> [Rust] Add boolean to valid keys of groupBy
> ---
>
> Key: ARROW-9382
> URL: https://issues.apache.org/jira/browse/ARROW-9382
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently we do not support boolean columns on groupBy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9382) [Rust] [DataFusion] Can not group by boolean columns (add boolean to valid keys of groupBy)

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9382:
---
Summary: [Rust] [DataFusion] Can not group by boolean columns (add  boolean 
to valid keys of groupBy)  (was: [Rust] Add boolean to valid keys of groupBy)

> [Rust] [DataFusion] Can not group by boolean columns (add  boolean to valid 
> keys of groupBy)
> 
>
> Key: ARROW-9382
> URL: https://issues.apache.org/jira/browse/ARROW-9382
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently we do not support boolean columns on groupBy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7731) [C++][Parquet] Support LargeListArray

2020-08-31 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187716#comment-17187716
 ] 

Artem KOZHEVNIKOV commented on ARROW-7731:
--

yes, I confirm that writing is ok now, reading is still broken !

> [C++][Parquet] Support LargeListArray
> -
>
> Key: ARROW-7731
> URL: https://issues.apache.org/jira/browse/ARROW-7731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: marc abboud
>Priority: Major
>  Labels: parquet
> Fix For: 2.0.0
>
>
> For now it's not possible to write a pyarrow.Table containing a 
> LargeListArray in parquet. The lines
> {code:java}
> from pyarrow import parquet
> import pyarrow as pa
> indices = [1, 2, 3]
> indptr = [0, 1, 2, 3]
> q = pa.lib.LargeListArray.from_arrays(indptr, indices) 
> table = pa.Table.from_arrays([q], names=['no']) 
> parquet.write_table(table, '/test'){code}
> yields the error 
> {code:java}
> ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema 
> conversion: large_list
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9889) [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with "Unsupported logical plan variant"

2020-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9889:
--
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with 
> "Unsupported logical plan variant"
> ---
>
> Key: ARROW-9889
> URL: https://issues.apache.org/jira/browse/ARROW-9889
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I try to make an external table using the DataFusion CLI I get an error:
> h3. Reproducer:
> # Check out master
> # Build via {{cd arrow/rust; cargo run --bin datafusion-cli}}
> # run this query: {{create external table test(c1 boolean) stored as CSV 
> location '/tmp/foo'}}
> *Expected Result*: An external table is created successfully
> *Actual Result*: An error is reported:
> {code}
> >create external table test(c1 boolean) stored as CSV location '/tmp/foo';
> General("Unsupported logical plan variant")
> >
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9587) [FlightRPC][Java] Clean up DoPut/FlightStream memory handling

2020-08-31 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-9587:

Description: 
We've been running into issues with DoPut in Java. In particular:
 * Closing a FlightStream without draining it should not send a cancellation to 
the other side (or should send the cancellation, but also drain the queue). A 
server will have sent an explicit error message, or will simply just not want 
to read the entire stream. A client should explicitly cancel/gRPC will cancel 
for you anyways when  you end the call. Also, the gRPC call may already have 
ended and cancelling the call may result in a runtime exception.
 * Cancelling a FlightStream explicitly should not immediately mark the stream 
as completed - it should wait for gRPC to acknowledge the cancellation as there 
may be undelivered messages.
 * Make sure there is no race between the gRPC observer in the FlightStream and 
the consumer. (Ideally the only way for a FlightStream to end is for the 
observer to end the stream; that does open us up to the possibility of a 
FlightStream being stuck forever for servers that do not respect cancellation.)
 * The server should close/clean up things properly in DoPut (it should act 
like DoExchange and tie closing of the stream to the onCompleted/onError 
callbacks). Otherwise trying to use it with ARROW-9586 becomes impossible (you 
need to close the FlightStream before ending the call, or you'll close the 
per-call allocator before you close the FlightStream)

I think this also ties into flakiness in unit tests.

  was:
We've been running into issues with DoPut in Java. In particular:
 * Closing a FlightStream without draining it should not send a cancellation to 
the other side. A server will have sent an explicit error message, or will 
simply just not want to read the entire stream. A client should explicitly 
cancel/gRPC will cancel for you anyways when  you end the call. Also, the gRPC 
call may already have ended and cancelling the call may result in a runtime 
exception.
 * The server should close/clean up things properly in DoPut (it should act 
like DoExchange and tie closing of the stream to the onCompleted/onError 
callbacks). Otherwise trying to use it with ARROW-9586 becomes impossible (you 
need to close the FlightStream before ending the call, or you'll close the 
per-call allocator before you close the FlightStream)

I think this also ties into flakiness in unit tests.


> [FlightRPC][Java] Clean up DoPut/FlightStream memory handling
> -
>
> Key: ARROW-9587
> URL: https://issues.apache.org/jira/browse/ARROW-9587
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Affects Versions: 1.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We've been running into issues with DoPut in Java. In particular:
>  * Closing a FlightStream without draining it should not send a cancellation 
> to the other side (or should send the cancellation, but also drain the 
> queue). A server will have sent an explicit error message, or will simply 
> just not want to read the entire stream. A client should explicitly 
> cancel/gRPC will cancel for you anyways when  you end the call. Also, the 
> gRPC call may already have ended and cancelling the call may result in a 
> runtime exception.
>  * Cancelling a FlightStream explicitly should not immediately mark the 
> stream as completed - it should wait for gRPC to acknowledge the cancellation 
> as there may be undelivered messages.
>  * Make sure there is no race between the gRPC observer in the FlightStream 
> and the consumer. (Ideally the only way for a FlightStream to end is for the 
> observer to end the stream; that does open us up to the possibility of a 
> FlightStream being stuck forever for servers that do not respect 
> cancellation.)
>  * The server should close/clean up things properly in DoPut (it should act 
> like DoExchange and tie closing of the stream to the onCompleted/onError 
> callbacks). Otherwise trying to use it with ARROW-9586 becomes impossible 
> (you need to close the FlightStream before ending the call, or you'll close 
> the per-call allocator before you close the FlightStream)
> I think this also ties into flakiness in unit tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9889) [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with "Unsupported logical plan variant"

2020-08-31 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187680#comment-17187680
 ] 

Andrew Lamb commented on ARROW-9889:


I think the issue is that the CLI actually tries to execute the "create table" 
logical plan, and the physical planner hasn't implemented that type.

> [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with 
> "Unsupported logical plan variant"
> ---
>
> Key: ARROW-9889
> URL: https://issues.apache.org/jira/browse/ARROW-9889
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>
> When I try to make an external table using the DataFusion CLI I get an error:
> h3. Reproducer:
> # Check out master
> # Build via {{cd arrow/rust; cargo run --bin datafusion-cli}}
> # run this query: {{create external table test(c1 boolean) stored as CSV 
> location '/tmp/foo'}}
> *Expected Result*: An external table is created successfully
> *Actual Result*: An error is reported:
> {code}
> >create external table test(c1 boolean) stored as CSV location '/tmp/foo';
> General("Unsupported logical plan variant")
> >
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9889) [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with "Unsupported logical plan variant"

2020-08-31 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187680#comment-17187680
 ] 

Andrew Lamb edited comment on ARROW-9889 at 8/31/20, 12:15 PM:
---

I think the issue is that the CLI actually tries to execute (collect) the 
"create table" logical plan, and the physical planner hasn't implemented that 
type.


was (Author: alamb):
I think the issue is that the CLI actually tries to execute the "create table" 
logical plan, and the physical planner hasn't implemented that type.

> [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with 
> "Unsupported logical plan variant"
> ---
>
> Key: ARROW-9889
> URL: https://issues.apache.org/jira/browse/ARROW-9889
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>
> When I try to make an external table using the DataFusion CLI I get an error:
> h3. Reproducer:
> # Check out master
> # Build via {{cd arrow/rust; cargo run --bin datafusion-cli}}
> # run this query: {{create external table test(c1 boolean) stored as CSV 
> location '/tmp/foo'}}
> *Expected Result*: An external table is created successfully
> *Actual Result*: An error is reported:
> {code}
> >create external table test(c1 boolean) stored as CSV location '/tmp/foo';
> General("Unsupported logical plan variant")
> >
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9889) [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with "Unsupported logical plan variant"

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reassigned ARROW-9889:
--

Assignee: Andrew Lamb

> [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with 
> "Unsupported logical plan variant"
> ---
>
> Key: ARROW-9889
> URL: https://issues.apache.org/jira/browse/ARROW-9889
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>
> When I try to make an external table using the DataFusion CLI I get an error:
> h3. Reproducer:
> # Check out master
> # Build via {{cd arrow/rust; cargo run --bin datafusion-cli}}
> # run this query: {{create external table test(c1 boolean) stored as CSV 
> location '/tmp/foo'}}
> *Expected Result*: An external table is created successfully
> *Actual Result*: An error is reported:
> {code}
> >create external table test(c1 boolean) stored as CSV location '/tmp/foo';
> General("Unsupported logical plan variant")
> >
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9677) [Rust] [DataFusion] Aggregate queries that don't include group column in select list error with "Projection references non-aggregate values"

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-9677.

Resolution: Duplicate

This is a dupe of ARROW-9520

> [Rust] [DataFusion] Aggregate queries that don't include group column in 
> select list error with "Projection references non-aggregate values"
> 
>
> Key: ARROW-9677
> URL: https://issues.apache.org/jira/browse/ARROW-9677
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>
> .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9677) [Rust] [DataFusion] Aggregate queries that don't include group column in select list error with "Projection references non-aggregate values"

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9677:
---
Description: .  (was: Reproducer (using the aggregate_test_100.csv data 
from the tests):

{code}
CREATE EXTERNAL TABLE aggregate_test_100 (
c1  VARCHAR NOT NULL,
c2  INT NOT NULL,
c3  SMALLINT NOT NULL,
c4  SMALLINT NOT NULL,
c5  INT NOT NULL,
c6  BIGINT NOT NULL,
c7  SMALLINT NOT NULL,
c8  INT NOT NULL,
c9  BIGINT NOT NULL,
c10 VARCHAR NOT NULL,
c11 FLOAT NOT NULL,
c12 DOUBLE NOT NULL,
c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH HEADER ROW
LOCATION 'arrow/testing/data/csv/aggregate_test_100.csv';
{code}

And then run this query:
{code}
> select min(c3) from aggregate_test_100 group by c1;
{code}

h2. Actual behavior: Error

{code}
> select min(c3) from aggregate_test_100 group by c1;
General("Projection references non-aggregate values")
{code}

h2. Expected behavior: Results

{code}
+-+
| min(c3) |
+-+
| -101|
| -95 |
| -99 |
| -117|
| -117|
+-+
{code}


Note, If you include the group key, c1, in the select list, then it does work:
{code}
> select min(c3), c1 from aggregate_test_100 group by c1;
+-++
| min(c3) | c1 |
+-++
| -101| a  |
| -95 | e  |
| -99 | d  |
| -117| c  |
| -117| b  |
+-++
5 row in set. Query took 0 seconds.
{code}

Typically handling this kind of query requires that c1 is brought up in the in 
plan, but is hidden in the  final selection.
)

> [Rust] [DataFusion] Aggregate queries that don't include group column in 
> select list error with "Projection references non-aggregate values"
> 
>
> Key: ARROW-9677
> URL: https://issues.apache.org/jira/browse/ARROW-9677
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>
> .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9753) [Rust] [DataFusion] Remove the use of Mutex in ExecutionPlan trait

2020-08-31 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187657#comment-17187657
 ] 

Andrew Lamb commented on ARROW-9753:


The Partition trait has been renamed / combined to be called ExecutionPlan, so 
I updated this ticket's description to reflect this

> [Rust] [DataFusion] Remove the use of Mutex in ExecutionPlan trait
> --
>
> Key: ARROW-9753
> URL: https://issues.apache.org/jira/browse/ARROW-9753
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The ExecutionPlan trait should not return Arc> but 
> just Arc since most operators do not need to be mutable. 
> Those that do can use interior mutability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9753) [Rust] [DataFusion] Remove the use of Mutex in ExecutionPlan trait

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9753:
---
Summary: [Rust] [DataFusion] Remove the use of Mutex in ExecutionPlan trait 
 (was: [Rust] [DataFusion] Remove the use of Mutex in Partition trait)

> [Rust] [DataFusion] Remove the use of Mutex in ExecutionPlan trait
> --
>
> Key: ARROW-9753
> URL: https://issues.apache.org/jira/browse/ARROW-9753
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The Partition trait should not return Arc> but 
> just Arc since most operators do not need to be mutable. 
> Those that do can use interior mutability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9753) [Rust] [DataFusion] Remove the use of Mutex in ExecutionPlan trait

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9753:
---
Description: The ExecutionPlan trait should not return 
Arc> but just Arc since most 
operators do not need to be mutable. Those that do can use interior mutability. 
 (was: The Partition trait should not return Arc> 
but just Arc since most operators do not need to be 
mutable. Those that do can use interior mutability.)

> [Rust] [DataFusion] Remove the use of Mutex in ExecutionPlan trait
> --
>
> Key: ARROW-9753
> URL: https://issues.apache.org/jira/browse/ARROW-9753
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The ExecutionPlan trait should not return Arc> but 
> just Arc since most operators do not need to be mutable. 
> Those that do can use interior mutability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9838) [Rust] [DataFusion] DefaultPhysicalPlanner should insert explicit MergeExec nodes

2020-08-31 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9838:
---
Summary: [Rust] [DataFusion] DefaultPhysicalPlanner should insert explicit 
MergeExec nodes  (was: [Rust] [DataFusion] Physical planner should insert 
explicit MergeExec nodes)

> [Rust] [DataFusion] DefaultPhysicalPlanner should insert explicit MergeExec 
> nodes
> -
>
> Key: ARROW-9838
> URL: https://issues.apache.org/jira/browse/ARROW-9838
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> Operators such as GlobalLimitExec, SortExec, and HashAggregateExec (in some 
> cases) require a single input partition. Rather than have these operators 
> perform their own merging of input partitions, the planner should insert 
> explicit MergeExec nodes into the physical plan, when needed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9888) [Rust] [DataFusion] Allow ExecutionContext to be shared between threads

2020-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9888:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Allow ExecutionContext to be shared between threads
> ---
>
> Key: ARROW-9888
> URL: https://issues.apache.org/jira/browse/ARROW-9888
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As suggested by Jorge on  https://github.com/apache/arrow/pull/8079
> The high level idea is to allow ExecutionContext on multi-threaded 
> environments such as Python.
> The two use-cases:
> 1. when a project is planning a complex number of plans that depend on a 
> common set of sources and UDFs, it would be nice to be able to multi-thread 
> the planning. This is particularly important when planning requires reading 
> remote metadata to formulate themselves (e.g. when the source is in s3 with 
> many partitions). Metadata reading is often slow and network bounded, which 
> makes threads suitable for these workloads. If multi-threading is not 
> possible, either each plan needs to read the metadata independently (one 
> context per plan) or planning must be sequential (with lots of network 
> waiting).
> 2. when creating bindings to programming languages that support 
> multi-threading, it would be nice for the ExecutionContext to be thread safe, 
> so that we can more easily integrate with those languages.
> The code might look like:
> {code}
> alamb@MacBook-Pro rust % git diff
> diff --git a/rust/datafusion/src/execution/context.rs 
> b/rust/datafusion/src/execution/context.rs
> index 5f8aa342e..7374b0a78 100644
> --- a/rust/datafusion/src/execution/context.rs
> +++ b/rust/datafusion/src/execution/context.rs
> @@ -460,7 +460,7 @@ mod tests {
>  use arrow::array::{ArrayRef, Int32Array};
>  use arrow::compute::add;
>  use std::fs::File;
> -use std::io::prelude::*;
> +use std::{sync::Mutex, io::prelude::*};
>  use tempdir::TempDir;
>  use test::*;
>  
> @@ -928,6 +928,28 @@ mod tests {
>  Ok(())
>  }
>  
> +#[test]
> +fn send_context_to_threads() -> Result<()> {
> +// ensure that ExecutionContext's can be read by multiple threads 
> concurrently
> +let tmp_dir = TempDir::new("send_context_to_threads")?;
> +let partition_count = 4;
> +let mut ctx = Arc::new(Mutex::new(create_ctx(_dir, 
> partition_count)?));
> +
> +let threads: Vec>> = (0..2)
> +.map(|_| { ctx.clone() })
> +.map(|ctx_clone| thread::spawn(move || {
> +let ctx = ctx_clone.lock().expect("Locked context");
> +// Ensure we can create logical plan code on a separate 
> thread.
> +ctx.create_logical_plan("SELECT c1, c2 FROM test WHERE c1 > 
> 0 AND c1 < 3")
> +}))
> +.collect();
> +
> +for thread in threads {
> +thread.join().expect("Failed to join thread")?;
> +}
> +Ok(())
> +}
> +
>  #[test]
>  fn scalar_udf() -> Result<()> {
>  let schema = Schema::new(vec![
> {code}
> At the moment, Rust refuses to compile this example (and also refuses to 
> share ExecutionContexts between threads) due to the following (namely that 
> there are several `dyn` objects that are also not marked as Send + Sync:
> {code}
>Compiling datafusion v2.0.0-SNAPSHOT 
> (/Users/alamb/Software/arrow/rust/datafusion)
> error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner + 'static)` 
> cannot be sent between threads safely
>--> datafusion/src/execution/context.rs:940:30
> |
> 940 | .map(|ctx_clone| thread::spawn(move || {
> |  ^ `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)` cannot be sent between 
> threads safely
> | 
>::: 
> /Users/alamb/.rustup/toolchains/nightly-2020-04-22-x86_64-apple-darwin/lib/rustlib/src/rust/src/libstd/thread/mod.rs:616:8
> |
> 616 | F: Send + 'static,
> | required by this bound in `std::thread::spawn`
> |
> = help: the trait `std::marker::Send` is not implemented for `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)`
> = note: required because of the requirements on the impl of 
> `std::marker::Send` for `std::sync::Arc<(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)>`
> = note: required because it appears within the type 
> `std::option::Option execution::physical_plan::PhysicalPlanner + 'static)>>`
> = note: required because it appears within the type 
> `execution::context::ExecutionConfig`
> = note: required 

[jira] [Created] (ARROW-9888) [Rust] [DataFusion] Allow ExecutionContext to be shared between threads

2020-08-31 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-9888:
--

 Summary: [Rust] [DataFusion] Allow ExecutionContext to be shared 
between threads
 Key: ARROW-9888
 URL: https://issues.apache.org/jira/browse/ARROW-9888
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Andrew Lamb


As suggested by Jorge on  https://github.com/apache/arrow/pull/8079

The high level idea is to allow ExecutionContext on multi-threaded environments 
such as Python.

The two use-cases:

1. when a project is planning a complex number of plans that depend on a common 
set of sources and UDFs, it would be nice to be able to multi-thread the 
planning. This is particularly important when planning requires reading remote 
metadata to formulate themselves (e.g. when the source is in s3 with many 
partitions). Metadata reading is often slow and network bounded, which makes 
threads suitable for these workloads. If multi-threading is not possible, 
either each plan needs to read the metadata independently (one context per 
plan) or planning must be sequential (with lots of network waiting).

2. when creating bindings to programming languages that support 
multi-threading, it would be nice for the ExecutionContext to be thread safe, 
so that we can more easily integrate with those languages.

The code might look like:
{code}
alamb@MacBook-Pro rust % git diff
diff --git a/rust/datafusion/src/execution/context.rs 
b/rust/datafusion/src/execution/context.rs
index 5f8aa342e..7374b0a78 100644
--- a/rust/datafusion/src/execution/context.rs
+++ b/rust/datafusion/src/execution/context.rs
@@ -460,7 +460,7 @@ mod tests {
 use arrow::array::{ArrayRef, Int32Array};
 use arrow::compute::add;
 use std::fs::File;
-use std::io::prelude::*;
+use std::{sync::Mutex, io::prelude::*};
 use tempdir::TempDir;
 use test::*;
 
@@ -928,6 +928,28 @@ mod tests {
 Ok(())
 }
 
+#[test]
+fn send_context_to_threads() -> Result<()> {
+// ensure that ExecutionContext's can be read by multiple threads 
concurrently
+let tmp_dir = TempDir::new("send_context_to_threads")?;
+let partition_count = 4;
+let mut ctx = Arc::new(Mutex::new(create_ctx(_dir, 
partition_count)?));
+
+let threads: Vec>> = (0..2)
+.map(|_| { ctx.clone() })
+.map(|ctx_clone| thread::spawn(move || {
+let ctx = ctx_clone.lock().expect("Locked context");
+// Ensure we can create logical plan code on a separate thread.
+ctx.create_logical_plan("SELECT c1, c2 FROM test WHERE c1 > 0 
AND c1 < 3")
+}))
+.collect();
+
+for thread in threads {
+thread.join().expect("Failed to join thread")?;
+}
+Ok(())
+}
+
 #[test]
 fn scalar_udf() -> Result<()> {
 let schema = Schema::new(vec![
{code}


At the moment, Rust refuses to compile this example (and also refuses to share 
ExecutionContexts between threads) due to the following (namely that there are 
several `dyn` objects that are also not marked as Send + Sync:

{code}
   Compiling datafusion v2.0.0-SNAPSHOT 
(/Users/alamb/Software/arrow/rust/datafusion)
error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner + 'static)` 
cannot be sent between threads safely
   --> datafusion/src/execution/context.rs:940:30
|
940 | .map(|ctx_clone| thread::spawn(move || {
|  ^ `(dyn 
execution::physical_plan::PhysicalPlanner + 'static)` cannot be sent between 
threads safely
| 
   ::: 
/Users/alamb/.rustup/toolchains/nightly-2020-04-22-x86_64-apple-darwin/lib/rustlib/src/rust/src/libstd/thread/mod.rs:616:8
|
616 | F: Send + 'static,
| required by this bound in `std::thread::spawn`
|
= help: the trait `std::marker::Send` is not implemented for `(dyn 
execution::physical_plan::PhysicalPlanner + 'static)`
= note: required because of the requirements on the impl of 
`std::marker::Send` for `std::sync::Arc<(dyn 
execution::physical_plan::PhysicalPlanner + 'static)>`
= note: required because it appears within the type 
`std::option::Option>`
= note: required because it appears within the type 
`execution::context::ExecutionConfig`
= note: required because it appears within the type 
`execution::context::ExecutionContextState`
= note: required because it appears within the type 
`execution::context::ExecutionContext`
= note: required because of the requirements on the impl of 
`std::marker::Send` for `std::sync::Mutex`
= note: required because of the requirements on the impl of 
`std::marker::Send` for 
`std::sync::Arc>`
= note: required because it appears within the type 
`[closure@datafusion/src/execution/context.rs:940:44: 944:14 
ctx_clone:std::sync::Arc>]`

error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner +