[jira] [Created] (ARROW-10683) [Rust] Remove Array.data method in favor of .data_ref to make performance impact of clone more obvious

2020-11-21 Thread Jira
Jörn Horstmann created ARROW-10683:
--

 Summary: [Rust] Remove Array.data method in favor of .data_ref to 
make performance impact of clone more obvious
 Key: ARROW-10683
 URL: https://issues.apache.org/jira/browse/ARROW-10683
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jörn Horstmann


The `Array.data()` method is a real performance foot-gun since it involves 
cloning an `Arc` which is not obvious to users. When used in innner loops that 
can cause big performance impacts. The cloning itself might not be a problem, 
but I think it sometimes prohibits other compiler optimizations.

It would be better to remove this method and let users call `array_ref()` and 
only clone when really needed.

Most of the current usages seem to be in test assertions which should be easy 
to refactor.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10682) [Rust] Sort kernel performance tuning

2020-11-21 Thread Jira
Jörn Horstmann created ARROW-10682:
--

 Summary: [Rust] Sort kernel performance tuning
 Key: ARROW-10682
 URL: https://issues.apache.org/jira/browse/ARROW-10682
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jörn Horstmann


The `is_valid` calls inside the lexical comparator are not being inlined 
because calls go through a `dyn Array`. We can instead call the `is_valid` 
function of `ArrayData` directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10681) [Rust] [DataFusion] TPC-H Query 12 fails with scheduler error

2020-11-21 Thread Andy Grove (Jira)
Andy Grove created ARROW-10681:
--

 Summary: [Rust] [DataFusion] TPC-H Query 12 fails with scheduler 
error
 Key: ARROW-10681
 URL: https://issues.apache.org/jira/browse/ARROW-10681
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 3.0.0


 
{code:java}
Running benchmarks with the following options: BenchmarkOpt { query: 12, debug: 
false, iterations: 1, concurrency: 2, batch_size: 4096, path: 
"/mnt/tpch/tbl-sf1/", file_format: "tbl", mem_table: false }

thread 'main' panicked at 'must be called from the context of Tokio runtime 
configured with either `basic_scheduler` or `threaded_scheduler`', 
datafusion/src/physical_plan/hash_aggregate.rs:368:9
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10680) [Rust] [DataFusion] Implement TPC-H Query 12

2020-11-21 Thread Andy Grove (Jira)
Andy Grove created ARROW-10680:
--

 Summary: [Rust] [DataFusion] Implement TPC-H Query 12
 Key: ARROW-10680
 URL: https://issues.apache.org/jira/browse/ARROW-10680
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove


Implement TPC-H Query 12 so that we can test JOIN support more fully. We will 
need to fake some parts for now because we don't support all the expressions in 
this query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10679) [Rust] [DataFusion] Implement SQL CASE WHEN expression

2020-11-21 Thread Andy Grove (Jira)
Andy Grove created ARROW-10679:
--

 Summary: [Rust] [DataFusion] Implement SQL CASE WHEN expression
 Key: ARROW-10679
 URL: https://issues.apache.org/jira/browse/ARROW-10679
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


Implement SQL CASE WHEN expression so that we can support TPC-H query 12 fully.

 

Postgres: [https://www.postgresqltutorial.com/postgresql-case/]

Spark: [http://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-case.html]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10678) [Python] pyarrow2.0.0 flight test crash on macOS

2020-11-21 Thread BinbinLiang (Jira)
BinbinLiang created ARROW-10678:
---

 Summary: [Python] pyarrow2.0.0 flight test crash on macOS
 Key: ARROW-10678
 URL: https://issues.apache.org/jira/browse/ARROW-10678
 Project: Apache Arrow
  Issue Type: Bug
 Environment: OS:mac os  catalina 10.15.6;
Python version: python3.7.2;
pyarrow: 2.0.0
Reporter: BinbinLiang
 Attachments: image-2020-11-21-23-31-40-211.png, 
image-2020-11-21-23-32-35-346.png, image-2020-11-21-23-33-30-170.png

* When I using pyarrow flight client get remote data, I encounter the 
"assertion failed" problem:  
!image-2020-11-21-23-32-35-346.png|width=1101,height=33!
 * I use the flight example code (client.py & server.py): 
[https://github.com/apache/arrow/tree/master/python/examples/flight]
 * It seems the grpc bug. And it is similar with the previous problem 
({color:#FF}also on mac os{color}): 
https://issues.apache.org/jira/browse/ARROW-7689 .
And the record of grpc is : [https://github.com/grpc/grpc/issues/20311]

  * {color:#FF}In my project, the version of pyarrow is 2.0.0, and the 
version of grpcio is 1.33.2. Both are the latest version.
{color}!image-2020-11-21-23-33-30-170.png|width=228,height=186! 
 * {color:#FF}When I change the version of pyarrow from 2.0.0 to 0.17.1, 
the problem disappeared.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10677) [Rust] Add tests as documentation showing supported csv parsing

2020-11-21 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10677:
---

 Summary: [Rust] Add tests as documentation showing supported csv 
parsing
 Key: ARROW-10677
 URL: https://issues.apache.org/jira/browse/ARROW-10677
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb
Assignee: Andrew Lamb


https://github.com/apache/arrow/pull/8714/files# / ARROW-10654 added some 
specialized parsing for the csv reader and among other things added additional 
boolean parsing support. 

We should add some tests as documentation of what boolean types are supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10676) [Python] pickle error occurs when using pyarrow._plasma.PlasmaClient with multiprocess on mac (python3.8.5)

2020-11-21 Thread BinbinLiang (Jira)
BinbinLiang created ARROW-10676:
---

 Summary: [Python] pickle error occurs when using 
pyarrow._plasma.PlasmaClient with multiprocess on mac (python3.8.5)
 Key: ARROW-10676
 URL: https://issues.apache.org/jira/browse/ARROW-10676
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0, 0.17.1
 Environment: OS: mac os catalina 10.15.6;
Python version: python3.8.5;
Reporter: BinbinLiang


* The environment is:
 ** OS: mac os catalina 10.15.6;
 ** {color:#FF}Python version: python3.8.5;{color}
 * {color:#ff}It's ok, when use python3.7(3.7.2 or 3.7.3) or 
python3.6(3.6.8).
The error occurs only in python3.8.5.{color}
 * When I use 'pyarrow._plasma.PlasmaClient' in the 
'multiprocessing.context.Process', the error occurs.

 ** First, I have a subclass of 'multiprocessing.context.Process', named 
Executor;
The defined code like this: 'class Executor(Process):' .
 ** And then, I create a 'PlasmaClient' object in the '__init__' function of 
'class Executor' .
 ** Finally, I new a 'Executor' object, and call the 'start()' function. 
 * The detail informations of traceback are as follows:
 Traceback (most recent call last):
 File 
"/Users/liangbinbin/PycharmProjects/waimai_data_cubeinsight/python/taf/server/taf_server.py",
 line 100, in LaunchExecutor
 executor.start()
 File 
"/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/process.py",
 line 121, in start
 self._popen = self._Popen(self)
 File 
"/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/context.py",
 line 224, in _Popen
 return _default_context.get_context().Process._Popen(process_obj)
 File 
"/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/context.py",
 line 284, in _Popen
 return Popen(process_obj)
 File 
"/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/popen_spawn_posix.py",
 line 32, in __init__
 super().__init__(process_obj)
 File 
"/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/popen_fork.py",
 line 19, in __init__
 self._launch(process_obj)
 File 
"/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/popen_spawn_posix.py",
 line 47, in _launch
 reduction.dump(process_obj, fp)
 File 
{color:#FF}"/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/reduction.py",
 line 60, in dump{color}
 {color:#FF}ForkingPickler(file, protocol).dump(obj){color}
 {color:#FF}File "stringsource", line 2, in 
*pyarrow._plasma.PlasmaClient.__reduce_cython__*{color}
 TypeError: no default __reduce__ due to non-trivial __cinit__



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10675) [C++][Python] Support AWS S3 Web identity credentials

2020-11-21 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-10675:


 Summary: [C++][Python] Support AWS S3 Web identity credentials
 Key: ARROW-10675
 URL: https://issues.apache.org/jira/browse/ARROW-10675
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 2.0.0, 1.0.1
Reporter: Paul Balanca


It seems to me that Arrow only supports at the moment the "AssumeRole" AWS STS 
API, but not the other options offered: 
[https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#stsapi_comparison]

[https://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_auth_1_1_s_t_s_assume_role_web_identity_credentials_provider.html]

 

I am clearly no security/infra expert, but it seems that the configuration 
"AssumeRoleWithWebIdentity" is used commonly in Kubernetes setups, and I 
believe it would be beneficial for Arrow C++ & Python library to support.

At the moment, a work around is to call directly `aws sts` to generate a 
temporary session, but it is a fairly paintful as the session expires: all 
PyArrow objects with an S3 filesystem (datasets, ...) needs to be re-built with 
new credentials. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10674) [Rust] Add integration tests for Decimal type

2020-11-21 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10674:
--

 Summary: [Rust] Add integration tests for Decimal type
 Key: ARROW-10674
 URL: https://issues.apache.org/jira/browse/ARROW-10674
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


We have basic decimal support, but we have not yet included decimals in the 
integration testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)