[jira] [Created] (ARROW-10683) [Rust] Remove Array.data method in favor of .data_ref to make performance impact of clone more obvious
Jörn Horstmann created ARROW-10683: -- Summary: [Rust] Remove Array.data method in favor of .data_ref to make performance impact of clone more obvious Key: ARROW-10683 URL: https://issues.apache.org/jira/browse/ARROW-10683 Project: Apache Arrow Issue Type: Improvement Reporter: Jörn Horstmann The `Array.data()` method is a real performance foot-gun since it involves cloning an `Arc` which is not obvious to users. When used in innner loops that can cause big performance impacts. The cloning itself might not be a problem, but I think it sometimes prohibits other compiler optimizations. It would be better to remove this method and let users call `array_ref()` and only clone when really needed. Most of the current usages seem to be in test assertions which should be easy to refactor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10682) [Rust] Sort kernel performance tuning
Jörn Horstmann created ARROW-10682: -- Summary: [Rust] Sort kernel performance tuning Key: ARROW-10682 URL: https://issues.apache.org/jira/browse/ARROW-10682 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jörn Horstmann The `is_valid` calls inside the lexical comparator are not being inlined because calls go through a `dyn Array`. We can instead call the `is_valid` function of `ArrayData` directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10681) [Rust] [DataFusion] TPC-H Query 12 fails with scheduler error
Andy Grove created ARROW-10681: -- Summary: [Rust] [DataFusion] TPC-H Query 12 fails with scheduler error Key: ARROW-10681 URL: https://issues.apache.org/jira/browse/ARROW-10681 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove Fix For: 3.0.0 {code:java} Running benchmarks with the following options: BenchmarkOpt { query: 12, debug: false, iterations: 1, concurrency: 2, batch_size: 4096, path: "/mnt/tpch/tbl-sf1/", file_format: "tbl", mem_table: false } thread 'main' panicked at 'must be called from the context of Tokio runtime configured with either `basic_scheduler` or `threaded_scheduler`', datafusion/src/physical_plan/hash_aggregate.rs:368:9 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10680) [Rust] [DataFusion] Implement TPC-H Query 12
Andy Grove created ARROW-10680: -- Summary: [Rust] [DataFusion] Implement TPC-H Query 12 Key: ARROW-10680 URL: https://issues.apache.org/jira/browse/ARROW-10680 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Implement TPC-H Query 12 so that we can test JOIN support more fully. We will need to fake some parts for now because we don't support all the expressions in this query. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10679) [Rust] [DataFusion] Implement SQL CASE WHEN expression
Andy Grove created ARROW-10679: -- Summary: [Rust] [DataFusion] Implement SQL CASE WHEN expression Key: ARROW-10679 URL: https://issues.apache.org/jira/browse/ARROW-10679 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Implement SQL CASE WHEN expression so that we can support TPC-H query 12 fully. Postgres: [https://www.postgresqltutorial.com/postgresql-case/] Spark: [http://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-case.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10678) [Python] pyarrow2.0.0 flight test crash on macOS
BinbinLiang created ARROW-10678: --- Summary: [Python] pyarrow2.0.0 flight test crash on macOS Key: ARROW-10678 URL: https://issues.apache.org/jira/browse/ARROW-10678 Project: Apache Arrow Issue Type: Bug Environment: OS:mac os catalina 10.15.6; Python version: python3.7.2; pyarrow: 2.0.0 Reporter: BinbinLiang Attachments: image-2020-11-21-23-31-40-211.png, image-2020-11-21-23-32-35-346.png, image-2020-11-21-23-33-30-170.png * When I using pyarrow flight client get remote data, I encounter the "assertion failed" problem: !image-2020-11-21-23-32-35-346.png|width=1101,height=33! * I use the flight example code (client.py & server.py): [https://github.com/apache/arrow/tree/master/python/examples/flight] * It seems the grpc bug. And it is similar with the previous problem ({color:#FF}also on mac os{color}): https://issues.apache.org/jira/browse/ARROW-7689 . And the record of grpc is : [https://github.com/grpc/grpc/issues/20311] * {color:#FF}In my project, the version of pyarrow is 2.0.0, and the version of grpcio is 1.33.2. Both are the latest version. {color}!image-2020-11-21-23-33-30-170.png|width=228,height=186! * {color:#FF}When I change the version of pyarrow from 2.0.0 to 0.17.1, the problem disappeared.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10677) [Rust] Add tests as documentation showing supported csv parsing
Andrew Lamb created ARROW-10677: --- Summary: [Rust] Add tests as documentation showing supported csv parsing Key: ARROW-10677 URL: https://issues.apache.org/jira/browse/ARROW-10677 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb https://github.com/apache/arrow/pull/8714/files# / ARROW-10654 added some specialized parsing for the csv reader and among other things added additional boolean parsing support. We should add some tests as documentation of what boolean types are supported -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10676) [Python] pickle error occurs when using pyarrow._plasma.PlasmaClient with multiprocess on mac (python3.8.5)
BinbinLiang created ARROW-10676: --- Summary: [Python] pickle error occurs when using pyarrow._plasma.PlasmaClient with multiprocess on mac (python3.8.5) Key: ARROW-10676 URL: https://issues.apache.org/jira/browse/ARROW-10676 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0, 0.17.1 Environment: OS: mac os catalina 10.15.6; Python version: python3.8.5; Reporter: BinbinLiang * The environment is: ** OS: mac os catalina 10.15.6; ** {color:#FF}Python version: python3.8.5;{color} * {color:#ff}It's ok, when use python3.7(3.7.2 or 3.7.3) or python3.6(3.6.8). The error occurs only in python3.8.5.{color} * When I use 'pyarrow._plasma.PlasmaClient' in the 'multiprocessing.context.Process', the error occurs. ** First, I have a subclass of 'multiprocessing.context.Process', named Executor; The defined code like this: 'class Executor(Process):' . ** And then, I create a 'PlasmaClient' object in the '__init__' function of 'class Executor' . ** Finally, I new a 'Executor' object, and call the 'start()' function. * The detail informations of traceback are as follows: Traceback (most recent call last): File "/Users/liangbinbin/PycharmProjects/waimai_data_cubeinsight/python/taf/server/taf_server.py", line 100, in LaunchExecutor executor.start() File "/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File {color:#FF}"/Users/liangbinbin/Applications/anaconda3/envs/python3_8/lib/python3.8/multiprocessing/reduction.py", line 60, in dump{color} {color:#FF}ForkingPickler(file, protocol).dump(obj){color} {color:#FF}File "stringsource", line 2, in *pyarrow._plasma.PlasmaClient.__reduce_cython__*{color} TypeError: no default __reduce__ due to non-trivial __cinit__ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10675) [C++][Python] Support AWS S3 Web identity credentials
Paul Balanca created ARROW-10675: Summary: [C++][Python] Support AWS S3 Web identity credentials Key: ARROW-10675 URL: https://issues.apache.org/jira/browse/ARROW-10675 Project: Apache Arrow Issue Type: Improvement Affects Versions: 2.0.0, 1.0.1 Reporter: Paul Balanca It seems to me that Arrow only supports at the moment the "AssumeRole" AWS STS API, but not the other options offered: [https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#stsapi_comparison] [https://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_auth_1_1_s_t_s_assume_role_web_identity_credentials_provider.html] I am clearly no security/infra expert, but it seems that the configuration "AssumeRoleWithWebIdentity" is used commonly in Kubernetes setups, and I believe it would be beneficial for Arrow C++ & Python library to support. At the moment, a work around is to call directly `aws sts` to generate a temporary session, but it is a fairly paintful as the session expires: all PyArrow objects with an S3 filesystem (datasets, ...) needs to be re-built with new credentials. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10674) [Rust] Add integration tests for Decimal type
Neville Dipale created ARROW-10674: -- Summary: [Rust] Add integration tests for Decimal type Key: ARROW-10674 URL: https://issues.apache.org/jira/browse/ARROW-10674 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Neville Dipale We have basic decimal support, but we have not yet included decimals in the integration testing. -- This message was sent by Atlassian Jira (v8.3.4#803005)