[jira] [Updated] (ARROW-10263) [C++][Compute] Improve numerical stability of variances merging
[ https://issues.apache.org/jira/browse/ARROW-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10263: --- Labels: pull-request-available (was: ) > [C++][Compute] Improve numerical stability of variances merging > --- > > Key: ARROW-10263 > URL: https://issues.apache.org/jira/browse/ARROW-10263 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > For chunked array, variance kernel needs to merge variances. > Tested with two single value chunk, [400800490], [400800400]. > The merged variance is 3872. If treated as single array with two values, the > variance is 3904, same as numpy outputs. > So current merging method is not stable in extreme cases when chunks are very > short and with approximate mean values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10278) [Cmake] Failures when building Arrow unittests from source
Andrew Wieteska created ARROW-10278: --- Summary: [Cmake] Failures when building Arrow unittests from source Key: ARROW-10278 URL: https://issues.apache.org/jira/browse/ARROW-10278 Project: Apache Arrow Issue Type: Test Components: C++ Reporter: Andrew Wieteska I've started to get errors while building the unit tests from source. Following the developer docs, I run this: {code:java} cd arrow/cpp/debug cmake -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON .. make unittest {code} On current master I get a number of failures: {code:java} The following tests FAILED: 1 - arrow-array-test (Failed) 2 - arrow-buffer-test (Failed) 4 - arrow-misc-test (Failed) 6 - arrow-scalar-test (Failed) 7 - arrow-type-test (Failed) 8 - arrow-table-test (Failed) 9 - arrow-tensor-test (Failed) 10 - arrow-sparse-tensor-test (Failed) 11 - arrow-stl-test (Failed) 12 - arrow-json-integration-test (Failed) 13 - arrow-concatenate-test (Failed) 14 - arrow-diff-test (Failed) 15 - arrow-c-bridge-test (Failed) 17 - arrow-io-compressed-test (Failed) 19 - arrow-io-memory-test (Failed) 20 - arrow-utility-test (Failed) 21 - arrow-threading-utility-test (Failed) 23 - arrow-compute-scalar-test (Failed) 24 - arrow-compute-vector-test (Failed) 26 - arrow-feather-test (Failed) 27 - arrow-ipc-json-simple-test (Failed) 28 - arrow-ipc-read-write-test (Failed) 29 - arrow-ipc-tensor-test (Failed) 30 - arrow-json-test (Failed) Errors while running CTest make[3]: *** [CMakeFiles/unittest.dir/build.make:76: CMakeFiles/unittest] Error 8 make[2]: *** [CMakeFiles/Makefile2:572: CMakeFiles/unittest.dir/all] Error 2 make[1]: *** [CMakeFiles/Makefile2:579: CMakeFiles/unittest.dir/rule] Error 2 make: *** [Makefile:246: unittest] Error 2 {code} Scrolling up I see that these all fail with this message: {code:java} 18/30 Test #23: arrow-compute-scalar-test ***Failed 0.11 sec Running arrow-compute-scalar-test, redirecting output into /home/andrew/git_repo/arrow/cpp/debug/build/test-logs/arrow-compute-scalar-test.txt (attempt 1/1) /home/andrew/git_repo/arrow/cpp/debug/debug/arrow-compute-scalar-test: symbol lookup error: /home/andrew/git_repo/arrow/cpp/debug/debug/arrow-compute-scalar-test: undefined symbol: _ZN5arrow6Status14AddContextLineEPKciS2_ cat: /home/andrew/git_repo/arrow/cpp/debug/build/test-logs/arrow-compute-scalar-test.txt.raw: No such file or directory ~/git_repo/arrow/cpp/debug/src/arrow/compute/kernels {code} I appreciate any comments/ideas on how to fix this! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10277) [C++] Support comparing scalars approximately
Liya Fan created ARROW-10277: Summary: [C++] Support comparing scalars approximately Key: ARROW-10277 URL: https://issues.apache.org/jira/browse/ARROW-10277 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Liya Fan Assignee: Liya Fan As discussed in [https://github.com/apache/arrow/pull/7748#discussion_r469997286,] we need to compare scalars approximately in some scenarios. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] utsav updated ARROW-10276: -- Description: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and flight flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. I have attached images below Any help would be appreciated. Thanks Spark Version: 2.4.5. was: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and flight flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. I have attached images below Any help would be appreciated. Thanks Spark Version: 2.4.5 > Armv7 orc and flight not supported for build. Compat error on using with spark > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_armv7, arrow_compat_error > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. I have attached images below > Any help would be appreciated. Thanks > Spark Version: 2.4.5. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] utsav updated ARROW-10276: -- Description: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and flight flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. I have attached images below Any help would be appreciated. Thanks Spark Version: 2.4.5 was: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and flight flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. I have attached images below Any help would be appreciated. Thanks > Armv7 orc and flight not supported for build. Compat error on using with spark > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_armv7, arrow_compat_error > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. I have attached images below > Any help would be appreciated. Thanks > Spark Version: 2.4.5 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] utsav updated ARROW-10276: -- Description: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and flight flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. I have attached images below Any help would be appreciated was: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and flight flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. Any help would be appreciated > Armv7 orc and flight not supported for build. Compat error on using with spark > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_armv7, arrow_compat_error > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. I have attached images below > > Any help would be appreciated > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] utsav updated ARROW-10276: -- Description: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and flight flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. I have attached images below Any help would be appreciated. Thanks was: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and flight flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. I have attached images below Any help would be appreciated > Armv7 orc and flight not supported for build. Compat error on using with spark > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_armv7, arrow_compat_error > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. I have attached images below > Any help would be appreciated. Thanks > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] utsav updated ARROW-10276: -- Summary: Armv7 orc and flight not supported for build. Compat error on using with spark (was: Armv7 orc and flight not supported for build. Compat error) > Armv7 orc and flight not supported for build. Compat error on using with spark > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_armv7, arrow_compat_error > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. > > Any help would be appreciated > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] utsav updated ARROW-10276: -- Description: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and flight flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. Any help would be appreciated was: I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and parquet flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. Any help would be appreciated > Armv7 orc and flight not supported for build. Compat error > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_armv7, arrow_compat_error > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. > > Any help would be appreciated > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] utsav updated ARROW-10276: -- Summary: Armv7 orc and flight not supported for build. Compat error (was: Armv7 orc and parquet not supported for build. Compat error) > Armv7 orc and flight not supported for build. Compat error > -- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_armv7, arrow_compat_error > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and parquet flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. > > Any help would be appreciated > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10276) Armv7 orc and parquet not supported for build. Compat error
utsav created ARROW-10276: - Summary: Armv7 orc and parquet not supported for build. Compat error Key: ARROW-10276 URL: https://issues.apache.org/jira/browse/ARROW-10276 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.17.0 Reporter: utsav Attachments: arrow_armv7, arrow_compat_error I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have tried to use it for the raspberry pi 3 without luck in previous posts. I figured out how to successfully build it for armv7 using the script below but cannot use orc and parquet flags. People had looked into it in ARROW-8420 but I don't know if they faced these issues. I tried converting a spark dataframe to pandas using pyarrow but now it complains about a compat feature. Any help would be appreciated -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-10271. Resolution: Fixed Issue resolved by pull request 8433 [https://github.com/apache/arrow/pull/8433] > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 1.0.1 >Reporter: Ritchie >Assignee: Neville Dipale >Priority: Blocker > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel
[ https://issues.apache.org/jira/browse/ARROW-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-10251. Fix Version/s: (was: 3.0.0) 2.0.0 Resolution: Fixed Issue resolved by pull request 8428 [https://github.com/apache/arrow/pull/8428] > [Rust] [DataFusion] MemTable::load() should load partitions in parallel > --- > > Key: ARROW-10251 > URL: https://issues.apache.org/jira/browse/ARROW-10251 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: beginner, pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > MemTable::load() should load partitions in parallel using async tasks, rather > than loading one partition at a time. > Also, we should make batch size configurable. It is currently hard-coded to > 1024*1024 which can be quite inefficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel
[ https://issues.apache.org/jira/browse/ARROW-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-10251: -- Assignee: Andy Grove > [Rust] [DataFusion] MemTable::load() should load partitions in parallel > --- > > Key: ARROW-10251 > URL: https://issues.apache.org/jira/browse/ARROW-10251 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: beginner, pull-request-available > Fix For: 3.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > MemTable::load() should load partitions in parallel using async tasks, rather > than loading one partition at a time. > Also, we should make batch size configurable. It is currently hard-coded to > 1024*1024 which can be quite inefficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8038) [C++][Packaging] Add OpenSSL / encryption support to C++ packages
[ https://issues.apache.org/jira/browse/ARROW-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211935#comment-17211935 ] Akshay commented on ARROW-8038: --- Hello sir , [~wesm] I am new to this parquet file format , I was trying to encrypt some parquet files using the given example code for encryption in github repository of apache/arrow/cpp/examples after installing the dependencies using the cmake file i was able to perform read operations on the parquet file. But when it is coming to encryption it compiles successfully but gives a runtime error of "Build without SSL". I wanted to know is it related to this issue that binding of OpenSSl/encryption support with c++ package for parquet haven't been done yet that's why the error is coming ? And if so , is there any other way i can try and test encryption on parquet files in c++ ! > [C++][Packaging] Add OpenSSL / encryption support to C++ packages > - > > Key: ARROW-8038 > URL: https://issues.apache.org/jira/browse/ARROW-8038 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > This is an umbrella issue for tackling encryption support in the various > packaging targets (Linux deb/rpm, Homebrew, conda, etc.) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10275) [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish
[ https://issues.apache.org/jira/browse/ARROW-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Taylor updated ARROW-10275: Description: Group by with a high cardinality (columns with lots of unique values) don't seem to finish. I've tried with both datafusion-cli and this: [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs] When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to stall. I've tried with limit but it doesn't work either. My parquet file: [https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing] datafusion-cli: {code:java} CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet'; select O_ORDERKEY from something group by O_ORDERKEY; {code} was: Group by with a high cardinality (columns with lots of unique values) don't seem to finish. I've tried with both datafusion-cli and this: [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs] When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to stall. I've tried with limit but it doesn't work either. My parquet file: [https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing] datafusion-cli: {code:java} CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet'; select O_ORDERKEY from something limit 20 {code} > [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish > --- > > Key: ARROW-10275 > URL: https://issues.apache.org/jira/browse/ARROW-10275 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Affects Versions: 2.0.0 > Environment: Ubuntu 20.04 >Reporter: Josh Taylor >Priority: Minor > > Group by with a high cardinality (columns with lots of unique values) don't > seem to finish. > I've tried with both datafusion-cli and this: > [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs] > When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to > stall. I've tried with limit but it doesn't work either. > My parquet file: > [https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing] > datafusion-cli: > {code:java} > CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet'; > select O_ORDERKEY from something group by O_ORDERKEY; > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10275) [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish
[ https://issues.apache.org/jira/browse/ARROW-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Taylor updated ARROW-10275: Description: Group by with a high cardinality (columns with lots of unique values) don't seem to finish. I've tried with both datafusion-cli and this: [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs] When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to stall. I've tried with limit but it doesn't work either. My parquet file: [https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing] datafusion-cli: {code:java} CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet'; select O_ORDERKEY from something limit 20 {code} was: Group by with a high cardinality (columns with lots of unique values) don't seem to finish. I've tried with both datafusion-cli and this: [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs] When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to stall. I've tried with limit but it doesn't work either. My parquet file: https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing > [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish > --- > > Key: ARROW-10275 > URL: https://issues.apache.org/jira/browse/ARROW-10275 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Affects Versions: 2.0.0 > Environment: Ubuntu 20.04 >Reporter: Josh Taylor >Priority: Minor > > Group by with a high cardinality (columns with lots of unique values) don't > seem to finish. > I've tried with both datafusion-cli and this: > [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs] > When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to > stall. I've tried with limit but it doesn't work either. > My parquet file: > [https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing] > datafusion-cli: > > {code:java} > CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet'; > select O_ORDERKEY from something limit 20 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10274) [Rust] arithmetic without SIMD does unnecesary copy
[ https://issues.apache.org/jira/browse/ARROW-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211853#comment-17211853 ] Jorge Leitão commented on ARROW-10274: -- > Maybe we could directly write the arithmetic result to a mutable buffer and > prevent this redundant copy? Yes :) > [Rust] arithmetic without SIMD does unnecesary copy > --- > > Key: ARROW-10274 > URL: https://issues.apache.org/jira/browse/ARROW-10274 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Ritchie >Priority: Minor > > The arithmetic kernels that don't use SIMD create a `vec` in memory and later > copy that data into a Buffer. Maybe we could directly write the arithmetic > result to a mutable buffer and prevent this redundant copy? > > > {code:java} > let values = (0..left.len()) > .map(|i| op(left.value(i), right.value(i))) > .collect::>(); > > > let data = ArrayData::new( > T::get_data_type(), > left.len(), > None, > null_bit_buffer, > 0, > vec![Buffer::from(values.to_byte_slice())], > vec![], > );{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10271: --- Labels: pull-request-available (was: ) > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 1.0.1 >Reporter: Ritchie >Assignee: Neville Dipale >Priority: Blocker > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10274) [Rust] arithmetic without SIMD does unnecesary copy
[ https://issues.apache.org/jira/browse/ARROW-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10274: --- Component/s: Rust > [Rust] arithmetic without SIMD does unnecesary copy > --- > > Key: ARROW-10274 > URL: https://issues.apache.org/jira/browse/ARROW-10274 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Ritchie >Priority: Minor > > The arithmetic kernels that don't use SIMD create a `vec` in memory and later > copy that data into a Buffer. Maybe we could directly write the arithmetic > result to a mutable buffer and prevent this redundant copy? > > > {code:java} > let values = (0..left.len()) > .map(|i| op(left.value(i), right.value(i))) > .collect::>(); > > > let data = ArrayData::new( > T::get_data_type(), > left.len(), > None, > null_bit_buffer, > 0, > vec![Buffer::from(values.to_byte_slice())], > vec![], > );{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10112) [Rust] Implement conversion of ArrowArray to array::Array
[ https://issues.apache.org/jira/browse/ARROW-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão reassigned ARROW-10112: Assignee: Jorge Leitão > [Rust] Implement conversion of ArrowArray to array::Array > - > > Key: ARROW-10112 > URL: https://issues.apache.org/jira/browse/ARROW-10112 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10113) [Rust] Implement conversion of array::Array to ArrowArray
[ https://issues.apache.org/jira/browse/ARROW-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão reassigned ARROW-10113: Assignee: Jorge Leitão > [Rust] Implement conversion of array::Array to ArrowArray > - > > Key: ARROW-10113 > URL: https://issues.apache.org/jira/browse/ARROW-10113 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10275) [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish
Josh Taylor created ARROW-10275: --- Summary: [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish Key: ARROW-10275 URL: https://issues.apache.org/jira/browse/ARROW-10275 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Affects Versions: 2.0.0 Environment: Ubuntu 20.04 Reporter: Josh Taylor Group by with a high cardinality (columns with lots of unique values) don't seem to finish. I've tried with both datafusion-cli and this: [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs] When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to stall. I've tried with limit but it doesn't work either. My parquet file: https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211839#comment-17211839 ] Neville Dipale commented on ARROW-10271: I was planning on doing a pass to check if there's dependencies that we could bump. I'm aware of the packed_simd_2 change, and was planning on addressing it. While we use an old nightly (call it a six-monthly at this stage), this issue will definitely break a lot of code for users. > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 1.0.1 >Reporter: Ritchie >Assignee: Neville Dipale >Priority: Blocker > Fix For: 2.0.0 > > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10271: --- Component/s: Rust > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 1.0.1 >Reporter: Ritchie >Assignee: Neville Dipale >Priority: Blocker > Fix For: 2.0.0 > > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-10271: -- Assignee: Neville Dipale > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Ritchie >Assignee: Neville Dipale >Priority: Blocker > Fix For: 2.0.0 > > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10271: --- Fix Version/s: 2.0.0 > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Ritchie >Priority: Blocker > Fix For: 2.0.0 > > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10271: --- Priority: Blocker (was: Major) > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Ritchie >Priority: Blocker > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10271: --- Affects Version/s: 1.0.1 > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Ritchie >Priority: Major > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)