[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package
[ https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198118#comment-17198118 ] Paul Taylor commented on ARROW-8394: [~pprice] [~timconkling] [~Costa] PR is up @ https://github.com/apache/arrow/pull/8216 > [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm > package > --- > > Key: ARROW-8394 > URL: https://issues.apache.org/jira/browse/ARROW-8394 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.16.0 >Reporter: Shyamal Shukla >Priority: Blocker > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Attempting to use apache-arrow within a web application, but typescript > compiler throws the following errors in some of arrow's .d.ts files > import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; > export class SomeClass { > . > . > constructor() { > const t = Table.from(''); > } > *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: > Class static side 'typeof Column' incorrectly extends base class static side > 'typeof Chunked'. Types of property 'new' are incompatible. > *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: > Subsequent property declarations must have the same type. Property 'schema' > must be of type 'Schema', but here has type 'Schema'. > 238 schema: Schema; > *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error > TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. > The types of 'slice(...).clone' are incompatible between these types. > the tsconfig.json file looks like > { > "compilerOptions": { > "target":"ES6", > "outDir": "dist", > "baseUrl": "src/" > }, > "exclude": ["dist"], > "include": ["src/*.ts"] > } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package
[ https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195668#comment-17195668 ] Paul Taylor edited comment on ARROW-8394 at 9/18/20, 5:30 AM: -- I've started work on a branch in my fork here[1], but have been occupied the last few weeks (work, moving, back injury, etc.). There's not much left to do, so I think I should be able to get it finished and PR'd this week. 1. https://github.com/trxcllnt/arrow/tree/fix/typescript-3.9-errors was (Author: paul.e.taylor): I've started work on a branch in my fork here[1], but have been occupied the last few weeks (work, moving, back injury, etc.). There's not much left to do, so I think I should be able to get it finished and PR'd this week. 1. https://github.com/trxcllnt/arrow/tree/typescript-3.9 > [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm > package > --- > > Key: ARROW-8394 > URL: https://issues.apache.org/jira/browse/ARROW-8394 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.16.0 >Reporter: Shyamal Shukla >Priority: Blocker > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Attempting to use apache-arrow within a web application, but typescript > compiler throws the following errors in some of arrow's .d.ts files > import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; > export class SomeClass { > . > . > constructor() { > const t = Table.from(''); > } > *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: > Class static side 'typeof Column' incorrectly extends base class static side > 'typeof Chunked'. Types of property 'new' are incompatible. > *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: > Subsequent property declarations must have the same type. Property 'schema' > must be of type 'Schema', but here has type 'Schema'. > 238 schema: Schema; > *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error > TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. > The types of 'slice(...).clone' are incompatible between these types. > the tsconfig.json file looks like > { > "compilerOptions": { > "target":"ES6", > "outDir": "dist", > "baseUrl": "src/" > }, > "exclude": ["dist"], > "include": ["src/*.ts"] > } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package
[ https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8394: -- Labels: pull-request-available (was: ) > [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm > package > --- > > Key: ARROW-8394 > URL: https://issues.apache.org/jira/browse/ARROW-8394 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.16.0 >Reporter: Shyamal Shukla >Priority: Blocker > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Attempting to use apache-arrow within a web application, but typescript > compiler throws the following errors in some of arrow's .d.ts files > import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; > export class SomeClass { > . > . > constructor() { > const t = Table.from(''); > } > *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: > Class static side 'typeof Column' incorrectly extends base class static side > 'typeof Chunked'. Types of property 'new' are incompatible. > *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: > Subsequent property declarations must have the same type. Property 'schema' > must be of type 'Schema', but here has type 'Schema'. > 238 schema: Schema; > *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error > TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. > The types of 'slice(...).clone' are incompatible between these types. > the tsconfig.json file looks like > { > "compilerOptions": { > "target":"ES6", > "outDir": "dist", > "baseUrl": "src/" > }, > "exclude": ["dist"], > "include": ["src/*.ts"] > } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Strand reopened ARROW-10002: - My first PR only removes {{default fn}} from one trait. > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of {{default fn}} in the > codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there > has been further discussion and ideas for resolving the soundness issue, but > to my knowledge no definitive action.) > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10036) [Rust] [DataFusion] Test that the final schema is expected in integration tests
Jorge created ARROW-10036: - Summary: [Rust] [DataFusion] Test that the final schema is expected in integration tests Key: ARROW-10036 URL: https://issues.apache.org/jira/browse/ARROW-10036 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Currently, our integration tests convert a Recordbatch to a string, which we use for testing, but they do not test that the final schema matches our expectations. We should add a test for this, which includes: # field name # field type # field nulability for every field in the schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-10002. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8206 [https://github.com/apache/arrow/pull/8206] > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of {{default fn}} in the > codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there > has been further discussion and ideas for resolving the soundness issue, but > to my knowledge no definitive action.) > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9965) [Java] Buffer capacity calculations are slow for fixed-width vectors
[ https://issues.apache.org/jira/browse/ARROW-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9965: -- Labels: pull-request-available (was: ) > [Java] Buffer capacity calculations are slow for fixed-width vectors > > > Key: ARROW-9965 > URL: https://issues.apache.org/jira/browse/ARROW-9965 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Josiah >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: after_patch_profile_prof_perfasm_unsafe_true, > before_patch_profile_prof_perfasm_unsafe_true > > Time Spent: 10m > Remaining Estimate: 0h > > It turns out that setSafe performs a very expensive integer division when > trying to compute buffer capacity; specifically, it divides by the field > size, which isn't hardcoded. Although it is typically a power of 2 for > alignment reasons, this doesn't compile down to a bitshift. > This is done here: > https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189 > > Forcing a bitshift operation results in a large speedup in benchmarks. When > turning off bounds checks (which affects another portion of set), > microbenchmarks indicate that setting the elements of a vector via setSafe is > increased by ~174% (almost 3 times faster). With bounds checks on, this is > reduced to a 73% increase (Amdahl's). > We use setSafe right now in a hot loop to set Arrow vectors in an internal > data-intensive service (for now), although in the future, we would prefer a > more specialized vector append interface to skip all the other indirection > and bit manipulation instructions, while not directly manipulating the > exposed (native) memory. > Here is the detailed analysis: > Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run > with the latest repo-provided JDK 14 on Ubuntu 20.04. > {code} > Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT > applied > # JMH version: 1.21 > # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 > # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java > # VM options: -Darrow.enable_unsafe_memory_access=false > # Warmup: 5 iterations, 10 s each > # Measurement: 5 iterations, 10 s each > # Timeout: 10 min per iteration > # Threads: 1 thread, will synchronize iterations > # Benchmark mode: Average time, time/op > # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly > *snip* > Benchmark Mode Cnt Score Error Units > IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op > IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op > IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op > Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied > # JMH version: 1.21 > # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 > # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java > # VM options: -Darrow.enable_unsafe_memory_access=false > # Warmup: 5 iterations, 10 s each > # Measurement: 5 iterations, 10 s each > # Timeout: 10 min per iteration > # Threads: 1 thread, will synchronize iterations > # Benchmark mode: Average time, time/op > # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly > *snip* > Benchmark Mode Cnt Score Error Units > IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op > IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op > IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op > Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT > applied > # JMH version: 1.21 > # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 > # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java > # VM options: -Darrow.enable_unsafe_memory_access=true > # Warmup: 5 iterations, 10 s each > # Measurement: 5 iterations, 10 s each > # Timeout: 10 min per iteration > # Threads: 1 thread, will synchronize iterations > # Benchmark mode: Average time, time/op > # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl > Benchmark Mode Cnt Score Error Units > IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op > IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op > IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op > Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied > # JMH version: 1.21 > # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 > # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java > # VM options: -Darrow.enable_unsafe_memory_access=true > # Warmup: 5 iterations, 10 s each > # Measurement: 5 it
[jira] [Created] (ARROW-10035) [C++] Bump versions of vendored code
Antoine Pitrou created ARROW-10035: -- Summary: [C++] Bump versions of vendored code Key: ARROW-10035 URL: https://issues.apache.org/jira/browse/ARROW-10035 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-9977) [Rust] Add min/max for [Large]String
[ https://issues.apache.org/jira/browse/ARROW-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reopened ARROW-9977: --- Re-opening this because I had to revert the PR due to conflicts > [Rust] Add min/max for [Large]String > > > Key: ARROW-9977 > URL: https://issues.apache.org/jira/browse/ARROW-9977 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Strings are ordered and thus we can apply min/max as other types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10034) [Rust] Master build broken
[ https://issues.apache.org/jira/browse/ARROW-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-10034. Resolution: Fixed Issue resolved by pull request 8213 [https://github.com/apache/arrow/pull/8213] > [Rust] Master build broken > -- > > Key: ARROW-10034 > URL: https://issues.apache.org/jira/browse/ARROW-10034 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > I merged quite a few PRs today. There was a conflict and I need to revert one > of them. I am working on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10034) [Rust] Master build broken
[ https://issues.apache.org/jira/browse/ARROW-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10034: - Assignee: Andy Grove (was: Apache Arrow JIRA Bot) > [Rust] Master build broken > -- > > Key: ARROW-10034 > URL: https://issues.apache.org/jira/browse/ARROW-10034 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > I merged quite a few PRs today. There was a conflict and I need to revert one > of them. I am working on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10034) [Rust] Master build broken
[ https://issues.apache.org/jira/browse/ARROW-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10034: - Assignee: Apache Arrow JIRA Bot (was: Andy Grove) > [Rust] Master build broken > -- > > Key: ARROW-10034 > URL: https://issues.apache.org/jira/browse/ARROW-10034 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > I merged quite a few PRs today. There was a conflict and I need to revert one > of them. I am working on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10034) [Rust] Master build broken
[ https://issues.apache.org/jira/browse/ARROW-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10034: --- Labels: pull-request-available (was: ) > [Rust] Master build broken > -- > > Key: ARROW-10034 > URL: https://issues.apache.org/jira/browse/ARROW-10034 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > I merged quite a few PRs today. There was a conflict and I need to revert one > of them. I am working on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10034) [Rust] Master build broken
Andy Grove created ARROW-10034: -- Summary: [Rust] Master build broken Key: ARROW-10034 URL: https://issues.apache.org/jira/browse/ARROW-10034 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 2.0.0 I merged quite a few PRs today. There was a conflict and I need to revert one of them. I am working on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README
[ https://issues.apache.org/jira/browse/ARROW-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-10001. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8186 [https://github.com/apache/arrow/pull/8186] > [Rust] [DataFusion] Add developer guide to README > - > > Key: ARROW-10001 > URL: https://issues.apache.org/jira/browse/ARROW-10001 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9987) [Rust] [DataFusion] Improve docs of `Expr`.
[ https://issues.apache.org/jira/browse/ARROW-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-9987. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8181 [https://github.com/apache/arrow/pull/8181] > [Rust] [DataFusion] Improve docs of `Expr`. > --- > > Key: ARROW-9987 > URL: https://issues.apache.org/jira/browse/ARROW-9987 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Jorge >Assignee: Jorge >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9977) [Rust] Add min/max for [Large]String
[ https://issues.apache.org/jira/browse/ARROW-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-9977. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8171 [https://github.com/apache/arrow/pull/8171] > [Rust] Add min/max for [Large]String > > > Key: ARROW-9977 > URL: https://issues.apache.org/jira/browse/ARROW-9977 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > Strings are ordered and thus we can apply min/max as other types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10028) [Rust] Simplify macro def_numeric_from_vec
[ https://issues.apache.org/jira/browse/ARROW-10028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-10028. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8207 [https://github.com/apache/arrow/pull/8207] > [Rust] Simplify macro def_numeric_from_vec > -- > > Key: ARROW-10028 > URL: https://issues.apache.org/jira/browse/ARROW-10028 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently we need to pass too many parameters to it, when they can be > inferred. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9990) [Rust] [DataFusion] NOT is not plannable
[ https://issues.apache.org/jira/browse/ARROW-9990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-9990. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8183 [https://github.com/apache/arrow/pull/8183] > [Rust] [DataFusion] NOT is not plannable > > > Key: ARROW-9990 > URL: https://issues.apache.org/jira/browse/ARROW-9990 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > We have the physical operator, but it is not usable in the logical planning. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9971) [Rust] Speedup take
[ https://issues.apache.org/jira/browse/ARROW-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-9971. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8170 [https://github.com/apache/arrow/pull/8170] > [Rust] Speedup take > --- > > Key: ARROW-9971 > URL: https://issues.apache.org/jira/browse/ARROW-9971 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10033) ArrowReaderProperties creates thread pool, even when use_threads=False and pre_buffer=False
[ https://issues.apache.org/jira/browse/ARROW-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Hooper updated ARROW-10033: Description: {{ArrowReaderProperties}} has a {{::arrow::io::AsyncContext async_context_;}} member. Its constructor creates a thread pool -- regardless of options. As a caller, I expect {{!use_threads}} to prevent the creation of a thread pool. (Maybe there should be an exception if {{pre_buffer && !use_threads}}?) Stack trace: {noformat} #0 arrow::internal::ThreadPool::ThreadPool (this=0x232fa90) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:121 #1 0x008e4747 in arrow::internal::ThreadPool::Make (threads=8) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:246 #2 0x008e48c9 in arrow::internal::ThreadPool::MakeEternal (threads=8) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:252 #3 0x008a20ac in arrow::io::internal::MakeIOThreadPool () at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:326 #4 0x008a21dd in arrow::io::internal::GetIOThreadPool () at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:334 #5 0x008a064f in arrow::io::AsyncContext::AsyncContext ( this=0xea6bb0 ) at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:49 #6 0x0048893e in parquet::ArrowReaderProperties::ArrowReaderProperties ( this=0xea6b60 , use_threads=false) at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.h:579 #7 0x005e1b98 in parquet::default_arrow_reader_properties () at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.cc:53 #8 0x00414843 in parquet::arrow::FileReaderBuilder::FileReaderBuilder (this=0x7fffb31f0c60) at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:930 #9 0x00414b10 in parquet::arrow::OpenFile (file=..., pool=0xea6cf0 , reader=0x7fffb31f0e08) at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:957 {noformat} was: `ArrowReaderProperties` has a `::arrow::io::AsyncContext async_context_;` member. Its ctor creates a thread pool. Stack trace: ``` #0 arrow::internal::ThreadPool::ThreadPool (this=0x232fa90) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:121 #1 0x008e4747 in arrow::internal::ThreadPool::Make (threads=8) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:246 #2 0x008e48c9 in arrow::internal::ThreadPool::MakeEternal (threads=8) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:252 #3 0x008a20ac in arrow::io::internal::MakeIOThreadPool () at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:326 #4 0x008a21dd in arrow::io::internal::GetIOThreadPool () at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:334 #5 0x008a064f in arrow::io::AsyncContext::AsyncContext ( this=0xea6bb0 ) at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:49 #6 0x0048893e in parquet::ArrowReaderProperties::ArrowReaderProperties ( this=0xea6b60 , use_threads=false) at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.h:579 #7 0x005e1b98 in parquet::default_arrow_reader_properties () at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.cc:53 #8 0x00414843 in parquet::arrow::FileReaderBuilder::FileReaderBuilder (this=0x7fffb31f0c60) at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:930 #9 0x00414b10 in parquet::arrow::OpenFile (file=..., pool=0xea6cf0 , reader=0x7fffb31f0e08) at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:957 ``` As a caller, I expect `use_threads=False` to prevent the creation of threads. (Maybe there should be an exception if `pre_buffer && !use_threads`?) > ArrowReaderProperties creates thread pool, even when use_threads=False and > pre_buffer=False > --- > > Key: ARROW-10033 > URL: https://issues.apache.org/jira/browse/ARROW-10033 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 1.0.1 >Reporter: Adam Hooper >Priority: Major > > {{ArrowReaderProperties}} has a {{::arrow::io::AsyncContext async_context_;}} > member. Its constructor creates a thread pool -- regardless of options. > As a caller, I expect {{!use_threads}} to prevent the creation of a thread > pool. (Maybe there should be an exception if {{pre_buffer && !use_threads}}?) > Stack trace: > {noformat} > #0 arrow::internal::ThreadPool::ThreadPool (this=0x232fa90) at > /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:121 > #1 0x008e4747 in arrow::internal::ThreadPool::Make (threads=8) > at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:246 > #2 0x008e48c9 in arrow::internal::ThreadPool::MakeEternal (th
[jira] [Created] (ARROW-10033) ArrowReaderProperties creates thread pool, even when use_threads=False and pre_buffer=False
Adam Hooper created ARROW-10033: --- Summary: ArrowReaderProperties creates thread pool, even when use_threads=False and pre_buffer=False Key: ARROW-10033 URL: https://issues.apache.org/jira/browse/ARROW-10033 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 1.0.1 Reporter: Adam Hooper `ArrowReaderProperties` has a `::arrow::io::AsyncContext async_context_;` member. Its ctor creates a thread pool. Stack trace: ``` #0 arrow::internal::ThreadPool::ThreadPool (this=0x232fa90) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:121 #1 0x008e4747 in arrow::internal::ThreadPool::Make (threads=8) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:246 #2 0x008e48c9 in arrow::internal::ThreadPool::MakeEternal (threads=8) at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:252 #3 0x008a20ac in arrow::io::internal::MakeIOThreadPool () at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:326 #4 0x008a21dd in arrow::io::internal::GetIOThreadPool () at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:334 #5 0x008a064f in arrow::io::AsyncContext::AsyncContext ( this=0xea6bb0 ) at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:49 #6 0x0048893e in parquet::ArrowReaderProperties::ArrowReaderProperties ( this=0xea6b60 , use_threads=false) at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.h:579 #7 0x005e1b98 in parquet::default_arrow_reader_properties () at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.cc:53 #8 0x00414843 in parquet::arrow::FileReaderBuilder::FileReaderBuilder (this=0x7fffb31f0c60) at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:930 #9 0x00414b10 in parquet::arrow::OpenFile (file=..., pool=0xea6cf0 , reader=0x7fffb31f0e08) at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:957 ``` As a caller, I expect `use_threads=False` to prevent the creation of threads. (Maybe there should be an exception if `pre_buffer && !use_threads`?) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date
[ https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-10032: - Description: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) * Prefer JOB=Build_Debug as otherwise it forces clcache * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION * Use conda manually to install gtest gflags ninja rapidjson grpc-cpp protobuf Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" was: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) * Prefer JOB=Build_Debug as otherwise it forces clcache * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION * Use conda manually to install gtest gflags ninja rapidjson Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" > [Documentation] C++ Windows docs are out of date > > > Key: ARROW-10032 > URL: https://issues.apache.org/jira/browse/ARROW-10032 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: David Li >Priority: Major > > "Replicating AppVeyor Builds" needs the following changes: > [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] > * The recommended VM does not include the C++ compiler - we should link to > the build tools and describe which of them needs installation > * Boost: the b2 script now requires --with not -with flags > * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) > * Prefer JOB=Build_Debug as otherwise it forces clcache > * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION > * Use conda manually to install gtest gflags ninja rapidjson grpc-cpp > protobuf > Even with this: > * The developer prompt can't find cl.exe (the compiler). (You must restart > the VM!) > * The PowerShell prompt can't use conda (it complains a config file isn't > signed) > Solution: run a PowerShell instance as administrator and run > "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date
[ https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-10032: - Description: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) * Prefer JOB=Build_Debug as otherwise it forces clcache * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION * Use conda manually to install gtest gflags ninja rapidjson Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" was: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) * Prefer JOB=Build_Debug as otherwise it forces clcache * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" > [Documentation] C++ Windows docs are out of date > > > Key: ARROW-10032 > URL: https://issues.apache.org/jira/browse/ARROW-10032 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: David Li >Priority: Major > > "Replicating AppVeyor Builds" needs the following changes: > [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] > * The recommended VM does not include the C++ compiler - we should link to > the build tools and describe which of them needs installation > * Boost: the b2 script now requires --with not -with flags > * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) > * Prefer JOB=Build_Debug as otherwise it forces clcache > * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION > * Use conda manually to install gtest gflags ninja rapidjson > Even with this: > * The developer prompt can't find cl.exe (the compiler). (You must restart > the VM!) > * The PowerShell prompt can't use conda (it complains a config file isn't > signed) > Solution: run a PowerShell instance as administrator and run > "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date
[ https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-10032: - Description: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) * Prefer JOB=Build_Debug as otherwise it forces clcache * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" was: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) * Prefer JOB=Build_Debug as otherwise it forces clcache Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" > [Documentation] C++ Windows docs are out of date > > > Key: ARROW-10032 > URL: https://issues.apache.org/jira/browse/ARROW-10032 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: David Li >Priority: Major > > "Replicating AppVeyor Builds" needs the following changes: > [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] > * The recommended VM does not include the C++ compiler - we should link to > the build tools and describe which of them needs installation > * Boost: the b2 script now requires --with not -with flags > * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) > * Prefer JOB=Build_Debug as otherwise it forces clcache > * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION > Even with this: > * The developer prompt can't find cl.exe (the compiler). (You must restart > the VM!) > * The PowerShell prompt can't use conda (it complains a config file isn't > signed) > Solution: run a PowerShell instance as administrator and run > "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date
[ https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-10032: - Description: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) * Prefer JOB=Build_Debug as otherwise it forces clcache Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" was: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" > [Documentation] C++ Windows docs are out of date > > > Key: ARROW-10032 > URL: https://issues.apache.org/jira/browse/ARROW-10032 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: David Li >Priority: Major > > "Replicating AppVeyor Builds" needs the following changes: > [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] > * The recommended VM does not include the C++ compiler - we should link to > the build tools and describe which of them needs installation > * Boost: the b2 script now requires --with not -with flags > * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) > * Prefer JOB=Build_Debug as otherwise it forces clcache > Even with this: > * The developer prompt can't find cl.exe (the compiler). (You must restart > the VM!) > * The PowerShell prompt can't use conda (it complains a config file isn't > signed) > Solution: run a PowerShell instance as administrator and run > "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date
[ https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-10032: - Description: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" was: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" > [Documentation] C++ Windows docs are out of date > > > Key: ARROW-10032 > URL: https://issues.apache.org/jira/browse/ARROW-10032 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: David Li >Priority: Major > > "Replicating AppVeyor Builds" needs the following changes: > [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] > * The recommended VM does not include the C++ compiler - we should link to > the build tools and describe which of them needs installation > * Boost: the b2 script now requires --with not -with flags > * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup) > Even with this: > * The developer prompt can't find cl.exe (the compiler). (You must restart > the VM!) > * The PowerShell prompt can't use conda (it complains a config file isn't > signed) > Solution: run a PowerShell instance as administrator and run > "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date
[ https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-10032: - Description: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags Even with this: * The developer prompt can't find cl.exe (the compiler). (You must restart the VM!) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" was: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags Even with this: * The developer prompt can't find cl.exe (the compiler) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" > [Documentation] C++ Windows docs are out of date > > > Key: ARROW-10032 > URL: https://issues.apache.org/jira/browse/ARROW-10032 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: David Li >Priority: Major > > "Replicating AppVeyor Builds" needs the following changes: > [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] > * The recommended VM does not include the C++ compiler - we should link to > the build tools and describe which of them needs installation > * Boost: the b2 script now requires --with not -with flags > Even with this: > * The developer prompt can't find cl.exe (the compiler). (You must restart > the VM!) > * The PowerShell prompt can't use conda (it complains a config file isn't > signed) > Solution: run a PowerShell instance as administrator and run > "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date
[ https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-10032: - Description: "Replicating AppVeyor Builds" needs the following changes: [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags Even with this: * The developer prompt can't find cl.exe (the compiler) * The PowerShell prompt can't use conda (it complains a config file isn't signed) Solution: run a PowerShell instance as administrator and run "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" was: "Replicating AppVeyor Builds" needs the following changes: https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags Even with this: * The developer prompt can't find cl.exe (the compiler) * The PowerShell prompt can't use conda (it complains a config file isn't signed) > [Documentation] C++ Windows docs are out of date > > > Key: ARROW-10032 > URL: https://issues.apache.org/jira/browse/ARROW-10032 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: David Li >Priority: Major > > "Replicating AppVeyor Builds" needs the following changes: > [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds] > * The recommended VM does not include the C++ compiler - we should link to > the build tools and describe which of them needs installation > * Boost: the b2 script now requires --with not -with flags > Even with this: > * The developer prompt can't find cl.exe (the compiler) > * The PowerShell prompt can't use conda (it complains a config file isn't > signed) > Solution: run a PowerShell instance as administrator and run > "Set-ExecutionPolicy -ExecutionPolicy Unrestricted" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10023) [Gandiva][C++] Implementing Split part function in gandiva
[ https://issues.apache.org/jira/browse/ARROW-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197892#comment-17197892 ] Maarten Breddels commented on ARROW-10023: -- It's gonna be in C++, I can push an initial version when I find the time, so you can take a look. I do a split into a list of strings, with a pattern separator, whitespace (ascii and utf8), and still need to finish reverse utf8 whitespace. You want a version that splits, and only takes the n-th part right? > [Gandiva][C++] Implementing Split part function in gandiva > -- > > Key: ARROW-10023 > URL: https://issues.apache.org/jira/browse/ARROW-10023 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Naman Udasi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-3757) [R] R bindings for Flight RPC client
[ https://issues.apache.org/jira/browse/ARROW-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-3757. Resolution: Fixed Issue resolved by pull request 7875 [https://github.com/apache/arrow/pull/7875] > [R] R bindings for Flight RPC client > > > Key: ARROW-3757 > URL: https://issues.apache.org/jira/browse/ARROW-3757 > Project: Apache Arrow > Issue Type: New Feature > Components: FlightRPC, R >Reporter: Wes McKinney >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10023) [Gandiva][C++] Implementing Split part function in gandiva
[ https://issues.apache.org/jira/browse/ARROW-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197846#comment-17197846 ] Naman Udasi edited comment on ARROW-10023 at 9/17/20, 5:14 PM: --- [~maartenbreddels] Where will the split functions mentioned in ARROW-9991 be implemented? I think if possible we can make them reusable ? was (Author: namanu): [~maartenbreddels] Where will the split functions mentioned be implemented? I think if possible we can make them reusable ? > [Gandiva][C++] Implementing Split part function in gandiva > -- > > Key: ARROW-10023 > URL: https://issues.apache.org/jira/browse/ARROW-10023 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Naman Udasi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10023) [Gandiva][C++] Implementing Split part function in gandiva
[ https://issues.apache.org/jira/browse/ARROW-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197846#comment-17197846 ] Naman Udasi commented on ARROW-10023: - [~maartenbreddels] Where will the split functions mentioned be implemented? I think if possible we can make them reusable ? > [Gandiva][C++] Implementing Split part function in gandiva > -- > > Key: ARROW-10023 > URL: https://issues.apache.org/jira/browse/ARROW-10023 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Naman Udasi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10029) [Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10029: - Summary: [Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset (was: Deadlock in the interaction of pyarrow FileSystem and ParquetDataset) > [Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset > - > > Key: ARROW-10029 > URL: https://issues.apache.org/jira/browse/ARROW-10029 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: David McGuire >Priority: Major > Attachments: repro.py > > > @martindurant good news (for you): I have a repro test case that is 100% > {{pyarrow}}, so it looks like {{s3fs}} is not involved. > @jorisvandenbossche how should I follow up with this, based on > {{pyarrow.filesystem.LocalFileSystem}}? > Viewing the File System *directories* as a tree, one thread is required for > every non-leaf node, in order to avoid deadlock. > 1) dataset > 2) dataset/foo=1 > 3) dataset/foo=1/bar=2 > 4) dataset/foo=1/bar=2/baz=0 > 5) dataset/foo=1/bar=2/baz=1 > 6) dataset/foo=1/bar=2/baz=2 > *) dataset/foo=1/bar=2/baz=0/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=true > *) dataset/foo=1/bar=2/baz=0/qux=true > *) dataset/foo=1/bar=2/baz=2/qux=false > *) dataset/foo=1/bar=2/baz=2/qux=true > {code} > import pyarrow.parquet as pq > import pyarrow.filesystem as fs > class LoggingLocalFileSystem(fs.LocalFileSystem): > def walk(self, path): > print(path) > return super().walk(path) > fs = LoggingLocalFileSystem() > dataset_url = "dataset" > threads = 6 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > threads = 5 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > {code} > *_Call with 6 threads completes._* > *_Call with 5 threads hangs indefinitely._* > {code} > $ python repro.py > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > dataset/foo=1/bar=2/baz=0/qux=false > dataset/foo=1/bar=2/baz=0/qux=true > dataset/foo=1/bar=2/baz=1/qux=false > dataset/foo=1/bar=2/baz=1/qux=true > dataset/foo=1/bar=2/baz=2/qux=false > dataset/foo=1/bar=2/baz=2/qux=true > 6 > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > ^C > ... > KeyboardInterrupt > ^C > ... > KeyboardInterrupt > {code} > **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and > when omitting the {{filesystem}} argument. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8678) [C++][Parquet] Remove legacy arrow to level translation.
[ https://issues.apache.org/jira/browse/ARROW-8678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-8678. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8184 [https://github.com/apache/arrow/pull/8184] > [C++][Parquet] Remove legacy arrow to level translation. > > > Key: ARROW-8678 > URL: https://issues.apache.org/jira/browse/ARROW-8678 > Project: Apache Arrow > Issue Type: Task > Components: C++, Python >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9636) [Python] Update documentation about 'LZO' compression in parquet.write_table
[ https://issues.apache.org/jira/browse/ARROW-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9636: -- Labels: beginner-friendly doc pull-request-available (was: beginner-friendly doc) > [Python] Update documentation about 'LZO' compression in parquet.write_table > > > Key: ARROW-9636 > URL: https://issues.apache.org/jira/browse/ARROW-9636 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Python >Reporter: Pierre >Priority: Trivial > Labels: beginner-friendly, doc, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hi, > When trying to use 'LZO' codec in `pyarrow.parquet.write_table()` with below > code, I get an error message indicating that 'LZO' is not available. However, > this codec is mentioned as available in the doc > [[https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html]]. > > Code > {code:python} > from pyarrow import parquet as pq > pq.write_table(data, file, compression='LZO') > {code} > > Error message > {code:bash} > File "pyarrow/_parquet.pyx", line 1374, in > pyarrow._parquet.ParquetWriter.write_table > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Codec type LZO not supported in Parquet format > {code} > > I would suggest correcting the documentation, or making this codec available? > Thanks for your support. > Bests, > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10030) [Rust] Support fromIter and toIter
[ https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10030: --- Labels: pull-request-available (was: ) > [Rust] Support fromIter and toIter > -- > > Key: ARROW-10030 > URL: https://issues.apache.org/jira/browse/ARROW-10030 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Proposal for comments: > [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing] > (dump of the document above) > Rust Arrow supports two main computational models: > # Batch Operations, that leverage some form of vectorization > # Element-by-element operations, that emerge in more complex operations > This document concerns element-by-element operations, that are common outside > of the library (and sometimes in the library). > h2. Element-by-element operations > These operations are programmatically written as: > # Downcast the array to its specific type > # Initialize buffers > # Iterate over indices and perform the operation, appending to the buffers > accordingly > # Create ArrayData with the required null bitmap, buffers, childs, etc. > # return ArrayRef from ArrayData > > We can split this process in 3 parts: > # Initialization (1 and 2) > # Iteration (3) > # Finalization (4 and 5) > Currently, the API that we offer to our users is: > # as_any() to downcast the array based on its DataType > # Builders for all types, that users can initialize, matching the downcasted > array > # Iterate > ## Use for i in (0..array.len()) > ## Use {{Array::value(i)}} and {{Array::is_valid(i)/is_null(i)}} > ## use builder.append_value(new_value) or builder.append_null() > # Finish the builder and wrap the result in an Arc > This API has some issues: > # value(i) +is unsafe+, even though it is not marked as such > # builders are usually slow due to the checks that they need to perform > # The API is not intuitive > h2. Proposal > This proposal aims at improving this API in 2 specific ways: > * Implement IntoIterator Iterator and Iterator> > * Implement FromIterator and Item=Option > so that users can write: > {code:java} > // incoming array > let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); > let array = Arc::new(array) as ArrayRef; > let array = array.as_any().downcast_ref::().unwrap(); > // to and from iter, with a +1 > let result: Int32Array = array > .iter() > .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) > .collect(); > let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); > assert_eq!(result, expected); > {code} > > This results in an API that is: > # efficient, as it is our responsibility to create `FromIterator` that are > efficient in populating the buffers/child etc from an iterator > # Safe, as it does not allow segfaults > # Simple, as users do not need to worry about Builders, buffers, etc, only > native Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date
[ https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-10032: - Description: "Replicating AppVeyor Builds" needs the following changes: https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags Even with this: * The developer prompt can't find cl.exe (the compiler) * The PowerShell prompt can't use conda (it complains a config file isn't signed) was: * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags Even with this: * The developer prompt can't find cl.exe (the compiler) * The PowerShell prompt can't use conda (it complains a config file isn't signed) > [Documentation] C++ Windows docs are out of date > > > Key: ARROW-10032 > URL: https://issues.apache.org/jira/browse/ARROW-10032 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: David Li >Priority: Major > > "Replicating AppVeyor Builds" needs the following changes: > https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds > * The recommended VM does not include the C++ compiler - we should link to > the build tools and describe which of them needs installation > * Boost: the b2 script now requires --with not -with flags > Even with this: > * The developer prompt can't find cl.exe (the compiler) > * The PowerShell prompt can't use conda (it complains a config file isn't > signed) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10032) [Documentation] C++ Windows docs are out of date
David Li created ARROW-10032: Summary: [Documentation] C++ Windows docs are out of date Key: ARROW-10032 URL: https://issues.apache.org/jira/browse/ARROW-10032 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: David Li * The recommended VM does not include the C++ compiler - we should link to the build tools and describe which of them needs installation * Boost: the b2 script now requires --with not -with flags Even with this: * The developer prompt can't find cl.exe (the compiler) * The PowerShell prompt can't use conda (it complains a config file isn't signed) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10029) Deadlock in the interaction of pyarrow FileSystem and ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197784#comment-17197784 ] David McGuire edited comment on ARROW-10029 at 9/17/20, 4:09 PM: - If it's not multi-threaded, then there won't be deadlock: {code} # This completes threads = 6 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False) print(len(dataset.pieces)) # This also completes threads = 5 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False) print(len(dataset.pieces)) {code} Running: {code} $ python repro.py 6 6 {code} was (Author: dmcguire): If it's not multi-threaded, then there won't be deadlock: {code} # This completes threads = 6 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False) print(len(dataset.pieces)) # This also completes threads = 5 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False) print(len(dataset.pieces)) {code} > Deadlock in the interaction of pyarrow FileSystem and ParquetDataset > > > Key: ARROW-10029 > URL: https://issues.apache.org/jira/browse/ARROW-10029 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: David McGuire >Priority: Major > Attachments: repro.py > > > @martindurant good news (for you): I have a repro test case that is 100% > {{pyarrow}}, so it looks like {{s3fs}} is not involved. > @jorisvandenbossche how should I follow up with this, based on > {{pyarrow.filesystem.LocalFileSystem}}? > Viewing the File System *directories* as a tree, one thread is required for > every non-leaf node, in order to avoid deadlock. > 1) dataset > 2) dataset/foo=1 > 3) dataset/foo=1/bar=2 > 4) dataset/foo=1/bar=2/baz=0 > 5) dataset/foo=1/bar=2/baz=1 > 6) dataset/foo=1/bar=2/baz=2 > *) dataset/foo=1/bar=2/baz=0/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=true > *) dataset/foo=1/bar=2/baz=0/qux=true > *) dataset/foo=1/bar=2/baz=2/qux=false > *) dataset/foo=1/bar=2/baz=2/qux=true > {code} > import pyarrow.parquet as pq > import pyarrow.filesystem as fs > class LoggingLocalFileSystem(fs.LocalFileSystem): > def walk(self, path): > print(path) > return super().walk(path) > fs = LoggingLocalFileSystem() > dataset_url = "dataset" > threads = 6 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > threads = 5 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > {code} > *_Call with 6 threads completes._* > *_Call with 5 threads hangs indefinitely._* > {code} > $ python repro.py > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > dataset/foo=1/bar=2/baz=0/qux=false > dataset/foo=1/bar=2/baz=0/qux=true > dataset/foo=1/bar=2/baz=1/qux=false > dataset/foo=1/bar=2/baz=1/qux=true > dataset/foo=1/bar=2/baz=2/qux=false > dataset/foo=1/bar=2/baz=2/qux=true > 6 > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > ^C > ... > KeyboardInterrupt > ^C > ... > KeyboardInterrupt > {code} > **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and > when omitting the {{filesystem}} argument. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10029) Deadlock in the interaction of pyarrow FileSystem and ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197784#comment-17197784 ] David McGuire commented on ARROW-10029: --- If it's not multi-threaded, then there won't be deadlock: {code} # This completes threads = 6 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False) print(len(dataset.pieces)) # This also completes threads = 5 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False) print(len(dataset.pieces)) {code} > Deadlock in the interaction of pyarrow FileSystem and ParquetDataset > > > Key: ARROW-10029 > URL: https://issues.apache.org/jira/browse/ARROW-10029 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: David McGuire >Priority: Major > Attachments: repro.py > > > @martindurant good news (for you): I have a repro test case that is 100% > {{pyarrow}}, so it looks like {{s3fs}} is not involved. > @jorisvandenbossche how should I follow up with this, based on > {{pyarrow.filesystem.LocalFileSystem}}? > Viewing the File System *directories* as a tree, one thread is required for > every non-leaf node, in order to avoid deadlock. > 1) dataset > 2) dataset/foo=1 > 3) dataset/foo=1/bar=2 > 4) dataset/foo=1/bar=2/baz=0 > 5) dataset/foo=1/bar=2/baz=1 > 6) dataset/foo=1/bar=2/baz=2 > *) dataset/foo=1/bar=2/baz=0/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=true > *) dataset/foo=1/bar=2/baz=0/qux=true > *) dataset/foo=1/bar=2/baz=2/qux=false > *) dataset/foo=1/bar=2/baz=2/qux=true > {code} > import pyarrow.parquet as pq > import pyarrow.filesystem as fs > class LoggingLocalFileSystem(fs.LocalFileSystem): > def walk(self, path): > print(path) > return super().walk(path) > fs = LoggingLocalFileSystem() > dataset_url = "dataset" > threads = 6 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > threads = 5 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > {code} > *_Call with 6 threads completes._* > *_Call with 5 threads hangs indefinitely._* > {code} > $ python repro.py > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > dataset/foo=1/bar=2/baz=0/qux=false > dataset/foo=1/bar=2/baz=0/qux=true > dataset/foo=1/bar=2/baz=1/qux=false > dataset/foo=1/bar=2/baz=1/qux=true > dataset/foo=1/bar=2/baz=2/qux=false > dataset/foo=1/bar=2/baz=2/qux=true > 6 > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > ^C > ... > KeyboardInterrupt > ^C > ... > KeyboardInterrupt > {code} > **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and > when omitting the {{filesystem}} argument. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10031) Support Java benchmark in Ursabot
[ https://issues.apache.org/jira/browse/ARROW-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10031: - Assignee: Apache Arrow JIRA Bot (was: Kazuaki Ishizaki) > Support Java benchmark in Ursabot > - > > Key: ARROW-10031 > URL: https://issues.apache.org/jira/browse/ARROW-10031 > Project: Apache Arrow > Issue Type: New Feature > Components: CI, Java >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Based on [the > suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e], > Ursabot will support Java benchmarks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10031) Support Java benchmark in Ursabot
[ https://issues.apache.org/jira/browse/ARROW-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10031: - Assignee: Kazuaki Ishizaki (was: Apache Arrow JIRA Bot) > Support Java benchmark in Ursabot > - > > Key: ARROW-10031 > URL: https://issues.apache.org/jira/browse/ARROW-10031 > Project: Apache Arrow > Issue Type: New Feature > Components: CI, Java >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Based on [the > suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e], > Ursabot will support Java benchmarks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10031) Support Java benchmark in Ursabot
[ https://issues.apache.org/jira/browse/ARROW-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10031: --- Labels: pull-request-available (was: ) > Support Java benchmark in Ursabot > - > > Key: ARROW-10031 > URL: https://issues.apache.org/jira/browse/ARROW-10031 > Project: Apache Arrow > Issue Type: New Feature > Components: CI, Java >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Based on [the > suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e], > Ursabot will support Java benchmarks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10024) [C++][Parquet] Create nested reading benchmarks
[ https://issues.apache.org/jira/browse/ARROW-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-10024. - Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8203 [https://github.com/apache/arrow/pull/8203] > [C++][Parquet] Create nested reading benchmarks > --- > > Key: ARROW-10024 > URL: https://issues.apache.org/jira/browse/ARROW-10024 > Project: Apache Arrow > Issue Type: Sub-task > Components: Benchmarking, C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 5h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10026) [C++] Improve kernel performance on small batches
[ https://issues.apache.org/jira/browse/ARROW-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197739#comment-17197739 ] Wes McKinney edited comment on ARROW-10026 at 9/17/20, 3:03 PM: IMHO we should consider a slimmed down data structure for the implementation of {{ExecBatch}} that does not use {{arrow::util::variant}}, considering that we only ever will have either {{ArrayData}} or {{Scalar}} as value types. The overhead of slicing {{ArrayData}} objects is also non-trivial was (Author: wesmckinn): IMHO we should consider a slimmed down data structure for {{ExecBatch}} that does not use {{arrow::util::variant}}, considering that we only ever will have either {{ArrayData}} or {{Scalar}} as value types. The overhead of slicing {{ArrayData}} objects is also non-trivial > [C++] Improve kernel performance on small batches > - > > Key: ARROW-10026 > URL: https://issues.apache.org/jira/browse/ARROW-10026 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > > It seems that invoking some kernels on smallish batches has quite an overhead: > {code} > ArrayArrayKernel/32768/100 2860 ns > 2859 ns 245195 bytes_per_second=10.6727G/s > items_per_second=2.86494G/s null_percent=1 size=32.768k > ArrayArrayKernel/32768/02752 ns > 2751 ns 249316 bytes_per_second=11.093G/s items_per_second=2.97775G/s > null_percent=0 size=32.768k > ArrayArrayKernel/524288/10018633 ns > 18630 ns36548 bytes_per_second=26.2097G/s > items_per_second=7.03561G/s null_percent=1 size=524.288k > ArrayArrayKernel/524288/0 18260 ns > 18257 ns38245 bytes_per_second=26.7451G/s > items_per_second=7.17933G/s null_percent=0 size=524.288k > {code} > We should investigate and try to lighten the overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10026) [C++] Improve kernel performance on small batches
[ https://issues.apache.org/jira/browse/ARROW-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197739#comment-17197739 ] Wes McKinney commented on ARROW-10026: -- IMHO we should consider a slimmed down data structure for {{ExecBatch}} that does not use {{arrow::util::variant}}, considering that we only ever will have either {{ArrayData}} or {{Scalar}} as value types. The overhead of slicing {{ArrayData}} objects is also non-trivial > [C++] Improve kernel performance on small batches > - > > Key: ARROW-10026 > URL: https://issues.apache.org/jira/browse/ARROW-10026 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > > It seems that invoking some kernels on smallish batches has quite an overhead: > {code} > ArrayArrayKernel/32768/100 2860 ns > 2859 ns 245195 bytes_per_second=10.6727G/s > items_per_second=2.86494G/s null_percent=1 size=32.768k > ArrayArrayKernel/32768/02752 ns > 2751 ns 249316 bytes_per_second=11.093G/s items_per_second=2.97775G/s > null_percent=0 size=32.768k > ArrayArrayKernel/524288/10018633 ns > 18630 ns36548 bytes_per_second=26.2097G/s > items_per_second=7.03561G/s null_percent=1 size=524.288k > ArrayArrayKernel/524288/0 18260 ns > 18257 ns38245 bytes_per_second=26.7451G/s > items_per_second=7.17933G/s null_percent=0 size=524.288k > {code} > We should investigate and try to lighten the overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197720#comment-17197720 ] Troy Zimmerman commented on ARROW-10027: You rock! Thank you for the fast turnaround. > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197703#comment-17197703 ] Joris Van den Bossche commented on ARROW-10027: --- Don't worry about the crash (I actually saw a crash when closing my python session afterwards). The malformed dataframe should be solved by fixing the filter bug, which is tackled by my PR https://github.com/apache/arrow/pull/8209 > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197685#comment-17197685 ] Troy Zimmerman commented on ARROW-10027: [~jorisvandenbossche] Thank you for the quick & detailed response. I'll take a closer look at the core that is dumped to see if I can narrow down what's causing the crash since it just seems to be on my end. > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10030) [Rust] Support fromIter and toIter
[ https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-10030: -- Description: Proposal for comments: [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing] (dump of the document above) Rust Arrow supports two main computational models: # Batch Operations, that leverage some form of vectorization # Element-by-element operations, that emerge in more complex operations This document concerns element-by-element operations, that are common outside of the library (and sometimes in the library). h2. Element-by-element operations These operations are programmatically written as: # Downcast the array to its specific type # Initialize buffers # Iterate over indices and perform the operation, appending to the buffers accordingly # Create ArrayData with the required null bitmap, buffers, childs, etc. # return ArrayRef from ArrayData We can split this process in 3 parts: # Initialization (1 and 2) # Iteration (3) # Finalization (4 and 5) Currently, the API that we offer to our users is: # as_any() to downcast the array based on its DataType # Builders for all types, that users can initialize, matching the downcasted array # Iterate ## Use for i in (0..array.len()) ## Use {{Array::value(i)}} and {{Array::is_valid(i)/is_null(i)}} ## use builder.append_value(new_value) or builder.append_null() # Finish the builder and wrap the result in an Arc This API has some issues: # value(i) +is unsafe+, even though it is not marked as such # builders are usually slow due to the checks that they need to perform # The API is not intuitive h2. Proposal This proposal aims at improving this API in 2 specific ways: * Implement IntoIterator Iterator and Iterator> * Implement FromIterator and Item=Option so that users can write: {code:java} // incoming array let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); let array = Arc::new(array) as ArrayRef; let array = array.as_any().downcast_ref::().unwrap(); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected); {code} This results in an API that is: # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator # Safe, as it does not allow segfaults # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust. was: Proposal for comments: [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing] (dump of the document above) Rust Arrow supports two main computational models: # Batch Operations, that leverage some form of vectorization # Element-by-element operations, that emerge in more complex operations This document concerns element-by-element operations, that are common outside of the library (and sometimes in the library). h2. Element-by-element operations These operations are programmatically written as: # Downcast the array to its specific type # Initialize buffers # Iterate over indices and perform the operation, appending to the buffers accordingly # Create ArrayData with the required null bitmap, buffers, childs, etc. # return ArrayRef from ArrayData We can split this process in 3 parts: # Initialization (1 and 2) # Iteration (3) # Finalization (4 and 5) Currently, the API that we offer to our users is: # as_any() to downcast the array based on its DataType # Builders for all types, that users can initialize, matching the downcasted array # Iterate # Use for i in (0..array.len()) # Use Array::value(i) and Array::is_valid(i)/is_null(i)` # use builder.append_value(new_value) or builder.append_null() # Finish the builder and wrap the result in an Arc This API has some issues: # value(i) +is unsafe+, even though it is not marked as such # builders are usually slow due to the checks that they need to perform # The API is not intuitive h2. Proposal This proposal aims at improving this API in 2 specific ways: * Implement IntoIterator Iterator and Iterator> * Implement FromIterator and Item=Option so that users can write: {code:java} let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected); {code} This results in an API that is: # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator # Safe, as it does not al
[jira] [Updated] (ARROW-10030) [Rust] Support fromIter and toIter
[ https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-10030: -- Description: Proposal for comments: [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing] (dump of the document above) Rust Arrow supports two main computational models: # Batch Operations, that leverage some form of vectorization # Element-by-element operations, that emerge in more complex operations This document concerns element-by-element operations, that are common outside of the library (and sometimes in the library). h2. Element-by-element operations These operations are programmatically written as: # Downcast the array to its specific type # Initialize buffers # Iterate over indices and perform the operation, appending to the buffers accordingly # Create ArrayData with the required null bitmap, buffers, childs, etc. # return ArrayRef from ArrayData We can split this process in 3 parts: # Initialization (1 and 2) # Iteration (3) # Finalization (4 and 5) Currently, the API that we offer to our users is: # as_any() to downcast the array based on its DataType # Builders for all types, that users can initialize, matching the downcasted array # Iterate # Use for i in (0..array.len()) # Use Array::value(i) and Array::is_valid(i)/is_null(i)` # use builder.append_value(new_value) or builder.append_null() # Finish the builder and wrap the result in an Arc This API has some issues: # value(i) +is unsafe+, even though it is not marked as such # builders are usually slow due to the checks that they need to perform # The API is not intuitive h2. Proposal This proposal aims at improving this API in 2 specific ways: * Implement IntoIterator Iterator and Iterator> * Implement FromIterator and Item=Option so that users can write: {code:java} let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected); {code} This results in an API that is: # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator # Safe, as it does not allow segfaults # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust. was: Proposal for comments: [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing] (dump of the document above) Rust Arrow supports two main computational models: # Batch Operations, that leverage some form of vectorization # Element-by-element operations, that emerge in more complex operations This document concerns element-by-element operations, that are the most common operations outside of the library. h2. Element-by-element operations These operations are programmatically written as: # Downcast the array to its specific type # Initialize buffers # Iterate over indices and perform the operation, appending to the buffers accordingly # Create ArrayData with the required null bitmap, buffers, childs, etc. # return ArrayRef from ArrayData We can split this process in 3 parts: # Initialization (1 and 2) # Iteration (3) # Finalization (4 and 5) Currently, the API that we offer to our users is: # as_any() to downcast the array based on its DataType # Builders for all types, that users can initialize, matching the downcasted array # Iterate # Use for i in (0..array.len()) # Use Array::value(i) and Array::is_valid(i)/is_null(i)` # use builder.append_value(new_value) or builder.append_null() # Finish the builder and wrap the result in an Arc This API has some issues: # value(i) +is unsafe+, even though it is not marked as such # builders are usually slow due to the checks that they need to perform # The API is not intuitive h2. Proposal This proposal aims at improving this API in 2 specific ways: * Implement IntoIterator Iterator and Iterator> * Implement FromIterator and Item=Option so that users can write: {code:java} let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected); {code} This results in an API that is: # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator # Safe, as it does not allow segfaults # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust. > [Rust] Support fromIte
[jira] [Assigned] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10027: - Assignee: Joris Van den Bossche (was: Apache Arrow JIRA Bot) > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10027: - Assignee: Apache Arrow JIRA Bot (was: Joris Van den Bossche) > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10027: --- Labels: pull-request-available (was: ) > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-10027: - Assignee: Joris Van den Bossche > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Assignee: Joris Van den Bossche >Priority: Major > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197535#comment-17197535 ] Joris Van den Bossche commented on ARROW-10027: --- So it seems this is a bug not directly in the Dataset code, but in the filter operation. Also when manually filtering a RecordBatch, it incorrectly returns a batch with the null column not being filtered: {code} table = pa.Table.from_arrays( arrays=[ pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), pa.array(["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]), pa.array([None, None, None, None, None, None, None, None, None, None]), ], names=["id", "name", "other"], ) batch = table.to_batches()[0] {code} {code} In [32]: batch Out[32]: pyarrow.RecordBatch id: int64 name: string other: null In [33]: batch.num_rows Out[33]: 10 In [34]: filtered_batch = batch.filter(pa.array([True, False]*5)) In [35]: filtered_table.num_rows Out[35]: 5 In [36]: filtered_batch.column(2) Out[36]: 10 nulls In [37]: len(filtered_batch.column(2)) Out[37]: 10 {code} Directly filtering on the array or chunked array or on a Table seems to work, though: {code} In [38]: filtered_table = table.filter(pa.array([True, False]*5)) In [39]: filtered_table.num_rows Out[39]: 5 In [40]: filtered_table['other'] Out[40]: [ 5 nulls ] In [41]: chunked_array = table['other'] In [42]: chunked_array Out[42]: [ 10 nulls ] In [43]: chunked_array.filter(pa.array([True, False]*5)) Out[43]: [ 5 nulls ] In [44]: chunked_array.chunks[0].filter(pa.array([True, False]*5)) Out[44]: 5 nulls {code} > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Priority: Major > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197516#comment-17197516 ] Joris Van den Bossche commented on ARROW-10027: --- Also selecting the null column from the filtered table indicates it still has 10 elements: {code} In [9]: table['other'] Out[9]: [ 10 nulls ] {code} so it seems the null column doesn't get propertly filtered (which means for a NullArray: change the length) > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Priority: Major > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197516#comment-17197516 ] Joris Van den Bossche edited comment on ARROW-10027 at 9/17/20, 8:53 AM: - Also selecting the null column from the filtered table indicates it still has 10 elements: {code} In [9]: table['other'] Out[9]: [ 10 nulls ] {code} so it seems the null column doesn't get properly filtered (which means for a NullArray: change the length) was (Author: jorisvandenbossche): Also selecting the null column from the filtered table indicates it still has 10 elements: {code} In [9]: table['other'] Out[9]: [ 10 nulls ] {code} so it seems the null column doesn't get propertly filtered (which means for a NullArray: change the length) > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Priority: Major > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.
[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197515#comment-17197515 ] Joris Van den Bossche commented on ARROW-10027: --- [~tazimmerman] thanks for the report! I don't see the crash in {{to_pandas}} (using master): {code} In [7]: table.to_pandas() Out[7]: id name other 0 1one None 1 4 four None 2 7 seven None {code} but also see the wrong behaviour of {{to_pydict}}, so there is certainly something fishy going on. > [Python] Incorrect null column returned when using a dataset filter > expression. > --- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Troy Zimmerman >Priority: Major > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: >id name other > 0 0 zero None > 1 1one None > 2 2two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10031) Support Java benchmark in Ursabot
Kazuaki Ishizaki created ARROW-10031: Summary: Support Java benchmark in Ursabot Key: ARROW-10031 URL: https://issues.apache.org/jira/browse/ARROW-10031 Project: Apache Arrow Issue Type: New Feature Components: CI, Java Affects Versions: 2.0.0 Reporter: Kazuaki Ishizaki Assignee: Kazuaki Ishizaki Based on [the suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e], Ursabot will support Java benchmarks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9862) Throw an exception in UnsafeDirectLittleEndian on Big-Endian platform
[ https://issues.apache.org/jira/browse/ARROW-9862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reassigned ARROW-9862: --- Assignee: Kazuaki Ishizaki > Throw an exception in UnsafeDirectLittleEndian on Big-Endian platform > - > > Key: ARROW-9862 > URL: https://issues.apache.org/jira/browse/ARROW-9862 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The current code throws an intended exception on a big-endian platform while > this class supports native endianness for the primitive data types (up to > 64-bit). > {code:java} > throw new IllegalStateException("Arrow only runs on LittleEndian systems."); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9861) [Java] Failed Arrow Vector on big-endian platform
[ https://issues.apache.org/jira/browse/ARROW-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reassigned ARROW-9861: --- Assignee: Kazuaki Ishizaki > [Java] Failed Arrow Vector on big-endian platform > - > > Key: ARROW-9861 > URL: https://issues.apache.org/jira/browse/ARROW-9861 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The following test failure occurs on a big-endian platform > {code:java} > mvn -B -Drat.skip=true > -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn > -Dflatc.download.skip=true -rf :arrow-vector test > ... > [INFO] Running org.apache.arrow.vector.TestDecimalVector > [ERROR] Tests run: 9, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 0.008 > s <<< FAILURE! - in org.apache.arrow.vector.TestDecimalVector > [ERROR] setUsingArrowBufOfLEInts Time elapsed: 0.001 s <<< FAILURE! > java.lang.AssertionError: expected:<705.32> but was:<-20791293.44> > at > org.apache.arrow.vector.TestDecimalVector.setUsingArrowBufOfLEInts(TestDecimalVector.java:295) > [ERROR] setUsingArrowLongLEBytes Time elapsed: 0.001 s <<< FAILURE! > java.lang.AssertionError: expected:<9223372036854775807> but was:<-129> > at > org.apache.arrow.vector.TestDecimalVector.setUsingArrowLongLEBytes(TestDecimalVector.java:322) > [ERROR] testLongReadWrite Time elapsed: 0.001 s <<< FAILURE! > java.lang.AssertionError: expected:<-2> but was:<-72057594037927937> > at > org.apache.arrow.vector.TestDecimalVector.testLongReadWrite(TestDecimalVector.java:176) > ... > [ERROR] Failures: > [ERROR] TestDecimalVector.setUsingArrowBufOfLEInts:295 expected:<705.32> > but was:<-20791293.44> > [ERROR] TestDecimalVector.setUsingArrowLongLEBytes:322 > expected:<9223372036854775807> but was:<-129> > [ERROR] TestDecimalVector.testLongReadWrite:176 expected:<-2> but > was:<-72057594037927937> > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10030) [Rust] Support fromIter and toIter
[ https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-10030: -- Component/s: Rust Description: Proposal for comments: [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing] (dump of the document above) Rust Arrow supports two main computational models: # Batch Operations, that leverage some form of vectorization # Element-by-element operations, that emerge in more complex operations This document concerns element-by-element operations, that are the most common operations outside of the library. h2. Element-by-element operations These operations are programmatically written as: # Downcast the array to its specific type # Initialize buffers # Iterate over indices and perform the operation, appending to the buffers accordingly # Create ArrayData with the required null bitmap, buffers, childs, etc. # return ArrayRef from ArrayData We can split this process in 3 parts: # Initialization (1 and 2) # Iteration (3) # Finalization (4 and 5) Currently, the API that we offer to our users is: # as_any() to downcast the array based on its DataType # Builders for all types, that users can initialize, matching the downcasted array # Iterate # Use for i in (0..array.len()) # Use Array::value(i) and Array::is_valid(i)/is_null(i)` # use builder.append_value(new_value) or builder.append_null() # Finish the builder and wrap the result in an Arc This API has some issues: # value(i) +is unsafe+, even though it is not marked as such # builders are usually slow due to the checks that they need to perform # The API is not intuitive h2. Proposal This proposal aims at improving this API in 2 specific ways: * Implement IntoIterator Iterator and Iterator> * Implement FromIterator and Item=Option so that users can write: {code:java} let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected); {code} This results in an API that is: # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator # Safe, as it does not allow segfaults # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust. was: Proposal for comments: https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing (dump of the proposal:) Rust Arrow supports two main computational models: # Batch Operations, that leverage some form of vectorization # Element-by-element operations, that emerge in more complex operations This document concerns element-by-element operations, that are the most common operations outside of the library. h2. Element-by-element operations These operations are programmatically written as: # Downcast the array to its specific type # Initialize buffers # Iterate over indices and perform the operation, appending to the buffers accordingly # Create ArrayData with the required null bitmap, buffers, childs, etc. # return ArrayRef from ArrayData We can split this process in 3 parts: # Initialization (1 and 2) # Iteration (3) # Finalization (4 and 5) Currently, the API that we offer to our users is: # as_any() to downcast the array based on its DataType # Builders for all types, that users can initialize, matching the downcasted array # Iterate # Use for i in (0..array.len()) # Use Array::value(i) and Array::is_valid(i)/is_null(i)` # use builder.append_value(new_value) or builder.append_null() # Finish the builder and wrap the result in an Arc This API has some issues: # value(i) +is unsafe+, even though it is not marked as such # builders are usually slow due to the checks that they need to perform # The API is not intuitive h2. Proposal This proposal aims at improving this API in 2 specific ways: * Implement IntoIterator Iterator and Iterator> * Implement FromIterator and Item=Option so that users can write: {code:java} let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected); {code} This results in an API that is: # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator # Safe, as it does not allow segfaults # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust. > [Rust] Support
[jira] [Created] (ARROW-10030) [Rust] Support fromIter and toIter
Jorge created ARROW-10030: - Summary: [Rust] Support fromIter and toIter Key: ARROW-10030 URL: https://issues.apache.org/jira/browse/ARROW-10030 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Proposal for comments: https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing (dump of the proposal:) Rust Arrow supports two main computational models: # Batch Operations, that leverage some form of vectorization # Element-by-element operations, that emerge in more complex operations This document concerns element-by-element operations, that are the most common operations outside of the library. h2. Element-by-element operations These operations are programmatically written as: # Downcast the array to its specific type # Initialize buffers # Iterate over indices and perform the operation, appending to the buffers accordingly # Create ArrayData with the required null bitmap, buffers, childs, etc. # return ArrayRef from ArrayData We can split this process in 3 parts: # Initialization (1 and 2) # Iteration (3) # Finalization (4 and 5) Currently, the API that we offer to our users is: # as_any() to downcast the array based on its DataType # Builders for all types, that users can initialize, matching the downcasted array # Iterate # Use for i in (0..array.len()) # Use Array::value(i) and Array::is_valid(i)/is_null(i)` # use builder.append_value(new_value) or builder.append_null() # Finish the builder and wrap the result in an Arc This API has some issues: # value(i) +is unsafe+, even though it is not marked as such # builders are usually slow due to the checks that they need to perform # The API is not intuitive h2. Proposal This proposal aims at improving this API in 2 specific ways: * Implement IntoIterator Iterator and Iterator> * Implement FromIterator and Item=Option so that users can write: {code:java} let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected); {code} This results in an API that is: # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator # Safe, as it does not allow segfaults # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)