[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-17 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198118#comment-17198118
 ] 

Paul Taylor commented on ARROW-8394:


[~pprice] [~timconkling] [~Costa] PR is up @ 
https://github.com/apache/arrow/pull/8216

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-17 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195668#comment-17195668
 ] 

Paul Taylor edited comment on ARROW-8394 at 9/18/20, 5:30 AM:
--

I've started work on a branch in my fork here[1], but have been occupied the 
last few weeks (work, moving, back injury, etc.). There's not much left to do, 
so I think I should be able to get it finished and PR'd this week.

1. https://github.com/trxcllnt/arrow/tree/fix/typescript-3.9-errors


was (Author: paul.e.taylor):
I've started work on a branch in my fork here[1], but have been occupied the 
last few weeks (work, moving, back injury, etc.). There's not much left to do, 
so I think I should be able to get it finished and PR'd this week.

1. https://github.com/trxcllnt/arrow/tree/typescript-3.9

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8394:
--
Labels: pull-request-available  (was: )

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-17 Thread Kyle Strand (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Strand reopened ARROW-10002:
-

My first PR only removes {{default fn}} from one trait.

> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of {{default fn}} in the 
> codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
> has been further discussion and ideas for resolving the soundness issue, but 
> to my knowledge no definitive action.)
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10036) [Rust] [DataFusion] Test that the final schema is expected in integration tests

2020-09-17 Thread Jorge (Jira)
Jorge created ARROW-10036:
-

 Summary: [Rust] [DataFusion] Test that the final schema is 
expected in integration tests
 Key: ARROW-10036
 URL: https://issues.apache.org/jira/browse/ARROW-10036
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge


Currently, our integration tests convert a Recordbatch to a string, which we 
use for testing, but they do not test that the final schema matches our 
expectations.

We should add a test for this, which includes:
 # field name
 # field type
 # field nulability

for every field in the schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10002.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8206
[https://github.com/apache/arrow/pull/8206]

> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of {{default fn}} in the 
> codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
> has been further discussion and ideas for resolving the soundness issue, but 
> to my knowledge no definitive action.)
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9965) [Java] Buffer capacity calculations are slow for fixed-width vectors

2020-09-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9965:
--
Labels: pull-request-available  (was: )

> [Java] Buffer capacity calculations are slow for fixed-width vectors
> 
>
> Key: ARROW-9965
> URL: https://issues.apache.org/jira/browse/ARROW-9965
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Josiah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: after_patch_profile_prof_perfasm_unsafe_true, 
> before_patch_profile_prof_perfasm_unsafe_true
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It turns out that setSafe performs a very expensive integer division when 
> trying to compute buffer capacity; specifically, it divides by the field 
> size, which isn't hardcoded. Although it is typically a power of 2 for 
> alignment reasons, this doesn't compile down to a bitshift.
> This is done here: 
> https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
>  
> Forcing a bitshift operation results in a large speedup in benchmarks. When 
> turning off bounds checks (which affects another portion of set), 
> microbenchmarks indicate that setting the elements of a vector via setSafe is 
> increased by ~174% (almost 3 times faster). With bounds checks on, this is 
> reduced to a 73% increase (Amdahl's).
> We use setSafe right now in a hot loop to set Arrow vectors in an internal 
> data-intensive service (for now), although in the future, we would prefer a 
> more specialized vector append interface to skip all the other indirection 
> and bit manipulation instructions, while not directly manipulating the 
> exposed (native) memory.
> Here is the detailed analysis:
> Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run 
> with the latest repo-provided JDK 14 on Ubuntu 20.04.
> {code}
> Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT 
> applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=false
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> *snip*
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
> IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
> IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=false
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> *snip*
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
> IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
> IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT 
> applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=true
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
> IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
> IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=true
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 it

[jira] [Created] (ARROW-10035) [C++] Bump versions of vendored code

2020-09-17 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10035:
--

 Summary: [C++] Bump versions of vendored code
 Key: ARROW-10035
 URL: https://issues.apache.org/jira/browse/ARROW-10035
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 2.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-9977) [Rust] Add min/max for [Large]String

2020-09-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reopened ARROW-9977:
---

Re-opening this because I had to revert the PR due to conflicts

> [Rust] Add min/max for [Large]String
> 
>
> Key: ARROW-9977
> URL: https://issues.apache.org/jira/browse/ARROW-9977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Strings are ordered and thus we can apply min/max as other types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10034) [Rust] Master build broken

2020-09-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10034.

Resolution: Fixed

Issue resolved by pull request 8213
[https://github.com/apache/arrow/pull/8213]

> [Rust] Master build broken
> --
>
> Key: ARROW-10034
> URL: https://issues.apache.org/jira/browse/ARROW-10034
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I merged quite a few PRs today. There was a conflict and I need to revert one 
> of them. I am working on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10034) [Rust] Master build broken

2020-09-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10034:
-

Assignee: Andy Grove  (was: Apache Arrow JIRA Bot)

> [Rust] Master build broken
> --
>
> Key: ARROW-10034
> URL: https://issues.apache.org/jira/browse/ARROW-10034
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I merged quite a few PRs today. There was a conflict and I need to revert one 
> of them. I am working on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10034) [Rust] Master build broken

2020-09-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10034:
-

Assignee: Apache Arrow JIRA Bot  (was: Andy Grove)

> [Rust] Master build broken
> --
>
> Key: ARROW-10034
> URL: https://issues.apache.org/jira/browse/ARROW-10034
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I merged quite a few PRs today. There was a conflict and I need to revert one 
> of them. I am working on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10034) [Rust] Master build broken

2020-09-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10034:
---
Labels: pull-request-available  (was: )

> [Rust] Master build broken
> --
>
> Key: ARROW-10034
> URL: https://issues.apache.org/jira/browse/ARROW-10034
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I merged quite a few PRs today. There was a conflict and I need to revert one 
> of them. I am working on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10034) [Rust] Master build broken

2020-09-17 Thread Andy Grove (Jira)
Andy Grove created ARROW-10034:
--

 Summary: [Rust] Master build broken
 Key: ARROW-10034
 URL: https://issues.apache.org/jira/browse/ARROW-10034
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


I merged quite a few PRs today. There was a conflict and I need to revert one 
of them. I am working on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README

2020-09-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10001.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8186
[https://github.com/apache/arrow/pull/8186]

> [Rust] [DataFusion] Add developer guide to README
> -
>
> Key: ARROW-10001
> URL: https://issues.apache.org/jira/browse/ARROW-10001
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9987) [Rust] [DataFusion] Improve docs of `Expr`.

2020-09-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9987.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8181
[https://github.com/apache/arrow/pull/8181]

> [Rust] [DataFusion] Improve docs of `Expr`.
> ---
>
> Key: ARROW-9987
> URL: https://issues.apache.org/jira/browse/ARROW-9987
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9977) [Rust] Add min/max for [Large]String

2020-09-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9977.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8171
[https://github.com/apache/arrow/pull/8171]

> [Rust] Add min/max for [Large]String
> 
>
> Key: ARROW-9977
> URL: https://issues.apache.org/jira/browse/ARROW-9977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Strings are ordered and thus we can apply min/max as other types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10028) [Rust] Simplify macro def_numeric_from_vec

2020-09-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10028.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8207
[https://github.com/apache/arrow/pull/8207]

> [Rust] Simplify macro def_numeric_from_vec
> --
>
> Key: ARROW-10028
> URL: https://issues.apache.org/jira/browse/ARROW-10028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently we need to pass too many parameters to it, when they can be 
> inferred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9990) [Rust] [DataFusion] NOT is not plannable

2020-09-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9990.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8183
[https://github.com/apache/arrow/pull/8183]

> [Rust] [DataFusion] NOT is not plannable
> 
>
> Key: ARROW-9990
> URL: https://issues.apache.org/jira/browse/ARROW-9990
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We have the physical operator, but it is not usable in the logical planning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9971) [Rust] Speedup take

2020-09-17 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9971.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8170
[https://github.com/apache/arrow/pull/8170]

> [Rust] Speedup take
> ---
>
> Key: ARROW-9971
> URL: https://issues.apache.org/jira/browse/ARROW-9971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10033) ArrowReaderProperties creates thread pool, even when use_threads=False and pre_buffer=False

2020-09-17 Thread Adam Hooper (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Hooper updated ARROW-10033:

Description: 
{{ArrowReaderProperties}} has a {{::arrow::io::AsyncContext async_context_;}} 
member. Its constructor creates a thread pool -- regardless of options.

As a caller, I expect {{!use_threads}} to prevent the creation of a thread 
pool. (Maybe there should be an exception if {{pre_buffer && !use_threads}}?)

Stack trace:


{noformat}
#0  arrow::internal::ThreadPool::ThreadPool (this=0x232fa90) at 
/src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:121
#1  0x008e4747 in arrow::internal::ThreadPool::Make (threads=8)
at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:246
#2  0x008e48c9 in arrow::internal::ThreadPool::MakeEternal (threads=8)
at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:252
#3  0x008a20ac in arrow::io::internal::MakeIOThreadPool () at 
/src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:326
#4  0x008a21dd in arrow::io::internal::GetIOThreadPool () at 
/src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:334
#5  0x008a064f in arrow::io::AsyncContext::AsyncContext (
this=0xea6bb0 
)
at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:49
#6  0x0048893e in parquet::ArrowReaderProperties::ArrowReaderProperties 
(
this=0xea6b60 
, 
use_threads=false)
at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.h:579
#7  0x005e1b98 in parquet::default_arrow_reader_properties () at 
/src/apache-arrow-1.0.1/cpp/src/parquet/properties.cc:53
#8  0x00414843 in parquet::arrow::FileReaderBuilder::FileReaderBuilder 
(this=0x7fffb31f0c60)
at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:930
#9  0x00414b10 in parquet::arrow::OpenFile (file=..., pool=0xea6cf0 
, reader=0x7fffb31f0e08)
at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:957
{noformat}

  was:
`ArrowReaderProperties` has a `::arrow::io::AsyncContext async_context_;` 
member. Its ctor creates a thread pool.

Stack trace:

```
#0  arrow::internal::ThreadPool::ThreadPool (this=0x232fa90) at 
/src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:121
#1  0x008e4747 in arrow::internal::ThreadPool::Make (threads=8)
at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:246
#2  0x008e48c9 in arrow::internal::ThreadPool::MakeEternal (threads=8)
at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:252
#3  0x008a20ac in arrow::io::internal::MakeIOThreadPool () at 
/src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:326
#4  0x008a21dd in arrow::io::internal::GetIOThreadPool () at 
/src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:334
#5  0x008a064f in arrow::io::AsyncContext::AsyncContext (
this=0xea6bb0 
)
at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:49
#6  0x0048893e in parquet::ArrowReaderProperties::ArrowReaderProperties 
(
this=0xea6b60 
, 
use_threads=false)
at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.h:579
#7  0x005e1b98 in parquet::default_arrow_reader_properties () at 
/src/apache-arrow-1.0.1/cpp/src/parquet/properties.cc:53
#8  0x00414843 in parquet::arrow::FileReaderBuilder::FileReaderBuilder 
(this=0x7fffb31f0c60)
at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:930
#9  0x00414b10 in parquet::arrow::OpenFile (file=..., pool=0xea6cf0 
, reader=0x7fffb31f0e08)
at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:957
```

As a caller, I expect `use_threads=False` to prevent the creation of threads. 
(Maybe there should be an exception if `pre_buffer && !use_threads`?)


> ArrowReaderProperties creates thread pool, even when use_threads=False and 
> pre_buffer=False
> ---
>
> Key: ARROW-10033
> URL: https://issues.apache.org/jira/browse/ARROW-10033
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Adam Hooper
>Priority: Major
>
> {{ArrowReaderProperties}} has a {{::arrow::io::AsyncContext async_context_;}} 
> member. Its constructor creates a thread pool -- regardless of options.
> As a caller, I expect {{!use_threads}} to prevent the creation of a thread 
> pool. (Maybe there should be an exception if {{pre_buffer && !use_threads}}?)
> Stack trace:
> {noformat}
> #0  arrow::internal::ThreadPool::ThreadPool (this=0x232fa90) at 
> /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:121
> #1  0x008e4747 in arrow::internal::ThreadPool::Make (threads=8)
> at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:246
> #2  0x008e48c9 in arrow::internal::ThreadPool::MakeEternal (th

[jira] [Created] (ARROW-10033) ArrowReaderProperties creates thread pool, even when use_threads=False and pre_buffer=False

2020-09-17 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-10033:
---

 Summary: ArrowReaderProperties creates thread pool, even when 
use_threads=False and pre_buffer=False
 Key: ARROW-10033
 URL: https://issues.apache.org/jira/browse/ARROW-10033
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.1
Reporter: Adam Hooper


`ArrowReaderProperties` has a `::arrow::io::AsyncContext async_context_;` 
member. Its ctor creates a thread pool.

Stack trace:

```
#0  arrow::internal::ThreadPool::ThreadPool (this=0x232fa90) at 
/src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:121
#1  0x008e4747 in arrow::internal::ThreadPool::Make (threads=8)
at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:246
#2  0x008e48c9 in arrow::internal::ThreadPool::MakeEternal (threads=8)
at /src/apache-arrow-1.0.1/cpp/src/arrow/util/thread_pool.cc:252
#3  0x008a20ac in arrow::io::internal::MakeIOThreadPool () at 
/src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:326
#4  0x008a21dd in arrow::io::internal::GetIOThreadPool () at 
/src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:334
#5  0x008a064f in arrow::io::AsyncContext::AsyncContext (
this=0xea6bb0 
)
at /src/apache-arrow-1.0.1/cpp/src/arrow/io/interfaces.cc:49
#6  0x0048893e in parquet::ArrowReaderProperties::ArrowReaderProperties 
(
this=0xea6b60 
, 
use_threads=false)
at /src/apache-arrow-1.0.1/cpp/src/parquet/properties.h:579
#7  0x005e1b98 in parquet::default_arrow_reader_properties () at 
/src/apache-arrow-1.0.1/cpp/src/parquet/properties.cc:53
#8  0x00414843 in parquet::arrow::FileReaderBuilder::FileReaderBuilder 
(this=0x7fffb31f0c60)
at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:930
#9  0x00414b10 in parquet::arrow::OpenFile (file=..., pool=0xea6cf0 
, reader=0x7fffb31f0e08)
at /src/apache-arrow-1.0.1/cpp/src/parquet/arrow/reader.cc:957
```

As a caller, I expect `use_threads=False` to prevent the creation of threads. 
(Maybe there should be an exception if `pre_buffer && !use_threads`?)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-09-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-10032:
-
Description: 
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags
 * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
 * Prefer JOB=Build_Debug as otherwise it forces clcache
 * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION
 * Use conda manually to install gtest gflags ninja rapidjson grpc-cpp protobuf

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"

  was:
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags
 * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
 * Prefer JOB=Build_Debug as otherwise it forces clcache
 * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION
 * Use conda manually to install gtest gflags ninja rapidjson

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"


> [Documentation] C++ Windows docs are out of date
> 
>
> Key: ARROW-10032
> URL: https://issues.apache.org/jira/browse/ARROW-10032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: David Li
>Priority: Major
>
> "Replicating AppVeyor Builds" needs the following changes: 
> [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
>  * The recommended VM does not include the C++ compiler - we should link to 
> the build tools and describe which of them needs installation
>  * Boost: the b2 script now requires --with not -with flags
>  * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
>  * Prefer JOB=Build_Debug as otherwise it forces clcache
>  * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION
>  * Use conda manually to install gtest gflags ninja rapidjson grpc-cpp 
> protobuf
> Even with this:
>  * The developer prompt can't find cl.exe (the compiler). (You must restart 
> the VM!)
>  * The PowerShell prompt can't use conda (it complains a config file isn't 
> signed)
>  Solution: run a PowerShell instance as administrator and run 
> "Set-ExecutionPolicy -ExecutionPolicy Unrestricted"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-09-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-10032:
-
Description: 
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags
 * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
 * Prefer JOB=Build_Debug as otherwise it forces clcache
 * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION
 * Use conda manually to install gtest gflags ninja rapidjson

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"

  was:
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags
 * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
 * Prefer JOB=Build_Debug as otherwise it forces clcache
 * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"


> [Documentation] C++ Windows docs are out of date
> 
>
> Key: ARROW-10032
> URL: https://issues.apache.org/jira/browse/ARROW-10032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: David Li
>Priority: Major
>
> "Replicating AppVeyor Builds" needs the following changes: 
> [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
>  * The recommended VM does not include the C++ compiler - we should link to 
> the build tools and describe which of them needs installation
>  * Boost: the b2 script now requires --with not -with flags
>  * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
>  * Prefer JOB=Build_Debug as otherwise it forces clcache
>  * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION
>  * Use conda manually to install gtest gflags ninja rapidjson
> Even with this:
>  * The developer prompt can't find cl.exe (the compiler). (You must restart 
> the VM!)
>  * The PowerShell prompt can't use conda (it complains a config file isn't 
> signed)
>  Solution: run a PowerShell instance as administrator and run 
> "Set-ExecutionPolicy -ExecutionPolicy Unrestricted"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-09-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-10032:
-
Description: 
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags
 * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
 * Prefer JOB=Build_Debug as otherwise it forces clcache
 * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"

  was:
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags
 * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
 * Prefer JOB=Build_Debug as otherwise it forces clcache

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"


> [Documentation] C++ Windows docs are out of date
> 
>
> Key: ARROW-10032
> URL: https://issues.apache.org/jira/browse/ARROW-10032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: David Li
>Priority: Major
>
> "Replicating AppVeyor Builds" needs the following changes: 
> [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
>  * The recommended VM does not include the C++ compiler - we should link to 
> the build tools and describe which of them needs installation
>  * Boost: the b2 script now requires --with not -with flags
>  * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
>  * Prefer JOB=Build_Debug as otherwise it forces clcache
>  * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION
> Even with this:
>  * The developer prompt can't find cl.exe (the compiler). (You must restart 
> the VM!)
>  * The PowerShell prompt can't use conda (it complains a config file isn't 
> signed)
>  Solution: run a PowerShell instance as administrator and run 
> "Set-ExecutionPolicy -ExecutionPolicy Unrestricted"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-09-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-10032:
-
Description: 
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags
 * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
 * Prefer JOB=Build_Debug as otherwise it forces clcache

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"

  was:
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags
 * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"


> [Documentation] C++ Windows docs are out of date
> 
>
> Key: ARROW-10032
> URL: https://issues.apache.org/jira/browse/ARROW-10032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: David Li
>Priority: Major
>
> "Replicating AppVeyor Builds" needs the following changes: 
> [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
>  * The recommended VM does not include the C++ compiler - we should link to 
> the build tools and describe which of them needs installation
>  * Boost: the b2 script now requires --with not -with flags
>  * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
>  * Prefer JOB=Build_Debug as otherwise it forces clcache
> Even with this:
>  * The developer prompt can't find cl.exe (the compiler). (You must restart 
> the VM!)
>  * The PowerShell prompt can't use conda (it complains a config file isn't 
> signed)
>  Solution: run a PowerShell instance as administrator and run 
> "Set-ExecutionPolicy -ExecutionPolicy Unrestricted"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-09-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-10032:
-
Description: 
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags
 * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"

  was:
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"


> [Documentation] C++ Windows docs are out of date
> 
>
> Key: ARROW-10032
> URL: https://issues.apache.org/jira/browse/ARROW-10032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: David Li
>Priority: Major
>
> "Replicating AppVeyor Builds" needs the following changes: 
> [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
>  * The recommended VM does not include the C++ compiler - we should link to 
> the build tools and describe which of them needs installation
>  * Boost: the b2 script now requires --with not -with flags
>  * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
> Even with this:
>  * The developer prompt can't find cl.exe (the compiler). (You must restart 
> the VM!)
>  * The PowerShell prompt can't use conda (it complains a config file isn't 
> signed)
>  Solution: run a PowerShell instance as administrator and run 
> "Set-ExecutionPolicy -ExecutionPolicy Unrestricted"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-09-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-10032:
-
Description: 
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags

Even with this:
 * The developer prompt can't find cl.exe (the compiler). (You must restart the 
VM!)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
 Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"

  was:
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags

Even with this:
 * The developer prompt can't find cl.exe (the compiler)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"


> [Documentation] C++ Windows docs are out of date
> 
>
> Key: ARROW-10032
> URL: https://issues.apache.org/jira/browse/ARROW-10032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: David Li
>Priority: Major
>
> "Replicating AppVeyor Builds" needs the following changes: 
> [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
>  * The recommended VM does not include the C++ compiler - we should link to 
> the build tools and describe which of them needs installation
>  * Boost: the b2 script now requires --with not -with flags
> Even with this:
>  * The developer prompt can't find cl.exe (the compiler). (You must restart 
> the VM!)
>  * The PowerShell prompt can't use conda (it complains a config file isn't 
> signed)
>  Solution: run a PowerShell instance as administrator and run 
> "Set-ExecutionPolicy -ExecutionPolicy Unrestricted"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-09-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-10032:
-
Description: 
"Replicating AppVeyor Builds" needs the following changes: 
[https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags

Even with this:
 * The developer prompt can't find cl.exe (the compiler)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)
Solution: run a PowerShell instance as administrator and run 
"Set-ExecutionPolicy -ExecutionPolicy Unrestricted"

  was:
"Replicating AppVeyor Builds" needs the following changes: 
https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags

Even with this:
 * The developer prompt can't find cl.exe (the compiler)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)


> [Documentation] C++ Windows docs are out of date
> 
>
> Key: ARROW-10032
> URL: https://issues.apache.org/jira/browse/ARROW-10032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: David Li
>Priority: Major
>
> "Replicating AppVeyor Builds" needs the following changes: 
> [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
>  * The recommended VM does not include the C++ compiler - we should link to 
> the build tools and describe which of them needs installation
>  * Boost: the b2 script now requires --with not -with flags
> Even with this:
>  * The developer prompt can't find cl.exe (the compiler)
>  * The PowerShell prompt can't use conda (it complains a config file isn't 
> signed)
> Solution: run a PowerShell instance as administrator and run 
> "Set-ExecutionPolicy -ExecutionPolicy Unrestricted"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10023) [Gandiva][C++] Implementing Split part function in gandiva

2020-09-17 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197892#comment-17197892
 ] 

Maarten Breddels commented on ARROW-10023:
--

It's gonna be in C++, I can push an initial version when I find the time, so 
you can take a look. I do a split into a list of strings, with a pattern 
separator, whitespace (ascii and utf8), and still need to finish reverse utf8 
whitespace. You want a version that splits, and only takes the n-th part right? 

> [Gandiva][C++] Implementing Split part function in gandiva
> --
>
> Key: ARROW-10023
> URL: https://issues.apache.org/jira/browse/ARROW-10023
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Naman Udasi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-3757) [R] R bindings for Flight RPC client

2020-09-17 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-3757.

Resolution: Fixed

Issue resolved by pull request 7875
[https://github.com/apache/arrow/pull/7875]

> [R] R bindings for Flight RPC client
> 
>
> Key: ARROW-3757
> URL: https://issues.apache.org/jira/browse/ARROW-3757
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC, R
>Reporter: Wes McKinney
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10023) [Gandiva][C++] Implementing Split part function in gandiva

2020-09-17 Thread Naman Udasi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197846#comment-17197846
 ] 

Naman Udasi edited comment on ARROW-10023 at 9/17/20, 5:14 PM:
---

[~maartenbreddels] Where will the split functions mentioned in ARROW-9991 be 
implemented? I think if possible we can make them reusable ?


was (Author: namanu):
[~maartenbreddels] Where will the split functions mentioned be implemented? I 
think if possible we can make them reusable ?

> [Gandiva][C++] Implementing Split part function in gandiva
> --
>
> Key: ARROW-10023
> URL: https://issues.apache.org/jira/browse/ARROW-10023
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Naman Udasi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10023) [Gandiva][C++] Implementing Split part function in gandiva

2020-09-17 Thread Naman Udasi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197846#comment-17197846
 ] 

Naman Udasi commented on ARROW-10023:
-

[~maartenbreddels] Where will the split functions mentioned be implemented? I 
think if possible we can make them reusable ?

> [Gandiva][C++] Implementing Split part function in gandiva
> --
>
> Key: ARROW-10023
> URL: https://issues.apache.org/jira/browse/ARROW-10023
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Naman Udasi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10029) [Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset

2020-09-17 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10029:
-
Summary: [Python] Deadlock in the interaction of pyarrow FileSystem and 
ParquetDataset  (was: Deadlock in the interaction of pyarrow FileSystem and 
ParquetDataset)

> [Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset
> -
>
> Key: ARROW-10029
> URL: https://issues.apache.org/jira/browse/ARROW-10029
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: David McGuire
>Priority: Major
> Attachments: repro.py
>
>
> @martindurant good news (for you): I have a repro test case that is 100% 
> {{pyarrow}}, so it looks like {{s3fs}} is not involved.
> @jorisvandenbossche how should I follow up with this, based on 
> {{pyarrow.filesystem.LocalFileSystem}}?
> Viewing the File System *directories* as a tree, one thread is required for 
> every non-leaf node, in order to avoid deadlock.
> 1) dataset
> 2) dataset/foo=1
> 3) dataset/foo=1/bar=2
> 4) dataset/foo=1/bar=2/baz=0
> 5) dataset/foo=1/bar=2/baz=1
> 6) dataset/foo=1/bar=2/baz=2
> *) dataset/foo=1/bar=2/baz=0/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=true
> *) dataset/foo=1/bar=2/baz=0/qux=true
> *) dataset/foo=1/bar=2/baz=2/qux=false
> *) dataset/foo=1/bar=2/baz=2/qux=true
> {code}
> import pyarrow.parquet as pq
> import pyarrow.filesystem as fs
> class LoggingLocalFileSystem(fs.LocalFileSystem):
> def walk(self, path):
> print(path)
> return super().walk(path)
> fs = LoggingLocalFileSystem()
> dataset_url = "dataset"
> threads = 6
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
> validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> threads = 5
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
> validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> {code}
> *_Call with 6 threads completes._*
> *_Call with 5 threads hangs indefinitely._*
> {code}
> $ python repro.py 
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> dataset/foo=1/bar=2/baz=0/qux=false
> dataset/foo=1/bar=2/baz=0/qux=true
> dataset/foo=1/bar=2/baz=1/qux=false
> dataset/foo=1/bar=2/baz=1/qux=true
> dataset/foo=1/bar=2/baz=2/qux=false
> dataset/foo=1/bar=2/baz=2/qux=true
> 6
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> ^C
> ...
> KeyboardInterrupt
> ^C
> ...
> KeyboardInterrupt
> {code}
> **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and 
> when omitting the {{filesystem}} argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8678) [C++][Parquet] Remove legacy arrow to level translation.

2020-09-17 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-8678.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8184
[https://github.com/apache/arrow/pull/8184]

> [C++][Parquet] Remove legacy arrow to level translation.
> 
>
> Key: ARROW-8678
> URL: https://issues.apache.org/jira/browse/ARROW-8678
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9636) [Python] Update documentation about 'LZO' compression in parquet.write_table

2020-09-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9636:
--
Labels: beginner-friendly doc pull-request-available  (was: 
beginner-friendly doc)

> [Python] Update documentation about 'LZO' compression in parquet.write_table
> 
>
> Key: ARROW-9636
> URL: https://issues.apache.org/jira/browse/ARROW-9636
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Reporter: Pierre
>Priority: Trivial
>  Labels: beginner-friendly, doc, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
> When trying to use 'LZO' codec in `pyarrow.parquet.write_table()` with below 
> code, I get an error message indicating that 'LZO' is not available. However, 
> this codec is mentioned as available in the doc 
> [[https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html]].
>  
> Code
> {code:python}
> from pyarrow import parquet as pq
> pq.write_table(data, file, compression='LZO')
> {code}
>  
> Error message
> {code:bash}
>  File "pyarrow/_parquet.pyx", line 1374, in 
> pyarrow._parquet.ParquetWriter.write_table
> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Codec type LZO not supported in Parquet format
> {code}
>  
> I would suggest correcting the documentation, or making this codec available?
> Thanks for your support.
> Bests,
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10030) [Rust] Support fromIter and toIter

2020-09-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10030:
---
Labels: pull-request-available  (was: )

> [Rust] Support fromIter and toIter
> --
>
> Key: ARROW-10030
> URL: https://issues.apache.org/jira/browse/ARROW-10030
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Proposal for comments: 
> [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]
> (dump of the document above)
> Rust Arrow supports two main computational models:
>  # Batch Operations, that leverage some form of vectorization
>  # Element-by-element operations, that emerge in more complex operations
> This document concerns element-by-element operations, that are common outside 
> of the library (and sometimes in the library).
> h2. Element-by-element operations
> These operations are programmatically written as:
>  # Downcast the array to its specific type
>  # Initialize buffers
>  # Iterate over indices and perform the operation, appending to the buffers 
> accordingly
>  # Create ArrayData with the required null bitmap, buffers, childs, etc.
>  # return ArrayRef from ArrayData
>  
> We can split this process in 3 parts:
>  # Initialization (1 and 2)
>  # Iteration (3)
>  # Finalization (4 and 5)
> Currently, the API that we offer to our users is:
>  # as_any() to downcast the array based on its DataType
>  # Builders for all types, that users can initialize, matching the downcasted 
> array
>  # Iterate
>  ## Use for i in (0..array.len())
>  ## Use {{Array::value(i)}} and {{Array::is_valid(i)/is_null(i)}}
>  ## use builder.append_value(new_value) or builder.append_null()
>  # Finish the builder and wrap the result in an Arc
> This API has some issues:
>  # value(i) +is unsafe+, even though it is not marked as such
>  # builders are usually slow due to the checks that they need to perform
>  # The API is not intuitive
> h2. Proposal
> This proposal aims at improving this API in 2 specific ways:
>  * Implement IntoIterator Iterator and Iterator>
>  * Implement FromIterator and Item=Option
> so that users can write:
> {code:java}
> // incoming array
> let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
> let array = Arc::new(array) as ArrayRef;
> let array = array.as_any().downcast_ref::().unwrap();
> // to and from iter, with a +1
> let result: Int32Array = array
>     .iter()
>     .map(|e| if let Some(r) = e { Some(r + 1) } else { None })
>     .collect();
> let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
> assert_eq!(result, expected);
> {code}
>  
> This results in an API that is:
>  # efficient, as it is our responsibility to create `FromIterator` that are 
> efficient in populating the buffers/child etc from an iterator
>  # Safe, as it does not allow segfaults
>  # Simple, as users do not need to worry about Builders, buffers, etc, only 
> native Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-09-17 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-10032:
-
Description: 
"Replicating AppVeyor Builds" needs the following changes: 
https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds
 * The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags

Even with this:
 * The developer prompt can't find cl.exe (the compiler)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)

  was:
* The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags

Even with this:
 * The developer prompt can't find cl.exe (the compiler)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)


> [Documentation] C++ Windows docs are out of date
> 
>
> Key: ARROW-10032
> URL: https://issues.apache.org/jira/browse/ARROW-10032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: David Li
>Priority: Major
>
> "Replicating AppVeyor Builds" needs the following changes: 
> https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds
>  * The recommended VM does not include the C++ compiler - we should link to 
> the build tools and describe which of them needs installation
>  * Boost: the b2 script now requires --with not -with flags
> Even with this:
>  * The developer prompt can't find cl.exe (the compiler)
>  * The PowerShell prompt can't use conda (it complains a config file isn't 
> signed)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-09-17 Thread David Li (Jira)
David Li created ARROW-10032:


 Summary: [Documentation] C++ Windows docs are out of date
 Key: ARROW-10032
 URL: https://issues.apache.org/jira/browse/ARROW-10032
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: David Li


* The recommended VM does not include the C++ compiler - we should link to the 
build tools and describe which of them needs installation
 * Boost: the b2 script now requires --with not -with flags

Even with this:
 * The developer prompt can't find cl.exe (the compiler)
 * The PowerShell prompt can't use conda (it complains a config file isn't 
signed)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10029) Deadlock in the interaction of pyarrow FileSystem and ParquetDataset

2020-09-17 Thread David McGuire (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197784#comment-17197784
 ] 

David McGuire edited comment on ARROW-10029 at 9/17/20, 4:09 PM:
-

If it's not multi-threaded, then there won't be deadlock:

{code}
# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
use_legacy_dataset=False)
print(len(dataset.pieces))

# This also completes
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
use_legacy_dataset=False)
print(len(dataset.pieces))
{code}

Running:

{code}
$ python repro.py 
6
6
{code}


was (Author: dmcguire):
If it's not multi-threaded, then there won't be deadlock:

{code}
# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
use_legacy_dataset=False)
print(len(dataset.pieces))

# This also completes
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
use_legacy_dataset=False)
print(len(dataset.pieces))
{code}

> Deadlock in the interaction of pyarrow FileSystem and ParquetDataset
> 
>
> Key: ARROW-10029
> URL: https://issues.apache.org/jira/browse/ARROW-10029
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: David McGuire
>Priority: Major
> Attachments: repro.py
>
>
> @martindurant good news (for you): I have a repro test case that is 100% 
> {{pyarrow}}, so it looks like {{s3fs}} is not involved.
> @jorisvandenbossche how should I follow up with this, based on 
> {{pyarrow.filesystem.LocalFileSystem}}?
> Viewing the File System *directories* as a tree, one thread is required for 
> every non-leaf node, in order to avoid deadlock.
> 1) dataset
> 2) dataset/foo=1
> 3) dataset/foo=1/bar=2
> 4) dataset/foo=1/bar=2/baz=0
> 5) dataset/foo=1/bar=2/baz=1
> 6) dataset/foo=1/bar=2/baz=2
> *) dataset/foo=1/bar=2/baz=0/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=true
> *) dataset/foo=1/bar=2/baz=0/qux=true
> *) dataset/foo=1/bar=2/baz=2/qux=false
> *) dataset/foo=1/bar=2/baz=2/qux=true
> {code}
> import pyarrow.parquet as pq
> import pyarrow.filesystem as fs
> class LoggingLocalFileSystem(fs.LocalFileSystem):
> def walk(self, path):
> print(path)
> return super().walk(path)
> fs = LoggingLocalFileSystem()
> dataset_url = "dataset"
> threads = 6
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
> validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> threads = 5
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
> validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> {code}
> *_Call with 6 threads completes._*
> *_Call with 5 threads hangs indefinitely._*
> {code}
> $ python repro.py 
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> dataset/foo=1/bar=2/baz=0/qux=false
> dataset/foo=1/bar=2/baz=0/qux=true
> dataset/foo=1/bar=2/baz=1/qux=false
> dataset/foo=1/bar=2/baz=1/qux=true
> dataset/foo=1/bar=2/baz=2/qux=false
> dataset/foo=1/bar=2/baz=2/qux=true
> 6
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> ^C
> ...
> KeyboardInterrupt
> ^C
> ...
> KeyboardInterrupt
> {code}
> **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and 
> when omitting the {{filesystem}} argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10029) Deadlock in the interaction of pyarrow FileSystem and ParquetDataset

2020-09-17 Thread David McGuire (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197784#comment-17197784
 ] 

David McGuire commented on ARROW-10029:
---

If it's not multi-threaded, then there won't be deadlock:

{code}
# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
use_legacy_dataset=False)
print(len(dataset.pieces))

# This also completes
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
use_legacy_dataset=False)
print(len(dataset.pieces))
{code}

> Deadlock in the interaction of pyarrow FileSystem and ParquetDataset
> 
>
> Key: ARROW-10029
> URL: https://issues.apache.org/jira/browse/ARROW-10029
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: David McGuire
>Priority: Major
> Attachments: repro.py
>
>
> @martindurant good news (for you): I have a repro test case that is 100% 
> {{pyarrow}}, so it looks like {{s3fs}} is not involved.
> @jorisvandenbossche how should I follow up with this, based on 
> {{pyarrow.filesystem.LocalFileSystem}}?
> Viewing the File System *directories* as a tree, one thread is required for 
> every non-leaf node, in order to avoid deadlock.
> 1) dataset
> 2) dataset/foo=1
> 3) dataset/foo=1/bar=2
> 4) dataset/foo=1/bar=2/baz=0
> 5) dataset/foo=1/bar=2/baz=1
> 6) dataset/foo=1/bar=2/baz=2
> *) dataset/foo=1/bar=2/baz=0/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=true
> *) dataset/foo=1/bar=2/baz=0/qux=true
> *) dataset/foo=1/bar=2/baz=2/qux=false
> *) dataset/foo=1/bar=2/baz=2/qux=true
> {code}
> import pyarrow.parquet as pq
> import pyarrow.filesystem as fs
> class LoggingLocalFileSystem(fs.LocalFileSystem):
> def walk(self, path):
> print(path)
> return super().walk(path)
> fs = LoggingLocalFileSystem()
> dataset_url = "dataset"
> threads = 6
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
> validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> threads = 5
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, 
> validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> {code}
> *_Call with 6 threads completes._*
> *_Call with 5 threads hangs indefinitely._*
> {code}
> $ python repro.py 
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> dataset/foo=1/bar=2/baz=0/qux=false
> dataset/foo=1/bar=2/baz=0/qux=true
> dataset/foo=1/bar=2/baz=1/qux=false
> dataset/foo=1/bar=2/baz=1/qux=true
> dataset/foo=1/bar=2/baz=2/qux=false
> dataset/foo=1/bar=2/baz=2/qux=true
> 6
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> ^C
> ...
> KeyboardInterrupt
> ^C
> ...
> KeyboardInterrupt
> {code}
> **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and 
> when omitting the {{filesystem}} argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10031) Support Java benchmark in Ursabot

2020-09-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10031:
-

Assignee: Apache Arrow JIRA Bot  (was: Kazuaki Ishizaki)

> Support Java benchmark in Ursabot
> -
>
> Key: ARROW-10031
> URL: https://issues.apache.org/jira/browse/ARROW-10031
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: CI, Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on [the 
> suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e],
>  Ursabot will support Java benchmarks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10031) Support Java benchmark in Ursabot

2020-09-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10031:
-

Assignee: Kazuaki Ishizaki  (was: Apache Arrow JIRA Bot)

> Support Java benchmark in Ursabot
> -
>
> Key: ARROW-10031
> URL: https://issues.apache.org/jira/browse/ARROW-10031
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: CI, Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on [the 
> suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e],
>  Ursabot will support Java benchmarks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10031) Support Java benchmark in Ursabot

2020-09-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10031:
---
Labels: pull-request-available  (was: )

> Support Java benchmark in Ursabot
> -
>
> Key: ARROW-10031
> URL: https://issues.apache.org/jira/browse/ARROW-10031
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: CI, Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on [the 
> suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e],
>  Ursabot will support Java benchmarks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10024) [C++][Parquet] Create nested reading benchmarks

2020-09-17 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-10024.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8203
[https://github.com/apache/arrow/pull/8203]

> [C++][Parquet] Create nested reading benchmarks
> ---
>
> Key: ARROW-10024
> URL: https://issues.apache.org/jira/browse/ARROW-10024
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Benchmarking, C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10026) [C++] Improve kernel performance on small batches

2020-09-17 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197739#comment-17197739
 ] 

Wes McKinney edited comment on ARROW-10026 at 9/17/20, 3:03 PM:


IMHO we should consider a slimmed down data structure for the implementation of 
{{ExecBatch}} that does not use {{arrow::util::variant}}, considering that we 
only ever will have either {{ArrayData}} or {{Scalar}} as value types. The 
overhead of slicing {{ArrayData}} objects is also non-trivial


was (Author: wesmckinn):
IMHO we should consider a slimmed down data structure for {{ExecBatch}} that 
does not use {{arrow::util::variant}}, considering that we only ever will have 
either {{ArrayData}} or {{Scalar}} as value types. The overhead of slicing 
{{ArrayData}} objects is also non-trivial

> [C++] Improve kernel performance on small batches
> -
>
> Key: ARROW-10026
> URL: https://issues.apache.org/jira/browse/ARROW-10026
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> It seems that invoking some kernels on smallish batches has quite an overhead:
> {code}
> ArrayArrayKernel/32768/100  2860 ns   
>   2859 ns   245195 bytes_per_second=10.6727G/s 
> items_per_second=2.86494G/s null_percent=1 size=32.768k
> ArrayArrayKernel/32768/02752 ns   
>   2751 ns   249316 bytes_per_second=11.093G/s items_per_second=2.97775G/s 
> null_percent=0 size=32.768k
> ArrayArrayKernel/524288/10018633 ns   
>  18630 ns36548 bytes_per_second=26.2097G/s 
> items_per_second=7.03561G/s null_percent=1 size=524.288k
> ArrayArrayKernel/524288/0  18260 ns   
>  18257 ns38245 bytes_per_second=26.7451G/s 
> items_per_second=7.17933G/s null_percent=0 size=524.288k
> {code}
> We should investigate and try to lighten the overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10026) [C++] Improve kernel performance on small batches

2020-09-17 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197739#comment-17197739
 ] 

Wes McKinney commented on ARROW-10026:
--

IMHO we should consider a slimmed down data structure for {{ExecBatch}} that 
does not use {{arrow::util::variant}}, considering that we only ever will have 
either {{ArrayData}} or {{Scalar}} as value types. The overhead of slicing 
{{ArrayData}} objects is also non-trivial

> [C++] Improve kernel performance on small batches
> -
>
> Key: ARROW-10026
> URL: https://issues.apache.org/jira/browse/ARROW-10026
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> It seems that invoking some kernels on smallish batches has quite an overhead:
> {code}
> ArrayArrayKernel/32768/100  2860 ns   
>   2859 ns   245195 bytes_per_second=10.6727G/s 
> items_per_second=2.86494G/s null_percent=1 size=32.768k
> ArrayArrayKernel/32768/02752 ns   
>   2751 ns   249316 bytes_per_second=11.093G/s items_per_second=2.97775G/s 
> null_percent=0 size=32.768k
> ArrayArrayKernel/524288/10018633 ns   
>  18630 ns36548 bytes_per_second=26.2097G/s 
> items_per_second=7.03561G/s null_percent=1 size=524.288k
> ArrayArrayKernel/524288/0  18260 ns   
>  18257 ns38245 bytes_per_second=26.7451G/s 
> items_per_second=7.17933G/s null_percent=0 size=524.288k
> {code}
> We should investigate and try to lighten the overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Troy Zimmerman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197720#comment-17197720
 ] 

Troy Zimmerman commented on ARROW-10027:


You rock! Thank you for the fast turnaround.

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197703#comment-17197703
 ] 

Joris Van den Bossche commented on ARROW-10027:
---

Don't worry about the crash (I actually saw a crash when closing my python 
session afterwards). The malformed dataframe should be solved by fixing the 
filter bug, which is tackled by my PR https://github.com/apache/arrow/pull/8209

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Troy Zimmerman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197685#comment-17197685
 ] 

Troy Zimmerman commented on ARROW-10027:


[~jorisvandenbossche] Thank you for the quick & detailed response.

I'll take a closer look at the core that is dumped to see if I can narrow down 
what's causing the crash since it just seems to be on my end.

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10030) [Rust] Support fromIter and toIter

2020-09-17 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-10030:
--
Description: 
Proposal for comments: 
[https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]

(dump of the document above)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are common outside 
of the library (and sometimes in the library).
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers 
accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted 
array
 # Iterate
 ## Use for i in (0..array.len())
 ## Use {{Array::value(i)}} and {{Array::is_valid(i)/is_null(i)}}
 ## use builder.append_value(new_value) or builder.append_null()
 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator and Iterator>
 * Implement FromIterator and Item=Option

so that users can write:
{code:java}
// incoming array
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
let array = Arc::new(array) as ArrayRef;
let array = array.as_any().downcast_ref::().unwrap();

// to and from iter, with a +1
let result: Int32Array = array
    .iter()
    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })
    .collect();

let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 

assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are 
efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not allow segfaults
 # Simple, as users do not need to worry about Builders, buffers, etc, only 
native Rust.

  was:
Proposal for comments: 
[https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]

(dump of the document above)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are common outside 
of the library (and sometimes in the library).
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers 
accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted 
array
 # Iterate
 # Use for i in (0..array.len())
 # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
 # use builder.append_value(new_value) or builder.append_null()

 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator and Iterator>
 * Implement FromIterator and Item=Option

so that users can write:

 
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array

    .iter()

    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })

    .collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are 
efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not al

[jira] [Updated] (ARROW-10030) [Rust] Support fromIter and toIter

2020-09-17 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-10030:
--
Description: 
Proposal for comments: 
[https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]

(dump of the document above)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are common outside 
of the library (and sometimes in the library).
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers 
accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted 
array
 # Iterate
 # Use for i in (0..array.len())
 # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
 # use builder.append_value(new_value) or builder.append_null()

 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator and Iterator>
 * Implement FromIterator and Item=Option

so that users can write:

 
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array

    .iter()

    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })

    .collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are 
efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not allow segfaults
 # Simple, as users do not need to worry about Builders, buffers, etc, only 
native Rust.

  was:
Proposal for comments: 
[https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]

(dump of the document above)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are the most common 
operations outside of the library.
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers 
accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted 
array
 # Iterate
 # Use for i in (0..array.len())
 # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
 # use builder.append_value(new_value) or builder.append_null()

 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator and Iterator>
 * Implement FromIterator and Item=Option

so that users can write:

 
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array

    .iter()

    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })

    .collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are 
efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not allow segfaults
 # Simple, as users do not need to worry about Builders, buffers, etc, only 
native Rust.


> [Rust] Support fromIte

[jira] [Assigned] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10027:
-

Assignee: Joris Van den Bossche  (was: Apache Arrow JIRA Bot)

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10027:
-

Assignee: Apache Arrow JIRA Bot  (was: Joris Van den Bossche)

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10027:
---
Labels: pull-request-available  (was: )

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-10027:
-

Assignee: Joris Van den Bossche

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Assignee: Joris Van den Bossche
>Priority: Major
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197535#comment-17197535
 ] 

Joris Van den Bossche commented on ARROW-10027:
---

So it seems this is a bug not directly in the Dataset code, but in the filter 
operation. Also when manually filtering a RecordBatch, it incorrectly returns a 
batch with the null column not being filtered:

{code}
table = pa.Table.from_arrays(
arrays=[
pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
pa.array(["zero", "one", "two", "three", "four", "five", "six", 
"seven", "eight", "nine"]),
pa.array([None, None, None, None, None, None, None, None, None, None]),
],
names=["id", "name", "other"],
)

batch = table.to_batches()[0]
{code}

{code}
In [32]: batch
Out[32]: 
pyarrow.RecordBatch
id: int64
name: string
other: null

In [33]: batch.num_rows
Out[33]: 10

In [34]: filtered_batch = batch.filter(pa.array([True, False]*5))

In [35]: filtered_table.num_rows
Out[35]: 5

In [36]: filtered_batch.column(2)
Out[36]: 

10 nulls

In [37]: len(filtered_batch.column(2))
Out[37]: 10
{code}


Directly filtering on the array or chunked array or on a Table seems to work, 
though:

{code}
In [38]: filtered_table = table.filter(pa.array([True, False]*5))

In [39]: filtered_table.num_rows
Out[39]: 5

In [40]: filtered_table['other']
Out[40]: 

[
5 nulls
]

In [41]: chunked_array = table['other']

In [42]: chunked_array
Out[42]: 

[
10 nulls
]

In [43]: chunked_array.filter(pa.array([True, False]*5))
Out[43]: 

[
5 nulls
]

In [44]: chunked_array.chunks[0].filter(pa.array([True, False]*5))
Out[44]: 

5 nulls

{code}

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Priority: Major
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197516#comment-17197516
 ] 

Joris Van den Bossche commented on ARROW-10027:
---

Also selecting the null column from the filtered table indicates it still has 
10 elements:

{code}
In [9]: table['other']
Out[9]: 

[
10 nulls
]
{code}

so it seems the null column doesn't get propertly filtered (which means for a 
NullArray: change the length)

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Priority: Major
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197516#comment-17197516
 ] 

Joris Van den Bossche edited comment on ARROW-10027 at 9/17/20, 8:53 AM:
-

Also selecting the null column from the filtered table indicates it still has 
10 elements:

{code}
In [9]: table['other']
Out[9]: 

[
10 nulls
]
{code}

so it seems the null column doesn't get properly filtered (which means for a 
NullArray: change the length)


was (Author: jorisvandenbossche):
Also selecting the null column from the filtered table indicates it still has 
10 elements:

{code}
In [9]: table['other']
Out[9]: 

[
10 nulls
]
{code}

so it seems the null column doesn't get propertly filtered (which means for a 
NullArray: change the length)

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Priority: Major
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

2020-09-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197515#comment-17197515
 ] 

Joris Van den Bossche commented on ARROW-10027:
---

[~tazimmerman] thanks for the report!

I don't see the crash in {{to_pandas}} (using master):

{code}
In [7]: table.to_pandas()
Out[7]: 
   id   name other
0   1one  None
1   4   four  None
2   7  seven  None
{code}

but also see the wrong behaviour of {{to_pydict}}, so there is certainly 
something fishy going on.

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> ---
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Troy Zimmerman
>Priority: Major
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...: arrays=[
>  ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...: pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...: ],
>  ...: names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>id   name other
> 0   0   zero  None
> 1   1one  None
> 2   2two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10031) Support Java benchmark in Ursabot

2020-09-17 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created ARROW-10031:


 Summary: Support Java benchmark in Ursabot
 Key: ARROW-10031
 URL: https://issues.apache.org/jira/browse/ARROW-10031
 Project: Apache Arrow
  Issue Type: New Feature
  Components: CI, Java
Affects Versions: 2.0.0
Reporter: Kazuaki Ishizaki
Assignee: Kazuaki Ishizaki


Based on [the 
suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e],
 Ursabot will support Java benchmarks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9862) Throw an exception in UnsafeDirectLittleEndian on Big-Endian platform

2020-09-17 Thread Kazuaki Ishizaki (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reassigned ARROW-9862:
---

Assignee: Kazuaki Ishizaki

> Throw an exception in UnsafeDirectLittleEndian on Big-Endian platform
> -
>
> Key: ARROW-9862
> URL: https://issues.apache.org/jira/browse/ARROW-9862
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The current code throws an intended exception on a big-endian platform while 
> this class supports native endianness for the primitive data types (up to 
> 64-bit).
> {code:java}
> throw new IllegalStateException("Arrow only runs on LittleEndian systems.");
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9861) [Java] Failed Arrow Vector on big-endian platform

2020-09-17 Thread Kazuaki Ishizaki (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reassigned ARROW-9861:
---

Assignee: Kazuaki Ishizaki

> [Java] Failed Arrow Vector on big-endian platform
> -
>
> Key: ARROW-9861
> URL: https://issues.apache.org/jira/browse/ARROW-9861
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The following test failure occurs on a big-endian platform
> {code:java}
> mvn -B -Drat.skip=true 
> -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn
>  -Dflatc.download.skip=true -rf :arrow-vector test
> ...
> [INFO] Running org.apache.arrow.vector.TestDecimalVector
> [ERROR] Tests run: 9, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 0.008 
> s <<< FAILURE! - in org.apache.arrow.vector.TestDecimalVector
> [ERROR] setUsingArrowBufOfLEInts  Time elapsed: 0.001 s  <<< FAILURE!
> java.lang.AssertionError: expected:<705.32> but was:<-20791293.44>
>   at 
> org.apache.arrow.vector.TestDecimalVector.setUsingArrowBufOfLEInts(TestDecimalVector.java:295)
> [ERROR] setUsingArrowLongLEBytes  Time elapsed: 0.001 s  <<< FAILURE!
> java.lang.AssertionError: expected:<9223372036854775807> but was:<-129>
>   at 
> org.apache.arrow.vector.TestDecimalVector.setUsingArrowLongLEBytes(TestDecimalVector.java:322)
> [ERROR] testLongReadWrite  Time elapsed: 0.001 s  <<< FAILURE!
> java.lang.AssertionError: expected:<-2> but was:<-72057594037927937>
>   at 
> org.apache.arrow.vector.TestDecimalVector.testLongReadWrite(TestDecimalVector.java:176)
> ...
> [ERROR] Failures: 
> [ERROR]   TestDecimalVector.setUsingArrowBufOfLEInts:295 expected:<705.32> 
> but was:<-20791293.44>
> [ERROR]   TestDecimalVector.setUsingArrowLongLEBytes:322 
> expected:<9223372036854775807> but was:<-129>
> [ERROR]   TestDecimalVector.testLongReadWrite:176 expected:<-2> but 
> was:<-72057594037927937>
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10030) [Rust] Support fromIter and toIter

2020-09-17 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-10030:
--
Component/s: Rust
Description: 
Proposal for comments: 
[https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]

(dump of the document above)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are the most common 
operations outside of the library.
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers 
accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted 
array
 # Iterate
 # Use for i in (0..array.len())
 # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
 # use builder.append_value(new_value) or builder.append_null()

 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator and Iterator>
 * Implement FromIterator and Item=Option

so that users can write:

 
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array

    .iter()

    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })

    .collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are 
efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not allow segfaults
 # Simple, as users do not need to worry about Builders, buffers, etc, only 
native Rust.

  was:
Proposal for comments: 
https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing

 

(dump of the proposal:)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are the most common 
operations outside of the library.
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers 
accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted 
array
 # Iterate
 # Use for i in (0..array.len())
 # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
 # use builder.append_value(new_value) or builder.append_null()


 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator and Iterator>
 * Implement FromIterator and Item=Option

so that users can write:

 
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array

    .iter()

    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })

    .collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are 
efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not allow segfaults
 # Simple, as users do not need to worry about Builders, buffers, etc, only 
native Rust.


> [Rust] Support

[jira] [Created] (ARROW-10030) [Rust] Support fromIter and toIter

2020-09-17 Thread Jorge (Jira)
Jorge created ARROW-10030:
-

 Summary: [Rust] Support fromIter and toIter
 Key: ARROW-10030
 URL: https://issues.apache.org/jira/browse/ARROW-10030
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge


Proposal for comments: 
https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing

 

(dump of the proposal:)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are the most common 
operations outside of the library.
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers 
accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted 
array
 # Iterate
 # Use for i in (0..array.len())
 # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
 # use builder.append_value(new_value) or builder.append_null()


 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator and Iterator>
 * Implement FromIterator and Item=Option

so that users can write:

 
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array

    .iter()

    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })

    .collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are 
efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not allow segfaults
 # Simple, as users do not need to worry about Builders, buffers, etc, only 
native Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)