[jira] [Updated] (ARROW-8426) [Rust] [Parquet] Add support for writing dictionary types

2020-10-05 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-8426:

Component/s: Rust

> [Rust] [Parquet] Add support for writing dictionary types
> -
>
> Key: ARROW-8426
> URL: https://issues.apache.org/jira/browse/ARROW-8426
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10162) [Rust] Support display of DictionaryArrays in pretty printing

2020-10-05 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-10162:
-
Component/s: Rust

> [Rust] Support display of DictionaryArrays in pretty printing
> -
>
> Key: ARROW-10162
> URL: https://issues.apache.org/jira/browse/ARROW-10162
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I try to display a DictionaryArray values, I get 'Unsupported {:?} type 
> for repl." error in rust/arrow/src/util/pretty.rs
> This ticket tracks adding proper support for printing these types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10169) [Rust] Nulls should be rendered as "" rather than default value when pretty printing arrays

2020-10-05 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-10169:
-
Component/s: Rust

> [Rust] Nulls should be rendered as "" rather than default value when pretty 
> printing arrays
> ---
>
> Key: ARROW-10169
> URL: https://issues.apache.org/jira/browse/ARROW-10169
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Null values should be printed as "" when pretty printing. However, as of now, 
> null values in primative arrays  are rendered as the type's default value .
> For example:
> {code}
> fn test_pretty_format_batches() -> Result<()> {
> // define a schema.
> let schema = Arc::new(Schema::new(vec![
> Field::new("a", DataType::Utf8, true),
> Field::new("b", DataType::Int32, true),
> ]));
> // define data.
> let batch = RecordBatch::try_new(
> schema,
> vec![
> Arc::new(array::StringArray::from(vec![Some("a"), Some("b"), 
> None, Some("d")])),
> Arc::new(array::Int32Array::from(vec![Some(1), None, 
> Some(10), Some(100)])),
> ],
> )?;
> println!(pretty_format_batches(&[batch])?);
> Ok(())
> }
> {code}
> Outputs:
> {code}
> +---+-+
> | a | b   |
> +---+-+
> | a | 1   |
> | b | 0   |
> |   | 10  |
> | d | 100 |
> +---+-+
> {code}
> The second row of b should be '', not 0. The third row of a should also be 
> '', which I think t is by accident
> Thanks to [~jhorstmann] horstmann for pointing this out on 
> https://github.com/apache/arrow/pull/8331#issuecomment-702964608



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10189) [C] C data interface example for i32 uses `l`, not `i`, in the format

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10189:
---
Labels: pull-request-available  (was: )

> [C] C data interface example for i32 uses `l`, not `i`, in the format
> -
>
> Key: ARROW-10189
> URL: https://issues.apache.org/jira/browse/ARROW-10189
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The 
> [specification|https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings]
>  uses "i" to represent i32, but the 
> [example|https://arrow.apache.org/docs/format/CDataInterface.html#exporting-a-simple-int32-array]
>  uses "l" (i64).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10189) [C] C data interface example for i32 uses `l`, not `i`, in the format

2020-10-05 Thread Jira
Jorge Leitão created ARROW-10189:


 Summary: [C] C data interface example for i32 uses `l`, not `i`, 
in the format
 Key: ARROW-10189
 URL: https://issues.apache.org/jira/browse/ARROW-10189
 Project: Apache Arrow
  Issue Type: Bug
  Components: C
Reporter: Jorge Leitão
Assignee: Jorge Leitão


The 
[specification|https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings]
 uses "i" to represent i32, but the 
[example|https://arrow.apache.org/docs/format/CDataInterface.html#exporting-a-simple-int32-array]
 uses "l" (i64).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10139) [C++] Add support for building arrow_testing without building tests

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10139:
---
Labels: pull-request-available  (was: )

> [C++] Add support for building arrow_testing without building tests
> ---
>
> Key: ARROW-10139
> URL: https://issues.apache.org/jira/browse/ARROW-10139
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yuri
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{ARROW_BUILD_TESTS}} installs the following arrow_testing related files 
> implicitly:
> {noformat}
> lib/cmake/arrow/ArrowTestingConfig.cmake
> lib/cmake/arrow/ArrowTestingConfigVersion.cmake
> lib/cmake/arrow/ArrowTestingTargets-%%CMAKE_BUILD_TYPE%%.cmake
> lib/cmake/arrow/ArrowTestingTargets.cmake
> lib/cmake/arrow/FindArrowTesting.cmake
> lib/libarrow_testing.so
> lib/libarrow_testing.so.100
> lib/libarrow_testing.so.100.1.0
> libdata/pkgconfig/arrow-testing.pc
> {noformat}
> If we have {{ARROW_TESTING}} or something, users can do it explicitly.
> The original GitHub bug report: [https://github.com/apache/arrow/issues/8306]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10187:
---
Fix Version/s: (was: 2.0.0)

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-05 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208480#comment-17208480
 ] 

Andy Grove commented on ARROW-10187:


I was able to run the DataFusion examples though, despite these test failures.

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10188) [Rust] [DataFusion] Some examples are broken

2020-10-05 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208479#comment-17208479
 ] 

Andy Grove commented on ARROW-10188:


Thanks [~jorgecarleitao] .. my mistake, I had set the PARQUET_TEST_DATA path 
relative to the wrong directory in the terminal window where I was running the 
client. The flight example works for me now.

> [Rust] [DataFusion] Some examples are broken
> 
>
> Key: ARROW-10188
> URL: https://issues.apache.org/jira/browse/ARROW-10188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The flight server example fails with "No such file or directory".
> The dataframe example produces an empty result set.
> The simple_udaf example produces no output at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10041) [Rust] Possible to create LargeStringArray with DataType::Utf8

2020-10-05 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-10041.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8296
[https://github.com/apache/arrow/pull/8296]

> [Rust] Possible to create LargeStringArray with DataType::Utf8
> --
>
> Key: ARROW-10041
> URL: https://issues.apache.org/jira/browse/ARROW-10041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We don't perform enough checks on ArrayData when creating StringArray and 
> LargeStringArray. As they use different integer sizes for offsets, this can 
> create a problem where Offset> could be correctly reinterpreted as 
> Offset> and vice versa.
> We should add checks that pervent this. The same might apply for List and 
> LargeList



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10188) [Rust] [DataFusion] Some examples are broken

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10188:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Some examples are broken
> 
>
> Key: ARROW-10188
> URL: https://issues.apache.org/jira/browse/ARROW-10188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The flight server example fails with "No such file or directory".
> The dataframe example produces an empty result set.
> The simple_udaf example produces no output at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10188) [Rust] [DataFusion] Some examples are broken

2020-10-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208471#comment-17208471
 ] 

Jorge Leitão commented on ARROW-10188:
--

I was unable to reproduce the first one:

 

On one shell:

```

export PARQUET_TEST_DATA=../../cpp/submodules/parquet-testing/data

cargo run --example flight_server

```

 

On a second shell:

```

export PARQUET_TEST_DATA=../../cpp/submodules/parquet-testing/data

cargo run --example flight_client

```

works as intended in my computer. [~andygrove], could you describe how you ran 
it in yours?

 

I will work the other two.

> [Rust] [DataFusion] Some examples are broken
> 
>
> Key: ARROW-10188
> URL: https://issues.apache.org/jira/browse/ARROW-10188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Major
> Fix For: 2.0.0
>
>
> The flight server example fails with "No such file or directory".
> The dataframe example produces an empty result set.
> The simple_udaf example produces no output at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10188) [Rust] [DataFusion] Some examples are broken

2020-10-05 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão reassigned ARROW-10188:


Assignee: Jorge Leitão

> [Rust] [DataFusion] Some examples are broken
> 
>
> Key: ARROW-10188
> URL: https://issues.apache.org/jira/browse/ARROW-10188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Major
> Fix For: 2.0.0
>
>
> The flight server example fails with "No such file or directory".
> The dataframe example produces an empty result set.
> The simple_udaf example produces no output at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8205) [Rust] [DataFusion] DataFusion should enforce unique field names in a schema

2020-10-05 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-8205.
-
Resolution: Fixed

Issue resolved by pull request 8334
[https://github.com/apache/arrow/pull/8334]

> [Rust] [DataFusion] DataFusion should enforce unique field names in a schema
> 
>
> Key: ARROW-8205
> URL: https://issues.apache.org/jira/browse/ARROW-8205
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Affects Versions: 0.16.0
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There does not seem to be any validation to avoid schemas being created with 
> duplicate field names. We should add this along with unit tests.
> This will require changing the signature of the constructors to try_new with 
> a Result return type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6972) [C#] Should support StructField arrays

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6972:
--

Assignee: Neal Richardson  (was: Prashanth Govindarajan)

> [C#] Should support StructField arrays
> --
>
> Key: ARROW-6972
> URL: https://issues.apache.org/jira/browse/ARROW-6972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Cameron Murray
>Assignee: Neal Richardson
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The C# implementation of Arrow does not support struct arrays and, complex 
> types more generally.
>  I notice ARROW-6870 addresses Dictionary arrays however this is not as 
> flexible as structs (for example, cannot mix data types)
> The source does have a stub for StructArray however there is no Builder nor 
> example on how to use it so I can assume it is not supported.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6972) [C#] Should support StructField arrays

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6972:
--

Assignee: Prashanth Govindarajan  (was: Neal Richardson)

> [C#] Should support StructField arrays
> --
>
> Key: ARROW-6972
> URL: https://issues.apache.org/jira/browse/ARROW-6972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Cameron Murray
>Assignee: Prashanth Govindarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The C# implementation of Arrow does not support struct arrays and, complex 
> types more generally.
>  I notice ARROW-6870 addresses Dictionary arrays however this is not as 
> flexible as structs (for example, cannot mix data types)
> The source does have a stub for StructArray however there is no Builder nor 
> example on how to use it so I can assume it is not supported.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10178) [CI] Fix spark master integration test build setup

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-10178:
---

Assignee: Bryan Cutler  (was: Neal Richardson)

> [CI] Fix spark master integration test build setup
> --
>
> Key: ARROW-10178
> URL: https://issues.apache.org/jira/browse/ARROW-10178
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://github.com/ursa-labs/crossbow/runs/1204690363



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10178) [CI] Fix spark master integration test build setup

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-10178:
---

Assignee: Neal Richardson  (was: Krisztian Szucs)

> [CI] Fix spark master integration test build setup
> --
>
> Key: ARROW-10178
> URL: https://issues.apache.org/jira/browse/ARROW-10178
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://github.com/ursa-labs/crossbow/runs/1204690363



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10181:
---
Labels: pull-request-available  (was: )

> [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)
> ---
>
> Key: ARROW-10181
> URL: https://issues.apache.org/jira/browse/ARROW-10181
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Raspberry Pi still tends to use 32-bit operating systems although there is a 
> beta 64 bit version of Raspbian. It would be nice to be able to at least 
> disable these tests when runnign on 32-bit. 
> {code:java}
> error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:421:25
> |
> 421 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: `#[deny(overflowing_literals)]` on by default
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:422:29
> |
> 422 | assert_eq!(ceil(10, 100), 1);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:423:25
> |
> 423 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10188) [Rust] [DataFusion] Some examples are broken

2020-10-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10188:
--

 Summary: [Rust] [DataFusion] Some examples are broken
 Key: ARROW-10188
 URL: https://issues.apache.org/jira/browse/ARROW-10188
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
 Fix For: 2.0.0


The flight server example fails with "No such file or directory".

The dataframe example produces an empty result set.

The simple_udaf example produces no output at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10187:
--

 Summary: [Rust] Test failures on 32 bit ARM (Raspberry Pi)
 Key: ARROW-10187
 URL: https://issues.apache.org/jira/browse/ARROW-10187
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


Perhaps these failures are to be expected and perhaps we can't really support 
32 bit?

 
{code:java}
 array::array::tests::test_primitive_array_from_vec stdout 
thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
'assertion failed: `(left == right)`
  left: `144`,
 right: `104`', arrow/src/array/array.rs:2383:9 
array::array::tests::test_primitive_array_from_vec_option stdout 
thread 'array::array::tests::test_primitive_array_from_vec_option' panicked at 
'assertion failed: `(left == right)`
  left: `224`,
 right: `176`', arrow/src/array/array.rs:2409:9 
array::null::tests::test_null_array stdout 
thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
`(left == right)`
  left: `64`,
 right: `32`', arrow/src/array/null.rs:134:9 
array::union::tests::test_dense_union_i32 stdout 
thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
failed: `(left == right)`
  left: `1024`,
 right: `768`', arrow/src/array/union.rs:704:9 memory::tests::test_allocate 
stdout 
thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left == 
right)`
  left: `0`,
 right: `32`', arrow/src/memory.rs:243:13
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10186) [Rust] Tests fail when following instructions in README

2020-10-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10186:
--

 Summary: [Rust] Tests fail when following instructions in README
 Key: ARROW-10186
 URL: https://issues.apache.org/jira/browse/ARROW-10186
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".

```bash

 export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
 export ARROW_TEST_DATA=../testing/data

```

If I change them to relative paths as follows then the tests pass:

 

```bash

export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data

export ARROW_TEST_DATA=`pwd`/../testing/data

```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10186) [Rust] Tests fail when following instructions in README

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10186:
---
Description: 
If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".
{code:java}
export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=../testing/data  {code}
If I change them to absolute paths as follows then the tests pass:
{code:java}
export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
 

  was:
If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".
{code:java}
export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=../testing/data  {code}
If I change them to relative paths as follows then the tests pass:
{code:java}
export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
 


> [Rust] Tests fail when following instructions in README
> ---
>
> Key: ARROW-10186
> URL: https://issues.apache.org/jira/browse/ARROW-10186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> If I follow the instructions from the README and set the test paths as 
> follows, some of the IPC tests fail with "no such file or directory".
> {code:java}
> export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=../testing/data  {code}
> If I change them to absolute paths as follows then the tests pass:
> {code:java}
> export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10186) [Rust] Tests fail when following instructions in README

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10186:
---
Description: 
If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".
{code:java}
export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=../testing/data  {code}
If I change them to relative paths as follows then the tests pass:
{code:java}
export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
 

  was:
If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".

```bash

 export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
 export ARROW_TEST_DATA=../testing/data

```

If I change them to relative paths as follows then the tests pass:

 

```bash

export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data

export ARROW_TEST_DATA=`pwd`/../testing/data

```

 


> [Rust] Tests fail when following instructions in README
> ---
>
> Key: ARROW-10186
> URL: https://issues.apache.org/jira/browse/ARROW-10186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> If I follow the instructions from the README and set the test paths as 
> follows, some of the IPC tests fail with "no such file or directory".
> {code:java}
> export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=../testing/data  {code}
> If I change them to relative paths as follows then the tests pass:
> {code:java}
> export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10178) [CI] Fix spark master integration test build setup

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10178:
---
Labels: pull-request-available  (was: )

> [CI] Fix spark master integration test build setup
> --
>
> Key: ARROW-10178
> URL: https://issues.apache.org/jira/browse/ARROW-10178
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/ursa-labs/crossbow/runs/1204690363



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9870) [R] Friendly interface for filesystems (S3)

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9870:
--
Labels: pull-request-available  (was: )

> [R] Friendly interface for filesystems (S3)
> ---
>
> Key: ARROW-9870
> URL: https://issues.apache.org/jira/browse/ARROW-9870
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Filesystem methods don't provide a human-friendly interface for basic 
> operations like ls, mkdir, etc. Since we provide access to S3 and potentially 
> other cloud storage, it would be nice to have simple methods for exploring it.
> Additional ideas:
> * S3Bucket class/constructor: it's basically a SubTreeFileSystem containing 
> S3FS and a path, except that we can auto-detect a bucket's region.
> * Add a class like the FileLocator C++ struct list(fs, path). _also_ kinda 
> like a SubTreeFileSystem, but with different methods and intents. Aside from 
> use in ls/mkdir/cp, it could be used in file reader/writers instead of having 
> an extra {{filesystem}} argument added everywhere, e.g. 
> {{fs$path("path/to/file")}}. See 
> https://github.com/apache/arrow/pull/8197#discussion_r494325934



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10014) [C++] TaskGroup::Finish should execute tasks

2020-10-05 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208391#comment-17208391
 ] 

Weston Pace commented on ARROW-10014:
-

I'm going to continue from the email discussion and investigation and have 
added sub tasks for my planned approach.  It's a slightly different approach 
than the one laid out in the description (instead of Finish running tasks the 
FinishAsync method will be added which just returns immediately and gets off 
the thread pool).  If anyone wants me to open a new issue for my alternate 
approach instead of taking over this one please let me know.

> [C++] TaskGroup::Finish should execute tasks
> 
>
> Key: ARROW-10014
> URL: https://issues.apache.org/jira/browse/ARROW-10014
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently ThreadedTaskGroup::Finish locks the current thread waiting for 
> worker threads to execute tasks. Instead it could pop tasks from the queue 
> and execute them, using the Finish-ing thread as a worker. This would enable 
> basic nested parallelism cases using TaskGroup::MakeSubGroup() without danger 
> of accumulating a thread deadlock.
> For example in the case of reading multiple parquet files we would like to 
> parallelize both across files to read and across columns within each file. We 
> could support this basic nested parallelism by rewriting ParquetFileReader 
> accept any TaskGroup across which to scatter its column reading tasks (rather 
> than instantiating its own ThreadPool based on a boolean flag). Then file 
> reading tasks could be scattered across a ThreadedTaskGroup, each of these 
> creating a subgroup which runs all column reading tasks.
> However the above would currently deadlock for reading {{(# files) * (# 
> columns) >= (# threads)}}, since every task of the root TaskGroup will be 
> locked by its subgroup's call to Finish. In order to use 
> TaskGroup::MakeSubGroup for basic nested parallelism, the Finish-ing thread 
> must perform work in addition to checking for group completion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10185) Add FinishAsync to TaskGroup

2020-10-05 Thread Weston Pace (Jira)
Weston Pace created ARROW-10185:
---

 Summary: Add FinishAsync to TaskGroup
 Key: ARROW-10185
 URL: https://issues.apache.org/jira/browse/ARROW-10185
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Weston Pace


As an alternative to Finish, FinishAsync will return a future immediately.  
This can then be added to a parent task group, allowing for nested task groups. 
 Since the future is returned immediately, the child task group is not 
occupying a thread on the thread pool and there is no concern for thread 
starvation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10184) Allow futures to be added as tasks to a TaskGroup

2020-10-05 Thread Weston Pace (Jira)
Weston Pace created ARROW-10184:
---

 Summary: Allow futures to be added as tasks to a TaskGroup
 Key: ARROW-10184
 URL: https://issues.apache.org/jira/browse/ARROW-10184
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Weston Pace


Once basic continuation support is in, if we can add futures to a task group, 
and a task group completion can be expressed as a future, then we can use task 
groups in such a way that they will not deadlock.

This task focuses on the ability to add a future to a task group instead of 
spawning a new task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10183) Create an async ParallelForEach that runs on an iterator

2020-10-05 Thread Weston Pace (Jira)
Weston Pace created ARROW-10183:
---

 Summary: Create an async ParallelForEach that runs on an iterator
 Key: ARROW-10183
 URL: https://issues.apache.org/jira/browse/ARROW-10183
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Weston Pace


This method should take in an iterator and spawn N threads to pull items off 
the iterator and start working on them.  It should return a future which will 
complete when all N threads have finished and the iterator is exhausted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10182) Add basic continuation support to futures

2020-10-05 Thread Weston Pace (Jira)
Weston Pace created ARROW-10182:
---

 Summary: Add basic continuation support to futures
 Key: ARROW-10182
 URL: https://issues.apache.org/jira/browse/ARROW-10182
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Weston Pace
 Fix For: 3.0.0


Add support for Then, WhenAny, and WhenAll.  This will allow for expressing 
dependencies between tasks and eliminates threads in the thread pool that are 
simply waiting for other tasks to complete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10147) [Python] Constructing pandas metadata fails if an Index name is not JSON-serializable by default

2020-10-05 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-10147.
--
Resolution: Fixed

Issue resolved by pull request 8314
[https://github.com/apache/arrow/pull/8314]

> [Python] Constructing pandas metadata fails if an Index name is not 
> JSON-serializable by default
> 
>
> Key: ARROW-10147
> URL: https://issues.apache.org/jira/browse/ARROW-10147
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Diana Clarke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> originally reported in https://github.com/apache/arrow/issues/8270
> here's a minimal reproduction:
> {code}
> In [24]: idx = pd.RangeIndex(0, 4, name=np.int64(6))  
>  
> In [25]: df = pd.DataFrame(index=idx) 
>  
> In [26]: pa.table(df) 
>  
> ---
> TypeError Traceback (most recent call last)
>  in 
> > 1 pa.table(df)
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.table()
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
> ~/code/arrow/python/pyarrow/pandas_compat.py in dataframe_to_arrays(df, 
> schema, preserve_index, nthreads, columns, safe)
> 604 pandas_metadata = construct_metadata(df, column_names, 
> index_columns,
> 605  index_descriptors, 
> preserve_index,
> --> 606  types)
> 607 metadata = deepcopy(schema.metadata) if schema.metadata else 
> dict()
> 608 metadata.update(pandas_metadata)
> ~/code/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, 
> column_names, index_levels, index_descriptors, preserve_index, types)
> 243 'version': pa.__version__
> 244 },
> --> 245 'pandas_version': _pandas_api.version
> 246 }).encode('utf8')
> 247 }
> ~/miniconda/envs/arrow-3.7/lib/python3.7/json/__init__.py in dumps(obj, 
> skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, 
> default, sort_keys, **kw)
> 229 cls is None and indent is None and separators is None and
> 230 default is None and not sort_keys and not kw):
> --> 231 return _default_encoder.encode(obj)
> 232 if cls is None:
> 233 cls = JSONEncoder
> ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in encode(self, o)
> 197 # exceptions aren't as detailed.  The list call should be 
> roughly
> 198 # equivalent to the PySequence_Fast that ''.join() would do.
> --> 199 chunks = self.iterencode(o, _one_shot=True)
> 200 if not isinstance(chunks, (list, tuple)):
> 201 chunks = list(chunks)
> ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in iterencode(self, 
> o, _one_shot)
> 255 self.key_separator, self.item_separator, 
> self.sort_keys,
> 256 self.skipkeys, _one_shot)
> --> 257 return _iterencode(o, 0)
> 258 
> 259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
> ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in default(self, o)
> 177 
> 178 """
> --> 179 raise TypeError(f'Object of type {o.__class__.__name__} '
> 180 f'is not JSON serializable')
> 181 
> TypeError: Object of type int64 is not JSON serializable
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10178) [CI] Fix spark master integration test build setup

2020-10-05 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208367#comment-17208367
 ] 

Bryan Cutler commented on ARROW-10178:
--

I'll check it out

> [CI] Fix spark master integration test build setup
> --
>
> Key: ARROW-10178
> URL: https://issues.apache.org/jira/browse/ARROW-10178
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> https://github.com/ursa-labs/crossbow/runs/1204690363



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-10181:
--

Assignee: Andy Grove

> [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)
> ---
>
> Key: ARROW-10181
> URL: https://issues.apache.org/jira/browse/ARROW-10181
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
>  
> {code:java}
> error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:421:25
> |
> 421 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: `#[deny(overflowing_literals)]` on by default
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:422:29
> |
> 422 | assert_eq!(ceil(10, 100), 1);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:423:25
> |
> 423 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10181:
---
Description: 
Raspberry Pi still tends to use 32-bit operating systems although there is a 
beta 64 bit version of Raspbian. It would be nice to be able to at least 
disable these tests when runnign on 32-bit. 
{code:java}
error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:421:25
|
421 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: `#[deny(overflowing_literals)]` on by default
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:422:29
|
422 | assert_eq!(ceil(10, 100), 1);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:423:25
|
423 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`
 {code}

  was:
 
{code:java}
error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:421:25
|
421 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: `#[deny(overflowing_literals)]` on by default
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:422:29
|
422 | assert_eq!(ceil(10, 100), 1);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:423:25
|
423 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`
 {code}


> [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)
> ---
>
> Key: ARROW-10181
> URL: https://issues.apache.org/jira/browse/ARROW-10181
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> Raspberry Pi still tends to use 32-bit operating systems although there is a 
> beta 64 bit version of Raspbian. It would be nice to be able to at least 
> disable these tests when runnign on 32-bit. 
> {code:java}
> error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:421:25
> |
> 421 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: `#[deny(overflowing_literals)]` on by default
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:422:29
> |
> 422 | assert_eq!(ceil(10, 100), 1);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:423:25
> |
> 423 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10181:
---
Summary: [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)  (was: 
[Rust] Arrow tests fail to compile on Raspberry Pi (ARM))

> [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)
> ---
>
> Key: ARROW-10181
> URL: https://issues.apache.org/jira/browse/ARROW-10181
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
>  
> {code:java}
> error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:421:25
> |
> 421 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: `#[deny(overflowing_literals)]` on by default
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:422:29
> |
> 422 | assert_eq!(ceil(10, 100), 1);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:423:25
> |
> 423 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (ARM)

2020-10-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10181:
--

 Summary: [Rust] Arrow tests fail to compile on Raspberry Pi (ARM)
 Key: ARROW-10181
 URL: https://issues.apache.org/jira/browse/ARROW-10181
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
 Fix For: 2.0.0


 
{code:java}
error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:421:25
|
421 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: `#[deny(overflowing_literals)]` on by default
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:422:29
|
422 | assert_eq!(ceil(10, 100), 1);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:423:25
|
423 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10068) [C++] Add bundled external project for aws-sdk-cpp

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-10068.
-
Resolution: Fixed

Issue resolved by pull request 8304
[https://github.com/apache/arrow/pull/8304]

> [C++] Add bundled external project for aws-sdk-cpp
> --
>
> Key: ARROW-10068
> URL: https://issues.apache.org/jira/browse/ARROW-10068
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Packaging
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> Currently {{build_awssdk}} errors with a FIXME message. We should fix it. 
> aws-sdk-cpp is not widely available on package managers, and in some cases 
> (like Homebrew) its cmake config is broken. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10180) [C++][Doc] Update dependency management docs following aws-sdk-cpp addition

2020-10-05 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10180:
---

 Summary: [C++][Doc] Update dependency management docs following 
aws-sdk-cpp addition
 Key: ARROW-10180
 URL: https://issues.apache.org/jira/browse/ARROW-10180
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Neal Richardson
Assignee: Kouhei Sutou
 Fix For: 2.0.0


https://arrow.apache.org/docs/developers/cpp/building.html#build-dependency-management
 needs updating after (esp.) ARROW-10068. aws-sdk-cpp can be "bundled" but 
still has system dependencies that cannot be, for example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6883) [C++] Support sending delta DictionaryBatch or replacement DictionaryBatch in IPC stream writer class

2020-10-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208309#comment-17208309
 ] 

Antoine Pitrou commented on ARROW-6883:
---

Presumably this can be controlled by a flag in {{IpcWriteOptions}}.

> [C++] Support sending delta DictionaryBatch or replacement DictionaryBatch in 
> IPC stream writer class
> -
>
> Key: ARROW-6883
> URL: https://issues.apache.org/jira/browse/ARROW-6883
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 3.0.0
>
>
> I didn't see other JIRA issues about this, but this is one significant matter 
> to have complete columnar format coverage in the C++ library.
> This functionality will flow through to the various bindings, so it would be 
> helpful to add unit tests to assert that things work correctly e.g. in Python 
> from an end-user perspective



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10121) [C++][Python] Variable dictionaries do not survive roundtrip to IPC stream

2020-10-05 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-10121.
--
Resolution: Fixed

Issue resolved by pull request 8302
[https://github.com/apache/arrow/pull/8302]

> [C++][Python] Variable dictionaries do not survive roundtrip to IPC stream
> --
>
> Key: ARROW-10121
> URL: https://issues.apache.org/jira/browse/ARROW-10121
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Failing test case (from dev@ 
> https://lists.apache.org/thread.html/r338942b4e9f9316b48e87aab41ac49c7ffedd45733d4a6349523b7eb%40%3Cdev.arrow.apache.org%3E)
> {code}
> import pyarrow as pa
> from io import BytesIO
> pa.__version__
> schema = pa.schema([pa.field('foo', pa.int32()), pa.field('bar', 
> pa.dictionary(pa.int32(), pa.string()))] )
> r1 = pa.record_batch(
> [
> [1, 2, 3, 4, 5],
> pa.array(["a", "b", "c", "d", "e"]).dictionary_encode()
> ],
> schema
> )
> r1.validate()
> r2 = pa.record_batch(
> [
> [1, 2, 3, 4, 5],
> pa.array(["c", "c", "e", "f", "g"]).dictionary_encode()
> ],
> schema
> )
> r2.validate()
> assert r1.column(1).dictionary != r2.column(1).dictionary
> sink =  pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, schema)
> writer.write(r1)
> writer.write(r2)
> serialized = BytesIO(sink.getvalue().to_pybytes())
> stream = pa.ipc.open_stream(serialized)
> deserialized = []
> while True:
> try:
> deserialized.append(stream.read_next_batch())
> except StopIteration:
> break
> assert deserialized[1][1].to_pylist() == r2[1].to_pylist()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10008) [Python] pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False

2020-10-05 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-10008.
--
Resolution: Fixed

Issue resolved by pull request 8311
[https://github.com/apache/arrow/pull/8311]

> [Python] pyarrow.parquet.read_table fails with predicate pushdown on 
> categorical data with use_legacy_dataset=False
> ---
>
> Key: ARROW-10008
> URL: https://issues.apache.org/jira/browse/ARROW-10008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.17.1, 1.0.1
> Environment: Platform: 
> Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
> Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
> [GCC 7.3.0]
> Pandas version: 1.1.2
> pyarrow version: 1.0.1
>Reporter: Caleb Hattingh
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: categorical, category, dataset, filters, parquet, 
> predicate, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I apologise if this is a known issue; I looked both in this issue tracker and 
> on github and I didn't find it.
> There seems to be a problem reading a dataset with predicate pushdown 
> (filters) on columns with categorical data. The problem only occurs with 
> `use_legacy_dataset=False` (but if that's True it has no effect if the column 
> isn't a partition key.
> Reproducer:
> {code:python}
> import shutil
> import sys, platform
> from pathlib import Path
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # Settings
> CATEGORICAL_DTYPE = True
> USE_LEGACY_DATASET = False
> print('Platform:', platform.platform())
> print('Python version:', sys.version)
> print('Pandas version:', pd.__version__)
> print('pyarrow version:', pa.__version__)
> print('categorical enabled:', CATEGORICAL_DTYPE)
> print('use_legacy_dataset:', USE_LEGACY_DATASET)
> print()
> # Clean up test dataset if present
> path = Path('blah.parquet')
> if path.exists():
> shutil.rmtree(str(path))
> # Simple data
> d = dict(col1=['a', 'b'], col2=[1, 2])
> # Either categorical or not
> if CATEGORICAL_DTYPE:
> df = pd.DataFrame(data=d, dtype='category')
> else:
> df = pd.DataFrame(data=d)
> # Write dataset
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, str(path))
> # Load dataset
> table = pq.read_table(
> str(path),
> filters=[('col1', '=', 'a')],
> use_legacy_dataset=USE_LEGACY_DATASET,
> )
> df = table.to_pandas()
> print(df.dtypes)
> print(repr(df))
> {code}
>  Output:
> {code:java}
> $ python categorical_predicate_pushdown.py 
> Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
> Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
> [GCC 7.3.0]
> Pandas version: 1.1.2
> pyarrow version: 1.0.1
> categorical enabled: True
> use_legacy_dataset: False
> /arrow/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Type error: 
> Cannot compare scalars of differing type: dictionary indices=int32, ordered=0> vs string
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4fc128)[0x7f50568c6128]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f50568c693d]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal14DieWithMessageERKSs+0x51)[0x7f50569757c1]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x4c)[0x7f505697716c]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression21AssumeGivenComparisonERKS1_+0x438)[0x7f5043334f18]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0x34)[0x7f5043334fa4]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset12RowGroupInfo7SatisfyERKNS0_10ExpressionE+0x1c)[0x7f50433116ac]
> 

[jira] [Updated] (ARROW-10179) [Rust] Labeler is not labeling

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10179:
---
Labels: pull-request-available  (was: )

> [Rust] Labeler is not labeling
> --
>
> Key: ARROW-10179
> URL: https://issues.apache.org/jira/browse/ARROW-10179
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The labeler is not doing its job and erroring. There is a bug on its 
> declaration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10179) [Rust] Labeler is not labeling

2020-10-05 Thread Jira
Jorge Leitão created ARROW-10179:


 Summary: [Rust] Labeler is not labeling
 Key: ARROW-10179
 URL: https://issues.apache.org/jira/browse/ARROW-10179
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jorge Leitão
Assignee: Jorge Leitão


The labeler is not doing its job and erroring. There is a bug on its 
declaration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6972) [C#] Should support StructField arrays

2020-10-05 Thread Prashanth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208284#comment-17208284
 ] 

Prashanth Govindarajan commented on ARROW-6972:
---

Opened [https://github.com/apache/arrow/pull/8348/files] for this

> [C#] Should support StructField arrays
> --
>
> Key: ARROW-6972
> URL: https://issues.apache.org/jira/browse/ARROW-6972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Cameron Murray
>Assignee: Prashanth Govindarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The C# implementation of Arrow does not support struct arrays and, complex 
> types more generally.
>  I notice ARROW-6870 addresses Dictionary arrays however this is not as 
> flexible as structs (for example, cannot mix data types)
> The source does have a stub for StructArray however there is no Builder nor 
> example on how to use it so I can assume it is not supported.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8735) [Rust] [Parquet] Parquet crate fails to compile on Arm architecture

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8735.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8338
[https://github.com/apache/arrow/pull/8338]

> [Rust] [Parquet] Parquet crate fails to compile on Arm architecture
> ---
>
> Key: ARROW-8735
> URL: https://issues.apache.org/jira/browse/ARROW-8735
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I'm trying to compile the project in Raspbian, on a Raspberry Pi and the 
> build fails:
> {code:java}
> error[E0308]: mismatched types
>   --> /home/pi/git/arrow/rust/parquet/src/util/hash_util.rs:26:37
>|
> 26 | fn hash_(data: &[u8], seed: u32) -> u32 {
>|-^^^ expected `u32`, found `()`
>||
>|implicitly returns `()` as its body has no tail or `return` expression
>  {code}
> This method is only implemented for x86, x86_64 and aarch64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3080) [Python] Unify Arrow to Python object conversion paths

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3080:
--
Labels: pull-request-available  (was: )

> [Python] Unify Arrow to Python object conversion paths
> --
>
> Key: ARROW-3080
> URL: https://issues.apache.org/jira/browse/ARROW-3080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Similar to ARROW-2814, we have inconsistent support for converting Arrow 
> nested types back to object sequences. For example, a list of structs fails 
> when calling {{to_pandas}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6972) [C#] Should support StructField arrays

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6972:
--
Labels: pull-request-available  (was: )

> [C#] Should support StructField arrays
> --
>
> Key: ARROW-6972
> URL: https://issues.apache.org/jira/browse/ARROW-6972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Cameron Murray
>Assignee: Prashanth Govindarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The C# implementation of Arrow does not support struct arrays and, complex 
> types more generally.
>  I notice ARROW-6870 addresses Dictionary arrays however this is not as 
> flexible as structs (for example, cannot mix data types)
> The source does have a stub for StructArray however there is no Builder nor 
> example on how to use it so I can assume it is not supported.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10164) [Rust] Add support for DictionaryArray types to cast kernels

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10164:
---
Labels: pull-request-available  (was: )

> [Rust] Add support for DictionaryArray types to cast kernels
> 
>
> Key: ARROW-10164
> URL: https://issues.apache.org/jira/browse/ARROW-10164
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket tracks the work to support casting to/from DictionaryArray's, (my 
> usecase is DictionaryArray's with a Utf8 dictionary). 
> There is prototype work on 
> https://github.com/alamb/arrow/tree/alamb/datafusion-string-dictionary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10153) [Java] Adding values to VarCharVector beyond 2GB results in IndexOutOfBoundsException

2020-10-05 Thread Samarth Jain (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208238#comment-17208238
 ] 

Samarth Jain commented on ARROW-10153:
--

Ah! Thanks, [~emkornfi...@gmail.com] ! Looks like this was recently added.

 

Are there are any perf implications of using LargeVarCharVector by default? 
Alternatively, is there a way to detect that a regular VarCharVector has run 
out of capacity and that we need to copy over contents from a VarCharVector to 
LargeVarCharVector. 

[~bryanc], [~liyafan] - maybe one of you know? 

> [Java] Adding values to VarCharVector beyond 2GB results in 
> IndexOutOfBoundsException
> -
>
> Key: ARROW-10153
> URL: https://issues.apache.org/jira/browse/ARROW-10153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.0.0
>Reporter: Samarth Jain
>Priority: Major
>
> On executing the below test case, one can see that on adding the 2049th 
> string of size 1MB, it fails.  
> {code:java}
> int length = 1024 * 1024;
> StringBuilder sb = new StringBuilder(length);
> for (int i = 0; i < length; i++) {
>  sb.append("a");
> }
> byte[] str = sb.toString().getBytes();
> VarCharVector vector = new VarCharVector("v", new 
> RootAllocator(Long.MAX_VALUE));
> vector.allocateNew(3000);
> for (int i = 0; i < 3000; i++) {
>  vector.setSafe(i, str);
> }{code}
>  
> {code:java}
> Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 
> -2147483648, length: 1048576 (expected: range(0, 2147483648))Exception in 
> thread "main" java.lang.IndexOutOfBoundsException: index: -2147483648, 
> length: 1048576 (expected: range(0, 2147483648)) at 
> org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) at 
> org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:762) at 
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1212)
>  at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1011)
> {code}
> Stepping through the code, 
>  
> [https://github.com/apache/arrow/blob/master/java/memory/memory-core/src/main/java/org/apache/arrow/memory/ArrowBuf.java#L425]
> returns the negative index `-2147483648`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10175) [CI] Nightly hdfs integration test job fails

2020-10-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10175:
---
Summary: [CI] Nightly hdfs integration test job fails  (was: [CI] Nightly 
hdfs integration test job crashes)

> [CI] Nightly hdfs integration test job fails
> 
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> Two tests fail:
> https://github.com/ursa-labs/crossbow/runs/1204680589
> [removed bogus investigation]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10175) [CI] Nightly hdfs integration test job crashes

2020-10-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10175:
---
Description: 
Two tests fail:
https://github.com/ursa-labs/crossbow/runs/1204680589

[removed bogus investigation]

  was:
This started failing July 16: 
https://github.com/ursa-labs/crossbow/runs/876346225

July 16: HEAD is now at 04d25fb75 ARROW-9500: [C++] Do not use std::to_string 
to fix segfault on gcc 7.x in -O3 builds

July 15: HEAD is now at 3586292d6 ARROW-9424: [C++][Parquet] Disable writing 
files with LZ4 codec


> [CI] Nightly hdfs integration test job crashes
> --
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> Two tests fail:
> https://github.com/ursa-labs/crossbow/runs/1204680589
> [removed bogus investigation]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10175) [CI] Nightly hdfs integration test job crashes

2020-10-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208209#comment-17208209
 ] 

Antoine Pitrou commented on ARROW-10175:


Ok, the issue description is really misleading. Your link points to a crash 
which doesn't exist anymore.
Instead, the current builds don't crash, they fail two tests.
https://github.com/ursa-labs/crossbow/runs/1204680589

Please don't use "crash" when there's no crash, because it's misleading.

> [CI] Nightly hdfs integration test job crashes
> --
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> This started failing July 16: 
> https://github.com/ursa-labs/crossbow/runs/876346225
> July 16: HEAD is now at 04d25fb75 ARROW-9500: [C++] Do not use std::to_string 
> to fix segfault on gcc 7.x in -O3 builds
> July 15: HEAD is now at 3586292d6 ARROW-9424: [C++][Parquet] Disable writing 
> files with LZ4 codec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5338) [Format][Integration] Define how to test for delta dictionary support in the JSON integration test data format

2020-10-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5338:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Format][Integration] Define how to test for delta dictionary support in the 
> JSON integration test data format
> --
>
> Key: ARROW-5338
> URL: https://issues.apache.org/jira/browse/ARROW-5338
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Integration
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently the integration test JSON format assumes that dictionaries remain 
> constant throughout the stream. It might be better to change the JSON format 
> to more closely mimic the IPC protocol (a sequence of messages tagged with 
> the message type)
> follow on to ARROW-3144



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10175) [CI] Nightly hdfs integration test job crashes

2020-10-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208193#comment-17208193
 ] 

Antoine Pitrou commented on ARROW-10175:


I tried to reproduce and it didn't crash. Then I noticed it was using Python 
3.6, and CI was using Python 3.7 (er... why not), and then it failed building:
{code:java}
/arrow/ci/scripts/cpp_build.sh /arrow /build
{code}

So it looks like the uploaded images are not up to date.

> [CI] Nightly hdfs integration test job crashes
> --
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> This started failing July 16: 
> https://github.com/ursa-labs/crossbow/runs/876346225
> July 16: HEAD is now at 04d25fb75 ARROW-9500: [C++] Do not use std::to_string 
> to fix segfault on gcc 7.x in -O3 builds
> July 15: HEAD is now at 3586292d6 ARROW-9424: [C++][Parquet] Disable writing 
> files with LZ4 codec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10147) [Python] Constructing pandas metadata fails if an Index name is not JSON-serializable by default

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-10147:
---

Assignee: Diana Clarke

> [Python] Constructing pandas metadata fails if an Index name is not 
> JSON-serializable by default
> 
>
> Key: ARROW-10147
> URL: https://issues.apache.org/jira/browse/ARROW-10147
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Diana Clarke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> originally reported in https://github.com/apache/arrow/issues/8270
> here's a minimal reproduction:
> {code}
> In [24]: idx = pd.RangeIndex(0, 4, name=np.int64(6))  
>  
> In [25]: df = pd.DataFrame(index=idx) 
>  
> In [26]: pa.table(df) 
>  
> ---
> TypeError Traceback (most recent call last)
>  in 
> > 1 pa.table(df)
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.table()
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
> ~/code/arrow/python/pyarrow/pandas_compat.py in dataframe_to_arrays(df, 
> schema, preserve_index, nthreads, columns, safe)
> 604 pandas_metadata = construct_metadata(df, column_names, 
> index_columns,
> 605  index_descriptors, 
> preserve_index,
> --> 606  types)
> 607 metadata = deepcopy(schema.metadata) if schema.metadata else 
> dict()
> 608 metadata.update(pandas_metadata)
> ~/code/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, 
> column_names, index_levels, index_descriptors, preserve_index, types)
> 243 'version': pa.__version__
> 244 },
> --> 245 'pandas_version': _pandas_api.version
> 246 }).encode('utf8')
> 247 }
> ~/miniconda/envs/arrow-3.7/lib/python3.7/json/__init__.py in dumps(obj, 
> skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, 
> default, sort_keys, **kw)
> 229 cls is None and indent is None and separators is None and
> 230 default is None and not sort_keys and not kw):
> --> 231 return _default_encoder.encode(obj)
> 232 if cls is None:
> 233 cls = JSONEncoder
> ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in encode(self, o)
> 197 # exceptions aren't as detailed.  The list call should be 
> roughly
> 198 # equivalent to the PySequence_Fast that ''.join() would do.
> --> 199 chunks = self.iterencode(o, _one_shot=True)
> 200 if not isinstance(chunks, (list, tuple)):
> 201 chunks = list(chunks)
> ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in iterencode(self, 
> o, _one_shot)
> 255 self.key_separator, self.item_separator, 
> self.sort_keys,
> 256 self.skipkeys, _one_shot)
> --> 257 return _iterencode(o, 0)
> 258 
> 259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
> ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in default(self, o)
> 177 
> 178 """
> --> 179 raise TypeError(f'Object of type {o.__class__.__name__} '
> 180 f'is not JSON serializable')
> 181 
> TypeError: Object of type int64 is not JSON serializable
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8394.

Resolution: Fixed

Issue resolved by pull request 8216
[https://github.com/apache/arrow/pull/8216]

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Assignee: Paul Taylor
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10178) [CI] Fix spark master integration test build setup

2020-10-05 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208178#comment-17208178
 ] 

Krisztian Szucs commented on ARROW-10178:
-

cc [~bryanc]

> [CI] Fix spark master integration test build setup
> --
>
> Key: ARROW-10178
> URL: https://issues.apache.org/jira/browse/ARROW-10178
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> https://github.com/ursa-labs/crossbow/runs/1204690363



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10178) [CI] Fix spark master integration test build setup

2020-10-05 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10178:
---

 Summary: [CI] Fix spark master integration test build setup
 Key: ARROW-10178
 URL: https://issues.apache.org/jira/browse/ARROW-10178
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Krisztian Szucs
 Fix For: 2.0.0


https://github.com/ursa-labs/crossbow/runs/1204690363



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10176) [CI] Nightly valgrind job fails

2020-10-05 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208174#comment-17208174
 ] 

Ben Kietzman commented on ARROW-10176:
--

https://github.com/ursa-labs/crossbow/runs/1204693039#step:6:3783 Looks like 
this originates from GTest's pretty printing utilities

> [CI] Nightly valgrind job fails
> ---
>
> Key: ARROW-10176
> URL: https://issues.apache.org/jira/browse/ARROW-10176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, CI
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 2.0.0
>
>
> https://github.com/ursa-labs/crossbow/runs/1204693039



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10177) [CI][Gandiva] Nightly gandiva-jar-xenial fails

2020-10-05 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10177:
---

 Summary: [CI][Gandiva] Nightly gandiva-jar-xenial fails
 Key: ARROW-10177
 URL: https://issues.apache.org/jira/browse/ARROW-10177
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva, Continuous Integration
Reporter: Neal Richardson
 Fix For: 2.0.0


The following tests FAILED:

 27 - gandiva-projector-test (Failed)

 42 - gandiva-projector-test-static (Failed)

https://travis-ci.org/github/ursa-labs/crossbow/builds/732659880



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10176) [CI] Nightly valgrind job fails

2020-10-05 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-10176:


Assignee: Ben Kietzman

> [CI] Nightly valgrind job fails
> ---
>
> Key: ARROW-10176
> URL: https://issues.apache.org/jira/browse/ARROW-10176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, CI
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 2.0.0
>
>
> https://github.com/ursa-labs/crossbow/runs/1204693039



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10176) [CI] Nightly valgrind job fails

2020-10-05 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10176:
---

 Summary: [CI] Nightly valgrind job fails
 Key: ARROW-10176
 URL: https://issues.apache.org/jira/browse/ARROW-10176
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, CI
Reporter: Neal Richardson
 Fix For: 2.0.0


https://github.com/ursa-labs/crossbow/runs/1204693039



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10175) [CI] Nightly hdfs integration test job crashes

2020-10-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208171#comment-17208171
 ] 

Antoine Pitrou commented on ARROW-10175:


I'll take a look.

> [CI] Nightly hdfs integration test job crashes
> --
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> This started failing July 16: 
> https://github.com/ursa-labs/crossbow/runs/876346225
> July 16: HEAD is now at 04d25fb75 ARROW-9500: [C++] Do not use std::to_string 
> to fix segfault on gcc 7.x in -O3 builds
> July 15: HEAD is now at 3586292d6 ARROW-9424: [C++][Parquet] Disable writing 
> files with LZ4 codec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10175) [CI] Nightly hdfs integration test job crashes

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10175:

Description: 
This started failing July 16: 
https://github.com/ursa-labs/crossbow/runs/876346225

July 16: HEAD is now at 04d25fb75 ARROW-9500: [C++] Do not use std::to_string 
to fix segfault on gcc 7.x in -O3 builds

July 15: HEAD is now at 3586292d6 ARROW-9424: [C++][Parquet] Disable writing 
files with LZ4 codec

  was:This started failing July 16: 
https://github.com/ursa-labs/crossbow/runs/876346225


> [CI] Nightly hdfs integration test job crashes
> --
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> This started failing July 16: 
> https://github.com/ursa-labs/crossbow/runs/876346225
> July 16: HEAD is now at 04d25fb75 ARROW-9500: [C++] Do not use std::to_string 
> to fix segfault on gcc 7.x in -O3 builds
> July 15: HEAD is now at 3586292d6 ARROW-9424: [C++][Parquet] Disable writing 
> files with LZ4 codec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

2020-10-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208167#comment-17208167
 ] 

Antoine Pitrou commented on ARROW-9974:
---

The error message ("OSError: Out of memory: malloc of size 131072 failed") 
tells us that the failure is returned by the glibc memory allocator, not by the 
jemalloc allocator which is used by Arrow for array data. Also, the failed 
allocation is tiny (128 kB). This hints at a possible heap fragmentation 
problem.

I'll recommend you try playing with the glibc malloc tunables, especially the 
{{MALLOC_MMAP_THRESHOLD_}} environment variable (note trailing underscore). For 
example {{MALLOC_MMAP_THRESHOLD_=65536}}. See 
[https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html]
 for reference.

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Assignee: Ben Kietzman
>Priority: Critical
>  Labels: dataset
> Fix For: 2.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
> # create a big dataframe
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> def read_works():
> # below code works to read
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> def read_errors():
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10175) [CI] Nightly hdfs integration test job crashes

2020-10-05 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10175:
---

 Summary: [CI] Nightly hdfs integration test job crashes
 Key: ARROW-10175
 URL: https://issues.apache.org/jira/browse/ARROW-10175
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Neal Richardson
 Fix For: 2.0.0


This started failing July 16: 
https://github.com/ursa-labs/crossbow/runs/876346225



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9228) [Python][CI] Always run pytest verbosely

2020-10-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208162#comment-17208162
 ] 

Antoine Pitrou commented on ARROW-9228:
---

Can we close this as won't fix?

> [Python][CI] Always run pytest verbosely
> 
>
> Key: ARROW-9228
> URL: https://issues.apache.org/jira/browse/ARROW-9228
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI, Python
>Affects Versions: 0.17.1
>Reporter: Ben Kietzman
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 3.0.0
>
>
> running pytest -v everywhere will ensure that CI logs are maximally 
> informative



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9006) [C++] Use Cast kernels to implement Scalar::Parse and Scalar::CastTo

2020-10-05 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9006:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++] Use Cast kernels to implement Scalar::Parse and Scalar::CastTo
> 
>
> Key: ARROW-9006
> URL: https://issues.apache.org/jira/browse/ARROW-9006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 3.0.0
>
>
> We should not maintain distinct (and possibly differently behaving) 
> implementations of elementwise array casting and scalar casting. The new 
> kernels framework provides for relatively easily generating kernels that can 
> process arrays or scalars. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10093) [R] Add ability to opt-out of int64 -> int demotion

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-10093:
---

Assignee: Romain Francois  (was: Neal Richardson)

> [R] Add ability to opt-out of int64 -> int demotion
> ---
>
> Key: ARROW-10093
> URL: https://issues.apache.org/jira/browse/ARROW-10093
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Kyle Kavanagh
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, if arrow detects that every value in an int64 column can fit in a 
> 32bit int, it will downcast the column an set the type to integer instead of 
> integer64.  Not having a mechanism to disable this optimization makes it 
> tricky if you have many parallel processes (think HPC use case) performing 
> the same calculation but potentially outputting different result values, some 
> being >2^32 and others not.  When you go to collect the resulting feather 
> files from the parallel processes, the types across the files may not line up.
> Feature request is to provide an option to disable this demotion and maintain 
> the source column type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10093) [R] Add ability to opt-out of int64 -> int demotion

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-10093:
---

Assignee: Neal Richardson

> [R] Add ability to opt-out of int64 -> int demotion
> ---
>
> Key: ARROW-10093
> URL: https://issues.apache.org/jira/browse/ARROW-10093
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Kyle Kavanagh
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, if arrow detects that every value in an int64 column can fit in a 
> 32bit int, it will downcast the column an set the type to integer instead of 
> integer64.  Not having a mechanism to disable this optimization makes it 
> tricky if you have many parallel processes (think HPC use case) performing 
> the same calculation but potentially outputting different result values, some 
> being >2^32 and others not.  When you go to collect the resulting feather 
> files from the parallel processes, the types across the files may not line up.
> Feature request is to provide an option to disable this demotion and maintain 
> the source column type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9044) [Go][Packaging] Revisit the license file attachment to the go packages

2020-10-05 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9044:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Go][Packaging] Revisit the license file attachment to the go packages
> --
>
> Key: ARROW-9044
> URL: https://issues.apache.org/jira/browse/ARROW-9044
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go, Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Minor
> Fix For: 3.0.0
>
>
> As per https://github.com/apache/arrow/pull/7355#issuecomment-639560475
> A nicer solution would be to rename the top level LICENSE.txt to LICENSE, so 
> we wouldn't need to maintain another copy of it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7853) [CI][Packaging] Add nightly test that pip-installs nightly wheels

2020-10-05 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-7853:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [CI][Packaging] Add nightly test that pip-installs nightly wheels
> -
>
> Key: ARROW-7853
> URL: https://issues.apache.org/jira/browse/ARROW-7853
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Continuous Integration, Packaging, Python
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 3.0.0
>
>
> This would catch issues with wheels that we only encountered during release 
> verification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9941) [Python] Better string representation for extension types

2020-10-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9941:
-

Assignee: Diana Clarke

> [Python] Better string representation for extension types
> -
>
> Key: ARROW-9941
> URL: https://issues.apache.org/jira/browse/ARROW-9941
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Diana Clarke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> When one defines an extension type in Python (by subclassing 
> {{PyExtensionType}}) and uses that type e.g. in a Arrow table, the printed 
> schema looks like this:
> {code}
> pyarrow.Table
> a: extension
> b: extension
> {code}
> ... which isn't very informative. PyExtensionType could perhaps override 
> ToString() and call {{str}} on the type instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8999) [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build

2020-10-05 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8999:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" 
> build
> 
>
> Key: ARROW-8999
> URL: https://issues.apache.org/jira/browse/ARROW-8999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 3.0.0
>
>
> I've been seeing this segfault periodically the last week, does anyone have 
> an idea what might be wrong?
> https://github.com/apache/arrow/pull/7273/checks?check_run_id=717249862



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6459) [C++] Remove "python" from conda_env_cpp.yml

2020-10-05 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-6459:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++] Remove "python" from conda_env_cpp.yml
> 
>
> Key: ARROW-6459
> URL: https://issues.apache.org/jira/browse/ARROW-6459
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Minor
> Fix For: 3.0.0
>
>
> I'm not sure why "python" is in this dependency file -- if it is used to 
> maintain a toolchain external to a particular Python environment then it 
> confuses CMake like
> {code}
> CMake Warning at cmake_modules/BuildUtils.cmake:529 (add_executable):
>   Cannot generate a safe runtime search path for target arrow-python-test
>   because there is a cycle in the constraint graph:
> dir 0 is [/home/wesm/code/arrow/cpp/build/debug]
> dir 1 is [/home/wesm/miniconda/envs/arrow-3.7/lib]
>   dir 2 must precede it due to runtime library [libcrypto.so.1.1]
> dir 2 is [/home/wesm/cpp-toolchain/lib]
>   dir 1 must precede it due to runtime library [libpython3.7m.so.1.0]
>   Some of these libraries may not be found correctly.
> Call Stack (most recent call first):
>   src/arrow/CMakeLists.txt:52 (add_test_case)
>   src/arrow/python/CMakeLists.txt:139 (add_arrow_test)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9435) [CI] Push docker images from nightly builds

2020-10-05 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9435:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [CI] Push docker images from nightly builds
> ---
>
> Key: ARROW-9435
> URL: https://issues.apache.org/jira/browse/ARROW-9435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 3.0.0
>
>
> The nightly builds test more docker configurations than the github actions 
> jobs, but those images are not pushed to the apache dockerhub account. 
> We should push the images from the nightly builds as well by setting up one 
> of our own credentials.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9941) [Python] Better string representation for extension types

2020-10-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9941.
---
Resolution: Fixed

Issue resolved by pull request 8312
[https://github.com/apache/arrow/pull/8312]

> [Python] Better string representation for extension types
> -
>
> Key: ARROW-9941
> URL: https://issues.apache.org/jira/browse/ARROW-9941
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> When one defines an extension type in Python (by subclassing 
> {{PyExtensionType}}) and uses that type e.g. in a Arrow table, the printed 
> schema looks like this:
> {code}
> pyarrow.Table
> a: extension
> b: extension
> {code}
> ... which isn't very informative. PyExtensionType could perhaps override 
> ToString() and call {{str}} on the type instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9228) [Python][CI] Always run pytest verbosely

2020-10-05 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9228:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Python][CI] Always run pytest verbosely
> 
>
> Key: ARROW-9228
> URL: https://issues.apache.org/jira/browse/ARROW-9228
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI, Python
>Affects Versions: 0.17.1
>Reporter: Ben Kietzman
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 3.0.0
>
>
> running pytest -v everywhere will ensure that CI logs are maximally 
> informative



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10152) "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"

2020-10-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He closed ARROW-10152.
-

> "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"
> ---
>
> Key: ARROW-10152
> URL: https://issues.apache.org/jira/browse/ARROW-10152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Pac A. He
>Assignee: Uwe Korn
>Priority: Minor
>
> I cannot run "{{import pyarrow}}" with {{pyarrow=1.0.1}} in dockerized 
> miniconda. It works fine with {{pyarrow=1.0.0}} though. The error is:
> {quote}File "", line 1, in 
>  File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/__init__.py", 
> line 62, in 
>  import pyarrow.lib as _lib
>  ImportError: liborc.so: cannot open shared object file: No such file or 
> directory
> {quote}
> 
> To reproduce, use:
> Dockerfile:
> {code:java}
> FROM continuumio/miniconda3:latest
> COPY environment.yml .
> RUN conda install -n base -c defaults conda=4.*
> RUN conda env create -n condaenv  # Installs environment.yml
> RUN conda list -n condaenv  # Just for comparison
> ENV PATH /opt/conda/envs/condaenv/bin:$PATH
> RUN python -c "import pyarrow"
> {code}
> environment.yml:
> {code:java}
> name: condaenv
> channels:
>   - defaults
>   - conda-forge
>   - conda-forge/label/rc
> dependencies:
>   - pyarrow==1.0.1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10152) "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"

2020-10-05 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn updated ARROW-10152:
-
Priority: Minor  (was: Blocker)

> "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"
> ---
>
> Key: ARROW-10152
> URL: https://issues.apache.org/jira/browse/ARROW-10152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Pac A. He
>Priority: Minor
>
> I cannot run "{{import pyarrow}}" with {{pyarrow=1.0.1}} in dockerized 
> miniconda. It works fine with {{pyarrow=1.0.0}} though. The error is:
> {quote}File "", line 1, in 
>  File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/__init__.py", 
> line 62, in 
>  import pyarrow.lib as _lib
>  ImportError: liborc.so: cannot open shared object file: No such file or 
> directory
> {quote}
> 
> To reproduce, use:
> Dockerfile:
> {code:java}
> FROM continuumio/miniconda3:latest
> COPY environment.yml .
> RUN conda install -n base -c defaults conda=4.*
> RUN conda env create -n condaenv  # Installs environment.yml
> RUN conda list -n condaenv  # Just for comparison
> ENV PATH /opt/conda/envs/condaenv/bin:$PATH
> RUN python -c "import pyarrow"
> {code}
> environment.yml:
> {code:java}
> name: condaenv
> channels:
>   - defaults
>   - conda-forge
>   - conda-forge/label/rc
> dependencies:
>   - pyarrow==1.0.1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10152) "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"

2020-10-05 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved ARROW-10152.
--
  Assignee: Uwe Korn
Resolution: Cannot Reproduce

> "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"
> ---
>
> Key: ARROW-10152
> URL: https://issues.apache.org/jira/browse/ARROW-10152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Pac A. He
>Assignee: Uwe Korn
>Priority: Minor
>
> I cannot run "{{import pyarrow}}" with {{pyarrow=1.0.1}} in dockerized 
> miniconda. It works fine with {{pyarrow=1.0.0}} though. The error is:
> {quote}File "", line 1, in 
>  File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/__init__.py", 
> line 62, in 
>  import pyarrow.lib as _lib
>  ImportError: liborc.so: cannot open shared object file: No such file or 
> directory
> {quote}
> 
> To reproduce, use:
> Dockerfile:
> {code:java}
> FROM continuumio/miniconda3:latest
> COPY environment.yml .
> RUN conda install -n base -c defaults conda=4.*
> RUN conda env create -n condaenv  # Installs environment.yml
> RUN conda list -n condaenv  # Just for comparison
> ENV PATH /opt/conda/envs/condaenv/bin:$PATH
> RUN python -c "import pyarrow"
> {code}
> environment.yml:
> {code:java}
> name: condaenv
> channels:
>   - defaults
>   - conda-forge
>   - conda-forge/label/rc
> dependencies:
>   - pyarrow==1.0.1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10152) "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"

2020-10-05 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208154#comment-17208154
 ] 

Pac A. He edited comment on ARROW-10152 at 10/5/20, 4:03 PM:
-

There is nothing wrong with `environment.yml`. The issue was fixed on Friday by 
someone by a new release of pyarrow at 
[https://anaconda.org/conda-forge/pyarrow/files] . It now works.

The default `anaconda` channel is on v0.15.1, so there is no conflict.


was (Author: apacman):
There is nothing wrong with `environment.yml`. The issue was fixed on Friday by 
someone by a new release of pyarrow at 
[https://anaconda.org/conda-forge/pyarrow/files] . It now works.

> "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"
> ---
>
> Key: ARROW-10152
> URL: https://issues.apache.org/jira/browse/ARROW-10152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Pac A. He
>Assignee: Uwe Korn
>Priority: Minor
>
> I cannot run "{{import pyarrow}}" with {{pyarrow=1.0.1}} in dockerized 
> miniconda. It works fine with {{pyarrow=1.0.0}} though. The error is:
> {quote}File "", line 1, in 
>  File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/__init__.py", 
> line 62, in 
>  import pyarrow.lib as _lib
>  ImportError: liborc.so: cannot open shared object file: No such file or 
> directory
> {quote}
> 
> To reproduce, use:
> Dockerfile:
> {code:java}
> FROM continuumio/miniconda3:latest
> COPY environment.yml .
> RUN conda install -n base -c defaults conda=4.*
> RUN conda env create -n condaenv  # Installs environment.yml
> RUN conda list -n condaenv  # Just for comparison
> ENV PATH /opt/conda/envs/condaenv/bin:$PATH
> RUN python -c "import pyarrow"
> {code}
> environment.yml:
> {code:java}
> name: condaenv
> channels:
>   - defaults
>   - conda-forge
>   - conda-forge/label/rc
> dependencies:
>   - pyarrow==1.0.1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10152) "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"

2020-10-05 Thread Pac A. He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208154#comment-17208154
 ] 

Pac A. He commented on ARROW-10152:
---

There is nothing wrong with `environment.yml`. The issue was fixed on Friday by 
someone by a new release of pyarrow at 
[https://anaconda.org/conda-forge/pyarrow/files] . It now works.

> "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"
> ---
>
> Key: ARROW-10152
> URL: https://issues.apache.org/jira/browse/ARROW-10152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Pac A. He
>Priority: Blocker
>
> I cannot run "{{import pyarrow}}" with {{pyarrow=1.0.1}} in dockerized 
> miniconda. It works fine with {{pyarrow=1.0.0}} though. The error is:
> {quote}File "", line 1, in 
>  File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/__init__.py", 
> line 62, in 
>  import pyarrow.lib as _lib
>  ImportError: liborc.so: cannot open shared object file: No such file or 
> directory
> {quote}
> 
> To reproduce, use:
> Dockerfile:
> {code:java}
> FROM continuumio/miniconda3:latest
> COPY environment.yml .
> RUN conda install -n base -c defaults conda=4.*
> RUN conda env create -n condaenv  # Installs environment.yml
> RUN conda list -n condaenv  # Just for comparison
> ENV PATH /opt/conda/envs/condaenv/bin:$PATH
> RUN python -c "import pyarrow"
> {code}
> environment.yml:
> {code:java}
> name: condaenv
> channels:
>   - defaults
>   - conda-forge
>   - conda-forge/label/rc
> dependencies:
>   - pyarrow==1.0.1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10152) "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"

2020-10-05 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208151#comment-17208151
 ] 

Uwe Korn commented on ARROW-10152:
--

Note that the environment.yml is faulty here. If you use conda-forge, you 
should set it to a higher priority than defaults.

> "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"
> ---
>
> Key: ARROW-10152
> URL: https://issues.apache.org/jira/browse/ARROW-10152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Pac A. He
>Priority: Blocker
>
> I cannot run "{{import pyarrow}}" with {{pyarrow=1.0.1}} in dockerized 
> miniconda. It works fine with {{pyarrow=1.0.0}} though. The error is:
> {quote}File "", line 1, in 
>  File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/__init__.py", 
> line 62, in 
>  import pyarrow.lib as _lib
>  ImportError: liborc.so: cannot open shared object file: No such file or 
> directory
> {quote}
> 
> To reproduce, use:
> Dockerfile:
> {code:java}
> FROM continuumio/miniconda3:latest
> COPY environment.yml .
> RUN conda install -n base -c defaults conda=4.*
> RUN conda env create -n condaenv  # Installs environment.yml
> RUN conda list -n condaenv  # Just for comparison
> ENV PATH /opt/conda/envs/condaenv/bin:$PATH
> RUN python -c "import pyarrow"
> {code}
> environment.yml:
> {code:java}
> name: condaenv
> channels:
>   - defaults
>   - conda-forge
>   - conda-forge/label/rc
> dependencies:
>   - pyarrow==1.0.1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10058) [C++] Investigate performance of LevelsToBitmap without BMI2

2020-10-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-10058:
--

Assignee: Antoine Pitrou

> [C++] Investigate performance of LevelsToBitmap without BMI2
> 
>
> Key: ARROW-10058
> URL: https://issues.apache.org/jira/browse/ARROW-10058
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Attachments: opt-level-conv.diff
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently, when some Parquet nested data involves some repetition levels, 
> converting the levels to bitmap goes through a slow scalar path unless the 
> BMI2 instruction set is available and efficient (the latter using the PEXT 
> instruction to process 16 levels at once).
> It may be possible to emulate PEXT for 5- or 6-bit masks by using a lookup 
> table, allowing to process 5-6 levels at once.
> (also, it would be good to add nested reading benchmarks for non-trivial 
> nesting; currently we only benchmark one-level struct and one-level list)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10152) "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"

2020-10-05 Thread Pac A. He (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-10152:
--
Description: 
I cannot run "{{import pyarrow}}" with {{pyarrow=1.0.1}} in dockerized 
miniconda. It works fine with {{pyarrow=1.0.0}} though. The error is:
{quote}File "", line 1, in 
 File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/__init__.py", 
line 62, in 
 import pyarrow.lib as _lib
 ImportError: liborc.so: cannot open shared object file: No such file or 
directory
{quote}

To reproduce, use:

Dockerfile:
{code:java}
FROM continuumio/miniconda3:latest
COPY environment.yml .
RUN conda install -n base -c defaults conda=4.*
RUN conda env create -n condaenv  # Installs environment.yml
RUN conda list -n condaenv  # Just for comparison
ENV PATH /opt/conda/envs/condaenv/bin:$PATH
RUN python -c "import pyarrow"
{code}
environment.yml:
{code:java}
name: condaenv
channels:
  - defaults
  - conda-forge
  - conda-forge/label/rc
dependencies:
  - pyarrow==1.0.1{code}

  was:
I cannot run "{{import pyarrow}}" with {{pyarrow=1.0.1}} in dockerized 
miniconda. It works fine with {{pyarrow=1.0.0}} though. The error is:

{quote}
  File "", line 1, in 
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/__init__.py", 
line 62, in 
import pyarrow.lib as _lib
ImportError: liborc.so: cannot open shared object file: No such file or 
directory
{quote}



To reproduce, use:

Dockerfile:
{code}
FROM continuumio/miniconda3:latest
COPY environment.yml .
RUN conda install -n base -c defaults conda=4.* && \
conda env create -n condaenv  # Installs environment.yml
ENV PATH /opt/conda/envs/condaenv/bin:$PATH
RUN python -c "import pyarrow"
{code}

environment.yml:
{code}
name: condaenv
channels:
  - defaults
  - conda-forge
  - conda-forge/label/rc
dependencies:
  - pyarrow==1.0.1
{code}



> "ImportError: liborc.so" with miniconda pyarrow=1.0.1 when "import pyarrow"
> ---
>
> Key: ARROW-10152
> URL: https://issues.apache.org/jira/browse/ARROW-10152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Pac A. He
>Priority: Blocker
>
> I cannot run "{{import pyarrow}}" with {{pyarrow=1.0.1}} in dockerized 
> miniconda. It works fine with {{pyarrow=1.0.0}} though. The error is:
> {quote}File "", line 1, in 
>  File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/__init__.py", 
> line 62, in 
>  import pyarrow.lib as _lib
>  ImportError: liborc.so: cannot open shared object file: No such file or 
> directory
> {quote}
> 
> To reproduce, use:
> Dockerfile:
> {code:java}
> FROM continuumio/miniconda3:latest
> COPY environment.yml .
> RUN conda install -n base -c defaults conda=4.*
> RUN conda env create -n condaenv  # Installs environment.yml
> RUN conda list -n condaenv  # Just for comparison
> ENV PATH /opt/conda/envs/condaenv/bin:$PATH
> RUN python -c "import pyarrow"
> {code}
> environment.yml:
> {code:java}
> name: condaenv
> channels:
>   - defaults
>   - conda-forge
>   - conda-forge/label/rc
> dependencies:
>   - pyarrow==1.0.1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9943) [C++] Arrow metadata not applied recursively when reading Parquet file

2020-10-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208147#comment-17208147
 ] 

Antoine Pitrou commented on ARROW-9943:
---

Well... ideally we start the release process by the end of the week, so I don't 
think that's workable.

> [C++] Arrow metadata not applied recursively when reading Parquet file
> --
>
> Key: ARROW-9943
> URL: https://issues.apache.org/jira/browse/ARROW-9943
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> Currently, {{ApplyOriginalMetadata}} in {{src/parquet/arrow/schema.cc}} is 
> only applied for the top-level node of each schema field. Nested metadata 
> (such as dicts-inside-lists, etc.) will not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9147) [C++][Dataset] Support null -> other type promotion in Dataset scanning

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9147:
--
Labels: dataset dataset-dask-integration pull-request-available  (was: 
dataset dataset-dask-integration)

> [C++][Dataset] Support null -> other type promotion in Dataset scanning
> ---
>
> Key: ARROW-9147
> URL: https://issues.apache.org/jira/browse/ARROW-9147
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With regarding schema evolution / normalization, we support inserting nulls 
> for a missing column or changing nullability, or normalizing column order, 
> but we do not yet seem to support promotion of null type to any other type.
> Small python example:
> {code}
> In [11]: df = pd.DataFrame({"col": np.array([None, None, None, None], 
> dtype='object')})
> ...: df.to_parquet("test_filter_schema.parquet", engine="pyarrow")
> ...:
> ...: import pyarrow.dataset as ds
> ...: dataset = ds.dataset("test_filter_schema.parquet", format="parquet", 
> schema=pa.schema([("col", pa.int64())]))
> ...: dataset.to_table()
> ...
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Dataset.to_table()
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Scanner.to_table()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: fields had matching names but differing types. From: col: 
> null To: col: int64
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10120) [C++][Parquet] Create reading benchmarks for 2-level nested data

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10120:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Create reading benchmarks for 2-level nested data
> 
>
> Key: ARROW-10120
> URL: https://issues.apache.org/jira/browse/ARROW-10120
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We already have benchmarks for reading one-level list and one-level struct. 
> It would be nice to add list-of-list, list-of-struct, struct-of-struct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9786) [R] Unvendor cpp11 before release

2020-10-05 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9786.

Resolution: Fixed

Issue resolved by pull request 8339
[https://github.com/apache/arrow/pull/8339]

> [R] Unvendor cpp11 before release
> -
>
> Key: ARROW-9786
> URL: https://issues.apache.org/jira/browse/ARROW-9786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7960) [C++][Parquet] Add support for schema translation from parquet nodes back to arrow for missing types

2020-10-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208112#comment-17208112
 ] 

Micah Kornfield commented on ARROW-7960:


this won't include large binary or large string for now (I had not realized 
there were not implemented).

> [C++][Parquet] Add support for schema translation from parquet nodes back to 
> arrow for missing types
> 
>
> Key: ARROW-7960
> URL: https://issues.apache.org/jira/browse/ARROW-7960
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Map seems to be the most obvious one missing.  Without additional metadata I 
> don't think FixedSizeList is possible.  LargeList would probably have to also 
> be could be determined  empirically while parsing if there are any entries 
> that exceed the int32 range (or with medata).  Need to also double check that 
> struct is supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9943) [C++] Arrow metadata not applied recursively when reading Parquet file

2020-10-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208110#comment-17208110
 ] 

Micah Kornfield commented on ARROW-9943:


If you could do it based off of 
https://github.com/emkornfield/arrow/tree/read_most_types that could save me 
some merge conflicts.  The branch is my work in progress for ARROW-7960 (I need 
to add tests to verify my code for maps works, but other round trips seems to 
ework, I had to fix some bugs for FixedSizeList)

> [C++] Arrow metadata not applied recursively when reading Parquet file
> --
>
> Key: ARROW-9943
> URL: https://issues.apache.org/jira/browse/ARROW-9943
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> Currently, {{ApplyOriginalMetadata}} in {{src/parquet/arrow/schema.cc}} is 
> only applied for the top-level node of each schema field. Nested metadata 
> (such as dicts-inside-lists, etc.) will not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9943) [C++] Arrow metadata not applied recursively when reading Parquet file

2020-10-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-9943:
--

Assignee: Antoine Pitrou  (was: Micah Kornfield)

> [C++] Arrow metadata not applied recursively when reading Parquet file
> --
>
> Key: ARROW-9943
> URL: https://issues.apache.org/jira/browse/ARROW-9943
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> Currently, {{ApplyOriginalMetadata}} in {{src/parquet/arrow/schema.cc}} is 
> only applied for the top-level node of each schema field. Nested metadata 
> (such as dicts-inside-lists, etc.) will not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9943) [C++] Arrow metadata not applied recursively when reading Parquet file

2020-10-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208108#comment-17208108
 ] 

Micah Kornfield commented on ARROW-9943:


Please go ahead and pick this up.  I think getting it into 2.0 will be good 
because it enables things like reading LargeList/FixedSizeList as nested 
elements.

> [C++] Arrow metadata not applied recursively when reading Parquet file
> --
>
> Key: ARROW-9943
> URL: https://issues.apache.org/jira/browse/ARROW-9943
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> Currently, {{ApplyOriginalMetadata}} in {{src/parquet/arrow/schema.cc}} is 
> only applied for the top-level node of each schema field. Nested metadata 
> (such as dicts-inside-lists, etc.) will not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10080) [R] Arrow does not release unused memory

2020-10-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208107#comment-17208107
 ] 

András Svraka commented on ARROW-10080:
---

The problem still exists on {{arrow_1.0.1.20201004}}.

> [R] Arrow does not release unused memory
> 
>
> Key: ARROW-10080
> URL: https://issues.apache.org/jira/browse/ARROW-10080
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: Linux, Windows
>Reporter: András Svraka
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: sessioninfo.txt
>
>
> I’m having problems when {{collect()}}-ing Arrow data sources into data 
> frames that are close in size to the available memory on the machine. 
> Consider the following workflow. I have a dataset which I want to query so 
> that at some point in needs to be {{collect()}}-ed but at the same I’m also 
> reducing the result. During the intermediate step the entire data frame fits 
> into memory, and the following code runs without any problems.
> {code:r}
> test_ds <- "memory_test"
> ds1 <- open_dataset(test_ds) %>%
>   collect() %>%
>   dim()
> {code}
> However, running the same code in the same R session again fails with R 
> running out of memory.
> {code:r}
> ds1 <- open_dataset(test_ds) %>%
>   collect() %>%
>   dim()
> {code}
> The example might be a but contrived but you can easily imagine a workflow 
> where different queries are ran on a dataset and the reduced results are 
> stored.
> As far as I understand, R is a garbage collected language, and in this case 
> there aren’t any references left to large objects in memory. And indeed, the 
> second query succeeds when manually forcing a garbage collection.
> Is this the expected behaviour from Arrow?
> I know, this is quite hard to reproduce, as the exact dataset size required 
> to trigger this behaviour depends on the particular machine but I prepared a 
> reproducible example in [this 
> gist|https://gist.github.com/svraka/c63fca51c6cc50020551e2319ff652b7], that 
> should give the same result on Ubuntu 20.04 with 1GB RAM and no swap. See 
> attachment for {{sessionInfo()}} output. I ran it on a Digitalocean 
> {{s-1vcpu-1gb}} droplet.
> First, let’s create a a partitioned Arrow dataset:
> {code:java}
> $ Rscript ds_prep.R 100 5
> {code}
> The first command line argument gives the number of rows in each partition, 
> and second gives the number of partitions. The parameters are set so that the 
> entire dataset should fit into memory.
> Then running the two queries fails:
> {code:java}
> $ Rscript ds_read.R
> Running query, 1st try...
> ds size, 1st run: 56
> Running query, 2nd try...
> [1]11151 killed Rscript ds_read.R
> {code}
> However, when forcing a {{gc()}} (which I’m controlling here with a command 
> line argument), it succeeds:
> {code:java}
> $ Rscript ds_read.R 1
> Running query, 1st try...
> ds size, 1st run: 56
> running gc() ...
>   used (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells  703052 37.61571691  84.0  1038494  55.5
> Vcells 1179578  9.0   36405636 277.8 41188956 314.3
> Running query, 2nd try...
> ds size, 2nd run: 56
> {code}
> In general, [one shouldn’t have to use {{gc()}} 
> manually|https://adv-r.hadley.nz/names-values.html#gc]. Interestingly, 
> setting R’s garbage collection more aggressive (see {{?Memory}}) doesn’t help 
> either:
> {code:java}
> $ R_GC_MEM_GROW=0 Rscript ds_read.R
> Running query, 1st try...
> ds size, 1st run: 56
> Running query, 2nd try...
> [1]11422 killed Rscript ds_read.R
> {code}
> I didn’t try to reproduce this problem on macOS, as my Mac would probably 
> start swapping furiously but I managed to reproduce it on a Windows 7 machine 
> with practically no swap. Of course the parameters are different, and the 
> error messages are presumably system specific.
> {code:java}
> $ Rscript ds_prep.R 100 40
> $ Rscript ds_read.R
> Running query, 1st try...
> ds size, 1st run: 56
> Running query, 2nd try...
> Error in dataset___Scanner__ToTable(self) :
>   IOError: Out of memory: malloc of size 524288 failed
> Calls: collect ... shared_ptr -> shared_ptr_is_null -> 
> dataset___Scanner__ToTable
> Execution halted
> $ Rscript ds_read.R 1
> Running query, 1st try...
> ds size, 1st run: 56
> running gc() ...
>   used (Mb) gc trigger   (Mb)  max used (Mb)
> Ncells  688789 36.81198030   64.0   1198030   64
> Vcells 1109451  8.5  271538343 2071.7 321118845 2450
> Running query, 2nd try...
> ds size, 2nd run: 56
> $ R_GC_MEM_GROW=0 Rscript ds_read.R
> Running query, 1st try...
> ds size, 1st run: 56
> Running query, 2nd try...
> Error in dataset___Scanner__ToTable(self) :
>   IOError: Out of memory: malloc of size 524288 failed
> Calls: collect ... shared_ptr -> shared_ptr_is_null -> 
> 

[jira] [Commented] (ARROW-10172) cancat_arrays requires upcast for large array

2020-10-05 Thread Artem KOZHEVNIKOV (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208105#comment-17208105
 ] 

Artem KOZHEVNIKOV commented on ARROW-10172:
---

btw, casting to large_string is not supported neither (it's maybe linked):
{code:python}
str_array.cast(pa.large_string())
ArrowNotImplementedError  Traceback (most recent call last)
 in 
> 1 str_array.cast(pa.large_string())

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/table.pxi in 
pyarrow.lib.ChunkedArray.cast()

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/compute.py in 
cast(arr, target_type, safe)
 85 else:
 86 options = _pc.CastOptions.unsafe(target_type)
---> 87 return call_function("cast", [arr], options)
 88 
 89 

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/_compute.pyx in 
pyarrow._compute.call_function()

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/_compute.pyx in 
pyarrow._compute.Function.call()

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

ArrowNotImplementedError: Unsupported cast from string to large_utf8 using 
function cast_large_string
​
{code}

> cancat_arrays requires upcast for large array
> -
>
> Key: ARROW-10172
> URL: https://issues.apache.org/jira/browse/ARROW-10172
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
>
> I'm sorry if this was already reported, but there's an overflow issue in 
> concatenation of large arrays
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: str_array = pa.array(['a' * 128] * 10**8)
> In [3]: large_array = pa.concat_arrays([str_array] * 50)
> Segmentation fault (core dumped)
> {code}
> I suppose that  this should be handled by upcast to large_string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10174) [Java] Reading of Dictionary encoded struct vector fails

2020-10-05 Thread Benjamin Wilhelm (Jira)
Benjamin Wilhelm created ARROW-10174:


 Summary: [Java] Reading of Dictionary encoded struct vector fails 
 Key: ARROW-10174
 URL: https://issues.apache.org/jira/browse/ARROW-10174
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 1.0.1
Reporter: Benjamin Wilhelm


Write an index vector and a dictionary with a dictionary vector of the type 
{{Struct}} using an {{ArrowStreamWriter}}. Reading this again fails with an 
exception.

Code to reproduce:

{code:java}
final RootAllocator allocator = new RootAllocator();

// Create the dictionary
final StructVector dict = StructVector.empty("Dict", allocator);
final NullableStructWriter dictWriter = dict.getWriter();
final IntWriter dictA = dictWriter.integer("a");
final IntWriter dictB = dictWriter.integer("b");
for (int i = 0; i < 3; i++) {
dictWriter.start();
dictA.writeInt(i);
dictB.writeInt(i);
dictWriter.end();
}
dict.setValueCount(3);
final Dictionary dictionary = new Dictionary(dict, new DictionaryEncoding(1, 
false, null));

// Create the vector
final Random random = new Random();
final StructVector vector = StructVector.empty("Dict", allocator);
final NullableStructWriter vectorWriter = vector.getWriter();
final IntWriter vectorA = vectorWriter.integer("a");
final IntWriter vectorB = vectorWriter.integer("b");
for (int i = 0; i < 10; i++) {
int v = random.nextInt(3);
vectorWriter.start();
vectorA.writeInt(v);
vectorB.writeInt(v);
vectorWriter.end();
}
vector.setValueCount(10);

// Encode the vector using the dictionary
final IntVector indexVector = (IntVector) DictionaryEncoder.encode(vector, 
dictionary);

// Write the vector to out
final ByteArrayOutputStream out = new ByteArrayOutputStream();
final VectorSchemaRoot root = new 
VectorSchemaRoot(Collections.singletonList(indexVector.getField()),
Collections.singletonList(indexVector));
final ArrowStreamWriter writer = new ArrowStreamWriter(root, new 
MapDictionaryProvider(dictionary),
Channels.newChannel(out));
writer.start();
writer.writeBatch();
writer.end();

// Read the vector from out
try (final ArrowStreamReader reader = new ArrowStreamReader(new 
ByteArrayInputStream(out.toByteArray()),
allocator)) {
reader.loadNextBatch();
final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
final FieldVector readIndexVector = readRoot.getVector(0);

// Get the dictionary and decode
final Map readDictionaryMap = 
reader.getDictionaryVectors();
final Dictionary readDictionary = 
readDictionaryMap.get(readIndexVector.getField().getDictionary().getId());
final ValueVector readVector = 
DictionaryEncoder.decode(readIndexVector, readDictionary);
}
{code}

Exception:
{code}
java.lang.IllegalArgumentException: not all nodes and buffers were consumed. 
nodes: [ArrowFieldNode [length=3, nullCount=0], ArrowFieldNode [length=3, 
nullCount=0]] buffers: [ArrowBuf[21], address:140118352739688, length:1, 
ArrowBuf[22], address:140118352739696, length:12, ArrowBuf[23], 
address:140118352739712, length:1, ArrowBuf[24], address:140118352739720, 
length:12]
at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:63)
at org.apache.arrow.vector.ipc.ArrowReader.load(ArrowReader.java:241)
at 
org.apache.arrow.vector.ipc.ArrowReader.loadDictionary(ArrowReader.java:232)
at 
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:129)
at com.knime.AppTest.testDictionaryStruct(AppTest.java:83)
{code}

If I see it corretly the error happens in 
{{DictionaryUtilities#toMessageFormat}}. If a dictionary encoded vector is 
encountered still the children of the memory format field are used (none 
because this is Int). However, the children of the field of the dictionary 
vector should be mapped to the message format and set as children.

I can create a fix and open a pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10093) [R] Add ability to opt-out of int64 -> int demotion

2020-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10093:
---
Labels: pull-request-available  (was: )

> [R] Add ability to opt-out of int64 -> int demotion
> ---
>
> Key: ARROW-10093
> URL: https://issues.apache.org/jira/browse/ARROW-10093
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Kyle Kavanagh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, if arrow detects that every value in an int64 column can fit in a 
> 32bit int, it will downcast the column an set the type to integer instead of 
> integer64.  Not having a mechanism to disable this optimization makes it 
> tricky if you have many parallel processes (think HPC use case) performing 
> the same calculation but potentially outputting different result values, some 
> being >2^32 and others not.  When you go to collect the resulting feather 
> files from the parallel processes, the types across the files may not line up.
> Feature request is to provide an option to disable this demotion and maintain 
> the source column type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10159) [Rust][DataFusion] Add support for Dictionary types in data fusion

2020-10-05 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reassigned ARROW-10159:
---

Assignee: Andrew Lamb

> [Rust][DataFusion] Add support for Dictionary types in data fusion
> --
>
> Key: ARROW-10159
> URL: https://issues.apache.org/jira/browse/ARROW-10159
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have a system that need to process low cardinality string data (aka there 
> are only a few distinct values, but there are many millions of values).
> Using a `StringArray` is very expensive as the same string value is copied 
> over and over again. The `DictionaryArray` was exactly designed to handle 
> this situatio:  rather than repeating each string, it uses indexes into a 
> dictionary and thus repeats integer values. 
> Sadly, DataFusion does not support processing on `DictionaryArray` types for 
> several reasons.
> This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I 
> would like to be possible:
> {code}
> #[tokio::test]
> async fn query_on_string_dictionary() -> Result<()> {
> // ensure that data fusion can operate on dictionary types
> // Use StringDictionary (32 bit indexes = keys)
> let field_type = DataType::Dictionary(
> Box::new(DataType::Int32),
> Box::new(DataType::Utf8),
> );
> let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, 
> true)]));
> let keys_builder = PrimitiveBuildernew(10);
> let values_builder = StringBuilder::new(10);
> let mut builder = StringDictionaryBuilder::new(
> keys_builder, values_builder
> );
> builder.append("one")?;
> builder.append_null()?;
> builder.append("three")?;
> let array = Arc::new(builder.finish());
> let data = RecordBatch::try_new(
> schema.clone(),
> vec![array],
> )?;
> let table = MemTable::new(schema, vec![vec![data]])?;
> let mut ctx = ExecutionContext::new();
> ctx.register_table("test", Box::new(table));
> // Basic SELECT
> let sql = "SELECT * FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\nNULL\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // basic filtering
> let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // filtering with constant
> let sql = "SELECT * FROM test WHERE d1 = 'three'";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"three\"".to_string();
> assert_eq!(expected, actual);
> // Expression evaluation
> let sql = "SELECT concat(d1, '-foo') FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
> assert_eq!(expected, actual);
> // aggregation
> let sql = "SELECT COUNT(d1) FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "2".to_string();
> assert_eq!(expected, actual);
> Ok(())
> }
> {code}
> However, it errors immediately:
> {code}
>  query_on_string_dictionary stdout 
> thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == 
> right)`
>   left: `"\"one\"\nNULL\n\"three\""`,
>  right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
> {code{
> This ticket tracks adding proper support Dictionary types to DataFusion. I 
> will break the work down into several smaller subtasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >