[jira] [Updated] (ARROW-11223) [Java] BaseVariableWidthVector/BaseLargeVariableWidthVector setNull and getBufferSizeFor is buggy

2021-01-18 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated ARROW-11223:
---
Summary: [Java] BaseVariableWidthVector/BaseLargeVariableWidthVector 
setNull and getBufferSizeFor is buggy  (was: [Java] BaseVariableWidthVector 
setNull and getBufferSizeFor is buggy)

> [Java] BaseVariableWidthVector/BaseLargeVariableWidthVector setNull and 
> getBufferSizeFor is buggy
> -
>
> Key: ARROW-11223
> URL: https://issues.apache.org/jira/browse/ARROW-11223
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We may get error  java.lang.IndexOutOfBoundsException: index: 15880, length: 
> 4 (expected: range(0, 15880)).
> I test on arrow 2.0.0
> Reproduce code in scala:
> {code}
> import org.apache.arrow.vector.VarCharVector
> import org.apache.arrow.memory.RootAllocator
> val rootAllocator = new RootAllocator(Long.MaxValue)
> val v1 = new VarCharVector("var1", rootAllocator)
> v1.allocateNew()
> val valueCount = 3970 // use any number >= 3970 will get similar error
> for (idx <- 0 until valueCount) {
>   v1.setNull(idx)
> }
> v1.getBufferSizeFor(valueCount) # failed, get error 
> java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: 
> range(0, 15880))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11314) [Release][APT][Yum] Add support for verifying arm64 packages

2021-01-18 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11314:


 Summary:  [Release][APT][Yum] Add support for verifying arm64 
packages
 Key: ARROW-11314
 URL: https://issues.apache.org/jira/browse/ARROW-11314
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11314) [Release][APT][Yum] Add support for verifying arm64 packages

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11314:
---
Labels: pull-request-available  (was: )

>  [Release][APT][Yum] Add support for verifying arm64 packages
> -
>
> Key: ARROW-11314
> URL: https://issues.apache.org/jira/browse/ARROW-11314
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11313) [Rust] Size hint of iterators is incorrect

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11313:
---
Labels: pull-request-available  (was: )

> [Rust] Size hint of iterators is incorrect
> --
>
> Key: ARROW-11313
> URL: https://issues.apache.org/jira/browse/ARROW-11313
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11313) [Rust] Size hint of iterators is incorrect

2021-01-18 Thread Jira
Jorge Leitão created ARROW-11313:


 Summary: [Rust] Size hint of iterators is incorrect
 Key: ARROW-11313
 URL: https://issues.apache.org/jira/browse/ARROW-11313
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11311) [Rust] unset_bit is toggling bits, not unsetting them

2021-01-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-11311:
-
Component/s: Rust

> [Rust] unset_bit is toggling bits, not unsetting them
> -
>
> Key: ARROW-11311
> URL: https://issues.apache.org/jira/browse/ARROW-11311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The functions {{bit_util::unset_bit[_raw]}} are currently toggling bits, not 
> setting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11312) [Rust] Implement FromIter for timestamps, that includes timezone info

2021-01-18 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11312:
--

 Summary: [Rust] Implement FromIter for timestamps, that includes 
timezone info
 Key: ARROW-11312
 URL: https://issues.apache.org/jira/browse/ARROW-11312
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale


We currently have TimestampArray::from_vec and TimestampArray::from_opt_vec in 
order to provide timezone information. We do not have an option that uses 
FromIter.

When implementing this, we should search the codebase (esp Parquet) and replace 
the vector-based methods above with iterators where it makes sense.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11310) [Rust] Implement arrow JSON writer

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-11310:
---
Summary: [Rust] Implement arrow JSON writer  (was: implement arrow JSON 
writer)

> [Rust] Implement arrow JSON writer
> --
>
> Key: ARROW-11310
> URL: https://issues.apache.org/jira/browse/ARROW-11310
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: QP Hou
>Assignee: QP Hou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11311) [Rust] unset_bit is toggling bits, not unsetting them

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-11311:
---
Summary: [Rust] unset_bit is toggling bits, not unsetting them  (was: 
unset_bit is toggling bits, not unsetting them)

> [Rust] unset_bit is toggling bits, not unsetting them
> -
>
> Key: ARROW-11311
> URL: https://issues.apache.org/jira/browse/ARROW-11311
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The functions {{bit_util::unset_bit[_raw]}} are currently toggling bits, not 
> setting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7396) [Format] Register media types (MIME types) for Apache Arrow formats to IANA

2021-01-18 Thread QP Hou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267660#comment-17267660
 ] 

QP Hou commented on ARROW-7396:
---

Any update on this task? Should we start a vote for what [~maartenbreddels] 
proposed to move this forward?

> [Format] Register media types (MIME types) for Apache Arrow formats to IANA
> ---
>
> Key: ARROW-7396
> URL: https://issues.apache.org/jira/browse/ARROW-7396
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Kouhei Sutou
>Priority: Major
>
> See "MIME types" thread for details: 
> https://lists.apache.org/thread.html/b15726d0c0da2223ba1b45a226ef86263f688b20532a30535cd5e267%40%3Cdev.arrow.apache.org%3E
> Summary:
>   * If we don't register our media types for Apache Arrow formats (IPC File 
> Format and IPC Streaming Format) to IANA, we should use "x-" prefix such as 
> "application/x-apache-arrow-file".
>   * It may be better that we reuse the same manner as Apache Thrift. Apache 
> Thrift registers their media types as "application/vnd.apache.thrift.XXX". If 
> we use the same manner as Apache Thrift, we will use 
> "application/vnd.apache.arrow.file" or something.
> TODO:
>   * Decide which media types should we register. (Do we need vote?)
>   * Register our media types to IANA.
>   ** Media types page: 
> https://www.iana.org/assignments/media-types/media-types.xhtml
>   ** Application form for new media types: 
> https://www.iana.org/form/media-types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11311) unset_bit is toggling bits, not unsetting them

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11311:
---
Labels: pull-request-available  (was: )

> unset_bit is toggling bits, not unsetting them
> --
>
> Key: ARROW-11311
> URL: https://issues.apache.org/jira/browse/ARROW-11311
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The functions {{bit_util::unset_bit[_raw]}} are currently toggling bits, not 
> setting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11311) unset_bit is incorrect

2021-01-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-11311:
-
Description: The functions {{bit_util::unset_bit[_raw]}} are currently 
toggling bits, not setting them.  (was: The functions 
{{bit_util::set_bit[_raw]}} are currently toggling bits, not setting them.)

> unset_bit is incorrect
> --
>
> Key: ARROW-11311
> URL: https://issues.apache.org/jira/browse/ARROW-11311
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>
> The functions {{bit_util::unset_bit[_raw]}} are currently toggling bits, not 
> setting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11311) unset_bit is toggling bits, not unsetting them

2021-01-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-11311:
-
Summary: unset_bit is toggling bits, not unsetting them  (was: unset_bit is 
incorrect)

> unset_bit is toggling bits, not unsetting them
> --
>
> Key: ARROW-11311
> URL: https://issues.apache.org/jira/browse/ARROW-11311
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>
> The functions {{bit_util::unset_bit[_raw]}} are currently toggling bits, not 
> setting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11311) unset_bit is incorrect

2021-01-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-11311:
-
Summary: unset_bit is incorrect  (was: set_bit is incorrect)

> unset_bit is incorrect
> --
>
> Key: ARROW-11311
> URL: https://issues.apache.org/jira/browse/ARROW-11311
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>
> The functions {{bit_util::set_bit[_raw]}} are currently toggling bits, not 
> setting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11311) set_bit is incorrect

2021-01-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-11311:
-
Description: The functions {{bit_util::set_bit[_raw]}} are currently 
toggling bits, not setting them.  (was: The functions 
{{bit_util::[un]set_bit[_raw]}} are currently flipping a bit, not setting or 
unsetting it.)

> set_bit is incorrect
> 
>
> Key: ARROW-11311
> URL: https://issues.apache.org/jira/browse/ARROW-11311
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>
> The functions {{bit_util::set_bit[_raw]}} are currently toggling bits, not 
> setting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11311) set_bit is incorrect

2021-01-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-11311:
-
Summary: set_bit is incorrect  (was: set_bit and unset_bit are incorrect)

> set_bit is incorrect
> 
>
> Key: ARROW-11311
> URL: https://issues.apache.org/jira/browse/ARROW-11311
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>
> The functions {{bit_util::[un]set_bit[_raw]}} are currently flipping a bit, 
> not setting or unsetting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11311) set_bit and unset_bit are incorrect

2021-01-18 Thread Jira
Jorge Leitão created ARROW-11311:


 Summary: set_bit and unset_bit are incorrect
 Key: ARROW-11311
 URL: https://issues.apache.org/jira/browse/ARROW-11311
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jorge Leitão
Assignee: Jorge Leitão


The functions {{bit_util::[un]set_bit[_raw]}} are currently flipping a bit, not 
setting or unsetting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11310) implement arrow JSON writer

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11310:
---
Labels: pull-request-available  (was: )

> implement arrow JSON writer
> ---
>
> Key: ARROW-11310
> URL: https://issues.apache.org/jira/browse/ARROW-11310
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: QP Hou
>Assignee: QP Hou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11310) implement arrow JSON writer

2021-01-18 Thread QP Hou (Jira)
QP Hou created ARROW-11310:
--

 Summary: implement arrow JSON writer
 Key: ARROW-11310
 URL: https://issues.apache.org/jira/browse/ARROW-11310
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: QP Hou
Assignee: QP Hou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11309) [Release][C#] Use .NET 3.1 for verification

2021-01-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-11309.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9254
[https://github.com/apache/arrow/pull/9254]

> [Release][C#] Use .NET 3.1 for verification
> ---
>
> Key: ARROW-11309
> URL: https://issues.apache.org/jira/browse/ARROW-11309
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11309) [Release][C#] Use .NET 3.1 for verification

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11309:
---
Labels: pull-request-available  (was: )

> [Release][C#] Use .NET 3.1 for verification
> ---
>
> Key: ARROW-11309
> URL: https://issues.apache.org/jira/browse/ARROW-11309
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11309) [Release][C#] Use .NET 3.1 for verification

2021-01-18 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11309:


 Summary: [Release][C#] Use .NET 3.1 for verification
 Key: ARROW-11309
 URL: https://issues.apache.org/jira/browse/ARROW-11309
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11061) [Rust] Validate array properties against schema

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-11061:
---
Fix Version/s: 4.0.0

> [Rust] Validate array properties against schema
> ---
>
> Key: ARROW-11061
> URL: https://issues.apache.org/jira/browse/ARROW-11061
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
> Fix For: 4.0.0
>
>
> We have a problem when it comes to nested arrays, where one could create a 
> > where the array fields can't be null, but 
> the list can have null slots.
> This creates a lot of work when working with such nested arrays, because we 
> have to create work-arounds to account for this, and take unnecessarily 
> slower paths.
> I propose that we prevent this problem at the source, by:
>  * checking that a batch can't be created with arrays that have incompatible 
> null contracts
>  * preventing list and struct children from being non-null if any descendant 
> of such children are null (might be less of an issue for structs)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11077) [Rust] ParquetFileArrowReader panicks when trying to read nested list

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-11077:
---
Fix Version/s: 4.0.0

> [Rust] ParquetFileArrowReader panicks when trying to read nested list
> -
>
> Key: ARROW-11077
> URL: https://issues.apache.org/jira/browse/ARROW-11077
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Ben Sully
>Assignee: Neville Dipale
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: small-nested-lists.parquet
>
>
> I think this is documented in the code, but I can't be 100% sure.
> When trying to execute a DataFusion query over a Parquet file where one field 
> is a struct with a nested list, the thread panicks due to unwrapping on an 
> `Option::None` [at this 
> point|https://github.com/apache/arrow/blob/36d80e37373ab49454eb47b2a89c10215ca1b67e/rust/parquet/src/arrow/array_reader.rs#L1334-L1337]
>  
> [.|https://github.com/apache/arrow/blob/36d80e37373ab49454eb47b2a89c10215ca1b67e/rust/parquet/src/arrow/array_reader.rs#L1334-L1337].]
>  This `None` is returned by 
> [`visit_primitive`|https://github.com/apache/arrow/blob/master/rust/parquet/src/arrow/array_reader.rs#L1243-L1245],
>  but I can't quite make sense of _why_ it returns a `None` rather than an 
> error?
> I added a couple of dbg! calls to see what the item_type and list_type are:
> {code}
> [/home/ben/repos/rust/arrow/rust/parquet/src/arrow/array_reader.rs:1339] 
> _type = PrimitiveType {
> basic_info: BasicTypeInfo {
> name: "item",
> repetition: Some(
> OPTIONAL,
> ),
> logical_type: UTF8,
> id: None,
> },
> physical_type: BYTE_ARRAY,
> type_length: -1,
> scale: -1,
> precision: -1,
> }
> [/home/ben/repos/rust/arrow/rust/parquet/src/arrow/array_reader.rs:1340] 
> _type = GroupType {
> basic_info: BasicTypeInfo {
> name: "tags",
> repetition: Some(
> OPTIONAL,
> ),
> logical_type: LIST,
> id: None,
> },
> fields: [
> GroupType {
> basic_info: BasicTypeInfo {
> name: "list",
> repetition: Some(
> REPEATED,
> ),
> logical_type: NONE,
> id: None,
> },
> fields: [
> PrimitiveType {
> basic_info: BasicTypeInfo {
> name: "item",
> repetition: Some(
> OPTIONAL,
> ),
> logical_type: UTF8,
> id: None,
> },
> physical_type: BYTE_ARRAY,
> type_length: -1,
> scale: -1,
> precision: -1,
> },
> ],
> },
> ],
> }{code}
> I guess we should at least use `.expect` here instead of `.unwrap` so it's 
> more clear why this is happening!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10550) [Rust] [Parquet] Write nested types (struct, list)

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10550:
---
Fix Version/s: 4.0.0

> [Rust] [Parquet] Write nested types (struct, list)
> --
>
> Key: ARROW-10550
> URL: https://issues.apache.org/jira/browse/ARROW-10550
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
> Fix For: 4.0.0
>
>
> After being able to compute arbitrarily nested definition and repetitions, we 
> should be able to write structs and lists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10927) [Rust] [Parquet] Add Decimal to ArrayBuilderReader for physical type fixed size binary

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10927:
---
Summary: [Rust] [Parquet] Add Decimal to ArrayBuilderReader for physical 
type fixed size binary  (was: Add Decimal to ArrayBuilderReader for physical 
type fixed size binary)

> [Rust] [Parquet] Add Decimal to ArrayBuilderReader for physical type fixed 
> size binary
> --
>
> Key: ARROW-10927
> URL: https://issues.apache.org/jira/browse/ARROW-10927
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Florian Müller
>Assignee: Florian Müller
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11308) [Rust] [Parquet] Add Arrow decimal array writer

2021-01-18 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11308:
--

 Summary: [Rust] [Parquet] Add Arrow decimal array writer
 Key: ARROW-11308
 URL: https://issues.apache.org/jira/browse/ARROW-11308
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10926) Add parquet reader / writer for decimal types

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10926:
---
Fix Version/s: 4.0.0

> Add parquet reader / writer for decimal types
> -
>
> Key: ARROW-10926
> URL: https://issues.apache.org/jira/browse/ARROW-10926
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Florian Müller
>Priority: Major
> Fix For: 4.0.0
>
>
> Decimal values, stored physically as e.g. Fixed Size Binary should be 
> represented by DecimalArray when the logical type indicates decimal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11269) [Rust] Unable to read Parquet file because of mismatch in column-derived and embedded schemas

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-11269:
--

Assignee: Neville Dipale

> [Rust] Unable to read Parquet file because of mismatch in column-derived and 
> embedded schemas
> -
>
> Key: ARROW-11269
> URL: https://issues.apache.org/jira/browse/ARROW-11269
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Max Burke
>Assignee: Neville Dipale
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet, main.rs
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The issue seems to stem from the new(-ish) behavior of the Arrow Parquet 
> reader where the embedded arrow schema is used instead of deriving the schema 
> from the Parquet columns.
>  
> However it seems like some cases still derive the schema type from the column 
> types, leading to the Arrow record batch reader erroring out that the column 
> types must match the schema types.
>  
> In our case, the column type is an int96 datetime (ns) type, and the Arrow 
> type in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, 
> Some("UTC")). However, the code that constructs the Arrays seems to re-derive 
> this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because 
> the Parquet schema has no timezone information). And so, Parquet files that 
> we were able to read successfully with our branch of Arrow circa October are 
> now unreadable.
>  
> I've attached an example of a Parquet file that demonstrates the problem. 
> This file was created in Python (as most of our Parquet files are).
>  
> I've also attached a sample Rust program that will demonstrate the error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10984) [Rust] Document use of unsafe in parquet crate

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10984:
---
Fix Version/s: 4.0.0

> [Rust] Document use of unsafe in parquet crate
> --
>
> Key: ARROW-10984
> URL: https://issues.apache.org/jira/browse/ARROW-10984
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> There are ~64 uses of unsafe in the parquet crate
> {code:java}
> ./parquet/src/util/hash_util.rs:6
> ./parquet/src/util/bit_packing.rs:34
> ./parquet/src/util/bit_util.rs:1
> ./parquet/src/data_type.rs:12
> ./parquet/src/arrow/record_reader.rs:5
> ./parquet/src/arrow/array_reader.rs:8 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8796) [Rust] Allow parquet to be written directly to memory

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8796:
--
Fix Version/s: 4.0.0

> [Rust] Allow parquet to be written directly to memory
> -
>
> Key: ARROW-8796
> URL: https://issues.apache.org/jira/browse/ARROW-8796
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Markus Westerlind
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The `TryClone` bound currently needed in `ParquetWriter` makes it awkward to 
> write parquet to memory, forcing either a `Rc` + `RefCell` wrapper or to 
> write to a `File` first.
> By explictly threading lifetimes around the underlying writer can be passed 
> mutably through all parts of the writer, allowing ` Vec` or any other 
> implementors of the basic io traits to be used directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10553) [Rust] [Parquet] Panic when reading Parquet file produced with parquet-cpp

2021-01-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10553:
---
Fix Version/s: 4.0.0

> [Rust] [Parquet] Panic when reading Parquet file produced with parquet-cpp
> --
>
> Key: ARROW-10553
> URL: https://issues.apache.org/jira/browse/ARROW-10553
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 2.0.0
> Environment: Windows 10 x86_64
> Cargo nightly
>Reporter: Michael Spector
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: 3072786907765935896_0_3.snappy.Parquet
>
>
> See attached Parquet file that was created with parquet-cpp.
> The file metadata is:
>  
>  {color:#dcdfe4}creator: parquet-cpp version 1.5.1-SNAPSHOT
>  
>  file schema: schema
>  
> 
>  __sys_isSystemRelocated: OPTIONAL INT64 R:0 D:1
>  __sys_schemaId: OPTIONAL INT64 R:0 D:1
>  __sys_invOffsetLSID: OPTIONAL INT64 R:0 D:1
>  __sys_invOffsetGroupIdx: OPTIONAL INT64 R:0 D:1
>  __sys_invOffsetRecordIdx: OPTIONAL INT64 R:0 D:1
>  _rid: OPTIONAL BINARY L:STRING R:0 D:1
>  __sys_sequenceNumber: OPTIONAL INT64 R:0 D:1
>  __sys_recordIndex: OPTIONAL INT64 R:0 D:1
>  __sys_isTombstone: OPTIONAL INT64 R:0 D:1
>  _ts: OPTIONAL INT64 R:0 D:1
>  partitionKey: OPTIONAL BINARY L:STRING R:0 D:1
>  entityType: OPTIONAL BINARY L:STRING R:0 D:1
>  ttl: OPTIONAL INT64 R:0 D:1
>  tripId: OPTIONAL INT32 R:0 D:1
>  vin: OPTIONAL BINARY L:STRING R:0 D:1
>  state: OPTIONAL BINARY L:STRING R:0 D:1
>  region: OPTIONAL INT32 R:0 D:1
>  outsideTemperature: OPTIONAL INT64 R:0 D:1
>  engineTemperature: OPTIONAL INT64 R:0 D:1
>  speed: OPTIONAL INT64 R:0 D:1
>  fuel: OPTIONAL INT64 R:0 D:1
>  fuelRate: OPTIONAL DOUBLE R:0 D:1
>  engineoil: OPTIONAL INT64 R:0 D:1
>  tirepressure: OPTIONAL INT64 R:0 D:1
>  odometer: OPTIONAL DOUBLE R:0 D:1
>  accelerator_pedal_position: OPTIONAL INT64 R:0 D:1
>  parking_brake_status: OPTIONAL BOOLEAN R:0 D:1
>  brake_pedal_status: OPTIONAL BOOLEAN R:0 D:1
>  headlamp_status: OPTIONAL BOOLEAN R:0 D:1
>  transmission_gear_position: OPTIONAL INT64 R:0 D:1
>  ignition_status: OPTIONAL BOOLEAN R:0 D:1
>  windshield_wiper_status: OPTIONAL BOOLEAN R:0 D:1
>  abs: OPTIONAL BOOLEAN R:0 D:1
>  refrigerationUnitKw: OPTIONAL DOUBLE R:0 D:1
>  refrigerationUnitTemp: OPTIONAL DOUBLE R:0 D:1
>  timestamp: OPTIONAL BINARY L:STRING R:0 D:1
>  id: OPTIONAL BINARY L:STRING R:0 D:1
>  _etag: OPTIONAL BINARY L:STRING R:0 D:1
>  __sys_value: OPTIONAL BINARY L:STRING R:0 D:1
>  
>  row group 1: RC:27150 TS:2481123 OFFSET:4
>  
> 
>  __sys_isSystemRelocated: INT64 SNAPPY DO:4 FPO:28 SZ:102/98/0.96 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 0, num_nulls: 0]
>  __sys_schemaId: INT64 SNAPPY DO:205 FPO:220 SZ:51/48/0.94 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined]
>  __sys_invOffsetLSID: INT64 SNAPPY DO:308 FPO:323 SZ:51/48/0.94 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined]
>  __sys_invOffsetGroupIdx: INT64 SNAPPY DO:416 FPO:431 SZ:51/48/0.94 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined]
>  __sys_invOffsetRecordIdx: INT64 SNAPPY DO:528 FPO:543 SZ:51/48/0.94 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined]
>  _rid: BINARY SNAPPY DO:641 FPO:137000 SZ:187417/811272/4.33 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: o9dcAMA1y14+BA==, max: 
> o9dcAMA1y17zaQAABA==, num_nulls: 0]
>  __sys_sequenceNumber: INT64 SNAPPY DO:188156 FPO:296856 
> SZ:159746/268260/1.68 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 3, 
> max: 27152, num_nulls: 0]
>  __sys_recordIndex: INT64 SNAPPY DO:348005 FPO:456699 SZ:159740/268260/1.68 
> VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 27149, num_nulls: 0]
>  __sys_isTombstone: INT64 SNAPPY DO:507845 FPO:507860 SZ:51/48/0.94 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined]
>  _ts: INT64 SNAPPY DO:507954 FPO:510167 SZ:3974/6137/1.54 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 1597365315, max: 1597365859, 
> num_nulls: 0]
>  partitionKey: BINARY SNAPPY DO:512012 FPO:512256 SZ:13967/14026/1.00 
> VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0A4SMSAGR5CA4LAY6-2020-08, 
> max: YKO1Q8RX7Z20BVBG0-2020-08, num_nulls: 0]
>  entityType: BINARY SNAPPY DO:526088 FPO:526124 SZ:110/106/0.96 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: VehicleTelemetry, max: 
> VehicleTelemetry, num_nulls: 0]
>  ttl: INT64 SNAPPY DO:526285 FPO:526309 SZ:102/98/0.96 VC:27150 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE 

[jira] [Updated] (ARROW-11269) [Rust] Unable to read Parquet file because of mismatch in column-derived and embedded schemas

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11269:
---
Labels: pull-request-available  (was: )

> [Rust] Unable to read Parquet file because of mismatch in column-derived and 
> embedded schemas
> -
>
> Key: ARROW-11269
> URL: https://issues.apache.org/jira/browse/ARROW-11269
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Max Burke
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet, main.rs
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The issue seems to stem from the new(-ish) behavior of the Arrow Parquet 
> reader where the embedded arrow schema is used instead of deriving the schema 
> from the Parquet columns.
>  
> However it seems like some cases still derive the schema type from the column 
> types, leading to the Arrow record batch reader erroring out that the column 
> types must match the schema types.
>  
> In our case, the column type is an int96 datetime (ns) type, and the Arrow 
> type in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, 
> Some("UTC")). However, the code that constructs the Arrays seems to re-derive 
> this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because 
> the Parquet schema has no timezone information). And so, Parquet files that 
> we were able to read successfully with our branch of Arrow circa October are 
> now unreadable.
>  
> I've attached an example of a Parquet file that demonstrates the problem. 
> This file was created in Python (as most of our Parquet files are).
>  
> I've also attached a sample Rust program that will demonstrate the error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11303) [Release][C++] Enable mimalloc in the windows verification script

2021-01-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-11303:


Assignee: Krisztian Szucs

> [Release][C++] Enable mimalloc in the windows verification script
> -
>
> Key: ARROW-11303
> URL: https://issues.apache.org/jira/browse/ARROW-11303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11303) [Release][C++] Enable mimalloc in the windows verification script

2021-01-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-11303.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9247
[https://github.com/apache/arrow/pull/9247]

> [Release][C++] Enable mimalloc in the windows verification script
> -
>
> Key: ARROW-11303
> URL: https://issues.apache.org/jira/browse/ARROW-11303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11183) [Rust] [Parquet] LogicalType::TIMESTAMP_NANOS missing

2021-01-18 Thread Ivan Smirnov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267559#comment-17267559
 ] 

Ivan Smirnov commented on ARROW-11183:
--

[~nevi_me] Yea, I think I could give it a go in all three if you give a brief 
outline of what needs to be done and where.

> [Rust] [Parquet] LogicalType::TIMESTAMP_NANOS missing
> -
>
> Key: ARROW-11183
> URL: https://issues.apache.org/jira/browse/ARROW-11183
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ivan Smirnov
>Priority: Major
>
> There's UnitTime::NANOS in parquet-format, but no nanosecond timestamp 
> support (seemingly) in schema's LogicalType. What is needed to add support 
> for nanosecond timestamps in Rust Parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11302) [Release][Python] Remove verification of python 3.5 wheel on macOS

2021-01-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11302:

Fix Version/s: (was: 4.0.0)
   3.0.0

> [Release][Python] Remove verification of python 3.5 wheel on macOS
> --
>
> Key: ARROW-11302
> URL: https://issues.apache.org/jira/browse/ARROW-11302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11307) [Release][Ubuntu][20.10] Add workaround for dependency issue

2021-01-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-11307.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9252
[https://github.com/apache/arrow/pull/9252]

> [Release][Ubuntu][20.10] Add workaround for dependency issue
> 
>
> Key: ARROW-11307
> URL: https://issues.apache.org/jira/browse/ARROW-11307
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11307) [Release][Ubuntu][20.10] Add workaround for dependency issue

2021-01-18 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11307:


 Summary: [Release][Ubuntu][20.10] Add workaround for dependency 
issue
 Key: ARROW-11307
 URL: https://issues.apache.org/jira/browse/ARROW-11307
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11307) [Release][Ubuntu][20.10] Add workaround for dependency issue

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11307:
---
Labels: pull-request-available  (was: )

> [Release][Ubuntu][20.10] Add workaround for dependency issue
> 
>
> Key: ARROW-11307
> URL: https://issues.apache.org/jira/browse/ARROW-11307
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11257) [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet

2021-01-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267527#comment-17267527
 ] 

Joris Van den Bossche commented on ARROW-11257:
---

I am not really sure what the exact bug was, but ARROW-10493 was one of the 
nested-parquet-related bugs reported after pyarrow 2.0.0 (note that the ability 
to write the data you have was new in 2.0.0)

bq. And when is the next release containing the update scheduled for?

There is a release candidate out right now. So if all goes well by the end of 
this week.



> [C++][Parquet] PyArrow Table contains different data after writing and 
> reloading from Parquet
> -
>
> Key: ARROW-11257
> URL: https://issues.apache.org/jira/browse/ARROW-11257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: Kari Schoonbee
>Priority: Critical
> Attachments: anonymised.jsonl, pyarrow_parquet_issue.ipynb
>
>
> * I'm loading a JSONlines object into a table using 
> {code:java}
> pa.json.readjson{code}
> It contains one column that is a nested dictionary.
>  * I select a row by key and inspect its nested dictionary.
>  * I write the table to parquet 
>  * I load the table again from the parquet file 
>  * I check the same key and the nested dictionary is not the same.
>  
> To reproduce:
>  
> Find the attached JSONLines file and Jupyter Notebook. 
> The json file contains entries per customer with a generated `msisdn`, 
> `scoring_request_id` and `scorecard_result` object. Each `scorecard result 
> consists of a list of feature objects, all with the value the same as the 
> msidn` and a score.
> The notebook reads the file and demonstrates the issue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11306) [Packaging][Ubuntu][16.04] Add missing libprotobuf-dev dependency

2021-01-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-11306.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9251
[https://github.com/apache/arrow/pull/9251]

> [Packaging][Ubuntu][16.04] Add missing libprotobuf-dev dependency
> -
>
> Key: ARROW-11306
> URL: https://issues.apache.org/jira/browse/ARROW-11306
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11135) Using Maven Central artifacts as dependencies produce runtime errors

2021-01-18 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267509#comment-17267509
 ] 

Kouhei Sutou edited comment on ARROW-11135 at 1/18/21, 8:57 PM:


My mistake. I missed that the JAR was macOS only. The tests pass and all is 
fine on openJDK 11. However, on openJDK 8 and 15, I get the following error 
after the tests complete which still causes the build to fail. However, I 
assume this may be because I did not correctly close some Gandiva resources. I 
have edited the code to try to properly free all resources, and I'll watch what 
happens with this build.

{noformat}
pure virtual method called
 terminate called without an active exception
{noformat}


was (Author: michaelmior):
My mistake. I missed that the JAR was macOS only. The tests pass and all is 
fine on openJDK 11. However, on openJDK 8 and 15, I get the following error 
after the tests complete which still causes the build to fail. However, I 
assume this may be because I did not correctly close some Gandiva resources. I 
have edited the code to try to properly free all resources, and I'll watch what 
happens with this build.

{{pure virtual method called}}
{{ terminate called without an active exception}}

> Using Maven Central artifacts as dependencies produce runtime errors
> 
>
> Key: ARROW-11135
> URL: https://issues.apache.org/jira/browse/ARROW-11135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Michael Mior
>Priority: Major
>
> I'm working on connecting Arrow/Gandiva with Apache Calcite. Overall the 
> integration is working well, but I'm having issues . As [suggested on the 
> mailing 
> list|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E],
>  using Dremio's public artifacts solves the problem. Between two Apache 
> projects however, there would be strong preference to use Apache artifacts as 
> a dependency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11301) [C++] Fix reading LZ4-compressed Parquet files produced by Java Parquet implementation

2021-01-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-11301.
-
Resolution: Fixed

Issue resolved by pull request 9244
[https://github.com/apache/arrow/pull/9244]

> [C++] Fix reading LZ4-compressed Parquet files produced by Java Parquet 
> implementation
> --
>
> Key: ARROW-11301
> URL: https://issues.apache.org/jira/browse/ARROW-11301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We slightly misunderstood the Hadoop LZ4 format. A compressed buffer can 
> actually contain several "frames", each prefixed with (de)compressed size.
> See 
> https://issues.apache.org/jira/browse/ARROW-9177?focusedCommentId=17267058=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17267058



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11306) [Packaging][Ubuntu][16.04] Add missing libprotobuf-dev dependency

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11306:
---
Labels: pull-request-available  (was: )

> [Packaging][Ubuntu][16.04] Add missing libprotobuf-dev dependency
> -
>
> Key: ARROW-11306
> URL: https://issues.apache.org/jira/browse/ARROW-11306
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11306) [Packaging][Ubuntu][16.04] Add missing libprotobuf-dev dependency

2021-01-18 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11306:


 Summary: [Packaging][Ubuntu][16.04] Add missing libprotobuf-dev 
dependency
 Key: ARROW-11306
 URL: https://issues.apache.org/jira/browse/ARROW-11306
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11135) Using Maven Central artifacts as dependencies produce runtime errors

2021-01-18 Thread Michael Mior (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267509#comment-17267509
 ] 

Michael Mior commented on ARROW-11135:
--

My mistake. I missed that the JAR was macOS only. The tests pass and all is 
fine on openJDK 11. However, on openJDK 8 and 15, I get the following error 
after the tests complete which still causes the build to fail. However, I 
assume this may be because I did not correctly close some Gandiva resources. I 
have edited the code to try to properly free all resources, and I'll watch what 
happens with this build.

{{pure virtual method called}}
{{ terminate called without an active exception}}

> Using Maven Central artifacts as dependencies produce runtime errors
> 
>
> Key: ARROW-11135
> URL: https://issues.apache.org/jira/browse/ARROW-11135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Michael Mior
>Priority: Major
>
> I'm working on connecting Arrow/Gandiva with Apache Calcite. Overall the 
> integration is working well, but I'm having issues . As [suggested on the 
> mailing 
> list|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E],
>  using Dremio's public artifacts solves the problem. Between two Apache 
> projects however, there would be strong preference to use Apache artifacts as 
> a dependency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11223) [Java] BaseVariableWidthVector setNull and getBufferSizeFor is buggy

2021-01-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-11223:


Assignee: Weichen Xu

> [Java] BaseVariableWidthVector setNull and getBufferSizeFor is buggy
> 
>
> Key: ARROW-11223
> URL: https://issues.apache.org/jira/browse/ARROW-11223
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We may get error  java.lang.IndexOutOfBoundsException: index: 15880, length: 
> 4 (expected: range(0, 15880)).
> I test on arrow 2.0.0
> Reproduce code in scala:
> {code}
> import org.apache.arrow.vector.VarCharVector
> import org.apache.arrow.memory.RootAllocator
> val rootAllocator = new RootAllocator(Long.MaxValue)
> val v1 = new VarCharVector("var1", rootAllocator)
> v1.allocateNew()
> val valueCount = 3970 // use any number >= 3970 will get similar error
> for (idx <- 0 until valueCount) {
>   v1.setNull(idx)
> }
> v1.getBufferSizeFor(valueCount) # failed, get error 
> java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: 
> range(0, 15880))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11223) [Java] BaseVariableWidthVector setNull and getBufferSizeFor is buggy

2021-01-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-11223:
-
Description: 
We may get error  java.lang.IndexOutOfBoundsException: index: 15880, length: 4 
(expected: range(0, 15880)).

I test on arrow 2.0.0

Reproduce code in scala:

{code}

import org.apache.arrow.vector.VarCharVector
import org.apache.arrow.memory.RootAllocator
val rootAllocator = new RootAllocator(Long.MaxValue)
val v1 = new VarCharVector("var1", rootAllocator)
v1.allocateNew()

val valueCount = 3970 // use any number >= 3970 will get similar error

for (idx <- 0 until valueCount) {
  v1.setNull(idx)
}
v1.getBufferSizeFor(valueCount) # failed, get error 
java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: 
range(0, 15880))

{code}

  was:
I test on arrow 2.0.0

Reproduce code in scala:

{code}

import org.apache.arrow.vector.VarCharVector
import org.apache.arrow.memory.RootAllocator
val rootAllocator = new RootAllocator(Long.MaxValue)
val v1 = new VarCharVector("var1", rootAllocator)
v1.allocateNew()

val valueCount = 3970 // use any number >= 3970 will get similar error

for (idx <- 0 until valueCount) {
  v1.setNull(idx)
}
v1.getBufferSizeFor(valueCount) # failed, get error 
java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: 
range(0, 15880))

{code}


> [Java] BaseVariableWidthVector setNull and getBufferSizeFor is buggy
> 
>
> Key: ARROW-11223
> URL: https://issues.apache.org/jira/browse/ARROW-11223
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We may get error  java.lang.IndexOutOfBoundsException: index: 15880, length: 
> 4 (expected: range(0, 15880)).
> I test on arrow 2.0.0
> Reproduce code in scala:
> {code}
> import org.apache.arrow.vector.VarCharVector
> import org.apache.arrow.memory.RootAllocator
> val rootAllocator = new RootAllocator(Long.MaxValue)
> val v1 = new VarCharVector("var1", rootAllocator)
> v1.allocateNew()
> val valueCount = 3970 // use any number >= 3970 will get similar error
> for (idx <- 0 until valueCount) {
>   v1.setNull(idx)
> }
> v1.getBufferSizeFor(valueCount) # failed, get error 
> java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: 
> range(0, 15880))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11223) [Java] BaseVariableWidthVector setNull and getBufferSizeFor is buggy

2021-01-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-11223:
-
Summary: [Java] BaseVariableWidthVector setNull and getBufferSizeFor is 
buggy  (was: BaseVariableWidthVector setNull and getBufferSizeFor is buggy, may 
get error  java.lang.IndexOutOfBoundsException: index: 15880, length: 4 
(expected: range(0, 15880)))

> [Java] BaseVariableWidthVector setNull and getBufferSizeFor is buggy
> 
>
> Key: ARROW-11223
> URL: https://issues.apache.org/jira/browse/ARROW-11223
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I test on arrow 2.0.0
> Reproduce code in scala:
> {code}
> import org.apache.arrow.vector.VarCharVector
> import org.apache.arrow.memory.RootAllocator
> val rootAllocator = new RootAllocator(Long.MaxValue)
> val v1 = new VarCharVector("var1", rootAllocator)
> v1.allocateNew()
> val valueCount = 3970 // use any number >= 3970 will get similar error
> for (idx <- 0 until valueCount) {
>   v1.setNull(idx)
> }
> v1.getBufferSizeFor(valueCount) # failed, get error 
> java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: 
> range(0, 15880))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10344) [Python] Get all columns names (or schema) from Feather file, before loading whole Feather file

2021-01-18 Thread al-hadi boublenza (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267494#comment-17267494
 ] 

al-hadi boublenza commented on ARROW-10344:
---

Facing the same issue and wondering how to know if you're dealing with a 
Feather V1 or Feather V2 file? (Using pyarrow)

> [Python]  Get all columns names (or schema) from Feather file, before loading 
> whole Feather file
> 
>
> Key: ARROW-10344
> URL: https://issues.apache.org/jira/browse/ARROW-10344
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Gert Hulselmans
>Priority: Major
>
> Is there a way to get all column names (or schema) from a Feather file before 
> loading the full Feather file?
> My Feather files are big (like 100GB) and the names of the columns are 
> different per analysis and can't be hard coded.
> {code:python}
> import pyarrow.feather as feather
> # Code here to check which columns are in the feather file.
> ...
> my_columns = ...
> # Result is pandas.DataFrame
> read_df = feather.read_feather('/path/to/file', columns=my_columns)
> # Result is pyarrow.Table
> read_arrow = feather.read_table('/path/to/file', columns=my_columns)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11183) [Rust] [Parquet] LogicalType::TIMESTAMP_NANOS missing

2021-01-18 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267491#comment-17267491
 ] 

Neville Dipale commented on ARROW-11183:


Hey [~aldanor], I had look at the commit. The nanosecond type is part of the 
2.6 format, which we don't fully support yet.

There's 3 tasks, 
 # Wire up the 2.6 changes in the parquet-format, it's clear that we're missing 
an enum (there might be more that's missing)
 # Add a reader for ts-nano
 # Add a writer for ts-nano, that uses the old int96 writer if the legacy 
support is requested, or uses the new ts-nano for non-legacy types

Would you like to contribute some of the above? I can help out with 1 as a 
start.

Thanks

> [Rust] [Parquet] LogicalType::TIMESTAMP_NANOS missing
> -
>
> Key: ARROW-11183
> URL: https://issues.apache.org/jira/browse/ARROW-11183
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ivan Smirnov
>Priority: Major
>
> There's UnitTime::NANOS in parquet-format, but no nanosecond timestamp 
> support (seemingly) in schema's LogicalType. What is needed to add support 
> for nanosecond timestamps in Rust Parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11183) [Rust] [Parquet] LogicalType::TIMESTAMP_NANOS missing

2021-01-18 Thread Ivan Smirnov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267452#comment-17267452
 ] 

Ivan Smirnov commented on ARROW-11183:
--

[~nevi_me] See this commit: 
[https://github.com/apache/parquet-format/commit/b879065ac1bee3fe1d770eb3c4b60ab4267044d7]

(PARQUET-1387)

> [Rust] [Parquet] LogicalType::TIMESTAMP_NANOS missing
> -
>
> Key: ARROW-11183
> URL: https://issues.apache.org/jira/browse/ARROW-11183
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ivan Smirnov
>Priority: Major
>
> There's UnitTime::NANOS in parquet-format, but no nanosecond timestamp 
> support (seemingly) in schema's LogicalType. What is needed to add support 
> for nanosecond timestamps in Rust Parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11183) [Rust] [Parquet] LogicalType::TIMESTAMP_NANOS missing

2021-01-18 Thread Ivan Smirnov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267449#comment-17267449
 ] 

Ivan Smirnov commented on ARROW-11183:
--

[~csun] Here's a Python example that works:
{code:java}
import pandas as pd
import pyarrow.parquet as pq

df = pd.DataFrame(dict(x=[pd.Timestamp.now() for _ in range(10)]))
table = pa.table(df)

pq.write_table(table, 'timestamps.parquet', version='2.0')
assert (pq.read_table('timestamps.parquet').to_pandas() == df).all().all()
{code}
What is the Rust equivalent then?

Note: the table's schema shows up as
{code:java}
pyarrow.Table
x: timestamp[ns]
{code}

> [Rust] [Parquet] LogicalType::TIMESTAMP_NANOS missing
> -
>
> Key: ARROW-11183
> URL: https://issues.apache.org/jira/browse/ARROW-11183
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ivan Smirnov
>Priority: Major
>
> There's UnitTime::NANOS in parquet-format, but no nanosecond timestamp 
> support (seemingly) in schema's LogicalType. What is needed to add support 
> for nanosecond timestamps in Rust Parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11305) [Rust]: parquet-rowcount binary tries to open itself as a parquet file

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11305:
---
Labels: pull-request-available  (was: )

> [Rust]: parquet-rowcount binary tries to open itself as a parquet file
> --
>
> Key: ARROW-11305
> URL: https://issues.apache.org/jira/browse/ARROW-11305
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Introduced accidentally during clippy warning cleanups in 
> https://github.com/apache/arrow/pull/8687/files#diff-f3f978052bd519af87898fa196715ddb445c327045c09ed07be600ca4e1703b6R60



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11305) [Rust]: parquet-rowcount binary tries to open itself as a parquet file

2021-01-18 Thread Jira
Jörn Horstmann created ARROW-11305:
--

 Summary: [Rust]: parquet-rowcount binary tries to open itself as a 
parquet file
 Key: ARROW-11305
 URL: https://issues.apache.org/jira/browse/ARROW-11305
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jörn Horstmann
Assignee: Jörn Horstmann


Introduced accidentally during clippy warning cleanups in 
https://github.com/apache/arrow/pull/8687/files#diff-f3f978052bd519af87898fa196715ddb445c327045c09ed07be600ca4e1703b6R60



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11305) [Rust]: parquet-rowcount binary tries to open itself as a parquet file

2021-01-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann updated ARROW-11305:
---
Component/s: Rust

> [Rust]: parquet-rowcount binary tries to open itself as a parquet file
> --
>
> Key: ARROW-11305
> URL: https://issues.apache.org/jira/browse/ARROW-11305
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>
> Introduced accidentally during clippy warning cleanups in 
> https://github.com/apache/arrow/pull/8687/files#diff-f3f978052bd519af87898fa196715ddb445c327045c09ed07be600ca4e1703b6R60



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11302) [Release][Python] Remove verification of python 3.5 wheel on macOS

2021-01-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11302:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Release][Python] Remove verification of python 3.5 wheel on macOS
> --
>
> Key: ARROW-11302
> URL: https://issues.apache.org/jira/browse/ARROW-11302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11302) [Release][Python] Remove verification of python 3.5 wheel on macOS

2021-01-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-11302.
-
Fix Version/s: (was: 4.0.0)
   3.0.0
   Resolution: Fixed

Issue resolved by pull request 9246
[https://github.com/apache/arrow/pull/9246]

> [Release][Python] Remove verification of python 3.5 wheel on macOS
> --
>
> Key: ARROW-11302
> URL: https://issues.apache.org/jira/browse/ARROW-11302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11304) Add casts from / to DecimalArray

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11304:
---
Labels: pull-request-available  (was: )

> Add casts from / to DecimalArray
> 
>
> Key: ARROW-11304
> URL: https://issues.apache.org/jira/browse/ARROW-11304
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Florian Müller
>Assignee: Florian Müller
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As discussed in [https://github.com/apache/arrow/pull/8880,] several compute 
> implementations will be required. This task deals with cast.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11304) Add casts from / to DecimalArray

2021-01-18 Thread Jira
Florian Müller created ARROW-11304:
--

 Summary: Add casts from / to DecimalArray
 Key: ARROW-11304
 URL: https://issues.apache.org/jira/browse/ARROW-11304
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Florian Müller
Assignee: Florian Müller


As discussed in [https://github.com/apache/arrow/pull/8880,] several compute 
implementations will be required. This task deals with cast.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-11171) [Go] Build fails on s390x with noasm tag

2021-01-18 Thread Jonathan Albrecht (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Albrecht closed ARROW-11171.
-

verified `make test-noasm` on master on s390x

thx [~kou]!

> [Go] Build fails on s390x with noasm tag
> 
>
> Key: ARROW-11171
> URL: https://issues.apache.org/jira/browse/ARROW-11171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
> Environment: linux on s390x with -tags='noasm'
>Reporter: Jonathan Albrecht
>Assignee: Jonathan Albrecht
>Priority: Minor
>  Labels: pull-request-available, s390x
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Initial support for s390x was added in 
> [aca707086160afd92da62aa2f9537a284528e48a|https://github.com/apache/arrow/commit/aca707086160afd92da62aa2f9537a284528e48a]
>  but if building with -tags='noasm' it fails with:
> {code:go}
> # github.com/apache/arrow/go/arrow/math
> math/float64_s390x.go:21:6: initFloat64Go redeclared in this block
> previous declaration at math/float64_noasm.go:23:6
> math/int64_s390x.go:21:6: initInt64Go redeclared in this block
> previous declaration at math/int64_noasm.go:23:6
> math/math_s390x.go:24:6: initGo redeclared in this block
> previous declaration at math/math_noasm.go:25:6
> math/uint64_s390x.go:21:6: initUint64Go redeclared in this block
> previous declaration at math/uint64_noasm.go:23:6
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11303) [Release][C++] Enable mimalloc in the windows verification script

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11303:
---
Labels: pull-request-available  (was: )

> [Release][C++] Enable mimalloc in the windows verification script
> -
>
> Key: ARROW-11303
> URL: https://issues.apache.org/jira/browse/ARROW-11303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11303) [Release][C++] Enable mimalloc in the windows verification script

2021-01-18 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-11303:
---

 Summary: [Release][C++] Enable mimalloc in the windows 
verification script
 Key: ARROW-11303
 URL: https://issues.apache.org/jira/browse/ARROW-11303
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Krisztian Szucs






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11301) [C++] Fix reading LZ4-compressed Parquet files produced by Java Parquet implementation

2021-01-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11301:
---
Priority: Critical  (was: Blocker)

> [C++] Fix reading LZ4-compressed Parquet files produced by Java Parquet 
> implementation
> --
>
> Key: ARROW-11301
> URL: https://issues.apache.org/jira/browse/ARROW-11301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We slightly misunderstood the Hadoop LZ4 format. A compressed buffer can 
> actually contain several "frames", each prefixed with (de)compressed size.
> See 
> https://issues.apache.org/jira/browse/ARROW-9177?focusedCommentId=17267058=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17267058



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11301) [C++] Fix reading LZ4-compressed Parquet files produced by Java Parquet implementation

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11301:
---
Labels: pull-request-available  (was: )

> [C++] Fix reading LZ4-compressed Parquet files produced by Java Parquet 
> implementation
> --
>
> Key: ARROW-11301
> URL: https://issues.apache.org/jira/browse/ARROW-11301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We slightly misunderstood the Hadoop LZ4 format. A compressed buffer can 
> actually contain several "frames", each prefixed with (de)compressed size.
> See 
> https://issues.apache.org/jira/browse/ARROW-9177?focusedCommentId=17267058=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17267058



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9177) [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet compression compatibility

2021-01-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267338#comment-17267338
 ] 

Antoine Pitrou commented on ARROW-9177:
---

Great, thank you. I can confirm that the PR for ARROW-11301 reads the file 
properly.

Depending on specifics of the release procedure, it may go into 3.0.0 or 4.0.0 
(or perhaps a hypothetical 3.0.1).

> [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet 
> compression compatibility
> 
>
> Key: ARROW-9177
> URL: https://issues.apache.org/jira/browse/ARROW-9177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> per PARQUET-1878, it seems that there are still problems with our use of LZ4 
> compression in the Parquet format. While we should fix this (the Parquet 
> specification and our implementation of it), we may need to disable use of 
> LZ4 compression until the appropriate compatibility testing can bed one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9177) [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet compression compatibility

2021-01-18 Thread Steve M. Kim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267334#comment-17267334
 ] 

Steve M. Kim commented on ARROW-9177:
-

These are the decoded values as line-delimited JSON:

 

https://github.com/chairmank/arrow-9177-example/blob/3a169e32701939de64a8ecafb155cb0b730cd8d8/561120a3094ee4513ba619b518c7a6093fe4e38398219ad172fb75373c3360b8_decoded.jsonl

> [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet 
> compression compatibility
> 
>
> Key: ARROW-9177
> URL: https://issues.apache.org/jira/browse/ARROW-9177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> per PARQUET-1878, it seems that there are still problems with our use of LZ4 
> compression in the Parquet format. While we should fix this (the Parquet 
> specification and our implementation of it), we may need to disable use of 
> LZ4 compression until the appropriate compatibility testing can bed one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11302) [Release][Python] Remove verification of python 3.5 wheel on macOS

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11302:
---
Labels: pull-request-available  (was: )

> [Release][Python] Remove verification of python 3.5 wheel on macOS
> --
>
> Key: ARROW-11302
> URL: https://issues.apache.org/jira/browse/ARROW-11302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11302) [Release][Python] Remove verification of python 3.5 wheel on macOS

2021-01-18 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-11302:
---

 Summary: [Release][Python] Remove verification of python 3.5 wheel 
on macOS
 Key: ARROW-11302
 URL: https://issues.apache.org/jira/browse/ARROW-11302
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11299) [Python] build warning in python

2021-01-18 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai updated ARROW-11299:
-
Component/s: Python
 C++

> [Python] build warning in python
> 
>
> Key: ARROW-11299
> URL: https://issues.apache.org/jira/browse/ARROW-11299
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Yibo Cai
>Priority: Major
>
> Many warnings about compute kernel options when building Arrow python.
> Removing below line suppresses the warnings.
> https://github.com/apache/arrow/blob/140135908c5d131ceac31a0e529f9b9b763b1106/cpp/src/arrow/compute/function.h#L45
> I think the reason is virtual destructor makes the structure non C compatible 
> and cannot use offsetof macro safely.
> As function options are straightforward, looks destructor is not necessary.
> [~bkietz]
> *Steps to reproduce*
> build arrow cpp
> {code:bash}
>  ~/arrow/cpp/release $ cmake -GNinja -DCMAKE_BUILD_TYPE=Release 
> -DARROW_COMPUTE=ON -DARROW_BUILD_TESTS=ON 
> -DCMAKE_INSTALL_PREFIX=$(pwd)/_install -DCMAKE_INSTALL_LIBDIR=lib 
> -DARROW_PYTHON=ON -DCMAKE_CXX_COMPILER=/usr/bin/clang++-9 
> -DCMAKE_C_COMPILER=/usr/bin/clang-9 ..
> ~/arrow/cpp/release $ ninja install
> {code}
> build arrow python
> {code:bash}
>  ~/arrow/python $ python --version
>  Python 3.6.9
> ~/arrow/python $ python setup.py build_ext --inplace
>  ..
>  [ 93%] Building CXX object CMakeFiles/_compute.dir/_compute.cpp.o [27/1691]
>  In file included from 
> /usr/include/x86_64-linux-gnu/bits/types/stack_t.h:23:0, 
>  from /usr/include/signal.h:303,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/npy_interrupt.h:84,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:5,
>  from 
> /home/cyb/arrow/cpp/release/_install/include/arrow/python/numpy_interop.h:41,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/helpers.h:27,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/api.h:24,
>  from /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:696:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp: In function 
> ‘int __Pyx_modinit_type_init_code()’:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26034:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CastOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__CastOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CastOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26066:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__FilterOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__FilterOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__FilterOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26082:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__TakeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__TakeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__TakeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26130:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__MinMaxOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__MinMaxOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__MinMaxOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26146:148: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CountOptions’ is undefined [-Winvalid-offsetof]
>  _type_7pyarrow_8_compute__CountOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CountOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^ 
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26162:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__ModeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__ModeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__ModeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26210:154: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__VarianceOptions’ is undefined 
> [-Winvalid-offsetof]
>  pe_7pyarrow_8_compute__VarianceOptions.tp_weaklistoffset = offsetof(struct 
> 

[jira] [Commented] (ARROW-9177) [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet compression compatibility

2021-01-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267275#comment-17267275
 ] 

Antoine Pitrou commented on ARROW-9177:
---

[~chairmank] Can you post the decoded values in the file somewhere? At least 
the N first and last.

> [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet 
> compression compatibility
> 
>
> Key: ARROW-9177
> URL: https://issues.apache.org/jira/browse/ARROW-9177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> per PARQUET-1878, it seems that there are still problems with our use of LZ4 
> compression in the Parquet format. While we should fix this (the Parquet 
> specification and our implementation of it), we may need to disable use of 
> LZ4 compression until the appropriate compatibility testing can bed one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9177) [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet compression compatibility

2021-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9177:
--
Labels: pull-request-available  (was: )

> [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet 
> compression compatibility
> 
>
> Key: ARROW-9177
> URL: https://issues.apache.org/jira/browse/ARROW-9177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> per PARQUET-1878, it seems that there are still problems with our use of LZ4 
> compression in the Parquet format. While we should fix this (the Parquet 
> specification and our implementation of it), we may need to disable use of 
> LZ4 compression until the appropriate compatibility testing can bed one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11301) [C++] Fix reading LZ4-compressed Parquet files produced by Java Parquet implementation

2021-01-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11301:
---
Description: 
We slightly misunderstood the Hadoop LZ4 format. A compressed buffer can 
actually contain several "frames", each prefixed with (de)compressed size.

See 
https://issues.apache.org/jira/browse/ARROW-9177?focusedCommentId=17267058=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17267058

  was:See 
https://issues.apache.org/jira/browse/ARROW-9177?focusedCommentId=17267058=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17267058


> [C++] Fix reading LZ4-compressed Parquet files produced by Java Parquet 
> implementation
> --
>
> Key: ARROW-11301
> URL: https://issues.apache.org/jira/browse/ARROW-11301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Blocker
> Fix For: 3.0.0
>
>
> We slightly misunderstood the Hadoop LZ4 format. A compressed buffer can 
> actually contain several "frames", each prefixed with (de)compressed size.
> See 
> https://issues.apache.org/jira/browse/ARROW-9177?focusedCommentId=17267058=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17267058



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11301) [C++] Fix reading LZ4-compressed Parquet files produced by Java Parquet implementation

2021-01-18 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-11301:
--

 Summary: [C++] Fix reading LZ4-compressed Parquet files produced 
by Java Parquet implementation
 Key: ARROW-11301
 URL: https://issues.apache.org/jira/browse/ARROW-11301
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 3.0.0


See 
https://issues.apache.org/jira/browse/ARROW-9177?focusedCommentId=17267058=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17267058



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9177) [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet compression compatibility

2021-01-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267244#comment-17267244
 ] 

Antoine Pitrou commented on ARROW-9177:
---

Thank you [~chairmank], this helps a lot! Indeed it seems we misunderstood the 
undocumented Hadoop-LZ4-framing format :-(

> [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet 
> compression compatibility
> 
>
> Key: ARROW-9177
> URL: https://issues.apache.org/jira/browse/ARROW-9177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: 2.0.0
>
>
> per PARQUET-1878, it seems that there are still problems with our use of LZ4 
> compression in the Parquet format. While we should fix this (the Parquet 
> specification and our implementation of it), we may need to disable use of 
> LZ4 compression until the appropriate compatibility testing can bed one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11257) [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet

2021-01-18 Thread Kari Schoonbee (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267198#comment-17267198
 ] 

Kari Schoonbee commented on ARROW-11257:


{color:#FF}{color}{color:#172b4d}Hey Joris. Do we know what caused the bug? 
I'm a bit worried as this has led to data corruption in production for us and 
it's possible that it has affected others without them being aware. And when is 
the next release containing the update scheduled for?{color}

> [C++][Parquet] PyArrow Table contains different data after writing and 
> reloading from Parquet
> -
>
> Key: ARROW-11257
> URL: https://issues.apache.org/jira/browse/ARROW-11257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: Kari Schoonbee
>Priority: Critical
> Attachments: anonymised.jsonl, pyarrow_parquet_issue.ipynb
>
>
> * I'm loading a JSONlines object into a table using 
> {code:java}
> pa.json.readjson{code}
> It contains one column that is a nested dictionary.
>  * I select a row by key and inspect its nested dictionary.
>  * I write the table to parquet 
>  * I load the table again from the parquet file 
>  * I check the same key and the nested dictionary is not the same.
>  
> To reproduce:
>  
> Find the attached JSONLines file and Jupyter Notebook. 
> The json file contains entries per customer with a generated `msisdn`, 
> `scoring_request_id` and `scorecard_result` object. Each `scorecard result 
> consists of a list of feature objects, all with the value the same as the 
> msidn` and a score.
> The notebook reads the file and demonstrates the issue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11300) [Rust][DataFusion] Improve hash aggregate performance with large number of groups in

2021-01-18 Thread Jira
Daniël Heres created ARROW-11300:


 Summary: [Rust][DataFusion] Improve hash aggregate performance 
with large number of groups in 
 Key: ARROW-11300
 URL: https://issues.apache.org/jira/browse/ARROW-11300
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
 Attachments: image-2021-01-18-13-00-36-685.png

Currently, hash aggregates are performing well when having a small number of 
output groups, but the results on db-benchmark 
[https://github.com/h2oai/db-benchmark/pull/182] test on data with a high 
number of output groups.
[https://github.com/apache/arrow/pull/9234] improved the situation a bit, but 
DataFusion is still much slower than even the slowest result when comparing to 
the published results.

This seems mostly having to do with the way we use individual key/groups.
For each new key, we _take_ the indices of the group, resulting in lots of 
small allocations and cache unfriendliness and other overhead if we have many 
keys with only a small (just 1-2) number of rows per group in a batch. Also the 
indices are converted from a Vec to an Array, making the situation worse 
(accounts for ~22% of the instructions on the master branch!), other profiling 
results seem to be from related allocations too.

To make it efficient for tiny groups, we should probably change the hash 
aggregate algorithm to _take_ based on _all_ indices from the batch in one go, 
and "slice" into the resulting array for the individual accumulators.
 
Here is some profiling info of the db-benchmark questions 1-5 against master:

!image-2021-01-18-13-00-36-685.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11299) [Python] build warning in python

2021-01-18 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-11299:


 Summary: [Python] build warning in python
 Key: ARROW-11299
 URL: https://issues.apache.org/jira/browse/ARROW-11299
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Yibo Cai


Many warnings about compute kernel options when building Arrow python.
Removing below line suppresses the warnings.
https://github.com/apache/arrow/blob/140135908c5d131ceac31a0e529f9b9b763b1106/cpp/src/arrow/compute/function.h#L45
I think the reason is virtual destructor makes the structure non C compatible 
and cannot use offsetof macro safely.
As function options are straightforward, looks destructor is not necessary.
[~bkietz]

*Steps to reproduce*
build arrow cpp
{code:bash}
 ~/arrow/cpp/release $ cmake -GNinja -DCMAKE_BUILD_TYPE=Release 
-DARROW_COMPUTE=ON -DARROW_BUILD_TESTS=ON 
-DCMAKE_INSTALL_PREFIX=$(pwd)/_install -DCMAKE_INSTALL_LIBDIR=lib 
-DARROW_PYTHON=ON -DCMAKE_CXX_COMPILER=/usr/bin/clang++-9 
-DCMAKE_C_COMPILER=/usr/bin/clang-9 ..

~/arrow/cpp/release $ ninja install
{code}

build arrow python
{code:bash}
 ~/arrow/python $ python --version
 Python 3.6.9

~/arrow/python $ python setup.py build_ext --inplace
 ..
 [ 93%] Building CXX object CMakeFiles/_compute.dir/_compute.cpp.o [27/1691]
 In file included from /usr/include/x86_64-linux-gnu/bits/types/stack_t.h:23:0, 
 from /usr/include/signal.h:303,
 from 
/home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/npy_interrupt.h:84,
 from 
/home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:5,
 from 
/home/cyb/arrow/cpp/release/_install/include/arrow/python/numpy_interop.h:41,
 from /home/cyb/arrow/cpp/release/_install/include/arrow/python/helpers.h:27,
 from /home/cyb/arrow/cpp/release/_install/include/arrow/python/api.h:24,
 from /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:696:
 /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp: In function 
‘int __Pyx_modinit_type_init_code()’:
 /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26034:146: 
warning: offsetof within non-standard-layout type 
‘__pyx_obj_7pyarrow_8_compute__CastOptions’ is undefined [-Winvalid-offsetof]
 x_type_7pyarrow_8_compute__CastOptions.tp_weaklistoffset = offsetof(struct 
__pyx_obj_7pyarrow_8_compute__CastOptions, __pyx_base.__pyx_base.__weakref__);
 ^
 /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26066:150: 
warning: offsetof within non-standard-layout type 
‘__pyx_obj_7pyarrow_8_compute__FilterOptions’ is undefined [-Winvalid-offsetof]
 type_7pyarrow_8_compute__FilterOptions.tp_weaklistoffset = offsetof(struct 
__pyx_obj_7pyarrow_8_compute__FilterOptions, __pyx_base.__pyx_base.__weakref__);
 ^
 /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26082:146: 
warning: offsetof within non-standard-layout type 
‘__pyx_obj_7pyarrow_8_compute__TakeOptions’ is undefined [-Winvalid-offsetof]
 x_type_7pyarrow_8_compute__TakeOptions.tp_weaklistoffset = offsetof(struct 
__pyx_obj_7pyarrow_8_compute__TakeOptions, __pyx_base.__pyx_base.__weakref__);
 ^
 /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26130:150: 
warning: offsetof within non-standard-layout type 
‘__pyx_obj_7pyarrow_8_compute__MinMaxOptions’ is undefined [-Winvalid-offsetof]
 type_7pyarrow_8_compute__MinMaxOptions.tp_weaklistoffset = offsetof(struct 
__pyx_obj_7pyarrow_8_compute__MinMaxOptions, __pyx_base.__pyx_base.__weakref__);
 ^
 /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26146:148: 
warning: offsetof within non-standard-layout type 
‘__pyx_obj_7pyarrow_8_compute__CountOptions’ is undefined [-Winvalid-offsetof]
 _type_7pyarrow_8_compute__CountOptions.tp_weaklistoffset = offsetof(struct 
__pyx_obj_7pyarrow_8_compute__CountOptions, __pyx_base.__pyx_base.__weakref__);
 ^ 
 /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26162:146: 
warning: offsetof within non-standard-layout type 
‘__pyx_obj_7pyarrow_8_compute__ModeOptions’ is undefined [-Winvalid-offsetof]
 x_type_7pyarrow_8_compute__ModeOptions.tp_weaklistoffset = offsetof(struct 
__pyx_obj_7pyarrow_8_compute__ModeOptions, __pyx_base.__pyx_base.__weakref__);
 ^
 /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26210:154: 
warning: offsetof within non-standard-layout type 
‘__pyx_obj_7pyarrow_8_compute__VarianceOptions’ is undefined 
[-Winvalid-offsetof]
 pe_7pyarrow_8_compute__VarianceOptions.tp_weaklistoffset = offsetof(struct 
__pyx_obj_7pyarrow_8_compute__VarianceOptions, 
__pyx_base.__pyx_base.__weakref__);
 ^
 /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26258:156: 
warning: offsetof within non-standard-layout type 
‘__pyx_obj_7pyarrow_8_compute__ArraySortOptions’ is undefined 
[-Winvalid-offsetof]
 e_7pyarrow_8_compute__ArraySortOptions.tp_weaklistoffset = offsetof(struct 
__pyx_obj_7pyarrow_8_compute__ArraySortOptions,