[jira] [Updated] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10215:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Rename "Source" typedef
> ---
>
> Key: ARROW-10215
> URL: https://issues.apache.org/jira/browse/ARROW-10215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The name "Source" for this type doesn't make sense to me. I would like to 
> discuss alternate names for it.
> {code:java}
> type Source = Box; {code}
> My first thoughts are:
>  * RecordBatchIterator
>  * RecordBatchStream
>  * SendableRecordBatchReader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef

2020-10-08 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210625#comment-17210625
 ] 

Jorge Leitão commented on ARROW-10215:
--

I agree. I used for iterating during the PR, but it was not intended to remain 
like that. Any of your suggestions is fine by me.

> [Rust] [DataFusion] Rename "Source" typedef
> ---
>
> Key: ARROW-10215
> URL: https://issues.apache.org/jira/browse/ARROW-10215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Minor
> Fix For: 3.0.0
>
>
> The name "Source" for this type doesn't make sense to me. I would like to 
> discuss alternate names for it.
> {code:java}
> type Source = Box; {code}
> My first thoughts are:
>  * RecordBatchIterator
>  * RecordBatchStream
>  * SendableRecordBatchReader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version

2020-10-08 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-9553:
--

Assignee: Krisztian Szucs  (was: Andy Grove)

> [Rust] Release script doesn't bump parquet crate's arrow dependency version
> ---
>
> Key: ARROW-9553
> URL: https://issues.apache.org/jira/browse/ARROW-9553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> After rebasing the master the rust builds have started to fail.
> The solution is to bump a version number gere 
> https://github.com/apache/arrow/pull/7829



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")

2020-10-08 Thread Josh Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210605#comment-17210605
 ] 

Josh Taylor commented on ARROW-10242:
-

Hi [~andygrove],

I'm not sure if i'm using a nested type, they should all be pretty primitive 
types. I'll start by removing all the fields and field types and adding one at 
a time and see what causes it to explode.

Thanks for the swift response!

> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel")
> --
>
> Key: ARROW-10242
> URL: https://issues.apache.org/jira/browse/ARROW-10242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Josh Taylor
>Assignee: Andy Grove
>Priority: Major
>
> *Running the latest code from github for datafusion & parquet.*
> When trying to read a directory of around ~210 parquet files (3.2gb total, 
> each file around 13-18mb), doing the following:
> {code:java}
> let mut ctx = ExecutionContext::new();
> // register parquet file with the execution context
> ctx.register_parquet(
>  "something",
>  "/home/josh/dev/pat/fff/"
> )?;
> // execute the query
> let df = ctx.sql(
>  "select * from something",
> )?;
> let results = df.collect().await?;
>  
> {code}
> I get the following error shown ~204 times:
> {code:java}
> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")

2020-10-08 Thread Josh Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210605#comment-17210605
 ] 

Josh Taylor edited comment on ARROW-10242 at 10/9/20, 4:21 AM:
---

Hi [~andygrove],

I'm not sure if i'm using a nested type, they should all be pretty primitive 
types. I'll start by removing all the fields and field types and adding one at 
a time and see what causes it to explode.

I'm not seeing any other errors.

Thanks for the swift response!


was (Author: joshx):
Hi [~andygrove],

I'm not sure if i'm using a nested type, they should all be pretty primitive 
types. I'll start by removing all the fields and field types and adding one at 
a time and see what causes it to explode.

Thanks for the swift response!

> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel")
> --
>
> Key: ARROW-10242
> URL: https://issues.apache.org/jira/browse/ARROW-10242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Josh Taylor
>Assignee: Andy Grove
>Priority: Major
>
> *Running the latest code from github for datafusion & parquet.*
> When trying to read a directory of around ~210 parquet files (3.2gb total, 
> each file around 13-18mb), doing the following:
> {code:java}
> let mut ctx = ExecutionContext::new();
> // register parquet file with the execution context
> ctx.register_parquet(
>  "something",
>  "/home/josh/dev/pat/fff/"
> )?;
> // execute the query
> let df = ctx.sql(
>  "select * from something",
> )?;
> let results = df.collect().await?;
>  
> {code}
> I get the following error shown ~204 times:
> {code:java}
> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210603#comment-17210603
 ] 

Andy Grove commented on ARROW-9553:
---

Actually it has two separate dependencies on arrow, in  [dependencies] and  
[dev-dependencies] and a different format in each.

> [Rust] Release script doesn't bump parquet crate's arrow dependency version
> ---
>
> Key: ARROW-9553
> URL: https://issues.apache.org/jira/browse/ARROW-9553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Krisztian Szucs
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> After rebasing the master the rust builds have started to fail.
> The solution is to bump a version number gere 
> https://github.com/apache/arrow/pull/7829



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210602#comment-17210602
 ] 

Andy Grove commented on ARROW-9553:
---

The release-test script is looking for this pattern:
{code:java}
["-arrow = { path = \"../arrow\", version = \"#{@snapshot_version}\" }",
 "+arrow = { path = \"../arrow\", version = \"#{@release_version}\" }"]
{code}
The parquet cargo.toml does not match:
{code:java}
arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT", optional = true } 
{code}

> [Rust] Release script doesn't bump parquet crate's arrow dependency version
> ---
>
> Key: ARROW-9553
> URL: https://issues.apache.org/jira/browse/ARROW-9553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Krisztian Szucs
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> After rebasing the master the rust builds have started to fail.
> The solution is to bump a version number gere 
> https://github.com/apache/arrow/pull/7829



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9847) [Rust] Inconsistent use of import arrow:: vs crate::arrow::

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9847:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] Inconsistent use of import arrow:: vs crate::arrow::
> ---
>
> Key: ARROW-9847
> URL: https://issues.apache.org/jira/browse/ARROW-9847
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Both the DataFusion and Parquet crates have a mix of "import arrow::" and 
> "import crate::arrow::" and we should standardize on one or the other.
>  
> Which ever standard we use should be enforced in build.rs so CI fails on PRs 
> that do not follow the standard.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10186) [Rust] Tests fail when following instructions in README

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10186:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] Tests fail when following instructions in README
> ---
>
> Key: ARROW-10186
> URL: https://issues.apache.org/jira/browse/ARROW-10186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> If I follow the instructions from the README and set the test paths as 
> follows, some of the IPC tests fail with "no such file or directory".
> {code:java}
> export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=../testing/data  {code}
> If I change them to absolute paths as follows then the tests pass:
> {code:java}
> export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210600#comment-17210600
 ] 

Andy Grove commented on ARROW-10242:


Hi [~joshx]  and thanks for the bug report. I was unable to reproduce the issue 
on any of the parquet data sets that I usually test with, but they are simple 
data sets containing primitive types. My first guess here is that there is 
something in the files that DataFusion doesn't support and the error message is 
being suppressed, but this is just a guess. Do your files contain nested types?

 

Do you see any other errors before the disconnected channel error?

> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel")
> --
>
> Key: ARROW-10242
> URL: https://issues.apache.org/jira/browse/ARROW-10242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Josh Taylor
>Assignee: Andy Grove
>Priority: Major
>
> *Running the latest code from github for datafusion & parquet.*
> When trying to read a directory of around ~210 parquet files (3.2gb total, 
> each file around 13-18mb), doing the following:
> {code:java}
> let mut ctx = ExecutionContext::new();
> // register parquet file with the execution context
> ctx.register_parquet(
>  "something",
>  "/home/josh/dev/pat/fff/"
> )?;
> // execute the query
> let df = ctx.sql(
>  "select * from something",
> )?;
> let results = df.collect().await?;
>  
> {code}
> I get the following error shown ~204 times:
> {code:java}
> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-10242:
--

Assignee: Andy Grove

> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel")
> --
>
> Key: ARROW-10242
> URL: https://issues.apache.org/jira/browse/ARROW-10242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Josh Taylor
>Assignee: Andy Grove
>Priority: Major
>
> *Running the latest code from github for datafusion & parquet.*
> When trying to read a directory of around ~210 parquet files (3.2gb total, 
> each file around 13-18mb), doing the following:
> {code:java}
> let mut ctx = ExecutionContext::new();
> // register parquet file with the execution context
> ctx.register_parquet(
>  "something",
>  "/home/josh/dev/pat/fff/"
> )?;
> // execute the query
> let df = ctx.sql(
>  "select * from something",
> )?;
> let results = df.collect().await?;
>  
> {code}
> I get the following error shown ~204 times:
> {code:java}
> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")

2020-10-08 Thread Josh Taylor (Jira)
Josh Taylor created ARROW-10242:
---

 Summary: Parquet reader thread terminated due to error: 
ExecutionError("sending on a disconnected channel")
 Key: ARROW-10242
 URL: https://issues.apache.org/jira/browse/ARROW-10242
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Affects Versions: 2.0.0
Reporter: Josh Taylor


*Running the latest code from github for datafusion & parquet.*

When trying to read a directory of around ~210 parquet files (3.2gb total, each 
file around 13-18mb), doing the following:
{code:java}
let mut ctx = ExecutionContext::new();
// register parquet file with the execution context
ctx.register_parquet(
 "something",
 "/home/josh/dev/pat/fff/"
)?;

// execute the query
let df = ctx.sql(
 "select * from something",
)?;
let results = df.collect().await?;
 
{code}
I get the following error shown ~204 times:
{code:java}
Parquet reader thread terminated due to error: ExecutionError("sending on a 
disconnected channel"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too

2020-10-08 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-10239.
--
Resolution: Fixed

Issue resolved by pull request 8406
[https://github.com/apache/arrow/pull/8406]

> [C++] aws-sdk-cpp apparently requires zlib too
> --
>
> Key: ARROW-10239
> URL: https://issues.apache.org/jira/browse/ARROW-10239
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Neal Richardson
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9
> If you happen to be building on a bare system without zlib, the bundled 
> aws-sdk-cpp build fails: 
> https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10237) [C++] Duplicate values in a dictionary result in corrupted parquet

2020-10-08 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-10237.
--
Resolution: Fixed

Issue resolved by pull request 8403
[https://github.com/apache/arrow/pull/8403]

> [C++] Duplicate values in a dictionary result in corrupted parquet
> --
>
> Key: ARROW-10237
> URL: https://issues.apache.org/jira/browse/ARROW-10237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Initial discussion: 
> https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10153) [Java] Adding values to VarCharVector beyond 2GB results in IndexOutOfBoundsException

2020-10-08 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210562#comment-17210562
 ] 

Liya Fan commented on ARROW-10153:
--

[~samarthjain] Thanks for reporting the problem.

As indicated by [~emkornfi...@gmail.com], a LargeVarChatVector employs a 8-byte 
offset buffer, so the data locality can be worse as less data could be loaded 
to cache.

You can check the vector capacity by calling the 
\{{BaseVariableWidthVector#getValueCapacity}} API. 

Finally, it could be expensive to copy the data from a VarCharVector to a 
LargeVarCharVector, so if the data size may exceed 2GB, maybe you should 
consider LargeVarCharVector from the very beginning?

> [Java] Adding values to VarCharVector beyond 2GB results in 
> IndexOutOfBoundsException
> -
>
> Key: ARROW-10153
> URL: https://issues.apache.org/jira/browse/ARROW-10153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.0.0
>Reporter: Samarth Jain
>Priority: Major
>
> On executing the below test case, one can see that on adding the 2049th 
> string of size 1MB, it fails.  
> {code:java}
> int length = 1024 * 1024;
> StringBuilder sb = new StringBuilder(length);
> for (int i = 0; i < length; i++) {
>  sb.append("a");
> }
> byte[] str = sb.toString().getBytes();
> VarCharVector vector = new VarCharVector("v", new 
> RootAllocator(Long.MAX_VALUE));
> vector.allocateNew(3000);
> for (int i = 0; i < 3000; i++) {
>  vector.setSafe(i, str);
> }{code}
>  
> {code:java}
> Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 
> -2147483648, length: 1048576 (expected: range(0, 2147483648))Exception in 
> thread "main" java.lang.IndexOutOfBoundsException: index: -2147483648, 
> length: 1048576 (expected: range(0, 2147483648)) at 
> org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) at 
> org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:762) at 
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1212)
>  at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1011)
> {code}
> Stepping through the code, 
>  
> [https://github.com/apache/arrow/blob/master/java/memory/memory-core/src/main/java/org/apache/arrow/memory/ArrowBuf.java#L425]
> returns the negative index `-2147483648`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10238) [C#] List is broken

2020-10-08 Thread Eric Erhardt (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Erhardt resolved ARROW-10238.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8404
[https://github.com/apache/arrow/pull/8404]

> [C#] List is broken
> ---
>
> Key: ARROW-10238
> URL: https://issues.apache.org/jira/browse/ARROW-10238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 2.0.0
>Reporter: Eric Erhardt
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This code is currently broken:
> [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs#L147]
>  
> We need to use the `childFields` parameter when creating the ListType, that 
> way if there are recursive nested types, like List, the correct types 
> get flown down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10241) [C++][Compute] Add variance kernel benchmark

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10241:
---
Labels: pull-request-available  (was: )

> [C++][Compute] Add variance kernel benchmark
> 
>
> Key: ARROW-10241
> URL: https://issues.apache.org/jira/browse/ARROW-10241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10241) [C++][Compute] Add variance kernel benchmark

2020-10-08 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-10241:


 Summary: [C++][Compute] Add variance kernel benchmark
 Key: ARROW-10241
 URL: https://issues.apache.org/jira/browse/ARROW-10241
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10239:
---
Labels: pull-request-available  (was: )

> [C++] aws-sdk-cpp apparently requires zlib too
> --
>
> Key: ARROW-10239
> URL: https://issues.apache.org/jira/browse/ARROW-10239
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Neal Richardson
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9
> If you happen to be building on a bare system without zlib, the bundled 
> aws-sdk-cpp build fails: 
> https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too

2020-10-08 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210538#comment-17210538
 ] 

Kouhei Sutou commented on ARROW-10239:
--

OK. I'll take a look this.

> [C++] aws-sdk-cpp apparently requires zlib too
> --
>
> Key: ARROW-10239
> URL: https://issues.apache.org/jira/browse/ARROW-10239
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Neal Richardson
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 2.0.0
>
>
> https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9
> If you happen to be building on a bare system without zlib, the bundled 
> aws-sdk-cpp build fails: 
> https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1614) [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

2020-10-08 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210537#comment-17210537
 ] 

Rok Mihevc commented on ARROW-1614:
---

[~bryanc] Text Extensions for Pandas looks like a good start for the Python 
part.  We'd probably want to base it on pyarrow.Tensor instead of np.ndarray? 
Would you like to do this or shall I start and ask you for review?

> [C++] Add a Tensor logical value type with constant dimensions, implemented 
> using ExtensionType
> ---
>
> Key: ARROW-1614
> URL: https://issues.apache.org/jira/browse/ARROW-1614
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Format
>Reporter: Wes McKinney
>Priority: Major
>
> In an Arrow table, we would like to add support for a column that has values 
> cells each containing a tensor value, with all tensors having the same 
> dimensions. These would be stored as a binary value, plus some metadata to 
> store type and shape/strides.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too

2020-10-08 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-10239:


Assignee: Kouhei Sutou

> [C++] aws-sdk-cpp apparently requires zlib too
> --
>
> Key: ARROW-10239
> URL: https://issues.apache.org/jira/browse/ARROW-10239
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Neal Richardson
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 2.0.0
>
>
> https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9
> If you happen to be building on a bare system without zlib, the bundled 
> aws-sdk-cpp build fails: 
> https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210531#comment-17210531
 ] 

Andy Grove commented on ARROW-10240:


On second thoughts, I might not be able to get to this right away.

> [Rust] [Datafusion] Optionally load tpch data into memory before running 
> benchmark query
> 
>
> Key: ARROW-10240
> URL: https://issues.apache.org/jira/browse/ARROW-10240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Priority: Minor
>
> The tpch benchmark runtime seems to be dominated by csv parsing code and it 
> is really difficult to see any performance hotspots related to actual query 
> execution in a flamegraph.
> With the date in memory and more iterations it should be easier to profile 
> and find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10164) [Rust] Add support for DictionaryArray types to cast kernels

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10164.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8346
[https://github.com/apache/arrow/pull/8346]

> [Rust] Add support for DictionaryArray types to cast kernels
> 
>
> Key: ARROW-10164
> URL: https://issues.apache.org/jira/browse/ARROW-10164
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> This ticket tracks the work to support casting to/from DictionaryArray's, (my 
> usecase is DictionaryArray's with a Utf8 dictionary). 
> There is prototype work on 
> https://github.com/alamb/arrow/tree/alamb/datafusion-string-dictionary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10164) [Rust] Add support for DictionaryArray types to cast kernels

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10164:
---
Component/s: Rust

> [Rust] Add support for DictionaryArray types to cast kernels
> 
>
> Key: ARROW-10164
> URL: https://issues.apache.org/jira/browse/ARROW-10164
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> This ticket tracks the work to support casting to/from DictionaryArray's, (my 
> usecase is DictionaryArray's with a Utf8 dictionary). 
> There is prototype work on 
> https://github.com/alamb/arrow/tree/alamb/datafusion-string-dictionary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10043) [Rust] [DataFusion] Introduce support for DISTINCT by partially implementing COUNT(DISTINCT)

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10043.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8222
[https://github.com/apache/arrow/pull/8222]

> [Rust] [DataFusion] Introduce support for DISTINCT by partially implementing 
> COUNT(DISTINCT)
> 
>
> Key: ARROW-10043
> URL: https://issues.apache.org/jira/browse/ARROW-10043
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust, Rust - DataFusion
>Reporter: Daniel Russo
>Assignee: Daniel Russo
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> I am unsure where support for {{DISTINCT}} may be on the DataFusion roadmap, 
> so I've filed this with the "Wish" type and "Minor" priority to reflect that 
> this is a proposal:
> Introduce {{DISTINCT}} into DataFusion by partially implementing 
> {{COUNT(DISTINCT)}}. The ultimate goal is to fully support the {{DISTINCT}} 
> keyword, but to get implementation started, limit the scope of this work to:
>  * the {{COUNT()}} aggregate function
>  * a single expression in {{COUNT()}}, i.e., {{COUNT(DISTINCT c1)}}, but not 
> {{COUNT(DISTINCT c1, c2)}}
>  * only queries with a {{GROUP BY}} clause
>  * integer types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10015) [Rust] Implement SIMD for aggregate kernel sum

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10015.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8370
[https://github.com/apache/arrow/pull/8370]

> [Rust] Implement SIMD for aggregate kernel sum
> --
>
> Key: ARROW-10015
> URL: https://issues.apache.org/jira/browse/ARROW-10015
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, our aggregations are made in a simple loop. However, as described 
> [here|https://rust-lang.github.io/packed_simd/perf-guide/vert-hor-ops.html], 
> horizontal operations can also be SIMDed, reports of 2.7x speedups.
> The goal of this improvement is to support SIMD for the "sum", for primitive 
> types.
> The code to modify is in 
> [here|https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/aggregate.rs].
>  A good indication that this issue is completed is when the script
> {{cargo bench --bench aggregate_kernels && cargo bench --bench 
> aggregate_kernels --features simd}}
> yields a speed-up.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9414) [C++] apt package includes headers for S3 interface, but no support

2020-10-08 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-9414.
-
Resolution: Fixed

Issue resolved by pull request 8394
[https://github.com/apache/arrow/pull/8394]

> [C++] apt package includes headers for S3 interface, but no support
> ---
>
> Key: ARROW-9414
> URL: https://issues.apache.org/jira/browse/ARROW-9414
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.1
> Environment: Ubuntu 18.04.04 LTS
>Reporter: Simon Bertron
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: test.cpp
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I believe that the apt package is built without S3 support. But s3fs.h is 
> exported in filesystem/api.h anyway. This creates undefined reference errors 
> when trying to link to the package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210516#comment-17210516
 ] 

Andy Grove commented on ARROW-10240:


Great idea [~jhorstmann] . Do you want me to take care of this or are you 
planning on working on it? I could do this tonight.

> [Rust] [Datafusion] Optionally load tpch data into memory before running 
> benchmark query
> 
>
> Key: ARROW-10240
> URL: https://issues.apache.org/jira/browse/ARROW-10240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Priority: Minor
>
> The tpch benchmark runtime seems to be dominated by csv parsing code and it 
> is really difficult to see any performance hotspots related to actual query 
> execution in a flamegraph.
> With the date in memory and more iterations it should be easier to profile 
> and find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too

2020-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210500#comment-17210500
 ] 

Neal Richardson commented on ARROW-10239:
-

[~kou] I'm not going to get to this today, maybe you can in your day? If not, I 
can take a look tomorrow. 

It might be as simple as adding {{if ARROW_S3 then ARROW_WITH_ZLIB}} and then 
registering zlib as a dependency in build_awssk in case both are being bundled. 

> [C++] aws-sdk-cpp apparently requires zlib too
> --
>
> Key: ARROW-10239
> URL: https://issues.apache.org/jira/browse/ARROW-10239
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9
> If you happen to be building on a bare system without zlib, the bundled 
> aws-sdk-cpp build fails: 
> https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1614) [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

2020-10-08 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210489#comment-17210489
 ] 

Bryan Cutler commented on ARROW-1614:
-

I just wanted to let you all know I have been working on a similar Tensor 
extension type. I currently have a Pandas extension type for a tensor with 
conversion to/from an Arrow extension type, just for Python/PyArrow right now, 
and zero-copy conversion with numpy.ndarrays. It's part of the project [Text 
Extensions for Pandas|https://github.com/CODAIT/text-extensions-for-pandas] 
where we use it for NLP feature vectors, but it's really general purpose. You 
can check it out at

[https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/tensor.py]
 
[https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/arrow_conversion.py]
 Or install the package if you like via {{pip install 
text-extensions-for-pandas}} (it's currently in alpha)

We would love to help out with this effort and contribute what we have to 
Arrow, if it fits the bill!

> [C++] Add a Tensor logical value type with constant dimensions, implemented 
> using ExtensionType
> ---
>
> Key: ARROW-1614
> URL: https://issues.apache.org/jira/browse/ARROW-1614
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Format
>Reporter: Wes McKinney
>Priority: Major
>
> In an Arrow table, we would like to add support for a column that has values 
> cells each containing a tensor value, with all tensors having the same 
> dimensions. These would be stored as a binary value, plus some metadata to 
> store type and shape/strides.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-08 Thread Jira
Jörn Horstmann created ARROW-10240:
--

 Summary: [Rust] [Datafusion] Optionally load tpch data into memory 
before running benchmark query
 Key: ARROW-10240
 URL: https://issues.apache.org/jira/browse/ARROW-10240
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jörn Horstmann


The tpch benchmark runtime seems to be dominated by csv parsing code and it is 
really difficult to see any performance hotspots related to actual query 
execution in a flamegraph.

With the date in memory and more iterations it should be easier to profile and 
find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10109) [Rust] Add support to produce a C Data interface

2020-10-08 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-10109:


Assignee: Jorge Leitão

> [Rust] Add support to produce a C Data interface
> 
>
> Key: ARROW-10109
> URL: https://issues.apache.org/jira/browse/ARROW-10109
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The goal of this issue is to support producing C Data arrays of Rust.
> The use-case that motivated this issue was the possibility of running 
> DataFusion from Python and support moving arrays from DataFusion to 
> Python/Pyarray and vice-versa.
> In particular, so that users can write Python UDFs that expect arrow arrays 
> and return arrow arrays, in the same spirit as pandas-udfs in Spark work for 
> Pandas.
> The brute-force way of writing these arrays is by converting element by 
> element from and to native types. The efficient way of doing it to pass the 
> memory address from and to each implementation, which is zero-copy.
> To support the latter, we need an FFI implementation in Rust that produces 
> and consumes [C's Data 
> interface|https://arrow.apache.org/docs/format/CDataInterface.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10238) [C#] List is broken

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10238:
---
Labels: pull-request-available  (was: )

> [C#] List is broken
> ---
>
> Key: ARROW-10238
> URL: https://issues.apache.org/jira/browse/ARROW-10238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 2.0.0
>Reporter: Eric Erhardt
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This code is currently broken:
> [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs#L147]
>  
> We need to use the `childFields` parameter when creating the ListType, that 
> way if there are recursive nested types, like List, the correct types 
> get flown down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10237) [C++] Duplicate values in a dictionary result in corrupted parquet

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10237:
---
Labels: pull-request-available  (was: )

> [C++] Duplicate values in a dictionary result in corrupted parquet
> --
>
> Key: ARROW-10237
> URL: https://issues.apache.org/jira/browse/ARROW-10237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Initial discussion: 
> https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10237) [C++] Duplicate values in a dictionary result in corrupted parquet

2020-10-08 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-10237:


Assignee: Ben Kietzman  (was: Antoine Pitrou)

> [C++] Duplicate values in a dictionary result in corrupted parquet
> --
>
> Key: ARROW-10237
> URL: https://issues.apache.org/jira/browse/ARROW-10237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 2.0.0
>
>
> Initial discussion: 
> https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too

2020-10-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10239:
---

 Summary: [C++] aws-sdk-cpp apparently requires zlib too
 Key: ARROW-10239
 URL: https://issues.apache.org/jira/browse/ARROW-10239
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Packaging
Reporter: Neal Richardson
 Fix For: 2.0.0


https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9

If you happen to be building on a bare system without zlib, the bundled 
aws-sdk-cpp build fails: 
https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10238) [C#] List is broken

2020-10-08 Thread Eric Erhardt (Jira)
Eric Erhardt created ARROW-10238:


 Summary: [C#] List is broken
 Key: ARROW-10238
 URL: https://issues.apache.org/jira/browse/ARROW-10238
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Affects Versions: 2.0.0
Reporter: Eric Erhardt


This code is currently broken:

[https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs#L147]

 

We need to use the `childFields` parameter when creating the ListType, that way 
if there are recursive nested types, like List, the correct types get 
flown down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10237) [C++] Duplicate values in a dictionary result in corrupted parquet

2020-10-08 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-10237:


 Summary: [C++] Duplicate values in a dictionary result in 
corrupted parquet
 Key: ARROW-10237
 URL: https://issues.apache.org/jira/browse/ARROW-10237
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.1
Reporter: Ben Kietzman
Assignee: Antoine Pitrou
 Fix For: 2.0.0


Initial discussion: 
https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10220) Cache javascript utf-8 dictionary keys?

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10220:
---
Labels: pull-request-available  (was: )

> Cache javascript utf-8 dictionary keys?
> ---
>
> Key: ARROW-10220
> URL: https://issues.apache.org/jira/browse/ARROW-10220
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: 1.0.1
>Reporter: Ben Schmidt
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> String decoding from arrow tables is a major bottleneck in using arrow in 
> Javascript–it can take a second to decode a million rows. For utf-8 types, 
> I'm not sure what could be done; but some memoization would help utf-8 
> dictionary types.
> Currently, the javascript implementation decodes a utf-8 string every time 
> you request an item from a dictionary with utf-8 data. If arrow cached the 
> decoded strings to a native js Map, routine operations like looping over all 
> the entries in a text column might be on the order of 10x faster. Here's an 
> observable notebook [benchmarking that and a couple other 
> strategies|https://observablehq.com/@bmschmidt/faster-arrow-dictionary-unpacking].
> I would file a pull request, but 1) I would have to learn some typescript to 
> do so, and 2) this idea may be undesirable because it creates new objects 
> that will increase the memory footprint of a table, rather than just using 
> the typed arrays.
> Some discussion of how the real-world issues here affect the arquero project 
> is [here|https://github.com/uwdata/arquero/issues/1].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10175) [CI] Nightly hdfs integration test job fails

2020-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210379#comment-17210379
 ] 

Neal Richardson commented on ARROW-10175:
-

In the link Antoine shared, 

{code}
FAILED 
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py::TestLibHdfs::test_read_multiple_parquet_files_with_uri
FAILED 
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py::TestLibHdfs::test_read_write_parquet_files_with_uri
{code}

[~jorisvandenbossche] can you take a look?

> [CI] Nightly hdfs integration test job fails
> 
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> Two tests fail:
> https://github.com/ursa-labs/crossbow/runs/1204680589
> [removed bogus investigation]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10175) [CI] Nightly hdfs integration test job fails

2020-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-10175:
---

Assignee: Joris Van den Bossche

> [CI] Nightly hdfs integration test job fails
> 
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 2.0.0
>
>
> Two tests fail:
> https://github.com/ursa-labs/crossbow/runs/1204680589
> [removed bogus investigation]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9414) [C++] apt package includes headers for S3 interface, but no support

2020-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9414:
--

Assignee: Kouhei Sutou  (was: Neal Richardson)

> [C++] apt package includes headers for S3 interface, but no support
> ---
>
> Key: ARROW-9414
> URL: https://issues.apache.org/jira/browse/ARROW-9414
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.1
> Environment: Ubuntu 18.04.04 LTS
>Reporter: Simon Bertron
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: test.cpp
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I believe that the apt package is built without S3 support. But s3fs.h is 
> exported in filesystem/api.h anyway. This creates undefined reference errors 
> when trying to link to the package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5845) [Java] Implement converter between Arrow record batches and Avro records

2020-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5845:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Java] Implement converter between Arrow record batches and Avro records
> 
>
> Key: ARROW-5845
> URL: https://issues.apache.org/jira/browse/ARROW-5845
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
> Fix For: 3.0.0
>
>
> It would be useful for applications which need convert Avro data to Arrow 
> data.
> This is an adapter which convert data with existing API (like JDBC adapter) 
> rather than a native reader (like orc).
> We implement this function through Avro java project, receiving param like 
> Decoder/Schema/DatumReader of Avro and return VectorSchemaRoot. For each data 
> type we have a consumer class as below to get Avro data and write it into 
> vector to avoid boxing/unboxing (e.g. GenericRecord#get returns Object)
> {code:java}
> public class AvroIntConsumer implements Consumer {
> private final IntWriter writer;
> public AvroIntConsumer(IntVector vector)
> { this.writer = new IntWriterImpl(vector); }
> @Override
> public void consume(Decoder decoder) throws IOException
> { writer.writeInt(decoder.readInt()); writer.setPosition(writer.getPosition() 
> + 1); }
> {code}
> We intended to support primitive and complex types (null value represented 
> via unions type with null type), size limit and field selection could be 
> optional for users. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9414) [C++] apt package includes headers for S3 interface, but no support

2020-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9414:
--

Assignee: Neal Richardson  (was: Kouhei Sutou)

> [C++] apt package includes headers for S3 interface, but no support
> ---
>
> Key: ARROW-9414
> URL: https://issues.apache.org/jira/browse/ARROW-9414
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.1
> Environment: Ubuntu 18.04.04 LTS
>Reporter: Simon Bertron
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: test.cpp
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I believe that the apt package is built without S3 support. But s3fs.h is 
> exported in filesystem/api.h anyway. This creates undefined reference errors 
> when trying to link to the package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-10226.
--
Resolution: Fixed

Although Spark produces the correct result when I run an aggregate query 
against this parquet file, it too shows bad values when I just query the 
l_returnflag column so it appears that the files are corrupt and Spark skips 
the bad rows when building the aggregate? I will keep looking into this but I 
no longer think this is a bug that we need to spend time on.

 

fyi [~jorgecarleitao]

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210358#comment-17210358
 ] 

Andy Grove commented on ARROW-10226:


Here is a test case to reproduce the issue. I uploaded the parquet file to 
dropbox. It is ~100MB.

[https://www.dropbox.com/s/6cpz1h9juxl4c7t/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet?dl=0]

[~jorgecarleitao] Thanks for the offer of help. I don't know much time we 
should spend on this but if you have the time to take a look at least to 
confirm the test also fails for you, that would be an extra data point. 
{code:java}
#[test]
fn foo() {
use arrow::array::Array;
use crate::arrow::arrow_reader::ArrowReader;

let file = std::fs::File::open(

"/mnt/tpch/debug/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet").unwrap();
let file_reader = Rc::new(SerializedFileReader::new(file).unwrap());
let metadata = file_reader
.metadata
.file_metadata()
.key_value_metadata()
.as_ref()
.unwrap();


let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
let schema = arrow_reader.get_schema().unwrap();
let projection = vec![4, 5, 6, 7, 8, 9, 10];
let mut batch_reader =
arrow_reader.get_record_reader_by_columns(projection, 40960).unwrap();

while let Some(batch) = batch_reader.next() {
let batch = batch.unwrap();

let mut n = 0;
match batch.column(4).as_any().downcast_ref::() {
Some(l_returnflag) => {
for i in 0..batch.num_rows() {
if l_returnflag.is_valid(i) {
if l_returnflag.value(i).len() > 1 {
n = n + 1;
}
}
}
}
None => println!("l_returnflag is not a string")
}
println!("{} bad values in batch", n);
assert_eq!(n, 0);
}
}
 {code}

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10226:
---
Summary: [Rust] [Parquet] Parquet reader reading wrong columns in some 
batches within a parquet file  (was: [Rust] [DataFusion] TPC-H query 1 no 
longer completes for 100GB dataset)

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210341#comment-17210341
 ] 

Andy Grove commented on ARROW-10226:


{code:java}
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 
bad values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49880 
bad values in batch

part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 
bad values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49979 
bad values in batch

part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 374998 
bad values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50031 
bad values in batch

part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375002 
bad values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50110 
bad values in batch {code}

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10206) [Python][C++][FlightRPC] Add client option to disable server validation

2020-10-08 Thread James Duong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210329#comment-17210329
 ] 

James Duong commented on ARROW-10206:
-

In the PR above, we support building against multiple versions of gRPC.
1. Prior to 1.27, the features in gRPC needed to support this don't exist. The 
option to disable server verification fails at runtime if used.
2. Between 1.27 to 1.31 (inclusive), the features needed are in the 
grpc_impl::experimental namespace. Compile the code in Flight client using that 
namespace.
3. From 1.32 and later, the features are in the grpc::experimental namespace. 
Compile the code in Flight client using that namespace.

> [Python][C++][FlightRPC] Add client option to disable server validation
> ---
>
> Key: ARROW-10206
> URL: https://issues.apache.org/jira/browse/ARROW-10206
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Note that this requires using grpc-cpp version 1.25 or higher.
> This requires using GRPC's TlsCredentials class, which is in a different 
> namespace for 1.25-1.31 vs. 1.32+ as well.
> This class and its related options provide an option to disable server 
> certificate checks and require the caller to supply a callback to be used 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210328#comment-17210328
 ] 

Andy Grove commented on ARROW-10226:


Just tracking progress with debugging this. The issue is that the projection is 
behaving differently PER BATCH within these Parquet files. We expect 
l_returnflag to be a single char but sometimes the parquet reader is returning 
the contents of the l_comment field instead.
{code:java}
 
[/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: A
[/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: R
[/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: R
[/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: R
[/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: s among the fluffily r
[/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: eposits a
[/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: y ironic foxes above t
{code}
 

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210317#comment-17210317
 ] 

Neal Richardson commented on ARROW-10226:
-

Sounds good, thanks. Good luck!

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10226:
---
Priority: Major  (was: Blocker)

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210314#comment-17210314
 ] 

Andy Grove commented on ARROW-10226:


[~npr] Sure, I changed to major, but my plan was to resolve the issue before we 
release tomorrow.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10109) [Rust] Add support to produce a C Data interface

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10109:
---
Labels: pull-request-available  (was: )

> [Rust] Add support to produce a C Data interface
> 
>
> Key: ARROW-10109
> URL: https://issues.apache.org/jira/browse/ARROW-10109
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The goal of this issue is to support producing C Data arrays of Rust.
> The use-case that motivated this issue was the possibility of running 
> DataFusion from Python and support moving arrays from DataFusion to 
> Python/Pyarray and vice-versa.
> In particular, so that users can write Python UDFs that expect arrow arrays 
> and return arrow arrays, in the same spirit as pandas-udfs in Spark work for 
> Pandas.
> The brute-force way of writing these arrays is by converting element by 
> element from and to native types. The efficient way of doing it to pass the 
> memory address from and to each implementation, which is zero-copy.
> To support the latter, we need an FFI implementation in Rust that produces 
> and consumes [C's Data 
> interface|https://arrow.apache.org/docs/format/CDataInterface.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10228) [Julia] Donate Julia Implementation

2020-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10228:

Component/s: Julia

> [Julia] Donate Julia Implementation
> ---
>
> Key: ARROW-10228
> URL: https://issues.apache.org/jira/browse/ARROW-10228
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Julia
>Reporter: Jacob Quinn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Contribute pure Julia implementation supporting arrow array types and 
> reading/writing streams/files with the arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10228) [Julia] Donate Julia Implementation

2020-10-08 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10228:
-
Summary: [Julia] Donate Julia Implementation  (was: Donate Julia 
Implementation)

> [Julia] Donate Julia Implementation
> ---
>
> Key: ARROW-10228
> URL: https://issues.apache.org/jira/browse/ARROW-10228
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Jacob Quinn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Contribute pure Julia implementation supporting arrow array types and 
> reading/writing streams/files with the arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5440) [Rust][Parquet] Rust Parquet requiring libstd-xxx.so dependency on centos

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale closed ARROW-5440.
-
Resolution: Cannot Reproduce

>From the comments, it sounds like this is no longer an issue

> [Rust][Parquet] Rust Parquet requiring libstd-xxx.so dependency on centos
> -
>
> Key: ARROW-5440
> URL: https://issues.apache.org/jira/browse/ARROW-5440
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
> Environment: CentOS Linux release 7.6.1810 (Core) 
>Reporter: Tenzin Rigden
>Priority: Major
> Attachments: parquet-test-libstd.tar.gz, serde_json_test.tar.gz
>
>
> Hello,
> In the rust parquet implementation ([https://github.com/sunchao/parquet-rs]) 
> on centos, the binary created has a `libstd-hash.so` shared library 
> dependency that is causing issues since it's a shared library found in the 
> rustup directory. This `libstd-hash.so` dependency isn't there on any other 
> rust binaries I've made before. This dependency means that I can't run this 
> binary anywhere where rustup isn't installed with that exact libstd library.
> This is not an issue on Mac.
> I've attached the rust files and here is the command line output below.
> {code:java|title=cli-output|borderStyle=solid}
> [centos@_ parquet-test]$ cat /etc/centos-release
> CentOS Linux release 7.6.1810 (Core)
> [centos@_ parquet-test]$ rustc --version
> rustc 1.36.0-nightly (e70d5386d 2019-05-27)
> [centos@_ parquet-test]$ ldd target/release/parquet-test
> linux-vdso.so.1 =>  (0x7ffd02fee000)
> libstd-44988553032616b2.so => not found
> librt.so.1 => /lib64/librt.so.1 (0x7f6ecd209000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f6eccfed000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f6eccdd7000)
> libc.so.6 => /lib64/libc.so.6 (0x7f6ecca0a000)
> libm.so.6 => /lib64/libm.so.6 (0x7f6ecc708000)
> /lib64/ld-linux-x86-64.so.2 (0x7f6ecd8b1000)
> [centos@_ parquet-test]$ ls -l 
> ~/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so
> -rw-r--r--. 1 centos centos 5623568 May 27 21:46 
> /home/centos/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5352) [Rust] BinaryArray filter replaces nulls with empty strings

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale closed ARROW-5352.
-
Resolution: Duplicate

> [Rust] BinaryArray filter replaces nulls with empty strings
> ---
>
> Key: ARROW-5352
> URL: https://issues.apache.org/jira/browse/ARROW-5352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.13.0
>Reporter: Neville Dipale
>Priority: Minor
>
> The filter implementation for BinaryArray discards nullness of data. 
> BinaryArrays that are null (seem to) always return an empty string slice when 
> getting a value, so the way filter works might be a bug depending on what 
> Arrow developers' or users' intentions are.
> I think we should either preserve nulls (and their count) or document this as 
> intended behaviour.
> Below is a test case that reproduces the bug.
> {code:java}
> #[test]
> fn test_filter_binary_array_with_nulls() {
> let mut a: BinaryBuilder = BinaryBuilder::new(100);
> a.append_null().unwrap();
> a.append_string("a string").unwrap();
> a.append_null().unwrap();
> a.append_string("with nulls").unwrap();
> let array = a.finish();
> let b = BooleanArray::from(vec![true, true, true, true]);
> let c = filter(, ).unwrap();
> let d:  = c.as_any().downcast_ref::().unwrap();
> // I didn't expect this behaviour
> assert_eq!("", d.get_string(0));
> // fails here
> assert!(d.is_null(0));
> assert_eq!(4, d.len());
> // fails here
> assert_eq!(2, d.null_count());
> assert_eq!("a string", d.get_string(1));
> // fails here
> assert!(d.is_null(2));
> assert_eq!("with nulls", d.get_string(3));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10199) [Rust][Parquet] Release Parquet at crates.io to remove debug prints

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10199.

Fix Version/s: 2.0.0
   Resolution: Fixed

This has been resolved, and will be fixed in next release in about a week or 2

> [Rust][Parquet] Release Parquet at crates.io to remove debug prints
> ---
>
> Key: ARROW-10199
> URL: https://issues.apache.org/jira/browse/ARROW-10199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Krzysztof Stanisławek
>Priority: Critical
> Fix For: 2.0.0
>
>
> Version of Parquet released to docs.rs & crates.io has debug prints in 
> [https://github.com/apache/arrow/blob/886d87bdea78ce80e39a4b5b6fd6ca6042474c5f/rust/parquet/src/column/writer.rs#L60].
>  They were pretty hard to track down, so I suggest considering logging create 
> in the future. When is the new version going to be released? Is there some 
> stable schedule I can expect?
> Is it recommended to use the current snapshot straight from github instead of 
> crates.io?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210280#comment-17210280
 ] 

Neal Richardson commented on ARROW-10226:
-

[~andygrove] can you explain why this is a release blocker, given that our 
release target date is tomorrow? It certainly sounds bad, but if this is not 
due to a recent change, and perhaps something that never worked, I'm curious 
why this should hold up 2.0.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10225) [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10225:
---
Summary: [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests  
(was: [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests)

> [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests
> ---
>
> Key: ARROW-10225
> URL: https://issues.apache.org/jira/browse/ARROW-10225
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Arrow spec allows makes the null bitmap optional if an array has no nulls 
> [~carols10cents], so the tests that were failing were because we're comparing 
> `None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10225.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8388
[https://github.com/apache/arrow/pull/8388]

> [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests
> ---
>
> Key: ARROW-10225
> URL: https://issues.apache.org/jira/browse/ARROW-10225
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Arrow spec allows makes the null bitmap optional if an array has no nulls 
> [~carols10cents], so the tests that were failing were because we're comparing 
> `None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10236:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
> -
>
> Key: ARROW-10236
> URL: https://issues.apache.org/jira/browse/ARROW-10236
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are plan time checks for valid type casts in DataFusion that are 
> designed to catch errors early before plan execution
> Sadly the cast types that DataFusion thinks are valid is a significant subset 
> of what the arrow cast kernel supports.  The goal of this ticket is to bring 
> DataFusion to parity with the type casting supported by arrow and  allow 
> DataFusion to plan all casts that are supported by the arrow cast kernel
> (I want this implicitly so when I add support for DictionaryArray casts in 
> Arrow they also are part of DataFusion)
> Previously the notions of coercion and casting were somewhat conflated. I 
> have tried to clarify them in https://github.com/apache/arrow/pull/8399 as 
> well
> For more detail, see 
> https://github.com/apache/arrow/pull/8340#discussion_r501257096 from 
> [~jorgecarleitao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9164) [C++] Provide APIs for adding "docstrings" to arrow::compute::Function classes that can be accessed by bindings

2020-10-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9164:
-

Assignee: Antoine Pitrou

> [C++] Provide APIs for adding "docstrings" to arrow::compute::Function 
> classes that can be accessed by bindings
> ---
>
> Key: ARROW-9164
> URL: https://issues.apache.org/jira/browse/ARROW-9164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds

2020-10-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-10233:
-
Component/s: Rust

> [Rust] Make array_value_to_string available in all Arrow builds
> ---
>
> Key: ARROW-10233
> URL: https://issues.apache.org/jira/browse/ARROW-10233
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Make array_value_to_string available in all Arrow builds
> Currently the array_value_to_string function it is only available if the 
> `feature = "prettyprint"` is enabled. 
> The rationale for making this change is that I want to be able to use 
> `array_value_to_string` to write tests (such as on 
> https://github.com/apache/arrow/pull/8346) but currently it is only available 
> when `feature = "prettyprint"` is enabled.
> It appears that [~nevi_me] made prettyprint compilation optional so that 
> arrow could be compiled for wasm in 
> https://github.com/apache/arrow/pull/7400. 
> https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to 
> some dependency of pretty-table;   `array_value_to_string` has no needed 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds

2020-10-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-10233.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8397
[https://github.com/apache/arrow/pull/8397]

> [Rust] Make array_value_to_string available in all Arrow builds
> ---
>
> Key: ARROW-10233
> URL: https://issues.apache.org/jira/browse/ARROW-10233
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Make array_value_to_string available in all Arrow builds
> Currently the array_value_to_string function it is only available if the 
> `feature = "prettyprint"` is enabled. 
> The rationale for making this change is that I want to be able to use 
> `array_value_to_string` to write tests (such as on 
> https://github.com/apache/arrow/pull/8346) but currently it is only available 
> when `feature = "prettyprint"` is enabled.
> It appears that [~nevi_me] made prettyprint compilation optional so that 
> arrow could be compiled for wasm in 
> https://github.com/apache/arrow/pull/7400. 
> https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to 
> some dependency of pretty-table;   `array_value_to_string` has no needed 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210273#comment-17210273
 ] 

Jorge Leitão commented on ARROW-10226:
--

I am really sorry to hear that. Let me know if there is anything I can support 
on this ahead of the release. I can take time over the weekend to bootstrap an 
environment on the cloud to run this and debug it.

I can also easy write some Terraform to bootstrap an environment, so that we 
have a procedure to run these tests on an independent and "immutable" 
environment.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6537) [R] Pass column_types to CSV reader

2020-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6537.

Resolution: Fixed

Issue resolved by pull request 7807
[https://github.com/apache/arrow/pull/7807]

> [R] Pass column_types to CSV reader
> ---
>
> Key: ARROW-6537
> URL: https://issues.apache.org/jira/browse/ARROW-6537
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
>  Labels: csv, dataset, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See also ARROW-6536. It may be the case that the csv reader does accept a 
> Schema now, I think I saw that, but otherwise it takes unordered_map. 
> {{read_csv_arrow}} should take for {{col_types}} either a Schema, a named 
> list of Types, or the "compact string representation" that {{readr}} 
> supports. Per its docs, "c = character, i = integer, n = number, d = double, 
> l = logical, f = factor, D = date, T = date time, t = time, ? = guess, or _/- 
> to skip the column." So, c = utf8(), i = int32(), d = float64(), l = bool(), 
> f = dictionary(int32(), utf8()), D = date32(), T = timestamp(), t = time32(), 
> etc. I'm not sure if ? and - are supported, and/or what exactly happens if 
> you don't specify types for all columns, but I guess we'll find out, and we 
> can make JIRAs if important features are missing. 
> Following the existing conventions in csv.R, that compact string 
> representation would be encapsulated in {{read_csv_arrow}}, so CsvTableReader 
> and the various Csv*Options would only deal with the Arrow C++ interface. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8712) [R] Expose strptime timestamp parsing in read_csv conversion options

2020-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8712.

Resolution: Fixed

> [R] Expose strptime timestamp parsing in read_csv conversion options
> 
>
> Key: ARROW-8712
> URL: https://issues.apache.org/jira/browse/ARROW-8712
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Wes McKinney
>Assignee: Romain Francois
>Priority: Major
> Fix For: 2.0.0
>
>
> Follow up to ARROW-8111
> It appears that CsvConvertOptions has a {{timestamp_converters}} vector: 
> https://github.com/apache/arrow/pull/6631/files#diff-06f0ffdc5cae9f7e40e1a80b250dce47R95



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel rules

2020-10-08 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10236:
---

 Summary: [Rust] [DataFusion] Make DataFusion casting rules 
consistent with cast kernel rules 
 Key: ARROW-10236
 URL: https://issues.apache.org/jira/browse/ARROW-10236
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb
Assignee: Andrew Lamb


There are plan time checks for valid type casts in DataFusion that are designed 
to catch errors early before plan execution

Sadly the cast types that DataFusion thinks are valid is a significant subset 
of what the arrow cast kernel supports.  The goal of this ticket is to bring 
DataFusion to parity with the type casting supported by arrow and  allow 
DataFusion to plan all casts that are supported by the arrow cast kernel

(I want this implicitly so when I add support for DictionaryArray casts in 
Arrow they also are part of DataFusion)

Previously the notions of coercion and casting were somewhat conflated. I have 
tried to clarify them in https://github.com/apache/arrow/pull/8399 as well

For more detail, see 
https://github.com/apache/arrow/pull/8340#discussion_r501257096 from 
[~jorgecarleitao]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel

2020-10-08 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-10236:

Summary: [Rust] [DataFusion] Make DataFusion casting rules consistent with 
cast kernel  (was: [Rust] [DataFusion] Make DataFusion casting rules consistent 
with cast kernel rules )

> [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
> -
>
> Key: ARROW-10236
> URL: https://issues.apache.org/jira/browse/ARROW-10236
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>
> There are plan time checks for valid type casts in DataFusion that are 
> designed to catch errors early before plan execution
> Sadly the cast types that DataFusion thinks are valid is a significant subset 
> of what the arrow cast kernel supports.  The goal of this ticket is to bring 
> DataFusion to parity with the type casting supported by arrow and  allow 
> DataFusion to plan all casts that are supported by the arrow cast kernel
> (I want this implicitly so when I add support for DictionaryArray casts in 
> Arrow they also are part of DataFusion)
> Previously the notions of coercion and casting were somewhat conflated. I 
> have tried to clarify them in https://github.com/apache/arrow/pull/8399 as 
> well
> For more detail, see 
> https://github.com/apache/arrow/pull/8340#discussion_r501257096 from 
> [~jorgecarleitao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9930) [C++] Fix undefined behaviour on invalid IPC (OSS-Fuzz)

2020-10-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-9930.
-
Resolution: Invalid

I don't remember why I opened this, probably a duplicate of another issue.

> [C++] Fix undefined behaviour on invalid IPC (OSS-Fuzz)
> ---
>
> Key: ARROW-9930
> URL: https://issues.apache.org/jira/browse/ARROW-9930
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10235) [Rust][DataFusion] Improve documentation for type coercion

2020-10-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-10235.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8399
[https://github.com/apache/arrow/pull/8399]

> [Rust][DataFusion] Improve documentation for type coercion
> --
>
> Key: ARROW-10235
> URL: https://issues.apache.org/jira/browse/ARROW-10235
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The code / comments for type coercion are a little confusing and don't make 
> the distinction between coercion and casting clear -- we could improve the 
> documentation to clarify this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10040:
--

Assignee: Jörn Horstmann  (was: Neville Dipale)

> [Rust] Create a way to slice unalligned offset buffers
> --
>
> Key: ARROW-10040
> URL: https://issues.apache.org/jira/browse/ARROW-10040
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> We have limitations on the boolean kernels, where we can't apply the kernels 
> on buffers whose offsets aren't a multiple of 8. This has the potential of 
> preventing users from applying some computations on arrays whose offsets 
> aren't divisible by 8.
> We could create methods on Buffer that allow slicing buffers and copying them 
> into aligned buffers.
> An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10040:
--

Assignee: Neville Dipale

> [Rust] Create a way to slice unalligned offset buffers
> --
>
> Key: ARROW-10040
> URL: https://issues.apache.org/jira/browse/ARROW-10040
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> We have limitations on the boolean kernels, where we can't apply the kernels 
> on buffers whose offsets aren't a multiple of 8. This has the potential of 
> preventing users from applying some computations on arrays whose offsets 
> aren't divisible by 8.
> We could create methods on Buffer that allow slicing buffers and copying them 
> into aligned buffers.
> An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10040.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8262
[https://github.com/apache/arrow/pull/8262]

> [Rust] Create a way to slice unalligned offset buffers
> --
>
> Key: ARROW-10040
> URL: https://issues.apache.org/jira/browse/ARROW-10040
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> We have limitations on the boolean kernels, where we can't apply the kernels 
> on buffers whose offsets aren't a multiple of 8. This has the potential of 
> preventing users from applying some computations on arrays whose offsets 
> aren't divisible by 8.
> We could create methods on Buffer that allow slicing buffers and copying them 
> into aligned buffers.
> An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3122) [C++] Incremental Variance, Standard Deviation aggregators

2020-10-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210226#comment-17210226
 ] 

Antoine Pitrou commented on ARROW-3122:
---

Isn't this fixed by ARROW-10070?

> [C++] Incremental Variance, Standard Deviation aggregators
> --
>
> Key: ARROW-3122
> URL: https://issues.apache.org/jira/browse/ARROW-3122
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics
>
> These must provide for degrees of freedom adjustment when yielding result



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9967) [Python] Add compute module docs

2020-10-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9967.
---
Resolution: Fixed

Issue resolved by pull request 8145
[https://github.com/apache/arrow/pull/8145]

> [Python] Add compute module docs
> 
>
> Key: ARROW-9967
> URL: https://issues.apache.org/jira/browse/ARROW-9967
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Andrew Wieteska
>Assignee: Andrew Wieteska
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10234) [C++][Gandiva] Fix logic of round() for floats/decimals in Gandiva

2020-10-08 Thread Sagnik Chakraborty (Jira)
Sagnik Chakraborty created ARROW-10234:
--

 Summary: [C++][Gandiva] Fix logic of round() for floats/decimals 
in Gandiva
 Key: ARROW-10234
 URL: https://issues.apache.org/jira/browse/ARROW-10234
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Sagnik Chakraborty
Assignee: Sagnik Chakraborty


round() for floats/doubles is returning incorrect results for some edge cases, 
like round(cast(1.55 as float), 1) gives 1.6, but it should be 1.5, since the 
result after casting to float comes to 1.549523162842, due to inaccurate 
representation of floating point numbers in memory. Removing an intermediate 
explicit cast to float statement for a double value, which is used in 
subsequent computations, minimises the error introduced due to the incorrect 
representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10235) [Rust][DataFusion] Improve documentation for type coercion

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10235:
---
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Improve documentation for type coercion
> --
>
> Key: ARROW-10235
> URL: https://issues.apache.org/jira/browse/ARROW-10235
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The code / comments for type coercion are a little confusing and don't make 
> the distinction between coercion and casting clear -- we could improve the 
> documentation to clarify this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10234) [C++][Gandiva] Fix logic of round() for floats/decimals in Gandiva

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10234:
---
Labels: pull-request-available  (was: )

> [C++][Gandiva] Fix logic of round() for floats/decimals in Gandiva
> --
>
> Key: ARROW-10234
> URL: https://issues.apache.org/jira/browse/ARROW-10234
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> round() for floats/doubles is returning incorrect results for some edge 
> cases, like round(cast(1.55 as float), 1) gives 1.6, but it should be 1.5, 
> since the result after casting to float comes to 1.549523162842, due to 
> inaccurate representation of floating point numbers in memory. Removing an 
> intermediate explicit cast to float statement for a double value, which is 
> used in subsequent computations, minimises the error introduced due to the 
> incorrect representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10023) [Gandiva][C++] Implementing Split part function in gandiva

2020-10-08 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-10023.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8231
[https://github.com/apache/arrow/pull/8231]

> [Gandiva][C++] Implementing Split part function in gandiva
> --
>
> Key: ARROW-10023
> URL: https://issues.apache.org/jira/browse/ARROW-10023
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Naman Udasi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10235) [Rust][DataFusion] Improve documentation for type coercion

2020-10-08 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10235:
---

 Summary: [Rust][DataFusion] Improve documentation for type coercion
 Key: ARROW-10235
 URL: https://issues.apache.org/jira/browse/ARROW-10235
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andrew Lamb
Assignee: Andrew Lamb


The code / comments for type coercion are a little confusing and don't make the 
distinction between coercion and casting clear -- we could improve the 
documentation to clarify this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10165) [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported by Arrow cast kernel

2020-10-08 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb closed ARROW-10165.
---
Resolution: Duplicate

> [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported 
> by Arrow cast kernel
> -
>
> Key: ARROW-10165
> URL: https://issues.apache.org/jira/browse/ARROW-10165
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> When the  DataFusion planner inserts casts, today it relies on special logic 
> to determine the valid coded casts. 
> The actual arrow cast kernels support a much wider range of data types, and 
> thus DataFusion is artificially limiting the casts it supports for no 
> particularly good reason I can see.
> This ticket tracks the work to remove the extra casting checking in the 
> datafusion planner and instead simply rely on runtime check of arrow cast 
> compute kernel
> The potential  downside of this approach is that the error may be generated 
> later in the execution process (rather than the planner), and possibly have a 
> less specific error message, the upside is there is less code and we get 
> several conversions immediately (like timestamp predicate casting)
> I also plan to add DictionaryArray support to the casting kernels and I would 
> like to avoid having to replicate some part of that logic in DataFusion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10165) [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported by Arrow cast kernel

2020-10-08 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210151#comment-17210151
 ] 

Andrew Lamb commented on ARROW-10165:
-

Per comments on the PR, we have decided on a different approach here. I expect 
the code will be done under the aegis of 
https://issues.apache.org/jira/browse/ARROW-10163. 

Closing this one for now

> [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported 
> by Arrow cast kernel
> -
>
> Key: ARROW-10165
> URL: https://issues.apache.org/jira/browse/ARROW-10165
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> When the  DataFusion planner inserts casts, today it relies on special logic 
> to determine the valid coded casts. 
> The actual arrow cast kernels support a much wider range of data types, and 
> thus DataFusion is artificially limiting the casts it supports for no 
> particularly good reason I can see.
> This ticket tracks the work to remove the extra casting checking in the 
> datafusion planner and instead simply rely on runtime check of arrow cast 
> compute kernel
> The potential  downside of this approach is that the error may be generated 
> later in the execution process (rather than the planner), and possibly have a 
> less specific error message, the upside is there is less code and we get 
> several conversions immediately (like timestamp predicate casting)
> I also plan to add DictionaryArray support to the casting kernels and I would 
> like to avoid having to replicate some part of that logic in DataFusion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds

2020-10-08 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reassigned ARROW-10233:
---

Assignee: Andrew Lamb

> [Rust] Make array_value_to_string available in all Arrow builds
> ---
>
> Key: ARROW-10233
> URL: https://issues.apache.org/jira/browse/ARROW-10233
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Make array_value_to_string available in all Arrow builds
> Currently the array_value_to_string function it is only available if the 
> `feature = "prettyprint"` is enabled. 
> The rationale for making this change is that I want to be able to use 
> `array_value_to_string` to write tests (such as on 
> https://github.com/apache/arrow/pull/8346) but currently it is only available 
> when `feature = "prettyprint"` is enabled.
> It appears that [~nevi_me] made prettyprint compilation optional so that 
> arrow could be compiled for wasm in 
> https://github.com/apache/arrow/pull/7400. 
> https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to 
> some dependency of pretty-table;   `array_value_to_string` has no needed 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10233:
---
Labels: pull-request-available  (was: )

> [Rust] Make array_value_to_string available in all Arrow builds
> ---
>
> Key: ARROW-10233
> URL: https://issues.apache.org/jira/browse/ARROW-10233
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Make array_value_to_string available in all Arrow builds
> Currently the array_value_to_string function it is only available if the 
> `feature = "prettyprint"` is enabled. 
> The rationale for making this change is that I want to be able to use 
> `array_value_to_string` to write tests (such as on 
> https://github.com/apache/arrow/pull/8346) but currently it is only available 
> when `feature = "prettyprint"` is enabled.
> It appears that [~nevi_me] made prettyprint compilation optional so that 
> arrow could be compiled for wasm in 
> https://github.com/apache/arrow/pull/7400. 
> https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to 
> some dependency of pretty-table;   `array_value_to_string` has no needed 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10232) FixedSizeListArray is incorrectly written/read to/from parquet

2020-10-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-10232.
--
Resolution: Duplicate

> FixedSizeListArray is incorrectly written/read to/from parquet
> --
>
> Key: ARROW-10232
> URL: https://issues.apache.org/jira/browse/ARROW-10232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Simon Perkins
>Priority: Major
> Fix For: 2.0.0
>
>
> FixedSizeListArray's seem to be either incorrectly written or read to or from 
> Parquet files.
>  
> When reading the parquet file, nulls/Nones are returned where the original 
> values should be.
>  
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import numpy as np
> np_data = np.arange(20*4).reshape(20, 4).astype(np.float64)
> pa_data = pa.FixedSizeListArray.from_arrays(np_data.ravel(), 4)
> assert np_data.tolist() == pa_data.tolist()
> schema = pa.schema([pa.field("rectangle", pa_data.type)])
> table = pa.table({"rectangle": pa_data}, schema=schema)
> pq.write_table(table, "test.parquet")
> in_table = pq.read_table("test.parquet")   
> # rectangle is filled with nulls
> assert in_table.column("rectangle").to_pylist() == pa_data.tolist()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10232) FixedSizeListArray is incorrectly written/read to/from parquet

2020-10-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10232:
---
Fix Version/s: 2.0.0

> FixedSizeListArray is incorrectly written/read to/from parquet
> --
>
> Key: ARROW-10232
> URL: https://issues.apache.org/jira/browse/ARROW-10232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Simon Perkins
>Priority: Major
> Fix For: 2.0.0
>
>
> FixedSizeListArray's seem to be either incorrectly written or read to or from 
> Parquet files.
>  
> When reading the parquet file, nulls/Nones are returned where the original 
> values should be.
>  
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import numpy as np
> np_data = np.arange(20*4).reshape(20, 4).astype(np.float64)
> pa_data = pa.FixedSizeListArray.from_arrays(np_data.ravel(), 4)
> assert np_data.tolist() == pa_data.tolist()
> schema = pa.schema([pa.field("rectangle", pa_data.type)])
> table = pa.table({"rectangle": pa_data}, schema=schema)
> pq.write_table(table, "test.parquet")
> in_table = pq.read_table("test.parquet")   
> # rectangle is filled with nulls
> assert in_table.column("rectangle").to_pylist() == pa_data.tolist()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10232) FixedSizeListArray is incorrectly written/read to/from parquet

2020-10-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210139#comment-17210139
 ] 

Antoine Pitrou commented on ARROW-10232:


Thanks for the report. I can confirm this fails on 1.0.1, but it was fixed in 
git master (we hope to release 2.0.0 in a week or two).

> FixedSizeListArray is incorrectly written/read to/from parquet
> --
>
> Key: ARROW-10232
> URL: https://issues.apache.org/jira/browse/ARROW-10232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Simon Perkins
>Priority: Major
>
> FixedSizeListArray's seem to be either incorrectly written or read to or from 
> Parquet files.
>  
> When reading the parquet file, nulls/Nones are returned where the original 
> values should be.
>  
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import numpy as np
> np_data = np.arange(20*4).reshape(20, 4).astype(np.float64)
> pa_data = pa.FixedSizeListArray.from_arrays(np_data.ravel(), 4)
> assert np_data.tolist() == pa_data.tolist()
> schema = pa.schema([pa.field("rectangle", pa_data.type)])
> table = pa.table({"rectangle": pa_data}, schema=schema)
> pq.write_table(table, "test.parquet")
> in_table = pq.read_table("test.parquet")   
> # rectangle is filled with nulls
> assert in_table.column("rectangle").to_pylist() == pa_data.tolist()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds

2020-10-08 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10233:
---

 Summary: [Rust] Make array_value_to_string available in all Arrow 
builds
 Key: ARROW-10233
 URL: https://issues.apache.org/jira/browse/ARROW-10233
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb


Make array_value_to_string available in all Arrow builds

Currently the array_value_to_string function it is only available if the 
`feature = "prettyprint"` is enabled. 

The rationale for making this change is that I want to be able to use 
`array_value_to_string` to write tests (such as on 
https://github.com/apache/arrow/pull/8346) but currently it is only available 
when `feature = "prettyprint"` is enabled.

It appears that [~nevi_me] made prettyprint compilation optional so that arrow 
could be compiled for wasm in https://github.com/apache/arrow/pull/7400. 
https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to 
some dependency of pretty-table;   `array_value_to_string` has no needed 
dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10232) FixedSizeListArray is incorrectly written/read to/from parquet

2020-10-08 Thread Simon Perkins (Jira)
Simon Perkins created ARROW-10232:
-

 Summary: FixedSizeListArray is incorrectly written/read to/from 
parquet
 Key: ARROW-10232
 URL: https://issues.apache.org/jira/browse/ARROW-10232
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 1.0.1
Reporter: Simon Perkins


FixedSizeListArray's seem to be either incorrectly written or read to or from 
Parquet files.

 

When reading the parquet file, nulls/Nones are returned where the original 
values should be.

 
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

np_data = np.arange(20*4).reshape(20, 4).astype(np.float64)
pa_data = pa.FixedSizeListArray.from_arrays(np_data.ravel(), 4)
assert np_data.tolist() == pa_data.tolist()

schema = pa.schema([pa.field("rectangle", pa_data.type)])
table = pa.table({"rectangle": pa_data}, schema=schema)
pq.write_table(table, "test.parquet")

in_table = pq.read_table("test.parquet")   
# rectangle is filled with nulls
assert in_table.column("rectangle").to_pylist() == pa_data.tolist()

{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10231) [CI] Unable to download minio in arm32v7 docker image

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10231:
---
Labels: pull-request-available  (was: )

> [CI] Unable to download minio in arm32v7 docker image
> -
>
> Key: ARROW-10231
> URL: https://issues.apache.org/jira/browse/ARROW-10231
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See build log https://github.com/apache/arrow/runs/1224947766#step:5:2021



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10231) [CI] Unable to download minio in arm32v7 docker image

2020-10-08 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10231:
---

 Summary: [CI] Unable to download minio in arm32v7 docker image
 Key: ARROW-10231
 URL: https://issues.apache.org/jira/browse/ARROW-10231
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 2.0.0


See build log https://github.com/apache/arrow/runs/1224947766#step:5:2021



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10230) [JS][Doc] JavaScript documentation fails to build

2020-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10230:
---
Labels: pull-request-available  (was: )

> [JS][Doc] JavaScript documentation fails to build
> -
>
> Key: ARROW-10230
> URL: https://issues.apache.org/jira/browse/ARROW-10230
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, JavaScript
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Probably because of typedoc updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10230) [JS][Doc] JavaScript documentation fails to build

2020-10-08 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10230:
---

 Summary: [JS][Doc] JavaScript documentation fails to build
 Key: ARROW-10230
 URL: https://issues.apache.org/jira/browse/ARROW-10230
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, JavaScript
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 2.0.0


Probably because of typedoc updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10230) [JS][Doc] JavaScript documentation fails to build

2020-10-08 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10230:

Issue Type: Bug  (was: Improvement)

> [JS][Doc] JavaScript documentation fails to build
> -
>
> Key: ARROW-10230
> URL: https://issues.apache.org/jira/browse/ARROW-10230
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, JavaScript
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> Probably because of typedoc updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10229) [C++][Parquet] Remove left over ARROW_LOG statement.

2020-10-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10229:
---
Component/s: C++

> [C++][Parquet] Remove left over ARROW_LOG statement.
> 
>
> Key: ARROW-10229
> URL: https://issues.apache.org/jira/browse/ARROW-10229
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10229) [C++][Parquet] Remove left over ARROW_LOG statement.

2020-10-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10229.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8392
[https://github.com/apache/arrow/pull/8392]

> [C++][Parquet] Remove left over ARROW_LOG statement.
> 
>
> Key: ARROW-10229
> URL: https://issues.apache.org/jira/browse/ARROW-10229
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >