[jira] [Created] (ARROW-11285) [Release][APT] Add support for Ubuntu Groovy

2021-01-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11285:


 Summary: [Release][APT] Add support for Ubuntu Groovy
 Key: ARROW-11285
 URL: https://issues.apache.org/jira/browse/ARROW-11285
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11284) [R] Support dplyr verb transmute()

2021-01-16 Thread Ian Cook (Jira)
Ian Cook created ARROW-11284:


 Summary: [R] Support dplyr verb transmute()
 Key: ARROW-11284
 URL: https://issues.apache.org/jira/browse/ARROW-11284
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Ian Cook
Assignee: Ian Cook


Add support for the dplyr verb {{transmute()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11283) [Julia] Fix install link

2021-01-16 Thread Jacob Quinn (Jira)
Jacob Quinn created ARROW-11283:
---

 Summary: [Julia] Fix install link
 Key: ARROW-11283
 URL: https://issues.apache.org/jira/browse/ARROW-11283
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jacob Quinn
Assignee: Jacob Quinn
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11282) [Packaging][deb] Add missing libgflags-dev dependency

2021-01-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11282:


 Summary: [Packaging][deb] Add missing libgflags-dev dependency
 Key: ARROW-11282
 URL: https://issues.apache.org/jira/browse/ARROW-11282
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11281) [C++] Remove needless runtime RapidJSON dependency

2021-01-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11281:


 Summary: [C++] Remove needless runtime RapidJSON dependency
 Key: ARROW-11281
 URL: https://issues.apache.org/jira/browse/ARROW-11281
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11280) [Release][APT] Add a workaround for C++ and packaging bugs

2021-01-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11280:


 Summary: [Release][APT] Add a workaround for C++ and packaging bugs
 Key: ARROW-11280
 URL: https://issues.apache.org/jira/browse/ARROW-11280
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11279) [Rust][Parquet]

2021-01-16 Thread R J (Jira)
R J created ARROW-11279:
---

 Summary: [Rust][Parquet]
 Key: ARROW-11279
 URL: https://issues.apache.org/jira/browse/ARROW-11279
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: R J


In the rust implementation of an Arrow RecordBatch writer to parquet 
(3.0.0-SNAPSHOT), the ArrowWriter::write call potentially allocates more memory 
than required.

For a RecordBatch with m rows and n columns, ArrowWriter::write allocates m*n 
definition levels, leading to m times the required memory usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11278) [Release][NodeJS] Don't touch ~/.bash_profile

2021-01-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11278:


 Summary: [Release][NodeJS] Don't touch ~/.bash_profile
 Key: ARROW-11278
 URL: https://issues.apache.org/jira/browse/ARROW-11278
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11277) [C++] Fix compilation error in dataset expressions on macOS 10.11

2021-01-16 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-11277:
---

 Summary: [C++] Fix compilation error in dataset expressions on 
macOS 10.11
 Key: ARROW-11277
 URL: https://issues.apache.org/jira/browse/ARROW-11277
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson
Assignee: Ben Kietzman


See https://github.com/autobrew/homebrew-core/pull/61#issuecomment-761605455

R binary packages for macOS are built with an old SDK, so this is needed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11276) [Rust] [DataFusion] Make MemoryStream public

2021-01-16 Thread Andy Grove (Jira)
Andy Grove created ARROW-11276:
--

 Summary: [Rust] [DataFusion] Make MemoryStream public
 Key: ARROW-11276
 URL: https://issues.apache.org/jira/browse/ARROW-11276
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


I found the need to take a copy of MemoryStream for use in another project.

It would be nice if we could expose this as a supported public API so that 
other projects building physical operators can re-use it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11275) [Packaging][wheel][Linux] Fix paths for Gemfury

2021-01-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11275:


 Summary: [Packaging][wheel][Linux] Fix paths for Gemfury
 Key: ARROW-11275
 URL: https://issues.apache.org/jira/browse/ARROW-11275
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11274) [Packaging][wheel][Windows] Fix wheels path for Gemfury

2021-01-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11274:


 Summary: [Packaging][wheel][Windows] Fix wheels path for Gemfury
 Key: ARROW-11274
 URL: https://issues.apache.org/jira/browse/ARROW-11274
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11273) [Release][deb] Remove unsupported Debian GNU/Linux stretch

2021-01-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11273:


 Summary: [Release][deb] Remove unsupported Debian GNU/Linux stretch
 Key: ARROW-11273
 URL: https://issues.apache.org/jira/browse/ARROW-11273
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11272) [Release][wheel] Remove unsupported Python 3.5 and manylinux1

2021-01-16 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-11272:


 Summary: [Release][wheel] Remove unsupported Python 3.5 and 
manylinux1
 Key: ARROW-11272
 URL: https://issues.apache.org/jira/browse/ARROW-11272
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11271) [Rust] [Parquet] List schema to Arrow parser misinterpreting child nullability

2021-01-16 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11271:
--

 Summary: [Rust] [Parquet] List schema to Arrow parser 
misinterpreting child nullability
 Key: ARROW-11271
 URL: https://issues.apache.org/jira/browse/ARROW-11271
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale
Assignee: Neville Dipale


We currently do not propagate child nullability correctly when reading parquet 
files from Spark 3.0.1 (parquet-mr 1.10.1).

For example, the below taken from 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] is 
currently interpreted incorrectly:

 
{code:java}
// List (list nullable, elements non-null) 
optional group my_list (LIST) {
repeated group list { 
required binary element (UTF8); 
} 
}{code}
The Arrow type should be:
{code:java}
Field::new(
"my_list",
DataType::List(
box Field::new("element", DataType::Utf8, nullable: false),
),
nullable: true
){code}
but we currently end up with 
{code:java}
Field::new(
   "my_list",
   DataType::List(
   box Field::new("list", DataType::Utf8, nullable: true),
   ),
   nullable: true
)
{code}
This doesn't seem to be an issue with the master branch as of opening this 
issue, so it might not be severe enough to try force into the 3.0.0 release.

I tested null and non-null Spark files, and was able to read them correctly. 
This becomes an issue with nested lists, which I'm working on.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11270) [Rust] Use slices for simple array data buffer access

2021-01-16 Thread Tyrel Rink (Jira)
Tyrel Rink created ARROW-11270:
--

 Summary: [Rust] Use slices for simple array data buffer access
 Key: ARROW-11270
 URL: https://issues.apache.org/jira/browse/ARROW-11270
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Tyrel Rink
Assignee: Tyrel Rink


Using an approach similar to ARROW-10989, migrate typed array API's to use 
slices where they can.

This impacts the API of:
 * GenericBinaryArray<>
 * GenericListArray<>
 * GenericStringArray<>

This also does bounds checking to the value() function on each of the above 
arrays (as well as PrimitiveArray<> ).

The new PrimitiveArray bounds checks changes have a negative performance impact 
on various benchmarks that still use the .Value(...) function on 
PrimitiveArray.  But that should be resolvable by using the 
PrimitiveArray.values() instead (whether within this PR or a future PR).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11269) [Rust] Unable to read Parquet file because of mismatch

2021-01-16 Thread Max Burke (Jira)
Max Burke created ARROW-11269:
-

 Summary: [Rust] Unable to read Parquet file because of mismatch 
 Key: ARROW-11269
 URL: https://issues.apache.org/jira/browse/ARROW-11269
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 3.0.0
Reporter: Max Burke
 Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet

The issue seems to stem from the new(-ish) behavior of the Arrow Parquet reader 
where the embedded arrow schema is used instead of deriving the schema from the 
Parquet columns.

 

However it seems like some cases still derive the schema type from the column 
types, leading to the Arrow record batch reader erroring out that the column 
types must match the schema types.

 

In our case, the column type is an int96 datetime (ns) type, and the Arrow type 
in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, 
Some("UTC")). However, the code that constructs the Arrays seems to re-derive 
this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because 
the Parquet schema has no timezone information). And so, Parquet files that we 
were able to read successfully with our branch of Arrow circa October are now 
unreadable.

 

I've attached an example of a Parquet file that demonstrates the problem. This 
file was created in Python (as most of our Parquet files are).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11268) [Rust][DataFusion] Support specifying repartitions in mem table

2021-01-16 Thread Jira
Daniël Heres created ARROW-11268:


 Summary: [Rust][DataFusion] Support specifying repartitions in mem 
table 
 Key: ARROW-11268
 URL: https://issues.apache.org/jira/browse/ARROW-11268
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11267) [Rust]: Comparison of list arrays with differing offsets fails

2021-01-16 Thread Jira
Jörn Horstmann created ARROW-11267:
--

 Summary: [Rust]: Comparison of list arrays with differing offsets 
fails
 Key: ARROW-11267
 URL: https://issues.apache.org/jira/browse/ARROW-11267
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jörn Horstmann


Found this while reviewing the fix for ARROW-11239. The reason for the failure 
seems to be related to the combining of null bitmaps of parent/child data. When 
I changed `create_list_array` to not include null buffers the test passes.
{code:java}
#[test]
fn test_list_different_offsets() {
let a =
create_list_array(&[Some(&[0, 0]), Some(&[1, 2]), Some(&[3, 4])]);
let b =
create_list_array(&[Some(&[1, 2]), Some(&[3, 4]), Some(&[5, 6])]);  
  let a_slice = a.slice(1, 2);
let b_slice = b.slice(0, 2);
test_equal(_slice, _slice, true);
} {code}
[~jorgecarleitao] [~nevi_me] FYI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11266) [Rust][DataFusion] Implement vectorized hashing for hash aggregate

2021-01-16 Thread Jira
Daniël Heres created ARROW-11266:


 Summary: [Rust][DataFusion] Implement vectorized hashing for hash 
aggregate
 Key: ARROW-11266
 URL: https://issues.apache.org/jira/browse/ARROW-11266
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)