[jira] [Commented] (ARROW-7476) [Python] Arrow error: IOError: Error reading bytes from file: No error

2020-01-07 Thread gaurav vashisth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010445#comment-17010445
 ] 

gaurav vashisth commented on ARROW-7476:


UPDATE: This error can occur in the file having having 1 million records as 
well. 

> [Python] Arrow error: IOError: Error reading bytes from file: No error
> --
>
> Key: ARROW-7476
> URL: https://issues.apache.org/jira/browse/ARROW-7476
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: windows
>Reporter: gaurav vashisth
>Priority: Major
>
> when I try to read a parquet file using either Pandas or dask, I get error 
> following error:
> Arrow error: IOError: Error reading bytes from file: No error. However, when 
> I try again to read the file, sometime I'm able to read the file. Below are 
> the command I used to read the parquet file.
> with dask
> dd.read_parquet('my.parquet', engine='pyarrow',compression='snappy').compute()
> {color:#172b4d}pd.read_parquet('my.parquet', 
> engine='pyarrow',compression='snappy'){color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7514) [C#] Make GetValueOffset Obsolete

2020-01-07 Thread Takashi Hashida (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takashi Hashida updated ARROW-7514:
---
Description: 
[BinaryArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/BinaryArray.cs#L172]
 and 
[ListArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/ListArray.cs#L47]
 no longer have value.

We should add an `Obsolete` attribute to these methods in the next release, 
then remove these methods in the future release.

 

See this discussion: 
[https://github.com/apache/arrow/pull/6029#discussion_r361505788]

  was:
[BinaryArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/BinaryArray.cs#L172]
 and 
[ListArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/ListArray.cs#L47]
 no longer have value.



We should add an `Obsolete` attribute to these methods in the next release, 
then remove these methods in the future release.

 

Show this discussion: 
[https://github.com/apache/arrow/pull/6029#discussion_r361505788]


> [C#] Make GetValueOffset Obsolete
> -
>
> Key: ARROW-7514
> URL: https://issues.apache.org/jira/browse/ARROW-7514
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Takashi Hashida
>Priority: Major
>
> [BinaryArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/BinaryArray.cs#L172]
>  and 
> [ListArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/ListArray.cs#L47]
>  no longer have value.
> We should add an `Obsolete` attribute to these methods in the next release, 
> then remove these methods in the future release.
>  
> See this discussion: 
> [https://github.com/apache/arrow/pull/6029#discussion_r361505788]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7045) [R] Factor type not preserved in Parquet roundtrip

2020-01-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7045:
--

Assignee: Hiroaki Yutani

> [R] Factor type not preserved in Parquet roundtrip
> --
>
> Key: ARROW-7045
> URL: https://issues.apache.org/jira/browse/ARROW-7045
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Hiroaki Yutani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:r}
> test_that("Factors are preserved when writing/reading from Parquet", {
>   tf <- tempfile()
>   on.exit(unlink(tf))
>   df <- data.frame(a = factor(c("a", "b")))
>   write_parquet(df, tf)
>   expect_equivalent(read_parquet(tf), df)
> })
> {code}
> Fails:
> {code}
> `object` not equivalent to `expected`.
> Component “a”: target is character, current is factor
> {code}
> This has to do with the translation with Parquet and not the R <--> Arrow 
> type mapping (unlike ARROW-7028). If you write_feather and read_feather, the 
> test passes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7045) [R] Factor type not preserved in Parquet roundtrip

2020-01-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7045.

Fix Version/s: 0.16.0
   Resolution: Fixed

Issue resolved by pull request 6135
[https://github.com/apache/arrow/pull/6135]

> [R] Factor type not preserved in Parquet roundtrip
> --
>
> Key: ARROW-7045
> URL: https://issues.apache.org/jira/browse/ARROW-7045
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:r}
> test_that("Factors are preserved when writing/reading from Parquet", {
>   tf <- tempfile()
>   on.exit(unlink(tf))
>   df <- data.frame(a = factor(c("a", "b")))
>   write_parquet(df, tf)
>   expect_equivalent(read_parquet(tf), df)
> })
> {code}
> Fails:
> {code}
> `object` not equivalent to `expected`.
> Component “a”: target is character, current is factor
> {code}
> This has to do with the translation with Parquet and not the R <--> Arrow 
> type mapping (unlike ARROW-7028). If you write_feather and read_feather, the 
> test passes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7514) [C#] Make GetValueOffset Obsolete

2020-01-07 Thread Takashi Hashida (Jira)
Takashi Hashida created ARROW-7514:
--

 Summary: [C#] Make GetValueOffset Obsolete
 Key: ARROW-7514
 URL: https://issues.apache.org/jira/browse/ARROW-7514
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Takashi Hashida


[BinaryArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/BinaryArray.cs#L172]
 and 
[ListArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/ListArray.cs#L47]
 no longer have value.



We should add an `Obsolete` attribute to these methods in the next release, 
then remove these methods in the future release.

 

Show this discussion: 
[https://github.com/apache/arrow/pull/6029#discussion_r361505788]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7513) [JS] Arrow Tutorial: Common data types

2020-01-07 Thread Leo Meyerovich (Jira)
Leo Meyerovich created ARROW-7513:
-

 Summary: [JS] Arrow Tutorial: Common data types
 Key: ARROW-7513
 URL: https://issues.apache.org/jira/browse/ARROW-7513
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Leo Meyerovich
Assignee: Leo Meyerovich


The JS client lacks basic introductory material around creating the common 
basic data types such as turning JS arrays into ints, dicts, etc. There is no 
equivalent of Python's [https://arrow.apache.org/docs/python/data.html] . This 
has made use for myself difficult, and I bet for others.

 

As with prev tutorials, I started sketching on 
[https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit]
  . When we're happy can make sense to export as an html or something to the 
repo, or just link from the main readme.

I believe the target topics worth covering are:
 * Common user data types: Ints, Dicts, Struct, Time
 * Common column types: Data, Vector, Column
 * Going from individual & arrays & buffers of JS values to Arrow-wrapped 
forms, and basic inspection of the result

Not worth going into here is Tables vs. RecordBatches, which is the other 
tutorial.

 

1. Ideas of what to add/edit/remove?

2. And anyone up for helping with discussion of Data vs. Vector, and ingest of 
Time & Struct?

3. ... Should we be encouraging Struct or Map? I saw some PRs changing stuff 
here.

 

cc [~wesm] [~bhulette] [~paul.e.taylor]

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs

2020-01-07 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010293#comment-17010293
 ] 

Anthony Abate commented on ARROW-7511:
--

Now i remember why I thought Memory and Span can't support more than 2 gigs:

The *.Slice()* function only takes int32

https://docs.microsoft.com/en-us/dotnet/api/system.memory-1.slice?view=netcore-3.1#System_Memory_1_Slice_System_Int32_System_Int32_

 

> [C#] - Batch / Data Size Can't Exceed 2 gigs
> 
>
> Key: ARROW-7511
> URL: https://issues.apache.org/jira/browse/ARROW-7511
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> While the Arrow spec does not forbid batches larger than 2 gigs, the C# 
> library can not support this in its current form due to limits on managed 
> memory as it tries to put the whole batch into a single 
> Span/Memory
> It is possible to fix this by not trying to use Memory/Span/byte[] for the 
> entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
> only move the problem 'lower' as it would then still set the limit of a 
> Column Data in a single batch to be 2 Gigs.  
> This seems like plenty of memory... but if you think of strings columns, the 
> data is just one giant string appended to together with offsets and it can 
> get very large quickly.
> I think the unfortunate problem is that memory management in the C# managed 
> world is always going to hit the 2 gig limit somewhere. (please correct me if 
> I am wrong on this statement, but I thought i read some where that Memory 
> / Span are limited to int and changing to long would require major 
> framework rewrites - but i may be conflating that with array)
> That ultimately means the C# library either has to reject files of certain 
> characteristics (ie validation checks on opening) , or the spec needs put 
> upper limits on certain internal arrow constructs (ie arrow buffer) to 
> eliminate the need for more than a 2 gigs of contiguous memory for the 
> smallest arrow object.
> However, If the spec was indeed designed for the smallest buffer object to be 
> larger than 2 gigs, or for the entire memory buffer of arrow to be 
> contiguous, one has to wonder if at some point, it might just make sense for 
> the C# library to use the C++ library as its memory manager as replicating a 
> very large blocks of memory more work than its wroth.
> In any case,  this issue is more about 'deferring' the 2 gig size problem by 
> moving it down to the buffer objects... This might require some re-write of 
> the batch data structures
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7501) [C++] CMake build_thrift should build flex and bison if necessary

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7501:

Fix Version/s: 0.16.0

> [C++] CMake build_thrift should build flex and bison if necessary
> -
>
> Key: ARROW-7501
> URL: https://issues.apache.org/jira/browse/ARROW-7501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 0.16.0
>
>
> On MSVC and APPLE, {{build_thrift}} will handle thrift's flex and bison 
> dependencies: 
> [https://github.com/apache/arrow/blob/f578521/cpp/cmake_modules/ThirdpartyToolchain.cmake#L1052-L1097]
> But you're on your own on linux. In ARROW-6793, I wrote 100 lines of R code 
> to do this for my needs: 
> [https://github.com/apache/arrow/pull/6068/files#diff-3875fa5e75833c426b36487b25892bd8R204-R309]
> We should translate this to CMake so it's generally available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7501) [C++] CMake build_thrift should build flex and bison if necessary

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010250#comment-17010250
 ] 

Wes McKinney commented on ARROW-7501:
-

On reviewing what Neal's patch does, I think I see the conflict where we want 
to enable a seamless install on Linux systems that do not have these packages 
installed. It might be better to address this with ARROW-6821, the question is 
what kind of forward/backward compatibility there is in generated Thrift sources

> [C++] CMake build_thrift should build flex and bison if necessary
> -
>
> Key: ARROW-7501
> URL: https://issues.apache.org/jira/browse/ARROW-7501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> On MSVC and APPLE, {{build_thrift}} will handle thrift's flex and bison 
> dependencies: 
> [https://github.com/apache/arrow/blob/f578521/cpp/cmake_modules/ThirdpartyToolchain.cmake#L1052-L1097]
> But you're on your own on linux. In ARROW-6793, I wrote 100 lines of R code 
> to do this for my needs: 
> [https://github.com/apache/arrow/pull/6068/files#diff-3875fa5e75833c426b36487b25892bd8R204-R309]
> We should translate this to CMake so it's generally available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7475) [Rust] Create Arrow Stream writer

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7475:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Rust] Create Arrow Stream writer
> -
>
> Key: ARROW-7475
> URL: https://issues.apache.org/jira/browse/ARROW-7475
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7376) [C++] parquet NaN/null double statistics can result in endless loop

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010211#comment-17010211
 ] 

Wes McKinney commented on ARROW-7376:
-

This should be fixed ideally for the next major release

> [C++] parquet NaN/null double statistics can result in endless loop
> ---
>
> Key: ARROW-7376
> URL: https://issues.apache.org/jira/browse/ARROW-7376
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Pierre Belzile
>Priority: Major
>  Labels: parquet
> Fix For: 0.16.0
>
>
> There is a bug in the doubles column statistics computation when writing to 
> parquet an array with only NaNs and nulls. It loops endlessly if the last 
> cell of a write group is a Null. The line in error is 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633]
>  which checks for NaN but not for Null. Code then falls through and loops 
> endlessly and causes the program to appear frozen.
> This code snippet repeats:
> {noformat}
> TEST(parquet, nans) {
>   /* Create a small parquet structure */
>   std::vector> fields;
>   fields.push_back(::arrow::field("doubles", ::arrow::float64()));
>   std::shared_ptr<::arrow::Schema> schema = 
> ::arrow::schema(std::move(fields));  
> std::unique_ptr<::arrow::RecordBatchBuilder> builder;
>   ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), 
> );
>   
> builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits::quiet_NaN());
>   builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull();  
> std::shared_ptr<::arrow::RecordBatch> batch;
>   builder->Flush();
>   arrow::PrettyPrint(*batch, 0, ::cout);  std::shared_ptr 
> table;
>   arrow::Table::FromRecordBatches({batch}, );  /* Attempt to write */
>   std::shared_ptr<::arrow::io::FileOutputStream> os;
>   arrow::io::FileOutputStream::Open("/tmp/test.parquet", );
>   parquet::WriterProperties::Builder writer_props_bld;
>   // writer_props_bld.disable_statistics("doubles");
>   std::shared_ptr writer_props = 
> writer_props_bld.build();
>   std::shared_ptr arrow_props =
>   parquet::ArrowWriterProperties::Builder().store_schema()->build();
>   std::unique_ptr writer;
>   parquet::arrow::FileWriter::Open(
>   *table->schema(), arrow::default_memory_pool(), os,
>   writer_props, arrow_props, );
>   writer->WriteTable(*table, 1024);
> }{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7384) [Website] Fix search indexing warning reported by Google

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7384:

Fix Version/s: (was: 0.16.0)

> [Website] Fix search indexing warning reported by Google
> 
>
> Key: ARROW-7384
> URL: https://issues.apache.org/jira/browse/ARROW-7384
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
>
> I received the following e-mail from Google regarding arrow.apache.org (since 
> I'm an admin on the Analytics account)
> {code}
> Top Warnings
> Warnings are suggestions for improvement. Some warnings can affect your 
> appearance on Search; some might be reclassified as errors in the future. The 
> following warnings were found on your site:
> Indexed, though blocked by robots.txt
> We recommend that you fix these issues when possible to enable the best 
> experience and coverage in Google Search.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7313) [C++] Add function for retrieving a scalar from an array slot

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7313:

Fix Version/s: (was: 0.16.0)

> [C++] Add function for retrieving a scalar from an array slot
> -
>
> Key: ARROW-7313
> URL: https://issues.apache.org/jira/browse/ARROW-7313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>
> It'd be useful to construct scalar values given an array and an index.
> {code}
> /* static */ std::shared_ptr Scalar::FromArray(const Array&, int64_t);
> {code}
> Since this is much less efficient than unboxing the entire array and 
> accessing its buffers directly, it should not be used in hot loops.
> [~kszucs] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7285) [C++] ensure C++ implementation meets clarified dictionary spec

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7285:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] ensure C++ implementation meets clarified dictionary spec
> ---
>
> Key: ARROW-7285
> URL: https://issues.apache.org/jira/browse/ARROW-7285
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: 1.0.0
>
>
> see parent issue.
>  
> CC [~tianchen92]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7283) Ensure dictionary IPC implementations match spec clarifications

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7283:

Fix Version/s: (was: 0.16.0)
   1.0.0

> Ensure dictionary IPC implementations match spec clarifications
> ---
>
> Key: ARROW-7283
> URL: https://issues.apache.org/jira/browse/ARROW-7283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Parent tracking issue to ensure clarification in PR: 
> [https://github.com/apache/arrow/pull/5585#pullrequestreview-324979419] are 
> correctly implemented.  
>  
> Specifically:
> 1.  dictionary replacement in streams.
> 2.  Not requiring dictionaries be present at the beginning of the stream for 
> all null columns.
> 3.  Dictionary replacement isn't supported in the file format.
>  
> Some implementations might already have some or all of these.  This specific 
> Jira covers adding integration tests (children tasks cover language specific 
> implementations).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7284) [Java] ensure java implementation meets clarified dictionary spec

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7284:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Java] ensure java implementation meets clarified dictionary spec
> -
>
> Key: ARROW-7284
> URL: https://issues.apache.org/jira/browse/ARROW-7284
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> see parent issue.
>  
> CC [~tianchen92]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7269) [C++] Fix arrow::parquet compiler warning

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010209#comment-17010209
 ] 

Wes McKinney commented on ARROW-7269:
-

We can fix this once the next parquet-format release comes out

> [C++] Fix arrow::parquet compiler warning
> -
>
> Key: ARROW-7269
> URL: https://issues.apache.org/jira/browse/ARROW-7269
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jiajia Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Encountered the compiler warning when building:
> [WARNING:/arrow/cpp/src/parquet/parquet.thrift:297] The "byte" type is a 
> compatibility alias for "i8". Use "i8" to emphasize the signedness of this 
> type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7269) [C++] Fix arrow::parquet compiler warning

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7269:

Fix Version/s: (was: 0.16.0)

> [C++] Fix arrow::parquet compiler warning
> -
>
> Key: ARROW-7269
> URL: https://issues.apache.org/jira/browse/ARROW-7269
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jiajia Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Encountered the compiler warning when building:
> [WARNING:/arrow/cpp/src/parquet/parquet.thrift:297] The "byte" type is a 
> compatibility alias for "i8". Use "i8" to emphasize the signedness of this 
> type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7265) [Format][C++] Clarify the usage of typeIds in Union type documentation

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7265:

Fix Version/s: (was: 0.16.0)

> [Format][C++] Clarify the usage of typeIds in Union type documentation
> --
>
> Key: ARROW-7265
> URL: https://issues.apache.org/jira/browse/ARROW-7265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The documentation is unclear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7221) [C++][Documentation] Document how to set installed location for individual toolchain components

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7221:

Fix Version/s: (was: 0.16.0)

> [C++][Documentation] Document how to set installed location for individual 
> toolchain components
> ---
>
> Key: ARROW-7221
> URL: https://issues.apache.org/jira/browse/ARROW-7221
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This is not well documented in 
> http://arrow.apache.org/docs/developers/cpp.html#build-dependency-management
> the CMake variable are {{$DEPENDENCY_NAME_ROOT}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7191) [CI][Crossbow] Nightly build email should distinguish between new failures and still failing builds

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7191:

Fix Version/s: (was: 0.16.0)

> [CI][Crossbow] Nightly build email should distinguish between new failures 
> and still failing builds
> ---
>
> Key: ARROW-7191
> URL: https://issues.apache.org/jira/browse/ARROW-7191
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Neal Richardson
>Priority: Major
>
> It would help with triaging the nightly build status if it were more readily 
> visible which builds broke today vs. are still broken (and were triaged 
> yesterday).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7182) [CI][Crossbow] NIghtly fuzzit build broken in docker-compose refactor

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7182:

Fix Version/s: (was: 0.16.0)

> [CI][Crossbow] NIghtly fuzzit build broken in docker-compose refactor
> -
>
> Key: ARROW-7182
> URL: https://issues.apache.org/jira/browse/ARROW-7182
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Priority: Major
>
> See https://circleci.com/gh/ursa-labs/crossbow/4959 for example. 
> {code}
> /arrow/ci/scripts/fuzzit_build.sh: line 26: pushd: 
> /arrow/cpp/build/relwithdebinfo: No such file or directory
> {code}
> Scrolling up in the logs, it looks like the build dir is actually 
> {{/build/cpp}}, so given that, we should {{pushd /build/cpp/relwithdebinfo}}.
> cc [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7184) [C++][Dataset] Nightly ubuntu 14.04 fails because of dataset filter tests

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7184:

Fix Version/s: (was: 0.16.0)

> [C++][Dataset] Nightly ubuntu 14.04 fails because of dataset filter tests
> -
>
> Key: ARROW-7184
> URL: https://issues.apache.org/jira/browse/ARROW-7184
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset, Continuous Integration
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
>
> See https://circleci.com/gh/ursa-labs/crossbow/4958



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7204) [C++][Dataset] In expression should not require exact type match

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7204:

Fix Version/s: (was: 0.16.0)

> [C++][Dataset] In expression should not require exact type match
> 
>
> Key: ARROW-7204
> URL: https://issues.apache.org/jira/browse/ARROW-7204
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
>
> Similar to ARROW-7047. I encountered this on ARROW-7185 
> (https://github.com/apache/arrow/pull/5858/files#diff-1d8a97ca966e8446ef2ae4b7b5a96ed1R125)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7130) [C++][CMake] Automatically set ARROW_GANDIVA_PC_CXX_FLAGS for conda and OSX sdk

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7130:

Fix Version/s: (was: 0.16.0)

> [C++][CMake] Automatically set ARROW_GANDIVA_PC_CXX_FLAGS for conda and OSX 
> sdk
> ---
>
> Key: ARROW-7130
> URL: https://issues.apache.org/jira/browse/ARROW-7130
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: Krisztian Szucs
>Priority: Major
>
> ARROW_GANDIVA_PC_CXX_FLAGS requires special treatment based on the platforms, 
> see:
> - https://github.com/apache/arrow/blob/master/ci/scripts/cpp_build.sh#L27-L32
> - 
> https://github.com/conda-forge/arrow-cpp-feedstock/blob/master/recipe/build.sh#L12-L15
> We should integrate this logic into CMake by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7190) [CI][Crossbow] Add extra testing groups for on-demand builds

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7190:

Fix Version/s: (was: 0.16.0)

> [CI][Crossbow] Add extra testing groups for on-demand builds
> 
>
> Key: ARROW-7190
> URL: https://issues.apache.org/jira/browse/ARROW-7190
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Neal Richardson
>Priority: Minor
>
> We should have an easy way to trigger an extra set of builds but not the full 
> matrix that runs nightly. For example, we should have a python group that 
> runs one conda osx build, one wheel build, etc., not across all versions. 
> Most of the time when we have a nightly build failure on python packaging, we 
> get 5 failures for the same error, so we should be able to detect these more 
> cheaply.
> We already have build groups in crossbow, so this means adding new groups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7179) [C++][Compute] Coalesce kernel

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7179:

Fix Version/s: (was: 0.16.0)

> [C++][Compute] Coalesce kernel
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>
> Add a kernel which replaces null values in an array with a scalar value or 
> with values taken from another array:
> {code}
> coalesce([1, 2, null, 3], 5) -> [1, 2, 5, 3]
> coalesce([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}
> The code in {{take_internal.h}} should be of some use with a bit of 
> refactoring.
> A filter Expression should be added at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7151) [C++][Dataset] Refactor ExpressionEvaluator to yield Arrays

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7151:

Fix Version/s: (was: 0.16.0)

> [C++][Dataset] Refactor ExpressionEvaluator to yield Arrays
> ---
>
> Key: ARROW-7151
> URL: https://issues.apache.org/jira/browse/ARROW-7151
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>
> Currently expressions can be evaluated to scalars or arrays, mostly to 
> accomodate ScalarExpression. Instead let all expressions be evaluable to 
> Array only. ScalarExpression will evaluate to an array of repeated values, 
> but expressions whose corresponding kernels can accept a scalar directly 
> (comparison, for example) can avoid materializing this array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7122) [CI][Documenation] docker-compose developer guide in the sphinx documentation

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7122:

Fix Version/s: (was: 0.16.0)

> [CI][Documenation] docker-compose developer guide in the sphinx documentation
> -
>
> Key: ARROW-7122
> URL: https://issues.apache.org/jira/browse/ARROW-7122
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> We have a short guide in the sphinx documentation under integration.rst
> It needs to be updated with the recent docker-compose changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7128) [CI] Fedora cron jobs are failing because of wrong fedora version

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010207#comment-17010207
 ] 

Wes McKinney commented on ARROW-7128:
-

Is this resolved?

> [CI] Fedora cron jobs are failing because of wrong fedora version
> -
>
> Key: ARROW-7128
> URL: https://issues.apache.org/jira/browse/ARROW-7128
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The requested fedora version is 10 (Debian) instead of 29: 
> https://github.com/apache/arrow/runs/299223601



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7121) [C++][CI][Windows] Enable more features on the windows GHA build

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010206#comment-17010206
 ] 

Wes McKinney commented on ARROW-7121:
-

[~kszucs] this is a significant blind spot relative to Travis CI, can we fix 
this before releasing?

> [C++][CI][Windows] Enable more features on the windows GHA build
> 
>
> Key: ARROW-7121
> URL: https://issues.apache.org/jira/browse/ARROW-7121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.16.0
>
>
> Like `ARROW_GANDIVA: ON`, `ARROW_FLIGHT: ON`, `ARROW_PARQUET: ON`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7093) [R] Support creating ScalarExpressions for more data types

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7093:

Fix Version/s: (was: 0.16.0)

> [R] Support creating ScalarExpressions for more data types
> --
>
> Key: ARROW-7093
> URL: https://issues.apache.org/jira/browse/ARROW-7093
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Critical
>
> See 
> https://github.com/apache/arrow/blob/master/r/src/expression.cpp#L93-L107. 
> ARROW-6340 was limited to integer/double/logical. This will let us make 
> dataset filter expressions with all those other types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7075) [C++] Boolean kernels should not allocate in Call()

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7075:

Fix Version/s: (was: 0.16.0)

> [C++] Boolean kernels should not allocate in Call()
> ---
>
> Key: ARROW-7075
> URL: https://issues.apache.org/jira/browse/ARROW-7075
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute
>Affects Versions: 0.15.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>
> The boolean kernels currently allocate their value buffers ahead of time but 
> not their null bitmaps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7071) [Python] Add Array convenience method to create "masked" view with different validity bitmap

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7071:

Fix Version/s: (was: 0.16.0)

> [Python] Add Array convenience method to create "masked" view with different 
> validity bitmap
> 
>
> Key: ARROW-7071
> URL: https://issues.apache.org/jira/browse/ARROW-7071
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> NB: I'm not sure what kind of pitfalls there might be if replacing an 
> existing validity bitmap and exposing some previously-null values



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7114) [JS][CI] NodeJS build fails on Github Actions Windows node

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7114:

Fix Version/s: (was: 0.16.0)

> [JS][CI] NodeJS build fails on Github Actions Windows node
> --
>
> Key: ARROW-7114
> URL: https://issues.apache.org/jira/browse/ARROW-7114
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, JavaScript
>Reporter: Krisztian Szucs
>Priority: Major
>
> I had an attempt to install 
> [cross-env|https://github.com/apache/arrow/blob/master/.github/workflows/js.yml#L108]
>  as suggested by [~paul.e.taylor] but I guess it requires a bit more work.
> {code:java}
> > NODE_NO_WARNINGS=1 gulp build
> # 'NODE_NO_WARNINGS' is not recognized as an internal or external command,
> # operable program or batch file.
> # npm ERR! code ELIFECYCLE
> # npm ERR! errno 1
> # npm ERR! apache-arrow@1.0.0-SNAPSHOT build: `NODE_NO_WARNINGS=1 gulp build`
> # npm ERR! Exit status 1
> # npm ERR!
> # npm ERR! Failed at the apache-arrow@1.0.0-SNAPSHOT build script.
> # npm ERR! This is probably not a problem with npm. There is likely 
> additional logging output above. {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7032) [Release] Verify python wheels in the release verification script

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010204#comment-17010204
 ] 

Wes McKinney commented on ARROW-7032:
-

I'm not sure using virtualenv is practical because we need to use different 
versions of Python. Thoughts?

> [Release] Verify python wheels in the release verification script
> -
>
> Key: ARROW-7032
> URL: https://issues.apache.org/jira/browse/ARROW-7032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.16.0
>
>
> For linux wheels use docker, otherwise setup a virtualenv and install the 
> wheel supported on the host's platform. 
> Testing should include the imports for the optional modules and perhaps 
> running the unit tests, but the import testing should catch most of the wheel 
> issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7051) [C++] Improve MakeArrayOfNull to support creation of multiple arrays

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7051:

Fix Version/s: (was: 0.16.0)

> [C++] Improve MakeArrayOfNull to support creation of multiple arrays
> 
>
> Key: ARROW-7051
> URL: https://issues.apache.org/jira/browse/ARROW-7051
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
>
> MakeArrayOfNull reuses a single buffer of {{0}} for all buffers in the array 
> it creates. It could be extended to reuse that same buffer for all buffers in 
> multiple arrays. This optimization will make RecordBatchProjector and 
> ConcatenateTablesWithPromotion more memory efficient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6940) [C++] Expose Message-level IPC metadata in both read and write interfaces

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6940:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Expose Message-level IPC metadata in both read and write interfaces
> -
>
> Key: ARROW-6940
> URL: https://issues.apache.org/jira/browse/ARROW-6940
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> the Message flatbuffer type has {{custom_metadata}} but there is no API 
> support for reading and writing values to this field. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7010) [C++] Support lossy casts from decimal128 to float32 and float64/double

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7010:

Fix Version/s: (was: 0.16.0)

> [C++] Support lossy casts from decimal128 to float32 and float64/double
> ---
>
> Key: ARROW-7010
> URL: https://issues.apache.org/jira/browse/ARROW-7010
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> I do not believe such casts are implemented. This can be helpful for people 
> analyzing data where the precision of decimal128 is not needed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6982) [R] Add bindings for compare and boolean kernels

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6982:

Fix Version/s: (was: 0.16.0)

> [R] Add bindings for compare and boolean kernels
> 
>
> Key: ARROW-6982
> URL: https://issues.apache.org/jira/browse/ARROW-6982
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
>
> See cpp/src/arrow/compute/kernels/compare.h and boolean.h. ARROW-6980 
> introduces an Expression class that works on Arrow Arrays, but to evaluate 
> the expressions, it has to pull the data into R first. This would enable us 
> to do the work in C++ and only pull in the result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6959) [C++] Clarify what signatures are preferred for compute kernels

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6959:

Fix Version/s: (was: 0.16.0)

> [C++] Clarify what signatures are preferred for compute kernels
> ---
>
> Key: ARROW-6959
> URL: https://issues.apache.org/jira/browse/ARROW-6959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Minor
>
> Many of the compute kernels feature functions which accept only array inputs 
> in addition to functions which accept Datums. The former seems implicitly 
> like a convenience wrapper around the latter but I don't think this is 
> explicit anywhere. Is there a preferred overload for bindings to use? Is it 
> preferred that C++ implementers provide convenience wrappers for different 
> permutations of argument type? (for example, Filter now provides an overload 
> for record batch input as well as array input)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6978) [R] Add bindings for sum and mean compute kernels

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6978:

Fix Version/s: (was: 0.16.0)

> [R] Add bindings for sum and mean compute kernels
> -
>
> Key: ARROW-6978
> URL: https://issues.apache.org/jira/browse/ARROW-6978
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6945) [Rust] Enable integration tests

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6945:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Rust] Enable integration tests
> ---
>
> Key: ARROW-6945
> URL: https://issues.apache.org/jira/browse/ARROW-6945
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> Use docker-compose to generate test files using the Java implementation and 
> then have Rust tests read them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6917) [Developer] Implement Python script to generate git cherry-pick commands needed to create patch build branch for maint releases

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010202#comment-17010202
 ] 

Wes McKinney commented on ARROW-6917:
-

Do we want to add this to the repo?

> [Developer] Implement Python script to generate git cherry-pick commands 
> needed to create patch build branch for maint releases
> ---
>
> Key: ARROW-6917
> URL: https://issues.apache.org/jira/browse/ARROW-6917
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.16.0
>
>
> For 0.14.1, I maintained this script by hand. It would be less failure-prone 
> (maybe) to generate it based on the fix versions set in JIRA



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6895) [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010200#comment-17010200
 ] 

Wes McKinney commented on ARROW-6895:
-

Was this never fixed? I must have gotten sidetracked. Leaving in 0.16.0

> [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader 
> repeats returned values when calling `NextBatch()`
> ---
>
> Key: ARROW-6895
> URL: https://issues.apache.org/jira/browse/ARROW-6895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.0
> Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
>Reporter: Adam Hooper
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.16.0
>
> Attachments: bad.parquet, reset-dictionary-on-read.diff, works.parquet
>
>
> Given most columns, I can run a loop like:
> {code:cpp}
> std::unique_ptr columnReader(/*...*/);
> while (nRowsRemaining > 0) {
> int n = std::min(100, nRowsRemaining);
> std::shared_ptr chunkedArray;
> auto status = columnReader->NextBatch(n, );
> // ... and then use `chunkedArray`
> nRowsRemaining -= n;
> }
> {code}
> (The context is: "convert Parquet to CSV/JSON, with small memory footprint." 
> Used in https://github.com/CJWorkbench/parquet-to-arrow)
> Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; 
> the second return value looks like {{val100...val199}}; and so on.
> ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The 
> first {{NextBatch()}} return value looks like {{val0...val100}}; the second 
> return value looks like {{val0...val99, val100...val199}} (ChunkedArray with 
> two arrays); the third return value looks like {{val0...val99, 
> val100...val199, val200...val299}} (ChunkedArray with three arrays); and so 
> on. The returned arrays are never cleared.
> In sum: {{NextBatch()}} on a dictionary column reader returns the wrong 
> values.
> I've attached a minimal Parquet file that presents this problem with the 
> above code; and I've written a patch that fixes this one case, to illustrate 
> where things are wrong. I don't think I understand enough edge cases to 
> decree that my patch is a correct fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6890) [Rust] [Parquet] ArrowReader fails with seg fault

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6890:

Fix Version/s: (was: 0.16.0)

> [Rust] [Parquet] ArrowReader fails with seg fault
> -
>
> Key: ARROW-6890
> URL: https://issues.apache.org/jira/browse/ARROW-6890
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Andy Grove
>Assignee: Renjie Liu
>Priority: Major
>
> ArrowReader fails with seg fault when trying to read an unsupported type, 
> like Utf8. We should have it return an Err instead of causing a segmentation 
> fault.
>  
> See [https://github.com/apache/arrow/pull/5641] for a reproducible test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6892) [Rust] [DataFusion] Implement optimizer rule to remove redundant projections

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6892:

Fix Version/s: (was: 0.16.0)

> [Rust] [DataFusion] Implement optimizer rule to remove redundant projections
> 
>
> Key: ARROW-6892
> URL: https://issues.apache.org/jira/browse/ARROW-6892
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Minor
>
> Currently we have code in the SQL query planner that wraps aggregate queries 
> in a projection (if needed) to preserve the order of the final results. This 
> is needed because the aggregate query execution always returns a result with 
> grouping expressions first and then aggregate expressions.
> It would be better (simpler, more readable code) to always wrap aggregates in 
> projections and have an optimizer rule to remove redundant projections. There 
> are likely other use cases where redundant projections might exist too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6875) [Python][Flight] Implement Criteria for ListFlights RPC / list_flights method

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6875:

Fix Version/s: (was: 0.16.0)

> [Python][Flight] Implement Criteria for ListFlights RPC / list_flights method
> -
>
> Key: ARROW-6875
> URL: https://issues.apache.org/jira/browse/ARROW-6875
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> We should work through how to pass a custom Criteria to ListFlights



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6883) [C++] Support sending delta DictionaryBatch or replacement DictionaryBatch in IPC stream writer class

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6883:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Support sending delta DictionaryBatch or replacement DictionaryBatch in 
> IPC stream writer class
> -
>
> Key: ARROW-6883
> URL: https://issues.apache.org/jira/browse/ARROW-6883
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 1.0.0
>
>
> I didn't see other JIRA issues about this, but this is one significant matter 
> to have complete columnar format coverage in the C++ library.
> This functionality will flow through to the various bindings, so it would be 
> helpful to add unit tests to assert that things work correctly e.g. in Python 
> from an end-user perspective



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6841) [C++] Upgrade to LLVM 8

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010198#comment-17010198
 ] 

Wes McKinney commented on ARROW-6841:
-

[~ravindra] thoughts about this?

> [C++] Upgrade to LLVM 8
> ---
>
> Key: ARROW-6841
> URL: https://issues.apache.org/jira/browse/ARROW-6841
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.16.0
>
>
> Now that LLVM 9 has been released, LLVM 8 has been promoted to stable 
> according to 
> http://apt.llvm.org/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6856:

Fix Version/s: (was: 0.16.0)

> [C++] Use ArrayData instead of Array for ArrayData::dictionary
> --
>
> Key: ARROW-6856
> URL: https://issues.apache.org/jira/browse/ARROW-6856
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This would be helpful for consistency. {{DictionaryArray}} may want to cache 
> a "boxed" version of this to return from {{DictionaryArray::dictionary}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6799) [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010196#comment-17010196
 ] 

Wes McKinney commented on ARROW-6799:
-

If it's not being maintained then I agree we should delete it

> [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)
> -
>
> Key: ARROW-6799
> URL: https://issues.apache.org/jira/browse/ARROW-6799
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Does not appear to be tested in CI. Originally reported at 
> https://github.com/apache/arrow/issues/5575



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6821) [C++][Parquet] Do not require Thrift compiler when building (but still require library)

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010197#comment-17010197
 ] 

Wes McKinney commented on ARROW-6821:
-

cc [~npr]

> [C++][Parquet] Do not require Thrift compiler when building (but still 
> require library)
> ---
>
> Key: ARROW-6821
> URL: https://issues.apache.org/jira/browse/ARROW-6821
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.16.0
>
>
> Building Thrift from source carries extra toolchain dependencies (bison and 
> flex). If we check in the files produced by compiling parquet.thrift, then 
> the EP can be simplified to only build the Thrift C++ library and not the 
> compiler. This also results in a simpler build for third parties



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6800) [C++] Add CMake option to build libraries targeting a C++14 or C++17 toolchain environment

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6800:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Add CMake option to build libraries targeting a C++14 or C++17 
> toolchain environment
> --
>
> Key: ARROW-6800
> URL: https://issues.apache.org/jira/browse/ARROW-6800
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Such option would cause public APIs involving e.g. {{string_view}} to use the 
> STL versions rather than our vendored backports



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6788) [CI] Migrate Travis CI lint job to GitHub Actions

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010195#comment-17010195
 ] 

Wes McKinney commented on ARROW-6788:
-

[~kszucs] can this be closed? is test_merge_arrow_pr.py being run now?

> [CI] Migrate Travis CI lint job to GitHub Actions
> -
>
> Key: ARROW-6788
> URL: https://issues.apache.org/jira/browse/ARROW-6788
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.16.0
>
>
> Depends on ARROW-5802. As far as I can tell GitHub Actions jobs run more or 
> less immediately so this will give more prompt feedback to contributors



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6783) [C++] Provide API for reconstruction of RecordBatch from Flatbuffer containing process memory addresses instead of relative offsets into an IPC message

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6783:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [C++] Provide API for reconstruction of RecordBatch from Flatbuffer 
> containing process memory addresses instead of relative offsets into an IPC 
> message
> ---
>
> Key: ARROW-6783
> URL: https://issues.apache.org/jira/browse/ARROW-6783
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> A lot of our development has focused on _inter_process communication rather 
> than _in_process. We should start by making sure we have disassembly and 
> reassembly implemented where the Buffer Flatbuffers values contain process 
> memory addresses rather than offsets. This may require a bit of refactoring 
> so we can use the same reassembly code path for both use cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6759) [JS] Run less comprehensive every-commit build, relegate multi-target builds perhaps to nightlies

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010194#comment-17010194
 ] 

Wes McKinney commented on ARROW-6759:
-

GHA is taking about 25min but as it's running more promptly there seems less 
urgency to address this for now

> [JS] Run less comprehensive every-commit build, relegate multi-target builds 
> perhaps to nightlies
> -
>
> Key: ARROW-6759
> URL: https://issues.apache.org/jira/browse/ARROW-6759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>
> The JavaScript CI build is taking 25-30 minutes nowadays. This could be 
> abbreviated by testing fewer deployment targets. We obviously still need to 
> test all the deployment targets but we could do that nightly instead of on 
> every commit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6753) [Release] Document environment configuration to run release verification on macOS

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6753:

Fix Version/s: (was: 0.16.0)

> [Release] Document environment configuration to run release verification on 
> macOS
> -
>
> Key: ARROW-6753
> URL: https://issues.apache.org/jira/browse/ARROW-6753
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Priority: Major
>
> Since I don't use macOS as a primary OS I don't have all-the-things set up 
> for usual Arrow development. A guide for Homebrew users to be able to run the 
> release verification starting from very little pre-installed beyond Xcode 
> would be nice



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6720) [JAVA][C++]Support Parquet Read and Write in Java

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6720:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [JAVA][C++]Support Parquet Read and Write in Java
> -
>
> Key: ARROW-6720
> URL: https://issues.apache.org/jira/browse/ARROW-6720
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java
>Affects Versions: 0.15.0
>Reporter: Chendi.Xue
>Assignee: Chendi.Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 37h 10m
>  Remaining Estimate: 0h
>
> We added a new java interface to support parquet read and write from hdfs or 
> local file.
> The purpose of this implementation is that when we loading and dumping 
> parquet data in Java, we can only use rowBased put and get methods. Since 
> arrow already has C++ implementation to load and dump parquet, so we wrapped 
> those codes as Java APIs.
> After test, we noticed in our workload, performance improved more than 2x 
> comparing with rowBased load and dump. So we want to contribute codes to 
> arrow.
> since this is a total independent change, there is no codes change to current 
> arrow codes. We added two folders as listed:  java/adapter/parquet and 
> cpp/src/jni/parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6697) [Rust] [DataFusion] Validate that all parquet partitions have the same schema

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6697:

Fix Version/s: (was: 0.16.0)

> [Rust] [DataFusion] Validate that all parquet partitions have the same schema
> -
>
> Key: ARROW-6697
> URL: https://issues.apache.org/jira/browse/ARROW-6697
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> When reading a partitioned Parquet file in DataFusion, the schema is read 
> from the first partition and it is assumed that all other partitions have the 
> same schema.
> It would be better to actually validate that all of the partitions have the 
> same schema since there is no support for schema merging yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6699) [C++] Add Parquet docs

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6699:

Fix Version/s: (was: 0.16.0)

> [C++] Add Parquet docs
> --
>
> Key: ARROW-6699
> URL: https://issues.apache.org/jira/browse/ARROW-6699
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> There is currently zero Sphinx doc for Parquet. I'm adding a stub in 
> ARROW-6630 but we should do more, especially as Arrow benefits from tight 
> integration with Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6738) [Java] Fix problems with current union comparison logic

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6738:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Java] Fix problems with current union comparison logic
> ---
>
> Key: ARROW-6738
> URL: https://issues.apache.org/jira/browse/ARROW-6738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> There are some problems with the current union comparison logic. For example:
> 1. For type check, we should not require fields to be equal. It is possible 
> that two vectors' value ranges are equal but their fields are different.
> 2. We should not compare the number of sub vectors, as it is possible that 
> two union vectors have different numbers of sub vectors, but have equal 
> values in the range.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6759) [JS] Run less comprehensive every-commit build, relegate multi-target builds perhaps to nightlies

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6759:

Fix Version/s: (was: 0.16.0)

> [JS] Run less comprehensive every-commit build, relegate multi-target builds 
> perhaps to nightlies
> -
>
> Key: ARROW-6759
> URL: https://issues.apache.org/jira/browse/ARROW-6759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>
> The JavaScript CI build is taking 25-30 minutes nowadays. This could be 
> abbreviated by testing fewer deployment targets. We obviously still need to 
> test all the deployment targets but we could do that nightly instead of on 
> every commit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6691) [Rust] [DataFusion] Use tokio and Futures instead of spawning threads

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6691:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Rust] [DataFusion] Use tokio and Futures instead of spawning threads
> -
>
> Key: ARROW-6691
> URL: https://issues.apache.org/jira/browse/ARROW-6691
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: image-2019-12-07-17-54-57-862.png
>
>
> The current implementation of the physical query plan uses "thread::spawn" 
> which is expensive. We should switch to using Futures, async!/await!, and 
> tokio so that we are launching tasks in a thread pool instead and writing 
> idiomatic Rust code with futures combinators to chain actions together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6689) [Rust] [DataFusion] Query execution enhancements for 1.0.0 release

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6689:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Rust] [DataFusion] Query execution enhancements for 1.0.0 release
> --
>
> Key: ARROW-6689
> URL: https://issues.apache.org/jira/browse/ARROW-6689
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> There a number of optimizations that can be made to the new query execution 
> and this is a top level story to track them all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6680) [Python] Add Array ctor microbenchmarks

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6680:

Fix Version/s: (was: 0.16.0)

> [Python] Add Array ctor microbenchmarks
> ---
>
> Key: ARROW-6680
> URL: https://issues.apache.org/jira/browse/ARROW-6680
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> Since more unavoidable validation is being added in e.g. 
> https://github.com/apache/arrow/pull/5488



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6076) [C++][Parquet] RecordReader::Reset logic is inefficient for small reads

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6076:

Fix Version/s: (was: 0.16.0)

> [C++][Parquet] RecordReader::Reset logic is inefficient for small reads
> ---
>
> Key: ARROW-6076
> URL: https://issues.apache.org/jira/browse/ARROW-6076
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We have a unit test 
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L933
> that reads 1 record at a time from a Parquet-Arrow column reader. There is 
> logic on RecordReader that advances the definition/repetition levels based on 
> consumed data from previous records, but this is inefficient for this case:
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L1011
> This should be refactored to not require this copying, or at least to only 
> "shift" the levels occasionally 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6103) [Java] Do we really want to use the maven release plugin?

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6103:

Fix Version/s: (was: 0.16.0)

> [Java] Do we really want to use the maven release plugin?
> -
>
> Key: ARROW-6103
> URL: https://issues.apache.org/jira/browse/ARROW-6103
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Java
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> For reference .. I'm filing this issue to track investigation work around 
> this ..
> {code:java}
> The biggest problem for the Git commit is our Java package
> requires "apache-arrow-${VERSION}" tag on
> https://github.com/apache/arrow . (Right?)
> I think that "mvm release:perform" in
> dev/release/01-perform.sh does so but I don't know the
> details of "mvm release:perform"...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6071) [C++] Implement casting Binary <-> LargeBinary

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6071:

Fix Version/s: (was: 0.16.0)

> [C++] Implement casting Binary <-> LargeBinary
> --
>
> Key: ARROW-6071
> URL: https://issues.apache.org/jira/browse/ARROW-6071
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> We should implement bidirectional casts from Binary to LargeBinary and 
> vice-versa. Also including String and LargeString.
> In the narrowing direction, the offset width should be checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6072) [C++] Implement casting List <-> LargeList

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6072:

Fix Version/s: (was: 0.16.0)

> [C++] Implement casting List <-> LargeList
> --
>
> Key: ARROW-6072
> URL: https://issues.apache.org/jira/browse/ARROW-6072
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> We should implement bidirectional casts from List to LargeList and vice-versa.
> In the narrowing direction, the offset width should be checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6055) [C++] Refactor arrow/io/hdfs.h to use common FileSystem API

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010186#comment-17010186
 ] 

Wes McKinney commented on ARROW-6055:
-

where do things stand on this?

> [C++] Refactor arrow/io/hdfs.h to use common FileSystem API
> ---
>
> Key: ARROW-6055
> URL: https://issues.apache.org/jira/browse/ARROW-6055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> As part of this refactor, the FileSystem-related classes in 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.h#L51 
> should be removed. The files should probably be moved also to arrow/filesystem



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6064) [FlightRPC] [C++] Clean up IWYU

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6064:

Fix Version/s: (was: 0.16.0)

> [FlightRPC] [C++] Clean up IWYU
> ---
>
> Key: ARROW-6064
> URL: https://issues.apache.org/jira/browse/ARROW-6064
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: David Li
>Priority: Major
>
> As reported by Wes 
> https://gist.github.com/wesm/af59c7cc8f35c6fd806b0d041b816da8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6052) [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to builder files

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6052:

Fix Version/s: (was: 0.16.0)

> [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to 
> builder files
> 
>
> Key: ARROW-6052
> URL: https://issues.apache.org/jira/browse/ARROW-6052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Since these files are getting larger, this would improve codebase 
> navigability. Probably should use the same naming scheme as builder_* e.g. 
> {{arrow/array/array_dict.h}}
> I recommend also putting the unit test files related to these in there for 
> better semantic organization. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5982) [C++] Add methods to append dictionary values and dictionary indices directly into DictionaryBuilder

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5982.
---
Fix Version/s: (was: 0.16.0)
   Resolution: Fixed

This was done in 
https://github.com/apache/arrow/commit/38b01764da445ce6383b60a50d1e9b313857a3d7#diff-ce752fd9d1926a96cbc426de3d32d3ca

> [C++] Add methods to append dictionary values and dictionary indices directly 
> into DictionaryBuilder
> 
>
> Key: ARROW-5982
> URL: https://issues.apache.org/jira/browse/ARROW-5982
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> In scenarios where a developer has an array of dictionary indices already 
> that reference a known dictionary, it is useful to be able to insert the 
> indices directly, circumventing the hash table lookup. The developer will be 
> responsible for keeping things consistent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7501) [C++] CMake build_thrift should build flex and bison if necessary

2020-01-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7501:
---
Fix Version/s: (was: 0.16.0)

> [C++] CMake build_thrift should build flex and bison if necessary
> -
>
> Key: ARROW-7501
> URL: https://issues.apache.org/jira/browse/ARROW-7501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> On MSVC and APPLE, {{build_thrift}} will handle thrift's flex and bison 
> dependencies: 
> [https://github.com/apache/arrow/blob/f578521/cpp/cmake_modules/ThirdpartyToolchain.cmake#L1052-L1097]
> But you're on your own on linux. In ARROW-6793, I wrote 100 lines of R code 
> to do this for my needs: 
> [https://github.com/apache/arrow/pull/6068/files#diff-3875fa5e75833c426b36487b25892bd8R204-R309]
> We should translate this to CMake so it's generally available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7265) [Format][C++] Clarify the usage of typeIds in Union type documentation

2020-01-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7265:
---
Component/s: Format
 C++

> [Format][C++] Clarify the usage of typeIds in Union type documentation
> --
>
> Key: ARROW-7265
> URL: https://issues.apache.org/jira/browse/ARROW-7265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.16.0
>
>
> The documentation is unclear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7503) [Rust] Rust builds are failing on master

2020-01-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7503:
---
Priority: Blocker  (was: Major)

> [Rust] Rust builds are failing on master
> 
>
> Key: ARROW-7503
> URL: https://issues.apache.org/jira/browse/ARROW-7503
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Neal Richardson
>Priority: Blocker
> Fix For: 0.16.0
>
>
> See [https://github.com/apache/arrow/runs/374130594#step:5:1506] for example:
> {code}
> ...
>  schema::types::tests::test_schema_type_thrift_conversion_err stdout 
> thread 'schema::types::tests::test_schema_type_thrift_conversion_err' 
> panicked at 'assertion failed: `(left == right)`
>   left: `"description() is deprecated; use Display"`,
>  right: `"Root schema must be Group type"`', 
> parquet/src/schema/types.rs:1760:13
> failures:
> 
> column::writer::tests::test_column_writer_error_when_writing_disabled_dictionary
> column::writer::tests::test_column_writer_inconsistent_def_rep_length
> column::writer::tests::test_column_writer_invalid_def_levels
> column::writer::tests::test_column_writer_invalid_rep_levels
> column::writer::tests::test_column_writer_not_enough_values_to_write
> file::writer::tests::test_file_writer_error_after_close
> file::writer::tests::test_row_group_writer_error_after_close
> file::writer::tests::test_row_group_writer_error_not_all_columns_written
> file::writer::tests::test_row_group_writer_num_records_mismatch
> schema::types::tests::test_primitive_type
> schema::types::tests::test_schema_type_thrift_conversion_err
> test result: FAILED. 325 passed; 11 failed; 0 ignored; 0 measured; 0 filtered 
> out
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5972) [Rust] Installing cargo-tarpaulin and generating coverage report takes over 20 minutes

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5972:

Fix Version/s: (was: 0.16.0)

> [Rust] Installing cargo-tarpaulin and generating coverage report takes over 
> 20 minutes
> --
>
> Key: ARROW-5972
> URL: https://issues.apache.org/jira/browse/ARROW-5972
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Wes McKinney
>Priority: Major
>
> See example build:
> https://travis-ci.org/apache/arrow/jobs/558986931
> Here, installing cargo-tarpaulin takes 13m32s. Running the coverage report 
> takes another 7m40s. 
> Given the Travis CI build queue issues we're having, this might be worth 
> optimizing or moving to Docker/Buildbot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5981) [C++] DictionaryBuilder initialization with Array can fail silently

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5981:

Fix Version/s: (was: 0.16.0)

> [C++] DictionaryBuilder initialization with Array can fail silently
> --
>
> Key: ARROW-5981
> URL: https://issues.apache.org/jira/browse/ARROW-5981
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> See
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_dict.cc#L267
> I think it would be better to expose {{InsertValues}} on 
> {{DictionaryBuilder}} and initialize from a known dictionary that way



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5972) [Rust] Installing cargo-tarpaulin and generating coverage report takes over 20 minutes

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010179#comment-17010179
 ] 

Wes McKinney commented on ARROW-5972:
-

We aren't running the coverage anymore. Changing to some kind of nightly build 
might be a good option

> [Rust] Installing cargo-tarpaulin and generating coverage report takes over 
> 20 minutes
> --
>
> Key: ARROW-5972
> URL: https://issues.apache.org/jira/browse/ARROW-5972
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Wes McKinney
>Priority: Major
>
> See example build:
> https://travis-ci.org/apache/arrow/jobs/558986931
> Here, installing cargo-tarpaulin takes 13m32s. Running the coverage report 
> takes another 7m40s. 
> Given the Travis CI build queue issues we're having, this might be worth 
> optimizing or moving to Docker/Buildbot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6537) [R] Pass column_types to CSV reader

2020-01-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6537:
---
Fix Version/s: (was: 0.16.0)

> [R] Pass column_types to CSV reader
> ---
>
> Key: ARROW-6537
> URL: https://issues.apache.org/jira/browse/ARROW-6537
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: csv, dataset
>
> See also ARROW-6536. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6543) [R] Support LargeBinary and LargeString types

2020-01-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6543:
---
Fix Version/s: (was: 0.16.0)
   1.0.0

> [R] Support LargeBinary and LargeString types
> -
>
> Key: ARROW-6543
> URL: https://issues.apache.org/jira/browse/ARROW-6543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> See ARROW-750



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6537) [R] Pass column_types to CSV reader

2020-01-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6537:
--

Assignee: (was: Neal Richardson)

> [R] Pass column_types to CSV reader
> ---
>
> Key: ARROW-6537
> URL: https://issues.apache.org/jira/browse/ARROW-6537
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: csv, dataset
> Fix For: 0.16.0
>
>
> See also ARROW-6536. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5954) [Developer][Documentation] Organize source and binary dependency licenses into directories

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5954:

Fix Version/s: (was: 0.16.0)

> [Developer][Documentation] Organize source and binary dependency licenses 
> into directories
> --
>
> Key: ARROW-5954
> URL: https://issues.apache.org/jira/browse/ARROW-5954
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
>
> Similarly like Spark does, see comment 
> https://github.com/apache/arrow/pull/4880/files/b839964a2a43123991b5b291607ff1cb026fe8a4#diff-61e0bdf7e1b43c5c93d9488b22e04170



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5931) [C++] Extend extension types facility to provide for serialization and deserialization in IPC roundtrips

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5931:

Fix Version/s: (was: 0.16.0)

> [C++] Extend extension types facility to provide for serialization and 
> deserialization in IPC roundtrips
> 
>
> Key: ARROW-5931
> URL: https://issues.apache.org/jira/browse/ARROW-5931
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> A use case here is when an array needs to reference some external data. For 
> example, suppose that we wanted to implement an array that references a 
> sequence of Python objects as {{PyObject*}}. Obviously, a {{PyObject*}} must 
> be managed by the Python interpreter.
> For a vector of some {{T*}} to be sent through the IPC machinery, it must be 
> embedded in some Arrow type on the wire. For example, the memory resident 
> version of {{PyObject*}} might be 8-bytes per value (1 pointer per value) 
> while being serialized to the binary IPC protocol, such {{PyObject*}} values 
> must be serialized into an Arrow Binary type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5928) [JS] Test fuzzer inputs

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5928:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [JS] Test fuzzer inputs
> ---
>
> Key: ARROW-5928
> URL: https://issues.apache.org/jira/browse/ARROW-5928
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We are developing a fuzzer-based corpus of malformed IPC inputs
> https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc
> The JavaScript implementation should also test against these to verify that 
> the correct kind of exception is raised



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5933) [C++] [Documentation] add discussion of Union.typeIds to Layout.rst

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5933:

Fix Version/s: (was: 0.16.0)

> [C++] [Documentation] add discussion of Union.typeIds to Layout.rst 
> 
>
> Key: ARROW-5933
> URL: https://issues.apache.org/jira/browse/ARROW-5933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>
> Union.typeIds is poorly documented and the corresponding property in 
> UnionType is confusingly named type_codes. In particular, Layout.rst doesn't 
> include an explanation of Union.typeIds and implies that an element of a 
> union array's type_ids buffer is always the index of a child array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5950) [Rust] [DataFusion] Add logger dependency

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5950:

Fix Version/s: (was: 0.16.0)

> [Rust] [DataFusion] Add logger dependency
> -
>
> Key: ARROW-5950
> URL: https://issues.apache.org/jira/browse/ARROW-5950
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner
>
> It would be nice to be able turn on debug logging at runtime and see how 
> query plans are built and optimized. I propose adding a dependency on the log 
> crate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5915) [C++] [Python] Set up testing for backwards compatibility of the parquet reader

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5915:

Fix Version/s: (was: 0.16.0)

> [C++] [Python] Set up testing for backwards compatibility of the parquet 
> reader
> ---
>
> Key: ARROW-5915
> URL: https://issues.apache.org/jira/browse/ARROW-5915
> Project: Apache Arrow
>  Issue Type: Test
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet
>
> Given the recent parquet compat problems, we should have better testing for 
> this.
> For easy testing of backwards compatibility, we could add some files (with 
> different types) written with older versions, and ensure they are read 
> correctly with the current version.
> Similarly as what Kartothek is doing: 
> https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat
> An easy way would be to do that in pyarrow and add them to 
> /pyarrow/tests/data/parquet (we already have some files from 0.7 there). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5914) [CI] Build bundled dependencies in docker build step

2020-01-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010175#comment-17010175
 ] 

Wes McKinney commented on ARROW-5914:
-

[~fsaintjacques] is this still an issue?

> [CI] Build bundled dependencies in docker build step
> 
>
> Key: ARROW-5914
> URL: https://issues.apache.org/jira/browse/ARROW-5914
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 0.16.0
>
>
> In the recently introduced ARROW-5803, some heavy dependencies (thrift, 
> protobuf, flatbufers, grpc) are build at each invocation of docker-compose 
> build (thus each travis test).
> We should aim to build the third party dependencies in docker build phase 
> instead, to exploit caching and docker-compose pull so that the CI step 
> doesn't need to build said dependencies each time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7500) [C++][Dataset] regex_error in hive partition on centos7 and opensuse42

2020-01-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7500.

Resolution: Fixed

Issue resolved by pull request 6137
[https://github.com/apache/arrow/pull/6137]

> [C++][Dataset] regex_error in hive partition on centos7 and opensuse42
> --
>
> Key: ARROW-7500
> URL: https://issues.apache.org/jira/browse/ARROW-7500
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/runs/373769666#step:5:3301] and 
> [https://github.com/apache/arrow/runs/373769676#step:5:3297]:
>  {code}
> ══ Failed 
> ══
> ── 1. Error: Hive partitioning (@test-dataset.R#89)  
> ───
> regex_error
> Backtrace:
>   1. arrow::open_dataset(...) testthat/test-dataset.R:89:2
>  12. dsd$Finish(schema)
>  15. arrow:::dataset___DSDiscovery__Finish2(self, schema)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5916:

Fix Version/s: (was: 0.16.0)

> [C++] Allow RecordBatch.length to be less than array lengths
> 
>
> Key: ARROW-5916
> URL: https://issues.apache.org/jira/browse/ARROW-5916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
> Attachments: test.arrow_ipc
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> 0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
> array length be equal.  As per 
> [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
>  , we discussed changing this so that RecordBatch.length can be [0,array 
> length].
>  If RecordBatch.length is less than array length, the reader should ignore 
> the portion of the array(s) beyond RecordBatch.length.  This will allow 
> partially populated batches to be read in scenarios identified in the above 
> discussion.
> {code:c++}
>   Status GetFieldMetadata(int field_index, ArrayData* out) {
> auto nodes = metadata_->nodes();
> // pop off a field
> if (field_index >= static_cast(nodes->size())) {
>   return Status::Invalid("Ran out of field metadata, likely malformed");
> }
> const flatbuf::FieldNode* node = nodes->Get(field_index);
> *//out->length = node->length();*
> *out->length = metadata_->length();*
> out->null_count = node->null_count();
> out->offset = 0;
> return Status::OK();
>   }
> {code}
> Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5927) [Go] Test fuzzer inputs

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5927:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Go] Test fuzzer inputs
> ---
>
> Key: ARROW-5927
> URL: https://issues.apache.org/jira/browse/ARROW-5927
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We are developing a fuzzer-based corpus of malformed IPC inputs
> https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc
> The Go implementation should also test against these to verify that the 
> correct kind of exception is raised



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5926) [Java] Test fuzzer inputs

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5926:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Java] Test fuzzer inputs
> -
>
> Key: ARROW-5926
> URL: https://issues.apache.org/jira/browse/ARROW-5926
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We are developing a fuzzer-based corpus of malformed IPC inputs
> https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc
> The Java implementation should also test against these to verify that the 
> correct kind of exception is raised



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5890) [C++][Python] Support ExtensionType arrays in more kernels

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5890:

Fix Version/s: (was: 0.16.0)

> [C++][Python] Support ExtensionType arrays in more kernels
> --
>
> Key: ARROW-5890
> URL: https://issues.apache.org/jira/browse/ARROW-5890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From a quick test (through Python), it seems that {{slice}} and {{take}} 
> work, but the following not:
> - {{cast}}: it could rely on the casting rules for the storage type. Or do we 
> want that you explicitly have to take the storage array before casting?
> - {{dictionary_encode}} / {{unique}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5912) [Python] conversion from datetime objects with mixed timezones should normalize to UTC

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5912:

Fix Version/s: (was: 0.16.0)

> [Python] conversion from datetime objects with mixed timezones should 
> normalize to UTC
> --
>
> Key: ARROW-5912
> URL: https://issues.apache.org/jira/browse/ARROW-5912
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: beginner
>
> Currently, when having objects with mixed timezones, they are each separately 
> interpreted as their local time:
> {code:python}
> >>> ts_pd_paris = pd.Timestamp("1970-01-01 01:00", tz="Europe/Paris")
> >>> ts_pd_paris
> Timestamp('1970-01-01 01:00:00+0100', tz='Europe/Paris')
> >>> ts_pd_helsinki = pd.Timestamp("1970-01-01 02:00", tz="Europe/Helsinki")
> >>> ts_pd_helsinki
> Timestamp('1970-01-01 02:00:00+0200', tz='Europe/Helsinki')
> >>> a = pa.array([ts_pd_paris, ts_pd_helsinki])   
> >>>   
> >>>  
> >>> a
> 
> [
>   1970-01-01 01:00:00.00,
>   1970-01-01 02:00:00.00
> ]
> >>> a.type
> TimestampType(timestamp[us])
> {code}
> So both times are actually about the same moment in time (the same value in 
> UTC; in pandas their stored {{value}} is also the same), but once converted 
> to pyarrow, they are both tz-naive but no longer the same time. That seems 
> rather unexpected and a source for bugs.
> I think a better option would be to normalize to UTC, and result in a 
> tz-aware TimestampArray with UTC as timezone. 
> That is also the behaviour of pandas if you force the conversion to result in 
> datetimes (by default pandas will keep them as object array preserving the 
> different timezones).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5879) [C++][Python] Clean up linking of optional libraries within C++ and to Python extensions

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5879.
---
Fix Version/s: (was: 0.16.0)
   Resolution: Duplicate

This was done in 
https://github.com/apache/arrow/commit/102acc47287c37a01ac11a5cb6bd1da3f1f0712d#diff-79b695dff65b8b0a69bfed14e824cb18

> [C++][Python] Clean up linking of optional libraries within C++ and to Python 
> extensions
> 
>
> Key: ARROW-5879
> URL: https://issues.apache.org/jira/browse/ARROW-5879
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>
> Optional modules such as
> * Flight (and its dependents, including OpenSSL)
> * Parquet
> * Gandiva
> are all linked unconditionally to {{pyarrow.lib}}. It would be better IMHO to 
> only link these libraries to the corresponding Cython extension rather than 
> link everything to every extension.
> Relatedly, libraries like OpenSSL are being included in linking with all 
> shared libraries. We should clean this up to only link to the relevant shared 
> libraries where it is required, like {{libparquet}} (for encryption support) 
> and {{libarrow_flight}} (for using gRPC with TLS)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5845) [Java] Implement converter between Arrow record batches and Avro records

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5845:

Fix Version/s: (was: 0.16.0)
   1.0.0

> [Java] Implement converter between Arrow record batches and Avro records
> 
>
> Key: ARROW-5845
> URL: https://issues.apache.org/jira/browse/ARROW-5845
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> It would be useful for applications which need convert Avro data to Arrow 
> data.
> This is an adapter which convert data with existing API (like JDBC adapter) 
> rather than a native reader (like orc).
> We implement this function through Avro java project, receiving param like 
> Decoder/Schema/DatumReader of Avro and return VectorSchemaRoot. For each data 
> type we have a consumer class as below to get Avro data and write it into 
> vector to avoid boxing/unboxing (e.g. GenericRecord#get returns Object)
> {code:java}
> public class AvroIntConsumer implements Consumer {
> private final IntWriter writer;
> public AvroIntConsumer(IntVector vector)
> { this.writer = new IntWriterImpl(vector); }
> @Override
> public void consume(Decoder decoder) throws IOException
> { writer.writeInt(decoder.readInt()); writer.setPosition(writer.getPosition() 
> + 1); }
> {code}
> We intended to support primitive and complex types (null value represented 
> via unions type with null type), size limit and field selection could be 
> optional for users. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5858) [Doc] Better document the Tensor classes in the prose documentation

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5858:

Fix Version/s: (was: 0.16.0)

> [Doc] Better document the Tensor classes in the prose documentation
> ---
>
> Key: ARROW-5858
> URL: https://issues.apache.org/jira/browse/ARROW-5858
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From a comment from [~wesmckinn] in ARROW-2714:
> {quote}The Tensor classes are independent from the columnar data structures, 
> though they reuse pieces of metadata, metadata serialization, memory 
> management, and IPC.
> The purpose of adding these to the library was to have in-memory data 
> structures for handling Tensor/ndarray data and metadata that "plug in" to 
> the rest of the Arrow C++ system (Plasma store, IO subsystem, memory pools, 
> buffers, etc.).
> Theoretically you could return a Tensor when creating a non-contiguous slice 
> of an Array; in light of the above, I don't think that would be intuitive.
> When we started the project, our focus was creating an open standard for 
> in-memory columnar data, a hitherto unsolved problem. The project's scope has 
> expanded into peripheral problems in the same domain in the meantime (with 
> the mantra of creating interoperable components, a use-what-you-need 
> development platform for system developers). I think this aspect of the 
> project could be better documented / advertised, since the project's initial 
> focus on the columnar standard has given some the mistaken impression that we 
> are not interested in any work outside of that.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5764) [Java] Failed to build document with OpenJDK 11

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5764:

Fix Version/s: (was: 0.16.0)

> [Java] Failed to build document with OpenJDK 11
> ---
>
> Key: ARROW-5764
> URL: https://issues.apache.org/jira/browse/ARROW-5764
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Kouhei Sutou
>Priority: Major
>
> It reports the following error:
> {noformat}
> [ERROR] Exit code: 1 - javadoc: error - The code being documented uses 
> modules but the packages defined in http://docs.oracle.com/javase/8/docs/api/ 
> are in the unnamed module.
> {noformat}
> See also: https://travis-ci.org/kou/arrow/jobs/551254733#L1453
> This branch just enables Javadoc with OpenJDK 11: 
> https://github.com/kou/arrow/commit/1eeded4b9d18d474721733751f57392cee766004.diff
> {noformat}
> diff --git a/.travis.yml b/.travis.yml
> index 5dc901561e8..1d6ba86dc2d 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -225,6 +225,7 @@ matrix:
>  - if [ $ARROW_CI_JAVA_AFFECTED != "1" ]; then exit; fi
>  script:
>  - $TRAVIS_BUILD_DIR/ci/travis_script_java.sh
> +- $TRAVIS_BUILD_DIR/ci/travis_script_javadoc.sh
>- name: "Integration w/ OpenJDK 8, conda-forge toolchain"
>  language: java
>  os: linux
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >