date:20200430

[jira] [Resolved] (ARROW-8634) [Java] Create an example

2020-04-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-8634.

Resolution: Fixed

Issue resolved by pull request 7066
[https://github.com/apache/arrow/pull/7066]

> [Java] Create an example
> 
>
> Key: ARROW-8634
> URL: https://issues.apache.org/jira/browse/ARROW-8634
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The Java implementation doesn't seem to have any documentation or examples on 
> how to get started with basic operations such as creating an array. Javadocs 
> exist but how do new users even know which class to look for?
> I would like to create an examples module and one simple example as a 
> starting point. I hope to have a PR soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8660) [C++][Gandiva] Reduce dependence on Boost

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8660:
--
Labels: pull-request-available  (was: )

> [C++][Gandiva] Reduce dependence on Boost
> -
>
> Key: ARROW-8660
> URL: https://issues.apache.org/jira/browse/ARROW-8660
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Remove Boost usages aside from Boost.Multiprecision



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8661) [C++][Gandiva] Reduce number of files and headers

2020-04-30 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8661:

Description: 
I feel that the Gandiva subpackage is more Java-like in its code organization 
than the rest of the Arrow codebase, and it might be easier to navigate and 
develop with closely related code condensed into some larger headers and 
compilation units. At present there are over 100 .h/.cc files in just 
src/gandiva, not considering subdirectories

Additionally, it's not necessary to have a header file for each component of 
the function registry -- the registration functions can be declared in 
function_registry.h or function_registry_internal.h

  was:
I feel that the Gandiva subpackage is more Java-like in its code organization 
than the rest of the Arrow codebase, and it might be easier to navigate and 
develop with closely related code condensed into some larger headers and 
compilation units.

Additionally, it's not necessary to have a header file for each component of 
the function registry -- the registration functions can be declared in 
function_registry.h or function_registry_internal.h


> [C++][Gandiva] Reduce number of files and headers
> -
>
> Key: ARROW-8661
> URL: https://issues.apache.org/jira/browse/ARROW-8661
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I feel that the Gandiva subpackage is more Java-like in its code organization 
> than the rest of the Arrow codebase, and it might be easier to navigate and 
> develop with closely related code condensed into some larger headers and 
> compilation units. At present there are over 100 .h/.cc files in just 
> src/gandiva, not considering subdirectories
> Additionally, it's not necessary to have a header file for each component of 
> the function registry -- the registration functions can be declared in 
> function_registry.h or function_registry_internal.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8661) [C++][Gandiva] Reduce number of files and headers

2020-04-30 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-8661:
---

 Summary: [C++][Gandiva] Reduce number of files and headers
 Key: ARROW-8661
 URL: https://issues.apache.org/jira/browse/ARROW-8661
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
 Fix For: 1.0.0


I feel that the Gandiva subpackage is more Java-like in its code organization 
than the rest of the Arrow codebase, and it might be easier to navigate and 
develop with closely related code condensed into some larger headers and 
compilation units.

Additionally, it's not necessary to have a header file for each component of 
the function registry -- the registration functions can be declared in 
function_registry.h or function_registry_internal.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8660) [C++][Gandiva] Reduce dependence on Boost

2020-04-30 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-8660:
---

 Summary: [C++][Gandiva] Reduce dependence on Boost
 Key: ARROW-8660
 URL: https://issues.apache.org/jira/browse/ARROW-8660
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


Remove Boost usages aside from Boost.Multiprecision



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-300) [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD

2020-04-30 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-300.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6707
[https://github.com/apache/arrow/pull/6707]

> [Format] Add body buffer compression option to IPC message protocol using LZ4 
> or ZSTD
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks

2020-04-30 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-8447:
---

Assignee: Francois Saint-Jacques

> [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
> -
>
> Key: ARROW-8447
> URL: https://issues.apache.org/jira/browse/ARROW-8447
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This can be refactored with a little effort in Scanner::ToTable:
> # Change `batches` to `std::vector`
> # When pushing the closure to the TaskGroup, also track an incrementing 
> integer, e.g. scan_task_id
> # In the closure, store the RecordBatches for this ScanTask in a local 
> vector, when all batches are consumed, move the local vector in the `batches` 
> at the right index, resizing and emplacing with mutex
> # After waiting for the task group completion either
> * Flatten into a single vector and call `Table::FromRecordBatch` or
> * Write a RecordBatchReader that supports vector and add 
> method `Table::FromRecordBatchReader`
> The later involves more work but is the clean way, the other FromRecordBatch 
> method can be implemented from it and support "streaming".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks

2020-04-30 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-8447.
-
Resolution: Fixed

Issue resolved by pull request 7075
[https://github.com/apache/arrow/pull/7075]

> [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
> -
>
> Key: ARROW-8447
> URL: https://issues.apache.org/jira/browse/ARROW-8447
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This can be refactored with a little effort in Scanner::ToTable:
> # Change `batches` to `std::vector`
> # When pushing the closure to the TaskGroup, also track an incrementing 
> integer, e.g. scan_task_id
> # In the closure, store the RecordBatches for this ScanTask in a local 
> vector, when all batches are consumed, move the local vector in the `batches` 
> at the right index, resizing and emplacing with mutex
> # After waiting for the task group completion either
> * Flatten into a single vector and call `Table::FromRecordBatch` or
> * Write a RecordBatchReader that supports vector and add 
> method `Table::FromRecordBatchReader`
> The later involves more work but is the clean way, the other FromRecordBatch 
> method can be implemented from it and support "streaming".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8659) [Rust] ListBuilder and FixedSizeListBuilder capacity

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8659:
--
Labels: pull-request-available  (was: )

> [Rust] ListBuilder and FixedSizeListBuilder capacity
> 
>
> Key: ARROW-8659
> URL: https://issues.apache.org/jira/browse/ARROW-8659
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Raphael Taylor-Davies
>Assignee: Raphael Taylor-Davies
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Both ListBuilder and FixedSizeListBuilder accept a values_builder as a 
> constructor argument and then set the capacity of their internal builders 
> based off the length of this values_builder. Unfortunately at construction 
> time this values_builder is normally empty, and consequently programs spend 
> an unnecessary amount of time reallocating memory.
>  
> This should be addressed by adding new constructor methods that allow 
> specifying the desired capacity upfront.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8659) ListBuilder and FixedSizeListBuilder capacity

2020-04-30 Thread Raphael Taylor-Davies (Jira)

Raphael Taylor-Davies created ARROW-8659:


 Summary: ListBuilder and FixedSizeListBuilder capacity
 Key: ARROW-8659
 URL: https://issues.apache.org/jira/browse/ARROW-8659
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies


Both ListBuilder and FixedSizeListBuilder accept a values_builder as a 
constructor argument and then set the capacity of their internal builders based 
off the length of this values_builder. Unfortunately at construction time this 
values_builder is normally empty, and consequently programs spend an 
unnecessary amount of time reallocating memory.

 

This should be addressed by adding new constructor methods that allow 
specifying the desired capacity upfront.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8659) [Rust] ListBuilder and FixedSizeListBuilder capacity

2020-04-30 Thread Raphael Taylor-Davies (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raphael Taylor-Davies updated ARROW-8659:
-
Summary: [Rust] ListBuilder and FixedSizeListBuilder capacity  (was: 
ListBuilder and FixedSizeListBuilder capacity)

> [Rust] ListBuilder and FixedSizeListBuilder capacity
> 
>
> Key: ARROW-8659
> URL: https://issues.apache.org/jira/browse/ARROW-8659
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Raphael Taylor-Davies
>Assignee: Raphael Taylor-Davies
>Priority: Minor
>
> Both ListBuilder and FixedSizeListBuilder accept a values_builder as a 
> constructor argument and then set the capacity of their internal builders 
> based off the length of this values_builder. Unfortunately at construction 
> time this values_builder is normally empty, and consequently programs spend 
> an unnecessary amount of time reallocating memory.
>  
> This should be addressed by adding new constructor methods that allow 
> specifying the desired capacity upfront.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8653) [C++] Add support for gflags version detection

2020-04-30 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096962#comment-17096962
 ] 

Kouhei Sutou commented on ARROW-8653:
-

We'll be able to implement this by checking {{gflags.pc}}.
We can't detect version from {{gflags/*.h}}.

> [C++] Add support for gflags version detection
> --
>
> Key: ARROW-8653
> URL: https://issues.apache.org/jira/browse/ARROW-8653
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>
> Missing functionality from FindgflagsAlt, follop-up for 
> https://github.com/apache/arrow/pull/7067/files#diff-bc36ca94c3abd969dcdbaec7125fed65R18



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8658) [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments

2020-04-30 Thread Ben Kietzman (Jira)

Ben Kietzman created ARROW-8658:
---

 Summary: [C++][Dataset] Implement subtree pruning for 
FileSystemDataset::GetFragments
 Key: ARROW-8658
 URL: https://issues.apache.org/jira/browse/ARROW-8658
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


This is a very handy optimization for large datasets with multiple partition 
fields. For example, given a hive-style directory {{$base_dir/a=3/}} and a 
filter {{"a"_ == 2}} none of its files or subdirectories need be examined.

After ARROW-8318 FileSystemDataset stores only files so subtree pruning (whose 
implementation depended on the presence of directories to represent subtrees) 
was disabled. It should be possible to reintroduce this without reference to 
directories by examining partition expressions directly and extracting a tree 
structure from their subexpressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8648) [Rust] Optimize Rust CI Build Times

2020-04-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8648.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7072
[https://github.com/apache/arrow/pull/7072]

> [Rust] Optimize Rust CI Build Times
> ---
>
> Key: ARROW-8648
> URL: https://issues.apache.org/jira/browse/ARROW-8648
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mark Hildreth
>Assignee: Mark Hildreth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build 
> options used that are at odds with each other, resulting in multiple 
> redundant builds where a smaller number could do the same job. The following 
> tweaks, at minimal, could reduce this, speeding up build times:
>  * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. 
> Currently, it's only used for a single command (the {{build --all-targets}} 
> in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, 
> since RUSTFLAGS has changed.
>  * Don't run examples in release mode, as that would force a new (and slower) 
> rebuild, when the examples have already been built in debug mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8648) [Rust] Optimize Rust CI Build Times

2020-04-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8648:
-

Assignee: Mark Hildreth

> [Rust] Optimize Rust CI Build Times
> ---
>
> Key: ARROW-8648
> URL: https://issues.apache.org/jira/browse/ARROW-8648
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mark Hildreth
>Assignee: Mark Hildreth
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build 
> options used that are at odds with each other, resulting in multiple 
> redundant builds where a smaller number could do the same job. The following 
> tweaks, at minimal, could reduce this, speeding up build times:
>  * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. 
> Currently, it's only used for a single command (the {{build --all-targets}} 
> in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, 
> since RUSTFLAGS has changed.
>  * Don't run examples in release mode, as that would force a new (and slower) 
> rebuild, when the examples have already been built in debug mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8592) [C++] Docs still list LLVM 7 as compiler used

2020-04-30 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8592.
---
Resolution: Fixed

Issue resolved by pull request 7068
[https://github.com/apache/arrow/pull/7068]

> [C++] Docs still list LLVM 7 as compiler used
> -
>
> Key: ARROW-8592
> URL: https://issues.apache.org/jira/browse/ARROW-8592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> should be LLVM 8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'

2020-04-30 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8657:

Description: 
With the recent release of 0.17, the ParquetVersion is used to define the 
logical type interpretation of fields and the selection of the DataPage format.

As a result all parquet files that were created with ParquetVersion::V2 to get 
features such as unsigned int32s, timestamps with nanosecond resolution, etc 
are not forward compatible (cannot be read with 0.16.0). That's TBs of data in 
my case.

Those two concerns should be separated. Given that that DataPageV2 pages were 
not written prior to 0.17 and in order to allow reading existing files, the 
existing version property should continue to operate as in 0.16 and inform the 
logical type mapping.

Some consideration should be given to issue a release 0.17.1.

 

  was:
With the recent release of 0.17, the ParquetVersion is used to define the 
logical type interpretation of fields and the selection of the DataPage format.

As a result all parquet files that were created with ParquetVersion::V2 to get 
features such as unsigned int32s, timestamps with nanosecond resolution, etc 
are now unreadable. That's TBs of data in my case.

Those two concerns should be separated. Given that that DataPageV2 pages were 
not written prior to 0.17 and in order to allow reading existing files, the 
existing version property should continue to operate as in 0.16 and inform the 
logical type mapping.

Some consideration should be given to issue a release 0.17.1.

 


> [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when 
> using version='2.0'
> -
>
> Key: ARROW-8657
> URL: https://issues.apache.org/jira/browse/ARROW-8657
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.17.0
>Reporter: Pierre Belzile
>Priority: Major
> Fix For: 0.17.1
>
>
> With the recent release of 0.17, the ParquetVersion is used to define the 
> logical type interpretation of fields and the selection of the DataPage 
> format.
> As a result all parquet files that were created with ParquetVersion::V2 to 
> get features such as unsigned int32s, timestamps with nanosecond resolution, 
> etc are not forward compatible (cannot be read with 0.16.0). That's TBs of 
> data in my case.
> Those two concerns should be separated. Given that that DataPageV2 pages were 
> not written prior to 0.17 and in order to allow reading existing files, the 
> existing version property should continue to operate as in 0.16 and inform 
> the logical type mapping.
> Some consideration should be given to issue a release 0.17.1.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'

2020-04-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096866#comment-17096866
 ] 

Wes McKinney commented on ARROW-8657:
-

For the record, I think we need to introduce a new flag to toggle the use of 
newer logical types and associated casting/metadata behavior, and leave the 
1.0/2.0 flag for its intended use, i.e. the DataPageV1 vs DataPageV2

So my suggested fix is:

* Add the new flag that is separate from switching version 1.0/2.0
* Revert the behavior in Python of version='2.0' to use DataPageV1, **but make 
a future warning to get people to use the new flag**
* In a future release (maybe 2 releases from now), {{version='2.0'}} will again 
write DataPageV2

> [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when 
> using version='2.0'
> -
>
> Key: ARROW-8657
> URL: https://issues.apache.org/jira/browse/ARROW-8657
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.17.0
>Reporter: Pierre Belzile
>Priority: Major
>
> With the recent release of 0.17, the ParquetVersion is used to define the 
> logical type interpretation of fields and the selection of the DataPage 
> format.
> As a result all parquet files that were created with ParquetVersion::V2 to 
> get features such as unsigned int32s, timestamps with nanosecond resolution, 
> etc are now unreadable. That's TBs of data in my case.
> Those two concerns should be separated. Given that that DataPageV2 pages were 
> not written prior to 0.17 and in order to allow reading existing files, the 
> existing version property should continue to operate as in 0.16 and inform 
> the logical type mapping.
> Some consideration should be given to issue a release 0.17.1.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'

2020-04-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096866#comment-17096866
 ] 

Wes McKinney edited comment on ARROW-8657 at 4/30/20, 6:37 PM:
---

For the record, I think we need to introduce a new flag to toggle the use of 
newer logical types and associated casting/metadata behavior, and leave the 
1.0/2.0 flag for its intended use, i.e. the DataPageV1 vs DataPageV2

So my suggested fix is:

* Add the new flag that is separate from switching version 1.0/2.0
* Revert the behavior in Python of version='2.0' to use DataPageV1, **but issue 
a FutureWarning to get people to use the new flag**
* In a future release (maybe 2 releases from now), {{version='2.0'}} will again 
write DataPageV2


was (Author: wesmckinn):
For the record, I think we need to introduce a new flag to toggle the use of 
newer logical types and associated casting/metadata behavior, and leave the 
1.0/2.0 flag for its intended use, i.e. the DataPageV1 vs DataPageV2

So my suggested fix is:

* Add the new flag that is separate from switching version 1.0/2.0
* Revert the behavior in Python of version='2.0' to use DataPageV1, **but make 
a future warning to get people to use the new flag**
* In a future release (maybe 2 releases from now), {{version='2.0'}} will again 
write DataPageV2

> [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when 
> using version='2.0'
> -
>
> Key: ARROW-8657
> URL: https://issues.apache.org/jira/browse/ARROW-8657
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.17.0
>Reporter: Pierre Belzile
>Priority: Major
>
> With the recent release of 0.17, the ParquetVersion is used to define the 
> logical type interpretation of fields and the selection of the DataPage 
> format.
> As a result all parquet files that were created with ParquetVersion::V2 to 
> get features such as unsigned int32s, timestamps with nanosecond resolution, 
> etc are now unreadable. That's TBs of data in my case.
> Those two concerns should be separated. Given that that DataPageV2 pages were 
> not written prior to 0.17 and in order to allow reading existing files, the 
> existing version property should continue to operate as in 0.16 and inform 
> the logical type mapping.
> Some consideration should be given to issue a release 0.17.1.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'

2020-04-30 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8657:

Fix Version/s: 0.17.1

> [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when 
> using version='2.0'
> -
>
> Key: ARROW-8657
> URL: https://issues.apache.org/jira/browse/ARROW-8657
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.17.0
>Reporter: Pierre Belzile
>Priority: Major
> Fix For: 0.17.1
>
>
> With the recent release of 0.17, the ParquetVersion is used to define the 
> logical type interpretation of fields and the selection of the DataPage 
> format.
> As a result all parquet files that were created with ParquetVersion::V2 to 
> get features such as unsigned int32s, timestamps with nanosecond resolution, 
> etc are now unreadable. That's TBs of data in my case.
> Those two concerns should be separated. Given that that DataPageV2 pages were 
> not written prior to 0.17 and in order to allow reading existing files, the 
> existing version property should continue to operate as in 0.16 and inform 
> the logical type mapping.
> Some consideration should be given to issue a release 0.17.1.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'

2020-04-30 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8657:

Summary: [Python][C++][Parquet] Forward compatibility issue from 0.16 to 
0.17 when using version='2.0'  (was: Distinguish parquet version 2 logical type 
vs DataPageV2)

> [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when 
> using version='2.0'
> -
>
> Key: ARROW-8657
> URL: https://issues.apache.org/jira/browse/ARROW-8657
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.17.0
>Reporter: Pierre Belzile
>Priority: Major
>
> With the recent release of 0.17, the ParquetVersion is used to define the 
> logical type interpretation of fields and the selection of the DataPage 
> format.
> As a result all parquet files that were created with ParquetVersion::V2 to 
> get features such as unsigned int32s, timestamps with nanosecond resolution, 
> etc are now unreadable. That's TBs of data in my case.
> Those two concerns should be separated. Given that that DataPageV2 pages were 
> not written prior to 0.17 and in order to allow reading existing files, the 
> existing version property should continue to operate as in 0.16 and inform 
> the logical type mapping.
> Some consideration should be given to issue a release 0.17.1.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8657) Distinguish parquet version 2 logical type vs DataPageV2

2020-04-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096862#comment-17096862
 ] 

Wes McKinney commented on ARROW-8657:
-

> As a result all parquet files that were created with ParquetVersion::V2 to 
> get features such as unsigned int32s, timestamps with nanosecond resolution, 
> etc are now unreadable. That's TBs of data in my case.

To clarify, they are _not_ unreadable, but rather they are not _forward 
compatible_ (files written by 0.17.0 with {{version='2.0'}} cannot be read with 
0.16.0 at the moment). In general, forward compatibility should be approached 
carefully. **All** files written by 0.16.0 are readable in 0.17.0

> Distinguish parquet version 2 logical type vs DataPageV2
> 
>
> Key: ARROW-8657
> URL: https://issues.apache.org/jira/browse/ARROW-8657
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.17.0
>Reporter: Pierre Belzile
>Priority: Major
>
> With the recent release of 0.17, the ParquetVersion is used to define the 
> logical type interpretation of fields and the selection of the DataPage 
> format.
> As a result all parquet files that were created with ParquetVersion::V2 to 
> get features such as unsigned int32s, timestamps with nanosecond resolution, 
> etc are now unreadable. That's TBs of data in my case.
> Those two concerns should be separated. Given that that DataPageV2 pages were 
> not written prior to 0.17 and in order to allow reading existing files, the 
> existing version property should continue to operate as in 0.16 and inform 
> the logical type mapping.
> Some consideration should be given to issue a release 0.17.1.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8447:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
> -
>
> Key: ARROW-8447
> URL: https://issues.apache.org/jira/browse/ARROW-8447
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This can be refactored with a little effort in Scanner::ToTable:
> # Change `batches` to `std::vector`
> # When pushing the closure to the TaskGroup, also track an incrementing 
> integer, e.g. scan_task_id
> # In the closure, store the RecordBatches for this ScanTask in a local 
> vector, when all batches are consumed, move the local vector in the `batches` 
> at the right index, resizing and emplacing with mutex
> # After waiting for the task group completion either
> * Flatten into a single vector and call `Table::FromRecordBatch` or
> * Write a RecordBatchReader that supports vector and add 
> method `Table::FromRecordBatchReader`
> The later involves more work but is the clean way, the other FromRecordBatch 
> method can be implemented from it and support "streaming".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files

2020-04-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096859#comment-17096859
 ] 

Wes McKinney commented on ARROW-8654:
-

Also, the perf of reading very wide Parquet files won't be very good. 

> [Python] pyarrow 0.17.0 fails reading "wide" parquet files
> --
>
> Key: ARROW-8654
> URL: https://issues.apache.org/jira/browse/ARROW-8654
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Mike Macpherson
>Priority: Major
>
> {code:java}
> import pandas as pd
> import numpy as np
> num_rows, num_cols = 1000, 45000
> df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
> num_cols)).astype(np.uint8))
> outfile = "test.parquet"
> df.to_parquet(outfile)
> del df
> df = pd.read_parquet(outfile)
> {code}
> Yields:
> {noformat}
> df = pd.read_parquet(outfile) 
> File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
> 310, in read_parquet 
> return impl.read(path, columns=columns, kwargs) 
> File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
> 125, in read 
> path, columns=columns, kwargs 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1530, in read_table 
> partitioning=partitioning) 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1189, in __init__ 
> self.validate_schemas() 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1217, in validate_schemas 
> self.schema = self.pieces[0].get_metadata().schema 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 662, in get_metadata 
> f = self.open() 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 669, in open 
> reader = self.open_file_func(self.path) 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1040, in _open_dataset_file 
> buffer_size=dataset.buffer_size 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 210, in __init__ 
> read_dictionary=read_dictionary, metadata=metadata) 
> File "pyarrow/_parquet.pyx", line 1023, in 
> pyarrow._parquet.ParquetReader.open 
> File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status 
> OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
> {noformat}
> This is pandas 1.0.3, and pyarrow 0.17.0.
>  
> I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well.
>  
> I also tried with 40,000 columns aot 45,000 as above, and that does work with 
> 0.17.0.
>  
> Thanks for all your work on this project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files

2020-04-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096858#comment-17096858
 ] 

Wes McKinney commented on ARROW-8654:
-

FWIW, "large" metadata from very wide tables is a problematic area for the 
Parquet format in general. We'll have to have a closer look to why the metadata 
got bigger from 0.16.0 to 0.17.0, but there will always be some point where 
it's too big. I would guess if you keep increasing the number of columns that 
0.16.0 will fail, too. 

> [Python] pyarrow 0.17.0 fails reading "wide" parquet files
> --
>
> Key: ARROW-8654
> URL: https://issues.apache.org/jira/browse/ARROW-8654
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Mike Macpherson
>Priority: Major
>
> {code:java}
> import pandas as pd
> import numpy as np
> num_rows, num_cols = 1000, 45000
> df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
> num_cols)).astype(np.uint8))
> outfile = "test.parquet"
> df.to_parquet(outfile)
> del df
> df = pd.read_parquet(outfile)
> {code}
> Yields:
> {noformat}
> df = pd.read_parquet(outfile) 
> File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
> 310, in read_parquet 
> return impl.read(path, columns=columns, kwargs) 
> File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
> 125, in read 
> path, columns=columns, kwargs 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1530, in read_table 
> partitioning=partitioning) 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1189, in __init__ 
> self.validate_schemas() 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1217, in validate_schemas 
> self.schema = self.pieces[0].get_metadata().schema 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 662, in get_metadata 
> f = self.open() 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 669, in open 
> reader = self.open_file_func(self.path) 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1040, in _open_dataset_file 
> buffer_size=dataset.buffer_size 
> File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 210, in __init__ 
> read_dictionary=read_dictionary, metadata=metadata) 
> File "pyarrow/_parquet.pyx", line 1023, in 
> pyarrow._parquet.ParquetReader.open 
> File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status 
> OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
> {noformat}
> This is pandas 1.0.3, and pyarrow 0.17.0.
>  
> I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well.
>  
> I also tried with 40,000 columns aot 45,000 as above, and that does work with 
> 0.17.0.
>  
> Thanks for all your work on this project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files

2020-04-30 Thread Mike Macpherson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Macpherson updated ARROW-8654:
---
Description: 
{code:java}
import pandas as pd
import numpy as np

num_rows, num_cols = 1000, 45000

df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
num_cols)).astype(np.uint8))

outfile = "test.parquet"
df.to_parquet(outfile)
del df

df = pd.read_parquet(outfile)
{code}
Yields:
{noformat}
df = pd.read_parquet(outfile) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
310, in read_parquet 
return impl.read(path, columns=columns, kwargs) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
125, in read 
path, columns=columns, kwargs 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, 
in read_table 
partitioning=partitioning) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, 
in __init__ 
self.validate_schemas() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, 
in validate_schemas 
self.schema = self.pieces[0].get_metadata().schema 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, 
in get_metadata 
f = self.open() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, 
in open 
reader = self.open_file_func(self.path) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, 
in _open_dataset_file 
buffer_size=dataset.buffer_size 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, 
in __init__ 
read_dictionary=read_dictionary, metadata=metadata) 
File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open 
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status 
OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
{noformat}
This is pandas 1.0.3, and pyarrow 0.17.0.

 

I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well.

 

I also tried with 40,000 columns aot 45,000 as above, and that does work with 
0.17.0.

 

Thanks for all your work on this project!

  was:
{code:java}
import pandas as pd

num_rows, num_cols = 1000, 45000

df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
num_cols)).astype(np.uint8))

outfile = "test.parquet"
df.to_parquet(outfile)
del df

df = pd.read_parquet(outfile)
{code}
Yields:
{noformat}
df = pd.read_parquet(outfile) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
310, in read_parquet 
return impl.read(path, columns=columns, kwargs) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
125, in read 
path, columns=columns, kwargs 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, 
in read_table 
partitioning=partitioning) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, 
in __init__ 
self.validate_schemas() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, 
in validate_schemas 
self.schema = self.pieces[0].get_metadata().schema 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, 
in get_metadata 
f = self.open() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, 
in open 
reader = self.open_file_func(self.path) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, 
in _open_dataset_file 
buffer_size=dataset.buffer_size 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, 
in __init__ 
read_dictionary=read_dictionary, metadata=metadata) 
File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open 
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status 
OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
{noformat}
This is pandas 1.0.3, and pyarrow 0.17.0.

 

I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well.

 

I also tried with 40,000 columns aot 45,000 as above, and that does work with 
0.17.0.

 

Thanks for all your work on this project!


> [Python] pyarrow 0.17.0 fails reading "wide" parquet files
> --
>
> Key: ARROW-8654
> URL: https://issues.apache.org/jira/browse/ARROW-8654
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Mike Macpherson
>Priority: Major
>
> {code:java}
> import pandas as pd
> import numpy as np
> num_rows, num_cols = 1000, 45000
> df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
> num_cols)).astype(np.uint8))
> outfile = "test.parquet"
> df.to_parquet(outfile)
> del df
> df = pd.read_parquet(outfile)
> {code}
> Yields:
> {noformat}
> df = pd.read_parquet(outfile) 
> File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
>

[jira] [Assigned] (ARROW-7759) [C++][Dataset] Add CsvFileFormat for CSV support

2020-04-30 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-7759:
---

Assignee: Ben Kietzman  (was: Antoine Pitrou)

> [C++][Dataset] Add CsvFileFormat for CSV support
> 
>
> Key: ARROW-7759
> URL: https://issues.apache.org/jira/browse/ARROW-7759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> This should be a minimal implementation that binds 1-1 file and ScanTask for 
> now. Streaming optimizations  can be done in ARROW-3410.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7759) [C++][Dataset] Add CsvFileFormat for CSV support

2020-04-30 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-7759.
-
Resolution: Fixed

Issue resolved by pull request 7033
[https://github.com/apache/arrow/pull/7033]

> [C++][Dataset] Add CsvFileFormat for CSV support
> 
>
> Key: ARROW-7759
> URL: https://issues.apache.org/jira/browse/ARROW-7759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> This should be a minimal implementation that binds 1-1 file and ScanTask for 
> now. Streaming optimizations  can be done in ARROW-3410.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8657) Distinguish parquet version 2 logical type vs DataPageV2

2020-04-30 Thread Pierre Belzile (Jira)

Pierre Belzile created ARROW-8657:
-

 Summary: Distinguish parquet version 2 logical type vs DataPageV2
 Key: ARROW-8657
 URL: https://issues.apache.org/jira/browse/ARROW-8657
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.17.0
Reporter: Pierre Belzile


With the recent release of 0.17, the ParquetVersion is used to define the 
logical type interpretation of fields and the selection of the DataPage format.

As a result all parquet files that were created with ParquetVersion::V2 to get 
features such as unsigned int32s, timestamps with nanosecond resolution, etc 
are now unreadable. That's TBs of data in my case.

Those two concerns should be separated. Given that that DataPageV2 pages were 
not written prior to 0.17 and in order to allow reading existing files, the 
existing version property should continue to operate as in 0.16 and inform the 
logical type mapping.

Some consideration should be given to issue a release 0.17.1.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8656) [Python] Switch to VS2017 in the windows wheel builds

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8656:
--
Labels: pull-request-available  (was: )

> [Python] Switch to VS2017 in the windows wheel builds
> -
>
> Key: ARROW-8656
> URL: https://issues.apache.org/jira/browse/ARROW-8656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since the recent conda-forge compiler migrations the wheel builds are failing 
> https://mail.google.com/mail/u/0/#label/ARROW/FMfcgxwHNCsqSGKQRMZxGlWWsfmGpKdC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8656) [Python] Switch to VS2017 in the windows wheel builds

2020-04-30 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-8656:
--

 Summary: [Python] Switch to VS2017 in the windows wheel builds
 Key: ARROW-8656
 URL: https://issues.apache.org/jira/browse/ARROW-8656
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


Since the recent conda-forge compiler migrations the wheel builds are failing 
https://mail.google.com/mail/u/0/#label/ARROW/FMfcgxwHNCsqSGKQRMZxGlWWsfmGpKdC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files

2020-04-30 Thread Mike Macpherson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Macpherson updated ARROW-8654:
---
Description: 
{code:java}
import pandas as pd

num_rows, num_cols = 1000, 45000

df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
num_cols)).astype(np.uint8))

outfile = "test.parquet"
df.to_parquet(outfile)
del df

df = pd.read_parquet(outfile)
{code}
Yields:
{noformat}
df = pd.read_parquet(outfile) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
310, in read_parquet 
return impl.read(path, columns=columns, kwargs) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
125, in read 
path, columns=columns, kwargs 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, 
in read_table 
partitioning=partitioning) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, 
in __init__ 
self.validate_schemas() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, 
in validate_schemas 
self.schema = self.pieces[0].get_metadata().schema 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, 
in get_metadata 
f = self.open() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, 
in open 
reader = self.open_file_func(self.path) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, 
in _open_dataset_file 
buffer_size=dataset.buffer_size 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, 
in __init__ 
read_dictionary=read_dictionary, metadata=metadata) 
File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open 
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status 
OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
{noformat}
This is pandas 1.0.3, and pyarrow 0.17.0.

 

I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well.

 

I also tried with 40,000 columns aot 45,000 as above, and that does work with 
0.17.0.

 

Thanks for all your work on this project!

  was:
{code:java}
import pandas as pd

num_rows, num_cols = 1000, 45000

df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
num_cols)).astype(np.uint8))

outfile = "test.parquet"
df.to_parquet(outfile)
del df

df = pd.read_parquet(fout)
{code}
Yields:
{noformat}
df = pd.read_parquet(outfile) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
310, in read_parquet 
return impl.read(path, columns=columns, kwargs) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
125, in read 
path, columns=columns, kwargs 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, 
in read_table 
partitioning=partitioning) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, 
in __init__ 
self.validate_schemas() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, 
in validate_schemas 
self.schema = self.pieces[0].get_metadata().schema 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, 
in get_metadata 
f = self.open() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, 
in open 
reader = self.open_file_func(self.path) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, 
in _open_dataset_file 
buffer_size=dataset.buffer_size 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, 
in __init__ 
read_dictionary=read_dictionary, metadata=metadata) 
File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open 
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status 
OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
{noformat}
This is pandas 1.0.3, and pyarrow 0.17.0.

 

I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well.

 

I also tried with 40,000 columns aot 45,000 as above, and that does work with 
0.17.0.

 

Thanks for all your work on this project!


> [Python] pyarrow 0.17.0 fails reading "wide" parquet files
> --
>
> Key: ARROW-8654
> URL: https://issues.apache.org/jira/browse/ARROW-8654
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Mike Macpherson
>Priority: Major
>
> {code:java}
> import pandas as pd
> num_rows, num_cols = 1000, 45000
> df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
> num_cols)).astype(np.uint8))
> outfile = "test.parquet"
> df.to_parquet(outfile)
> del df
> df = pd.read_parquet(outfile)
> {code}
> Yields:
> {noformat}
> df = pd.read_parquet(outfile) 
> File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
> 310, in read_parquet 
> return

[jira] [Created] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

2020-04-30 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-8655:


 Summary: [C++][Dataset][Python][R] Preserve partitioning 
information for a discovered Dataset
 Key: ARROW-8655
 URL: https://issues.apache.org/jira/browse/ARROW-8655
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} 
classes that describe a partitioning used in the discovery phase. But once a 
dataset object is created, it doesn't know any more about this, it just has 
partition expressions for the fragments. And the partition keys are added to 
the schema, but you can't directly know which columns of the schema originated 
from the partitions.

However, there can be use cases where it would be useful that a dataset still 
"knows" from what kind of partitioning it was created:

- The "read CSV write back Parquet" use case, where the CSV was already 
partitioned and you want to automatically preserve that partitioning for 
parquet (kind of roundtripping the partitioning on read/write)
- To convert the dataset to other representation, eg conversion to pandas, it 
can be useful to know what columns were partition columns (eg for pandas, those 
columns might be good candidates to be set as the index of the pandas/dask 
DataFrame). I can imagine conversions to other systems can use similar 
information.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8639) [C++][Plasma] Require gflags

2020-04-30 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-8639.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7067
[https://github.com/apache/arrow/pull/7067]

> [C++][Plasma] Require gflags
> 
>
> Key: ARROW-8639
> URL: https://issues.apache.org/jira/browse/ARROW-8639
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files

2020-04-30 Thread Mike Macpherson (Jira)

Mike Macpherson created ARROW-8654:
--

 Summary: [Python] pyarrow 0.17.0 fails reading "wide" parquet files
 Key: ARROW-8654
 URL: https://issues.apache.org/jira/browse/ARROW-8654
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Mike Macpherson


{code:java}
import pandas as pd

num_rows, num_cols = 1000, 45000

df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
num_cols)).astype(np.uint8))

outfile = "test.parquet"
df.to_parquet(outfile)
del df

df = pd.read_parquet(fout)
{code}
Yields:
{noformat}
df = pd.read_parquet(outfile) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
310, in read_parquet 
return impl.read(path, columns=columns, kwargs) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
125, in read 
path, columns=columns, kwargs 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, 
in read_table 
partitioning=partitioning) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, 
in __init__ 
self.validate_schemas() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, 
in validate_schemas 
self.schema = self.pieces[0].get_metadata().schema 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, 
in get_metadata 
f = self.open() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, 
in open 
reader = self.open_file_func(self.path) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, 
in _open_dataset_file 
buffer_size=dataset.buffer_size 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, 
in __init__ 
read_dictionary=read_dictionary, metadata=metadata) 
File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open 
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status 
OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
{noformat}
This is pandas 1.0.3, and pyarrow 0.17.0.

 

I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well.

 

I also tried with 40,000 columns aot 45,000 as above, and that does work with 
0.17.0.

 

Thanks for all your work on this project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8653) [C++] Add support for gflags version detection

2020-04-30 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-8653:
--

 Summary: [C++] Add support for gflags version detection
 Key: ARROW-8653
 URL: https://issues.apache.org/jira/browse/ARROW-8653
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs


Missing functionality from FindgflagsAlt, follop-up for 
https://github.com/apache/arrow/pull/7067/files#diff-bc36ca94c3abd969dcdbaec7125fed65R18



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8652) [Python] Test error message when discovering dataset with invalid files

2020-04-30 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8652:
-
Labels: dataset  (was: )

> [Python] Test error message when discovering dataset with invalid files
> ---
>
> Key: ARROW-8652
> URL: https://issues.apache.org/jira/browse/ARROW-8652
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset
>
> There is comment in the test_parquet.py about the Dataset API needing a 
> better error message for invalid files:
> https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648
> Although, this seems to work now:
> {code}
> import tempfile 
> import pathlib
> import pyarrow.dataset as ds  
>   
>
> tempdir = pathlib.Path(tempfile.mkdtemp()) 
> with open(str(tempdir / "data.parquet"), 'wb') as f: 
> pass 
> In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet")  
>   
>
> ...
> OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': 
> Invalid: Parquet file size is 0 bytes
> {code}
> So we need update the test to actually test it instead of skipping.
> The only difference with the python ParquetDataset implementation is that the 
> datasets API raises an OSError and not an ArrowInvalid error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8652) [Python] Test error message when discovering dataset with invalid files

2020-04-30 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-8652:


 Summary: [Python] Test error message when discovering dataset with 
invalid files
 Key: ARROW-8652
 URL: https://issues.apache.org/jira/browse/ARROW-8652
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


There is comment in the test_parquet.py about the Dataset API needing a better 
error message for invalid files:

https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648

Although, this seems to work now:

{code}
import tempfile 
import pathlib
import pyarrow.dataset as ds

   

tempdir = pathlib.Path(tempfile.mkdtemp()) 

with open(str(tempdir / "data.parquet"), 'wb') as f: 
pass 

In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet")

   
...
OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': 
Invalid: Parquet file size is 0 bytes
{code}

So we need update the test to actually test it instead of skipping.

The only difference with the python ParquetDataset implementation is that the 
datasets API raises an OSError and not an ArrowInvalid error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8318) [C++][Dataset] Dataset should instantiate Fragment

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8318:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Dataset should instantiate Fragment
> --
>
> Key: ARROW-8318
> URL: https://issues.apache.org/jira/browse/ARROW-8318
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fragments are created on the fly when invoking a Scan. This means that a lot 
> of the auxilliary/ancilliary data must be stored by the specialised Dataset, 
> e.g. the FileSystemDataset must hold the path and partition expression. With 
> the venue of more complex Fragment, e.g. ParquetFileFragment, more data must 
> be stored. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

2020-04-30 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8647:
-
Labels: dataset  (was: )

> [C++][Dataset] Optionally encode partition field values as dictionary type
> --
>
> Key: ARROW-8647
> URL: https://issues.apache.org/jira/browse/ARROW-8647
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> In the Python ParquetDataset implementation, the partition fields are 
> returned as dictionary type columns. 
> In the new Dataset API, we now use a plain type (integer or string when 
> inferred). But, you can already manually specify that the partition keys 
> should be dictionary type by specifying the partitioning schema (in 
> {{Partitioning}} passed to the dataset factory). 
> Since using dictionary type can be more efficient (since partition keys will 
> typically be repeated values in the resulting table), it might be good to 
> still have an option in the DatasetFactory to use dictionary types for the 
> partition fields.
> See also https://github.com/apache/arrow/pull/6303#discussion_r400622340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8651) [Python][Dataset] Support pickling of Dataset objects

2020-04-30 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8651:
-
Labels: dataset  (was: )

> [Python][Dataset] Support pickling of Dataset objects
> -
>
> Key: ARROW-8651
> URL: https://issues.apache.org/jira/browse/ARROW-8651
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> We alraedy made several parts of a Dataset serializable (the formats, the 
> expressions, the filesystem). With those, it should also be possible to 
> pickle FileFragments, and with that also Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8651) [Python][Dataset] Support pickling of Dataset objects

2020-04-30 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-8651:


 Summary: [Python][Dataset] Support pickling of Dataset objects
 Key: ARROW-8651
 URL: https://issues.apache.org/jira/browse/ARROW-8651
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


We alraedy made several parts of a Dataset serializable (the formats, the 
expressions, the filesystem). With those, it should also be possible to pickle 
FileFragments, and with that also Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8648) [Rust] Optimize Rust CI Build Times

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8648:
--
Labels: pull-request-available  (was: )

> [Rust] Optimize Rust CI Build Times
> ---
>
> Key: ARROW-8648
> URL: https://issues.apache.org/jira/browse/ARROW-8648
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mark Hildreth
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build 
> options used that are at odds with each other, resulting in multiple 
> redundant builds where a smaller number could do the same job. The following 
> tweaks, at minimal, could reduce this, speeding up build times:
>  * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. 
> Currently, it's only used for a single command (the {{build --all-targets}} 
> in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, 
> since RUSTFLAGS has changed.
>  * Don't run examples in release mode, as that would force a new (and slower) 
> rebuild, when the examples have already been built in debug mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8650) [Rust] [Website] Add documentation to Arrow website

2020-04-30 Thread Andy Grove (Jira)

Andy Grove created ARROW-8650:
-

 Summary: [Rust] [Website] Add documentation to Arrow website
 Key: ARROW-8650
 URL: https://issues.apache.org/jira/browse/ARROW-8650
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Website
Reporter: Andy Grove
 Fix For: 1.0.0


The documentation page [1] on the Arrow site has links for C, C++, Java, 
Python, JavaScript, and R. It would be good do add Rust here as well, even if 
the docs here are brief and link to the rustdocs on docs.rs [2] (which are 
currently broken due to ARROW-8536 [3].

 

[1] [https://arrow.apache.org/docs/]

[2] https://docs.rs/crate/arrow/0.17.0

[3] https://issues.apache.org/jira/browse/ARROW-8536



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8649) [Java] [Website] Java documentation on website is hidden

2020-04-30 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-8649:
--
Component/s: Website
 Java

> [Java] [Website] Java documentation on website is hidden
> 
>
> Key: ARROW-8649
> URL: https://issues.apache.org/jira/browse/ARROW-8649
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java, Website
>Reporter: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> There is some excellent Java documentation on the web site that is hard to 
> find because the Java documentation link  [1] goes straight to the generated 
> javadocs.
>  
>  [1] https://arrow.apache.org/docs/java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8648) [Rust] Optimize Rust CI Build Times

2020-04-30 Thread Mark Hildreth (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hildreth updated ARROW-8648:
-
Component/s: Rust

> [Rust] Optimize Rust CI Build Times
> ---
>
> Key: ARROW-8648
> URL: https://issues.apache.org/jira/browse/ARROW-8648
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mark Hildreth
>Priority: Major
>
> In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build 
> options used that are at odds with each other, resulting in multiple 
> redundant builds where a smaller number could do the same job. The following 
> tweaks, at minimal, could reduce this, speeding up build times:
>  * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. 
> Currently, it's only used for a single command (the {{build --all-targets}} 
> in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, 
> since RUSTFLAGS has changed.
>  * Don't run examples in release mode, as that would force a new (and slower) 
> rebuild, when the examples have already been built in debug mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8649) [Java] [Website] Java documentation on website is hidden

2020-04-30 Thread Andy Grove (Jira)

Andy Grove created ARROW-8649:
-

 Summary: [Java] [Website] Java documentation on website is hidden
 Key: ARROW-8649
 URL: https://issues.apache.org/jira/browse/ARROW-8649
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Andy Grove
 Fix For: 1.0.0


There is some excellent Java documentation on the web site that is hard to find 
because the Java documentation link  [1] goes straight to the generated 
javadocs.

 

 [1] https://arrow.apache.org/docs/java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8648) [Rust] Optimize Rust CI Build Times

2020-04-30 Thread Mark Hildreth (Jira)

Mark Hildreth created ARROW-8648:


 Summary: [Rust] Optimize Rust CI Build Times
 Key: ARROW-8648
 URL: https://issues.apache.org/jira/browse/ARROW-8648
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Mark Hildreth


In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build 
options used that are at odds with each other, resulting in multiple redundant 
builds where a smaller number could do the same job. The following tweaks, at 
minimal, could reduce this, speeding up build times:
 * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. 
Currently, it's only used for a single command (the {{build --all-targets}} in 
{{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, 
since RUSTFLAGS has changed.
 * Don't run examples in release mode, as that would force a new (and slower) 
rebuild, when the examples have already been built in debug mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8638) Arrow Cython API Usage Gives an error when calling CTable API Endpoints

2020-04-30 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8638.
---
Resolution: Information Provided

Closing since there isn't a bug to fix, further discussion can take place here 
or on the mailing list

> Arrow Cython API Usage Gives an error when calling CTable API Endpoints
> ---
>
> Key: ARROW-8638
> URL: https://issues.apache.org/jira/browse/ARROW-8638
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 20.04 with Python 3.8.2
> RHEL7 with Python 3.6.8
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Blocker
> Fix For: 0.16.0
>
>
> I am working on using both Arrow C++ API and Cython API to support an 
> application that I am developing. But here, I will add the issue I 
> experienced when I am trying to follow the example, 
> [https://arrow.apache.org/docs/python/extending.html]
> I am testing on Ubuntu 20.04 LTS
> Python version 3.8.2
> These are the steps I followed.
>  # Create Virtualenv
> python3 -m venv ENVARROW
>  
> 2. Activate ENV
> source ENVARROW/bin/activate
>  
> 3. pip3 install pyarrow==0.16.0 cython numpy
>  
>  4. Code block and Tools,
>  
> +*example.pyx*+
>  
>  
> {code:java}
> from pyarrow.lib cimport *
> def get_array_length(obj):
>  # Just an example function accessing both the pyarrow Cython API
>  # and the Arrow C++ API
>  cdef shared_ptr[CArray] arr = pyarrow_unwrap_array(obj)
>  if arr.get() == NULL:
>  raise TypeError("not an array")
>  return arr.get().length()
> def get_table_info(obj):
>  cdef shared_ptr[CTable] table = pyarrow_unwrap_table(obj)
>  if table.get() == NULL:
>  raise TypeError("not an table")
>  
>  return table.get().num_columns() 
> {code}
>  
>  
> +*setup.py*+
>  
>  
> {code:java}
> from distutils.core import setup
> from Cython.Build import cythonize
> import os
> import numpy as np
> import pyarrow as pa
> ext_modules = cythonize("example.pyx")
> for ext in ext_modules:
>  # The Numpy C headers are currently required
>  ext.include_dirs.append(np.get_include())
>  ext.include_dirs.append(pa.get_include())
>  ext.libraries.extend(pa.get_libraries())
>  ext.library_dirs.extend(pa.get_library_dirs())
> if os.name == 'posix':
>  ext.extra_compile_args.append('-std=c++11')
> # Try uncommenting the following line on Linux
>  # if you get weird linker errors or runtime crashes
>  #ext.define_macros.append(("_GLIBCXX_USE_CXX11_ABI", "0"))
> setup(ext_modules=ext_modules)
> {code}
>  
>  
> +*arrow_array.py*+
>  
> {code:java}
> import example
> import pyarrow as pa
> import numpy as np
> arr = pa.array([1,2,3,4,5])
> len = example.get_array_length(arr)
> print("Array length {} ".format(len)) 
> {code}
>  
> +*arrow_table.py*+
>  
> {code:java}
> import example
> import pyarrow as pa
> import numpy as np
> from pyarrow import csv
> fn = 'data.csv'
> table = csv.read_csv(fn)
> print(table)
> cols = example.get_table_info(table)
> print(cols)
>  
> {code}
> +*data.csv*+
> {code:java}
> 1,2,3,4,5
> 6,7,8,9,10
> 11,12,13,14,15
> {code}
>  
> +*Makefile*+
>  
> {code:java}
> install: 
> python3 setup.py build_ext --inplace
> clean: 
> rm -R *.so build *.cpp 
> {code}
>  
> **When I try to run either of the python example scripts arrow_table.py or 
> arrow_array.py, 
> I get the following error. 
>  
> {code:java}
> File "arrow_array.py", line 1, in 
>  import example
> ImportError: libarrow.so.16: cannot open shared object file: No such file or 
> directory
> {code}
>  
>  
> *Note: I also checked this on RHEL7 with Python 3.6.8, I got a similar 
> response.* 
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?

2020-04-30 Thread Anish Biswas (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096569#comment-17096569
 ] 

Anish Biswas commented on ARROW-8642:
-

Okay, I will do that from now on.

> Is there a good way to convert data types from numpy types to pyarrow 
> DataType?
> ---
>
> Key: ARROW-8642
> URL: https://issues.apache.org/jira/browse/ARROW-8642
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Anish Biswas
>Priority: Major
>
> Pretty much what the title says. Suppose I have a numpy array and its a 
> numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I 
> thought a Dictionary lookup table might work but perhaps there is some better 
> way?
> Why do I need this? I am trying to make pyarrow arrays with from_buffers(). 
> The first parameter is essentially a pyarrow.Datatype. So that's why. I have 
> validity_bitmaps as a buffer of uint8 and that's why I am using 
> from_buffers() and not pyarrow.array().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?

2020-04-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096567#comment-17096567
 ] 

Wes McKinney commented on ARROW-8642:
-

[~trickarcher] if you have questions it's better to use the mailing list than 
to open JIRA issues

> Is there a good way to convert data types from numpy types to pyarrow 
> DataType?
> ---
>
> Key: ARROW-8642
> URL: https://issues.apache.org/jira/browse/ARROW-8642
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Anish Biswas
>Priority: Major
>
> Pretty much what the title says. Suppose I have a numpy array and its a 
> numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I 
> thought a Dictionary lookup table might work but perhaps there is some better 
> way?
> Why do I need this? I am trying to make pyarrow arrays with from_buffers(). 
> The first parameter is essentially a pyarrow.Datatype. So that's why. I have 
> validity_bitmaps as a buffer of uint8 and that's why I am using 
> from_buffers() and not pyarrow.array().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection

2020-04-30 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096566#comment-17096566
 ] 

Wes McKinney commented on ARROW-8641:
-

Too bad this was not tested

> [Python] Regression in feather: no longer supports permutation in column 
> selection
> --
>
> Key: ARROW-8641
> URL: https://issues.apache.org/jira/browse/ARROW-8641
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> A quite annoying regression (original report from 
> https://github.com/pandas-dev/pandas/issues/33878), is that when specifying 
> {{columns}} to read, this now fails if the order of the columns is not 
> exactly the same as in the file:
> {code: python}
> In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b', 
> 'c'])
> In [29]: from pyarrow import feather 
> In [30]: feather.write_feather(table, "test.feather")   
> # this works fine
> In [32]: feather.read_table("test.feather", columns=['a', 'b'])   
>   
>
> Out[32]: 
> pyarrow.Table
> a: int64
> b: int64
> In [33]: feather.read_table("test.feather", columns=['b', 'a'])   
>   
>
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 feather.read_table("test.feather", columns=['b', 'a'])
> ~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, 
> memory_map)
> 237 return reader.read_indices(columns)
> 238 elif all(map(lambda t: t == str, column_types)):
> --> 239 return reader.read_names(columns)
> 240 
> 241 column_type_names = [t.__name__ for t in column_types]
> ~/scipy/repos/arrow/python/pyarrow/feather.pxi in 
> pyarrow.lib.FeatherReader.read_names()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Schema at index 0 was different: 
> b: int64
> a: int64
> vs
> a: int64
> b: int64
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection

2020-04-30 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8641:

Fix Version/s: 1.0.0

> [Python] Regression in feather: no longer supports permutation in column 
> selection
> --
>
> Key: ARROW-8641
> URL: https://issues.apache.org/jira/browse/ARROW-8641
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> A quite annoying regression (original report from 
> https://github.com/pandas-dev/pandas/issues/33878), is that when specifying 
> {{columns}} to read, this now fails if the order of the columns is not 
> exactly the same as in the file:
> {code: python}
> In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b', 
> 'c'])
> In [29]: from pyarrow import feather 
> In [30]: feather.write_feather(table, "test.feather")   
> # this works fine
> In [32]: feather.read_table("test.feather", columns=['a', 'b'])   
>   
>
> Out[32]: 
> pyarrow.Table
> a: int64
> b: int64
> In [33]: feather.read_table("test.feather", columns=['b', 'a'])   
>   
>
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 feather.read_table("test.feather", columns=['b', 'a'])
> ~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, 
> memory_map)
> 237 return reader.read_indices(columns)
> 238 elif all(map(lambda t: t == str, column_types)):
> --> 239 return reader.read_names(columns)
> 240 
> 241 column_type_names = [t.__name__ for t in column_types]
> ~/scipy/repos/arrow/python/pyarrow/feather.pxi in 
> pyarrow.lib.FeatherReader.read_names()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Schema at index 0 was different: 
> b: int64
> a: int64
> vs
> a: int64
> b: int64
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

2020-04-30 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8647:
-
Description: 
In the Python ParquetDataset implementation, the partition fields are returned 
as dictionary type columns. 

In the new Dataset API, we now use a plain type (integer or string when 
inferred). But, you can already manually specify that the partition keys should 
be dictionary type by specifying the partitioning schema (in {{Partitioning}} 
passed to the dataset factory). 

Since using dictionary type can be more efficient (since partition keys will 
typically be repeated values in the resulting table), it might be good to still 
have an option in the DatasetFactory to use dictionary types for the partition 
fields.

See also https://github.com/apache/arrow/pull/6303#discussion_r400622340

  was:
In the Python ParquetDataset implementation, the partition fields are returned 
as dictionary type columns. 

In the new Dataset API, we now use a plain type (integer or string when 
inferred). But, you can already manually specify that the partition keys should 
be dictionary type by specifying the partitioning schema (in {{Partitioning}} 
passed to the dataset factory). 

Since using dictionary type can be more efficient (since partition keys will 
typically be repeated values in the resulting table), it might be good to still 
have an option in the DatasetFactory to use dictionary types for the partition 
fields.


> [C++][Dataset] Optionally encode partition field values as dictionary type
> --
>
> Key: ARROW-8647
> URL: https://issues.apache.org/jira/browse/ARROW-8647
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> In the Python ParquetDataset implementation, the partition fields are 
> returned as dictionary type columns. 
> In the new Dataset API, we now use a plain type (integer or string when 
> inferred). But, you can already manually specify that the partition keys 
> should be dictionary type by specifying the partitioning schema (in 
> {{Partitioning}} passed to the dataset factory). 
> Since using dictionary type can be more efficient (since partition keys will 
> typically be repeated values in the resulting table), it might be good to 
> still have an option in the DatasetFactory to use dictionary types for the 
> partition fields.
> See also https://github.com/apache/arrow/pull/6303#discussion_r400622340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

2020-04-30 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-8647:


 Summary: [C++][Dataset] Optionally encode partition field values 
as dictionary type
 Key: ARROW-8647
 URL: https://issues.apache.org/jira/browse/ARROW-8647
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


In the Python ParquetDataset implementation, the partition fields are returned 
as dictionary type columns. 

In the new Dataset API, we now use a plain type (integer or string when 
inferred). But, you can already manually specify that the partition keys should 
be dictionary type by specifying the partitioning schema (in {{Partitioning}} 
passed to the dataset factory). 

Since using dictionary type can be more efficient (since partition keys will 
typically be repeated values in the resulting table), it might be good to still 
have an option in the DatasetFactory to use dictionary types for the partition 
fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8638) Arrow Cython API Usage Gives an error when calling CTable API Endpoints

2020-04-30 Thread Vibhatha Lakmal Abeykoon (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096500#comment-17096500
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-8638:
-

I tried the LD_LIBRARY_PATH approach and it worked fine. But I think, I need to 
adopt a neat setup as you point out. 

Thank you for this response. 

I have another thing in mind. Think of an instance where the arrow is compiled 
from source. In such cases is there a best practice that can be adopted. 

> Arrow Cython API Usage Gives an error when calling CTable API Endpoints
> ---
>
> Key: ARROW-8638
> URL: https://issues.apache.org/jira/browse/ARROW-8638
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 20.04 with Python 3.8.2
> RHEL7 with Python 3.6.8
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Blocker
> Fix For: 0.16.0
>
>
> I am working on using both Arrow C++ API and Cython API to support an 
> application that I am developing. But here, I will add the issue I 
> experienced when I am trying to follow the example, 
> [https://arrow.apache.org/docs/python/extending.html]
> I am testing on Ubuntu 20.04 LTS
> Python version 3.8.2
> These are the steps I followed.
>  # Create Virtualenv
> python3 -m venv ENVARROW
>  
> 2. Activate ENV
> source ENVARROW/bin/activate
>  
> 3. pip3 install pyarrow==0.16.0 cython numpy
>  
>  4. Code block and Tools,
>  
> +*example.pyx*+
>  
>  
> {code:java}
> from pyarrow.lib cimport *
> def get_array_length(obj):
>  # Just an example function accessing both the pyarrow Cython API
>  # and the Arrow C++ API
>  cdef shared_ptr[CArray] arr = pyarrow_unwrap_array(obj)
>  if arr.get() == NULL:
>  raise TypeError("not an array")
>  return arr.get().length()
> def get_table_info(obj):
>  cdef shared_ptr[CTable] table = pyarrow_unwrap_table(obj)
>  if table.get() == NULL:
>  raise TypeError("not an table")
>  
>  return table.get().num_columns() 
> {code}
>  
>  
> +*setup.py*+
>  
>  
> {code:java}
> from distutils.core import setup
> from Cython.Build import cythonize
> import os
> import numpy as np
> import pyarrow as pa
> ext_modules = cythonize("example.pyx")
> for ext in ext_modules:
>  # The Numpy C headers are currently required
>  ext.include_dirs.append(np.get_include())
>  ext.include_dirs.append(pa.get_include())
>  ext.libraries.extend(pa.get_libraries())
>  ext.library_dirs.extend(pa.get_library_dirs())
> if os.name == 'posix':
>  ext.extra_compile_args.append('-std=c++11')
> # Try uncommenting the following line on Linux
>  # if you get weird linker errors or runtime crashes
>  #ext.define_macros.append(("_GLIBCXX_USE_CXX11_ABI", "0"))
> setup(ext_modules=ext_modules)
> {code}
>  
>  
> +*arrow_array.py*+
>  
> {code:java}
> import example
> import pyarrow as pa
> import numpy as np
> arr = pa.array([1,2,3,4,5])
> len = example.get_array_length(arr)
> print("Array length {} ".format(len)) 
> {code}
>  
> +*arrow_table.py*+
>  
> {code:java}
> import example
> import pyarrow as pa
> import numpy as np
> from pyarrow import csv
> fn = 'data.csv'
> table = csv.read_csv(fn)
> print(table)
> cols = example.get_table_info(table)
> print(cols)
>  
> {code}
> +*data.csv*+
> {code:java}
> 1,2,3,4,5
> 6,7,8,9,10
> 11,12,13,14,15
> {code}
>  
> +*Makefile*+
>  
> {code:java}
> install: 
> python3 setup.py build_ext --inplace
> clean: 
> rm -R *.so build *.cpp 
> {code}
>  
> **When I try to run either of the python example scripts arrow_table.py or 
> arrow_array.py, 
> I get the following error. 
>  
> {code:java}
> File "arrow_array.py", line 1, in 
>  import example
> ImportError: libarrow.so.16: cannot open shared object file: No such file or 
> directory
> {code}
>  
>  
> *Note: I also checked this on RHEL7 with Python 3.6.8, I got a similar 
> response.* 
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8622) [Rust] Parquet crate does not compile on aarch64

2020-04-30 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan resolved ARROW-8622.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7059
[https://github.com/apache/arrow/pull/7059]

> [Rust] Parquet crate does not compile on aarch64
> 
>
> Key: ARROW-8622
> URL: https://issues.apache.org/jira/browse/ARROW-8622
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: R. Tyler Croy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8638) Arrow Cython API Usage Gives an error when calling CTable API Endpoints

2020-04-30 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096493#comment-17096493
 ] 

Uwe Korn commented on ARROW-8638:
-

You either need to extend the environment variable `LD_LIBRARY_PATH` to point 
to the directory where `libarrow.so.16` resides or (a bit more complicated in 
setup.py but the preferred approach) set the RPATH on the generated 
`example.so` Python module to also include the directory where `libarrow.so.16` 
reside, see turbodbc for an example: 
https://github.com/blue-yonder/turbodbc/blob/8e2db0d0a26b620ad3e687e56a88fdab3117e09c/setup.py#L186-L189

> Arrow Cython API Usage Gives an error when calling CTable API Endpoints
> ---
>
> Key: ARROW-8638
> URL: https://issues.apache.org/jira/browse/ARROW-8638
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 20.04 with Python 3.8.2
> RHEL7 with Python 3.6.8
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Blocker
> Fix For: 0.16.0
>
>
> I am working on using both Arrow C++ API and Cython API to support an 
> application that I am developing. But here, I will add the issue I 
> experienced when I am trying to follow the example, 
> [https://arrow.apache.org/docs/python/extending.html]
> I am testing on Ubuntu 20.04 LTS
> Python version 3.8.2
> These are the steps I followed.
>  # Create Virtualenv
> python3 -m venv ENVARROW
>  
> 2. Activate ENV
> source ENVARROW/bin/activate
>  
> 3. pip3 install pyarrow==0.16.0 cython numpy
>  
>  4. Code block and Tools,
>  
> +*example.pyx*+
>  
>  
> {code:java}
> from pyarrow.lib cimport *
> def get_array_length(obj):
>  # Just an example function accessing both the pyarrow Cython API
>  # and the Arrow C++ API
>  cdef shared_ptr[CArray] arr = pyarrow_unwrap_array(obj)
>  if arr.get() == NULL:
>  raise TypeError("not an array")
>  return arr.get().length()
> def get_table_info(obj):
>  cdef shared_ptr[CTable] table = pyarrow_unwrap_table(obj)
>  if table.get() == NULL:
>  raise TypeError("not an table")
>  
>  return table.get().num_columns() 
> {code}
>  
>  
> +*setup.py*+
>  
>  
> {code:java}
> from distutils.core import setup
> from Cython.Build import cythonize
> import os
> import numpy as np
> import pyarrow as pa
> ext_modules = cythonize("example.pyx")
> for ext in ext_modules:
>  # The Numpy C headers are currently required
>  ext.include_dirs.append(np.get_include())
>  ext.include_dirs.append(pa.get_include())
>  ext.libraries.extend(pa.get_libraries())
>  ext.library_dirs.extend(pa.get_library_dirs())
> if os.name == 'posix':
>  ext.extra_compile_args.append('-std=c++11')
> # Try uncommenting the following line on Linux
>  # if you get weird linker errors or runtime crashes
>  #ext.define_macros.append(("_GLIBCXX_USE_CXX11_ABI", "0"))
> setup(ext_modules=ext_modules)
> {code}
>  
>  
> +*arrow_array.py*+
>  
> {code:java}
> import example
> import pyarrow as pa
> import numpy as np
> arr = pa.array([1,2,3,4,5])
> len = example.get_array_length(arr)
> print("Array length {} ".format(len)) 
> {code}
>  
> +*arrow_table.py*+
>  
> {code:java}
> import example
> import pyarrow as pa
> import numpy as np
> from pyarrow import csv
> fn = 'data.csv'
> table = csv.read_csv(fn)
> print(table)
> cols = example.get_table_info(table)
> print(cols)
>  
> {code}
> +*data.csv*+
> {code:java}
> 1,2,3,4,5
> 6,7,8,9,10
> 11,12,13,14,15
> {code}
>  
> +*Makefile*+
>  
> {code:java}
> install: 
> python3 setup.py build_ext --inplace
> clean: 
> rm -R *.so build *.cpp 
> {code}
>  
> **When I try to run either of the python example scripts arrow_table.py or 
> arrow_array.py, 
> I get the following error. 
>  
> {code:java}
> File "arrow_array.py", line 1, in 
>  import example
> ImportError: libarrow.so.16: cannot open shared object file: No such file or 
> directory
> {code}
>  
>  
> *Note: I also checked this on RHEL7 with Python 3.6.8, I got a similar 
> response.* 
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7955) [Java] Support large buffer for file/stream IPC

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7955:
--
Labels: pull-request-available  (was: )

> [Java] Support large buffer for file/stream IPC
> ---
>
> Key: ARROW-7955
> URL: https://issues.apache.org/jira/browse/ARROW-7955
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After supporting 64-bit ArrowBuf, we need to make file/stream IPC work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8646) Allow UnionListWriter to write null values

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8646:
--
Labels: pull-request-available  (was: )

> Allow UnionListWriter to write null values
> --
>
> Key: ARROW-8646
> URL: https://issues.apache.org/jira/browse/ARROW-8646
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Thippana Vamsi Kalyan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> UnionListWriter has no provision to skip an index to write a null value into 
> the list.
> It should allow to writeNull



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8646) Allow UnionListWriter to write null values

2020-04-30 Thread Thippana Vamsi Kalyan (Jira)

Thippana Vamsi Kalyan created ARROW-8646:


 Summary: Allow UnionListWriter to write null values
 Key: ARROW-8646
 URL: https://issues.apache.org/jira/browse/ARROW-8646
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Thippana Vamsi Kalyan


UnionListWriter has no provision to skip an index to write a null value into 
the list.

It should allow to writeNull



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8645) [C++] Missing gflags dependency for plasma

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8645:
--
Labels: pull-request-available  (was: )

> [C++] Missing gflags dependency for plasma
> --
>
> Key: ARROW-8645
> URL: https://issues.apache.org/jira/browse/ARROW-8645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The documentation build fails because gflags is not installed and CMake 
> doesn't build the bundled version of it.
> Introduced by 
> https://github.com/apache/arrow/commit/dfc14ef24ed54ff757c10a26663a629ce5e8cebf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8645) [C++] Missing gflags dependency for plasma

2020-04-30 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-8645:
--

 Summary: [C++] Missing gflags dependency for plasma
 Key: ARROW-8645
 URL: https://issues.apache.org/jira/browse/ARROW-8645
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


The documentation build fails because gflags is not installed and CMake doesn't 
build the bundled version of it.

Introduced by 
https://github.com/apache/arrow/commit/dfc14ef24ed54ff757c10a26663a629ce5e8cebf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8644) [Python] Dask integration tests failing due to change in not including partition columns

2020-04-30 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-8644:


 Summary: [Python] Dask integration tests failing due to change in 
not including partition columns
 Key: ARROW-8644
 URL: https://issues.apache.org/jira/browse/ARROW-8644
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


In ARROW-3861 (https://github.com/apache/arrow/pull/7050), I "fixed" a bug that 
the partition columns are always included even when the user did a manual 
column selection.

But apparently, this behaviour was being relied upon by dask. See the failing 
nightly integration tests: 
https://circleci.com/gh/ursa-labs/crossbow/11854?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link

So the best option might be to just keep the "old" behaviour for the legacy 
ParquetDataset, when using the new datasets API 
({{use_legacy_datasets=False}}), you get the new / correct behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8643) [Python] Tests with pandas master failing due to freq assertion

2020-04-30 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-8643:


 Summary: [Python] Tests with pandas master failing due to freq 
assertion 
 Key: ARROW-8643
 URL: https://issues.apache.org/jira/browse/ARROW-8643
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Nightly pandas master tests are failing, eg 
https://circleci.com/gh/ursa-labs/crossbow/11858?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link

This is caused by a change in pandas, see 
https://github.com/pandas-dev/pandas/pull/33815#issuecomment-620820134



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?

2020-04-30 Thread Anish Biswas (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096278#comment-17096278
 ] 

Anish Biswas commented on ARROW-8642:
-

Oh okay! That's neat! Thanks!

> Is there a good way to convert data types from numpy types to pyarrow 
> DataType?
> ---
>
> Key: ARROW-8642
> URL: https://issues.apache.org/jira/browse/ARROW-8642
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Anish Biswas
>Priority: Major
>
> Pretty much what the title says. Suppose I have a numpy array and its a 
> numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I 
> thought a Dictionary lookup table might work but perhaps there is some better 
> way?
> Why do I need this? I am trying to make pyarrow arrays with from_buffers(). 
> The first parameter is essentially a pyarrow.Datatype. So that's why. I have 
> validity_bitmaps as a buffer of uint8 and that's why I am using 
> from_buffers() and not pyarrow.array().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?

2020-04-30 Thread Anish Biswas (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anish Biswas closed ARROW-8642.
---
Resolution: Fixed

> Is there a good way to convert data types from numpy types to pyarrow 
> DataType?
> ---
>
> Key: ARROW-8642
> URL: https://issues.apache.org/jira/browse/ARROW-8642
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Anish Biswas
>Priority: Major
>
> Pretty much what the title says. Suppose I have a numpy array and its a 
> numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I 
> thought a Dictionary lookup table might work but perhaps there is some better 
> way?
> Why do I need this? I am trying to make pyarrow arrays with from_buffers(). 
> The first parameter is essentially a pyarrow.Datatype. So that's why. I have 
> validity_bitmaps as a buffer of uint8 and that's why I am using 
> from_buffers() and not pyarrow.array().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?

2020-04-30 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096272#comment-17096272
 ] 

Joris Van den Bossche commented on ARROW-8642:
--

There is a {{from_numpy_dtype}} function for this:

{code}
In [42]: pa.from_numpy_dtype(np.dtype("int8"))
Out[42]: DataType(int8)
{code}

It's included in the API docs here: 
https://arrow.apache.org/docs/python/api/datatypes.html 

> Is there a good way to convert data types from numpy types to pyarrow 
> DataType?
> ---
>
> Key: ARROW-8642
> URL: https://issues.apache.org/jira/browse/ARROW-8642
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Anish Biswas
>Priority: Major
>
> Pretty much what the title says. Suppose I have a numpy array and its a 
> numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I 
> thought a Dictionary lookup table might work but perhaps there is some better 
> way?
> Why do I need this? I am trying to make pyarrow arrays with from_buffers(). 
> The first parameter is essentially a pyarrow.Datatype. So that's why. I have 
> validity_bitmaps as a buffer of uint8 and that's why I am using 
> from_buffers() and not pyarrow.array().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?

2020-04-30 Thread Anish Biswas (Jira)

Anish Biswas created ARROW-8642:
---

 Summary: Is there a good way to convert data types from numpy 
types to pyarrow DataType?
 Key: ARROW-8642
 URL: https://issues.apache.org/jira/browse/ARROW-8642
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Anish Biswas


Pretty much what the title says. Suppose I have a numpy array and its a 
numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I 
thought a Dictionary lookup table might work but perhaps there is some better 
way?

Why do I need this? I am trying to make pyarrow arrays with from_buffers(). The 
first parameter is essentially a pyarrow.Datatype. So that's why. I have 
validity_bitmaps as a buffer of uint8 and that's why I am using from_buffers() 
and not pyarrow.array().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8504) [C++] Add a method that takes an RLE visitor for a bitmap.

2020-04-30 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-8504:
--

Assignee: Micah Kornfield

> [C++] Add a method that takes an RLE visitor for a bitmap.
> --
>
> Key: ARROW-8504
> URL: https://issues.apache.org/jira/browse/ARROW-8504
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>
> For nullability data, in many cases nulls are not evenly distributed.  In 
> these cases it would be beneficial to have a mechanism to understand when 
> runs of set/unset bits are encountered.  One example of this is writing 
> translating a bitmap to parquet definition levels .
>  
> An implementation path could be to add this as method on Bitmap that makes an 
> adaptor callback for VisitWords but I think at least for parquet an iterator 
> API might be more appropriate (something that is easily stoppable/resumable).
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection

2020-04-30 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-8641:


 Summary: [Python] Regression in feather: no longer supports 
permutation in column selection
 Key: ARROW-8641
 URL: https://issues.apache.org/jira/browse/ARROW-8641
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


A quite annoying regression (original report from 
https://github.com/pandas-dev/pandas/issues/33878), is that when specifying 
{{columns}} to read, this now fails if the order of the columns is not exactly 
the same as in the file:

{code: python}
In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b', 
'c'])

In [29]: from pyarrow import feather 

In [30]: feather.write_feather(table, "test.feather")   

# this works fine
In [32]: feather.read_table("test.feather", columns=['a', 'b']) 

   
Out[32]: 
pyarrow.Table
a: int64
b: int64

In [33]: feather.read_table("test.feather", columns=['b', 'a']) 

   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 feather.read_table("test.feather", columns=['b', 'a'])

~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, 
memory_map)
237 return reader.read_indices(columns)
238 elif all(map(lambda t: t == str, column_types)):
--> 239 return reader.read_names(columns)
240 
241 column_type_names = [t.__name__ for t in column_types]

~/scipy/repos/arrow/python/pyarrow/feather.pxi in 
pyarrow.lib.FeatherReader.read_names()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Schema at index 0 was different: 
b: int64
a: int64
vs
a: int64
b: int64
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8592) [C++] Docs still list LLVM 7 as compiler used

2020-04-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8592:
--
Labels: pull-request-available  (was: )

> [C++] Docs still list LLVM 7 as compiler used
> -
>
> Key: ARROW-8592
> URL: https://issues.apache.org/jira/browse/ARROW-8592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> should be LLVM 8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8640) pyarrow.UnionArray.from_buffers() expected number of buffers (1) did not match the passed number (3)

2020-04-30 Thread Anish Biswas (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anish Biswas closed ARROW-8640.
---

> pyarrow.UnionArray.from_buffers() expected number of buffers (1) did not 
> match the passed number (3)
> 
>
> Key: ARROW-8640
> URL: https://issues.apache.org/jira/browse/ARROW-8640
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Anish Biswas
>Priority: Major
>
> {code:python}
> arr1 = pa.array([1,2,3,4,5])arr1.buffers()
> arr2 = pa.array([1.1,2.2,3.3,4.4,5.5])
> types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
> value_offsets = pa.array([1, 0, 0, 2, 1, 2, 3], 
> type='int32')value_offsets.buffers()
> arr = pa.UnionArray.from_dense(types, value_offsets,  
> [arr1, arr2])
> arr4 = pa.UnionArray.from_buffers(pa.struct([pa.field("0", arr1.type) , 
> pa.field("1", arr2.type)]), 5, arr.buffers()[0:3], children=[arr1, arr2])
> {code}
> The problem here arises when I try to produce the Union Array via buffers, 
> according to the Columnar Documentation I need 3 buffers to produce a dense 
> Union Array. But when I try this, there is the error `Type's expected number 
> of buffers (1) did not match the passed number (3)`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8640) pyarrow.UnionArray.from_buffers() expected number of buffers (1) did not match the passed number (3)

2020-04-30 Thread Anish Biswas (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096164#comment-17096164
 ] 

Anish Biswas commented on ARROW-8640:
-

Ah, I see. Yes, that makes more sense. Thanks for the help! I'll close this 
issue now.

> pyarrow.UnionArray.from_buffers() expected number of buffers (1) did not 
> match the passed number (3)
> 
>
> Key: ARROW-8640
> URL: https://issues.apache.org/jira/browse/ARROW-8640
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Anish Biswas
>Priority: Major
>
> {code:python}
> arr1 = pa.array([1,2,3,4,5])arr1.buffers()
> arr2 = pa.array([1.1,2.2,3.3,4.4,5.5])
> types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
> value_offsets = pa.array([1, 0, 0, 2, 1, 2, 3], 
> type='int32')value_offsets.buffers()
> arr = pa.UnionArray.from_dense(types, value_offsets,  
> [arr1, arr2])
> arr4 = pa.UnionArray.from_buffers(pa.struct([pa.field("0", arr1.type) , 
> pa.field("1", arr2.type)]), 5, arr.buffers()[0:3], children=[arr1, arr2])
> {code}
> The problem here arises when I try to produce the Union Array via buffers, 
> according to the Columnar Documentation I need 3 buffers to produce a dense 
> Union Array. But when I try this, there is the error `Type's expected number 
> of buffers (1) did not match the passed number (3)`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

74 matches

Mail list logo