[jira] [Resolved] (ARROW-8634) [Java] Create an example
[ https://issues.apache.org/jira/browse/ARROW-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-8634. Resolution: Fixed Issue resolved by pull request 7066 [https://github.com/apache/arrow/pull/7066] > [Java] Create an example > > > Key: ARROW-8634 > URL: https://issues.apache.org/jira/browse/ARROW-8634 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The Java implementation doesn't seem to have any documentation or examples on > how to get started with basic operations such as creating an array. Javadocs > exist but how do new users even know which class to look for? > I would like to create an examples module and one simple example as a > starting point. I hope to have a PR soon. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8660) [C++][Gandiva] Reduce dependence on Boost
[ https://issues.apache.org/jira/browse/ARROW-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8660: -- Labels: pull-request-available (was: ) > [C++][Gandiva] Reduce dependence on Boost > - > > Key: ARROW-8660 > URL: https://issues.apache.org/jira/browse/ARROW-8660 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Gandiva >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Remove Boost usages aside from Boost.Multiprecision -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8661) [C++][Gandiva] Reduce number of files and headers
[ https://issues.apache.org/jira/browse/ARROW-8661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8661: Description: I feel that the Gandiva subpackage is more Java-like in its code organization than the rest of the Arrow codebase, and it might be easier to navigate and develop with closely related code condensed into some larger headers and compilation units. At present there are over 100 .h/.cc files in just src/gandiva, not considering subdirectories Additionally, it's not necessary to have a header file for each component of the function registry -- the registration functions can be declared in function_registry.h or function_registry_internal.h was: I feel that the Gandiva subpackage is more Java-like in its code organization than the rest of the Arrow codebase, and it might be easier to navigate and develop with closely related code condensed into some larger headers and compilation units. Additionally, it's not necessary to have a header file for each component of the function registry -- the registration functions can be declared in function_registry.h or function_registry_internal.h > [C++][Gandiva] Reduce number of files and headers > - > > Key: ARROW-8661 > URL: https://issues.apache.org/jira/browse/ARROW-8661 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Gandiva >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I feel that the Gandiva subpackage is more Java-like in its code organization > than the rest of the Arrow codebase, and it might be easier to navigate and > develop with closely related code condensed into some larger headers and > compilation units. At present there are over 100 .h/.cc files in just > src/gandiva, not considering subdirectories > Additionally, it's not necessary to have a header file for each component of > the function registry -- the registration functions can be declared in > function_registry.h or function_registry_internal.h -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8661) [C++][Gandiva] Reduce number of files and headers
Wes McKinney created ARROW-8661: --- Summary: [C++][Gandiva] Reduce number of files and headers Key: ARROW-8661 URL: https://issues.apache.org/jira/browse/ARROW-8661 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Gandiva Reporter: Wes McKinney Fix For: 1.0.0 I feel that the Gandiva subpackage is more Java-like in its code organization than the rest of the Arrow codebase, and it might be easier to navigate and develop with closely related code condensed into some larger headers and compilation units. Additionally, it's not necessary to have a header file for each component of the function registry -- the registration functions can be declared in function_registry.h or function_registry_internal.h -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8660) [C++][Gandiva] Reduce dependence on Boost
Wes McKinney created ARROW-8660: --- Summary: [C++][Gandiva] Reduce dependence on Boost Key: ARROW-8660 URL: https://issues.apache.org/jira/browse/ARROW-8660 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Gandiva Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 Remove Boost usages aside from Boost.Multiprecision -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-300) [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-300. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6707 [https://github.com/apache/arrow/pull/6707] > [Format] Add body buffer compression option to IPC message protocol using LZ4 > or ZSTD > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
[ https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-8447: --- Assignee: Francois Saint-Jacques > [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks > - > > Key: ARROW-8447 > URL: https://issues.apache.org/jira/browse/ARROW-8447 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This can be refactored with a little effort in Scanner::ToTable: > # Change `batches` to `std::vector` > # When pushing the closure to the TaskGroup, also track an incrementing > integer, e.g. scan_task_id > # In the closure, store the RecordBatches for this ScanTask in a local > vector, when all batches are consumed, move the local vector in the `batches` > at the right index, resizing and emplacing with mutex > # After waiting for the task group completion either > * Flatten into a single vector and call `Table::FromRecordBatch` or > * Write a RecordBatchReader that supports vector and add > method `Table::FromRecordBatchReader` > The later involves more work but is the clean way, the other FromRecordBatch > method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
[ https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-8447. - Resolution: Fixed Issue resolved by pull request 7075 [https://github.com/apache/arrow/pull/7075] > [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks > - > > Key: ARROW-8447 > URL: https://issues.apache.org/jira/browse/ARROW-8447 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This can be refactored with a little effort in Scanner::ToTable: > # Change `batches` to `std::vector` > # When pushing the closure to the TaskGroup, also track an incrementing > integer, e.g. scan_task_id > # In the closure, store the RecordBatches for this ScanTask in a local > vector, when all batches are consumed, move the local vector in the `batches` > at the right index, resizing and emplacing with mutex > # After waiting for the task group completion either > * Flatten into a single vector and call `Table::FromRecordBatch` or > * Write a RecordBatchReader that supports vector and add > method `Table::FromRecordBatchReader` > The later involves more work but is the clean way, the other FromRecordBatch > method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8659) [Rust] ListBuilder and FixedSizeListBuilder capacity
[ https://issues.apache.org/jira/browse/ARROW-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8659: -- Labels: pull-request-available (was: ) > [Rust] ListBuilder and FixedSizeListBuilder capacity > > > Key: ARROW-8659 > URL: https://issues.apache.org/jira/browse/ARROW-8659 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Raphael Taylor-Davies >Assignee: Raphael Taylor-Davies >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Both ListBuilder and FixedSizeListBuilder accept a values_builder as a > constructor argument and then set the capacity of their internal builders > based off the length of this values_builder. Unfortunately at construction > time this values_builder is normally empty, and consequently programs spend > an unnecessary amount of time reallocating memory. > > This should be addressed by adding new constructor methods that allow > specifying the desired capacity upfront. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8659) ListBuilder and FixedSizeListBuilder capacity
Raphael Taylor-Davies created ARROW-8659: Summary: ListBuilder and FixedSizeListBuilder capacity Key: ARROW-8659 URL: https://issues.apache.org/jira/browse/ARROW-8659 Project: Apache Arrow Issue Type: Improvement Reporter: Raphael Taylor-Davies Assignee: Raphael Taylor-Davies Both ListBuilder and FixedSizeListBuilder accept a values_builder as a constructor argument and then set the capacity of their internal builders based off the length of this values_builder. Unfortunately at construction time this values_builder is normally empty, and consequently programs spend an unnecessary amount of time reallocating memory. This should be addressed by adding new constructor methods that allow specifying the desired capacity upfront. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8659) [Rust] ListBuilder and FixedSizeListBuilder capacity
[ https://issues.apache.org/jira/browse/ARROW-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raphael Taylor-Davies updated ARROW-8659: - Summary: [Rust] ListBuilder and FixedSizeListBuilder capacity (was: ListBuilder and FixedSizeListBuilder capacity) > [Rust] ListBuilder and FixedSizeListBuilder capacity > > > Key: ARROW-8659 > URL: https://issues.apache.org/jira/browse/ARROW-8659 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Raphael Taylor-Davies >Assignee: Raphael Taylor-Davies >Priority: Minor > > Both ListBuilder and FixedSizeListBuilder accept a values_builder as a > constructor argument and then set the capacity of their internal builders > based off the length of this values_builder. Unfortunately at construction > time this values_builder is normally empty, and consequently programs spend > an unnecessary amount of time reallocating memory. > > This should be addressed by adding new constructor methods that allow > specifying the desired capacity upfront. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8653) [C++] Add support for gflags version detection
[ https://issues.apache.org/jira/browse/ARROW-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096962#comment-17096962 ] Kouhei Sutou commented on ARROW-8653: - We'll be able to implement this by checking {{gflags.pc}}. We can't detect version from {{gflags/*.h}}. > [C++] Add support for gflags version detection > -- > > Key: ARROW-8653 > URL: https://issues.apache.org/jira/browse/ARROW-8653 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Priority: Major > > Missing functionality from FindgflagsAlt, follop-up for > https://github.com/apache/arrow/pull/7067/files#diff-bc36ca94c3abd969dcdbaec7125fed65R18 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8658) [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments
Ben Kietzman created ARROW-8658: --- Summary: [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments Key: ARROW-8658 URL: https://issues.apache.org/jira/browse/ARROW-8658 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.17.0 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 This is a very handy optimization for large datasets with multiple partition fields. For example, given a hive-style directory {{$base_dir/a=3/}} and a filter {{"a"_ == 2}} none of its files or subdirectories need be examined. After ARROW-8318 FileSystemDataset stores only files so subtree pruning (whose implementation depended on the presence of directories to represent subtrees) was disabled. It should be possible to reintroduce this without reference to directories by examining partition expressions directly and extracting a tree structure from their subexpressions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8648) [Rust] Optimize Rust CI Build Times
[ https://issues.apache.org/jira/browse/ARROW-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-8648. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7072 [https://github.com/apache/arrow/pull/7072] > [Rust] Optimize Rust CI Build Times > --- > > Key: ARROW-8648 > URL: https://issues.apache.org/jira/browse/ARROW-8648 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mark Hildreth >Assignee: Mark Hildreth >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build > options used that are at odds with each other, resulting in multiple > redundant builds where a smaller number could do the same job. The following > tweaks, at minimal, could reduce this, speeding up build times: > * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. > Currently, it's only used for a single command (the {{build --all-targets}} > in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, > since RUSTFLAGS has changed. > * Don't run examples in release mode, as that would force a new (and slower) > rebuild, when the examples have already been built in debug mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8648) [Rust] Optimize Rust CI Build Times
[ https://issues.apache.org/jira/browse/ARROW-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-8648: - Assignee: Mark Hildreth > [Rust] Optimize Rust CI Build Times > --- > > Key: ARROW-8648 > URL: https://issues.apache.org/jira/browse/ARROW-8648 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mark Hildreth >Assignee: Mark Hildreth >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build > options used that are at odds with each other, resulting in multiple > redundant builds where a smaller number could do the same job. The following > tweaks, at minimal, could reduce this, speeding up build times: > * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. > Currently, it's only used for a single command (the {{build --all-targets}} > in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, > since RUSTFLAGS has changed. > * Don't run examples in release mode, as that would force a new (and slower) > rebuild, when the examples have already been built in debug mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8592) [C++] Docs still list LLVM 7 as compiler used
[ https://issues.apache.org/jira/browse/ARROW-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-8592. --- Resolution: Fixed Issue resolved by pull request 7068 [https://github.com/apache/arrow/pull/7068] > [C++] Docs still list LLVM 7 as compiler used > - > > Key: ARROW-8592 > URL: https://issues.apache.org/jira/browse/ARROW-8592 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > should be LLVM 8 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'
[ https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8657: Description: With the recent release of 0.17, the ParquetVersion is used to define the logical type interpretation of fields and the selection of the DataPage format. As a result all parquet files that were created with ParquetVersion::V2 to get features such as unsigned int32s, timestamps with nanosecond resolution, etc are not forward compatible (cannot be read with 0.16.0). That's TBs of data in my case. Those two concerns should be separated. Given that that DataPageV2 pages were not written prior to 0.17 and in order to allow reading existing files, the existing version property should continue to operate as in 0.16 and inform the logical type mapping. Some consideration should be given to issue a release 0.17.1. was: With the recent release of 0.17, the ParquetVersion is used to define the logical type interpretation of fields and the selection of the DataPage format. As a result all parquet files that were created with ParquetVersion::V2 to get features such as unsigned int32s, timestamps with nanosecond resolution, etc are now unreadable. That's TBs of data in my case. Those two concerns should be separated. Given that that DataPageV2 pages were not written prior to 0.17 and in order to allow reading existing files, the existing version property should continue to operate as in 0.16 and inform the logical type mapping. Some consideration should be given to issue a release 0.17.1. > [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when > using version='2.0' > - > > Key: ARROW-8657 > URL: https://issues.apache.org/jira/browse/ARROW-8657 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.17.0 >Reporter: Pierre Belzile >Priority: Major > Fix For: 0.17.1 > > > With the recent release of 0.17, the ParquetVersion is used to define the > logical type interpretation of fields and the selection of the DataPage > format. > As a result all parquet files that were created with ParquetVersion::V2 to > get features such as unsigned int32s, timestamps with nanosecond resolution, > etc are not forward compatible (cannot be read with 0.16.0). That's TBs of > data in my case. > Those two concerns should be separated. Given that that DataPageV2 pages were > not written prior to 0.17 and in order to allow reading existing files, the > existing version property should continue to operate as in 0.16 and inform > the logical type mapping. > Some consideration should be given to issue a release 0.17.1. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'
[ https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096866#comment-17096866 ] Wes McKinney commented on ARROW-8657: - For the record, I think we need to introduce a new flag to toggle the use of newer logical types and associated casting/metadata behavior, and leave the 1.0/2.0 flag for its intended use, i.e. the DataPageV1 vs DataPageV2 So my suggested fix is: * Add the new flag that is separate from switching version 1.0/2.0 * Revert the behavior in Python of version='2.0' to use DataPageV1, **but make a future warning to get people to use the new flag** * In a future release (maybe 2 releases from now), {{version='2.0'}} will again write DataPageV2 > [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when > using version='2.0' > - > > Key: ARROW-8657 > URL: https://issues.apache.org/jira/browse/ARROW-8657 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.17.0 >Reporter: Pierre Belzile >Priority: Major > > With the recent release of 0.17, the ParquetVersion is used to define the > logical type interpretation of fields and the selection of the DataPage > format. > As a result all parquet files that were created with ParquetVersion::V2 to > get features such as unsigned int32s, timestamps with nanosecond resolution, > etc are now unreadable. That's TBs of data in my case. > Those two concerns should be separated. Given that that DataPageV2 pages were > not written prior to 0.17 and in order to allow reading existing files, the > existing version property should continue to operate as in 0.16 and inform > the logical type mapping. > Some consideration should be given to issue a release 0.17.1. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'
[ https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096866#comment-17096866 ] Wes McKinney edited comment on ARROW-8657 at 4/30/20, 6:37 PM: --- For the record, I think we need to introduce a new flag to toggle the use of newer logical types and associated casting/metadata behavior, and leave the 1.0/2.0 flag for its intended use, i.e. the DataPageV1 vs DataPageV2 So my suggested fix is: * Add the new flag that is separate from switching version 1.0/2.0 * Revert the behavior in Python of version='2.0' to use DataPageV1, **but issue a FutureWarning to get people to use the new flag** * In a future release (maybe 2 releases from now), {{version='2.0'}} will again write DataPageV2 was (Author: wesmckinn): For the record, I think we need to introduce a new flag to toggle the use of newer logical types and associated casting/metadata behavior, and leave the 1.0/2.0 flag for its intended use, i.e. the DataPageV1 vs DataPageV2 So my suggested fix is: * Add the new flag that is separate from switching version 1.0/2.0 * Revert the behavior in Python of version='2.0' to use DataPageV1, **but make a future warning to get people to use the new flag** * In a future release (maybe 2 releases from now), {{version='2.0'}} will again write DataPageV2 > [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when > using version='2.0' > - > > Key: ARROW-8657 > URL: https://issues.apache.org/jira/browse/ARROW-8657 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.17.0 >Reporter: Pierre Belzile >Priority: Major > > With the recent release of 0.17, the ParquetVersion is used to define the > logical type interpretation of fields and the selection of the DataPage > format. > As a result all parquet files that were created with ParquetVersion::V2 to > get features such as unsigned int32s, timestamps with nanosecond resolution, > etc are now unreadable. That's TBs of data in my case. > Those two concerns should be separated. Given that that DataPageV2 pages were > not written prior to 0.17 and in order to allow reading existing files, the > existing version property should continue to operate as in 0.16 and inform > the logical type mapping. > Some consideration should be given to issue a release 0.17.1. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'
[ https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8657: Fix Version/s: 0.17.1 > [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when > using version='2.0' > - > > Key: ARROW-8657 > URL: https://issues.apache.org/jira/browse/ARROW-8657 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.17.0 >Reporter: Pierre Belzile >Priority: Major > Fix For: 0.17.1 > > > With the recent release of 0.17, the ParquetVersion is used to define the > logical type interpretation of fields and the selection of the DataPage > format. > As a result all parquet files that were created with ParquetVersion::V2 to > get features such as unsigned int32s, timestamps with nanosecond resolution, > etc are now unreadable. That's TBs of data in my case. > Those two concerns should be separated. Given that that DataPageV2 pages were > not written prior to 0.17 and in order to allow reading existing files, the > existing version property should continue to operate as in 0.16 and inform > the logical type mapping. > Some consideration should be given to issue a release 0.17.1. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8657) [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0'
[ https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8657: Summary: [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when using version='2.0' (was: Distinguish parquet version 2 logical type vs DataPageV2) > [Python][C++][Parquet] Forward compatibility issue from 0.16 to 0.17 when > using version='2.0' > - > > Key: ARROW-8657 > URL: https://issues.apache.org/jira/browse/ARROW-8657 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.17.0 >Reporter: Pierre Belzile >Priority: Major > > With the recent release of 0.17, the ParquetVersion is used to define the > logical type interpretation of fields and the selection of the DataPage > format. > As a result all parquet files that were created with ParquetVersion::V2 to > get features such as unsigned int32s, timestamps with nanosecond resolution, > etc are now unreadable. That's TBs of data in my case. > Those two concerns should be separated. Given that that DataPageV2 pages were > not written prior to 0.17 and in order to allow reading existing files, the > existing version property should continue to operate as in 0.16 and inform > the logical type mapping. > Some consideration should be given to issue a release 0.17.1. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8657) Distinguish parquet version 2 logical type vs DataPageV2
[ https://issues.apache.org/jira/browse/ARROW-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096862#comment-17096862 ] Wes McKinney commented on ARROW-8657: - > As a result all parquet files that were created with ParquetVersion::V2 to > get features such as unsigned int32s, timestamps with nanosecond resolution, > etc are now unreadable. That's TBs of data in my case. To clarify, they are _not_ unreadable, but rather they are not _forward compatible_ (files written by 0.17.0 with {{version='2.0'}} cannot be read with 0.16.0 at the moment). In general, forward compatibility should be approached carefully. **All** files written by 0.16.0 are readable in 0.17.0 > Distinguish parquet version 2 logical type vs DataPageV2 > > > Key: ARROW-8657 > URL: https://issues.apache.org/jira/browse/ARROW-8657 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.17.0 >Reporter: Pierre Belzile >Priority: Major > > With the recent release of 0.17, the ParquetVersion is used to define the > logical type interpretation of fields and the selection of the DataPage > format. > As a result all parquet files that were created with ParquetVersion::V2 to > get features such as unsigned int32s, timestamps with nanosecond resolution, > etc are now unreadable. That's TBs of data in my case. > Those two concerns should be separated. Given that that DataPageV2 pages were > not written prior to 0.17 and in order to allow reading existing files, the > existing version property should continue to operate as in 0.16 and inform > the logical type mapping. > Some consideration should be given to issue a release 0.17.1. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
[ https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8447: -- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks > - > > Key: ARROW-8447 > URL: https://issues.apache.org/jira/browse/ARROW-8447 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This can be refactored with a little effort in Scanner::ToTable: > # Change `batches` to `std::vector` > # When pushing the closure to the TaskGroup, also track an incrementing > integer, e.g. scan_task_id > # In the closure, store the RecordBatches for this ScanTask in a local > vector, when all batches are consumed, move the local vector in the `batches` > at the right index, resizing and emplacing with mutex > # After waiting for the task group completion either > * Flatten into a single vector and call `Table::FromRecordBatch` or > * Write a RecordBatchReader that supports vector and add > method `Table::FromRecordBatchReader` > The later involves more work but is the clean way, the other FromRecordBatch > method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files
[ https://issues.apache.org/jira/browse/ARROW-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096859#comment-17096859 ] Wes McKinney commented on ARROW-8654: - Also, the perf of reading very wide Parquet files won't be very good. > [Python] pyarrow 0.17.0 fails reading "wide" parquet files > -- > > Key: ARROW-8654 > URL: https://issues.apache.org/jira/browse/ARROW-8654 > Project: Apache Arrow > Issue Type: Bug >Reporter: Mike Macpherson >Priority: Major > > {code:java} > import pandas as pd > import numpy as np > num_rows, num_cols = 1000, 45000 > df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, > num_cols)).astype(np.uint8)) > outfile = "test.parquet" > df.to_parquet(outfile) > del df > df = pd.read_parquet(outfile) > {code} > Yields: > {noformat} > df = pd.read_parquet(outfile) > File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line > 310, in read_parquet > return impl.read(path, columns=columns, kwargs) > File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line > 125, in read > path, columns=columns, kwargs > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1530, in read_table > partitioning=partitioning) > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1189, in __init__ > self.validate_schemas() > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1217, in validate_schemas > self.schema = self.pieces[0].get_metadata().schema > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 662, in get_metadata > f = self.open() > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 669, in open > reader = self.open_file_func(self.path) > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1040, in _open_dataset_file > buffer_size=dataset.buffer_size > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 210, in __init__ > read_dictionary=read_dictionary, metadata=metadata) > File "pyarrow/_parquet.pyx", line 1023, in > pyarrow._parquet.ParquetReader.open > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit > {noformat} > This is pandas 1.0.3, and pyarrow 0.17.0. > > I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well. > > I also tried with 40,000 columns aot 45,000 as above, and that does work with > 0.17.0. > > Thanks for all your work on this project! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files
[ https://issues.apache.org/jira/browse/ARROW-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096858#comment-17096858 ] Wes McKinney commented on ARROW-8654: - FWIW, "large" metadata from very wide tables is a problematic area for the Parquet format in general. We'll have to have a closer look to why the metadata got bigger from 0.16.0 to 0.17.0, but there will always be some point where it's too big. I would guess if you keep increasing the number of columns that 0.16.0 will fail, too. > [Python] pyarrow 0.17.0 fails reading "wide" parquet files > -- > > Key: ARROW-8654 > URL: https://issues.apache.org/jira/browse/ARROW-8654 > Project: Apache Arrow > Issue Type: Bug >Reporter: Mike Macpherson >Priority: Major > > {code:java} > import pandas as pd > import numpy as np > num_rows, num_cols = 1000, 45000 > df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, > num_cols)).astype(np.uint8)) > outfile = "test.parquet" > df.to_parquet(outfile) > del df > df = pd.read_parquet(outfile) > {code} > Yields: > {noformat} > df = pd.read_parquet(outfile) > File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line > 310, in read_parquet > return impl.read(path, columns=columns, kwargs) > File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line > 125, in read > path, columns=columns, kwargs > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1530, in read_table > partitioning=partitioning) > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1189, in __init__ > self.validate_schemas() > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1217, in validate_schemas > self.schema = self.pieces[0].get_metadata().schema > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 662, in get_metadata > f = self.open() > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 669, in open > reader = self.open_file_func(self.path) > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1040, in _open_dataset_file > buffer_size=dataset.buffer_size > File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line > 210, in __init__ > read_dictionary=read_dictionary, metadata=metadata) > File "pyarrow/_parquet.pyx", line 1023, in > pyarrow._parquet.ParquetReader.open > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit > {noformat} > This is pandas 1.0.3, and pyarrow 0.17.0. > > I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well. > > I also tried with 40,000 columns aot 45,000 as above, and that does work with > 0.17.0. > > Thanks for all your work on this project! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files
[ https://issues.apache.org/jira/browse/ARROW-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Macpherson updated ARROW-8654: --- Description: {code:java} import pandas as pd import numpy as np num_rows, num_cols = 1000, 45000 df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, num_cols)).astype(np.uint8)) outfile = "test.parquet" df.to_parquet(outfile) del df df = pd.read_parquet(outfile) {code} Yields: {noformat} df = pd.read_parquet(outfile) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 310, in read_parquet return impl.read(path, columns=columns, kwargs) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 125, in read path, columns=columns, kwargs File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, in read_table partitioning=partitioning) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, in __init__ self.validate_schemas() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, in validate_schemas self.schema = self.pieces[0].get_metadata().schema File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, in get_metadata f = self.open() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, in open reader = self.open_file_func(self.path) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, in _open_dataset_file buffer_size=dataset.buffer_size File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in __init__ read_dictionary=read_dictionary, metadata=metadata) File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit {noformat} This is pandas 1.0.3, and pyarrow 0.17.0. I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well. I also tried with 40,000 columns aot 45,000 as above, and that does work with 0.17.0. Thanks for all your work on this project! was: {code:java} import pandas as pd num_rows, num_cols = 1000, 45000 df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, num_cols)).astype(np.uint8)) outfile = "test.parquet" df.to_parquet(outfile) del df df = pd.read_parquet(outfile) {code} Yields: {noformat} df = pd.read_parquet(outfile) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 310, in read_parquet return impl.read(path, columns=columns, kwargs) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 125, in read path, columns=columns, kwargs File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, in read_table partitioning=partitioning) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, in __init__ self.validate_schemas() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, in validate_schemas self.schema = self.pieces[0].get_metadata().schema File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, in get_metadata f = self.open() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, in open reader = self.open_file_func(self.path) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, in _open_dataset_file buffer_size=dataset.buffer_size File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in __init__ read_dictionary=read_dictionary, metadata=metadata) File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit {noformat} This is pandas 1.0.3, and pyarrow 0.17.0. I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well. I also tried with 40,000 columns aot 45,000 as above, and that does work with 0.17.0. Thanks for all your work on this project! > [Python] pyarrow 0.17.0 fails reading "wide" parquet files > -- > > Key: ARROW-8654 > URL: https://issues.apache.org/jira/browse/ARROW-8654 > Project: Apache Arrow > Issue Type: Bug >Reporter: Mike Macpherson >Priority: Major > > {code:java} > import pandas as pd > import numpy as np > num_rows, num_cols = 1000, 45000 > df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, > num_cols)).astype(np.uint8)) > outfile = "test.parquet" > df.to_parquet(outfile) > del df > df = pd.read_parquet(outfile) > {code} > Yields: > {noformat} > df = pd.read_parquet(outfile) > File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line >
[jira] [Assigned] (ARROW-7759) [C++][Dataset] Add CsvFileFormat for CSV support
[ https://issues.apache.org/jira/browse/ARROW-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-7759: --- Assignee: Ben Kietzman (was: Antoine Pitrou) > [C++][Dataset] Add CsvFileFormat for CSV support > > > Key: ARROW-7759 > URL: https://issues.apache.org/jira/browse/ARROW-7759 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > This should be a minimal implementation that binds 1-1 file and ScanTask for > now. Streaming optimizations can be done in ARROW-3410. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7759) [C++][Dataset] Add CsvFileFormat for CSV support
[ https://issues.apache.org/jira/browse/ARROW-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-7759. - Resolution: Fixed Issue resolved by pull request 7033 [https://github.com/apache/arrow/pull/7033] > [C++][Dataset] Add CsvFileFormat for CSV support > > > Key: ARROW-7759 > URL: https://issues.apache.org/jira/browse/ARROW-7759 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > This should be a minimal implementation that binds 1-1 file and ScanTask for > now. Streaming optimizations can be done in ARROW-3410. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8657) Distinguish parquet version 2 logical type vs DataPageV2
Pierre Belzile created ARROW-8657: - Summary: Distinguish parquet version 2 logical type vs DataPageV2 Key: ARROW-8657 URL: https://issues.apache.org/jira/browse/ARROW-8657 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.17.0 Reporter: Pierre Belzile With the recent release of 0.17, the ParquetVersion is used to define the logical type interpretation of fields and the selection of the DataPage format. As a result all parquet files that were created with ParquetVersion::V2 to get features such as unsigned int32s, timestamps with nanosecond resolution, etc are now unreadable. That's TBs of data in my case. Those two concerns should be separated. Given that that DataPageV2 pages were not written prior to 0.17 and in order to allow reading existing files, the existing version property should continue to operate as in 0.16 and inform the logical type mapping. Some consideration should be given to issue a release 0.17.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8656) [Python] Switch to VS2017 in the windows wheel builds
[ https://issues.apache.org/jira/browse/ARROW-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8656: -- Labels: pull-request-available (was: ) > [Python] Switch to VS2017 in the windows wheel builds > - > > Key: ARROW-8656 > URL: https://issues.apache.org/jira/browse/ARROW-8656 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Since the recent conda-forge compiler migrations the wheel builds are failing > https://mail.google.com/mail/u/0/#label/ARROW/FMfcgxwHNCsqSGKQRMZxGlWWsfmGpKdC -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8656) [Python] Switch to VS2017 in the windows wheel builds
Krisztian Szucs created ARROW-8656: -- Summary: [Python] Switch to VS2017 in the windows wheel builds Key: ARROW-8656 URL: https://issues.apache.org/jira/browse/ARROW-8656 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 Since the recent conda-forge compiler migrations the wheel builds are failing https://mail.google.com/mail/u/0/#label/ARROW/FMfcgxwHNCsqSGKQRMZxGlWWsfmGpKdC -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files
[ https://issues.apache.org/jira/browse/ARROW-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Macpherson updated ARROW-8654: --- Description: {code:java} import pandas as pd num_rows, num_cols = 1000, 45000 df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, num_cols)).astype(np.uint8)) outfile = "test.parquet" df.to_parquet(outfile) del df df = pd.read_parquet(outfile) {code} Yields: {noformat} df = pd.read_parquet(outfile) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 310, in read_parquet return impl.read(path, columns=columns, kwargs) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 125, in read path, columns=columns, kwargs File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, in read_table partitioning=partitioning) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, in __init__ self.validate_schemas() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, in validate_schemas self.schema = self.pieces[0].get_metadata().schema File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, in get_metadata f = self.open() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, in open reader = self.open_file_func(self.path) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, in _open_dataset_file buffer_size=dataset.buffer_size File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in __init__ read_dictionary=read_dictionary, metadata=metadata) File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit {noformat} This is pandas 1.0.3, and pyarrow 0.17.0. I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well. I also tried with 40,000 columns aot 45,000 as above, and that does work with 0.17.0. Thanks for all your work on this project! was: {code:java} import pandas as pd num_rows, num_cols = 1000, 45000 df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, num_cols)).astype(np.uint8)) outfile = "test.parquet" df.to_parquet(outfile) del df df = pd.read_parquet(fout) {code} Yields: {noformat} df = pd.read_parquet(outfile) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 310, in read_parquet return impl.read(path, columns=columns, kwargs) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 125, in read path, columns=columns, kwargs File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, in read_table partitioning=partitioning) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, in __init__ self.validate_schemas() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, in validate_schemas self.schema = self.pieces[0].get_metadata().schema File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, in get_metadata f = self.open() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, in open reader = self.open_file_func(self.path) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, in _open_dataset_file buffer_size=dataset.buffer_size File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in __init__ read_dictionary=read_dictionary, metadata=metadata) File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit {noformat} This is pandas 1.0.3, and pyarrow 0.17.0. I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well. I also tried with 40,000 columns aot 45,000 as above, and that does work with 0.17.0. Thanks for all your work on this project! > [Python] pyarrow 0.17.0 fails reading "wide" parquet files > -- > > Key: ARROW-8654 > URL: https://issues.apache.org/jira/browse/ARROW-8654 > Project: Apache Arrow > Issue Type: Bug >Reporter: Mike Macpherson >Priority: Major > > {code:java} > import pandas as pd > num_rows, num_cols = 1000, 45000 > df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, > num_cols)).astype(np.uint8)) > outfile = "test.parquet" > df.to_parquet(outfile) > del df > df = pd.read_parquet(outfile) > {code} > Yields: > {noformat} > df = pd.read_parquet(outfile) > File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line > 310, in read_parquet > return
[jira] [Created] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset
Joris Van den Bossche created ARROW-8655: Summary: [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset Key: ARROW-8655 URL: https://issues.apache.org/jira/browse/ARROW-8655 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} classes that describe a partitioning used in the discovery phase. But once a dataset object is created, it doesn't know any more about this, it just has partition expressions for the fragments. And the partition keys are added to the schema, but you can't directly know which columns of the schema originated from the partitions. However, there can be use cases where it would be useful that a dataset still "knows" from what kind of partitioning it was created: - The "read CSV write back Parquet" use case, where the CSV was already partitioned and you want to automatically preserve that partitioning for parquet (kind of roundtripping the partitioning on read/write) - To convert the dataset to other representation, eg conversion to pandas, it can be useful to know what columns were partition columns (eg for pandas, those columns might be good candidates to be set as the index of the pandas/dask DataFrame). I can imagine conversions to other systems can use similar information. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8639) [C++][Plasma] Require gflags
[ https://issues.apache.org/jira/browse/ARROW-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-8639. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7067 [https://github.com/apache/arrow/pull/7067] > [C++][Plasma] Require gflags > > > Key: ARROW-8639 > URL: https://issues.apache.org/jira/browse/ARROW-8639 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files
Mike Macpherson created ARROW-8654: -- Summary: [Python] pyarrow 0.17.0 fails reading "wide" parquet files Key: ARROW-8654 URL: https://issues.apache.org/jira/browse/ARROW-8654 Project: Apache Arrow Issue Type: Bug Reporter: Mike Macpherson {code:java} import pandas as pd num_rows, num_cols = 1000, 45000 df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, num_cols)).astype(np.uint8)) outfile = "test.parquet" df.to_parquet(outfile) del df df = pd.read_parquet(fout) {code} Yields: {noformat} df = pd.read_parquet(outfile) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 310, in read_parquet return impl.read(path, columns=columns, kwargs) File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 125, in read path, columns=columns, kwargs File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, in read_table partitioning=partitioning) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, in __init__ self.validate_schemas() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, in validate_schemas self.schema = self.pieces[0].get_metadata().schema File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, in get_metadata f = self.open() File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, in open reader = self.open_file_func(self.path) File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, in _open_dataset_file buffer_size=dataset.buffer_size File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in __init__ read_dictionary=read_dictionary, metadata=metadata) File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit {noformat} This is pandas 1.0.3, and pyarrow 0.17.0. I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well. I also tried with 40,000 columns aot 45,000 as above, and that does work with 0.17.0. Thanks for all your work on this project! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8653) [C++] Add support for gflags version detection
Krisztian Szucs created ARROW-8653: -- Summary: [C++] Add support for gflags version detection Key: ARROW-8653 URL: https://issues.apache.org/jira/browse/ARROW-8653 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Missing functionality from FindgflagsAlt, follop-up for https://github.com/apache/arrow/pull/7067/files#diff-bc36ca94c3abd969dcdbaec7125fed65R18 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8652) [Python] Test error message when discovering dataset with invalid files
[ https://issues.apache.org/jira/browse/ARROW-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-8652: - Labels: dataset (was: ) > [Python] Test error message when discovering dataset with invalid files > --- > > Key: ARROW-8652 > URL: https://issues.apache.org/jira/browse/ARROW-8652 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Minor > Labels: dataset > > There is comment in the test_parquet.py about the Dataset API needing a > better error message for invalid files: > https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648 > Although, this seems to work now: > {code} > import tempfile > import pathlib > import pyarrow.dataset as ds > > > tempdir = pathlib.Path(tempfile.mkdtemp()) > with open(str(tempdir / "data.parquet"), 'wb') as f: > pass > In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet") > > > ... > OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': > Invalid: Parquet file size is 0 bytes > {code} > So we need update the test to actually test it instead of skipping. > The only difference with the python ParquetDataset implementation is that the > datasets API raises an OSError and not an ArrowInvalid error. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8652) [Python] Test error message when discovering dataset with invalid files
Joris Van den Bossche created ARROW-8652: Summary: [Python] Test error message when discovering dataset with invalid files Key: ARROW-8652 URL: https://issues.apache.org/jira/browse/ARROW-8652 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche There is comment in the test_parquet.py about the Dataset API needing a better error message for invalid files: https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648 Although, this seems to work now: {code} import tempfile import pathlib import pyarrow.dataset as ds tempdir = pathlib.Path(tempfile.mkdtemp()) with open(str(tempdir / "data.parquet"), 'wb') as f: pass In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet") ... OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': Invalid: Parquet file size is 0 bytes {code} So we need update the test to actually test it instead of skipping. The only difference with the python ParquetDataset implementation is that the datasets API raises an OSError and not an ArrowInvalid error. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8318) [C++][Dataset] Dataset should instantiate Fragment
[ https://issues.apache.org/jira/browse/ARROW-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8318: -- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Dataset should instantiate Fragment > -- > > Key: ARROW-8318 > URL: https://issues.apache.org/jira/browse/ARROW-8318 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Fragments are created on the fly when invoking a Scan. This means that a lot > of the auxilliary/ancilliary data must be stored by the specialised Dataset, > e.g. the FileSystemDataset must hold the path and partition expression. With > the venue of more complex Fragment, e.g. ParquetFileFragment, more data must > be stored. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type
[ https://issues.apache.org/jira/browse/ARROW-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-8647: - Labels: dataset (was: ) > [C++][Dataset] Optionally encode partition field values as dictionary type > -- > > Key: ARROW-8647 > URL: https://issues.apache.org/jira/browse/ARROW-8647 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 1.0.0 > > > In the Python ParquetDataset implementation, the partition fields are > returned as dictionary type columns. > In the new Dataset API, we now use a plain type (integer or string when > inferred). But, you can already manually specify that the partition keys > should be dictionary type by specifying the partitioning schema (in > {{Partitioning}} passed to the dataset factory). > Since using dictionary type can be more efficient (since partition keys will > typically be repeated values in the resulting table), it might be good to > still have an option in the DatasetFactory to use dictionary types for the > partition fields. > See also https://github.com/apache/arrow/pull/6303#discussion_r400622340 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8651) [Python][Dataset] Support pickling of Dataset objects
[ https://issues.apache.org/jira/browse/ARROW-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-8651: - Labels: dataset (was: ) > [Python][Dataset] Support pickling of Dataset objects > - > > Key: ARROW-8651 > URL: https://issues.apache.org/jira/browse/ARROW-8651 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 1.0.0 > > > We alraedy made several parts of a Dataset serializable (the formats, the > expressions, the filesystem). With those, it should also be possible to > pickle FileFragments, and with that also Dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8651) [Python][Dataset] Support pickling of Dataset objects
Joris Van den Bossche created ARROW-8651: Summary: [Python][Dataset] Support pickling of Dataset objects Key: ARROW-8651 URL: https://issues.apache.org/jira/browse/ARROW-8651 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 We alraedy made several parts of a Dataset serializable (the formats, the expressions, the filesystem). With those, it should also be possible to pickle FileFragments, and with that also Dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8648) [Rust] Optimize Rust CI Build Times
[ https://issues.apache.org/jira/browse/ARROW-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8648: -- Labels: pull-request-available (was: ) > [Rust] Optimize Rust CI Build Times > --- > > Key: ARROW-8648 > URL: https://issues.apache.org/jira/browse/ARROW-8648 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mark Hildreth >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build > options used that are at odds with each other, resulting in multiple > redundant builds where a smaller number could do the same job. The following > tweaks, at minimal, could reduce this, speeding up build times: > * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. > Currently, it's only used for a single command (the {{build --all-targets}} > in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, > since RUSTFLAGS has changed. > * Don't run examples in release mode, as that would force a new (and slower) > rebuild, when the examples have already been built in debug mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8650) [Rust] [Website] Add documentation to Arrow website
Andy Grove created ARROW-8650: - Summary: [Rust] [Website] Add documentation to Arrow website Key: ARROW-8650 URL: https://issues.apache.org/jira/browse/ARROW-8650 Project: Apache Arrow Issue Type: Improvement Components: Rust, Website Reporter: Andy Grove Fix For: 1.0.0 The documentation page [1] on the Arrow site has links for C, C++, Java, Python, JavaScript, and R. It would be good do add Rust here as well, even if the docs here are brief and link to the rustdocs on docs.rs [2] (which are currently broken due to ARROW-8536 [3]. [1] [https://arrow.apache.org/docs/] [2] https://docs.rs/crate/arrow/0.17.0 [3] https://issues.apache.org/jira/browse/ARROW-8536 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8649) [Java] [Website] Java documentation on website is hidden
[ https://issues.apache.org/jira/browse/ARROW-8649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-8649: -- Component/s: Website Java > [Java] [Website] Java documentation on website is hidden > > > Key: ARROW-8649 > URL: https://issues.apache.org/jira/browse/ARROW-8649 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Website >Reporter: Andy Grove >Priority: Major > Fix For: 1.0.0 > > > There is some excellent Java documentation on the web site that is hard to > find because the Java documentation link [1] goes straight to the generated > javadocs. > > [1] https://arrow.apache.org/docs/java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8648) [Rust] Optimize Rust CI Build Times
[ https://issues.apache.org/jira/browse/ARROW-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hildreth updated ARROW-8648: - Component/s: Rust > [Rust] Optimize Rust CI Build Times > --- > > Key: ARROW-8648 > URL: https://issues.apache.org/jira/browse/ARROW-8648 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mark Hildreth >Priority: Major > > In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build > options used that are at odds with each other, resulting in multiple > redundant builds where a smaller number could do the same job. The following > tweaks, at minimal, could reduce this, speeding up build times: > * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. > Currently, it's only used for a single command (the {{build --all-targets}} > in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, > since RUSTFLAGS has changed. > * Don't run examples in release mode, as that would force a new (and slower) > rebuild, when the examples have already been built in debug mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8649) [Java] [Website] Java documentation on website is hidden
Andy Grove created ARROW-8649: - Summary: [Java] [Website] Java documentation on website is hidden Key: ARROW-8649 URL: https://issues.apache.org/jira/browse/ARROW-8649 Project: Apache Arrow Issue Type: Bug Reporter: Andy Grove Fix For: 1.0.0 There is some excellent Java documentation on the web site that is hard to find because the Java documentation link [1] goes straight to the generated javadocs. [1] https://arrow.apache.org/docs/java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8648) [Rust] Optimize Rust CI Build Times
Mark Hildreth created ARROW-8648: Summary: [Rust] Optimize Rust CI Build Times Key: ARROW-8648 URL: https://issues.apache.org/jira/browse/ARROW-8648 Project: Apache Arrow Issue Type: Improvement Reporter: Mark Hildreth In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build options used that are at odds with each other, resulting in multiple redundant builds where a smaller number could do the same job. The following tweaks, at minimal, could reduce this, speeding up build times: * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. Currently, it's only used for a single command (the {{build --all-targets}} in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, since RUSTFLAGS has changed. * Don't run examples in release mode, as that would force a new (and slower) rebuild, when the examples have already been built in debug mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8638) Arrow Cython API Usage Gives an error when calling CTable API Endpoints
[ https://issues.apache.org/jira/browse/ARROW-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-8638. --- Resolution: Information Provided Closing since there isn't a bug to fix, further discussion can take place here or on the mailing list > Arrow Cython API Usage Gives an error when calling CTable API Endpoints > --- > > Key: ARROW-8638 > URL: https://issues.apache.org/jira/browse/ARROW-8638 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.16.0 > Environment: Ubuntu 20.04 with Python 3.8.2 > RHEL7 with Python 3.6.8 >Reporter: Vibhatha Lakmal Abeykoon >Priority: Blocker > Fix For: 0.16.0 > > > I am working on using both Arrow C++ API and Cython API to support an > application that I am developing. But here, I will add the issue I > experienced when I am trying to follow the example, > [https://arrow.apache.org/docs/python/extending.html] > I am testing on Ubuntu 20.04 LTS > Python version 3.8.2 > These are the steps I followed. > # Create Virtualenv > python3 -m venv ENVARROW > > 2. Activate ENV > source ENVARROW/bin/activate > > 3. pip3 install pyarrow==0.16.0 cython numpy > > 4. Code block and Tools, > > +*example.pyx*+ > > > {code:java} > from pyarrow.lib cimport * > def get_array_length(obj): > # Just an example function accessing both the pyarrow Cython API > # and the Arrow C++ API > cdef shared_ptr[CArray] arr = pyarrow_unwrap_array(obj) > if arr.get() == NULL: > raise TypeError("not an array") > return arr.get().length() > def get_table_info(obj): > cdef shared_ptr[CTable] table = pyarrow_unwrap_table(obj) > if table.get() == NULL: > raise TypeError("not an table") > > return table.get().num_columns() > {code} > > > +*setup.py*+ > > > {code:java} > from distutils.core import setup > from Cython.Build import cythonize > import os > import numpy as np > import pyarrow as pa > ext_modules = cythonize("example.pyx") > for ext in ext_modules: > # The Numpy C headers are currently required > ext.include_dirs.append(np.get_include()) > ext.include_dirs.append(pa.get_include()) > ext.libraries.extend(pa.get_libraries()) > ext.library_dirs.extend(pa.get_library_dirs()) > if os.name == 'posix': > ext.extra_compile_args.append('-std=c++11') > # Try uncommenting the following line on Linux > # if you get weird linker errors or runtime crashes > #ext.define_macros.append(("_GLIBCXX_USE_CXX11_ABI", "0")) > setup(ext_modules=ext_modules) > {code} > > > +*arrow_array.py*+ > > {code:java} > import example > import pyarrow as pa > import numpy as np > arr = pa.array([1,2,3,4,5]) > len = example.get_array_length(arr) > print("Array length {} ".format(len)) > {code} > > +*arrow_table.py*+ > > {code:java} > import example > import pyarrow as pa > import numpy as np > from pyarrow import csv > fn = 'data.csv' > table = csv.read_csv(fn) > print(table) > cols = example.get_table_info(table) > print(cols) > > {code} > +*data.csv*+ > {code:java} > 1,2,3,4,5 > 6,7,8,9,10 > 11,12,13,14,15 > {code} > > +*Makefile*+ > > {code:java} > install: > python3 setup.py build_ext --inplace > clean: > rm -R *.so build *.cpp > {code} > > **When I try to run either of the python example scripts arrow_table.py or > arrow_array.py, > I get the following error. > > {code:java} > File "arrow_array.py", line 1, in > import example > ImportError: libarrow.so.16: cannot open shared object file: No such file or > directory > {code} > > > *Note: I also checked this on RHEL7 with Python 3.6.8, I got a similar > response.* > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?
[ https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096569#comment-17096569 ] Anish Biswas commented on ARROW-8642: - Okay, I will do that from now on. > Is there a good way to convert data types from numpy types to pyarrow > DataType? > --- > > Key: ARROW-8642 > URL: https://issues.apache.org/jira/browse/ARROW-8642 > Project: Apache Arrow > Issue Type: Wish >Reporter: Anish Biswas >Priority: Major > > Pretty much what the title says. Suppose I have a numpy array and its a > numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I > thought a Dictionary lookup table might work but perhaps there is some better > way? > Why do I need this? I am trying to make pyarrow arrays with from_buffers(). > The first parameter is essentially a pyarrow.Datatype. So that's why. I have > validity_bitmaps as a buffer of uint8 and that's why I am using > from_buffers() and not pyarrow.array(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?
[ https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096567#comment-17096567 ] Wes McKinney commented on ARROW-8642: - [~trickarcher] if you have questions it's better to use the mailing list than to open JIRA issues > Is there a good way to convert data types from numpy types to pyarrow > DataType? > --- > > Key: ARROW-8642 > URL: https://issues.apache.org/jira/browse/ARROW-8642 > Project: Apache Arrow > Issue Type: Wish >Reporter: Anish Biswas >Priority: Major > > Pretty much what the title says. Suppose I have a numpy array and its a > numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I > thought a Dictionary lookup table might work but perhaps there is some better > way? > Why do I need this? I am trying to make pyarrow arrays with from_buffers(). > The first parameter is essentially a pyarrow.Datatype. So that's why. I have > validity_bitmaps as a buffer of uint8 and that's why I am using > from_buffers() and not pyarrow.array(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection
[ https://issues.apache.org/jira/browse/ARROW-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096566#comment-17096566 ] Wes McKinney commented on ARROW-8641: - Too bad this was not tested > [Python] Regression in feather: no longer supports permutation in column > selection > -- > > Key: ARROW-8641 > URL: https://issues.apache.org/jira/browse/ARROW-8641 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 1.0.0 > > > A quite annoying regression (original report from > https://github.com/pandas-dev/pandas/issues/33878), is that when specifying > {{columns}} to read, this now fails if the order of the columns is not > exactly the same as in the file: > {code: python} > In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b', > 'c']) > In [29]: from pyarrow import feather > In [30]: feather.write_feather(table, "test.feather") > # this works fine > In [32]: feather.read_table("test.feather", columns=['a', 'b']) > > > Out[32]: > pyarrow.Table > a: int64 > b: int64 > In [33]: feather.read_table("test.feather", columns=['b', 'a']) > > > --- > ArrowInvalid Traceback (most recent call last) > in > > 1 feather.read_table("test.feather", columns=['b', 'a']) > ~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, > memory_map) > 237 return reader.read_indices(columns) > 238 elif all(map(lambda t: t == str, column_types)): > --> 239 return reader.read_names(columns) > 240 > 241 column_type_names = [t.__name__ for t in column_types] > ~/scipy/repos/arrow/python/pyarrow/feather.pxi in > pyarrow.lib.FeatherReader.read_names() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Schema at index 0 was different: > b: int64 > a: int64 > vs > a: int64 > b: int64 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection
[ https://issues.apache.org/jira/browse/ARROW-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8641: Fix Version/s: 1.0.0 > [Python] Regression in feather: no longer supports permutation in column > selection > -- > > Key: ARROW-8641 > URL: https://issues.apache.org/jira/browse/ARROW-8641 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 1.0.0 > > > A quite annoying regression (original report from > https://github.com/pandas-dev/pandas/issues/33878), is that when specifying > {{columns}} to read, this now fails if the order of the columns is not > exactly the same as in the file: > {code: python} > In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b', > 'c']) > In [29]: from pyarrow import feather > In [30]: feather.write_feather(table, "test.feather") > # this works fine > In [32]: feather.read_table("test.feather", columns=['a', 'b']) > > > Out[32]: > pyarrow.Table > a: int64 > b: int64 > In [33]: feather.read_table("test.feather", columns=['b', 'a']) > > > --- > ArrowInvalid Traceback (most recent call last) > in > > 1 feather.read_table("test.feather", columns=['b', 'a']) > ~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, > memory_map) > 237 return reader.read_indices(columns) > 238 elif all(map(lambda t: t == str, column_types)): > --> 239 return reader.read_names(columns) > 240 > 241 column_type_names = [t.__name__ for t in column_types] > ~/scipy/repos/arrow/python/pyarrow/feather.pxi in > pyarrow.lib.FeatherReader.read_names() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Schema at index 0 was different: > b: int64 > a: int64 > vs > a: int64 > b: int64 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type
[ https://issues.apache.org/jira/browse/ARROW-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-8647: - Description: In the Python ParquetDataset implementation, the partition fields are returned as dictionary type columns. In the new Dataset API, we now use a plain type (integer or string when inferred). But, you can already manually specify that the partition keys should be dictionary type by specifying the partitioning schema (in {{Partitioning}} passed to the dataset factory). Since using dictionary type can be more efficient (since partition keys will typically be repeated values in the resulting table), it might be good to still have an option in the DatasetFactory to use dictionary types for the partition fields. See also https://github.com/apache/arrow/pull/6303#discussion_r400622340 was: In the Python ParquetDataset implementation, the partition fields are returned as dictionary type columns. In the new Dataset API, we now use a plain type (integer or string when inferred). But, you can already manually specify that the partition keys should be dictionary type by specifying the partitioning schema (in {{Partitioning}} passed to the dataset factory). Since using dictionary type can be more efficient (since partition keys will typically be repeated values in the resulting table), it might be good to still have an option in the DatasetFactory to use dictionary types for the partition fields. > [C++][Dataset] Optionally encode partition field values as dictionary type > -- > > Key: ARROW-8647 > URL: https://issues.apache.org/jira/browse/ARROW-8647 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 1.0.0 > > > In the Python ParquetDataset implementation, the partition fields are > returned as dictionary type columns. > In the new Dataset API, we now use a plain type (integer or string when > inferred). But, you can already manually specify that the partition keys > should be dictionary type by specifying the partitioning schema (in > {{Partitioning}} passed to the dataset factory). > Since using dictionary type can be more efficient (since partition keys will > typically be repeated values in the resulting table), it might be good to > still have an option in the DatasetFactory to use dictionary types for the > partition fields. > See also https://github.com/apache/arrow/pull/6303#discussion_r400622340 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type
Joris Van den Bossche created ARROW-8647: Summary: [C++][Dataset] Optionally encode partition field values as dictionary type Key: ARROW-8647 URL: https://issues.apache.org/jira/browse/ARROW-8647 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 In the Python ParquetDataset implementation, the partition fields are returned as dictionary type columns. In the new Dataset API, we now use a plain type (integer or string when inferred). But, you can already manually specify that the partition keys should be dictionary type by specifying the partitioning schema (in {{Partitioning}} passed to the dataset factory). Since using dictionary type can be more efficient (since partition keys will typically be repeated values in the resulting table), it might be good to still have an option in the DatasetFactory to use dictionary types for the partition fields. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8638) Arrow Cython API Usage Gives an error when calling CTable API Endpoints
[ https://issues.apache.org/jira/browse/ARROW-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096500#comment-17096500 ] Vibhatha Lakmal Abeykoon commented on ARROW-8638: - I tried the LD_LIBRARY_PATH approach and it worked fine. But I think, I need to adopt a neat setup as you point out. Thank you for this response. I have another thing in mind. Think of an instance where the arrow is compiled from source. In such cases is there a best practice that can be adopted. > Arrow Cython API Usage Gives an error when calling CTable API Endpoints > --- > > Key: ARROW-8638 > URL: https://issues.apache.org/jira/browse/ARROW-8638 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.16.0 > Environment: Ubuntu 20.04 with Python 3.8.2 > RHEL7 with Python 3.6.8 >Reporter: Vibhatha Lakmal Abeykoon >Priority: Blocker > Fix For: 0.16.0 > > > I am working on using both Arrow C++ API and Cython API to support an > application that I am developing. But here, I will add the issue I > experienced when I am trying to follow the example, > [https://arrow.apache.org/docs/python/extending.html] > I am testing on Ubuntu 20.04 LTS > Python version 3.8.2 > These are the steps I followed. > # Create Virtualenv > python3 -m venv ENVARROW > > 2. Activate ENV > source ENVARROW/bin/activate > > 3. pip3 install pyarrow==0.16.0 cython numpy > > 4. Code block and Tools, > > +*example.pyx*+ > > > {code:java} > from pyarrow.lib cimport * > def get_array_length(obj): > # Just an example function accessing both the pyarrow Cython API > # and the Arrow C++ API > cdef shared_ptr[CArray] arr = pyarrow_unwrap_array(obj) > if arr.get() == NULL: > raise TypeError("not an array") > return arr.get().length() > def get_table_info(obj): > cdef shared_ptr[CTable] table = pyarrow_unwrap_table(obj) > if table.get() == NULL: > raise TypeError("not an table") > > return table.get().num_columns() > {code} > > > +*setup.py*+ > > > {code:java} > from distutils.core import setup > from Cython.Build import cythonize > import os > import numpy as np > import pyarrow as pa > ext_modules = cythonize("example.pyx") > for ext in ext_modules: > # The Numpy C headers are currently required > ext.include_dirs.append(np.get_include()) > ext.include_dirs.append(pa.get_include()) > ext.libraries.extend(pa.get_libraries()) > ext.library_dirs.extend(pa.get_library_dirs()) > if os.name == 'posix': > ext.extra_compile_args.append('-std=c++11') > # Try uncommenting the following line on Linux > # if you get weird linker errors or runtime crashes > #ext.define_macros.append(("_GLIBCXX_USE_CXX11_ABI", "0")) > setup(ext_modules=ext_modules) > {code} > > > +*arrow_array.py*+ > > {code:java} > import example > import pyarrow as pa > import numpy as np > arr = pa.array([1,2,3,4,5]) > len = example.get_array_length(arr) > print("Array length {} ".format(len)) > {code} > > +*arrow_table.py*+ > > {code:java} > import example > import pyarrow as pa > import numpy as np > from pyarrow import csv > fn = 'data.csv' > table = csv.read_csv(fn) > print(table) > cols = example.get_table_info(table) > print(cols) > > {code} > +*data.csv*+ > {code:java} > 1,2,3,4,5 > 6,7,8,9,10 > 11,12,13,14,15 > {code} > > +*Makefile*+ > > {code:java} > install: > python3 setup.py build_ext --inplace > clean: > rm -R *.so build *.cpp > {code} > > **When I try to run either of the python example scripts arrow_table.py or > arrow_array.py, > I get the following error. > > {code:java} > File "arrow_array.py", line 1, in > import example > ImportError: libarrow.so.16: cannot open shared object file: No such file or > directory > {code} > > > *Note: I also checked this on RHEL7 with Python 3.6.8, I got a similar > response.* > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8622) [Rust] Parquet crate does not compile on aarch64
[ https://issues.apache.org/jira/browse/ARROW-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan resolved ARROW-8622. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7059 [https://github.com/apache/arrow/pull/7059] > [Rust] Parquet crate does not compile on aarch64 > > > Key: ARROW-8622 > URL: https://issues.apache.org/jira/browse/ARROW-8622 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Paddy Horan >Assignee: R. Tyler Croy >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8638) Arrow Cython API Usage Gives an error when calling CTable API Endpoints
[ https://issues.apache.org/jira/browse/ARROW-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096493#comment-17096493 ] Uwe Korn commented on ARROW-8638: - You either need to extend the environment variable `LD_LIBRARY_PATH` to point to the directory where `libarrow.so.16` resides or (a bit more complicated in setup.py but the preferred approach) set the RPATH on the generated `example.so` Python module to also include the directory where `libarrow.so.16` reside, see turbodbc for an example: https://github.com/blue-yonder/turbodbc/blob/8e2db0d0a26b620ad3e687e56a88fdab3117e09c/setup.py#L186-L189 > Arrow Cython API Usage Gives an error when calling CTable API Endpoints > --- > > Key: ARROW-8638 > URL: https://issues.apache.org/jira/browse/ARROW-8638 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.16.0 > Environment: Ubuntu 20.04 with Python 3.8.2 > RHEL7 with Python 3.6.8 >Reporter: Vibhatha Lakmal Abeykoon >Priority: Blocker > Fix For: 0.16.0 > > > I am working on using both Arrow C++ API and Cython API to support an > application that I am developing. But here, I will add the issue I > experienced when I am trying to follow the example, > [https://arrow.apache.org/docs/python/extending.html] > I am testing on Ubuntu 20.04 LTS > Python version 3.8.2 > These are the steps I followed. > # Create Virtualenv > python3 -m venv ENVARROW > > 2. Activate ENV > source ENVARROW/bin/activate > > 3. pip3 install pyarrow==0.16.0 cython numpy > > 4. Code block and Tools, > > +*example.pyx*+ > > > {code:java} > from pyarrow.lib cimport * > def get_array_length(obj): > # Just an example function accessing both the pyarrow Cython API > # and the Arrow C++ API > cdef shared_ptr[CArray] arr = pyarrow_unwrap_array(obj) > if arr.get() == NULL: > raise TypeError("not an array") > return arr.get().length() > def get_table_info(obj): > cdef shared_ptr[CTable] table = pyarrow_unwrap_table(obj) > if table.get() == NULL: > raise TypeError("not an table") > > return table.get().num_columns() > {code} > > > +*setup.py*+ > > > {code:java} > from distutils.core import setup > from Cython.Build import cythonize > import os > import numpy as np > import pyarrow as pa > ext_modules = cythonize("example.pyx") > for ext in ext_modules: > # The Numpy C headers are currently required > ext.include_dirs.append(np.get_include()) > ext.include_dirs.append(pa.get_include()) > ext.libraries.extend(pa.get_libraries()) > ext.library_dirs.extend(pa.get_library_dirs()) > if os.name == 'posix': > ext.extra_compile_args.append('-std=c++11') > # Try uncommenting the following line on Linux > # if you get weird linker errors or runtime crashes > #ext.define_macros.append(("_GLIBCXX_USE_CXX11_ABI", "0")) > setup(ext_modules=ext_modules) > {code} > > > +*arrow_array.py*+ > > {code:java} > import example > import pyarrow as pa > import numpy as np > arr = pa.array([1,2,3,4,5]) > len = example.get_array_length(arr) > print("Array length {} ".format(len)) > {code} > > +*arrow_table.py*+ > > {code:java} > import example > import pyarrow as pa > import numpy as np > from pyarrow import csv > fn = 'data.csv' > table = csv.read_csv(fn) > print(table) > cols = example.get_table_info(table) > print(cols) > > {code} > +*data.csv*+ > {code:java} > 1,2,3,4,5 > 6,7,8,9,10 > 11,12,13,14,15 > {code} > > +*Makefile*+ > > {code:java} > install: > python3 setup.py build_ext --inplace > clean: > rm -R *.so build *.cpp > {code} > > **When I try to run either of the python example scripts arrow_table.py or > arrow_array.py, > I get the following error. > > {code:java} > File "arrow_array.py", line 1, in > import example > ImportError: libarrow.so.16: cannot open shared object file: No such file or > directory > {code} > > > *Note: I also checked this on RHEL7 with Python 3.6.8, I got a similar > response.* > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7955) [Java] Support large buffer for file/stream IPC
[ https://issues.apache.org/jira/browse/ARROW-7955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7955: -- Labels: pull-request-available (was: ) > [Java] Support large buffer for file/stream IPC > --- > > Key: ARROW-7955 > URL: https://issues.apache.org/jira/browse/ARROW-7955 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > After supporting 64-bit ArrowBuf, we need to make file/stream IPC work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8646) Allow UnionListWriter to write null values
[ https://issues.apache.org/jira/browse/ARROW-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8646: -- Labels: pull-request-available (was: ) > Allow UnionListWriter to write null values > -- > > Key: ARROW-8646 > URL: https://issues.apache.org/jira/browse/ARROW-8646 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Thippana Vamsi Kalyan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > UnionListWriter has no provision to skip an index to write a null value into > the list. > It should allow to writeNull -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8646) Allow UnionListWriter to write null values
Thippana Vamsi Kalyan created ARROW-8646: Summary: Allow UnionListWriter to write null values Key: ARROW-8646 URL: https://issues.apache.org/jira/browse/ARROW-8646 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Thippana Vamsi Kalyan UnionListWriter has no provision to skip an index to write a null value into the list. It should allow to writeNull -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8645) [C++] Missing gflags dependency for plasma
[ https://issues.apache.org/jira/browse/ARROW-8645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8645: -- Labels: pull-request-available (was: ) > [C++] Missing gflags dependency for plasma > -- > > Key: ARROW-8645 > URL: https://issues.apache.org/jira/browse/ARROW-8645 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The documentation build fails because gflags is not installed and CMake > doesn't build the bundled version of it. > Introduced by > https://github.com/apache/arrow/commit/dfc14ef24ed54ff757c10a26663a629ce5e8cebf -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8645) [C++] Missing gflags dependency for plasma
Krisztian Szucs created ARROW-8645: -- Summary: [C++] Missing gflags dependency for plasma Key: ARROW-8645 URL: https://issues.apache.org/jira/browse/ARROW-8645 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 The documentation build fails because gflags is not installed and CMake doesn't build the bundled version of it. Introduced by https://github.com/apache/arrow/commit/dfc14ef24ed54ff757c10a26663a629ce5e8cebf -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8644) [Python] Dask integration tests failing due to change in not including partition columns
Joris Van den Bossche created ARROW-8644: Summary: [Python] Dask integration tests failing due to change in not including partition columns Key: ARROW-8644 URL: https://issues.apache.org/jira/browse/ARROW-8644 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche In ARROW-3861 (https://github.com/apache/arrow/pull/7050), I "fixed" a bug that the partition columns are always included even when the user did a manual column selection. But apparently, this behaviour was being relied upon by dask. See the failing nightly integration tests: https://circleci.com/gh/ursa-labs/crossbow/11854?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link So the best option might be to just keep the "old" behaviour for the legacy ParquetDataset, when using the new datasets API ({{use_legacy_datasets=False}}), you get the new / correct behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8643) [Python] Tests with pandas master failing due to freq assertion
Joris Van den Bossche created ARROW-8643: Summary: [Python] Tests with pandas master failing due to freq assertion Key: ARROW-8643 URL: https://issues.apache.org/jira/browse/ARROW-8643 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Nightly pandas master tests are failing, eg https://circleci.com/gh/ursa-labs/crossbow/11858?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link This is caused by a change in pandas, see https://github.com/pandas-dev/pandas/pull/33815#issuecomment-620820134 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?
[ https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096278#comment-17096278 ] Anish Biswas commented on ARROW-8642: - Oh okay! That's neat! Thanks! > Is there a good way to convert data types from numpy types to pyarrow > DataType? > --- > > Key: ARROW-8642 > URL: https://issues.apache.org/jira/browse/ARROW-8642 > Project: Apache Arrow > Issue Type: Wish >Reporter: Anish Biswas >Priority: Major > > Pretty much what the title says. Suppose I have a numpy array and its a > numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I > thought a Dictionary lookup table might work but perhaps there is some better > way? > Why do I need this? I am trying to make pyarrow arrays with from_buffers(). > The first parameter is essentially a pyarrow.Datatype. So that's why. I have > validity_bitmaps as a buffer of uint8 and that's why I am using > from_buffers() and not pyarrow.array(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?
[ https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anish Biswas closed ARROW-8642. --- Resolution: Fixed > Is there a good way to convert data types from numpy types to pyarrow > DataType? > --- > > Key: ARROW-8642 > URL: https://issues.apache.org/jira/browse/ARROW-8642 > Project: Apache Arrow > Issue Type: Wish >Reporter: Anish Biswas >Priority: Major > > Pretty much what the title says. Suppose I have a numpy array and its a > numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I > thought a Dictionary lookup table might work but perhaps there is some better > way? > Why do I need this? I am trying to make pyarrow arrays with from_buffers(). > The first parameter is essentially a pyarrow.Datatype. So that's why. I have > validity_bitmaps as a buffer of uint8 and that's why I am using > from_buffers() and not pyarrow.array(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?
[ https://issues.apache.org/jira/browse/ARROW-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096272#comment-17096272 ] Joris Van den Bossche commented on ARROW-8642: -- There is a {{from_numpy_dtype}} function for this: {code} In [42]: pa.from_numpy_dtype(np.dtype("int8")) Out[42]: DataType(int8) {code} It's included in the API docs here: https://arrow.apache.org/docs/python/api/datatypes.html > Is there a good way to convert data types from numpy types to pyarrow > DataType? > --- > > Key: ARROW-8642 > URL: https://issues.apache.org/jira/browse/ARROW-8642 > Project: Apache Arrow > Issue Type: Wish >Reporter: Anish Biswas >Priority: Major > > Pretty much what the title says. Suppose I have a numpy array and its a > numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I > thought a Dictionary lookup table might work but perhaps there is some better > way? > Why do I need this? I am trying to make pyarrow arrays with from_buffers(). > The first parameter is essentially a pyarrow.Datatype. So that's why. I have > validity_bitmaps as a buffer of uint8 and that's why I am using > from_buffers() and not pyarrow.array(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?
Anish Biswas created ARROW-8642: --- Summary: Is there a good way to convert data types from numpy types to pyarrow DataType? Key: ARROW-8642 URL: https://issues.apache.org/jira/browse/ARROW-8642 Project: Apache Arrow Issue Type: Wish Reporter: Anish Biswas Pretty much what the title says. Suppose I have a numpy array and its a numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I thought a Dictionary lookup table might work but perhaps there is some better way? Why do I need this? I am trying to make pyarrow arrays with from_buffers(). The first parameter is essentially a pyarrow.Datatype. So that's why. I have validity_bitmaps as a buffer of uint8 and that's why I am using from_buffers() and not pyarrow.array(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8504) [C++] Add a method that takes an RLE visitor for a bitmap.
[ https://issues.apache.org/jira/browse/ARROW-8504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned ARROW-8504: -- Assignee: Micah Kornfield > [C++] Add a method that takes an RLE visitor for a bitmap. > -- > > Key: ARROW-8504 > URL: https://issues.apache.org/jira/browse/ARROW-8504 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > > For nullability data, in many cases nulls are not evenly distributed. In > these cases it would be beneficial to have a mechanism to understand when > runs of set/unset bits are encountered. One example of this is writing > translating a bitmap to parquet definition levels . > > An implementation path could be to add this as method on Bitmap that makes an > adaptor callback for VisitWords but I think at least for parquet an iterator > API might be more appropriate (something that is easily stoppable/resumable). > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection
Joris Van den Bossche created ARROW-8641: Summary: [Python] Regression in feather: no longer supports permutation in column selection Key: ARROW-8641 URL: https://issues.apache.org/jira/browse/ARROW-8641 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche A quite annoying regression (original report from https://github.com/pandas-dev/pandas/issues/33878), is that when specifying {{columns}} to read, this now fails if the order of the columns is not exactly the same as in the file: {code: python} In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b', 'c']) In [29]: from pyarrow import feather In [30]: feather.write_feather(table, "test.feather") # this works fine In [32]: feather.read_table("test.feather", columns=['a', 'b']) Out[32]: pyarrow.Table a: int64 b: int64 In [33]: feather.read_table("test.feather", columns=['b', 'a']) --- ArrowInvalid Traceback (most recent call last) in > 1 feather.read_table("test.feather", columns=['b', 'a']) ~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, memory_map) 237 return reader.read_indices(columns) 238 elif all(map(lambda t: t == str, column_types)): --> 239 return reader.read_names(columns) 240 241 column_type_names = [t.__name__ for t in column_types] ~/scipy/repos/arrow/python/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.read_names() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Schema at index 0 was different: b: int64 a: int64 vs a: int64 b: int64 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8592) [C++] Docs still list LLVM 7 as compiler used
[ https://issues.apache.org/jira/browse/ARROW-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8592: -- Labels: pull-request-available (was: ) > [C++] Docs still list LLVM 7 as compiler used > - > > Key: ARROW-8592 > URL: https://issues.apache.org/jira/browse/ARROW-8592 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > should be LLVM 8 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8640) pyarrow.UnionArray.from_buffers() expected number of buffers (1) did not match the passed number (3)
[ https://issues.apache.org/jira/browse/ARROW-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anish Biswas closed ARROW-8640. --- > pyarrow.UnionArray.from_buffers() expected number of buffers (1) did not > match the passed number (3) > > > Key: ARROW-8640 > URL: https://issues.apache.org/jira/browse/ARROW-8640 > Project: Apache Arrow > Issue Type: Bug >Reporter: Anish Biswas >Priority: Major > > {code:python} > arr1 = pa.array([1,2,3,4,5])arr1.buffers() > arr2 = pa.array([1.1,2.2,3.3,4.4,5.5]) > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([1, 0, 0, 2, 1, 2, 3], > type='int32')value_offsets.buffers() > arr = pa.UnionArray.from_dense(types, value_offsets, > [arr1, arr2]) > arr4 = pa.UnionArray.from_buffers(pa.struct([pa.field("0", arr1.type) , > pa.field("1", arr2.type)]), 5, arr.buffers()[0:3], children=[arr1, arr2]) > {code} > The problem here arises when I try to produce the Union Array via buffers, > according to the Columnar Documentation I need 3 buffers to produce a dense > Union Array. But when I try this, there is the error `Type's expected number > of buffers (1) did not match the passed number (3)`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8640) pyarrow.UnionArray.from_buffers() expected number of buffers (1) did not match the passed number (3)
[ https://issues.apache.org/jira/browse/ARROW-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096164#comment-17096164 ] Anish Biswas commented on ARROW-8640: - Ah, I see. Yes, that makes more sense. Thanks for the help! I'll close this issue now. > pyarrow.UnionArray.from_buffers() expected number of buffers (1) did not > match the passed number (3) > > > Key: ARROW-8640 > URL: https://issues.apache.org/jira/browse/ARROW-8640 > Project: Apache Arrow > Issue Type: Bug >Reporter: Anish Biswas >Priority: Major > > {code:python} > arr1 = pa.array([1,2,3,4,5])arr1.buffers() > arr2 = pa.array([1.1,2.2,3.3,4.4,5.5]) > types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') > value_offsets = pa.array([1, 0, 0, 2, 1, 2, 3], > type='int32')value_offsets.buffers() > arr = pa.UnionArray.from_dense(types, value_offsets, > [arr1, arr2]) > arr4 = pa.UnionArray.from_buffers(pa.struct([pa.field("0", arr1.type) , > pa.field("1", arr2.type)]), 5, arr.buffers()[0:3], children=[arr1, arr2]) > {code} > The problem here arises when I try to produce the Union Array via buffers, > according to the Columnar Documentation I need 3 buffers to produce a dense > Union Array. But when I try this, there is the error `Type's expected number > of buffers (1) did not match the passed number (3)`. -- This message was sent by Atlassian Jira (v8.3.4#803005)