[jira] [Commented] (ARROW-7513) [JS] Arrow Tutorial: Common data types
[ https://issues.apache.org/jira/browse/ARROW-7513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011451#comment-17011451 ] Leo Meyerovich commented on ARROW-7513: --- * Good: Updated the numerics section to use `VectorT.from(Array | Buffer)` ** Oddly, arrow.Int64Vector.from((new Uint32Array([2,3, 555,0, 1,0])).buffer) returns length 6, not 3 (0.15.0) * Bad:`VectorDictionary.from(['hello', 'hello', null, 'carrot'])` did not seem to work, so kept as lower-level for now * Bad: Still not sure how to do structs > [JS] Arrow Tutorial: Common data types > -- > > Key: ARROW-7513 > URL: https://issues.apache.org/jira/browse/ARROW-7513 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Leo Meyerovich >Assignee: Leo Meyerovich >Priority: Minor > > The JS client lacks basic introductory material around creating the common > basic data types such as turning JS arrays into ints, dicts, etc. There is no > equivalent of Python's [https://arrow.apache.org/docs/python/data.html] . > This has made use for myself difficult, and I bet for others. > > As with prev tutorials, I started sketching on > [https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit] > . When we're happy can make sense to export as an html or something to the > repo, or just link from the main readme. > I believe the target topics worth covering are: > * Common user data types: Ints, Dicts, Struct, Time > * Common column types: Data, Vector, Column > * Going from individual & arrays & buffers of JS values to Arrow-wrapped > forms, and basic inspection of the result > Not worth going into here is Tables vs. RecordBatches, which is the other > tutorial. > > 1. Ideas of what to add/edit/remove? > 2. And anyone up for helping with discussion of Data vs. Vector, and ingest > of Time & Struct? > 3. ... Should we be encouraging Struct or Map? I saw some PRs changing stuff > here. > > cc [~wesm] [~bhulette] [~paul.e.taylor] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7523) [Tools] Relax clang-tidy check
[ https://issues.apache.org/jira/browse/ARROW-7523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-7523: --- Assignee: Francois Saint-Jacques > [Tools] Relax clang-tidy check > -- > > Key: ARROW-7523 > URL: https://issues.apache.org/jira/browse/ARROW-7523 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This is a very invasive check added in recent clang-tidy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6788) [CI] Migrate Travis CI lint job to GitHub Actions
[ https://issues.apache.org/jira/browse/ARROW-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-6788. - Resolution: Fixed Issue resolved by pull request 6144 [https://github.com/apache/arrow/pull/6144] > [CI] Migrate Travis CI lint job to GitHub Actions > - > > Key: ARROW-6788 > URL: https://issues.apache.org/jira/browse/ARROW-6788 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Depends on ARROW-5802. As far as I can tell GitHub Actions jobs run more or > less immediately so this will give more prompt feedback to contributors -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6788) [CI] Migrate Travis CI lint job to GitHub Actions
[ https://issues.apache.org/jira/browse/ARROW-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-6788: --- Assignee: Krisztian Szucs > [CI] Migrate Travis CI lint job to GitHub Actions > - > > Key: ARROW-6788 > URL: https://issues.apache.org/jira/browse/ARROW-6788 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Depends on ARROW-5802. As far as I can tell GitHub Actions jobs run more or > less immediately so this will give more prompt feedback to contributors -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7429) [Java] Enhance code style checking for Java code (remove consecutive spaces)
[ https://issues.apache.org/jira/browse/ARROW-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-7429. Fix Version/s: 0.16.0 Resolution: Fixed Issue resolved by pull request 6060 [https://github.com/apache/arrow/pull/6060] > [Java] Enhance code style checking for Java code (remove consecutive spaces) > > > Key: ARROW-7429 > URL: https://issues.apache.org/jira/browse/ARROW-7429 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Minor > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This issue is opened in response to a discussion in > https://github.com/apache/arrow/pull/5861#discussion_r348917065. > We found the current style checking for Java code is not sufficient. So we > want to enhace it in a series of "small" steps, in order to avoid having to > change too many files at once. > In this issue, we remove consecutive spaces between tokens, so that tokens > are separated by single spaces. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7525) [Python][CI] Build PyArrow on VS2019
[ https://issues.apache.org/jira/browse/ARROW-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-7525: --- Component/s: (was: C++) Python > [Python][CI] Build PyArrow on VS2019 > - > > Key: ARROW-7525 > URL: https://issues.apache.org/jira/browse/ARROW-7525 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > > Enable ARROW_PARQUET cmake flag. Additional patching might be required, see > https://github.com/microsoft/vcpkg/pull/8263/files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7525) [Python][CI] Build PyArrow on VS2019
Krisztian Szucs created ARROW-7525: -- Summary: [Python][CI] Build PyArrow on VS2019 Key: ARROW-7525 URL: https://issues.apache.org/jira/browse/ARROW-7525 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Krisztian Szucs Enable ARROW_PARQUET cmake flag. Additional patching might be required, see https://github.com/microsoft/vcpkg/pull/8263/files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7524) [C++][CI] Build parquet support in the VS2019 GitHub Actions job
Krisztian Szucs created ARROW-7524: -- Summary: [C++][CI] Build parquet support in the VS2019 GitHub Actions job Key: ARROW-7524 URL: https://issues.apache.org/jira/browse/ARROW-7524 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Krisztian Szucs Enable ARROW_PARQUET cmake flag. Additional patching might be required, see https://github.com/microsoft/vcpkg/pull/8263/files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7523) [Tools] Ignore modernize-use-trailing-return-type clang-tidy check
Francois Saint-Jacques created ARROW-7523: - Summary: [Tools] Ignore modernize-use-trailing-return-type clang-tidy check Key: ARROW-7523 URL: https://issues.apache.org/jira/browse/ARROW-7523 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Francois Saint-Jacques Fix For: 0.16.0 This is a very invasive check added in recent clang-tidy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7523) [Tools] Relax clang-tidy check
[ https://issues.apache.org/jira/browse/ARROW-7523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-7523: -- Summary: [Tools] Relax clang-tidy check (was: [Tools] Ignore modernize-use-trailing-return-type clang-tidy check) > [Tools] Relax clang-tidy check > -- > > Key: ARROW-7523 > URL: https://issues.apache.org/jira/browse/ARROW-7523 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This is a very invasive check added in recent clang-tidy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7523) [Tools] Ignore modernize-use-trailing-return-type clang-tidy check
[ https://issues.apache.org/jira/browse/ARROW-7523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7523: -- Labels: pull-request-available (was: ) > [Tools] Ignore modernize-use-trailing-return-type clang-tidy check > -- > > Key: ARROW-7523 > URL: https://issues.apache.org/jira/browse/ARROW-7523 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > > This is a very invasive check added in recent clang-tidy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7466) [CI][Java] Fix gandiva-jar-osx nightly build failure
[ https://issues.apache.org/jira/browse/ARROW-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011066#comment-17011066 ] Krisztian Szucs commented on ARROW-7466: Hey [~projjal]! Seems like it is a travis deployment issue. We can use another approach implemented in the crossbow script (basically an alternative for the deployment scripts on various CI services), see the usage here https://github.com/apache/arrow/blob/master/dev/tasks/conda-recipes/azure.osx.yml#L76-L91 Installing the dependencies on Travis could be more complicated than porting the CI template to Azure (travis.osx.yml to azure.osx.yml). > [CI][Java] Fix gandiva-jar-osx nightly build failure > > > Key: ARROW-7466 > URL: https://issues.apache.org/jira/browse/ARROW-7466 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Projjal Chanda >Assignee: Projjal Chanda >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Gandiva-jar-osx nightly build has been failing for the past few days. From > [https://github.com/google/error-prone/issues/1441] the issue seems to be > error-prone version 2.3.3 currently used is incompatible with java 13 that is > being used in the nightly build. Updating it to 2.3.4 should fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6788) [CI] Migrate Travis CI lint job to GitHub Actions
[ https://issues.apache.org/jira/browse/ARROW-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011060#comment-17011060 ] Krisztian Szucs commented on ARROW-6788: Just added a Pr to test the merge script. > [CI] Migrate Travis CI lint job to GitHub Actions > - > > Key: ARROW-6788 > URL: https://issues.apache.org/jira/browse/ARROW-6788 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Depends on ARROW-5802. As far as I can tell GitHub Actions jobs run more or > less immediately so this will give more prompt feedback to contributors -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5213) [Format] Script for updating various checked-in Flatbuffers files
[ https://issues.apache.org/jira/browse/ARROW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011052#comment-17011052 ] Neville Dipale commented on ARROW-5213: --- Yes, we've documented how to generate the flatbuffer files as part of the README > [Format] Script for updating various checked-in Flatbuffers files > - > > Key: ARROW-5213 > URL: https://issues.apache.org/jira/browse/ARROW-5213 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools, Format, Go, Rust >Reporter: Wes McKinney >Assignee: Andy Grove >Priority: Minor > Fix For: 0.16.0 > > > Some subprojects have begun checking in generated Flatbuffers files to source > control. This presents a maintainability issue when there are additions or > changes made to the .fbs sources. It would be useful to be able to automate > the update of these files so it doesn't have to happen on a manual / > case-by-case basis -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6256) [Rust] parquet-format should be released by Apache process
[ https://issues.apache.org/jira/browse/ARROW-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011051#comment-17011051 ] Neville Dipale commented on ARROW-6256: --- Moving to 1.0.0, not urgent for 0.16.0 > [Rust] parquet-format should be released by Apache process > -- > > Key: ARROW-6256 > URL: https://issues.apache.org/jira/browse/ARROW-6256 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.14.1 >Reporter: Andy Grove >Priority: Major > Fix For: 1.0.0 > > > The Arrow parquet crate depends on the parquet-format crate [1]. > Parquet-format 2.6.0 was recently released and has breaking changes compared > to 2.5.0. > This means that previously published Arrow Parquet/DataFusion crates are now > unusable out the box [2]. > We should bring parquet-format into an Apache release process to avoid this > type of issue in the future. > > [1] [https://github.com/sunchao/parquet-format-rs] > [2] https://issues.apache.org/jira/browse/ARROW-6255 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6256) [Rust] parquet-format should be released by Apache process
[ https://issues.apache.org/jira/browse/ARROW-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-6256: -- Fix Version/s: (was: 0.16.0) 1.0.0 > [Rust] parquet-format should be released by Apache process > -- > > Key: ARROW-6256 > URL: https://issues.apache.org/jira/browse/ARROW-6256 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.14.1 >Reporter: Andy Grove >Priority: Major > Fix For: 1.0.0 > > > The Arrow parquet crate depends on the parquet-format crate [1]. > Parquet-format 2.6.0 was recently released and has breaking changes compared > to 2.5.0. > This means that previously published Arrow Parquet/DataFusion crates are now > unusable out the box [2]. > We should bring parquet-format into an Apache release process to avoid this > type of issue in the future. > > [1] [https://github.com/sunchao/parquet-format-rs] > [2] https://issues.apache.org/jira/browse/ARROW-6255 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7503) [Rust] Rust builds are failing on master
[ https://issues.apache.org/jira/browse/ARROW-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-7503: - Assignee: Neville Dipale > [Rust] Rust builds are failing on master > > > Key: ARROW-7503 > URL: https://issues.apache.org/jira/browse/ARROW-7503 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Neal Richardson >Assignee: Neville Dipale >Priority: Blocker > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > See [https://github.com/apache/arrow/runs/374130594#step:5:1506] for example: > {code} > ... > schema::types::tests::test_schema_type_thrift_conversion_err stdout > thread 'schema::types::tests::test_schema_type_thrift_conversion_err' > panicked at 'assertion failed: `(left == right)` > left: `"description() is deprecated; use Display"`, > right: `"Root schema must be Group type"`', > parquet/src/schema/types.rs:1760:13 > failures: > > column::writer::tests::test_column_writer_error_when_writing_disabled_dictionary > column::writer::tests::test_column_writer_inconsistent_def_rep_length > column::writer::tests::test_column_writer_invalid_def_levels > column::writer::tests::test_column_writer_invalid_rep_levels > column::writer::tests::test_column_writer_not_enough_values_to_write > file::writer::tests::test_file_writer_error_after_close > file::writer::tests::test_row_group_writer_error_after_close > file::writer::tests::test_row_group_writer_error_not_all_columns_written > file::writer::tests::test_row_group_writer_num_records_mismatch > schema::types::tests::test_primitive_type > schema::types::tests::test_schema_type_thrift_conversion_err > test result: FAILED. 325 passed; 11 failed; 0 ignored; 0 measured; 0 filtered > out > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6788) [CI] Migrate Travis CI lint job to GitHub Actions
[ https://issues.apache.org/jira/browse/ARROW-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6788: -- Labels: pull-request-available (was: ) > [CI] Migrate Travis CI lint job to GitHub Actions > - > > Key: ARROW-6788 > URL: https://issues.apache.org/jira/browse/ARROW-6788 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > > Depends on ARROW-5802. As far as I can tell GitHub Actions jobs run more or > less immediately so this will give more prompt feedback to contributors -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-186) [Java] Make sure alignment and memory padding conform to spec
[ https://issues.apache.org/jira/browse/ARROW-186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011044#comment-17011044 ] Micah Kornfield commented on ARROW-186: --- The spec has changed. I believe Java does to 8-byte alignment, but would need to double check. > [Java] Make sure alignment and memory padding conform to spec > - > > Key: ARROW-186 > URL: https://issues.apache.org/jira/browse/ARROW-186 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Micah Kornfield >Priority: Major > > Per spec 8 byte alignment and padding for buffers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-186) [Java] Make sure alignment and memory padding conform to spec
[ https://issues.apache.org/jira/browse/ARROW-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated ARROW-186: -- Description: Per spec 8 byte alignment and padding for buffers. (was: Per spec 64 byte alignment and padding for buffers.) > [Java] Make sure alignment and memory padding conform to spec > - > > Key: ARROW-186 > URL: https://issues.apache.org/jira/browse/ARROW-186 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Micah Kornfield >Priority: Major > > Per spec 8 byte alignment and padding for buffers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6799) [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)
[ https://issues.apache.org/jira/browse/ARROW-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011041#comment-17011041 ] Micah Kornfield commented on ARROW-6799: What criteria are we using for "maintained"? There might be some other code that would fall into this category (ORC JNI support comes to mind). > [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?) > - > > Key: ARROW-6799 > URL: https://issues.apache.org/jira/browse/ARROW-6799 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Does not appear to be tested in CI. Originally reported at > https://github.com/apache/arrow/issues/5575 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7503) [Rust] Rust builds are failing on master
[ https://issues.apache.org/jira/browse/ARROW-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7503: -- Labels: pull-request-available (was: ) > [Rust] Rust builds are failing on master > > > Key: ARROW-7503 > URL: https://issues.apache.org/jira/browse/ARROW-7503 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Neal Richardson >Priority: Blocker > Labels: pull-request-available > Fix For: 0.16.0 > > > See [https://github.com/apache/arrow/runs/374130594#step:5:1506] for example: > {code} > ... > schema::types::tests::test_schema_type_thrift_conversion_err stdout > thread 'schema::types::tests::test_schema_type_thrift_conversion_err' > panicked at 'assertion failed: `(left == right)` > left: `"description() is deprecated; use Display"`, > right: `"Root schema must be Group type"`', > parquet/src/schema/types.rs:1760:13 > failures: > > column::writer::tests::test_column_writer_error_when_writing_disabled_dictionary > column::writer::tests::test_column_writer_inconsistent_def_rep_length > column::writer::tests::test_column_writer_invalid_def_levels > column::writer::tests::test_column_writer_invalid_rep_levels > column::writer::tests::test_column_writer_not_enough_values_to_write > file::writer::tests::test_file_writer_error_after_close > file::writer::tests::test_row_group_writer_error_after_close > file::writer::tests::test_row_group_writer_error_not_all_columns_written > file::writer::tests::test_row_group_writer_num_records_mismatch > schema::types::tests::test_primitive_type > schema::types::tests::test_schema_type_thrift_conversion_err > test result: FAILED. 325 passed; 11 failed; 0 ignored; 0 measured; 0 filtered > out > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5767) [Format] Permit dictionary replacements in IPC protocol
[ https://issues.apache.org/jira/browse/ARROW-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011038#comment-17011038 ] Micah Kornfield commented on ARROW-5767: yes, I think this duplicate ARROW-7283 > [Format] Permit dictionary replacements in IPC protocol > --- > > Key: ARROW-5767 > URL: https://issues.apache.org/jira/browse/ARROW-5767 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.16.0 > > > We permit dictionaries to grow using the {{isDelta}} property in the IPC > protocol. I think it should be allowed for the same dictionary ID to appear > in an IPC protocol stream but with {{isDelta=false}}. This would indicate > that the dictionary in that message is to replace any prior-observed ones in > subsequent record batches. > For example, we might have dictionary batches in a stream: > {code} > id: 0 isDelta: false values: [a, b, c] > id: 0 isDelta: true values [d] > id 0 isDelta: false values [c, a, b] > {code} > Such data could easily be produced by a stream producer that is creating > dictionaries in different execution threads. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-5767) [Format] Permit dictionary replacements in IPC protocol
[ https://issues.apache.org/jira/browse/ARROW-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-5767. Resolution: Duplicate > [Format] Permit dictionary replacements in IPC protocol > --- > > Key: ARROW-5767 > URL: https://issues.apache.org/jira/browse/ARROW-5767 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.16.0 > > > We permit dictionaries to grow using the {{isDelta}} property in the IPC > protocol. I think it should be allowed for the same dictionary ID to appear > in an IPC protocol stream but with {{isDelta=false}}. This would indicate > that the dictionary in that message is to replace any prior-observed ones in > subsequent record batches. > For example, we might have dictionary batches in a stream: > {code} > id: 0 isDelta: false values: [a, b, c] > id: 0 isDelta: true values [d] > id 0 isDelta: false values [c, a, b] > {code} > Such data could easily be produced by a stream producer that is creating > dictionaries in different execution threads. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7121) [C++][CI][Windows] Enable more features on the windows GHA build
[ https://issues.apache.org/jira/browse/ARROW-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011036#comment-17011036 ] Neal Richardson commented on ARROW-7121: Seems like this is more about being able to leave Appveyor, right? FWIW the R test job on Appveyor has parquet ON. > [C++][CI][Windows] Enable more features on the windows GHA build > > > Key: ARROW-7121 > URL: https://issues.apache.org/jira/browse/ARROW-7121 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > Fix For: 0.16.0 > > > Like `ARROW_GANDIVA: ON`, `ARROW_FLIGHT: ON`, `ARROW_PARQUET: ON` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7520) [R] Writing many batches causes a crash
[ https://issues.apache.org/jira/browse/ARROW-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian updated ARROW-7520: - Description: Hi, When creating north of 200-300 batches, the writing to the arrow file crashes R - it doesn't even show an error message. Rstudio just aborts. I have the feeling that maybe each batch becomes a stream and R has issues with the connections, but that's a total guess. Any help would be appreciated. ## Here is the function. When running it with 3000 it crashes immediately. Before that I ran it with 100, and then increased it slowly, and then it randomly crashed again. ## Now I received this error message after writing 30 batches. Error in ipc___RecordBatchWriter__WriteRecordBatch(self, batch) : Invalid: Invalid operation on closed file Error in ipc___RecordBatchWriter__WriteRecordBatch(self, batch) : Invalid: Invalid operation on closed file ## write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000) write_arrow_custom <- function(df,targetarrow,nrbatches) { ct <- nrbatches idxs <- c(0:ct)/ct*nrow(df) idxs <- round(idxs,0) %>% as.integer() idxs[length(idxs)] <- nrow(df) df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>% mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% filter(!is.na(colto)) %>% mutate(R=row_number()) stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>% sum()==nrow(df)) table_df <- Table$create(name=rownames(df[1,]),df[1,]) writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema) df_nav %>% dlply(c('R'),function(df_nav) { catl(glue('\\{df_nav$colfrom[1]} :{df_nav$colto[1]} / {df_nav$R[1]}...')) tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],] writer$write_batch(record_batch(name = rownames(tmp), tmp)) NULL }) -> batch_lst writer$close() rm(batch_lst) gc() } was: Hi, When creating north of 200-300 batches, the writing to the arrow file crashes R - it doesn't even show an error message. Rstudio just aborts. I have the feeling that maybe each batch becomes a stream and R has issues with the connections, but that's a total guess. Any help would be appreciated. ## Here is the function. When running it with 3000 it crashes immediately. Before that I ran it with 100, and then increased it slowly, and then it randomly crashed again. write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000) write_arrow_custom <- function(df,targetarrow,nrbatches) { ct <- nrbatches idxs <- c(0:ct)/ct*nrow(df) idxs <- round(idxs,0) %>% as.integer() idxs[length(idxs)] <- nrow(df) df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>% mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% filter(!is.na(colto)) %>% mutate(R=row_number()) stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>% sum()==nrow(df)) table_df <- Table$create(name=rownames(df[1,]),df[1,]) writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema) df_nav %>% dlply(c('R'),function(df_nav){ catl(glue('\{df_nav$colfrom[1]}:\{df_nav$colto[1]} / \{df_nav$R[1]}...')) tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],] writer$write_batch(record_batch(name = rownames(tmp), tmp)) NULL }) -> batch_lst writer$close() rm(batch_lst) gc() } > [R] Writing many batches causes a crash > --- > > Key: ARROW-7520 > URL: https://issues.apache.org/jira/browse/ARROW-7520 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 > Environment: - Session info > --- > setting value > version R version 3.6.1 (2019-07-05) > os Windows 10 x64 > system x86_64, mingw32 > ui RStudio > language (EN) > collate English_United States.1252 > ctype English_United States.1252 > tz America/New_York > date 2020-01-08 > > - Packages > --- > ! package * version date lib source > > acepack 1.4.1 2016-10-29 [1] CRAN (R 3.6.1) > > arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.2) > > askpass 1.1 2019-01-13 [1] CRAN (R 3.6.1) > > assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1)
[jira] [Updated] (ARROW-7522) [C++][Plasma] Broken Record Batch returned from a function call
[ https://issues.apache.org/jira/browse/ARROW-7522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7522: --- Summary: [C++][Plasma] Broken Record Batch returned from a function call (was: Broken Record Batch returned from a function call) > [C++][Plasma] Broken Record Batch returned from a function call > --- > > Key: ARROW-7522 > URL: https://issues.apache.org/jira/browse/ARROW-7522 > Project: Apache Arrow > Issue Type: Bug > Components: C++, C++ - Plasma >Affects Versions: 0.15.1 > Environment: macOS >Reporter: Chengxin Ma >Priority: Minor > > Scenario: retrieving Record Batch from Plasma with known Object ID. > The following code snippet works well: > {code:java} > int main(int argc, char **argv) > { > plasma::ObjectID object_id = > plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF"); > // Start up and connect a Plasma client. > plasma::PlasmaClient client; > ARROW_CHECK_OK(client.Connect("/tmp/store")); > plasma::ObjectBuffer object_buffer; > ARROW_CHECK_OK(client.Get(_id, 1, -1, _buffer)); > // Retrieve object data. > auto buffer = object_buffer.data; > arrow::io::BufferReader buffer_reader(buffer); > std::shared_ptr record_batch_stream_reader; > ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(_reader, > _batch_stream_reader)); > std::shared_ptr record_batch; > arrow::Status status = > record_batch_stream_reader->ReadNext(_batch); > std::cout << "record_batch->column_name(0): " << > record_batch->column_name(0) << std::endl; > std::cout << "record_batch->num_columns(): " << > record_batch->num_columns() << std::endl; > std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << > std::endl; > std::cout << "record_batch->column(0)->length(): " > << record_batch->column(0)->length() << std::endl; > std::cout << "record_batch->column(0)->ToString(): " > << record_batch->column(0)->ToString() << std::endl; > } > {code} > {{record_batch->column(0)->ToString()}} would incur a segmentation fault if > retrieving Record Batch is wrapped in a function: > {code:java} > std::shared_ptr GetRecordBatchFromPlasma(plasma::ObjectID > object_id) > { > // Start up and connect a Plasma client. > plasma::PlasmaClient client; > ARROW_CHECK_OK(client.Connect("/tmp/store")); > plasma::ObjectBuffer object_buffer; > ARROW_CHECK_OK(client.Get(_id, 1, -1, _buffer)); > // Retrieve object data. > auto buffer = object_buffer.data; > arrow::io::BufferReader buffer_reader(buffer); > std::shared_ptr record_batch_stream_reader; > ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(_reader, > _batch_stream_reader)); > std::shared_ptr record_batch; > arrow::Status status = > record_batch_stream_reader->ReadNext(_batch); > // Disconnect the client. > ARROW_CHECK_OK(client.Disconnect()); > return record_batch; > } > int main(int argc, char **argv) > { > plasma::ObjectID object_id = > plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF"); > std::shared_ptr record_batch = > GetRecordBatchFromPlasma(object_id); > std::cout << "record_batch->column_name(0): " << > record_batch->column_name(0) << std::endl; > std::cout << "record_batch->num_columns(): " << > record_batch->num_columns() << std::endl; > std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << > std::endl; > std::cout << "record_batch->column(0)->length(): " > << record_batch->column(0)->length() << std::endl; > std::cout << "record_batch->column(0)->ToString(): " > << record_batch->column(0)->ToString() << std::endl; > } > {code} > The meta info of the Record Batch such as number of columns and rows is still > available, but I can't see the content of the columns. > {{lldb}} says that the stop reason is {{EXC_BAD_ACCESS}}, so I think the > Record Batch is destroyed after {{GetRecordBatchFromPlasma}} finishes. But > why can I still see the meta info of this Record Batch? > What is the proper way to get the Record Batch if we insist using a function? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7522) Broken Record Batch returned from a function call
Chengxin Ma created ARROW-7522: -- Summary: Broken Record Batch returned from a function call Key: ARROW-7522 URL: https://issues.apache.org/jira/browse/ARROW-7522 Project: Apache Arrow Issue Type: Bug Components: C++, C++ - Plasma Affects Versions: 0.15.1 Environment: macOS Reporter: Chengxin Ma Scenario: retrieving Record Batch from Plasma with known Object ID. The following code snippet works well: {code:java} int main(int argc, char **argv) { plasma::ObjectID object_id = plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF"); // Start up and connect a Plasma client. plasma::PlasmaClient client; ARROW_CHECK_OK(client.Connect("/tmp/store")); plasma::ObjectBuffer object_buffer; ARROW_CHECK_OK(client.Get(_id, 1, -1, _buffer)); // Retrieve object data. auto buffer = object_buffer.data; arrow::io::BufferReader buffer_reader(buffer); std::shared_ptr record_batch_stream_reader; ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(_reader, _batch_stream_reader)); std::shared_ptr record_batch; arrow::Status status = record_batch_stream_reader->ReadNext(_batch); std::cout << "record_batch->column_name(0): " << record_batch->column_name(0) << std::endl; std::cout << "record_batch->num_columns(): " << record_batch->num_columns() << std::endl; std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << std::endl; std::cout << "record_batch->column(0)->length(): " << record_batch->column(0)->length() << std::endl; std::cout << "record_batch->column(0)->ToString(): " << record_batch->column(0)->ToString() << std::endl; } {code} {{record_batch->column(0)->ToString()}} would incur a segmentation fault if retrieving Record Batch is wrapped in a function: {code:java} std::shared_ptr GetRecordBatchFromPlasma(plasma::ObjectID object_id) { // Start up and connect a Plasma client. plasma::PlasmaClient client; ARROW_CHECK_OK(client.Connect("/tmp/store")); plasma::ObjectBuffer object_buffer; ARROW_CHECK_OK(client.Get(_id, 1, -1, _buffer)); // Retrieve object data. auto buffer = object_buffer.data; arrow::io::BufferReader buffer_reader(buffer); std::shared_ptr record_batch_stream_reader; ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(_reader, _batch_stream_reader)); std::shared_ptr record_batch; arrow::Status status = record_batch_stream_reader->ReadNext(_batch); // Disconnect the client. ARROW_CHECK_OK(client.Disconnect()); return record_batch; } int main(int argc, char **argv) { plasma::ObjectID object_id = plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF"); std::shared_ptr record_batch = GetRecordBatchFromPlasma(object_id); std::cout << "record_batch->column_name(0): " << record_batch->column_name(0) << std::endl; std::cout << "record_batch->num_columns(): " << record_batch->num_columns() << std::endl; std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << std::endl; std::cout << "record_batch->column(0)->length(): " << record_batch->column(0)->length() << std::endl; std::cout << "record_batch->column(0)->ToString(): " << record_batch->column(0)->ToString() << std::endl; } {code} The meta info of the Record Batch such as number of columns and rows is still available, but I can't see the content of the columns. {{lldb}} says that the stop reason is {{EXC_BAD_ACCESS}}, so I think the Record Batch is destroyed after {{GetRecordBatchFromPlasma}} finishes. But why can I still see the meta info of this Record Batch? What is the proper way to get the Record Batch if we insist using a function? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7507) [Rust] Bump Thrift version to 0.13 in parquet-format and parquet
[ https://issues.apache.org/jira/browse/ARROW-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7507: --- Summary: [Rust] Bump Thrift version to 0.13 in parquet-format and parquet (was: Bump Thrift version to 0.13 in parquet-format and parquet) > [Rust] Bump Thrift version to 0.13 in parquet-format and parquet > > > Key: ARROW-7507 > URL: https://issues.apache.org/jira/browse/ARROW-7507 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.15.1 >Reporter: Mahmut Bulut >Priority: Major > Labels: parquet > > *Problem Description* > Currently, `byteorder` crate changes is not incorporated in both > `parquet-format` and `parquet` crates. Both should have consistently updated > to the thrift 0.13 in reverse order(first parquet-format then parquet) to > update the dependencies which are using older versions. > This makes clashing versions from other crates that are following the > upstream. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7521) [Rust] Remove tuple on FixedSizeList datatype
[ https://issues.apache.org/jira/browse/ARROW-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7521: -- Labels: pull-request-available (was: ) > [Rust] Remove tuple on FixedSizeList datatype > - > > Key: ARROW-7521 > URL: https://issues.apache.org/jira/browse/ARROW-7521 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Minor > Labels: pull-request-available > > The FixedSizeList datatype takes a tuple of Box and length, but > this could be simplified to take the two values without a tuple. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7507) Bump Thrift version to 0.13 in parquet-format and parquet
[ https://issues.apache.org/jira/browse/ARROW-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-7507: -- Affects Version/s: (was: 0.16.0) 0.15.1 > Bump Thrift version to 0.13 in parquet-format and parquet > - > > Key: ARROW-7507 > URL: https://issues.apache.org/jira/browse/ARROW-7507 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.15.1 >Reporter: Mahmut Bulut >Priority: Major > Labels: parquet > > *Problem Description* > Currently, `byteorder` crate changes is not incorporated in both > `parquet-format` and `parquet` crates. Both should have consistently updated > to the thrift 0.13 in reverse order(first parquet-format then parquet) to > update the dependencies which are using older versions. > This makes clashing versions from other crates that are following the > upstream. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7521) [Rust] Remove tuple on FixedSizeList datatype
[ https://issues.apache.org/jira/browse/ARROW-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-7521: - Assignee: Neville Dipale > [Rust] Remove tuple on FixedSizeList datatype > - > > Key: ARROW-7521 > URL: https://issues.apache.org/jira/browse/ARROW-7521 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Minor > > The FixedSizeList datatype takes a tuple of Box and length, but > this could be simplified to take the two values without a tuple. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7521) [Rust] Remove tuple on FixedSizeList datatype
[ https://issues.apache.org/jira/browse/ARROW-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-7521: -- Priority: Minor (was: Major) > [Rust] Remove tuple on FixedSizeList datatype > - > > Key: ARROW-7521 > URL: https://issues.apache.org/jira/browse/ARROW-7521 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Neville Dipale >Priority: Minor > > The FixedSizeList datatype takes a tuple of Box and length, but > this could be simplified to take the two values without a tuple. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7521) [Rust] Remove tuple on FixedSizeList datatype
Neville Dipale created ARROW-7521: - Summary: [Rust] Remove tuple on FixedSizeList datatype Key: ARROW-7521 URL: https://issues.apache.org/jira/browse/ARROW-7521 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Neville Dipale The FixedSizeList datatype takes a tuple of Box and length, but this could be simplified to take the two values without a tuple. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7516) [C#] .NET Benchmarks are broken
[ https://issues.apache.org/jira/browse/ARROW-7516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7516: -- Labels: pull-request-available (was: ) > [C#] .NET Benchmarks are broken > --- > > Key: ARROW-7516 > URL: https://issues.apache.org/jira/browse/ARROW-7516 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Reporter: Eric Erhardt >Priority: Major > Labels: pull-request-available > Original Estimate: 2h > Remaining Estimate: 2h > > See [https://github.com/apache/arrow/pull/6030#issuecomment-571877721] > > It looks like the issue is that in the Benchmarks, `Length` is specified as > `1_000_000`, and there has only been ~730,000 days since `DateTime.Min`, so > this line fails: > https://github.com/apache/arrow/blob/4634c89fc77f70fb5b5d035d6172263a4604da82/csharp/test/Apache.Arrow.Tests/TestData.cs#L130 > A simple fix would be to cap what we pass into `AddDays` to some number like > `100_000`, or so. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7498) [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme
[ https://issues.apache.org/jira/browse/ARROW-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010947#comment-17010947 ] Francois Saint-Jacques edited comment on ARROW-7498 at 1/8/20 7:09 PM: --- For SchemaPartitioner (each directory is a partition value), I have * StackPartitioner * LevelPartitioner * HierarchyPartitioner * OrderedPartitioner * DirectoryPartitioner I'm tempted by DirectoryPartitioner. was (Author: fsaintjacques): For SchemaPartitioner (each directory is a partition value), I have * StackPartitioner * LevelPartitioner * HierarchyPartitioner * OrderedPartitioner > [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme > - > > Key: ARROW-7498 > URL: https://issues.apache.org/jira/browse/ARROW-7498 > Project: Apache Arrow > Issue Type: Wish > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > > DataFragment -> Fragment > DataSource -> Source > PartitionScheme -> PartitionSchema > *Discovery -> *Manifest -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7498) [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme
[ https://issues.apache.org/jira/browse/ARROW-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010947#comment-17010947 ] Francois Saint-Jacques commented on ARROW-7498: --- For SchemaPartitioner (each directory is a partition value), I have * StackPartitioner * LevelPartitioner * HierarchyPartitioner * OrderedPartitioner > [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme > - > > Key: ARROW-7498 > URL: https://issues.apache.org/jira/browse/ARROW-7498 > Project: Apache Arrow > Issue Type: Wish > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > > DataFragment -> Fragment > DataSource -> Source > PartitionScheme -> PartitionSchema > *Discovery -> *Manifest -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7520) [R] Writing many batches causes a crash
[ https://issues.apache.org/jira/browse/ARROW-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-7520: Fix Version/s: (was: 0.15.1) > [R] Writing many batches causes a crash > --- > > Key: ARROW-7520 > URL: https://issues.apache.org/jira/browse/ARROW-7520 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 > Environment: - Session info > --- > setting value > version R version 3.6.1 (2019-07-05) > os Windows 10 x64 > system x86_64, mingw32 > ui RStudio > language (EN) > collate English_United States.1252 > ctype English_United States.1252 > tz America/New_York > date 2020-01-08 > > - Packages > --- > ! package * version date lib source > > acepack 1.4.1 2016-10-29 [1] CRAN (R 3.6.1) > > arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.2) > > askpass 1.1 2019-01-13 [1] CRAN (R 3.6.1) > > assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) > > backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) > > base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) > > bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0) > > bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0) > > blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) > > callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1) > > cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1) > > checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) > > cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.1) > > cluster 2.1.0 2019-06-19 [2] CRAN (R 3.6.1) > > codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.1) > > colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) > > commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) > > crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) > > credentials 1.1 2019-03-12 [1] CRAN (R 3.6.2) > > curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) > > data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.1) > > DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.6.1) > > desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) > > devtools * 2.2.0 2019-09-07 [1] CRAN (R 3.6.1) > > digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) > > dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) > > DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) > > ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) > > evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1) > > foreign 0.8-71 2018-07-20 [2] CRAN (R 3.6.1) > > Formula * 1.2-3 2018-05-03 [1] CRAN (R 3.6.0) > > fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) > > fst * 0.9.0 2019-04-09 [1] CRAN (R 3.6.1) > > future * 1.15.0-9000 2019-11-19 [1] Github > (HenrikBengtsson/future@bc241c7) > ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) > > globals 0.12.4 2018-10-11 [1] CRAN (R 3.6.0) > > glue * 1.3.1 2019-03-12 [1] CRAN (R 3.6.1) > > gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) > > gt * 0.1.0 2019-11-27 [1] Github (rstudio/gt@284bbe5) > > gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1) > > Hmisc
[jira] [Updated] (ARROW-7520) [R] Writing many batches causes a crash
[ https://issues.apache.org/jira/browse/ARROW-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-7520: Summary: [R] Writing many batches causes a crash (was: Arrow / R - too many batches causes a crash) > [R] Writing many batches causes a crash > --- > > Key: ARROW-7520 > URL: https://issues.apache.org/jira/browse/ARROW-7520 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 > Environment: - Session info > --- > setting value > version R version 3.6.1 (2019-07-05) > os Windows 10 x64 > system x86_64, mingw32 > ui RStudio > language (EN) > collate English_United States.1252 > ctype English_United States.1252 > tz America/New_York > date 2020-01-08 > > - Packages > --- > ! package * version date lib source > > acepack 1.4.1 2016-10-29 [1] CRAN (R 3.6.1) > > arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.2) > > askpass 1.1 2019-01-13 [1] CRAN (R 3.6.1) > > assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) > > backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) > > base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) > > bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0) > > bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0) > > blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) > > callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1) > > cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1) > > checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) > > cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.1) > > cluster 2.1.0 2019-06-19 [2] CRAN (R 3.6.1) > > codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.1) > > colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) > > commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) > > crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) > > credentials 1.1 2019-03-12 [1] CRAN (R 3.6.2) > > curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) > > data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.1) > > DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.6.1) > > desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) > > devtools * 2.2.0 2019-09-07 [1] CRAN (R 3.6.1) > > digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) > > dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) > > DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) > > ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) > > evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1) > > foreign 0.8-71 2018-07-20 [2] CRAN (R 3.6.1) > > Formula * 1.2-3 2018-05-03 [1] CRAN (R 3.6.0) > > fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) > > fst * 0.9.0 2019-04-09 [1] CRAN (R 3.6.1) > > future * 1.15.0-9000 2019-11-19 [1] Github > (HenrikBengtsson/future@bc241c7) > ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) > > globals 0.12.4 2018-10-11 [1] CRAN (R 3.6.0) > > glue * 1.3.1 2019-03-12 [1] CRAN (R 3.6.1) > > gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) > > gt * 0.1.0 2019-11-27 [1] Github (rstudio/gt@284bbe5) > > gtable 0.3.0
[jira] [Created] (ARROW-7520) Arrow / R - too many batches causes a crash
Christian created ARROW-7520: Summary: Arrow / R - too many batches causes a crash Key: ARROW-7520 URL: https://issues.apache.org/jira/browse/ARROW-7520 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.15.1 Environment: - Session info --- setting value version R version 3.6.1 (2019-07-05) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctype English_United States.1252 tz America/New_York date 2020-01-08 - Packages --- ! package * version date lib source acepack 1.4.1 2016-10-29 [1] CRAN (R 3.6.1) arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.2) askpass 1.1 2019-01-13 [1] CRAN (R 3.6.1) assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0) bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0) blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1) cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1) checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.1) cluster 2.1.0 2019-06-19 [2] CRAN (R 3.6.1) codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.1) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) credentials 1.1 2019-03-12 [1] CRAN (R 3.6.2) curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.1) DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.6.1) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) devtools * 2.2.0 2019-09-07 [1] CRAN (R 3.6.1) digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1) foreign 0.8-71 2018-07-20 [2] CRAN (R 3.6.1) Formula * 1.2-3 2018-05-03 [1] CRAN (R 3.6.0) fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) fst * 0.9.0 2019-04-09 [1] CRAN (R 3.6.1) future * 1.15.0-9000 2019-11-19 [1] Github (HenrikBengtsson/future@bc241c7) ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) globals 0.12.4 2018-10-11 [1] CRAN (R 3.6.0) glue * 1.3.1 2019-03-12 [1] CRAN (R 3.6.1) gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) gt * 0.1.0 2019-11-27 [1] Github (rstudio/gt@284bbe5) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1) Hmisc * 4.3-0 2019-11-07 [1] CRAN (R 3.6.1) htmlTable 1.13.2 2019-09-22 [1] CRAN (R 3.6.1) D htmltools 0.3.6.9004 2019-09-20 [1] Github (rstudio/htmltools@c49b29c) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.6.1)
[jira] [Updated] (ARROW-7519) [Python] Build wheels, conda packages with PYARROW_WITH_DATASET=1
[ https://issues.apache.org/jira/browse/ARROW-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-7519: Description: We should make sure our package builds work with this enabled (was: We should ) > [Python] Build wheels, conda packages with PYARROW_WITH_DATASET=1 > - > > Key: ARROW-7519 > URL: https://issues.apache.org/jira/browse/ARROW-7519 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Critical > Fix For: 0.16.0 > > > We should make sure our package builds work with this enabled -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7519) [Python] Build wheels, conda packages with PYARROW_WITH_DATASET=1
Wes McKinney created ARROW-7519: --- Summary: [Python] Build wheels, conda packages with PYARROW_WITH_DATASET=1 Key: ARROW-7519 URL: https://issues.apache.org/jira/browse/ARROW-7519 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.16.0 We should -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
Wes McKinney created ARROW-7518: --- Summary: [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages Key: ARROW-7518 URL: https://issues.apache.org/jira/browse/ARROW-7518 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.16.0 This new module is not enabled in the package builds -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
[ https://issues.apache.org/jira/browse/ARROW-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010918#comment-17010918 ] Wes McKinney commented on ARROW-7518: - I assume this is being tested in GHA? > [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages > --- > > Key: ARROW-7518 > URL: https://issues.apache.org/jira/browse/ARROW-7518 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Blocker > Fix For: 0.16.0 > > > This new module is not enabled in the package builds -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7387) [C#] Support ListType Serialization
[ https://issues.apache.org/jira/browse/ARROW-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Erhardt resolved ARROW-7387. - Fix Version/s: 0.16.0 Resolution: Fixed Issue resolved by pull request 6030 [https://github.com/apache/arrow/pull/6030] > [C#] Support ListType Serialization > --- > > Key: ARROW-7387 > URL: https://issues.apache.org/jira/browse/ARROW-7387 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Reporter: Takashi Hashida >Assignee: Takashi Hashida >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Support ListType serialization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7477) [FlightRPC][Java] Flight gRPC service is missing reflection info
[ https://issues.apache.org/jira/browse/ARROW-7477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-7477. - Resolution: Fixed Issue resolved by pull request 6114 [https://github.com/apache/arrow/pull/6114] > [FlightRPC][Java] Flight gRPC service is missing reflection info > > > Key: ARROW-7477 > URL: https://issues.apache.org/jira/browse/ARROW-7477 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Java >Affects Versions: 0.14.1 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > When setting up the gRPC service, we mangle the gRPC [service > descriptor|https://github.com/apache/arrow/blob/master/java/flight/src/main/java/org/apache/arrow/flight/FlightBindingService.java], > removing reflection information. This means things like gRPC reflection > don't work, which is necessary for debugging/development tools like > [grpcurl|https://github.com/fullstorydev/grpcurl/]. Reflection information is > also useful to do things like authorization/access control based on RPC > method. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7517) [C++] Builder does not honour dictionary type provided during initialization
Wamsi Viswanath created ARROW-7517: -- Summary: [C++] Builder does not honour dictionary type provided during initialization Key: ARROW-7517 URL: https://issues.apache.org/jira/browse/ARROW-7517 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.15.0 Reporter: Wamsi Viswanath Below is an example for reproducing the issue: [https://gist.github.com/wamsiv/d48ec37a9a9b5f4d484de6ff86a3870d] Builder automatically optimizes the dictionary type depending upon the number of unique values provided which results in schema mismatch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7513) [JS] Arrow Tutorial: Common data types
[ https://issues.apache.org/jira/browse/ARROW-7513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010870#comment-17010870 ] Leo Meyerovich commented on ARROW-7513: --- Agreed, I'll see about forking this into Part I & Part II, where Part I is high-level api and move the Data stuff to Part II. I'm stumped on `structs` and `nested structs` though, any recs/examples? > [JS] Arrow Tutorial: Common data types > -- > > Key: ARROW-7513 > URL: https://issues.apache.org/jira/browse/ARROW-7513 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Leo Meyerovich >Assignee: Leo Meyerovich >Priority: Minor > > The JS client lacks basic introductory material around creating the common > basic data types such as turning JS arrays into ints, dicts, etc. There is no > equivalent of Python's [https://arrow.apache.org/docs/python/data.html] . > This has made use for myself difficult, and I bet for others. > > As with prev tutorials, I started sketching on > [https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit] > . When we're happy can make sense to export as an html or something to the > repo, or just link from the main readme. > I believe the target topics worth covering are: > * Common user data types: Ints, Dicts, Struct, Time > * Common column types: Data, Vector, Column > * Going from individual & arrays & buffers of JS values to Arrow-wrapped > forms, and basic inspection of the result > Not worth going into here is Tables vs. RecordBatches, which is the other > tutorial. > > 1. Ideas of what to add/edit/remove? > 2. And anyone up for helping with discussion of Data vs. Vector, and ingest > of Time & Struct? > 3. ... Should we be encouraging Struct or Map? I saw some PRs changing stuff > here. > > cc [~wesm] [~bhulette] [~paul.e.taylor] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7516) [C#] .NET Benchmarks are broken
Eric Erhardt created ARROW-7516: --- Summary: [C#] .NET Benchmarks are broken Key: ARROW-7516 URL: https://issues.apache.org/jira/browse/ARROW-7516 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt See [https://github.com/apache/arrow/pull/6030#issuecomment-571877721] It looks like the issue is that in the Benchmarks, `Length` is specified as `1_000_000`, and there has only been ~730,000 days since `DateTime.Min`, so this line fails: https://github.com/apache/arrow/blob/4634c89fc77f70fb5b5d035d6172263a4604da82/csharp/test/Apache.Arrow.Tests/TestData.cs#L130 A simple fix would be to cap what we pass into `AddDays` to some number like `100_000`, or so. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-7128) [CI] Fedora cron jobs are failing because of wrong fedora version
[ https://issues.apache.org/jira/browse/ARROW-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reopened ARROW-7128: > [CI] Fedora cron jobs are failing because of wrong fedora version > - > > Key: ARROW-7128 > URL: https://issues.apache.org/jira/browse/ARROW-7128 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The requested fedora version is 10 (Debian) instead of 29: > https://github.com/apache/arrow/runs/299223601 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7128) [CI] Fedora cron jobs are failing because of wrong fedora version
[ https://issues.apache.org/jira/browse/ARROW-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-7128. Resolution: Fixed > [CI] Fedora cron jobs are failing because of wrong fedora version > - > > Key: ARROW-7128 > URL: https://issues.apache.org/jira/browse/ARROW-7128 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The requested fedora version is 10 (Debian) instead of 29: > https://github.com/apache/arrow/runs/299223601 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7128) [CI] Fedora cron jobs are failing because of wrong fedora version
[ https://issues.apache.org/jira/browse/ARROW-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010852#comment-17010852 ] Neal Richardson commented on ARROW-7128: Yes, it looks like it: https://github.com/apache/arrow/commit/4634c89fc77f70fb5b5d035d6172263a4604da82/checks?check_suite_id=389869709 > [CI] Fedora cron jobs are failing because of wrong fedora version > - > > Key: ARROW-7128 > URL: https://issues.apache.org/jira/browse/ARROW-7128 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The requested fedora version is 10 (Debian) instead of 29: > https://github.com/apache/arrow/runs/299223601 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7376) [C++] parquet NaN/null double statistics can result in endless loop
[ https://issues.apache.org/jira/browse/ARROW-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-7376: - Assignee: Francois Saint-Jacques > [C++] parquet NaN/null double statistics can result in endless loop > --- > > Key: ARROW-7376 > URL: https://issues.apache.org/jira/browse/ARROW-7376 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Pierre Belzile >Assignee: Francois Saint-Jacques >Priority: Major > Labels: parquet > Fix For: 0.16.0 > > > There is a bug in the doubles column statistics computation when writing to > parquet an array with only NaNs and nulls. It loops endlessly if the last > cell of a write group is a Null. The line in error is > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633] > which checks for NaN but not for Null. Code then falls through and loops > endlessly and causes the program to appear frozen. > This code snippet repeats: > {noformat} > TEST(parquet, nans) { > /* Create a small parquet structure */ > std::vector> fields; > fields.push_back(::arrow::field("doubles", ::arrow::float64())); > std::shared_ptr<::arrow::Schema> schema = > ::arrow::schema(std::move(fields)); > std::unique_ptr<::arrow::RecordBatchBuilder> builder; > ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), > ); > > builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits::quiet_NaN()); > builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull(); > std::shared_ptr<::arrow::RecordBatch> batch; > builder->Flush(); > arrow::PrettyPrint(*batch, 0, ::cout); std::shared_ptr > table; > arrow::Table::FromRecordBatches({batch}, ); /* Attempt to write */ > std::shared_ptr<::arrow::io::FileOutputStream> os; > arrow::io::FileOutputStream::Open("/tmp/test.parquet", ); > parquet::WriterProperties::Builder writer_props_bld; > // writer_props_bld.disable_statistics("doubles"); > std::shared_ptr writer_props = > writer_props_bld.build(); > std::shared_ptr arrow_props = > parquet::ArrowWriterProperties::Builder().store_schema()->build(); > std::unique_ptr writer; > parquet::arrow::FileWriter::Open( > *table->schema(), arrow::default_memory_pool(), os, > writer_props, arrow_props, ); > writer->WriteTable(*table, 1024); > }{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6821) [C++][Parquet] Do not require Thrift compiler when building (but still require library)
[ https://issues.apache.org/jira/browse/ARROW-6821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010830#comment-17010830 ] Neal Richardson commented on ARROW-6821: Yes please. I'm happy to help with this, though I'd need a little direction for what to put where (and then presumably there's a {{THRIFT_CMAKE_ARGS}} to add or remove). > [C++][Parquet] Do not require Thrift compiler when building (but still > require library) > --- > > Key: ARROW-6821 > URL: https://issues.apache.org/jira/browse/ARROW-6821 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.16.0 > > > Building Thrift from source carries extra toolchain dependencies (bison and > flex). If we check in the files produced by compiling parquet.thrift, then > the EP can be simplified to only build the Thrift C++ library and not the > compiler. This also results in a simpler build for third parties -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7204) [C++][Dataset] In expression should not require exact type match
[ https://issues.apache.org/jira/browse/ARROW-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7204: --- Fix Version/s: 0.16.0 > [C++][Dataset] In expression should not require exact type match > > > Key: ARROW-7204 > URL: https://issues.apache.org/jira/browse/ARROW-7204 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Neal Richardson >Assignee: Ben Kietzman >Priority: Major > Fix For: 0.16.0 > > > Similar to ARROW-7047. I encountered this on ARROW-7185 > (https://github.com/apache/arrow/pull/5858/files#diff-1d8a97ca966e8446ef2ae4b7b5a96ed1R125) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7093) [R] Support creating ScalarExpressions for more data types
[ https://issues.apache.org/jira/browse/ARROW-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7093: --- Fix Version/s: 0.16.0 > [R] Support creating ScalarExpressions for more data types > -- > > Key: ARROW-7093 > URL: https://issues.apache.org/jira/browse/ARROW-7093 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Assignee: Romain Francois >Priority: Critical > Fix For: 0.16.0 > > > See > https://github.com/apache/arrow/blob/master/r/src/expression.cpp#L93-L107. > ARROW-6340 was limited to integer/double/logical. This will let us make > dataset filter expressions with all those other types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7498) [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme
[ https://issues.apache.org/jira/browse/ARROW-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010783#comment-17010783 ] Ben Kietzman commented on ARROW-7498: - I'd say Partitioner. Partitioning sounds more like it describes the output > [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme > - > > Key: ARROW-7498 > URL: https://issues.apache.org/jira/browse/ARROW-7498 > Project: Apache Arrow > Issue Type: Wish > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > > DataFragment -> Fragment > DataSource -> Source > PartitionScheme -> PartitionSchema > *Discovery -> *Manifest -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-3873) [C++] Build shared libraries consistently with -fvisibility=hidden
[ https://issues.apache.org/jira/browse/ARROW-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-3873: - Assignee: Antoine Pitrou > [C++] Build shared libraries consistently with -fvisibility=hidden > -- > > Key: ARROW-3873 > URL: https://issues.apache.org/jira/browse/ARROW-3873 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > See https://github.com/apache/arrow/pull/2437 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7498) [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme
[ https://issues.apache.org/jira/browse/ARROW-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010773#comment-17010773 ] Francois Saint-Jacques commented on ARROW-7498: --- Partitioning or Partitioner? > [C++][Dataset] Rename DataFragment/DataSource/PartitionScheme > - > > Key: ARROW-7498 > URL: https://issues.apache.org/jira/browse/ARROW-7498 > Project: Apache Arrow > Issue Type: Wish > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > > DataFragment -> Fragment > DataSource -> Source > PartitionScheme -> PartitionSchema > *Discovery -> *Manifest -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7513) [JS] Arrow Tutorial: Common data types
[ https://issues.apache.org/jira/browse/ARROW-7513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010761#comment-17010761 ] Brian Hulette commented on ARROW-7513: -- Thanks for doing this Leo! I just have one suggestion after a brief look this morning. I think Data should be considered a low-level API (and maybe even a private one?), and we should direct users to create Vectors directly with the builders, or with the {{from}} static initializers (which defer to the builders). > [JS] Arrow Tutorial: Common data types > -- > > Key: ARROW-7513 > URL: https://issues.apache.org/jira/browse/ARROW-7513 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Leo Meyerovich >Assignee: Leo Meyerovich >Priority: Minor > > The JS client lacks basic introductory material around creating the common > basic data types such as turning JS arrays into ints, dicts, etc. There is no > equivalent of Python's [https://arrow.apache.org/docs/python/data.html] . > This has made use for myself difficult, and I bet for others. > > As with prev tutorials, I started sketching on > [https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit] > . When we're happy can make sense to export as an html or something to the > repo, or just link from the main readme. > I believe the target topics worth covering are: > * Common user data types: Ints, Dicts, Struct, Time > * Common column types: Data, Vector, Column > * Going from individual & arrays & buffers of JS values to Arrow-wrapped > forms, and basic inspection of the result > Not worth going into here is Tables vs. RecordBatches, which is the other > tutorial. > > 1. Ideas of what to add/edit/remove? > 2. And anyone up for helping with discussion of Data vs. Vector, and ingest > of Time & Struct? > 3. ... Should we be encouraging Struct or Map? I saw some PRs changing stuff > here. > > cc [~wesm] [~bhulette] [~paul.e.taylor] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7413) [Python][Dataset] Add tests for PartitionSchemeDiscovery
[ https://issues.apache.org/jira/browse/ARROW-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010713#comment-17010713 ] Joris Van den Bossche commented on ARROW-7413: -- [~bkietz] I suppose you are not working on this right now? I am running into the python bindings for the partition discovery while updating my open_dataset() PR, so can probably also tackle this. > [Python][Dataset] Add tests for PartitionSchemeDiscovery > > > Key: ARROW-7413 > URL: https://issues.apache.org/jira/browse/ARROW-7413 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: dataset > Fix For: 0.16.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7515) [C++] Rename nonexistent and non_existent to not_found
[ https://issues.apache.org/jira/browse/ARROW-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7515: -- Labels: pull-request-available (was: ) > [C++] Rename nonexistent and non_existent to not_found > -- > > Key: ARROW-7515 > URL: https://issues.apache.org/jira/browse/ARROW-7515 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Kenta Murata >Assignee: Kenta Murata >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7512) [C++] Dictionary memo missing elements in id_to_dictionary_ map after deserialization
[ https://issues.apache.org/jira/browse/ARROW-7512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7512: - Summary: [C++] Dictionary memo missing elements in id_to_dictionary_ map after deserialization (was: Dictionary memo missing elements in id_to_dictionary_ map after deserialization) > [C++] Dictionary memo missing elements in id_to_dictionary_ map after > deserialization > - > > Key: ARROW-7512 > URL: https://issues.apache.org/jira/browse/ARROW-7512 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.0 >Reporter: Wamsi Viswanath >Priority: Major > > `id_to_dictionary_` map is empty after de-serialization of schema using > ReadSchema method. > An example for reproduction: > [https://gist.github.com/wamsiv/77dc1db44b5805828172e6c94d61d2d9] > I see that it is probably being missed here: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata_internal.cc#L804 > Please let me know if the behavior is expected and if so then how the client > is expected to have dictionary array values? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7515) [C++] Rename nonexistent and non_existent to not_found
Kenta Murata created ARROW-7515: --- Summary: [C++] Rename nonexistent and non_existent to not_found Key: ARROW-7515 URL: https://issues.apache.org/jira/browse/ARROW-7515 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Kenta Murata Assignee: Kenta Murata -- This message was sent by Atlassian Jira (v8.3.4#803005)