[GitHub] [arrow] github-actions[bot] commented on issue #7009: ARROW-8552: [Rust] support iterate parquet row columns
github-actions[bot] commented on issue #7009: URL: https://github.com/apache/arrow/pull/7009#issuecomment-617555901 https://issues.apache.org/jira/browse/ARROW-8552 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp opened a new pull request #7009: ARROW-8552: [Rust] support iterate parquet row columns
houqp opened a new pull request #7009: URL: https://github.com/apache/arrow/pull/7009 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7008: ARROW-8551: [CI][Gandiva] Use LLVM 8 in gandiva linux build
github-actions[bot] commented on issue #7008: URL: https://github.com/apache/arrow/pull/7008#issuecomment-617523989 https://issues.apache.org/jira/browse/ARROW-8551 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pprudhvi opened a new pull request #7008: ARROW-8551: [CI][Gandiva] Use LLVM 8 to build gandiva linux jar
pprudhvi opened a new pull request #7008: URL: https://github.com/apache/arrow/pull/7008 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pprudhvi commented on issue #6990: ARROW-8528 : [CI][NIGHTLY:gandiva-jar-osx] fix gandiva osx build
pprudhvi commented on issue #6990: URL: https://github.com/apache/arrow/pull/6990#issuecomment-617519736 resolved with https://github.com/Homebrew/homebrew-core/pull/53445/files. closing this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on issue #6954: ARROW-8440: [C++] Refine SIMD header files
cyb70289 commented on issue #6954: URL: https://github.com/apache/arrow/pull/6954#issuecomment-617516947 > Maybe it was just a thought I had in my head but never expressed. Opened https://issues.apache.org/jira/browse/ARROW-8531 Updated this patch to remove ARROW_USE_SIMD This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #6578: ARROW-7371: WIP: [GLib] Add GLib binding of Dataset
wesm commented on issue #6578: URL: https://github.com/apache/arrow/pull/6578#issuecomment-617515815 I haven't looked at the details of this binding too much, but I wanted to let you know that I'm taking a closer look at the way that filter expressions work in the datasets API in the context of being able to support more general purpose expression evaluation -- beyond the scope of just the datasets API -- i.e. with many more functions. In particular, I am concerned about having significant bindings for the `Expression` subclasses until we feel confident in the C++ API that we have an array-expression API that can accommodate the expanded scope of general purpose query processing. In general, the expressions here are not specific to datasets -- we should be working toward an expression API (closely tied to a kernel/function catalog) that can also be used in projections, hash aggregations, join predicates, and other query processing uses. So until then, I would recommend that you make minimal bindings of the factory functions needed to be able to form filters in the datasets API and avoid wrapping the expression subclasses if you can. This will save you work now and potentially spare us painful refactoring or API breaks later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7007: ARROW-8537: [C++] Revert Optimizing BitmapReader
github-actions[bot] commented on issue #7007: URL: https://github.com/apache/arrow/pull/7007#issuecomment-617494295 https://issues.apache.org/jira/browse/ARROW-8537 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 opened a new pull request #7007: ARROW-8537: [C++] Revert Optimizing BitmapReader
cyb70289 opened a new pull request #7007: URL: https://github.com/apache/arrow/pull/7007 Revert PR https://github.com/apache/arrow/pull/6986 as it introduces big performance regression to BitmapAnd unaligned benchmark. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7006: ARROW-8508 [Rust] FixedSizeListArray improper offset for value
github-actions[bot] commented on issue #7006: URL: https://github.com/apache/arrow/pull/7006#issuecomment-617490579 https://issues.apache.org/jira/browse/ARROW-8508 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth opened a new pull request #7006: ARROW-8508 [Rust] FixedSizeListArray improper offset for value
markhildreth opened a new pull request #7006: URL: https://github.com/apache/arrow/pull/7006 Potentially Fixes ARROW-8508 Fixed size list arrays sourced with a non-zero offset of their child data was respecting this offset when calculating value offsets in the `value_offset` method, but not in the `value` method. This would cause nested fixed list arrays that should have looked like this: ``` [ [ [0, 1] ], [ [2, 3], [4, 5] ] ] ``` ...to behave like this when looking directly at the values... ``` [ [ [0, 1] ], [ [0, 1], [2, 3] ], ] ``` This is different to how ListArray would do things (which respect the offset in both methods). This PR makes the change. Additionally, it adds a failing test case for this, as well as a passing test case for similar cases, including on ListArray. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on issue #7005: ARROW-8550: [CI] Don't run cron GHA jobs on forks
nealrichardson commented on issue #7005: URL: https://github.com/apache/arrow/pull/7005#issuecomment-617477677 I believe the failure on Jira link might be expected: it's possible that the pull-request token is not sufficiently authorized to run it. @kou does that sound right? I just merged this into my fork's master branch and it did skip the cron builds (e.g. https://github.com/nealrichardson/arrow/actions/runs/84286607) so I think this is doing what it is intending. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7005: ARROW-8550: [CI] Don't run cron GHA jobs on forks
github-actions[bot] commented on issue #7005: URL: https://github.com/apache/arrow/pull/7005#issuecomment-617456224 https://issues.apache.org/jira/browse/ARROW-8550 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7005: ARROW-8550: [CI] Don't run cron GHA jobs on forks
nealrichardson opened a new pull request #7005: URL: https://github.com/apache/arrow/pull/7005 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #6995: ARROW-8549: [R] Assorted post-0.17 release cleanups
github-actions[bot] commented on issue #6995: URL: https://github.com/apache/arrow/pull/6995#issuecomment-617445300 Revision: e7dbd9c977b765e618a40e997039be773c9f16bf Submitted crossbow builds: [ursa-labs/crossbow @ actions-161](https://github.com/ursa-labs/crossbow/branches/all?query=actions-161) |Task|Status| ||--| |homebrew-r-autobrew|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-161-travis-homebrew-r-autobrew.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |test-conda-r-3.6|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-161-circle-test-conda-r-3.6.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-161-circle-test-conda-r-3.6)| |test-r-linux-as-cran|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-161-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-161-github-test-r-linux-as-cran)| |test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rhub-ubuntu-gcc-release)| |test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rocker-r-base-latest)| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rstudio-r-base-3.6-bionic)| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rstudio-r-base-3.6-opensuse15)| |test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rstudio-r-base-3.6-opensuse42)| |test-ubuntu-18.04-r-3.6|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-161-circle-test-ubuntu-18.04-r-3.6.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-161-circle-test-ubuntu-18.04-r-3.6)| |test-ubuntu-18.04-r-sanitizer|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-161-circle-test-ubuntu-18.04-r-sanitizer.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-161-circle-test-ubuntu-18.04-r-sanitizer)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on issue #6995: ARROW-8549: [R] Assorted post-0.17 release cleanups
nealrichardson commented on issue #6995: URL: https://github.com/apache/arrow/pull/6995#issuecomment-617444872 @github-actions crossbow submit -g r This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #6995: ARROW-8549: [R] Assorted post-0.17 release cleanups
github-actions[bot] commented on issue #6995: URL: https://github.com/apache/arrow/pull/6995#issuecomment-617442386 https://issues.apache.org/jira/browse/ARROW-8549 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] davidanthoff commented on a change in pull request #7001: Use lowercase ws2_32 everywhere
davidanthoff commented on a change in pull request #7001: URL: https://github.com/apache/arrow/pull/7001#discussion_r412497174 ## File path: cpp/cmake_modules/FindThrift.cmake ## @@ -100,7 +100,7 @@ if(Thrift_FOUND OR THRIFT_FOUND) INTERFACE_INCLUDE_DIRECTORIES "${THRIFT_INCLUDE_DIR}") if(WIN32 AND NOT MSVC) # We don't need this for Visual C++ because Thrift uses -# "#pragma comment(lib, "Ws2_32.lib")" in +# "#pragma comment(lib, "ws2_32.lib")" in Review comment: That makes sense, I adjusted this. I should probably double check that the cross compile still works without this, though. Will report back once I have tried that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] davidanthoff commented on a change in pull request #7001: Use lowercase ws2_32 everywhere
davidanthoff commented on a change in pull request #7001: URL: https://github.com/apache/arrow/pull/7001#discussion_r412497174 ## File path: cpp/cmake_modules/FindThrift.cmake ## @@ -100,7 +100,7 @@ if(Thrift_FOUND OR THRIFT_FOUND) INTERFACE_INCLUDE_DIRECTORIES "${THRIFT_INCLUDE_DIR}") if(WIN32 AND NOT MSVC) # We don't need this for Visual C++ because Thrift uses -# "#pragma comment(lib, "Ws2_32.lib")" in +# "#pragma comment(lib, "ws2_32.lib")" in Review comment: That makes sense, I adjusted this. I should probably double check that the cross compile still works without this, though. Will report once I have tried that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #7001: Use lowercase ws2_32 everywhere
kou commented on a change in pull request #7001: URL: https://github.com/apache/arrow/pull/7001#discussion_r412492656 ## File path: cpp/cmake_modules/FindThrift.cmake ## @@ -100,7 +100,7 @@ if(Thrift_FOUND OR THRIFT_FOUND) INTERFACE_INCLUDE_DIRECTORIES "${THRIFT_INCLUDE_DIR}") if(WIN32 AND NOT MSVC) # We don't need this for Visual C++ because Thrift uses -# "#pragma comment(lib, "Ws2_32.lib")" in +# "#pragma comment(lib, "ws2_32.lib")" in Review comment: We should not change this because this is the code in Thrift: https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/windows/config.h#L66 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on issue #6306: ARROW-7705: [Rust] Initial sort implementation
paddyhoran commented on issue #6306: URL: https://github.com/apache/arrow/pull/6306#issuecomment-617402123 @nevi-me this needs a rebase now. Once you do that, I'll take a look so we can get this merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
wesm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r412474320 ## File path: cpp/src/parquet/file_reader.h ## @@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader { // Returns the file metadata. Only one instance is ever created std::shared_ptr metadata() const; + /// Pre-buffer the specified column indices in all row groups. + /// + /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is set; + /// otherwise this is a no-op. The reader internally maintains a cache which is + /// overwritten each time this is called. Intended to increase performance on + /// high-latency filesystems (e.g. Amazon S3). + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices); Review comment: On second look it seems desirable to be able to have control over when this operation is invoked. There might be some other options relating to concurrent IO calls that you might want to pass here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
wesm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r412465641 ## File path: cpp/src/parquet/properties.h ## @@ -56,10 +60,32 @@ class PARQUET_EXPORT ReaderProperties { bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; } + bool is_coalesced_stream_enabled() const { return coalesced_stream_enabled_; } + void enable_buffered_stream() { buffered_stream_enabled_ = true; } void disable_buffered_stream() { buffered_stream_enabled_ = false; } + /// Enable read coalescing. + /// + /// When enabled, readers can optionally call + /// ParquetFileReader.PreBuffer to cache the necessary slices of the + /// file in-memory before deserialization. Arrow readers + /// automatically do this. This is intended to increase performance + /// when reading from high-latency filesystems (e.g. Amazon S3). + /// + /// When this is enabled, it is NOT SAFE to concurrently create + /// RecordBatchReaders from the same file. Review comment: Could it be made safe? ## File path: cpp/src/parquet/file_reader.h ## @@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader { // Returns the file metadata. Only one instance is ever created std::shared_ptr metadata() const; + /// Pre-buffer the specified column indices in all row groups. + /// + /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is set; + /// otherwise this is a no-op. The reader internally maintains a cache which is + /// overwritten each time this is called. Intended to increase performance on + /// high-latency filesystems (e.g. Amazon S3). + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices); Review comment: My biggest question is whether this detail should be exposed here or handled automatically under the hood. I don't have a strong opinion ## File path: cpp/src/parquet/arrow/reader.cc ## @@ -803,6 +809,9 @@ Status FileReaderImpl::ReadRowGroups(const std::vector& row_groups, return Status::Invalid("Invalid column index"); } + // PARQUET-1698/PARQUET-1820: pre-buffer row groups/column chunks if enabled + parquet_reader()->PreBuffer(row_groups, indices); Review comment: Per comments below, it seems slightly odd that this would be a no-op sometimes. It seems like it would be better to have ``` if (properties_.prebuffering_enabled()) { parquet_reader()->PreBuffer(...); } ``` ## File path: cpp/src/parquet/properties.h ## @@ -56,10 +60,32 @@ class PARQUET_EXPORT ReaderProperties { bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; } + bool is_coalesced_stream_enabled() const { return coalesced_stream_enabled_; } Review comment: So rather than having this flag here, what if you put a prebuffer option in `ArrowReadProperties`? Then `FileReader::PreBuffer` will always prebuffer if it's called ## File path: cpp/src/parquet/file_reader.h ## @@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader { // Returns the file metadata. Only one instance is ever created std::shared_ptr metadata() const; + /// Pre-buffer the specified column indices in all row groups. + /// + /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is set; Review comment: Does this detail need to be exposed? The first time `PreBuffer` is called, could the cached reader be created? ## File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc ## @@ -141,16 +142,31 @@ class MinioFixture : public benchmark::Fixture { } /// Make an object with Parquet data. + /// Appends integer columns to the beginning (to act as indices). Status MakeParquetObject(const std::string& path, int num_columns, int num_rows) { -std::vector> columns(num_columns); -std::vector> fields(num_columns); -for (int i = 0; i < num_columns; ++i) { +std::vector> columns; +std::vector> fields; + +{ + arrow::random::RandomArrayGenerator generator(0); + std::shared_ptr values = generator.Int64(num_rows, 0, 1e10, 0); + columns.push_back(std::make_shared(values)); + fields.push_back(::arrow::field("timestamp", values->type())); +} +{ + arrow::random::RandomArrayGenerator generator(1); + std::shared_ptr values = generator.Int32(num_rows, 0, 1e9, 0); + columns.push_back(std::make_shared(values)); + fields.push_back(::arrow::field("val", values->type())); +} + +for (int i = 0; i < num_columns; i++) { arrow::random::RandomArrayGenerator generator(i); std::shared_ptr values = generator.Float64(num_rows, -1.e10, 1e10, 0); std::stringstream ss; ss << "col" << i; - columns[i] = std::make_shared(values); - fields[i] = ::arrow::field(ss.str(), values->type()); +
[GitHub] [arrow] paddyhoran commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
paddyhoran commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r412473292 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n( self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: I think he means using just a single call to `write_bytes`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on issue #7004: ARROW-3827: [Rust] Implement UnionArray Updated
paddyhoran commented on issue #7004: URL: https://github.com/apache/arrow/pull/7004#issuecomment-617399915 @kszucs it's failing due to `rustfmt` not being installed before testing the flight crate, any idea why this would be the case? Sorry, I don't know much about GitHub actions yet... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7004: ARROW-3827: [Rust] Implement UnionArray Updated
github-actions[bot] commented on issue #7004: URL: https://github.com/apache/arrow/pull/7004#issuecomment-617398778 https://issues.apache.org/jira/browse/ARROW-3827 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on issue #6209: ARROW-3827: [Rust] Implement UnionArray
paddyhoran commented on issue #6209: URL: https://github.com/apache/arrow/pull/6209#issuecomment-617389640 Closing and I'll open a new PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation
fsaintjacques commented on a change in pull request #7000: URL: https://github.com/apache/arrow/pull/7000#discussion_r412452967 ## File path: python/pyarrow/tests/test_dataset.py ## @@ -671,41 +669,29 @@ def test_fragments(tempdir): f = fragments[0] # file's schema does not include partition column -phys_schema = f.schema.remove(f.schema.get_field_index('part')) -assert f.format.inspect(f.path, f.filesystem) == phys_schema +assert f.physical_schema.names == ['f1', 'f2'] +assert f.format.inspect(f.path, f.filesystem) == f.physical_schema assert f.partition_expression.equals(ds.field('part') == 'a') # scanning fragment includes partition columns -result = f.to_table() -assert f.schema == result.schema +result = f.to_table(schema=dataset.schema) assert result.column_names == ['f1', 'f2', 'part'] -assert len(result) == 4 assert result.equals(table.slice(0, 4)) - -# scanning fragments follow column projection -fragments = list(dataset.get_fragments(columns=['f1', 'part'])) Review comment: Via kwargs to `self._scanner` then forwarding kwargs to `Scanner.from_fragment` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] working-estimate opened a new issue #7003: from pyarrow import parquet fails with AttributeError: type object 'pyarrow._parquet.Statistics' has no attribute '__reduce_cython__'
working-estimate opened a new issue #7003: URL: https://github.com/apache/arrow/issues/7003 I have tried versions 0.15.1, 0.16.0, 0.17.0. Same error on all. I've seen in other issues that co-installations of tensorflow and numpy might be causing issues. I have tensorflow==1.14.0 and numpy==1.16.4 ```from pyarrow import parquet ~/python/lib/python3.6/site-packages/pyarrow/parquet.py in 32 import pyarrow as pa 33 import pyarrow.lib as lib ---> 34 import pyarrow._parquet as _parquet 35 36 from pyarrow._parquet import (ParquetReader, Statistics, # noqa ~/python/lib/python3.6/site-packages/pyarrow/_parquet.pyx in init pyarrow._parquet() AttributeError: type object 'pyarrow._parquet.Statistics' has no attribute '__reduce_cython__'``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on issue #6986: ARROW-8523: [C++] Optimize BitmapReader
fsaintjacques commented on issue #6986: URL: https://github.com/apache/arrow/pull/6986#issuecomment-617368021 See either `archery benchmark diff --help` or the [benchmark](https://arrow.apache.org/docs/developers/benchmarks.html) section of the documentation. Archery can compare the same binary with different scenarios: - Commit - Toolchain - Compile flags All it needs is two cmake build directory and it'll compare the benchmark binaries from there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] davidanthoff commented on issue #7001: Use lowercase ws2_32 everywhere
davidanthoff commented on issue #7001: URL: https://github.com/apache/arrow/pull/7001#issuecomment-617365369 > How does BinaryBuilder compile Windows binaries on Linux? Using MinGW? Yes, it uses MinGW for Windows, but then it also cross-compiles to lots of other platforms. The PR that tries to get arrow to build is https://github.com/JuliaPackaging/Yggdrasil/pull/918. > (also, could you please open an issue on JIRA as explained above?) I'm not familiar with JIRA, so I'll have to find some time for that, hopefully soon. But no promises, right now I want to use the time I have for this project to get the build to work :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on issue #6883: Prepare for the release candidate
kszucs commented on issue #6883: URL: https://github.com/apache/arrow/pull/6883#issuecomment-617350272 The release is out, we can close this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation
fsaintjacques commented on a change in pull request #7000: URL: https://github.com/apache/arrow/pull/7000#discussion_r412407398 ## File path: cpp/src/arrow/dataset/dataset.cc ## @@ -72,36 +78,15 @@ Result> Dataset::NewScan() { return NewScan(std::make_shared()); } -bool Dataset::AssumePartitionExpression( -const std::shared_ptr& scan_options, -std::shared_ptr* simplified_scan_options) const { - if (partition_expression_ == nullptr) { -if (simplified_scan_options != nullptr) { - *simplified_scan_options = scan_options; -} -return true; - } +FragmentIterator Dataset::GetFragments() { return GetFragments(scalar(true)); } - auto expr = scan_options->filter->Assume(*partition_expression_); - if (expr->IsNull() || expr->Equals(false)) { -// selector is not satisfiable; yield no fragments -return false; +FragmentIterator Dataset::GetFragments(std::shared_ptr predicate) { + if (partition_expression_) { Review comment: A unit test triggered a segfault. I added this to move on to the next failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #7002: ARROW-8543: [C++] Single pass coalescing algorithm + Rebase
pitrou commented on issue #7002: URL: https://github.com/apache/arrow/pull/7002#issuecomment-617339230 The original PR message is slightly misleading: both algorithms have the same complexity (O(N) except for the sorting step which is O(N log N)). However, it's true that the new algorithm is more compact and not less readable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation
bkietz commented on a change in pull request #7000: URL: https://github.com/apache/arrow/pull/7000#discussion_r412252930 ## File path: cpp/src/arrow/dataset/dataset.h ## @@ -84,13 +82,12 @@ class ARROW_DS_EXPORT Fragment { class ARROW_DS_EXPORT InMemoryFragment : public Fragment { public: InMemoryFragment(RecordBatchVector record_batches, - std::shared_ptr scan_options); + std::shared_ptr = NULLPTR); Review comment: `scalar` is declared such that you can use `scalar(true)` for default arguments ```suggestion std::shared_ptr = scalar(true)); ``` ## File path: cpp/src/arrow/dataset/dataset.h ## @@ -100,16 +97,20 @@ class ARROW_DS_EXPORT InMemoryFragment : public Fragment { RecordBatchVector record_batches_; }; -/// \brief A container of zero or more Fragments. A Dataset acts as a discovery mechanism -/// of Fragments and partitions, e.g. files deeply nested in a directory. +/// \brief A container of zero or more Fragments. +/// +/// A Dataset acts as a union of Fragments, e.g. files deeply nested in a +/// directory. A Dataset has a schema, also known as the "reader" schema. Review comment: Is this known as the "reader" schema outside Avro space? If not I think it's still acceptable to explain this by analogy to Avro, but we should be explicit whose jargon we're borrowing ## File path: cpp/src/arrow/dataset/dataset.h ## @@ -100,16 +97,20 @@ class ARROW_DS_EXPORT InMemoryFragment : public Fragment { RecordBatchVector record_batches_; }; -/// \brief A container of zero or more Fragments. A Dataset acts as a discovery mechanism -/// of Fragments and partitions, e.g. files deeply nested in a directory. +/// \brief A container of zero or more Fragments. +/// +/// A Dataset acts as a union of Fragments, e.g. files deeply nested in a +/// directory. A Dataset has a schema, also known as the "reader" schema. +/// class ARROW_DS_EXPORT Dataset : public std::enable_shared_from_this { public: /// \brief Begin to build a new Scan operation against this Dataset Result> NewScan(std::shared_ptr context); Result> NewScan(); - /// \brief GetFragments returns an iterator of Fragments given ScanOptions. - FragmentIterator GetFragments(std::shared_ptr options); + /// \brief GetFragments returns an iterator of Fragments given a predicate. + FragmentIterator GetFragments(std::shared_ptr predicate); + FragmentIterator GetFragments(); Review comment: ```suggestion FragmentIterator GetFragments(std::shared_ptr predicate = scalar(true)); ``` ## File path: cpp/src/arrow/dataset/dataset.cc ## @@ -30,34 +30,40 @@ namespace arrow { namespace dataset { -Fragment::Fragment(std::shared_ptr scan_options) -: scan_options_(std::move(scan_options)), partition_expression_(scalar(true)) {} +Fragment::Fragment(std::shared_ptr partition_expression) +: partition_expression_(partition_expression ? partition_expression : scalar(true)) {} -const std::shared_ptr& Fragment::schema() const { - return scan_options_->schema(); -} +Result> InMemoryFragment::ReadPhysicalSchema() { Review comment: Probably follow up: `InMemoryFragment` should be constructed with an explicit schema with convenience constructor to extract batches[0]->schema that fails if batches are empty. ## File path: cpp/src/arrow/dataset/dataset.cc ## @@ -72,36 +78,15 @@ Result> Dataset::NewScan() { return NewScan(std::make_shared()); } -bool Dataset::AssumePartitionExpression( -const std::shared_ptr& scan_options, -std::shared_ptr* simplified_scan_options) const { - if (partition_expression_ == nullptr) { -if (simplified_scan_options != nullptr) { - *simplified_scan_options = scan_options; -} -return true; - } +FragmentIterator Dataset::GetFragments() { return GetFragments(scalar(true)); } - auto expr = scan_options->filter->Assume(*partition_expression_); - if (expr->IsNull() || expr->Equals(false)) { -// selector is not satisfiable; yield no fragments -return false; +FragmentIterator Dataset::GetFragments(std::shared_ptr predicate) { + if (partition_expression_) { Review comment: Isn't this initialized to `scalar(true)`? ## File path: cpp/src/arrow/dataset/filter.cc ## @@ -655,31 +655,32 @@ std::string ScalarExpression::ToString() const { return value_->ToString() + ":" + type_repr; } +using arrow::internal::JoinStrings; Review comment: :+1: ## File path: cpp/src/arrow/dataset/file_parquet_test.cc ## @@ -470,8 +466,8 @@ TEST_F(TestParquetFileFormat, PredicatePushdownRowGroupFragments) { CountRowGroupsInFragment(fragment, internal::Iota(5, static_cast(kNumRowGroups)), "i64"_ >= int64_t(6)); - CountRowGroupsInFragment(fragment, {5, 6, 7}, "i64"_ >= int64_t(6), - "i64"_ <
[GitHub] [arrow] github-actions[bot] commented on issue #7002: ARROW-8543: [C++] Single pass coalescing algorithm + Rebase
github-actions[bot] commented on issue #7002: URL: https://github.com/apache/arrow/pull/7002#issuecomment-617329923 https://issues.apache.org/jira/browse/ARROW-8543 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types
pitrou commented on a change in pull request #6959: URL: https://github.com/apache/arrow/pull/6959#discussion_r412350484 ## File path: python/pyarrow/tests/test_extension_type.py ## @@ -445,22 +445,28 @@ def test_parquet(tmpdir, registered_period_type): import base64 decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"]) schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema)) -assert schema.field("ext").metadata == { -b'ARROW:extension:metadata': b'freq=D', -b'ARROW:extension:name': b'pandas.period'} +# Since the type could be reconstructed, the extension type metadata is +# absent. +assert schema.field("ext").metadata == {} Review comment: @jorisvandenbossche You may want to check whether this is ok for you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #7001: Use lowercase ws2_32 everywhere
pitrou commented on issue #7001: URL: https://github.com/apache/arrow/pull/7001#issuecomment-617301935 (also, could you please open an issue on JIRA as explained above?) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #7001: Use lowercase ws2_32 everywhere
pitrou commented on issue #7001: URL: https://github.com/apache/arrow/pull/7001#issuecomment-617301716 How does `BinaryBuilder` compile Windows binaries on Linux? Using MinGW? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types
pitrou commented on a change in pull request #6959: URL: https://github.com/apache/arrow/pull/6959#discussion_r412310009 ## File path: cpp/src/arrow/ipc/metadata_internal.cc ## @@ -756,10 +737,35 @@ Status FieldFromFlatbuffer(const flatbuf::Field* field, DictionaryMemo* dictiona RETURN_NOT_OK(IntFromFlatbuffer(int_data, _type)); ARROW_ASSIGN_OR_RAISE(type, DictionaryType::Make(index_type, type, encoding->isOrdered())); -*out = ::arrow::field(field_name, type, field->nullable(), metadata); -RETURN_NOT_OK(dictionary_memo->AddField(encoding->id(), *out)); - } else { -*out = ::arrow::field(field_name, type, field->nullable(), metadata); +dictionary_id = encoding->id(); + } + + // 4. Is it an extension type? + if (metadata != nullptr) { +// Look for extension metadata in custom_metadata field +int name_index = metadata->FindKey(kExtensionTypeKeyName); +if (name_index != -1) { + std::string type_name = metadata->value(name_index); + int data_index = metadata->FindKey(kExtensionMetadataKeyName); + std::string type_data = data_index == -1 ? "" : metadata->value(data_index); + + std::shared_ptr ext_type = GetExtensionType(type_name); + if (ext_type != nullptr) { +ARROW_ASSIGN_OR_RAISE(type, ext_type->Deserialize(type, type_data)); +// Remove the metadata, for faithful roundtripping +RETURN_NOT_OK(metadata->DeleteMany({name_index, data_index})); Review comment: @wesm Do you opine to this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7001: Use lowercase ws2_32 everywhere
github-actions[bot] commented on issue #7001: URL: https://github.com/apache/arrow/pull/7001#issuecomment-617271478 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types
pitrou commented on a change in pull request #6959: URL: https://github.com/apache/arrow/pull/6959#discussion_r412305089 ## File path: dev/archery/archery/integration/datagen.py ## @@ -1401,6 +1437,18 @@ def generate_nested_dictionary_case(): dictionaries=[dict0, dict1, dict2]) +def generate_extension_case(): +uuid_type = ExtensionType('uuid', 'uuid-serialization', + FixedSizeBinaryField('', 16)) + +fields = [ +ExtensionField('uuids', uuid_type), Review comment: Ok, it was more involved than I imagined, because the IPC layer had to be fixed as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] davidanthoff opened a new pull request #7001: Use lowercase ws2_32 everywhere
davidanthoff opened a new pull request #7001: URL: https://github.com/apache/arrow/pull/7001 With this patch I can cross-compile arrow from a Linux system, in particular I can compile Windows binaries on a Linux system (using https://binarybuilder.org/). I hope to eventually be able to use things from Julia with this. My best guess is that the inconsistent casing of `ws2_32` in the various build files/systems is no problem when compiling things on Windows because file systems there tend to be case insensitive. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #6970: ARROW-2714: [Python] Implement variable step slicing with Take
wesm commented on issue #6970: URL: https://github.com/apache/arrow/pull/6970#issuecomment-617250883 +1. Appveyor build looks good https://ci.appveyor.com/project/wesm/arrow/builds/32336612 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation
jorisvandenbossche commented on a change in pull request #7000: URL: https://github.com/apache/arrow/pull/7000#discussion_r412191161 ## File path: python/pyarrow/tests/test_dataset.py ## @@ -671,41 +669,29 @@ def test_fragments(tempdir): f = fragments[0] # file's schema does not include partition column -phys_schema = f.schema.remove(f.schema.get_field_index('part')) -assert f.format.inspect(f.path, f.filesystem) == phys_schema +assert f.physical_schema.names == ['f1', 'f2'] +assert f.format.inspect(f.path, f.filesystem) == f.physical_schema assert f.partition_expression.equals(ds.field('part') == 'a') # scanning fragment includes partition columns -result = f.to_table() -assert f.schema == result.schema +result = f.to_table(schema=dataset.schema) assert result.column_names == ['f1', 'f2', 'part'] -assert len(result) == 4 assert result.equals(table.slice(0, 4)) - -# scanning fragments follow column projection -fragments = list(dataset.get_fragments(columns=['f1', 'part'])) Review comment: Keep this but where the columns selection is passed to `to_table` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #6995: WIP DO NOT MERGE 0.17.0 R release prep
github-actions[bot] commented on issue #6995: URL: https://github.com/apache/arrow/pull/6995#issuecomment-617235575 Revision: f543317d36d39322bd339b49dd8867cbd3f2ad70 Submitted crossbow builds: [ursa-labs/crossbow @ actions-160](https://github.com/ursa-labs/crossbow/branches/all?query=actions-160) |Task|Status| ||--| |test-r-linux-as-cran|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-160-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-160-github-test-r-linux-as-cran)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on issue #6996: ARROW-8538: [Packaging] Remove boost from homebrew formula
nealrichardson commented on issue #6996: URL: https://github.com/apache/arrow/pull/6996#issuecomment-617232353 ¯\_(ツ)_/¯ maybe it's time to port this job to GHA This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #6986: ARROW-8523: [C++] Optimize BitmapReader
wesm commented on issue #6986: URL: https://github.com/apache/arrow/pull/6986#issuecomment-617228774 FWIW, we have some benchmark diffing code already written in https://github.com/apache/arrow/blob/master/dev/archery/archery/benchmark I'm not sure where this is documented / how to use it to check the output of a single executable cc @fsaintjacques This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types
pitrou commented on a change in pull request #6959: URL: https://github.com/apache/arrow/pull/6959#discussion_r412227300 ## File path: dev/archery/archery/integration/datagen.py ## @@ -1401,6 +1437,18 @@ def generate_nested_dictionary_case(): dictionaries=[dict0, dict1, dict2]) +def generate_extension_case(): +uuid_type = ExtensionType('uuid', 'uuid-serialization', + FixedSizeBinaryField('', 16)) + +fields = [ +ExtensionField('uuids', uuid_type), Review comment: dictionary(ext) is not possible as per https://mail-archives.apache.org/mod_mbox/arrow-dev/202004.mbox/%3CCAJPUwMAvxLYxJg_QRgdALSq3XS%2BY0zV_LYwsd6FVNYbA90RAAw%40mail.gmail.com%3E , but I'll try to add ext(dictionary). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types
bkietz commented on a change in pull request #6959: URL: https://github.com/apache/arrow/pull/6959#discussion_r412194818 ## File path: dev/archery/archery/integration/datagen.py ## @@ -1401,6 +1437,18 @@ def generate_nested_dictionary_case(): dictionaries=[dict0, dict1, dict2]) +def generate_extension_case(): +uuid_type = ExtensionType('uuid', 'uuid-serialization', + FixedSizeBinaryField('', 16)) + +fields = [ +ExtensionField('uuids', uuid_type), Review comment: Should we also test dictionary(ext) and ext(storage=dictionary)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation
jorisvandenbossche commented on a change in pull request #7000: URL: https://github.com/apache/arrow/pull/7000#discussion_r412150862 ## File path: cpp/src/arrow/dataset/dataset.h ## @@ -30,12 +30,22 @@ namespace arrow { namespace dataset { -/// \brief A granular piece of a Dataset, such as an individual file, which can be -/// read/scanned separately from other fragments. +/// \brief A granular piece of a Dataset, such as an individual file. /// -/// A Fragment yields a collection of RecordBatch, encapsulated in one or more ScanTasks. +/// A Fragment can be read/scanned separately from other fragments. It yields a +/// collection of RecordBatch, encapsulated in one or more ScanTasks. +/// +/// A notable difference from Dataset is that Fragments have physical schemas +/// which may differ from Fragments. Review comment: ```suggestion /// which may differ from other Fragments. ``` ## File path: python/pyarrow/_dataset.pyx ## @@ -519,30 +500,69 @@ cdef class Fragment: """ return Expression.wrap(self.fragment.partition_expression()) -def to_table(self, use_threads=True, MemoryPool memory_pool=None): -"""Convert this Fragment into a Table. +def _scanner(self, **kwargs): +return Scanner.from_fragment(self, **kwargs) -Use this convenience utility with care. This will serially materialize -the Scan result in memory before creating the Table. +def scan(self, columns=None, filter=None, use_threads=True, + MemoryPool memory_pool=None, **kwargs): +"""Builds a scan operation against the dataset. + Review comment: When using Fragment.scan, it uses the Fragment's physical schema for the resulting table? (since the Fragment is not aware of the dataset "read" schema?) If so, we should note that here in the docstring I think ## File path: cpp/src/arrow/dataset/file_parquet.cc ## @@ -433,26 +430,22 @@ Result ParquetFileFormat::GetRowGroupFragments( } FragmentVector fragments(row_groups.size()); - auto new_options = std::make_shared(*fragment.scan_options()); - if (!extra_filter->Equals(true)) { -new_options->filter = and_(std::move(extra_filter), std::move(new_options->filter)); - } - - RowGroupSkipper skipper(std::move(metadata), std::move(arrow_properties), - new_options->filter, std::move(row_groups)); + RowGroupSkipper skipper(std::move(metadata), std::move(arrow_properties), extra_filter, Review comment: We should probably rename "extra_filter" to just "filter" or "predicate" as how it is called in Dataset::GetFragments, since it is no longer "extra" ? ## File path: python/pyarrow/tests/test_dataset.py ## @@ -671,41 +669,29 @@ def test_fragments(tempdir): f = fragments[0] # file's schema does not include partition column -phys_schema = f.schema.remove(f.schema.get_field_index('part')) -assert f.format.inspect(f.path, f.filesystem) == phys_schema +assert f.physical_schema.names == ['f1', 'f2'] +assert f.format.inspect(f.path, f.filesystem) == f.physical_schema assert f.partition_expression.equals(ds.field('part') == 'a') # scanning fragment includes partition columns -result = f.to_table() -assert f.schema == result.schema +result = f.to_table(schema=dataset.schema) Review comment: Can you also test without passing the dataset's schema, and assert that the column_names are [f1, f2] ? ## File path: python/pyarrow/tests/test_dataset.py ## @@ -671,41 +669,29 @@ def test_fragments(tempdir): f = fragments[0] # file's schema does not include partition column -phys_schema = f.schema.remove(f.schema.get_field_index('part')) -assert f.format.inspect(f.path, f.filesystem) == phys_schema +assert f.physical_schema.names == ['f1', 'f2'] +assert f.format.inspect(f.path, f.filesystem) == f.physical_schema assert f.partition_expression.equals(ds.field('part') == 'a') # scanning fragment includes partition columns -result = f.to_table() -assert f.schema == result.schema +result = f.to_table(schema=dataset.schema) assert result.column_names == ['f1', 'f2', 'part'] -assert len(result) == 4 assert result.equals(table.slice(0, 4)) - -# scanning fragments follow column projection -fragments = list(dataset.get_fragments(columns=['f1', 'part'])) Review comment: Keep this but where the columns selection is passe to `to_table` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] gramirezespinoza commented on issue #6977: Missing `take` method in pyarrow's `Table` class
gramirezespinoza commented on issue #6977: URL: https://github.com/apache/arrow/issues/6977#issuecomment-617178347 Waiting for #6970 to be approved/merged This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #6988: ARROW-8524: [CI] Free up space on github actions
wesm commented on issue #6988: URL: https://github.com/apache/arrow/pull/6988#issuecomment-617164152 The copy-pasta in the .yml files is a bummer. I hope one day for a higher level specification of these tasks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #6999: ARROW-8542: [Release] Fix checksum url in the website post release script
github-actions[bot] commented on issue #6999: URL: https://github.com/apache/arrow/pull/6999#issuecomment-617150142 https://issues.apache.org/jira/browse/ARROW-8542 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs opened a new pull request #6999: ARROW-8542: [Release] Fix checksum url in the website post release script
kszucs opened a new pull request #6999: URL: https://github.com/apache/arrow/pull/6999 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian
pitrou commented on issue #6981: URL: https://github.com/apache/arrow/pull/6981#issuecomment-617126711 Perhaps. If the reader is compatible with those files, and roundtripping works, then the writer is probably compliant as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pprudhvi commented on issue #6990: ARROW-8528 : [CI][NIGHTLY:gandiva-jar-osx] fix gandiva osx build
pprudhvi commented on issue #6990: URL: https://github.com/apache/arrow/pull/6990#issuecomment-617111538 lets wait till https://github.com/Homebrew/homebrew-core/pull/53445/files is merged. see https://issues.apache.org/jira/browse/ARROW-8539 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #6997: ARROW-8540: [C++] Add memory allocation benchmarks
pitrou opened a new pull request #6997: URL: https://github.com/apache/arrow/pull/6997 Example output: ``` --- Benchmark Time CPU Iterations --- TouchArea/size:4096/real_time 20.1 ns 20.1 ns 34893671 TouchArea/size:65536/real_time 483 ns 483 ns 1448647 TouchArea/size:1048576/real_time 7670 ns 7669 ns90816 TouchArea/size:16777216/real_time124297 ns 124280 ns 5611 AllocateDeallocate/size:4096/real_time18.6 ns 18.6 ns 37781939 AllocateDeallocate/size:65536/real_time161 ns 161 ns 4360765 AllocateDeallocate/size:1048576/real_time 328 ns 328 ns 2131288 AllocateDeallocate/size:16777216/real_time 160 ns 160 ns 4366862 AllocateTouchDeallocate/size:4096/real_time 40.4 ns 40.4 ns 17333165 AllocateTouchDeallocate/size:65536/real_time 640 ns 640 ns 1092988 AllocateTouchDeallocate/size:1048576/real_time7959 ns 7958 ns87693 AllocateTouchDeallocate/size:16777216/real_time 124816 ns 124801 ns 5602 AllocateDeallocate/size:4096/real_time 22.2 ns 22.2 ns 31611774 AllocateDeallocate/size:65536/real_time 157 ns 157 ns 4460745 AllocateDeallocate/size:1048576/real_time 330 ns 330 ns 2113808 AllocateDeallocate/size:16777216/real_time158 ns 158 ns 4439623 AllocateTouchDeallocate/size:4096/real_time 43.0 ns 43.0 ns 16252256 AllocateTouchDeallocate/size:65536/real_time 638 ns 638 ns 1091897 AllocateTouchDeallocate/size:1048576/real_time 7961 ns 7960 ns87755 AllocateTouchDeallocate/size:16777216/real_time124699 ns 124682 ns 5588 AllocateDeallocate/size:4096/real_time232 ns 232 ns 3015215 AllocateDeallocate/size:65536/real_time 153 ns 153 ns 4527945 AllocateDeallocate/size:1048576/real_time 146 ns 146 ns 4720662 AllocateDeallocate/size:16777216/real_time144 ns 144 ns 4859165 AllocateTouchDeallocate/size:4096/real_time 254 ns 254 ns 2750031 AllocateTouchDeallocate/size:65536/real_time 635 ns 635 ns 1100267 AllocateTouchDeallocate/size:1048576/real_time 7753 ns 7752 ns89887 AllocateTouchDeallocate/size:16777216/real_time124518 ns 124501 ns 5604 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r412070422 ## File path: java/vector/src/test/java/org/apache/arrow/vector/TestLargeVector.java ## @@ -0,0 +1,187 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector; + +import static org.junit.Assert.assertArrayEquals; +import static org.junit.Assert.assertEquals; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; + +import io.netty.buffer.ArrowBuf; + +/** + * Integration test for a vector with a large (more than 2GB) {@link io.netty.buffer.ArrowBuf} as + * the data buffer. + * To run this test, please + *Make sure there are 4GB memory available in the system. + * + * Make sure the default allocation manager type is unsafe. + * This can be achieved by the environmental variable or system property. + * The details can be found in {@link DefaultAllocationManagerOption}. + * + */ +public class TestLargeVector { + private static void testLargeLongVector() { +System.out.println("Testing large big int vector."); + +final long bufSize = 4 * 1024 * 1024 * 1024L; +final int vecLength = (int) (bufSize / BigIntVector.TYPE_WIDTH); + +try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); +BigIntVector largeVec = new BigIntVector("vec", allocator)) { + largeVec.allocateNew(vecLength); + + System.out.println("Successfully allocated a vector with capacity " + vecLength); + + for (int i = 0; i < vecLength; i++) { +largeVec.set(i, i * 10L); + +if ((i + 1) % 1 == 0) { + System.out.println("Successfully written " + (i + 1) + " values"); +} + } + System.out.println("Successfully written " + vecLength + " values"); + + for (int i = 0; i < vecLength; i++) { +long val = largeVec.get(i); +assertEquals(i * 10L, val); + +if ((i + 1) % 1 == 0) { + System.out.println("Successfully read " + (i + 1) + " values"); +} + } + System.out.println("Successfully read " + vecLength + " values"); +} +System.out.println("Successfully released the large vector."); + } + + private static void testLargeIntVector() { +System.out.println("Testing large int vector."); + +final long bufSize = 4 * 1024 * 1024 * 1024L; +final int vecLength = (int) (bufSize / IntVector.TYPE_WIDTH); + +try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); + IntVector largeVec = new IntVector("vec", allocator)) { + largeVec.allocateNew(vecLength); + + System.out.println("Successfully allocated a vector with capacity " + vecLength); + + for (int i = 0; i < vecLength; i++) { +largeVec.set(i, i); + +if ((i + 1) % 1 == 0) { + System.out.println("Successfully written " + (i + 1) + " values"); +} + } + System.out.println("Successfully written " + vecLength + " values"); + + for (int i = 0; i < vecLength; i++) { +long val = largeVec.get(i); +assertEquals(i, val); + +if ((i + 1) % 1 == 0) { + System.out.println("Successfully read " + (i + 1) + " values"); +} + } + System.out.println("Successfully read " + vecLength + " values"); +} +System.out.println("Successfully released the large vector."); + } + + private static void testLargeDecimalVector() { +System.out.println("Testing large decimal vector."); + +final long bufSize = 4 * 1024 * 1024 * 1024L; +final int vecLength = (int) (bufSize / DecimalVector.TYPE_WIDTH); + +try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); + DecimalVector largeVec = new DecimalVector("vec", allocator, 38, 16)) { + largeVec.allocateNew(vecLength); + + System.out.println("Successfully allocated a vector with capacity " + vecLength); + + for (int i = 0; i < vecLength; i++) { +largeVec.set(i, 0); + +if ((i + 1) % 1 == 0) { + System.out.println("Successfully written " + (i + 1) + " values"); +} + } + System.out.println("Successfully written " +
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r412070083 ## File path: java/vector/src/test/java/org/apache/arrow/vector/TestLargeVector.java ## @@ -0,0 +1,187 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector; + +import static org.junit.Assert.assertArrayEquals; +import static org.junit.Assert.assertEquals; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; + +import io.netty.buffer.ArrowBuf; + +/** + * Integration test for a vector with a large (more than 2GB) {@link io.netty.buffer.ArrowBuf} as + * the data buffer. + * To run this test, please + *Make sure there are 4GB memory available in the system. + * + * Make sure the default allocation manager type is unsafe. Review comment: Revised. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r412067781 ## File path: java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java ## @@ -34,31 +33,24 @@ static final UnsafeDirectLittleEndian EMPTY = INNER_ALLOCATOR.empty; static final long CHUNK_SIZE = INNER_ALLOCATOR.getChunkSize(); - private final int allocatedSize; - private final UnsafeDirectLittleEndian memoryChunk; + private final long allocatedSize; - NettyAllocationManager(BaseAllocator accountingAllocator, int requestedSize) { -super(accountingAllocator); -this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize); Review comment: Revised. Thank you for the good suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian
kiszk commented on issue #6981: URL: https://github.com/apache/arrow/pull/6981#issuecomment-617087421 I think that the current test cases for parquet writer do not have tests to verify the bit pattern of the generated parquet file. I will also create the test case in another PR since they are important on the different native endian platforms. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #6996: ARROW-8538: [Packaging] Remove boost from homebrew formula
pitrou commented on issue #6996: URL: https://github.com/apache/arrow/pull/6996#issuecomment-617067706 Wow, that is compiling OpenSSL by hand? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6991: ARROW-8529: [C++] Fix usage of NextCounts() on dictionary-encoded data
pitrou commented on a change in pull request #6991: URL: https://github.com/apache/arrow/pull/6991#discussion_r412025224 ## File path: cpp/src/arrow/util/rle_encoding.h ## @@ -414,6 +414,8 @@ static inline bool IndexInRange(int32_t idx, int32_t dictionary_length) { template inline int RleDecoder::GetBatchWithDict(const T* dictionary, int32_t dictionary_length, T* values, int batch_size) { + using IndexType = int32_t; Review comment: Note that this was previously simply `int`. The change here is simply to make things clearer and also note the index size explicitly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #6996: ARROW-8538: [Packaging] Remove boost from homebrew formula
pitrou commented on issue #6996: URL: https://github.com/apache/arrow/pull/6996#issuecomment-617061541 "The job exceeded the maximum log length, and has been terminated." -- restarting This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #6986: ARROW-8523: [C++] Optimize BitmapReader
pitrou commented on issue #6986: URL: https://github.com/apache/arrow/pull/6986#issuecomment-617033346 To be honest, `BitmapAnd` should probably be rewritten using `Bitmap::VisitWords`. But we can revert anyway if we fear regressions may appear in other workloads. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on issue #6986: ARROW-8523: [C++] Optimize BitmapReader
cyb70289 commented on issue #6986: URL: https://github.com/apache/arrow/pull/6986#issuecomment-616978252 This change introduces severe branch misses in certain conditions. See perf logs below. I changed benchmark code to run only the problematic test case. Without this patch ```bash 807.415826 task-clock (msec) #0.979 CPUs utilized 83 context-switches #0.103 K/sec 0 cpu-migrations#0.000 K/sec 427 page-faults #0.529 K/sec 2,285,801,407 cycles#2.831 GHz (83.17%) 2,313,785 stalled-cycles-frontend #0.10% frontend cycles idle (83.16%) 915,631,177 stalled-cycles-backend# 40.06% backend cycles idle (82.93%) 9,997,208,858 instructions #4.37 insn per cycle #0.09 stalled cycles per insn (83.66%) 1,679,799,451 branches # 2080.464 M/sec (83.66%) 106,599 branch-misses #0.01% of all branches (83.41%) ``` With this patch ```bash 902.557236 task-clock (msec) #0.980 CPUs utilized 94 context-switches #0.104 K/sec 0 cpu-migrations#0.000 K/sec 427 page-faults #0.473 K/sec 2,567,879,767 cycles#2.845 GHz (83.17%) 88,266,680 stalled-cycles-frontend #3.44% frontend cycles idle (83.17%) 20,826,862 stalled-cycles-backend#0.81% backend cycles idle (83.03%) 2,518,949,193 instructions #0.98 insn per cycle #0.04 stalled cycles per insn (83.62%) 847,459,928 branches # 938.954 M/sec (83.61%) 75,187,208 branch-misses #8.87% of all branches (83.39%) ``` Absolute counts are not comparable as gtest runs different loops for each test. The point is branch-misses jumps from 0.01% to 8.87%, which causes high frontend stall(cpu wait for fetching code to execute), and ipc(instructions per cycle) drops from 4.37 to 0.98. I didn't figure out which branch is miss predicted and why. My haswell cpu is too old to support branch tracing. My guess is [this line](https://github.com/apache/arrow/blob/5093b809d63ac8db99aec9caa7ad7e723f277c46/cpp/src/arrow/util/bit_util.cc#L285), no concrete justification. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org