[GitHub] [arrow] github-actions[bot] commented on issue #7009: ARROW-8552: [Rust] support iterate parquet row columns

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #7009:
URL: https://github.com/apache/arrow/pull/7009#issuecomment-617555901


   https://issues.apache.org/jira/browse/ARROW-8552



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] houqp opened a new pull request #7009: ARROW-8552: [Rust] support iterate parquet row columns

2020-04-21 Thread GitBox


houqp opened a new pull request #7009:
URL: https://github.com/apache/arrow/pull/7009


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #7008: ARROW-8551: [CI][Gandiva] Use LLVM 8 in gandiva linux build

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #7008:
URL: https://github.com/apache/arrow/pull/7008#issuecomment-617523989


   https://issues.apache.org/jira/browse/ARROW-8551



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pprudhvi opened a new pull request #7008: ARROW-8551: [CI][Gandiva] Use LLVM 8 to build gandiva linux jar

2020-04-21 Thread GitBox


pprudhvi opened a new pull request #7008:
URL: https://github.com/apache/arrow/pull/7008


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pprudhvi commented on issue #6990: ARROW-8528 : [CI][NIGHTLY:gandiva-jar-osx] fix gandiva osx build

2020-04-21 Thread GitBox


pprudhvi commented on issue #6990:
URL: https://github.com/apache/arrow/pull/6990#issuecomment-617519736


   resolved with https://github.com/Homebrew/homebrew-core/pull/53445/files. 
closing this



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on issue #6954: ARROW-8440: [C++] Refine SIMD header files

2020-04-21 Thread GitBox


cyb70289 commented on issue #6954:
URL: https://github.com/apache/arrow/pull/6954#issuecomment-617516947


   > Maybe it was just a thought I had in my head but never expressed. Opened 
https://issues.apache.org/jira/browse/ARROW-8531
   
   Updated this patch to remove ARROW_USE_SIMD



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on issue #6578: ARROW-7371: WIP: [GLib] Add GLib binding of Dataset

2020-04-21 Thread GitBox


wesm commented on issue #6578:
URL: https://github.com/apache/arrow/pull/6578#issuecomment-617515815


   I haven't looked at the details of this binding too much, but I wanted to 
let you know that I'm taking a closer look at the way that filter expressions 
work in the datasets API in the context of being able to support more general 
purpose expression evaluation -- beyond the scope of just the datasets API -- 
i.e. with many more functions. 
   
   In particular, I am concerned about having significant bindings for the 
`Expression` subclasses until we feel confident in the C++ API that we have an 
array-expression API that can accommodate the expanded scope of general purpose 
query processing. In general, the expressions here are not specific to datasets 
-- we should be working toward an expression API (closely tied to a 
kernel/function catalog) that can also be used in projections, hash 
aggregations, join predicates, and other query processing uses. 
   
   So until then, I would recommend that you make minimal bindings of the 
factory functions needed to be able to form filters in the datasets API and 
avoid wrapping the expression subclasses if you can. This will save you work 
now and potentially spare us painful refactoring or API breaks later.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #7007: ARROW-8537: [C++] Revert Optimizing BitmapReader

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #7007:
URL: https://github.com/apache/arrow/pull/7007#issuecomment-617494295


   https://issues.apache.org/jira/browse/ARROW-8537



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 opened a new pull request #7007: ARROW-8537: [C++] Revert Optimizing BitmapReader

2020-04-21 Thread GitBox


cyb70289 opened a new pull request #7007:
URL: https://github.com/apache/arrow/pull/7007


   Revert PR https://github.com/apache/arrow/pull/6986 as it introduces
   big performance regression to BitmapAnd unaligned benchmark.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #7006: ARROW-8508 [Rust] FixedSizeListArray improper offset for value

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #7006:
URL: https://github.com/apache/arrow/pull/7006#issuecomment-617490579


   https://issues.apache.org/jira/browse/ARROW-8508



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] markhildreth opened a new pull request #7006: ARROW-8508 [Rust] FixedSizeListArray improper offset for value

2020-04-21 Thread GitBox


markhildreth opened a new pull request #7006:
URL: https://github.com/apache/arrow/pull/7006


   Potentially Fixes ARROW-8508
   
   Fixed size list arrays sourced with a non-zero offset of their
   child data was respecting this offset when calculating value offsets
   in the `value_offset` method, but not in the `value` method. This would
   cause nested fixed list arrays that should have looked like this:
   
   ```
   [
 [ [0, 1] ],
 [ [2, 3], [4, 5] ]
   ]
   ```
   
   ...to behave like this when looking directly at the values...
   
   ```
   [
 [ [0, 1] ],
 [ [0, 1], [2, 3] ],
   ]
   ```
   
   This is different to how ListArray would do things (which respect the offset 
in both methods). This PR makes the change. Additionally, it adds a failing 
test case for this, as well as a passing test case for similar cases, including 
on ListArray.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on issue #7005: ARROW-8550: [CI] Don't run cron GHA jobs on forks

2020-04-21 Thread GitBox


nealrichardson commented on issue #7005:
URL: https://github.com/apache/arrow/pull/7005#issuecomment-617477677


   I believe the failure on Jira link might be expected: it's possible that the 
pull-request token is not sufficiently authorized to run it. @kou does that 
sound right?
   
   I just merged this into my fork's master branch and it did skip the cron 
builds (e.g. https://github.com/nealrichardson/arrow/actions/runs/84286607) so 
I think this is doing what it is intending.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #7005: ARROW-8550: [CI] Don't run cron GHA jobs on forks

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #7005:
URL: https://github.com/apache/arrow/pull/7005#issuecomment-617456224


   https://issues.apache.org/jira/browse/ARROW-8550



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7005: ARROW-8550: [CI] Don't run cron GHA jobs on forks

2020-04-21 Thread GitBox


nealrichardson opened a new pull request #7005:
URL: https://github.com/apache/arrow/pull/7005


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #6995: ARROW-8549: [R] Assorted post-0.17 release cleanups

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #6995:
URL: https://github.com/apache/arrow/pull/6995#issuecomment-617445300


   Revision: e7dbd9c977b765e618a40e997039be773c9f16bf
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-161](https://github.com/ursa-labs/crossbow/branches/all?query=actions-161)
   
   |Task|Status|
   ||--|
   
|homebrew-r-autobrew|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-161-travis-homebrew-r-autobrew.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|test-conda-r-3.6|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-161-circle-test-conda-r-3.6.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-161-circle-test-conda-r-3.6)|
   |test-r-linux-as-cran|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-161-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-161-github-test-r-linux-as-cran)|
   
|test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rhub-ubuntu-gcc-release)|
   
|test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rocker-r-base-latest)|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rstudio-r-base-3.6-bionic)|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rstudio-r-base-3.6-opensuse15)|
   
|test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-161-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-161-azure-test-r-rstudio-r-base-3.6-opensuse42)|
   
|test-ubuntu-18.04-r-3.6|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-161-circle-test-ubuntu-18.04-r-3.6.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-161-circle-test-ubuntu-18.04-r-3.6)|
   
|test-ubuntu-18.04-r-sanitizer|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-161-circle-test-ubuntu-18.04-r-sanitizer.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-161-circle-test-ubuntu-18.04-r-sanitizer)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on issue #6995: ARROW-8549: [R] Assorted post-0.17 release cleanups

2020-04-21 Thread GitBox


nealrichardson commented on issue #6995:
URL: https://github.com/apache/arrow/pull/6995#issuecomment-617444872


   @github-actions crossbow submit -g r



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #6995: ARROW-8549: [R] Assorted post-0.17 release cleanups

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #6995:
URL: https://github.com/apache/arrow/pull/6995#issuecomment-617442386


   https://issues.apache.org/jira/browse/ARROW-8549



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] davidanthoff commented on a change in pull request #7001: Use lowercase ws2_32 everywhere

2020-04-21 Thread GitBox


davidanthoff commented on a change in pull request #7001:
URL: https://github.com/apache/arrow/pull/7001#discussion_r412497174



##
File path: cpp/cmake_modules/FindThrift.cmake
##
@@ -100,7 +100,7 @@ if(Thrift_FOUND OR THRIFT_FOUND)
INTERFACE_INCLUDE_DIRECTORIES 
"${THRIFT_INCLUDE_DIR}")
   if(WIN32 AND NOT MSVC)
 # We don't need this for Visual C++ because Thrift uses
-# "#pragma comment(lib, "Ws2_32.lib")" in
+# "#pragma comment(lib, "ws2_32.lib")" in

Review comment:
   That makes sense, I adjusted this.
   
   I should probably double check that the cross compile still works without 
this, though. Will report back once I have tried that.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] davidanthoff commented on a change in pull request #7001: Use lowercase ws2_32 everywhere

2020-04-21 Thread GitBox


davidanthoff commented on a change in pull request #7001:
URL: https://github.com/apache/arrow/pull/7001#discussion_r412497174



##
File path: cpp/cmake_modules/FindThrift.cmake
##
@@ -100,7 +100,7 @@ if(Thrift_FOUND OR THRIFT_FOUND)
INTERFACE_INCLUDE_DIRECTORIES 
"${THRIFT_INCLUDE_DIR}")
   if(WIN32 AND NOT MSVC)
 # We don't need this for Visual C++ because Thrift uses
-# "#pragma comment(lib, "Ws2_32.lib")" in
+# "#pragma comment(lib, "ws2_32.lib")" in

Review comment:
   That makes sense, I adjusted this.
   
   I should probably double check that the cross compile still works without 
this, though. Will report once I have tried that.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #7001: Use lowercase ws2_32 everywhere

2020-04-21 Thread GitBox


kou commented on a change in pull request #7001:
URL: https://github.com/apache/arrow/pull/7001#discussion_r412492656



##
File path: cpp/cmake_modules/FindThrift.cmake
##
@@ -100,7 +100,7 @@ if(Thrift_FOUND OR THRIFT_FOUND)
INTERFACE_INCLUDE_DIRECTORIES 
"${THRIFT_INCLUDE_DIR}")
   if(WIN32 AND NOT MSVC)
 # We don't need this for Visual C++ because Thrift uses
-# "#pragma comment(lib, "Ws2_32.lib")" in
+# "#pragma comment(lib, "ws2_32.lib")" in

Review comment:
   We should not change this because this is the code in Thrift: 
https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/windows/config.h#L66





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran commented on issue #6306: ARROW-7705: [Rust] Initial sort implementation

2020-04-21 Thread GitBox


paddyhoran commented on issue #6306:
URL: https://github.com/apache/arrow/pull/6306#issuecomment-617402123


   @nevi-me this needs a rebase now.  Once you do that, I'll take a look so we 
can get this merged.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-21 Thread GitBox


wesm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r412474320



##
File path: cpp/src/parquet/file_reader.h
##
@@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader {
   // Returns the file metadata. Only one instance is ever created
   std::shared_ptr metadata() const;
 
+  /// Pre-buffer the specified column indices in all row groups.
+  ///
+  /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is 
set;
+  /// otherwise this is a no-op. The reader internally maintains a cache which 
is
+  /// overwritten each time this is called. Intended to increase performance on
+  /// high-latency filesystems (e.g. Amazon S3).
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices);

Review comment:
   On second look it seems desirable to be able to have control over when 
this operation is invoked. There might be some other options relating to 
concurrent IO calls that you might want to pass here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-21 Thread GitBox


wesm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r412465641



##
File path: cpp/src/parquet/properties.h
##
@@ -56,10 +60,32 @@ class PARQUET_EXPORT ReaderProperties {
 
   bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; }
 
+  bool is_coalesced_stream_enabled() const { return coalesced_stream_enabled_; 
}
+
   void enable_buffered_stream() { buffered_stream_enabled_ = true; }
 
   void disable_buffered_stream() { buffered_stream_enabled_ = false; }
 
+  /// Enable read coalescing.
+  ///
+  /// When enabled, readers can optionally call
+  /// ParquetFileReader.PreBuffer to cache the necessary slices of the
+  /// file in-memory before deserialization. Arrow readers
+  /// automatically do this. This is intended to increase performance
+  /// when reading from high-latency filesystems (e.g. Amazon S3).
+  ///
+  /// When this is enabled, it is NOT SAFE to concurrently create
+  /// RecordBatchReaders from the same file.

Review comment:
   Could it be made safe?

##
File path: cpp/src/parquet/file_reader.h
##
@@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader {
   // Returns the file metadata. Only one instance is ever created
   std::shared_ptr metadata() const;
 
+  /// Pre-buffer the specified column indices in all row groups.
+  ///
+  /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is 
set;
+  /// otherwise this is a no-op. The reader internally maintains a cache which 
is
+  /// overwritten each time this is called. Intended to increase performance on
+  /// high-latency filesystems (e.g. Amazon S3).
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices);

Review comment:
   My biggest question is whether this detail should be exposed here or 
handled automatically under the hood. I don't have a strong opinion

##
File path: cpp/src/parquet/arrow/reader.cc
##
@@ -803,6 +809,9 @@ Status FileReaderImpl::ReadRowGroups(const 
std::vector& row_groups,
 return Status::Invalid("Invalid column index");
   }
 
+  // PARQUET-1698/PARQUET-1820: pre-buffer row groups/column chunks if enabled
+  parquet_reader()->PreBuffer(row_groups, indices);

Review comment:
   Per comments below, it seems slightly odd that this would be a no-op 
sometimes. It seems like it would be better to have
   
   ```
   if (properties_.prebuffering_enabled()) {
 parquet_reader()->PreBuffer(...);
   }
   ```

##
File path: cpp/src/parquet/properties.h
##
@@ -56,10 +60,32 @@ class PARQUET_EXPORT ReaderProperties {
 
   bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; }
 
+  bool is_coalesced_stream_enabled() const { return coalesced_stream_enabled_; 
}

Review comment:
   So rather than having this flag here, what if you put a prebuffer option 
in `ArrowReadProperties`? Then `FileReader::PreBuffer` will always prebuffer if 
it's called

##
File path: cpp/src/parquet/file_reader.h
##
@@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader {
   // Returns the file metadata. Only one instance is ever created
   std::shared_ptr metadata() const;
 
+  /// Pre-buffer the specified column indices in all row groups.
+  ///
+  /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is 
set;

Review comment:
   Does this detail need to be exposed? The first time `PreBuffer` is 
called, could the cached reader be created?

##
File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc
##
@@ -141,16 +142,31 @@ class MinioFixture : public benchmark::Fixture {
   }
 
   /// Make an object with Parquet data.
+  /// Appends integer columns to the beginning (to act as indices).
   Status MakeParquetObject(const std::string& path, int num_columns, int 
num_rows) {
-std::vector> columns(num_columns);
-std::vector> fields(num_columns);
-for (int i = 0; i < num_columns; ++i) {
+std::vector> columns;
+std::vector> fields;
+
+{
+  arrow::random::RandomArrayGenerator generator(0);
+  std::shared_ptr values = generator.Int64(num_rows, 0, 1e10, 0);
+  columns.push_back(std::make_shared(values));
+  fields.push_back(::arrow::field("timestamp", values->type()));
+}
+{
+  arrow::random::RandomArrayGenerator generator(1);
+  std::shared_ptr values = generator.Int32(num_rows, 0, 1e9, 0);
+  columns.push_back(std::make_shared(values));
+  fields.push_back(::arrow::field("val", values->type()));
+}
+
+for (int i = 0; i < num_columns; i++) {
   arrow::random::RandomArrayGenerator generator(i);
   std::shared_ptr values = generator.Float64(num_rows, -1.e10, 
1e10, 0);
   std::stringstream ss;
   ss << "col" << i;
-  columns[i] = std::make_shared(values);
-  fields[i] = ::arrow::field(ss.str(), values->type());
+  

[GitHub] [arrow] paddyhoran commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance

2020-04-21 Thread GitBox


paddyhoran commented on a change in pull request #6980:
URL: https://github.com/apache/arrow/pull/6980#discussion_r412473292



##
File path: rust/arrow/src/array/builder.rs
##
@@ -236,6 +251,14 @@ impl BufferBuilderTrait for 
BufferBuilder {
 self.write_bytes(v.to_byte_slice(), 1)
 }
 
+default fn append_n( self, n: usize, v: T::Native) -> Result<()> {
+self.reserve(n)?;
+for _ in 0..n {
+self.write_bytes(v.to_byte_slice(), 1)?;
+}

Review comment:
   I think he means using just a single call to `write_bytes`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran commented on issue #7004: ARROW-3827: [Rust] Implement UnionArray Updated

2020-04-21 Thread GitBox


paddyhoran commented on issue #7004:
URL: https://github.com/apache/arrow/pull/7004#issuecomment-617399915


   @kszucs it's failing due to `rustfmt` not being installed before testing the 
flight crate, any idea why this would be the case?  Sorry, I don't know much 
about GitHub actions yet...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #7004: ARROW-3827: [Rust] Implement UnionArray Updated

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #7004:
URL: https://github.com/apache/arrow/pull/7004#issuecomment-617398778


   https://issues.apache.org/jira/browse/ARROW-3827



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran commented on issue #6209: ARROW-3827: [Rust] Implement UnionArray

2020-04-21 Thread GitBox


paddyhoran commented on issue #6209:
URL: https://github.com/apache/arrow/pull/6209#issuecomment-617389640


   Closing and I'll open a new PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation

2020-04-21 Thread GitBox


fsaintjacques commented on a change in pull request #7000:
URL: https://github.com/apache/arrow/pull/7000#discussion_r412452967



##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -671,41 +669,29 @@ def test_fragments(tempdir):
 f = fragments[0]
 
 # file's schema does not include partition column
-phys_schema = f.schema.remove(f.schema.get_field_index('part'))
-assert f.format.inspect(f.path, f.filesystem) == phys_schema
+assert f.physical_schema.names == ['f1', 'f2']
+assert f.format.inspect(f.path, f.filesystem) == f.physical_schema
 assert f.partition_expression.equals(ds.field('part') == 'a')
 
 # scanning fragment includes partition columns
-result = f.to_table()
-assert f.schema == result.schema
+result = f.to_table(schema=dataset.schema)
 assert result.column_names == ['f1', 'f2', 'part']
-assert len(result) == 4
 assert result.equals(table.slice(0, 4))
-
-# scanning fragments follow column projection
-fragments = list(dataset.get_fragments(columns=['f1', 'part']))

Review comment:
   Via kwargs to `self._scanner` then forwarding kwargs to 
`Scanner.from_fragment`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] working-estimate opened a new issue #7003: from pyarrow import parquet fails with AttributeError: type object 'pyarrow._parquet.Statistics' has no attribute '__reduce_cython__'

2020-04-21 Thread GitBox


working-estimate opened a new issue #7003:
URL: https://github.com/apache/arrow/issues/7003


   I have tried versions 0.15.1, 0.16.0, 0.17.0. Same error on all. I've seen 
in other issues that co-installations of tensorflow and numpy might be causing 
issues. I have tensorflow==1.14.0 and numpy==1.16.4
   
   ```from pyarrow import parquet

   ~/python/lib/python3.6/site-packages/pyarrow/parquet.py in 
32 import pyarrow as pa
33 import pyarrow.lib as lib
   ---> 34 import pyarrow._parquet as _parquet
35 
36 from pyarrow._parquet import (ParquetReader, Statistics,  # noqa
   
   ~/python/lib/python3.6/site-packages/pyarrow/_parquet.pyx in init 
pyarrow._parquet()
   
   AttributeError: type object 'pyarrow._parquet.Statistics' has no attribute 
'__reduce_cython__'```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on issue #6986: ARROW-8523: [C++] Optimize BitmapReader

2020-04-21 Thread GitBox


fsaintjacques commented on issue #6986:
URL: https://github.com/apache/arrow/pull/6986#issuecomment-617368021


   See either `archery benchmark diff --help` or the 
[benchmark](https://arrow.apache.org/docs/developers/benchmarks.html) section 
of the documentation. Archery can compare the same binary with different 
scenarios:
   - Commit
   - Toolchain
   - Compile flags
   
   All it needs is two cmake build directory and it'll compare the benchmark 
binaries from there.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] davidanthoff commented on issue #7001: Use lowercase ws2_32 everywhere

2020-04-21 Thread GitBox


davidanthoff commented on issue #7001:
URL: https://github.com/apache/arrow/pull/7001#issuecomment-617365369


   > How does BinaryBuilder compile Windows binaries on Linux? Using MinGW?
   
   Yes, it uses MinGW for Windows, but then it also cross-compiles to lots of 
other platforms. The PR that tries to get arrow to build is 
https://github.com/JuliaPackaging/Yggdrasil/pull/918.
   
   > (also, could you please open an issue on JIRA as explained above?)
   
   I'm not familiar with JIRA, so I'll have to find some time for that, 
hopefully soon. But no promises, right now I want to use the time I have for 
this project to get the build to work :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on issue #6883: Prepare for the release candidate

2020-04-21 Thread GitBox


kszucs commented on issue #6883:
URL: https://github.com/apache/arrow/pull/6883#issuecomment-617350272


   The release is out, we can close this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation

2020-04-21 Thread GitBox


fsaintjacques commented on a change in pull request #7000:
URL: https://github.com/apache/arrow/pull/7000#discussion_r412407398



##
File path: cpp/src/arrow/dataset/dataset.cc
##
@@ -72,36 +78,15 @@ Result> Dataset::NewScan() {
   return NewScan(std::make_shared());
 }
 
-bool Dataset::AssumePartitionExpression(
-const std::shared_ptr& scan_options,
-std::shared_ptr* simplified_scan_options) const {
-  if (partition_expression_ == nullptr) {
-if (simplified_scan_options != nullptr) {
-  *simplified_scan_options = scan_options;
-}
-return true;
-  }
+FragmentIterator Dataset::GetFragments() { return GetFragments(scalar(true)); }
 
-  auto expr = scan_options->filter->Assume(*partition_expression_);
-  if (expr->IsNull() || expr->Equals(false)) {
-// selector is not satisfiable; yield no fragments
-return false;
+FragmentIterator Dataset::GetFragments(std::shared_ptr predicate) {
+  if (partition_expression_) {

Review comment:
   A unit test triggered a segfault. I added this to move on to the next 
failure.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on issue #7002: ARROW-8543: [C++] Single pass coalescing algorithm + Rebase

2020-04-21 Thread GitBox


pitrou commented on issue #7002:
URL: https://github.com/apache/arrow/pull/7002#issuecomment-617339230


   The original PR message is slightly misleading: both algorithms have the 
same complexity (O(N) except for the sorting step which is O(N log N)). 
However, it's true that the new algorithm is more compact and not less readable.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation

2020-04-21 Thread GitBox


bkietz commented on a change in pull request #7000:
URL: https://github.com/apache/arrow/pull/7000#discussion_r412252930



##
File path: cpp/src/arrow/dataset/dataset.h
##
@@ -84,13 +82,12 @@ class ARROW_DS_EXPORT Fragment {
 class ARROW_DS_EXPORT InMemoryFragment : public Fragment {
  public:
   InMemoryFragment(RecordBatchVector record_batches,
-   std::shared_ptr scan_options);
+   std::shared_ptr = NULLPTR);

Review comment:
   `scalar` is declared such that you can use `scalar(true)` for default 
arguments
   ```suggestion
  std::shared_ptr = scalar(true));
   ```

##
File path: cpp/src/arrow/dataset/dataset.h
##
@@ -100,16 +97,20 @@ class ARROW_DS_EXPORT InMemoryFragment : public Fragment {
   RecordBatchVector record_batches_;
 };
 
-/// \brief A container of zero or more Fragments. A Dataset acts as a 
discovery mechanism
-/// of Fragments and partitions, e.g. files deeply nested in a directory.
+/// \brief A container of zero or more Fragments.
+///
+/// A Dataset acts as a union of Fragments, e.g. files deeply nested in a
+/// directory. A Dataset has a schema, also known as the "reader" schema.

Review comment:
   Is this known as the "reader" schema outside Avro space? If not I think 
it's still acceptable to explain this by analogy to Avro, but we should be 
explicit whose jargon we're borrowing

##
File path: cpp/src/arrow/dataset/dataset.h
##
@@ -100,16 +97,20 @@ class ARROW_DS_EXPORT InMemoryFragment : public Fragment {
   RecordBatchVector record_batches_;
 };
 
-/// \brief A container of zero or more Fragments. A Dataset acts as a 
discovery mechanism
-/// of Fragments and partitions, e.g. files deeply nested in a directory.
+/// \brief A container of zero or more Fragments.
+///
+/// A Dataset acts as a union of Fragments, e.g. files deeply nested in a
+/// directory. A Dataset has a schema, also known as the "reader" schema.
+///
 class ARROW_DS_EXPORT Dataset : public std::enable_shared_from_this {
  public:
   /// \brief Begin to build a new Scan operation against this Dataset
   Result> NewScan(std::shared_ptr 
context);
   Result> NewScan();
 
-  /// \brief GetFragments returns an iterator of Fragments given ScanOptions.
-  FragmentIterator GetFragments(std::shared_ptr options);
+  /// \brief GetFragments returns an iterator of Fragments given a predicate.
+  FragmentIterator GetFragments(std::shared_ptr predicate);
+  FragmentIterator GetFragments();

Review comment:
   ```suggestion
 FragmentIterator GetFragments(std::shared_ptr predicate = 
scalar(true));
   ```

##
File path: cpp/src/arrow/dataset/dataset.cc
##
@@ -30,34 +30,40 @@
 namespace arrow {
 namespace dataset {
 
-Fragment::Fragment(std::shared_ptr scan_options)
-: scan_options_(std::move(scan_options)), 
partition_expression_(scalar(true)) {}
+Fragment::Fragment(std::shared_ptr partition_expression)
+: partition_expression_(partition_expression ? partition_expression : 
scalar(true)) {}
 
-const std::shared_ptr& Fragment::schema() const {
-  return scan_options_->schema();
-}
+Result> InMemoryFragment::ReadPhysicalSchema() {

Review comment:
   Probably follow up: `InMemoryFragment` should be constructed with an 
explicit schema with convenience constructor to extract batches[0]->schema that 
fails if batches are empty.

##
File path: cpp/src/arrow/dataset/dataset.cc
##
@@ -72,36 +78,15 @@ Result> Dataset::NewScan() {
   return NewScan(std::make_shared());
 }
 
-bool Dataset::AssumePartitionExpression(
-const std::shared_ptr& scan_options,
-std::shared_ptr* simplified_scan_options) const {
-  if (partition_expression_ == nullptr) {
-if (simplified_scan_options != nullptr) {
-  *simplified_scan_options = scan_options;
-}
-return true;
-  }
+FragmentIterator Dataset::GetFragments() { return GetFragments(scalar(true)); }
 
-  auto expr = scan_options->filter->Assume(*partition_expression_);
-  if (expr->IsNull() || expr->Equals(false)) {
-// selector is not satisfiable; yield no fragments
-return false;
+FragmentIterator Dataset::GetFragments(std::shared_ptr predicate) {
+  if (partition_expression_) {

Review comment:
   Isn't this initialized to `scalar(true)`?

##
File path: cpp/src/arrow/dataset/filter.cc
##
@@ -655,31 +655,32 @@ std::string ScalarExpression::ToString() const {
   return value_->ToString() + ":" + type_repr;
 }
 
+using arrow::internal::JoinStrings;

Review comment:
   :+1: 

##
File path: cpp/src/arrow/dataset/file_parquet_test.cc
##
@@ -470,8 +466,8 @@ TEST_F(TestParquetFileFormat, 
PredicatePushdownRowGroupFragments) {
   CountRowGroupsInFragment(fragment, internal::Iota(5, 
static_cast(kNumRowGroups)),
"i64"_ >= int64_t(6));
 
-  CountRowGroupsInFragment(fragment, {5, 6, 7}, "i64"_ >= int64_t(6),
-   "i64"_ < 

[GitHub] [arrow] github-actions[bot] commented on issue #7002: ARROW-8543: [C++] Single pass coalescing algorithm + Rebase

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #7002:
URL: https://github.com/apache/arrow/pull/7002#issuecomment-617329923


   https://issues.apache.org/jira/browse/ARROW-8543



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types

2020-04-21 Thread GitBox


pitrou commented on a change in pull request #6959:
URL: https://github.com/apache/arrow/pull/6959#discussion_r412350484



##
File path: python/pyarrow/tests/test_extension_type.py
##
@@ -445,22 +445,28 @@ def test_parquet(tmpdir, registered_period_type):
 import base64
 decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"])
 schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema))
-assert schema.field("ext").metadata == {
-b'ARROW:extension:metadata': b'freq=D',
-b'ARROW:extension:name': b'pandas.period'}
+# Since the type could be reconstructed, the extension type metadata is
+# absent.
+assert schema.field("ext").metadata == {}

Review comment:
   @jorisvandenbossche You may want to check whether this is ok for you.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on issue #7001: Use lowercase ws2_32 everywhere

2020-04-21 Thread GitBox


pitrou commented on issue #7001:
URL: https://github.com/apache/arrow/pull/7001#issuecomment-617301935


   (also, could you please open an issue on JIRA as explained above?)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on issue #7001: Use lowercase ws2_32 everywhere

2020-04-21 Thread GitBox


pitrou commented on issue #7001:
URL: https://github.com/apache/arrow/pull/7001#issuecomment-617301716


   How does `BinaryBuilder` compile Windows binaries on Linux? Using MinGW?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types

2020-04-21 Thread GitBox


pitrou commented on a change in pull request #6959:
URL: https://github.com/apache/arrow/pull/6959#discussion_r412310009



##
File path: cpp/src/arrow/ipc/metadata_internal.cc
##
@@ -756,10 +737,35 @@ Status FieldFromFlatbuffer(const flatbuf::Field* field, 
DictionaryMemo* dictiona
 RETURN_NOT_OK(IntFromFlatbuffer(int_data, _type));
 ARROW_ASSIGN_OR_RAISE(type,
   DictionaryType::Make(index_type, type, 
encoding->isOrdered()));
-*out = ::arrow::field(field_name, type, field->nullable(), metadata);
-RETURN_NOT_OK(dictionary_memo->AddField(encoding->id(), *out));
-  } else {
-*out = ::arrow::field(field_name, type, field->nullable(), metadata);
+dictionary_id = encoding->id();
+  }
+
+  // 4. Is it an extension type?
+  if (metadata != nullptr) {
+// Look for extension metadata in custom_metadata field
+int name_index = metadata->FindKey(kExtensionTypeKeyName);
+if (name_index != -1) {
+  std::string type_name = metadata->value(name_index);
+  int data_index = metadata->FindKey(kExtensionMetadataKeyName);
+  std::string type_data = data_index == -1 ? "" : 
metadata->value(data_index);
+
+  std::shared_ptr ext_type = GetExtensionType(type_name);
+  if (ext_type != nullptr) {
+ARROW_ASSIGN_OR_RAISE(type, ext_type->Deserialize(type, type_data));
+// Remove the metadata, for faithful roundtripping
+RETURN_NOT_OK(metadata->DeleteMany({name_index, data_index}));

Review comment:
   @wesm Do you opine to this?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #7001: Use lowercase ws2_32 everywhere

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #7001:
URL: https://github.com/apache/arrow/pull/7001#issuecomment-617271478


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types

2020-04-21 Thread GitBox


pitrou commented on a change in pull request #6959:
URL: https://github.com/apache/arrow/pull/6959#discussion_r412305089



##
File path: dev/archery/archery/integration/datagen.py
##
@@ -1401,6 +1437,18 @@ def generate_nested_dictionary_case():
   dictionaries=[dict0, dict1, dict2])
 
 
+def generate_extension_case():
+uuid_type = ExtensionType('uuid', 'uuid-serialization',
+  FixedSizeBinaryField('', 16))
+
+fields = [
+ExtensionField('uuids', uuid_type),

Review comment:
   Ok, it was more involved than I imagined, because the IPC layer had to 
be fixed as well.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] davidanthoff opened a new pull request #7001: Use lowercase ws2_32 everywhere

2020-04-21 Thread GitBox


davidanthoff opened a new pull request #7001:
URL: https://github.com/apache/arrow/pull/7001


   With this patch I can cross-compile arrow from a Linux system, in particular 
I can compile Windows binaries on a Linux system (using 
https://binarybuilder.org/). I hope to eventually be able to use things from 
Julia with this.
   
   My best guess is that the inconsistent casing of `ws2_32` in the various 
build files/systems is no problem when compiling things on Windows because file 
systems there tend to be case insensitive.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on issue #6970: ARROW-2714: [Python] Implement variable step slicing with Take

2020-04-21 Thread GitBox


wesm commented on issue #6970:
URL: https://github.com/apache/arrow/pull/6970#issuecomment-617250883


   +1. Appveyor build looks good 
https://ci.appveyor.com/project/wesm/arrow/builds/32336612



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation

2020-04-21 Thread GitBox


jorisvandenbossche commented on a change in pull request #7000:
URL: https://github.com/apache/arrow/pull/7000#discussion_r412191161



##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -671,41 +669,29 @@ def test_fragments(tempdir):
 f = fragments[0]
 
 # file's schema does not include partition column
-phys_schema = f.schema.remove(f.schema.get_field_index('part'))
-assert f.format.inspect(f.path, f.filesystem) == phys_schema
+assert f.physical_schema.names == ['f1', 'f2']
+assert f.format.inspect(f.path, f.filesystem) == f.physical_schema
 assert f.partition_expression.equals(ds.field('part') == 'a')
 
 # scanning fragment includes partition columns
-result = f.to_table()
-assert f.schema == result.schema
+result = f.to_table(schema=dataset.schema)
 assert result.column_names == ['f1', 'f2', 'part']
-assert len(result) == 4
 assert result.equals(table.slice(0, 4))
-
-# scanning fragments follow column projection
-fragments = list(dataset.get_fragments(columns=['f1', 'part']))

Review comment:
   Keep this but where the columns selection is passed to `to_table` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #6995: WIP DO NOT MERGE 0.17.0 R release prep

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #6995:
URL: https://github.com/apache/arrow/pull/6995#issuecomment-617235575


   Revision: f543317d36d39322bd339b49dd8867cbd3f2ad70
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-160](https://github.com/ursa-labs/crossbow/branches/all?query=actions-160)
   
   |Task|Status|
   ||--|
   |test-r-linux-as-cran|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-160-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-160-github-test-r-linux-as-cran)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on issue #6996: ARROW-8538: [Packaging] Remove boost from homebrew formula

2020-04-21 Thread GitBox


nealrichardson commented on issue #6996:
URL: https://github.com/apache/arrow/pull/6996#issuecomment-617232353


   ¯\_(ツ)_/¯ maybe it's time to port this job to GHA



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on issue #6986: ARROW-8523: [C++] Optimize BitmapReader

2020-04-21 Thread GitBox


wesm commented on issue #6986:
URL: https://github.com/apache/arrow/pull/6986#issuecomment-617228774


   FWIW, we have some benchmark diffing code already written in 
   
   https://github.com/apache/arrow/blob/master/dev/archery/archery/benchmark
   
   I'm not sure where this is documented / how to use it to check the output of 
a single executable 
   
   cc @fsaintjacques 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types

2020-04-21 Thread GitBox


pitrou commented on a change in pull request #6959:
URL: https://github.com/apache/arrow/pull/6959#discussion_r412227300



##
File path: dev/archery/archery/integration/datagen.py
##
@@ -1401,6 +1437,18 @@ def generate_nested_dictionary_case():
   dictionaries=[dict0, dict1, dict2])
 
 
+def generate_extension_case():
+uuid_type = ExtensionType('uuid', 'uuid-serialization',
+  FixedSizeBinaryField('', 16))
+
+fields = [
+ExtensionField('uuids', uuid_type),

Review comment:
   dictionary(ext) is not possible as per 
https://mail-archives.apache.org/mod_mbox/arrow-dev/202004.mbox/%3CCAJPUwMAvxLYxJg_QRgdALSq3XS%2BY0zV_LYwsd6FVNYbA90RAAw%40mail.gmail.com%3E
 , but I'll try to add ext(dictionary).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types

2020-04-21 Thread GitBox


bkietz commented on a change in pull request #6959:
URL: https://github.com/apache/arrow/pull/6959#discussion_r412194818



##
File path: dev/archery/archery/integration/datagen.py
##
@@ -1401,6 +1437,18 @@ def generate_nested_dictionary_case():
   dictionaries=[dict0, dict1, dict2])
 
 
+def generate_extension_case():
+uuid_type = ExtensionType('uuid', 'uuid-serialization',
+  FixedSizeBinaryField('', 16))
+
+fields = [
+ExtensionField('uuids', uuid_type),

Review comment:
   Should we also test dictionary(ext) and ext(storage=dictionary)?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation

2020-04-21 Thread GitBox


jorisvandenbossche commented on a change in pull request #7000:
URL: https://github.com/apache/arrow/pull/7000#discussion_r412150862



##
File path: cpp/src/arrow/dataset/dataset.h
##
@@ -30,12 +30,22 @@
 namespace arrow {
 namespace dataset {
 
-/// \brief A granular piece of a Dataset, such as an individual file, which 
can be
-/// read/scanned separately from other fragments.
+/// \brief A granular piece of a Dataset, such as an individual file.
 ///
-/// A Fragment yields a collection of RecordBatch, encapsulated in one or more 
ScanTasks.
+/// A Fragment can be read/scanned separately from other fragments. It yields a
+/// collection of RecordBatch, encapsulated in one or more ScanTasks.
+///
+/// A notable difference from Dataset is that Fragments have physical schemas
+/// which may differ from Fragments.

Review comment:
   ```suggestion
   /// which may differ from other Fragments.
   ```

##
File path: python/pyarrow/_dataset.pyx
##
@@ -519,30 +500,69 @@ cdef class Fragment:
 """
 return Expression.wrap(self.fragment.partition_expression())
 
-def to_table(self, use_threads=True, MemoryPool memory_pool=None):
-"""Convert this Fragment into a Table.
+def _scanner(self, **kwargs):
+return Scanner.from_fragment(self, **kwargs)
 
-Use this convenience utility with care. This will serially materialize
-the Scan result in memory before creating the Table.
+def scan(self, columns=None, filter=None, use_threads=True,
+ MemoryPool memory_pool=None, **kwargs):
+"""Builds a scan operation against the dataset.
+

Review comment:
   When using Fragment.scan, it uses the Fragment's physical schema for the 
resulting table? (since the Fragment is not aware of the dataset "read" 
schema?) 
   If so, we should note that here in the docstring I think

##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -433,26 +430,22 @@ Result 
ParquetFileFormat::GetRowGroupFragments(
   }
   FragmentVector fragments(row_groups.size());
 
-  auto new_options = std::make_shared(*fragment.scan_options());
-  if (!extra_filter->Equals(true)) {
-new_options->filter = and_(std::move(extra_filter), 
std::move(new_options->filter));
-  }
-
-  RowGroupSkipper skipper(std::move(metadata), std::move(arrow_properties),
-  new_options->filter, std::move(row_groups));
+  RowGroupSkipper skipper(std::move(metadata), std::move(arrow_properties), 
extra_filter,

Review comment:
   We should probably rename "extra_filter" to just "filter" or "predicate" 
as how it is called in Dataset::GetFragments, since it is no longer "extra" ?

##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -671,41 +669,29 @@ def test_fragments(tempdir):
 f = fragments[0]
 
 # file's schema does not include partition column
-phys_schema = f.schema.remove(f.schema.get_field_index('part'))
-assert f.format.inspect(f.path, f.filesystem) == phys_schema
+assert f.physical_schema.names == ['f1', 'f2']
+assert f.format.inspect(f.path, f.filesystem) == f.physical_schema
 assert f.partition_expression.equals(ds.field('part') == 'a')
 
 # scanning fragment includes partition columns
-result = f.to_table()
-assert f.schema == result.schema
+result = f.to_table(schema=dataset.schema)

Review comment:
   Can you also test without passing the dataset's schema, and assert that 
the column_names are [f1, f2] ?

##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -671,41 +669,29 @@ def test_fragments(tempdir):
 f = fragments[0]
 
 # file's schema does not include partition column
-phys_schema = f.schema.remove(f.schema.get_field_index('part'))
-assert f.format.inspect(f.path, f.filesystem) == phys_schema
+assert f.physical_schema.names == ['f1', 'f2']
+assert f.format.inspect(f.path, f.filesystem) == f.physical_schema
 assert f.partition_expression.equals(ds.field('part') == 'a')
 
 # scanning fragment includes partition columns
-result = f.to_table()
-assert f.schema == result.schema
+result = f.to_table(schema=dataset.schema)
 assert result.column_names == ['f1', 'f2', 'part']
-assert len(result) == 4
 assert result.equals(table.slice(0, 4))
-
-# scanning fragments follow column projection
-fragments = list(dataset.get_fragments(columns=['f1', 'part']))

Review comment:
   Keep this but where the columns selection is passe to `to_table` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] gramirezespinoza commented on issue #6977: Missing `take` method in pyarrow's `Table` class

2020-04-21 Thread GitBox


gramirezespinoza commented on issue #6977:
URL: https://github.com/apache/arrow/issues/6977#issuecomment-617178347


   Waiting for #6970 to be approved/merged



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on issue #6988: ARROW-8524: [CI] Free up space on github actions

2020-04-21 Thread GitBox


wesm commented on issue #6988:
URL: https://github.com/apache/arrow/pull/6988#issuecomment-617164152


   The copy-pasta in the .yml files is a bummer. I hope one day for a higher 
level specification of these tasks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on issue #6999: ARROW-8542: [Release] Fix checksum url in the website post release script

2020-04-21 Thread GitBox


github-actions[bot] commented on issue #6999:
URL: https://github.com/apache/arrow/pull/6999#issuecomment-617150142


   https://issues.apache.org/jira/browse/ARROW-8542



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #6999: ARROW-8542: [Release] Fix checksum url in the website post release script

2020-04-21 Thread GitBox


kszucs opened a new pull request #6999:
URL: https://github.com/apache/arrow/pull/6999


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-21 Thread GitBox


pitrou commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-617126711


   Perhaps. If the reader is compatible with those files, and roundtripping 
works, then the writer is probably compliant as well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pprudhvi commented on issue #6990: ARROW-8528 : [CI][NIGHTLY:gandiva-jar-osx] fix gandiva osx build

2020-04-21 Thread GitBox


pprudhvi commented on issue #6990:
URL: https://github.com/apache/arrow/pull/6990#issuecomment-617111538


   lets wait till https://github.com/Homebrew/homebrew-core/pull/53445/files is 
merged. see https://issues.apache.org/jira/browse/ARROW-8539



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #6997: ARROW-8540: [C++] Add memory allocation benchmarks

2020-04-21 Thread GitBox


pitrou opened a new pull request #6997:
URL: https://github.com/apache/arrow/pull/6997


   Example output:
   ```
   
---
   Benchmark Time   
  CPU   Iterations
   
---
   TouchArea/size:4096/real_time  20.1 ns   
  20.1 ns 34893671
   TouchArea/size:65536/real_time  483 ns   
   483 ns  1448647
   TouchArea/size:1048576/real_time   7670 ns   
  7669 ns90816
   TouchArea/size:16777216/real_time124297 ns   
124280 ns 5611
   
   AllocateDeallocate/size:4096/real_time18.6 ns   
  18.6 ns 37781939
   AllocateDeallocate/size:65536/real_time161 ns   
   161 ns  4360765
   AllocateDeallocate/size:1048576/real_time  328 ns   
   328 ns  2131288
   AllocateDeallocate/size:16777216/real_time 160 ns   
   160 ns  4366862
   AllocateTouchDeallocate/size:4096/real_time   40.4 ns   
  40.4 ns 17333165
   AllocateTouchDeallocate/size:65536/real_time   640 ns   
   640 ns  1092988
   AllocateTouchDeallocate/size:1048576/real_time7959 ns   
  7958 ns87693
   AllocateTouchDeallocate/size:16777216/real_time 124816 ns   
124801 ns 5602
   
   AllocateDeallocate/size:4096/real_time   22.2 ns   
  22.2 ns 31611774
   AllocateDeallocate/size:65536/real_time   157 ns   
   157 ns  4460745
   AllocateDeallocate/size:1048576/real_time 330 ns   
   330 ns  2113808
   AllocateDeallocate/size:16777216/real_time158 ns   
   158 ns  4439623
   AllocateTouchDeallocate/size:4096/real_time  43.0 ns   
  43.0 ns 16252256
   AllocateTouchDeallocate/size:65536/real_time  638 ns   
   638 ns  1091897
   AllocateTouchDeallocate/size:1048576/real_time   7961 ns   
  7960 ns87755
   AllocateTouchDeallocate/size:16777216/real_time124699 ns   
124682 ns 5588
   
   AllocateDeallocate/size:4096/real_time232 ns   
   232 ns  3015215
   AllocateDeallocate/size:65536/real_time   153 ns   
   153 ns  4527945
   AllocateDeallocate/size:1048576/real_time 146 ns   
   146 ns  4720662
   AllocateDeallocate/size:16777216/real_time144 ns   
   144 ns  4859165
   AllocateTouchDeallocate/size:4096/real_time   254 ns   
   254 ns  2750031
   AllocateTouchDeallocate/size:65536/real_time  635 ns   
   635 ns  1100267
   AllocateTouchDeallocate/size:1048576/real_time   7753 ns   
  7752 ns89887
   AllocateTouchDeallocate/size:16777216/real_time124518 ns   
124501 ns 5604
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-21 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r412070422



##
File path: 
java/vector/src/test/java/org/apache/arrow/vector/TestLargeVector.java
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.assertEquals;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+
+import io.netty.buffer.ArrowBuf;
+
+/**
+ * Integration test for a vector with a large (more than 2GB) {@link 
io.netty.buffer.ArrowBuf} as
+ * the data buffer.
+ * To run this test, please
+ *Make sure there are 4GB memory available in the system.
+ * 
+ *   Make sure the default allocation manager type is unsafe.
+ *   This can be achieved by the environmental variable or system property.
+ *   The details can be found in {@link DefaultAllocationManagerOption}.
+ * 
+ */
+public class TestLargeVector {
+  private static void testLargeLongVector() {
+System.out.println("Testing large big int vector.");
+
+final long bufSize = 4 * 1024 * 1024 * 1024L;
+final int vecLength = (int) (bufSize / BigIntVector.TYPE_WIDTH);
+
+try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+BigIntVector largeVec = new BigIntVector("vec", allocator)) {
+  largeVec.allocateNew(vecLength);
+
+  System.out.println("Successfully allocated a vector with capacity " + 
vecLength);
+
+  for (int i = 0; i < vecLength; i++) {
+largeVec.set(i, i * 10L);
+
+if ((i + 1) % 1 == 0) {
+  System.out.println("Successfully written " + (i + 1) + " values");
+}
+  }
+  System.out.println("Successfully written " + vecLength + " values");
+
+  for (int i = 0; i < vecLength; i++) {
+long val = largeVec.get(i);
+assertEquals(i * 10L, val);
+
+if ((i + 1) % 1 == 0) {
+  System.out.println("Successfully read " + (i + 1) + " values");
+}
+  }
+  System.out.println("Successfully read " + vecLength + " values");
+}
+System.out.println("Successfully released the large vector.");
+  }
+
+  private static void testLargeIntVector() {
+System.out.println("Testing large int vector.");
+
+final long bufSize = 4 * 1024 * 1024 * 1024L;
+final int vecLength = (int) (bufSize / IntVector.TYPE_WIDTH);
+
+try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+ IntVector largeVec = new IntVector("vec", allocator)) {
+  largeVec.allocateNew(vecLength);
+
+  System.out.println("Successfully allocated a vector with capacity " + 
vecLength);
+
+  for (int i = 0; i < vecLength; i++) {
+largeVec.set(i, i);
+
+if ((i + 1) % 1 == 0) {
+  System.out.println("Successfully written " + (i + 1) + " values");
+}
+  }
+  System.out.println("Successfully written " + vecLength + " values");
+
+  for (int i = 0; i < vecLength; i++) {
+long val = largeVec.get(i);
+assertEquals(i, val);
+
+if ((i + 1) % 1 == 0) {
+  System.out.println("Successfully read " + (i + 1) + " values");
+}
+  }
+  System.out.println("Successfully read " + vecLength + " values");
+}
+System.out.println("Successfully released the large vector.");
+  }
+
+  private static void testLargeDecimalVector() {
+System.out.println("Testing large decimal vector.");
+
+final long bufSize = 4 * 1024 * 1024 * 1024L;
+final int vecLength = (int) (bufSize / DecimalVector.TYPE_WIDTH);
+
+try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+ DecimalVector largeVec = new DecimalVector("vec", allocator, 38, 16)) 
{
+  largeVec.allocateNew(vecLength);
+
+  System.out.println("Successfully allocated a vector with capacity " + 
vecLength);
+
+  for (int i = 0; i < vecLength; i++) {
+largeVec.set(i, 0);
+
+if ((i + 1) % 1 == 0) {
+  System.out.println("Successfully written " + (i + 1) + " values");
+}
+  }
+  System.out.println("Successfully written " + 

[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-21 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r412070083



##
File path: 
java/vector/src/test/java/org/apache/arrow/vector/TestLargeVector.java
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.assertEquals;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+
+import io.netty.buffer.ArrowBuf;
+
+/**
+ * Integration test for a vector with a large (more than 2GB) {@link 
io.netty.buffer.ArrowBuf} as
+ * the data buffer.
+ * To run this test, please
+ *Make sure there are 4GB memory available in the system.
+ * 
+ *   Make sure the default allocation manager type is unsafe.

Review comment:
   Revised. Thank you.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-21 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r412067781



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java
##
@@ -34,31 +33,24 @@
   static final UnsafeDirectLittleEndian EMPTY = INNER_ALLOCATOR.empty;
   static final long CHUNK_SIZE = INNER_ALLOCATOR.getChunkSize();
 
-  private final int allocatedSize;
-  private final UnsafeDirectLittleEndian memoryChunk;
+  private final long allocatedSize;
 
-  NettyAllocationManager(BaseAllocator accountingAllocator, int requestedSize) 
{
-super(accountingAllocator);
-this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize);

Review comment:
   Revised. Thank you for the good suggestion. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-21 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-617087421


   I think that the current test cases for parquet writer do not have tests to 
verify the bit pattern of the generated parquet file. I will also create the 
test case in another PR since they are important on the different native endian 
platforms.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on issue #6996: ARROW-8538: [Packaging] Remove boost from homebrew formula

2020-04-21 Thread GitBox


pitrou commented on issue #6996:
URL: https://github.com/apache/arrow/pull/6996#issuecomment-617067706


   Wow, that is compiling OpenSSL by hand?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #6991: ARROW-8529: [C++] Fix usage of NextCounts() on dictionary-encoded data

2020-04-21 Thread GitBox


pitrou commented on a change in pull request #6991:
URL: https://github.com/apache/arrow/pull/6991#discussion_r412025224



##
File path: cpp/src/arrow/util/rle_encoding.h
##
@@ -414,6 +414,8 @@ static inline bool IndexInRange(int32_t idx, int32_t 
dictionary_length) {
 template 
 inline int RleDecoder::GetBatchWithDict(const T* dictionary, int32_t 
dictionary_length,
 T* values, int batch_size) {
+  using IndexType = int32_t;

Review comment:
   Note that this was previously simply `int`. The change here is simply to 
make things clearer and also note the index size explicitly.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on issue #6996: ARROW-8538: [Packaging] Remove boost from homebrew formula

2020-04-21 Thread GitBox


pitrou commented on issue #6996:
URL: https://github.com/apache/arrow/pull/6996#issuecomment-617061541


   "The job exceeded the maximum log length, and has been terminated." -- 
restarting



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on issue #6986: ARROW-8523: [C++] Optimize BitmapReader

2020-04-21 Thread GitBox


pitrou commented on issue #6986:
URL: https://github.com/apache/arrow/pull/6986#issuecomment-617033346


   To be honest, `BitmapAnd` should probably be rewritten using 
`Bitmap::VisitWords`.
   But we can revert anyway if we fear regressions may appear in other 
workloads.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on issue #6986: ARROW-8523: [C++] Optimize BitmapReader

2020-04-21 Thread GitBox


cyb70289 commented on issue #6986:
URL: https://github.com/apache/arrow/pull/6986#issuecomment-616978252


   This change introduces severe branch misses in certain conditions. See perf 
logs below. I changed benchmark code to run only the problematic test case.
   
   Without this patch
   ```bash
   807.415826  task-clock (msec) #0.979 CPUs utilized   
   
   83  context-switches  #0.103 K/sec   
   
0  cpu-migrations#0.000 K/sec   
   
  427  page-faults   #0.529 K/sec   
   
2,285,801,407  cycles#2.831 GHz 
 (83.17%)
2,313,785  stalled-cycles-frontend   #0.10% frontend cycles 
idle (83.16%)
  915,631,177  stalled-cycles-backend#   40.06% backend cycles 
idle  (82.93%)
9,997,208,858  instructions  #4.37  insn per cycle  
   
 #0.09  stalled cycles 
per insn  (83.66%)
1,679,799,451  branches  # 2080.464 M/sec   
 (83.66%)
  106,599  branch-misses #0.01% of all branches 
 (83.41%)
   ```
   
   With this patch
   ```bash
   902.557236  task-clock (msec) #0.980 CPUs utilized   
   
   94  context-switches  #0.104 K/sec   
   
0  cpu-migrations#0.000 K/sec   
   
  427  page-faults   #0.473 K/sec   
   
2,567,879,767  cycles#2.845 GHz 
 (83.17%)
   88,266,680  stalled-cycles-frontend   #3.44% frontend cycles 
idle (83.17%)
   20,826,862  stalled-cycles-backend#0.81% backend cycles 
idle  (83.03%)
2,518,949,193  instructions  #0.98  insn per cycle  
   
 #0.04  stalled cycles 
per insn  (83.62%)
  847,459,928  branches  #  938.954 M/sec   
 (83.61%)
   75,187,208  branch-misses #8.87% of all branches 
 (83.39%)
   ```
   Absolute counts are not comparable as gtest runs different loops for each 
test.
   The point is branch-misses jumps from 0.01% to 8.87%, which causes high 
frontend stall(cpu wait for fetching code to execute), and ipc(instructions per 
cycle) drops from 4.37 to 0.98.
   
   I didn't figure out which branch is miss predicted and why. My haswell cpu 
is too old to support branch tracing. My guess is [this 
line](https://github.com/apache/arrow/blob/5093b809d63ac8db99aec9caa7ad7e723f277c46/cpp/src/arrow/util/bit_util.cc#L285),
 no concrete justification.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org