[GitHub] [arrow] jianxind edited a comment on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind edited a comment on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649198815


   Here is the results: 
https://ci.ursalabs.org/#/builders/73/builds/90/steps/3/logs/result, not know 
why it not paste here, I see all 0.01% get positive results. 
   
   ```
   106BM_PlainEncodingSpacedFloat/32768/1   2.355 GiB/sec   
6.385 GiB/sec   171.068  {'run_name': 'BM_PlainEncodingSpacedFloat/3276...
   80 BM_PlainDecodingSpacedFloat/32768/1   3.312 GiB/sec   
8.534 GiB/sec   157.683  {'run_name': 'BM_PlainDecodingSpacedFloat/3276...
   37   BM_PlainDecodingSpacedBoolean/32768/1 595.833 MiB/sec   
1.035 GiB/sec77.882  {'run_name': 'BM_PlainDecodingSpacedBoolean/32...
   92BM_PlainEncodingSpacedDouble/32768/1   3.834 GiB/sec   
5.799 GiB/sec51.278  {'run_name': 'BM_PlainEncodingSpacedDouble/327...
   149  BM_PlainEncodingSpacedBoolean/32768/1 270.760 MiB/sec   
  393.746 MiB/sec45.422  {'run_name': 'BM_PlainEncodingSpacedBoolean/32...
   75BM_PlainDecodingSpacedDouble/32768/1   4.603 GiB/sec   
5.475 GiB/sec18.954  {'run_name': 'BM_PlainDecodingSpacedDouble/327...
   ```
   
   > @ursabot benchmark --suite-filter=parquet-encoding-benchmark
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind edited a comment on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind edited a comment on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649198815


   Here is the results: 
https://ci.ursalabs.org/#/builders/73/builds/90/steps/3/logs/result, not know 
why it not paste here, I see all 0.01% get positive results. 
   
   106BM_PlainEncodingSpacedFloat/32768/1   2.355 GiB/sec   
6.385 GiB/sec   171.068  {'run_name': 'BM_PlainEncodingSpacedFloat/3276...
   
   80 BM_PlainDecodingSpacedFloat/32768/1   3.312 GiB/sec   
8.534 GiB/sec   157.683  {'run_name': 'BM_PlainDecodingSpacedFloat/3276...
   
   37   BM_PlainDecodingSpacedBoolean/32768/1 595.833 MiB/sec   
1.035 GiB/sec77.882  {'run_name': 'BM_PlainDecodingSpacedBoolean/32...
   
   92BM_PlainEncodingSpacedDouble/32768/1   3.834 GiB/sec   
5.799 GiB/sec51.278  {'run_name': 'BM_PlainEncodingSpacedDouble/327...
   
   149  BM_PlainEncodingSpacedBoolean/32768/1 270.760 MiB/sec   
  393.746 MiB/sec45.422  {'run_name': 'BM_PlainEncodingSpacedBoolean/32...
   
   75BM_PlainDecodingSpacedDouble/32768/1   4.603 GiB/sec   
5.475 GiB/sec18.954  {'run_name': 'BM_PlainDecodingSpacedDouble/327...
   
   > @ursabot benchmark --suite-filter=parquet-encoding-benchmark
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649198815


   Here is the results: 
https://ci.ursalabs.org/#/builders/73/builds/90/steps/3/logs/result, not know 
why it not paste here, I see all 0.01% get positive results. 
   
   106BM_PlainEncodingSpacedFloat/32768/1   2.355 GiB/sec   
6.385 GiB/sec   171.068  {'run_name': 'BM_PlainEncodingSpacedFloat/3276...
   80 BM_PlainDecodingSpacedFloat/32768/1   3.312 GiB/sec   
8.534 GiB/sec   157.683  {'run_name': 'BM_PlainDecodingSpacedFloat/3276...
   37   BM_PlainDecodingSpacedBoolean/32768/1 595.833 MiB/sec   
1.035 GiB/sec77.882  {'run_name': 'BM_PlainDecodingSpacedBoolean/32...
   92BM_PlainEncodingSpacedDouble/32768/1   3.834 GiB/sec   
5.799 GiB/sec51.278  {'run_name': 'BM_PlainEncodingSpacedDouble/327...
   149  BM_PlainEncodingSpacedBoolean/32768/1 270.760 MiB/sec   
  393.746 MiB/sec45.422  {'run_name': 'BM_PlainEncodingSpacedBoolean/32...
   75BM_PlainDecodingSpacedDouble/32768/1   4.603 GiB/sec   
5.475 GiB/sec18.954  {'run_name': 'BM_PlainDecodingSpacedDouble/327...
   
   > @ursabot benchmark --suite-filter=parquet-encoding-benchmark
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649197066


   @ursabot benchmark --suite-filter=parquet-encoding-benchmark 
--benchmark-filter=BM_Plain



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian

2020-06-24 Thread GitBox


kiszk commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r445289548



##
File path: cpp/src/arrow/ipc/read_write_test.cc
##
@@ -427,6 +469,14 @@ class TestIpcRoundTrip : public 
::testing::TestWithParam,
   void TearDown() { IpcTestFixture::TearDown(); }
 };
 
+class TestIpcDifferentEndian
+: public ::testing::TestWithParam>,
+  public IpcTestFixture {
+ public:
+  void SetUp() { IpcTestFixture::SetUp(); }
+  void TearDown() { IpcTestFixture::TearDown(); }
+};

Review comment:
   Good point. In my current thought, another version of 
[ArrayLoader](https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L107)
 (`LoadType` method), which is called from `LoadRecordBatchSubset`, will be 
implemented. One existing version is the non-endian conversion. The other is 
the endian conversion version. At this level, we know the type of each element. 
Thus, I imagine it is easy to swap endianness. What do you think?
   
   To add a new API `Result> 
RecordBatch::EnsureEndianness(Endianness)` is good idea to avoid unnecessary 
conversion in the above scenario. Technically, this could work correctly.
   I have two questions.
   1. I imagine that this API is explicitly called by a user. If so, is this 
change acceptable by a user? Of course, most of the scenarios (communicate 
between the same endianness) work well without calling this new API.
   2. This new API needs to traverse all of the elements again if we need the 
data conversion. It may lead to additional overhead.
   
   What do you think?  cc @pitrou 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7439: ARROW-4309: [Documentation] Add a docker-compose entry which builds the documentation with CUDA enabled

2020-06-24 Thread GitBox


wesm commented on pull request #7439:
URL: https://github.com/apache/arrow/pull/7439#issuecomment-649195633


   The Java docs don't build at all
   
   ```
   + mvn -B -DskipTests -Drat.skip=true 
-Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn
 install
   [INFO] Scanning for projects...
   [INFO] 

   [INFO] Detecting the operating system and CPU architecture
   [INFO] 

   [INFO] os.detected.name: linux
   [INFO] os.detected.arch: x86_64
   [INFO] os.detected.version: 5.3
   [INFO] os.detected.version.major: 5
   [INFO] os.detected.version.minor: 3
   [INFO] os.detected.release: ubuntu
   [INFO] os.detected.release.version: 18.04
   [INFO] os.detected.release.like.ubuntu: true
   [INFO] os.detected.release.like.debian: true
   [INFO] os.detected.classifier: linux-x86_64
   [INFO] 

   [INFO] Reactor Build Order:
   [INFO] 
   [INFO] Apache Arrow Java Root POM 
[pom]
   [INFO] Arrow Format   
[jar]
   [INFO] Arrow Memory   
[jar]
   [INFO] Arrow Vectors  
[jar]
   [INFO] Arrow Tools
[jar]
   [INFO] Arrow JDBC Adapter 
[jar]
   [INFO] Arrow Plasma Client
[jar]
   [INFO] Arrow Flight Core  
[jar]
   [INFO] Arrow Flight GRPC  
[jar]
   [INFO] Arrow AVRO Adapter 
[jar]
   [INFO] Arrow Algorithms   
[jar]
   [INFO] Arrow Performance Benchmarks   
[jar]
   [INFO] 
   [INFO] --< org.apache.arrow:arrow-java-root 
>--
   [INFO] Building Apache Arrow Java Root POM 1.0.0-SNAPSHOT
[1/12]
   [INFO] [ pom 
]-
   [INFO] 
   [INFO] --- apache-rat-plugin:0.13:check (rat-checks) @ arrow-java-root ---
   [INFO] RAT will not execute since it is configured to be skipped via system 
property 'rat.skip'.
   [INFO] 
   [INFO] --- maven-checkstyle-plugin:3.1.0:check (validate) @ arrow-java-root 
---
   [INFO] Starting audit...
   Audit done.
   [INFO] 
   [INFO] --- git-commit-id-plugin:2.2.2:revision (for-jars) @ arrow-java-root 
---
   [INFO] 
   [INFO] --- git-commit-id-plugin:2.2.2:revision (for-source-tarball) @ 
arrow-java-root ---
   [INFO] 
   [INFO] --- maven-remote-resources-plugin:1.5:process 
(process-resource-bundles) @ arrow-java-root ---
   [INFO] 
   [INFO] --- maven-site-plugin:3.5.1:attach-descriptor (attach-descriptor) @ 
arrow-java-root ---
   [INFO] 
   [INFO] --- maven-jar-plugin:3.0.0:test-jar (default) @ arrow-java-root ---
   [INFO] Skipping packaging of the test-jar
   [INFO] 
   [INFO] --- maven-enforcer-plugin:3.0.0-M2:enforce 
(validate_java_and_maven_version) @ arrow-java-root ---
   [INFO] 
   [INFO] --- maven-enforcer-plugin:3.0.0-M2:enforce (avoid_bad_dependencies) @ 
arrow-java-root ---
   [INFO] 
   [INFO] --- maven-dependency-plugin:3.0.1:analyze-only (analyze) @ 
arrow-java-root ---
   [INFO] Skipping pom project
   [INFO] 
   [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ 
arrow-java-root ---
   [INFO] Installing /arrow/java/pom.xml to 
/root/.m2/repository/org/apache/arrow/arrow-java-root/1.0.0-SNAPSHOT/arrow-java-root-1.0.0-SNAPSHOT.pom
   [INFO] 
   [INFO] ---< org.apache.arrow:arrow-format 
>
   [INFO] Building Arrow Format 1.0.0-SNAPSHOT  
[2/12]
   [INFO] [ jar 
]-
   [INFO] 
   [INFO] --- apache-rat-plugin:0.13:check (rat-checks) @ arrow-format ---
   [INFO] RAT will not execute since it is configured to be skipped via system 
property 'rat.skip'.
   [INFO] 
   [INFO] --- maven-checkstyle-plugin:3.1.0:check (validate) @ arrow-format ---
   [INFO] 
   [INFO] --- git-commit-id-plugin:2.2.2:revision (for-jars) @ arrow-format ---
   [INFO] 
   [INFO] --- maven-dependency-plugin:3.0.1:copy (copy-flatc) @ arrow-format ---
   [INFO] Configured Artifact: 
com.github.icexelloss:flatc-linux-x86_64:1.9.0:exe
   [INFO] Copying flatc-linux-x86_64-1.9.0.exe to 
/arrow/java/format/target/flatc-linux-x86_64-1.9.0.exe
   [INFO] 
   [INFO] --- exec-maven-plugin:1.4.0:exec (script-chmod) @ arrow-format ---
   [INFO] 
   [INFO] --- exec-maven-plugin:1.4.0:exec (default) @ arrow-format ---
   [INFO] 
   [INFO] --- 

[GitHub] [arrow] wesm closed pull request #7439: ARROW-4309: [Documentation] Add a docker-compose entry which builds the documentation with CUDA enabled

2020-06-24 Thread GitBox


wesm closed pull request #7439:
URL: https://github.com/apache/arrow/pull/7439


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


wesm commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445281773



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scaned with BitBlockCounter
+  arrow::internal::BitBlockCounter data_counter(valid_bits, valid_bits_offset,
+num_values - leading_bits);
+  auto current_block = data_counter.NextWord();
+  while (idx_src < num_values) {
+if (current_block.AllSet()) {  // All true values
+  int run_length = 0;
+  // Scan forward until a block that has some false values (or the end)
+  while (current_block.length > 0 && current_block.AllSet()) {
+run_length += current_block.length;
+current_block = data_counter.NextWord();
+  }
+  // Fill all valid values of this scan
+  std::memcpy([num_valid_values], [idx_src], run_length * 
sizeof(T));
+  num_valid_values += run_length;
+  idx_src += run_length;
+  valid_bits_offset += run_length;
+  // The current_block already computed, advance to next loop
+  continue;
+} else if (!current_block.NoneSet()) {  // Some values are null
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,
+  current_block.length);
+  for (int64_t i = 0; i < current_block.length; i++) {
+if (valid_bits_reader.IsSet()) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_reader.Next();
+  }
+  valid_bits_offset += current_block.length;
+} else {  // All null values
+  idx_src += current_block.length;
+  valid_bits_offset += current_block.length;
+}
+current_block = data_counter.NextWord();
+  }
+
+  return num_valid_values;
+}
+
+/// \brief Expand the buffer but leave spaces for null entries.
+///
+/// \param[in, out] buffer the in-place buffer
+/// \param[in] num_values total size of buffer including null slots
+/// \param[in] null_count number of null slots
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \return The number of values expanded, including nulls.
+template 
+inline int SpacedExpand(T* buffer, int num_values, int null_count,
+const uint8_t* valid_bits, int64_t valid_bits_offset) {
+  constexpr uint64_t kBatchSize = 64;
+  static_assert(kBatchSize == 64, "Invalid batch size, BitBlockCounter stick 
to 64");
+  constexpr uint64_t kBatchByteSize = kBatchSize / 8;
+
+  // Point to end as we add the spacing from the back.
+  int idx_decode = num_values - null_count - 1;
+  int idx_buffer = num_values - 1;
+  int64_t idx_valid_bits = valid_bits_offset + idx_buffer;
+
+  // Depending on the number of nulls, some of the value slots in buffer may
+  

[GitHub] [arrow] wesm closed pull request #7538: ARROW-7925: [C++][Docs] Better document use of IWYU, including new 'match' option

2020-06-24 Thread GitBox


wesm closed pull request #7538:
URL: https://github.com/apache/arrow/pull/7538


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7538: ARROW-7925: [C++][Docs] Better document use of IWYU, including new 'match' option

2020-06-24 Thread GitBox


wesm commented on pull request #7538:
URL: https://github.com/apache/arrow/pull/7538#issuecomment-649186060


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-24 Thread GitBox


bkietz commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649154064


   Actually, on reflection: I'm not sure it's worthwhile to check the count of 
unique values at all. In any given batch a virtual column would be materialized 
with a single-item dictionary so `int8` should always suffice. (Unless we want 
to always support concatenation of a materialized table's chunks, though even 
in that case we could promote the index type on concatenation...). 
   
   Currently it seems preferable to remove `max_partition_dictionary_size` in 
favor of a boolean flag and always infer `dictionary`.
   
   @jorisvandenbossche ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445249689



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scaned with BitBlockCounter
+  arrow::internal::BitBlockCounter data_counter(valid_bits, valid_bits_offset,
+num_values - leading_bits);
+  auto current_block = data_counter.NextWord();
+  while (idx_src < num_values) {
+if (current_block.AllSet()) {  // All true values
+  int run_length = 0;
+  // Scan forward until a block that has some false values (or the end)
+  while (current_block.length > 0 && current_block.AllSet()) {
+run_length += current_block.length;
+current_block = data_counter.NextWord();
+  }
+  // Fill all valid values of this scan
+  std::memcpy([num_valid_values], [idx_src], run_length * 
sizeof(T));
+  num_valid_values += run_length;
+  idx_src += run_length;
+  valid_bits_offset += run_length;
+  // The current_block already computed, advance to next loop
+  continue;
+} else if (!current_block.NoneSet()) {  // Some values are null
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,
+  current_block.length);
+  for (int64_t i = 0; i < current_block.length; i++) {
+if (valid_bits_reader.IsSet()) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_reader.Next();
+  }
+  valid_bits_offset += current_block.length;
+} else {  // All null values
+  idx_src += current_block.length;
+  valid_bits_offset += current_block.length;
+}
+current_block = data_counter.NextWord();
+  }
+
+  return num_valid_values;
+}
+
+/// \brief Expand the buffer but leave spaces for null entries.
+///
+/// \param[in, out] buffer the in-place buffer
+/// \param[in] num_values total size of buffer including null slots
+/// \param[in] null_count number of null slots
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \return The number of values expanded, including nulls.
+template 
+inline int SpacedExpand(T* buffer, int num_values, int null_count,
+const uint8_t* valid_bits, int64_t valid_bits_offset) {
+  constexpr uint64_t kBatchSize = 64;
+  static_assert(kBatchSize == 64, "Invalid batch size, BitBlockCounter stick 
to 64");
+  constexpr uint64_t kBatchByteSize = kBatchSize / 8;
+
+  // Point to end as we add the spacing from the back.
+  int idx_decode = num_values - null_count - 1;
+  int idx_buffer = num_values - 1;
+  int64_t idx_valid_bits = valid_bits_offset + idx_buffer;
+
+  // Depending on the number of nulls, some of the value slots in buffer may

[GitHub] [arrow] jianxind commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445249248



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scaned with BitBlockCounter
+  arrow::internal::BitBlockCounter data_counter(valid_bits, valid_bits_offset,
+num_values - leading_bits);
+  auto current_block = data_counter.NextWord();
+  while (idx_src < num_values) {
+if (current_block.AllSet()) {  // All true values
+  int run_length = 0;
+  // Scan forward until a block that has some false values (or the end)
+  while (current_block.length > 0 && current_block.AllSet()) {
+run_length += current_block.length;
+current_block = data_counter.NextWord();
+  }
+  // Fill all valid values of this scan
+  std::memcpy([num_valid_values], [idx_src], run_length * 
sizeof(T));
+  num_valid_values += run_length;
+  idx_src += run_length;
+  valid_bits_offset += run_length;
+  // The current_block already computed, advance to next loop
+  continue;
+} else if (!current_block.NoneSet()) {  // Some values are null
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,
+  current_block.length);
+  for (int64_t i = 0; i < current_block.length; i++) {
+if (valid_bits_reader.IsSet()) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_reader.Next();
+  }
+  valid_bits_offset += current_block.length;
+} else {  // All null values
+  idx_src += current_block.length;
+  valid_bits_offset += current_block.length;
+}
+current_block = data_counter.NextWord();
+  }
+
+  return num_valid_values;
+}
+
+/// \brief Expand the buffer but leave spaces for null entries.

Review comment:
   done, 3x





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445249183



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scaned with BitBlockCounter

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-24 Thread GitBox


bkietz commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649148829


   I think there's value in finding the smallest index type possible; we expect 
partition fields to have few unique values in most cases.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649147883


   @ursabot benchmark --suite-filter=parquet-encoding-benchmark



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445245990



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.

Review comment:
   I put it to a new header as it could make easy for future optimization 
further, something like SIMD I proposal before at 
https://github.com/apache/arrow/pull/7029.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-24 Thread GitBox


wesm commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649145717


   We could use just int32() dictionary indices and call it a day? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7538: ARROW-7925: [C++][Docs] Better document use of IWYU, including new 'match' option

2020-06-24 Thread GitBox


github-actions[bot] commented on pull request #7538:
URL: https://github.com/apache/arrow/pull/7538#issuecomment-649144542


   https://issues.apache.org/jira/browse/ARROW-7925



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7538: ARROW-7925: [C++][Docs] Better document use of IWYU, including new 'match' option

2020-06-24 Thread GitBox


wesm opened a new pull request #7538:
URL: https://github.com/apache/arrow/pull/7538


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #4140: ARROW-5123: [Rust] Parquet derive for simple structs

2020-06-24 Thread GitBox


wesm commented on pull request #4140:
URL: https://github.com/apache/arrow/pull/4140#issuecomment-649124737


   You should be able to just rebase and the problem will go away



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-24 Thread GitBox


wesm commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445221159



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static OwnedRef pandas_NaTType;

Review comment:
   @pitrou this seemed worrisome to me so could use your advice about how 
to import this symbol once and then have it persist. Another option is that I 
could immediately decref the object returned by `PyObject_Type` and put a 
`PyObject*` here with the near certainty that this object will remain alive 
until the end of the Python session





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-24 Thread GitBox


wesm opened a new pull request #7537:
URL: https://github.com/apache/arrow/pull/7537


   This has been the root cause of a number of bugs. I'm unclear if there's a 
race condition with tearing down a `static OwnedRef` so we might need some 
other approach to managing symbols imported from various Python modules



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-24 Thread GitBox


github-actions[bot] commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649112307


   https://issues.apache.org/jira/browse/ARROW-8647



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-24 Thread GitBox


github-actions[bot] commented on pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#issuecomment-649112294


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz opened a new pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-24 Thread GitBox


bkietz opened a new pull request #7536:
URL: https://github.com/apache/arrow/pull/7536


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-24 Thread GitBox


wesm opened a new pull request #7535:
URL: https://github.com/apache/arrow/pull/7535


   See mailing list discussion. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] BryanCutler commented on pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release

2020-06-24 Thread GitBox


BryanCutler commented on pull request #6316:
URL: https://github.com/apache/arrow/pull/6316#issuecomment-649103303


   I mean the current process for integration tests with the master branch is 
to build Spark with Arrow Java master, then run Java and Python tests. That 
process is good for Spark master and branch-3.0, but with Spark branch-2.4, it 
doesn't make too much sense to rebuild Spark with Arrow Java master. This is 
because the dependency on Arrow Java is pretty much frozen for branch-2.4 
(unless there is a serious enough bug), but different versions of pyarrow can 
be used. It would be best to take the latest Spark 2.4.6 release out of the box 
and run the pyspark tests from the script with pyarrow from master. We are not 
really concerned about Java compatibility since it can't be upgraded anyway, 
but that still allows us to flush out any Python problems.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7396: ARROW-9092: [C++][TRIAGE] Do not enable TestRoundFunctions when using LLVM 9 until gandiva-decimal-test is fixed

2020-06-24 Thread GitBox


wesm commented on pull request #7396:
URL: https://github.com/apache/arrow/pull/7396#issuecomment-649102716


   I used perf to record some data about the hung function
   
   ```
   +   83.37% 0.00%  gandiva-decimal  [unknown]  [.] 
0x
   +   65.62% 5.27%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::SelectionDAG::Combine
   +   62.50% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::SelectionDAGISel::CodeGenAndEmitDAG
   +   57.88% 0.04%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::SelectionDAGISel::SelectAllBasicBlocks
   +   55.57% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::SelectionDAGISel::runOnMachineFunction
   +   54.60% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::MachineFunctionPass::runOnFunction
   +   53.48% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
(anonymous namespace)::X86DAGToDAGISel::runOnMachineFunction
   +   46.62% 0.07%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::FPPassManager::runOnFunction
   +   45.75% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::legacy::PassManagerImpl::run
   +   41.41% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::FPPassManager::runOnModule
   +   39.79% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::MCJIT::emitObject
   +   39.75% 2.28%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
(anonymous namespace)::DAGCombiner::combine
   +   30.20% 3.51%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
(anonymous namespace)::DAGCombiner::visit
   +   16.76% 6.80%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::TargetLowering::SimplifyDemandedVectorElts
   +   16.22% 0.54%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
(anonymous namespace)::DAGCombiner::SimplifyDemandedVectorElts
   +   15.57% 0.11%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
(anonymous namespace)::DAGCombiner::SimplifyDemandedVectorElts
   +   10.84% 1.95%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
(anonymous namespace)::SelectionDAGLegalize::LegalizeOp
   +   10.23% 4.51%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::X86TargetLowering::SimplifyDemandedVectorEltsForTargetNode
   +8.64% 1.48%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
(anonymous namespace)::DAGCombiner::recursivelyDeleteUnusedNodes
   +8.23% 0.25%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::X86TargetLowering::LowerOperation
   +8.17% 1.55%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::X86TargetLowering::PerformDAGCombine
   +7.59% 6.43%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
(anonymous namespace)::DAGCombiner::AddToWorklist
   +7.10% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
gandiva::Projector::Make
   +6.93% 1.23%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
LowerVSETCC
   +6.90% 0.14%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::X86TargetLowering::LowerSETCC
   +6.68% 1.30%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::SelectionDAG::getNode
   +6.27% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
gandiva::Engine::FinalizeModule
   +6.13% 0.58%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::TargetLowering::SimplifySetCC
   +6.02% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
gandiva::LLVMGenerator::Build
   +5.81% 0.00%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
gandiva::Projector::Make
   +5.45% 0.90%  gandiva-decimal  libgandiva.so.100.0.0  [.] 
llvm::SelectionDAG::LegalizeOp
   ```
   
   It looks like an LLVM bug based on Googling for "llvm::SelectionDAG::Combine 
hang"



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain

2020-06-24 Thread GitBox


wesm commented on pull request #7452:
URL: https://github.com/apache/arrow/pull/7452#issuecomment-649098754


   Once the utf8_lower/utf8_upper patch lands I am going to make utf8proc not 
mandatory. See ARROW-9220.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9153: [C++][Python] Refactor scalar bindings

2020-06-24 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445194648



##
File path: python/pyarrow/tests/test_misc.py
##
@@ -120,7 +120,6 @@ def test_cpu_count():
 pa.LargeListValue,
 pa.MapValue,
 pa.FixedSizeListValue,
-pa.UnionValue,

Review comment:
   Restore it





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-24 Thread GitBox


kou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-649090497


   Rebased.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain

2020-06-24 Thread GitBox


kou closed pull request #7452:
URL: https://github.com/apache/arrow/pull/7452


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain

2020-06-24 Thread GitBox


kou commented on pull request #7452:
URL: https://github.com/apache/arrow/pull/7452#issuecomment-649085709


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #7507: ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian

2020-06-24 Thread GitBox


kou commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r445184425



##
File path: cpp/src/arrow/ipc/read_write_test.cc
##
@@ -427,6 +469,14 @@ class TestIpcRoundTrip : public 
::testing::TestWithParam,
   void TearDown() { IpcTestFixture::TearDown(); }
 };
 
+class TestIpcDifferentEndian
+: public ::testing::TestWithParam>,
+  public IpcTestFixture {
+ public:
+  void SetUp() { IpcTestFixture::SetUp(); }
+  void TearDown() { IpcTestFixture::TearDown(); }
+};

Review comment:
   I'm not sure how difficult to implement this idea...
   How about enhancing existing `IpcTestFixture` instead of adding a new 
fixture?
   
   ```diff
   diff --git a/cpp/src/arrow/ipc/read_write_test.cc 
b/cpp/src/arrow/ipc/read_write_test.cc
   index 923670834..1e03472b8 100644
   --- a/cpp/src/arrow/ipc/read_write_test.cc
   +++ b/cpp/src/arrow/ipc/read_write_test.cc
   @@ -397,6 +397,9 @@ class IpcTestFixture : public io::MemoryMapFixture, 
public ExtensionTypesMixin {

ASSERT_OK_AND_ASSIGN(result, DoLargeRoundTrip(batch, 
/*zero_data=*/true));
CheckReadResult(*result, batch);
   +
   +ASSERT_OK_AND_ASSIGN(result, DoEndiannessRoundTrip(batch));
   +CheckReadResult(*result, batch);
  }
  void CheckRoundtrip(const std::shared_ptr& array,
  IpcWriteOptions options = IpcWriteOptions::Defaults(),
   ```
   
   `DoEndiannessRoundTrip()` will:
   
   * create a nonnative endianness record batch by swapping the given record 
batch's endianness
 * We need to implement this endianness conversion feature but how 
difficult...?
   * run `DoStandardRoundTrip()` against the nonnative endianness record batch
   
   BTW, should we always convert endianness in IPC? If users just relay data (A 
(little, generate data) -> B (big, not touch data) -> C (little, process 
data)), it'll be inefficient.
   Or should we convert endianness explicitly out of IPC by providing 
`Result> 
RecordBatch::EnsureEndianness(Endianness)` or something?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xrl commented on pull request #4140: ARROW-5123: [Rust] Parquet derive for simple structs

2020-06-24 Thread GitBox


xrl commented on pull request #4140:
URL: https://github.com/apache/arrow/pull/4140#issuecomment-649081934


   @wesm I'm the original author and I'd love to wrap this up. I can probably 
figure out how to debug some ruby for that release script bug.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-06-24 Thread GitBox


wesm commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-649079161


   There's a small rebase conflict here. Need help from Java folks here, 
@rymurr can you help?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7014: ARROW-8563: [Go] Minor change to make newBuilder public

2020-06-24 Thread GitBox


wesm closed pull request #7014:
URL: https://github.com/apache/arrow/pull/7014


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-24 Thread GitBox


wesm commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-649077394


   @sonthonaxrk I would recommend opening a new PR



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6725: ARROW-8226: [Go] Implement 64 bit offsets binary builder

2020-06-24 Thread GitBox


wesm commented on pull request #6725:
URL: https://github.com/apache/arrow/pull/6725#issuecomment-649076794


   @sbinet could you assist with this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6725: ARROW-8226: [Go] Implement 64 bit offsets binary builder

2020-06-24 Thread GitBox


wesm commented on pull request #6725:
URL: https://github.com/apache/arrow/pull/6725#issuecomment-649076694


   Hm, at a high level it seems like it might be better to have a separate set 
of LargeBinary types rather than try to pack both 32-bit and 64-bit into the 
same types. This means some code duplication (given Go's current 
generics/templates situation) but perhaps for the best



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #4140: ARROW-5123: [Rust] Parquet derive for simple structs

2020-06-24 Thread GitBox


wesm commented on pull request #4140:
URL: https://github.com/apache/arrow/pull/4140#issuecomment-649075783


   Any hope of rehabilitating this for 1.0.0?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7374: ARROW-9064: [APT] Use --no-install-recommends

2020-06-24 Thread GitBox


wesm commented on pull request #7374:
URL: https://github.com/apache/arrow/pull/7374#issuecomment-649075242


   I'm closing this



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7374: ARROW-9064: [APT] Use --no-install-recommends

2020-06-24 Thread GitBox


wesm closed pull request #7374:
URL: https://github.com/apache/arrow/pull/7374


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location

2020-06-24 Thread GitBox


wesm commented on pull request #7376:
URL: https://github.com/apache/arrow/pull/7376#issuecomment-649074875


   @kszucs can you take this over?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-24 Thread GitBox


wesm closed pull request #7520:
URL: https://github.com/apache/arrow/pull/7520


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7533: ARROW-7375: [Python] Expose C++ MakeArrayOfNull

2020-06-24 Thread GitBox


wesm closed pull request #7533:
URL: https://github.com/apache/arrow/pull/7533


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


wesm commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649067717


   Benchmark results
   
   ```
   $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 jianxind/BitBlockSpaced 
master --suite-filter=parquet-encoding
 benchmarkbaseline  
 contender  change %counters
   13  BM_PlainDecodingSpacedFloat/32768/1   6.454 GiB/sec  
21.480 GiB/sec   232.838 {'iterations': 36805, 'null_percent': 0.01}
   54BM_PlainDecodingSpacedBoolean/32768/1   1.017 GiB/sec  
 2.031 GiB/sec99.662 {'iterations': 23083, 'null_percent': 0.01}
   53BM_PlainDecodingSpacedFloat/32768/100   4.087 GiB/sec  
 6.490 GiB/sec58.788  {'iterations': 23365, 'null_percent': 1.0}
   118BM_PlainDecodingSpacedDouble/32768/1  11.326 GiB/sec  
17.696 GiB/sec56.237 {'iterations': 32216, 'null_percent': 0.01}
   134 BM_PlainDecodingSpacedBoolean/32768/100 755.399 MiB/sec  
 1.046 GiB/sec41.803  {'iterations': 16687, 'null_percent': 1.0}
   86   BM_PlainDecodingSpacedDouble/32768/100   7.654 GiB/sec  
 9.557 GiB/sec24.864  {'iterations': 22049, 'null_percent': 1.0}
   43   BM_PlainDecodingInt64/4096  44.856 GiB/sec  
51.732 GiB/sec15.329 {'iterations': 1199214}
   56  BM_PlainEncodingFloat/32768  32.768 GiB/sec  
36.879 GiB/sec12.548  {'iterations': 163156}
   45BM_ByteStreamSplitEncode_Double_Sse2/4096  14.178 GiB/sec  
15.697 GiB/sec10.715  {'iterations': 323465}
   140BM_PlainDecodingSpacedBoolean/32768/5000 244.132 MiB/sec  
   267.167 MiB/sec 9.435  {'iterations': 5521, 'null_percent': 50.0}
   98   BM_PlainDecodingFloat/4096  69.756 GiB/sec  
75.939 GiB/sec 8.863 {'iterations': 3748990}
   41  BM_PlainDecodingFloat/65536  53.684 GiB/sec  
57.973 GiB/sec 7.988  {'iterations': 169375}
   68   BM_PlainDecodingFloat/1024  37.061 GiB/sec  
39.638 GiB/sec 6.955 {'iterations': 6855467}
   131 BM_PlainEncodingInt64/32768  33.101 GiB/sec  
34.982 GiB/sec 5.683   {'iterations': 87975}
   42 BM_ArrowBinaryDict/EncodeLowLevel/262144  79.220 MiB/sec  
83.540 MiB/sec 5.453  {'iterations': 36}
   62 BM_PlainDecodingSpacedBoolean/32768/1000 520.545 MiB/sec  
   548.129 MiB/sec 5.299 {'iterations': 11926, 'null_percent': 10.0}
   106  BM_ArrowBinaryPlain/EncodeLowLevel/1048576 779.964 MiB/sec  
   810.645 MiB/sec 3.934  {'iterations': 87}
   20  BM_PlainDecodingDouble/1024  65.892 GiB/sec  
68.154 GiB/sec 3.434 {'iterations': 6008229}
   70 BM_PlainEncodingDouble/32768  31.926 GiB/sec  
32.940 GiB/sec 3.176   {'iterations': 86995}
   39  BM_ByteStreamSplitEncode_Double_Scalar/4096   2.652 GiB/sec  
 2.735 GiB/sec 3.151   {'iterations': 59682}
   36 BM_PlainDecodingBoolean/1024   1.688 GiB/sec  
 1.737 GiB/sec 2.907 {'iterations': 1257476}
   89 BM_ArrowBinaryDict/EncodeDictDirectInt32/1048576  322.964m items/sec  
331.200m items/sec 2.550 {'iterations': 219}
   30  BM_ArrowBinaryDict/DecodeArrowNonNull_Dict/4096 204.031 MiB/sec  
   208.917 MiB/sec 2.395{'iterations': 5959}
   146  BM_ByteStreamSplitEncode_Double_Sse2/32768  15.496 GiB/sec  
15.845 GiB/sec 2.256   {'iterations': 44572}
   60   BM_PlainDecodingInt64/1024  66.471 GiB/sec  
67.795 GiB/sec 1.992 {'iterations': 6157194}
   82   BM_ByteStreamSplitEncode_Double_Sse2/65536  14.740 GiB/sec  
15.025 GiB/sec 1.933   {'iterations': 21204}
   38  BM_ByteStreamSplitEncode_Float_Scalar/65536  20.023 GiB/sec  
20.406 GiB/sec 1.913   {'iterations': 57636}
   59BM_DictDecodingInt64_repeats/1024   9.453 GiB/sec  
 9.629 GiB/sec 1.859  {'iterations': 871513}
   44BM_ByteStreamSplitEncode_Double_Sse2/1024  16.284 GiB/sec  
16.576 GiB/sec 1.790 {'iterations': 1520871}

[GitHub] [arrow] kou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-24 Thread GitBox


kou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-649065240


   Oh, sorry.
   It seems that I saw wrong CI jobs.
   The link problem has been fixed by the workaround.
   
   I'll cherry pick the workaround to https://github.com/apache/arrow/pull/7452 
and merge it.
   Then I'll rebase this branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release

2020-06-24 Thread GitBox


kszucs commented on pull request #6316:
URL: https://github.com/apache/arrow/pull/6316#issuecomment-649043272


   > @kszucs I submitted a patch to fix Java compilation with Spark master and 
branch-3.0, and tested locally with the latest pyarrow so Spark integration 
tests should pass for these as of this morning.
   
   Thanks Bryan!
   
   > 
   > For Spark branch-2.4, it will require applying a different patch with the 
testing script, but maybe we should rethink this strategy a little. Spark 2.4.x 
will not upgrade Arrow Java anymore, so it would be more correct to take the 
latest Spark 2.4.6 release out of the box and test that with pyarrow from 
master. Is that possible to setup, wdyt?
   
   I'm not sure that I understand you correctly :)
   - keep arrow master <=> spark branch-3.0
   - keep arrow master <=> spark master
   - and keep arrow master <=> spark 2.4.6 but execute only the pyarrow tests?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-24 Thread GitBox


wesm commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r445127367



##
File path: r/src/array_from_vector.cpp
##
@@ -915,6 +924,39 @@ class Time64Converter : public TimeConverter {
   }
 };
 
+class BinaryVectorConverter : public VectorConverter {
+ public:
+  ~BinaryVectorConverter() {}
+
+  Status Init(ArrayBuilder* builder) {
+typed_builder_ = checked_cast(builder);
+return Status::OK();
+  }
+
+  Status Ingest(SEXP obj) {
+ARROW_RETURN_IF(TYPEOF(obj) != VECSXP, Status::RError("Expecting a list"));
+R_xlen_t n = XLENGTH(obj);
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP obj_i = VECTOR_ELT(obj, i);
+  if (Rf_isNull(obj_i)) {
+RETURN_NOT_OK(typed_builder_->AppendNull());
+  } else {
+ARROW_RETURN_IF(TYPEOF(obj_i) != RAWSXP,
+Status::RError("Expecting a raw vector"));
+RETURN_NOT_OK(typed_builder_->Append(RAW(obj_i), XLENGTH(obj_i)));

Review comment:
   You could add a call to `typed_builder_->Reserve` before the loop for 
slightly better perf





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


wesm commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445122603



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scaned with BitBlockCounter

Review comment:
   typo: scanned

##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scaned with BitBlockCounter
+  arrow::internal::BitBlockCounter data_counter(valid_bits, valid_bits_offset,
+num_values - leading_bits);
+  auto current_block = data_counter.NextWord();
+  while (idx_src < num_values) {
+if (current_block.AllSet()) {  // All true values
+  int run_length = 0;
+  // Scan forward until a block that has some false values (or the end)
+  while (current_block.length > 0 && current_block.AllSet()) {
+run_length += 

[GitHub] [arrow] bkietz closed pull request #7493: ARROW-9183: [C++] Fix build with clang & old libstdc++.

2020-06-24 Thread GitBox


bkietz closed pull request #7493:
URL: https://github.com/apache/arrow/pull/7493


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-24 Thread GitBox


github-actions[bot] commented on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649022873


   https://issues.apache.org/jira/browse/ARROW-8729



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz opened a new pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-24 Thread GitBox


bkietz opened a new pull request #7534:
URL: https://github.com/apache/arrow/pull/7534


   This bug is inherited from `parquet::arrow::RowGroupRecordBatchReader`, 
which yielded empty record batches when no columns were projected because no 
field readers were available from which to derive batch size. I've added logic 
to get usable batch sizes from file metadata in the empty columns case



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7498: ARROW-9091: [C++][Compute] Add default FunctionOptions

2020-06-24 Thread GitBox


wesm closed pull request #7498:
URL: https://github.com/apache/arrow/pull/7498


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion

2020-06-24 Thread GitBox


nealrichardson commented on pull request #7527:
URL: https://github.com/apache/arrow/pull/7527#issuecomment-649000526


   @romainfrancois this is ready for (and seriously needs) your review. Tests 
should be passing now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion

2020-06-24 Thread GitBox


nealrichardson commented on a change in pull request #7527:
URL: https://github.com/apache/arrow/pull/7527#discussion_r445098056



##
File path: r/R/schema.R
##
@@ -83,16 +83,21 @@ Schema <- R6Class("Schema",
 }
   ),
   active = list(
-names = function() Schema__field_names(self),
+names = function() {
+  out <- Schema__field_names(self)
+  # Hack: Rcpp should set the encoding

Review comment:
   This is a more general problem, would affect the `names()` method of any 
objects where they return a `std::vector`. Those must (as I 
understand it) always be UTF-8 in Arrow, but if you don't declare them as UTF-8 
in R, then they get displayed all mangled on Windows (default/unknown encoding 
treated as `latin1`). 
   
   Rather than relying on the default `Rcpp::wrap` method for this, we should 
probably wrap ourselves. I could naively write this (create CharacterVector, 
iterate over the `std::vector` and insert Rcpp::String with 
CE_UTF8) but maybe that's not great?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alexbaden commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-24 Thread GitBox


alexbaden commented on pull request #7263:
URL: https://github.com/apache/arrow/pull/7263#issuecomment-648988459


   Maybe with this PR that is possible, I'll have to explore a bit once this is 
merged. The concern is more around getting the order of the dictionary, etc 
right in the message stream so that when you go to unpack everything you don't 
end up with dangling dictionary references to old memory locations, which is 
what I had to work around with the previous API. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] BryanCutler commented on pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release

2020-06-24 Thread GitBox


BryanCutler commented on pull request #6316:
URL: https://github.com/apache/arrow/pull/6316#issuecomment-648982266


   @kszucs I submitted a patch to fix Java compilation with Spark master and 
branch-3.0, and tested locally with the latest pyarrow so Spark integration 
tests should pass for these as of this morning.
   
   For Spark branch-2.4, it will require applying a different patch with the 
testing script, but maybe we should rethink this strategy a little. Spark 2.4.x 
will not upgrade Arrow Java anymore, so it would be more correct to take the 
latest Spark 2.4.6 release out of the box and test that with pyarrow from 
master. Is that possible to setup, wdyt?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-24 Thread GitBox


wesm commented on a change in pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#discussion_r445042115



##
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##
@@ -397,24 +452,26 @@ struct MinMaxImpl : public ScalarAggregator {
 
 ArrayType arr(batch[0].array());
 
-local.has_nulls = arr.null_count() > 0;
+const auto null_count = arr.null_count();
+local.has_nulls = null_count > 0;
+local.has_values = (arr.length() - null_count) > 0;
+
 if (local.has_nulls && options.null_handling == 
MinMaxOptions::OUTPUT_NULL) {
   this->state = local;
   return;
 }
 
-const auto values = arr.raw_values();
-if (arr.null_count() > 0) {
+if (local.has_nulls) {
   BitmapReader reader(arr.null_bitmap_data(), arr.offset(), arr.length());
   for (int64_t i = 0; i < arr.length(); i++) {
 if (reader.IsSet()) {
-  local.MergeOne(values[i]);
+  local.MergeOne(arr.Value(i));
 }
 reader.Next();
   }
 } else {
   for (int64_t i = 0; i < arr.length(); i++) {
-local.MergeOne(values[i]);
+local.MergeOne(arr.Value(i));
   }

Review comment:
   you can infer the false or true count from the other given also the null 
count, so no need to call both `true_count()` AND `false_count()`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-24 Thread GitBox


kszucs commented on a change in pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#discussion_r445007543



##
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##
@@ -397,24 +452,26 @@ struct MinMaxImpl : public ScalarAggregator {
 
 ArrayType arr(batch[0].array());
 
-local.has_nulls = arr.null_count() > 0;
+const auto null_count = arr.null_count();
+local.has_nulls = null_count > 0;
+local.has_values = (arr.length() - null_count) > 0;
+
 if (local.has_nulls && options.null_handling == 
MinMaxOptions::OUTPUT_NULL) {
   this->state = local;
   return;
 }
 
-const auto values = arr.raw_values();
-if (arr.null_count() > 0) {
+if (local.has_nulls) {
   BitmapReader reader(arr.null_bitmap_data(), arr.offset(), arr.length());
   for (int64_t i = 0; i < arr.length(); i++) {
 if (reader.IsSet()) {
-  local.MergeOne(values[i]);
+  local.MergeOne(arr.Value(i));
 }
 reader.Next();
   }
 } else {
   for (int64_t i = 0; i < arr.length(); i++) {
-local.MergeOne(values[i]);
+local.MergeOne(arr.Value(i));
   }

Review comment:
   I was thinking of that, but the current false_count imlementation 
triggers true_count, so we would't iterate over the values at least twice to 
get both true and false count - it might be more efficient though.
   
   I'm going to write a benchmark for this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-24 Thread GitBox


kszucs commented on a change in pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#discussion_r445006111



##
File path: cpp/src/arrow/testing/gtest_util.h
##
@@ -137,6 +137,8 @@ namespace arrow {
 // --
 // Useful testing::Types declarations
 
+typedef ::testing::Types BooleanArrowType;

Review comment:
   Updating.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7297: ARROW-6945: [Rust][Integration] Run rust integration tests

2020-06-24 Thread GitBox


nealrichardson closed pull request #7297:
URL: https://github.com/apache/arrow/pull/7297


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-24 Thread GitBox


wesm commented on a change in pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#discussion_r444996273



##
File path: cpp/src/arrow/compute/kernels/aggregate_test.cc
##
@@ -399,15 +434,59 @@ class TestNumericMinMaxKernel : public ::testing::Test {
 };
 
 template 
-class TestFloatingMinMaxKernel : public TestNumericMinMaxKernel {};
+class TestBooleanMinMaxKernel : public TestPrimitiveMinMaxKernel {};
+
+template 
+class TestIntegerMinMaxKernel : public TestPrimitiveMinMaxKernel {};
+
+template 
+class TestFloatingMinMaxKernel : public TestPrimitiveMinMaxKernel 
{};
+
+TYPED_TEST_SUITE(TestBooleanMinMaxKernel, BooleanArrowType);

Review comment:
   This seems overwrought since there is only one type, can we do it 
without TYPED_TEST?

##
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##
@@ -397,24 +452,26 @@ struct MinMaxImpl : public ScalarAggregator {
 
 ArrayType arr(batch[0].array());
 
-local.has_nulls = arr.null_count() > 0;
+const auto null_count = arr.null_count();
+local.has_nulls = null_count > 0;
+local.has_values = (arr.length() - null_count) > 0;
+
 if (local.has_nulls && options.null_handling == 
MinMaxOptions::OUTPUT_NULL) {
   this->state = local;
   return;
 }
 
-const auto values = arr.raw_values();
-if (arr.null_count() > 0) {
+if (local.has_nulls) {
   BitmapReader reader(arr.null_bitmap_data(), arr.offset(), arr.length());
   for (int64_t i = 0; i < arr.length(); i++) {
 if (reader.IsSet()) {
-  local.MergeOne(values[i]);
+  local.MergeOne(arr.Value(i));
 }
 reader.Next();
   }
 } else {
   for (int64_t i = 0; i < arr.length(); i++) {
-local.MergeOne(values[i]);
+local.MergeOne(arr.Value(i));
   }

Review comment:
   To be honest this seems wasteful. Shouldn't we compute the true, false, 
and null count and then set the min/max based on those? 

##
File path: cpp/src/arrow/testing/gtest_util.h
##
@@ -137,6 +137,8 @@ namespace arrow {
 // --
 // Useful testing::Types declarations
 
+typedef ::testing::Types BooleanArrowType;

Review comment:
   Would prefer not to have this per above





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #7297: ARROW-6945: [Rust][Integration] Run rust integration tests

2020-06-24 Thread GitBox


nevi-me commented on pull request #7297:
URL: https://github.com/apache/arrow/pull/7297#issuecomment-648901870


   > Ok there was one other test needing to be skipped, which I've done, and 
now the tests "pass". Should we merge this and progressively unskip tests as 
you can?
   
   Yes please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on pull request #7297: ARROW-6945: [Rust][Integration] Run rust integration tests

2020-06-24 Thread GitBox


andygrove commented on pull request #7297:
URL: https://github.com/apache/arrow/pull/7297#issuecomment-648901377


   Yes, that would be great. Thanks!
   
   On Wed, Jun 24, 2020 at 9:43 AM Neal Richardson 
   wrote:
   
   > Ok there was one other test needing to be skipped, which I've done, and
   > now the tests "pass". Should we merge this and progressively unskip tests
   > as you can?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   >
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7297: ARROW-6945: [Rust][Integration] Run rust integration tests

2020-06-24 Thread GitBox


nealrichardson commented on pull request #7297:
URL: https://github.com/apache/arrow/pull/7297#issuecomment-648900443


   Ok there was one other test needing to be skipped, which I've done, and now 
the tests "pass". Should we merge this and progressively unskip tests as you 
can?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-24 Thread GitBox


nealrichardson commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-648889866


   @xhochy if you're trying to add R Windows dependencies, see the discussion 
on https://issues.apache.org/jira/browse/ARROW-6960 for pointers



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-06-24 Thread GitBox


nealrichardson commented on pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#issuecomment-648886550


   Is this good to merge now? @BryanCutler are you still planning to review 
this? Would like to get this in 1.0.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-24 Thread GitBox


wesm commented on pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#issuecomment-648879878


   Looking



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7532: ARROW-9217: [C++] Cover 0.01% null for the plain spaced benchmark

2020-06-24 Thread GitBox


wesm commented on pull request #7532:
URL: https://github.com/apache/arrow/pull/7532#issuecomment-648877226


   Here's 0.17.1 with the benchmark changes backported 
https://github.com/wesm/arrow/tree/BitBlockSpacedBM-0.17.1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7532: ARROW-9217: [C++] Cover 0.01% null for the plain spaced benchmark

2020-06-24 Thread GitBox


wesm closed pull request #7532:
URL: https://github.com/apache/arrow/pull/7532


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7532: ARROW-9217: [C++] Cover 0.01% null for the plain spaced benchmark

2020-06-24 Thread GitBox


wesm commented on pull request #7532:
URL: https://github.com/apache/arrow/pull/7532#issuecomment-648875010


   For interest, I benchmarked 0.17.1 versus master with these new benchmarks 
(gcc-8 on i9-9960X, with SSE4.2):
   
   ```
 benchmarkbaseline  
 contender  change %counters
   75   BM_ByteStreamSplitDecode_Float_Scalar/1024   3.290 GiB/sec  
33.036 GiB/sec   904.213  {'iterations': 609088}
   17  BM_ByteStreamSplitDecode_Float_Scalar/32768   3.326 GiB/sec  
32.946 GiB/sec   890.483   {'iterations': 18993}
   94  BM_ByteStreamSplitDecode_Float_Scalar/65536   3.331 GiB/sec  
32.940 GiB/sec   889.020{'iterations': 9562}
   40   BM_ByteStreamSplitDecode_Float_Scalar/4096   3.325 GiB/sec  
32.770 GiB/sec   885.487  {'iterations': 150968}
   85 BM_ByteStreamSplitDecode_Double_Scalar/32768   2.372 GiB/sec  
22.162 GiB/sec   834.465{'iterations': 6821}
   41  BM_ByteStreamSplitDecode_Double_Scalar/4096   2.373 GiB/sec  
21.957 GiB/sec   825.240   {'iterations': 54606}
   152 BM_ByteStreamSplitDecode_Double_Scalar/1024   2.382 GiB/sec  
21.598 GiB/sec   806.793  {'iterations': 216005}
   56 BM_ByteStreamSplitDecode_Double_Scalar/65536   2.360 GiB/sec  
20.721 GiB/sec   778.165{'iterations': 3404}
   81  BM_PlainDecodingFloat/32768  35.302 GiB/sec  
57.093 GiB/sec61.726  {'iterations': 283126}
   42  BM_PlainEncodingSpacedFloat/32768/1   4.001 GiB/sec  
 5.803 GiB/sec45.019{'iterations': 22934, 'null_percent': 100.0}
   45   BM_PlainDecodingFloat/4096  63.384 GiB/sec  
88.650 GiB/sec39.862 {'iterations': 2777408}
   48   BM_PlainEncodingInt64/1024  27.515 GiB/sec  
36.216 GiB/sec31.622 {'iterations': 2591133}
   117 BM_PlainDecodingSpacedBoolean/32768/100 595.658 MiB/sec  
   752.488 MiB/sec26.329  {'iterations': 13123, 'null_percent': 1.0}
   26 BM_PlainDecodingSpacedBoolean/32768/1000 434.416 MiB/sec  
   544.202 MiB/sec25.272  {'iterations': 9539, 'null_percent': 10.0}
   86BM_PlainDecodingSpacedBoolean/32768/1 843.205 MiB/sec  
 1.013 GiB/sec22.976 {'iterations': 18949, 'null_percent': 0.01}
   120 BM_PlainDecodingInt64/32768  47.857 GiB/sec  
57.500 GiB/sec20.149  {'iterations': 10}
   109 BM_PlainEncodingInt64/32768  29.582 GiB/sec  
35.480 GiB/sec19.938   {'iterations': 79459}
   104 BM_PlainDecodingFloat/65536  49.863 GiB/sec  
59.278 GiB/sec18.883  {'iterations': 142638}
   129BM_PlainEncodingDouble/32768  28.485 GiB/sec  
33.853 GiB/sec18.845   {'iterations': 76522}
   128BM_PlainDecodingDouble/32768  49.289 GiB/sec  
58.411 GiB/sec18.508  {'iterations': 143174}
   53BM_PlainEncodingSpacedFloat/32768/100   3.748 GiB/sec  
 4.364 GiB/sec16.428  {'iterations': 21388, 'null_percent': 1.0}
   115 BM_PlainDecodingDouble/4096  44.133 GiB/sec  
51.155 GiB/sec15.910 {'iterations': 1035922}
   74   BM_PlainEncodingFloatNaN/65536  32.516 GiB/sec  
37.217 GiB/sec14.456   {'iterations': 86167}
   63  BM_PlainEncodingSpacedFloat/32768/1   4.060 GiB/sec  
 4.625 GiB/sec13.919 {'iterations': 22322, 'null_percent': 0.01}
   12 BM_PlainDecodingSpacedBoolean/32768/5000 213.706 MiB/sec  
   242.841 MiB/sec13.634  {'iterations': 4744, 'null_percent': 50.0}
   65 BM_PlainDecodingDouble/65536  41.841 GiB/sec  
47.024 GiB/sec12.387   {'iterations': 61985}
   136  BM_PlainEncodingSpacedFloat/32768/1000   2.393 GiB/sec  
 2.672 GiB/sec11.664 {'iterations': 13868, 'null_percent': 10.0}
   49   BM_PlainEncodingFloatNaN/32768  33.901 GiB/sec  
37.708 GiB/sec11.230  {'iterations': 161071}
   88  BM_PlainEncodingDoubleNaN/32768  29.286 GiB/sec  
32.545 GiB/sec11.126   {'iterations': 76529}
   5   

[GitHub] [arrow] wesm commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-24 Thread GitBox


wesm commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648873558


   I sent an e-mail to dev@ -- let's discuss there



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alippai edited a comment on pull request #7533: ARROW-7375: [Python] Expose C++ MakeArrayOfNull

2020-06-24 Thread GitBox


alippai edited a comment on pull request #7533:
URL: https://github.com/apache/arrow/pull/7533#issuecomment-648752119


   Can this be extended to support any scalar value? Creating a column with 
single value is a common step for me (before concatenating tables, so the 
fragments a are differentiated in the result)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-24 Thread GitBox


kszucs commented on pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#issuecomment-648856803


   ping @wesm 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #6592: ARROW-8089: [C++] Port the toolchain build from Appveyor to Github Actions

2020-06-24 Thread GitBox


kszucs commented on pull request #6592:
URL: https://github.com/apache/arrow/pull/6592#issuecomment-648856487


   This PR was outdated, I will keep working on the windows docker containers 
instead.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-24 Thread GitBox


romainfrancois commented on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-648848097


   Added support for record batches. 
   
   Toying with the idea of a print method for the metadata, to make it less 
opaque: 
   
   ``` r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #> timestamp
   
   Table$create(a = structure(1:2, foo = "bar"))$metadata
   #> $r
   #> [1] 
"A\n3\n262144\n197888\n5\nUTF-8\n531\n2\n254\n531\n1\n531\n1\n16\n1\n262153\n3\nbar\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n3\nfoo\n254\n1026\n511\n16\n1\n262153\n1\na\n254\n1026\n511\n16\n2\n262153\n4\ndata\n262153\n7\ncolumns\n254\n"
   #> List of 2
   #>  $ data   : NULL
   #>  $ columns:List of 1
   #>   ..$ a:List of 1
   #>   .. ..$ foo: chr "bar"
   #> 
   #> attr(,"class")
   #> [1] "arrow:::metadata"
   record_batch(a = structure(1:2, foo = "bar"))$metadata
   #> $r
   #> [1] 
"A\n3\n262144\n197888\n5\nUTF-8\n531\n2\n254\n531\n1\n531\n1\n16\n1\n262153\n3\nbar\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n3\nfoo\n254\n1026\n511\n16\n1\n262153\n1\na\n254\n1026\n511\n16\n2\n262153\n4\ndata\n262153\n7\ncolumns\n254\n"
   #> List of 2
   #>  $ data   : NULL
   #>  $ columns:List of 1
   #>   ..$ a:List of 1
   #>   .. ..$ foo: chr "bar"
   #> 
   #> attr(,"class")
   #> [1] "arrow:::metadata"
   ```
   
   Created on 2020-06-24 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7533: ARROW-7375: [Python] Expose C++ MakeArrayOfNull

2020-06-24 Thread GitBox


kszucs commented on a change in pull request #7533:
URL: https://github.com/apache/arrow/pull/7533#discussion_r444924523



##
File path: python/pyarrow/__init__.py
##
@@ -90,7 +90,7 @@ def parse_git(root, **kwargs):
  schema,
  unify_schemas,
  Array, Tensor,
- array, chunked_array, record_batch, table,
+ array, chunked_array, record_batch, table, nulls,

Review comment:
   Sure.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-24 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444914459



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   Fragments created by ParquetDatasetFactory will have their physical 
schema populated on construction: 
https://github.com/apache/arrow/pull/7526/files#diff-b7aed3b5c0adacff778c6930cc6a03e4R646-R650
   
   After that calling `ReadPhysicalSchema` on those fragments will be an 
IO-free operation 
https://github.com/apache/arrow/pull/7526/files#diff-ce896c4eb96f3a050aec23151adf58e8R42-R48





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-24 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444914459



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   
https://github.com/apache/arrow/pull/7526/files#diff-b7aed3b5c0adacff778c6930cc6a03e4R646-R650





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7533: ARROW-7375: [Python] Expose C++ MakeArrayOfNull

2020-06-24 Thread GitBox


wesm commented on a change in pull request #7533:
URL: https://github.com/apache/arrow/pull/7533#discussion_r444890495



##
File path: python/pyarrow/__init__.py
##
@@ -90,7 +90,7 @@ def parse_git(root, **kwargs):
  schema,
  unify_schemas,
  Array, Tensor,
- array, chunked_array, record_batch, table,
+ array, chunked_array, record_batch, table, nulls,

Review comment:
   Can you add it to the API listing .rst page?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7533: ARROW-7375: [Python] Expose C++ MakeArrayOfNull

2020-06-24 Thread GitBox


wesm commented on pull request #7533:
URL: https://github.com/apache/arrow/pull/7533#issuecomment-648817609


   @alippai that is doable but would need to get done in a separate PR



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-24 Thread GitBox


rymurr commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648810719


   Thanks @wesm and @jacques-n for the review. I will leave this up until 
consensus is reached on the format change. Please let me know if I can help w/ 
the c++ patch, would be happy to contribute!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alippai commented on pull request #7533: ARROW-7375: [Python] Expose C++ MakeArrayOfNull

2020-06-24 Thread GitBox


alippai commented on pull request #7533:
URL: https://github.com/apache/arrow/pull/7533#issuecomment-648752119


   Can this be extended to support any scalar value? Creating a column with 
single value is a common step for us (before concatenating tables, so the 
fragments a are differentiated in the result)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7533: ARROW-7375: [Python] Expose C++ MakeArrayOfNull

2020-06-24 Thread GitBox


github-actions[bot] commented on pull request #7533:
URL: https://github.com/apache/arrow/pull/7533#issuecomment-648694810


   https://issues.apache.org/jira/browse/ARROW-7375



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #7533: ARROW-7375: [Python] Expose C++ MakeArrayOfNull

2020-06-24 Thread GitBox


kszucs opened a new pull request #7533:
URL: https://github.com/apache/arrow/pull/7533


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-648656865


   This PR https://github.com/apache/arrow/pull/7532 add the 0.01% benchmark 
case, I can trigger a benchmark action if 7532 get merged.
   
   Below is the results for 0.01% on my device.
   Before
   ```
   BM_PlainEncodingSpacedBoolean/32768/1  74766 ns74685 ns  
   9348 bytes_per_second=418.427M/s null_percent=0.01
   BM_PlainEncodingSpacedFloat/32768/129197 ns29165 ns  
  23713 bytes_per_second=4.18549G/s null_percent=0.01
   BM_PlainEncodingSpacedDouble/32768/1   31678 ns31644 ns  
  22103 bytes_per_second=7.71528G/s null_percent=0.01
   
   BM_PlainDecodingSpacedBoolean/32768/1  34833 ns34796 ns  
  20109 bytes_per_second=898.088M/s null_percent=0.01
   BM_PlainDecodingSpacedFloat/32768/121615 ns21591 ns  
  32197 bytes_per_second=5.65367G/s null_percent=0.01
   BM_PlainDecodingSpacedDouble/32768/1   24647 ns24621 ns  
  28162 bytes_per_second=9.91599G/s null_percent=0.01
   ```
   
   After
   ```
   BM_PlainEncodingSpacedBoolean/32768/1  51807 ns51745 ns  
  13489 bytes_per_second=603.926M/s null_percent=0.01
   BM_PlainEncodingSpacedFloat/32768/1 7811 ns 7803 ns  
  84179 bytes_per_second=15.6441G/s null_percent=0.01
   BM_PlainEncodingSpacedDouble/32768/1   13149 ns13135 ns  
  49635 bytes_per_second=18.5867G/s null_percent=0.01
   
   BM_PlainDecodingSpacedBoolean/32768/1  18332 ns18313 ns  
  38127 bytes_per_second=1.66644G/s null_percent=0.01
   BM_PlainDecodingSpacedFloat/32768/1 6634 ns 6627 ns  
 102921 bytes_per_second=18.4212G/s null_percent=0.01
   BM_PlainDecodingSpacedDouble/32768/1   16183 ns16165 ns  
  42532 bytes_per_second=15.1027G/s null_percent=0.01
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7532: ARROW-9217: [C++] Cover 0.01% null for the plain spaced benchmark

2020-06-24 Thread GitBox


github-actions[bot] commented on pull request #7532:
URL: https://github.com/apache/arrow/pull/7532#issuecomment-648655465


   https://issues.apache.org/jira/browse/ARROW-9217



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-24 Thread GitBox


xhochy commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-648653322


   > The R ones probably?
   
   For these, we need to add `utf8proc` to rtools40 and rtools35 and add them 
to the linker line of the R build.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind opened a new pull request #7532: ARROW-9217: [C++] Cover 0.01% null for the plain spaced benchmark

2020-06-24 Thread GitBox


jianxind opened a new pull request #7532:
URL: https://github.com/apache/arrow/pull/7532


   Add 0.01% null probability which represent most data are true values.
   
   Signed-off-by: Frank Du 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-24 Thread GitBox


xhochy commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-648649038


   The R ones probably?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-24 Thread GitBox


xhochy commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-648648745


   @kou What is the problematic CI job that shows your problem? The MinGW ones 
seem fine.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


github-actions[bot] commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-648648342


   https://issues.apache.org/jira/browse/ARROW-9216



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind opened a new pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-24 Thread GitBox


jianxind opened a new pull request #7531:
URL: https://github.com/apache/arrow/pull/7531


   Speedup the typical use case which most data are true values, also add null 
probability
   test case.
   
   Signed-off-by: Frank Du 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-24 Thread GitBox


romainfrancois commented on a change in pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#discussion_r444682932



##
File path: r/R/table.R
##
@@ -202,7 +210,27 @@ Table$create <- function(..., schema = NULL) {
 
 #' @export
 as.data.frame.Table <- function(x, row.names = NULL, optional = FALSE, ...) {
-  Table__to_dataframe(x, use_threads = option_use_threads())
+  df <- Table__to_dataframe(x, use_threads = option_use_threads())
+
+  if (!is.null(r_metadata <- x$metadata$r)) {
+r_metadata <- .arrow_unserialize(r_metadata)
+
+df_metadata <- r_metadata[[1L]]

Review comment:
   Thanks





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-24 Thread GitBox


romainfrancois commented on a change in pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#discussion_r444675449



##
File path: r/tests/testthat/test-Table.R
##
@@ -334,5 +334,5 @@ test_that("Table metadata", {
 
 test_that("Table handles null type (ARROW-7064)", {
   tab <- Table$create(a = 1:10, n = vctrs::unspecified(10))
-  expect_equal(tab$schema,  schema(a = int32(), n = null()))
+  expect_true(tab$schema$Equals(schema(a = int32(), n = null()), FALSE))

Review comment:
   Thanks. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] praveenbingo closed pull request #7402: ARROW-9099: [C++][Gandiva] Implement trim function for string

2020-06-24 Thread GitBox


praveenbingo closed pull request #7402:
URL: https://github.com/apache/arrow/pull/7402


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org