[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415233940



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,

Review comment:
   might be worth a quick discussion on whether we should vendor this (or 
maybe just vendor the lookup tables).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#issuecomment-619493366


   Just curious if you see and impact on parquet-arrow-reader-writer 
benchmarks?  That is the ultimate goal of the speedup.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415231929



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,
+  num_values);
+  for (int32_t i = 0; i < num_values; i++) {
+if (valid_bits_reader.IsSet()) {
+  output[num_valid_values++] = values[i];
+}
+valid_bits_reader.Next();
+  }
+  return num_valid_values;
+}
+
+template 
+int DecodeSpacedScalar(T* buffer, int num_values, int null_count,
+   const uint8_t* valid_bits, int64_t valid_bits_offset) {
+  const int values_read = num_values - null_count;
+
+  // Depending on the number of nulls, some of the value slots in buffer may
+  // be uninitialized, and this will cause valgrind warnings / potentially UB
+  memset(static_cast(buffer + values_read), 0,
+ (num_values - values_read) * sizeof(T));
+
+  // Add spacing for null entries. As we have filled the buffer from the front,
+  // we need to add the spacing from the back.
+  int values_to_move = values_read - 1;
+  // We stop early on one of two conditions:
+  // 1. There are no more null values that need spacing.  Note we infer this
+  // backwards, when 'i' is equal to 'values_to_move' it indicates
+  //all nulls have been consumed.
+  // 2. There are no more non-null values that need to move which indicates
+  //all remaining slots are null, so their exact value doesn't matter.
+  for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); 
i--) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) {
+  buffer[i] = buffer[values_to_move];
+  values_to_move--;
+}
+  }
+  return num_values;
+}
+
+#if defined(ARROW_HAVE_AVX512)
+template 
+int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* 
valid_bits,
+int64_t valid_bits_offset, T* output) {
+  assert(sizeof(T) == 4 || sizeof(T) == 8);  // Only support epi32 and epi64
+  constexpr int kBatchSize = sizeof(__m512i) / sizeof(T);
+  constexpr int kBatchValidBytes = kBatchSize / 8;
+  int num_valid_values = 0;
+  int idx_values = 0;
+  int64_t idx_valid_bits = valid_bits_offset;
+
+  // First handle the front suffix
+  const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8);
+  for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); 
i++) {
+if (BitUtil::GetBit(valid_bits, idx_valid_bits)) {
+  output[num_valid_values] = values[idx_values];
+  num_valid_values++;
+}
+idx_values++;
+idx_valid_bits++;
+  }
+
+  // The parts can fill into batches
+  uint8_t valid_count;
+  int64_t idx_valid_bytes = BitUtil::BytesForBits(idx_valid_bits + 1) - 1;
+  static const __m512i zero = _mm512_set_epi64(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0);
+  while (num_values - idx_values >= kBatchSize) {
+// count the valid numbers of one batch.
+valid_count = BitUtil::kBytePopcount[valid_bits[idx_valid_bytes]];
+if (kBatchValidBytes > 1) {
+  valid_count += BitUtil::kBytePopcount[valid_bits[idx_valid_bytes + 1]];
+}
+
+// pack the data
+if (valid_count > 0) {
+  __m512i src = _mm512_loadu_si512(values + idx_values);
+  __m512i result;
+  if (sizeof(T) == 4) {
+// 16 float for one m512i block, two bytes in valid_bits
+__mmask16 k = *(reinterpret_cast(valid_bits + 
idx_valid_bytes));
+result = _mm512_mask_compress_epi32(zero, k, src);
+  } else {
+// 8 double for one m512i block, one byte in valid_bits
+__mmask8 k = *(valid_bits + idx_valid_bytes);
+result = _mm512_mask_compress_epi64(zero, k, src);
+  }
+
+

[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415231200



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,
+  num_values);
+  for (int32_t i = 0; i < num_values; i++) {
+if (valid_bits_reader.IsSet()) {
+  output[num_valid_values++] = values[i];
+}
+valid_bits_reader.Next();
+  }
+  return num_valid_values;
+}
+
+template 
+int DecodeSpacedScalar(T* buffer, int num_values, int null_count,
+   const uint8_t* valid_bits, int64_t valid_bits_offset) {
+  const int values_read = num_values - null_count;
+
+  // Depending on the number of nulls, some of the value slots in buffer may
+  // be uninitialized, and this will cause valgrind warnings / potentially UB
+  memset(static_cast(buffer + values_read), 0,
+ (num_values - values_read) * sizeof(T));
+
+  // Add spacing for null entries. As we have filled the buffer from the front,
+  // we need to add the spacing from the back.
+  int values_to_move = values_read - 1;
+  // We stop early on one of two conditions:
+  // 1. There are no more null values that need spacing.  Note we infer this
+  // backwards, when 'i' is equal to 'values_to_move' it indicates
+  //all nulls have been consumed.
+  // 2. There are no more non-null values that need to move which indicates
+  //all remaining slots are null, so their exact value doesn't matter.
+  for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); 
i--) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) {
+  buffer[i] = buffer[values_to_move];
+  values_to_move--;
+}
+  }
+  return num_values;
+}
+
+#if defined(ARROW_HAVE_AVX512)
+template 
+int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* 
valid_bits,
+int64_t valid_bits_offset, T* output) {
+  assert(sizeof(T) == 4 || sizeof(T) == 8);  // Only support epi32 and epi64
+  constexpr int kBatchSize = sizeof(__m512i) / sizeof(T);
+  constexpr int kBatchValidBytes = kBatchSize / 8;
+  int num_valid_values = 0;
+  int idx_values = 0;
+  int64_t idx_valid_bits = valid_bits_offset;
+
+  // First handle the front suffix
+  const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8);
+  for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); 
i++) {
+if (BitUtil::GetBit(valid_bits, idx_valid_bits)) {
+  output[num_valid_values] = values[idx_values];
+  num_valid_values++;
+}
+idx_values++;
+idx_valid_bits++;
+  }
+
+  // The parts can fill into batches
+  uint8_t valid_count;
+  int64_t idx_valid_bytes = BitUtil::BytesForBits(idx_valid_bits + 1) - 1;
+  static const __m512i zero = _mm512_set_epi64(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0);
+  while (num_values - idx_values >= kBatchSize) {
+// count the valid numbers of one batch.
+valid_count = BitUtil::kBytePopcount[valid_bits[idx_valid_bytes]];
+if (kBatchValidBytes > 1) {
+  valid_count += BitUtil::kBytePopcount[valid_bits[idx_valid_bytes + 1]];
+}
+
+// pack the data
+if (valid_count > 0) {
+  __m512i src = _mm512_loadu_si512(values + idx_values);
+  __m512i result;
+  if (sizeof(T) == 4) {
+// 16 float for one m512i block, two bytes in valid_bits
+__mmask16 k = *(reinterpret_cast(valid_bits + 
idx_valid_bytes));
+result = _mm512_mask_compress_epi32(zero, k, src);
+  } else {
+// 8 double for one m512i block, one byte in valid_bits
+__mmask8 k = *(valid_bits + idx_valid_bytes);
+result = _mm512_mask_compress_epi64(zero, k, src);
+  }
+
+

[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415231122



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,
+  num_values);
+  for (int32_t i = 0; i < num_values; i++) {
+if (valid_bits_reader.IsSet()) {
+  output[num_valid_values++] = values[i];
+}
+valid_bits_reader.Next();
+  }
+  return num_valid_values;
+}
+
+template 
+int DecodeSpacedScalar(T* buffer, int num_values, int null_count,
+   const uint8_t* valid_bits, int64_t valid_bits_offset) {
+  const int values_read = num_values - null_count;
+
+  // Depending on the number of nulls, some of the value slots in buffer may
+  // be uninitialized, and this will cause valgrind warnings / potentially UB
+  memset(static_cast(buffer + values_read), 0,
+ (num_values - values_read) * sizeof(T));
+
+  // Add spacing for null entries. As we have filled the buffer from the front,
+  // we need to add the spacing from the back.
+  int values_to_move = values_read - 1;
+  // We stop early on one of two conditions:
+  // 1. There are no more null values that need spacing.  Note we infer this
+  // backwards, when 'i' is equal to 'values_to_move' it indicates
+  //all nulls have been consumed.
+  // 2. There are no more non-null values that need to move which indicates
+  //all remaining slots are null, so their exact value doesn't matter.
+  for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); 
i--) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) {
+  buffer[i] = buffer[values_to_move];
+  values_to_move--;
+}
+  }
+  return num_values;
+}
+
+#if defined(ARROW_HAVE_AVX512)
+template 
+int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* 
valid_bits,
+int64_t valid_bits_offset, T* output) {
+  assert(sizeof(T) == 4 || sizeof(T) == 8);  // Only support epi32 and epi64
+  constexpr int kBatchSize = sizeof(__m512i) / sizeof(T);
+  constexpr int kBatchValidBytes = kBatchSize / 8;
+  int num_valid_values = 0;
+  int idx_values = 0;
+  int64_t idx_valid_bits = valid_bits_offset;
+
+  // First handle the front suffix
+  const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8);
+  for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); 
i++) {
+if (BitUtil::GetBit(valid_bits, idx_valid_bits)) {
+  output[num_valid_values] = values[idx_values];
+  num_valid_values++;
+}
+idx_values++;
+idx_valid_bits++;
+  }
+
+  // The parts can fill into batches
+  uint8_t valid_count;
+  int64_t idx_valid_bytes = BitUtil::BytesForBits(idx_valid_bits + 1) - 1;
+  static const __m512i zero = _mm512_set_epi64(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0);
+  while (num_values - idx_values >= kBatchSize) {
+// count the valid numbers of one batch.
+valid_count = BitUtil::kBytePopcount[valid_bits[idx_valid_bytes]];
+if (kBatchValidBytes > 1) {
+  valid_count += BitUtil::kBytePopcount[valid_bits[idx_valid_bytes + 1]];
+}
+
+// pack the data
+if (valid_count > 0) {
+  __m512i src = _mm512_loadu_si512(values + idx_values);
+  __m512i result;
+  if (sizeof(T) == 4) {
+// 16 float for one m512i block, two bytes in valid_bits
+__mmask16 k = *(reinterpret_cast(valid_bits + 
idx_valid_bytes));
+result = _mm512_mask_compress_epi32(zero, k, src);
+  } else {
+// 8 double for one m512i block, one byte in valid_bits
+__mmask8 k = *(valid_bits + idx_valid_bytes);
+result = _mm512_mask_compress_epi64(zero, k, src);
+  }
+
+

[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415230959



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,
+  num_values);
+  for (int32_t i = 0; i < num_values; i++) {
+if (valid_bits_reader.IsSet()) {
+  output[num_valid_values++] = values[i];
+}
+valid_bits_reader.Next();
+  }
+  return num_valid_values;
+}
+
+template 
+int DecodeSpacedScalar(T* buffer, int num_values, int null_count,
+   const uint8_t* valid_bits, int64_t valid_bits_offset) {
+  const int values_read = num_values - null_count;
+
+  // Depending on the number of nulls, some of the value slots in buffer may
+  // be uninitialized, and this will cause valgrind warnings / potentially UB
+  memset(static_cast(buffer + values_read), 0,
+ (num_values - values_read) * sizeof(T));
+
+  // Add spacing for null entries. As we have filled the buffer from the front,
+  // we need to add the spacing from the back.
+  int values_to_move = values_read - 1;
+  // We stop early on one of two conditions:
+  // 1. There are no more null values that need spacing.  Note we infer this
+  // backwards, when 'i' is equal to 'values_to_move' it indicates
+  //all nulls have been consumed.
+  // 2. There are no more non-null values that need to move which indicates
+  //all remaining slots are null, so their exact value doesn't matter.
+  for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); 
i--) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) {
+  buffer[i] = buffer[values_to_move];
+  values_to_move--;
+}
+  }
+  return num_values;
+}
+
+#if defined(ARROW_HAVE_AVX512)
+template 
+int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* 
valid_bits,
+int64_t valid_bits_offset, T* output) {
+  assert(sizeof(T) == 4 || sizeof(T) == 8);  // Only support epi32 and epi64
+  constexpr int kBatchSize = sizeof(__m512i) / sizeof(T);
+  constexpr int kBatchValidBytes = kBatchSize / 8;
+  int num_valid_values = 0;
+  int idx_values = 0;
+  int64_t idx_valid_bits = valid_bits_offset;
+
+  // First handle the front suffix
+  const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8);
+  for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); 
i++) {

Review comment:
   i < min(offset_suffix_front, num_values)?  if you take the min outside 
of the loop, you also  don't need to increment idx_values_* each time inside 
the loop (they can be calculated with a single addition).  Also by doing the 
calculation outside the loop I think you can call the scalar function instead 
of duplicating code?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415230291



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,
+  num_values);
+  for (int32_t i = 0; i < num_values; i++) {
+if (valid_bits_reader.IsSet()) {
+  output[num_valid_values++] = values[i];
+}
+valid_bits_reader.Next();
+  }
+  return num_valid_values;
+}
+
+template 
+int DecodeSpacedScalar(T* buffer, int num_values, int null_count,
+   const uint8_t* valid_bits, int64_t valid_bits_offset) {
+  const int values_read = num_values - null_count;
+
+  // Depending on the number of nulls, some of the value slots in buffer may
+  // be uninitialized, and this will cause valgrind warnings / potentially UB
+  memset(static_cast(buffer + values_read), 0,
+ (num_values - values_read) * sizeof(T));
+
+  // Add spacing for null entries. As we have filled the buffer from the front,
+  // we need to add the spacing from the back.
+  int values_to_move = values_read - 1;
+  // We stop early on one of two conditions:
+  // 1. There are no more null values that need spacing.  Note we infer this
+  // backwards, when 'i' is equal to 'values_to_move' it indicates
+  //all nulls have been consumed.
+  // 2. There are no more non-null values that need to move which indicates
+  //all remaining slots are null, so their exact value doesn't matter.
+  for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); 
i--) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) {
+  buffer[i] = buffer[values_to_move];
+  values_to_move--;
+}
+  }
+  return num_values;
+}
+
+#if defined(ARROW_HAVE_AVX512)
+template 
+int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* 
valid_bits,
+int64_t valid_bits_offset, T* output) {
+  assert(sizeof(T) == 4 || sizeof(T) == 8);  // Only support epi32 and epi64
+  constexpr int kBatchSize = sizeof(__m512i) / sizeof(T);
+  constexpr int kBatchValidBytes = kBatchSize / 8;
+  int num_valid_values = 0;
+  int idx_values = 0;
+  int64_t idx_valid_bits = valid_bits_offset;
+
+  // First handle the front suffix
+  const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8);
+  for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); 
i++) {
+if (BitUtil::GetBit(valid_bits, idx_valid_bits)) {
+  output[num_valid_values] = values[idx_values];
+  num_valid_values++;
+}
+idx_values++;

Review comment:
   nit: can't these be moved out of the loop?
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415230291



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,
+  num_values);
+  for (int32_t i = 0; i < num_values; i++) {
+if (valid_bits_reader.IsSet()) {
+  output[num_valid_values++] = values[i];
+}
+valid_bits_reader.Next();
+  }
+  return num_valid_values;
+}
+
+template 
+int DecodeSpacedScalar(T* buffer, int num_values, int null_count,
+   const uint8_t* valid_bits, int64_t valid_bits_offset) {
+  const int values_read = num_values - null_count;
+
+  // Depending on the number of nulls, some of the value slots in buffer may
+  // be uninitialized, and this will cause valgrind warnings / potentially UB
+  memset(static_cast(buffer + values_read), 0,
+ (num_values - values_read) * sizeof(T));
+
+  // Add spacing for null entries. As we have filled the buffer from the front,
+  // we need to add the spacing from the back.
+  int values_to_move = values_read - 1;
+  // We stop early on one of two conditions:
+  // 1. There are no more null values that need spacing.  Note we infer this
+  // backwards, when 'i' is equal to 'values_to_move' it indicates
+  //all nulls have been consumed.
+  // 2. There are no more non-null values that need to move which indicates
+  //all remaining slots are null, so their exact value doesn't matter.
+  for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); 
i--) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) {
+  buffer[i] = buffer[values_to_move];
+  values_to_move--;
+}
+  }
+  return num_values;
+}
+
+#if defined(ARROW_HAVE_AVX512)
+template 
+int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* 
valid_bits,
+int64_t valid_bits_offset, T* output) {
+  assert(sizeof(T) == 4 || sizeof(T) == 8);  // Only support epi32 and epi64
+  constexpr int kBatchSize = sizeof(__m512i) / sizeof(T);
+  constexpr int kBatchValidBytes = kBatchSize / 8;
+  int num_valid_values = 0;
+  int idx_values = 0;
+  int64_t idx_valid_bits = valid_bits_offset;
+
+  // First handle the front suffix
+  const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8);
+  for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); 
i++) {
+if (BitUtil::GetBit(valid_bits, idx_valid_bits)) {
+  output[num_valid_values] = values[idx_values];
+  num_valid_values++;
+}
+idx_values++;

Review comment:
   nit: can't these be moved out of the loop?
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415229405



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,

Review comment:
   please add docs to describing what each method does.  PutSpaced and 
DecodSpaced aren't the most intuitive names and someone just looking at the 
header would have to read the code to understand each one.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415229405



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,

Review comment:
   please add docs to each method





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#discussion_r415229227



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+
+#ifdef ARROW_HAVE_AVX512
+#include 
+#endif  // ARROW_HAVE_AVX512
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+template 
+int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits,

Review comment:
   FWIW, Daniel Lemire has a SIMD library for this 
https://lemire.me/blog/2017/04/25/quickly-pruning-elements-in-simd-vectors-using-the-simdprune-library/





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tianchen92 commented on a change in pull request #6912: ARROW-8020: [Java] Implement vector validate functionality

2020-04-25 Thread GitBox


tianchen92 commented on a change in pull request #6912:
URL: https://github.com/apache/arrow/pull/6912#discussion_r415229312



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java
##
@@ -320,6 +322,20 @@ public int hashCode(int index, ArrowBufHasher hasher) {
 return visitor.visit(this, value);
   }
 
+  @Override
+  public void validate() {
+if (getValueCount() < 0) {
+  throw new RuntimeException("vector valueCount is negative");
+}
+
+if (getNullCount() > getValueCount()) {

Review comment:
   Thanks, fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tianchen92 commented on a change in pull request #6912: ARROW-8020: [Java] Implement vector validate functionality

2020-04-25 Thread GitBox


tianchen92 commented on a change in pull request #6912:
URL: https://github.com/apache/arrow/pull/6912#discussion_r415229163



##
File path: java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java
##
@@ -283,4 +283,10 @@
* @return the name of the vector.
*/
   String getName();
+
+  /**
+   * Validate the vector, will throw exception if validate fail.
+   */
+  void validate();

Review comment:
   Thanks for all your comments, we already have validate method in 
ValueVectorUtility, and I removed the api from ValueVector.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #6954:
URL: https://github.com/apache/arrow/pull/6954#discussion_r415226340



##
File path: cpp/src/arrow/util/hash_util.h
##
@@ -27,39 +27,27 @@
 
 #include "arrow/util/logging.h"
 #include "arrow/util/macros.h"
-#include "arrow/util/neon_util.h"
-#include "arrow/util/sse_util.h"
+#include "arrow/util/simd.h"
 
-static inline uint32_t HW_crc32_u8(uint32_t crc, uint8_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u16(uint32_t crc, uint16_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u32(uint32_t crc, uint32_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u64(uint32_t crc, uint64_t v) {
+#ifdef ARROW_HAVE_SSE4_2

Review comment:
   I'd be concerned about putting too much in SIMD directly.  If we expect 
these to be used in other places that don't depend on this header we should 
create a specific header targetted at hardware. digests/hashing.  I think the 
way bit_util.h wraps special instructions is a reasonable path to follow.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #6954:
URL: https://github.com/apache/arrow/pull/6954#discussion_r415228325



##
File path: cpp/src/arrow/util/simd.h
##
@@ -17,6 +17,24 @@
 
 #pragma once
 
+#ifdef _MSC_VER
+// MSVC x86_64/arm64
+
+#if defined(_M_AMD64) || defined(_M_X64)
+#include 
+#elif defined(_M_ARM64)
+#include 
+#endif
+
+#else
+// gcc/clang (possibly others)
+

Review comment:
   Just an FYI in https://github.com/apache/arrow/pull/6985 relies on BMI2 
do you think adding that here would be appropriate?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #6954:
URL: https://github.com/apache/arrow/pull/6954#discussion_r415226495



##
File path: cpp/src/arrow/util/hash_util.h
##
@@ -27,39 +27,27 @@
 
 #include "arrow/util/logging.h"
 #include "arrow/util/macros.h"
-#include "arrow/util/neon_util.h"
-#include "arrow/util/sse_util.h"
+#include "arrow/util/simd.h"
 
-static inline uint32_t HW_crc32_u8(uint32_t crc, uint8_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u16(uint32_t crc, uint16_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u32(uint32_t crc, uint32_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u64(uint32_t crc, uint64_t v) {
+#ifdef ARROW_HAVE_SSE4_2
+constexpr auto HW_crc32_u8 = _mm_crc32_u8;

Review comment:
   style nits: If possible les spell out the type instead of using auto 
(its not all clear what the types are without looking up these exact functions.
   
   Since these are functions, it would be nice to use standard function casing 
for them. e.g.  Crc32U16, even better might be to define something like:
   HardwareCrc(uint16) ..
   HardwareCrc(uint16) ..





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #6954:
URL: https://github.com/apache/arrow/pull/6954#discussion_r415226495



##
File path: cpp/src/arrow/util/hash_util.h
##
@@ -27,39 +27,27 @@
 
 #include "arrow/util/logging.h"
 #include "arrow/util/macros.h"
-#include "arrow/util/neon_util.h"
-#include "arrow/util/sse_util.h"
+#include "arrow/util/simd.h"
 
-static inline uint32_t HW_crc32_u8(uint32_t crc, uint8_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u16(uint32_t crc, uint16_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u32(uint32_t crc, uint32_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u64(uint32_t crc, uint64_t v) {
+#ifdef ARROW_HAVE_SSE4_2
+constexpr auto HW_crc32_u8 = _mm_crc32_u8;

Review comment:
   If possible les spell out the type instead of using auto (its not all 
clear what the types are without looking up these exact functions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #6954:
URL: https://github.com/apache/arrow/pull/6954#discussion_r415226340



##
File path: cpp/src/arrow/util/hash_util.h
##
@@ -27,39 +27,27 @@
 
 #include "arrow/util/logging.h"
 #include "arrow/util/macros.h"
-#include "arrow/util/neon_util.h"
-#include "arrow/util/sse_util.h"
+#include "arrow/util/simd.h"
 
-static inline uint32_t HW_crc32_u8(uint32_t crc, uint8_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u16(uint32_t crc, uint16_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u32(uint32_t crc, uint32_t v) {
-  DCHECK(false) << "Hardware CRC support is not enabled";
-  return 0;
-}
-
-static inline uint32_t HW_crc32_u64(uint32_t crc, uint64_t v) {
+#ifdef ARROW_HAVE_SSE4_2

Review comment:
   I'd be concerned about putting too much in SIMD directly.  If we expect 
these to be used in other places that don't depend on this header we should 
create a specific header targetted at hardware. digests/hashing.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-25 Thread GitBox


emkornfield commented on pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#issuecomment-619488093


   @pitrou I think I addressed your comments.  One of them that went stale was 
the complexity for "AppendWord", I tried to remove parts that did not seem to 
affect performance on benchmarks for parquet column reading but I spent some 
time trying to maximize word level parallelism for the unaligned case, because 
I think at least for repeated fields I expect this case to be fairly common. 
   
   As a point of comparison using the AppendWord implementation I put in place 
for BigEndian shows much smaller improvements.  Really this should use 
CopyBitmap but I didn't want to start moving a move code then I needed to in 
bit_util.h for this PR. I opened 
a [JIRA (ARROW-8595)](https://issues.apache.org/jira/browse/ARROW-8595) to 
track some improvements in that regard.
   
   If more comments are needed or you think there is a cleaner way of writing 
the code, I'm happy for input.
   
   @wesm do you want to look at the parquet specific logic/comments to make 
sure I captured them correctly?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r415221052



##
File path: cpp/cmake_modules/SetupCxxFlags.cmake
##
@@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86")
 set(CXX_SUPPORTS_SSE4_2 TRUE)
   else()
 set(ARROW_SSE4_2_FLAG "-msse4.2")
-set(ARROW_AVX2_FLAG "-mavx2")
+set(ARROW_AVX2_FLAG "-march=core-avx2")

Review comment:
   done, seems to work.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r415221009



##
File path: cpp/src/parquet/level_conversion_test.cc
##
@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/level_conversion.h"
+
+#include 
+#include 
+
+#include 
+
+#include "arrow/util/bit_util.h"
+
+namespace parquet {
+namespace internal {
+
+using ::testing::ElementsAreArray;
+
+std::string ToString(const uint8_t* bitmap, int64_t bit_count) {
+  return arrow::internal::Bitmap(bitmap, /*offset*/ 0, 
/*length=*/bit_count).ToString();
+}
+
+std::string ToString(const std::vector& bitmap, int64_t bit_count) {
+  return ToString(bitmap.data(), bit_count);
+}
+
+TEST(TestGreaterThanBitmap, GeneratesExpectedBitmasks) {

Review comment:
   There was a substantial refactoring, I tried to provide guards but 
without a big-endian machine in CI it will be very hard to to catch all of 
these issues.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-25 Thread GitBox


emkornfield commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r414295562



##
File path: cpp/cmake_modules/SetupCxxFlags.cmake
##
@@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86")
 set(CXX_SUPPORTS_SSE4_2 TRUE)
   else()
 set(ARROW_SSE4_2_FLAG "-msse4.2")
-set(ARROW_AVX2_FLAG "-mavx2")
+set(ARROW_AVX2_FLAG "-march=core-avx2")
 # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
 set(ARROW_AVX512_FLAG "-march=skylake-avx512")
 check_cxx_compiler_flag(${ARROW_SSE4_2_FLAG} CXX_SUPPORTS_SSE4_2)
   endif()
   check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2)
+  check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2)

Review comment:
   should be removed now.

##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,

Review comment:
   I've moved all everything in the anonymous namespace to 
level_conversion.cc

##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,
+const int16_t max_expected_level) {
+  int16_t min_level = std::numeric_limits::max();
+  int16_t max_level = std::numeric_limits::min();
+  for (int x = 0; x < num_levels; x++) {
+min_level = std::min(levels[x], min_level);
+max_level = std::max(levels[x], max_level);
+  }
+  if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > 
max_level))) {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+#if !defined(ARROW_HAVE_BMI2)
+
+inline void DefinitionLevelsToBitmapScalar(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  // We assume here that valid_bits is large enough to accommodate the
+  // additional definition levels and the ones that have already been written
+  ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, 
valid_bits_offset,
+num_def_levels);
+
+  // TODO(itaiin): As an interim solution we are splitting the code path here
+  // between repeated+flat column reads, and non-repeated+nested reads.
+  // Those paths need to be merged in the future
+  for (int i = 0; i < num_def_levels; ++i) {
+if (def_levels[i] == max_definition_level) {
+  valid_bits_writer.Set();
+} else if (max_repetition_level > 0) {
+  // repetition+flat case
+  if (def_levels[i] == (max_definition_level - 1)) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+continue;
+  }
+} else {
+  // non-repeated+nested case
+  if (def_levels[i] < max_definition_level) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+valid_bits_writer.Next();
+  }
+  valid_bits_writer.Finish();
+  *values_read = valid_bits_writer.position();
+}
+#endif
+
+template 
+void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t 
num_def_levels,
+  const int16_t required_definition_level,
+  int64_t* values_read, int64_t* null_count,
+  uint8_t* valid_bits, int64_t 
valid_bits_offset) {
+  constexpr int64_t kBitMaskSize = 64;
+  int64_t set_count = 0;
+  *values_read = 0;
+  while (num_def_levels > 0) {
+int64_t batch_size = std::min(num_def_levels, kBitMaskSize);
+CheckLevelRange(def_levels, batch_size, required_definition_level);
+uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, 
batch_size,
+  
required_definition_level - 1);
+if (has_repeated_parent) {
+  // This is currently a specialized code path assuming only (nested) lists

Review comment:
   done.

##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,
+const int16_t max_expected_level) {
+  int16_t min_level = std::numeric_limits::max();
+  int16_t max_level = std::numeric_limits::min();
+  for (int x = 0; x < num_levels; x++) {
+min_level = std::min(levels[x], min_level);
+max_level = std::max(levels[x], max_level);
+  }
+  if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > 
max_level))) {
+ 

[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-25 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r415203879



##
File path: 
java/memory/src/test/java/org/apache/arrow/memory/TestNettyAllocationManager.java
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.memory;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertNull;
+import static org.junit.Assert.assertTrue;
+
+import org.junit.Test;
+
+import io.netty.buffer.ArrowBuf;
+
+/**
+ * Test cases for {@link NettyAllocationManager}.
+ */
+public class TestNettyAllocationManager {
+
+  private void readWriteArrowBuf(ArrowBuf buffer) {
+// write buffer
+for (long i = 0; i < buffer.capacity() / 8; i++) {
+  buffer.setLong(i * 8, i);
+}
+
+// read buffer
+for (long i = 0; i < buffer.capacity() / 8; i++) {
+  long val = buffer.getLong(i * 8);
+  assertEquals(i, val);
+}
+  }
+
+  /**
+   * Test the allocation strategy for small buffers..
+   */
+  @Test
+  public void testSmallBufferAllocation() {
+final long bufSize = 512L;
+try (RootAllocator allocator = new RootAllocator(bufSize);
+ ArrowBuf buffer = allocator.buffer(bufSize)) {
+  // make sure the buffer is small enough, so we wil use the allocation 
strategy for small buffers
+  assertTrue(bufSize < 
NettyAllocationManager.DEFAULT_ALLOCATION_CUTOFF_VALUE);
+
+  assertTrue(buffer.getReferenceManager() instanceof BufferLedger);
+  BufferLedger bufferLedger = (BufferLedger) buffer.getReferenceManager();
+
+  // make sure we are using netty allocation manager
+  AllocationManager allocMgr = bufferLedger.getAllocationManager();
+  assertTrue(allocMgr instanceof NettyAllocationManager);
+  NettyAllocationManager nettyMgr = (NettyAllocationManager) allocMgr;
+
+  // for the small buffer allocation strategy, the chunk is not null
+  assertNotNull(nettyMgr.getMemoryChunk());
+
+  readWriteArrowBuf(buffer);
+}
+  }
+
+  /**
+   * Test the allocation strategy for large buffers..
+   */
+  @Test
+  public void testLargeBufferAllocation() {
+final long bufSize = 2048L;

Review comment:
   Sounds good. Revised accordingly. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-25 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r415203851



##
File path: 
java/memory/src/test/java/org/apache/arrow/memory/TestNettyAllocationManager.java
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.memory;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertNull;
+import static org.junit.Assert.assertTrue;
+
+import org.junit.Test;
+
+import io.netty.buffer.ArrowBuf;
+
+/**
+ * Test cases for {@link NettyAllocationManager}.
+ */
+public class TestNettyAllocationManager {
+
+  private void readWriteArrowBuf(ArrowBuf buffer) {
+// write buffer
+for (long i = 0; i < buffer.capacity() / 8; i++) {
+  buffer.setLong(i * 8, i);
+}
+
+// read buffer
+for (long i = 0; i < buffer.capacity() / 8; i++) {
+  long val = buffer.getLong(i * 8);
+  assertEquals(i, val);
+}
+  }
+
+  /**
+   * Test the allocation strategy for small buffers..
+   */
+  @Test
+  public void testSmallBufferAllocation() {
+final long bufSize = 512L;
+try (RootAllocator allocator = new RootAllocator(bufSize);
+ ArrowBuf buffer = allocator.buffer(bufSize)) {
+  // make sure the buffer is small enough, so we wil use the allocation 
strategy for small buffers
+  assertTrue(bufSize < 
NettyAllocationManager.DEFAULT_ALLOCATION_CUTOFF_VALUE);
+
+  assertTrue(buffer.getReferenceManager() instanceof BufferLedger);
+  BufferLedger bufferLedger = (BufferLedger) buffer.getReferenceManager();
+
+  // make sure we are using netty allocation manager
+  AllocationManager allocMgr = bufferLedger.getAllocationManager();
+  assertTrue(allocMgr instanceof NettyAllocationManager);
+  NettyAllocationManager nettyMgr = (NettyAllocationManager) allocMgr;
+
+  // for the small buffer allocation strategy, the chunk is not null
+  assertNotNull(nettyMgr.getMemoryChunk());
+
+  readWriteArrowBuf(buffer);
+}
+  }
+
+  /**
+   * Test the allocation strategy for large buffers..
+   */
+  @Test
+  public void testLargeBufferAllocation() {
+final long bufSize = 2048L;
+try (RootAllocator allocator = new RootAllocator(bufSize);
+ ArrowBuf buffer = allocator.buffer(bufSize)) {
+  // make sure the buffer is large enough, so we wil use the allocation 
strategy for large buffers

Review comment:
   Sorry about the typo. This comment is removed. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-25 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r415203808



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java
##
@@ -17,48 +17,97 @@
 
 package org.apache.arrow.memory;
 
-import org.apache.arrow.memory.util.LargeMemoryUtil;
-
 import io.netty.buffer.PooledByteBufAllocatorL;
 import io.netty.buffer.UnsafeDirectLittleEndian;
+import io.netty.util.internal.PlatformDependent;
 
 /**
- * The default implementation of AllocationManagerBase. The implementation is 
responsible for managing when memory
+ * The default implementation of {@link AllocationManager}. The implementation 
is responsible for managing when memory
  * is allocated and returned to the Netty-based PooledByteBufAllocatorL.
  */
 public class NettyAllocationManager extends AllocationManager {
 
   public static final Factory FACTORY = new Factory();
 
+  /**
+   * The default cut-off value for switching allocation strategies.
+   * If the request size is not greater than the cut-off value, we will 
allocate memory by
+   * {@link PooledByteBufAllocatorL} APIs,
+   * otherwise, we will use {@link PlatformDependent} APIs.
+   */
+  public static final long DEFAULT_ALLOCATION_CUTOFF_VALUE;
+
+  public static final String DEFAULT_ALLOCATION_CUTOFF_NAME = 
"default.allocation.cutoff.name";
+
+  static {
+long cutOffValue;
+try {
+  cutOffValue = 
Long.parseLong(System.getProperty(DEFAULT_ALLOCATION_CUTOFF_NAME));
+} catch (Exception e) {
+  cutOffValue = Integer.MAX_VALUE;
+}
+DEFAULT_ALLOCATION_CUTOFF_VALUE = cutOffValue;
+  }
+
   private static final PooledByteBufAllocatorL INNER_ALLOCATOR = new 
PooledByteBufAllocatorL();
   static final UnsafeDirectLittleEndian EMPTY = INNER_ALLOCATOR.empty;
   static final long CHUNK_SIZE = INNER_ALLOCATOR.getChunkSize();
 
-  private final int allocatedSize;
+  private final long allocatedSize;
   private final UnsafeDirectLittleEndian memoryChunk;
+  private final long allocatedAddress;
+
+  /**
+   * The cut-off value for switching allocation strategies.
+   */
+  private final long allocationCutOffValue;
 
-  NettyAllocationManager(BaseAllocator accountingAllocator, int requestedSize) 
{
+  NettyAllocationManager(BaseAllocator accountingAllocator, long 
requestedSize, long allocationCutOffValue) {
 super(accountingAllocator);
-this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize);
-this.allocatedSize = memoryChunk.capacity();
+if (allocationCutOffValue > Integer.MAX_VALUE) {
+  throw new IllegalArgumentException("The cut-off value cannot be larger 
than Integer.MAX_VALUE");
+}
+this.allocationCutOffValue = allocationCutOffValue;
+
+if (requestedSize > allocationCutOffValue) {
+  this.memoryChunk = null;
+  this.allocatedAddress = PlatformDependent.allocateMemory(requestedSize);
+  this.allocatedSize = requestedSize;
+} else {
+  this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize);
+  this.allocatedAddress = memoryChunk.memoryAddress();
+  this.allocatedSize = memoryChunk.capacity();
+}
+  }
+
+  NettyAllocationManager(BaseAllocator accountingAllocator, long 
requestedSize) {
+this(accountingAllocator, requestedSize, DEFAULT_ALLOCATION_CUTOFF_VALUE);
   }
 
   /**
* Get the underlying memory chunk managed by this AllocationManager.
-   * @return buffer
+   * @return the underlying memory chunk if the request size is not greater 
than the
+   *   {@link NettyAllocationManager#allocationCutOffValue}, or null otherwise.
+   *
+   * @deprecated this method will be removed in a future release.
*/
+  @Deprecated
   UnsafeDirectLittleEndian getMemoryChunk() {
-return memoryChunk;
+return allocatedSize > allocationCutOffValue ? null : memoryChunk;

Review comment:
   Sorry. This is not needed. Reverted.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-25 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r415203665



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java
##
@@ -17,48 +17,97 @@
 
 package org.apache.arrow.memory;
 
-import org.apache.arrow.memory.util.LargeMemoryUtil;
-
 import io.netty.buffer.PooledByteBufAllocatorL;
 import io.netty.buffer.UnsafeDirectLittleEndian;
+import io.netty.util.internal.PlatformDependent;
 
 /**
- * The default implementation of AllocationManagerBase. The implementation is 
responsible for managing when memory
+ * The default implementation of {@link AllocationManager}. The implementation 
is responsible for managing when memory
  * is allocated and returned to the Netty-based PooledByteBufAllocatorL.
  */
 public class NettyAllocationManager extends AllocationManager {
 
   public static final Factory FACTORY = new Factory();
 
+  /**
+   * The default cut-off value for switching allocation strategies.
+   * If the request size is not greater than the cut-off value, we will 
allocate memory by
+   * {@link PooledByteBufAllocatorL} APIs,
+   * otherwise, we will use {@link PlatformDependent} APIs.
+   */
+  public static final long DEFAULT_ALLOCATION_CUTOFF_VALUE;
+
+  public static final String DEFAULT_ALLOCATION_CUTOFF_NAME = 
"default.allocation.cutoff.name";
+
+  static {
+long cutOffValue;
+try {
+  cutOffValue = 
Long.parseLong(System.getProperty(DEFAULT_ALLOCATION_CUTOFF_NAME));
+} catch (Exception e) {
+  cutOffValue = Integer.MAX_VALUE;
+}
+DEFAULT_ALLOCATION_CUTOFF_VALUE = cutOffValue;
+  }
+

Review comment:
   Removed. Thank you for the good suggestion. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-25 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r415203349



##
File path: 
java/memory/src/test/java/org/apache/arrow/memory/TestLargeArrowBuf.java
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.memory;
+
+import static org.junit.Assert.assertEquals;
+
+import io.netty.buffer.ArrowBuf;
+
+/**
+ * Integration test for large (more than 2GB) {@link io.netty.buffer.ArrowBuf}.
+ * To run this test, please
+ *Make sure there are 4GB memory available in the system.
+ * 
+ *   Make sure the default allocation manager type is unsafe.
+ *   This can be achieved by the environmental variable or system property.
+ *   The details can be found in {@link DefaultAllocationManagerOption}.
+ * 
+ */
+public class TestLargeArrowBuf {
+
+  private static void testLargeArrowBuf() {
+final long bufSize = 4 * 1024 * 1024 * 1024L;
+try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+ ArrowBuf largeBuf = allocator.buffer(bufSize)) {
+  assertEquals(bufSize, largeBuf.capacity());
+  System.out.println("Successfully allocated a buffer with capacity " + 
largeBuf.capacity());
+
+  for (long i = 0; i < bufSize / 8; i++) {
+largeBuf.setLong(i * 8, i);
+
+if ((i + 1) % 1 == 0) {
+  System.out.println("Successfully written " + (i + 1) + " long 
words");
+}
+  }
+  System.out.println("Successfully written " + (bufSize / 8) + " long 
words");
+
+  for (long i = 0; i < bufSize / 8; i++) {
+long val = largeBuf.getLong(i * 8);
+assertEquals(i, val);
+
+if ((i + 1) % 1 == 0) {
+  System.out.println("Successfully read " + (i + 1) + " long words");
+}
+  }
+  System.out.println("Successfully read " + (bufSize / 8) + " long words");
+}
+System.out.println("Successfully released the large buffer.");
+  }
+
+  public static void main(String[] args) {

Review comment:
   Yes, it needs to run manually. 
   I have updated the javadoc accordingly.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-25 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r415202903



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java
##
@@ -17,48 +17,97 @@
 
 package org.apache.arrow.memory;
 
-import org.apache.arrow.memory.util.LargeMemoryUtil;
-
 import io.netty.buffer.PooledByteBufAllocatorL;
 import io.netty.buffer.UnsafeDirectLittleEndian;
+import io.netty.util.internal.PlatformDependent;
 
 /**
- * The default implementation of AllocationManagerBase. The implementation is 
responsible for managing when memory
+ * The default implementation of {@link AllocationManager}. The implementation 
is responsible for managing when memory
  * is allocated and returned to the Netty-based PooledByteBufAllocatorL.
  */
 public class NettyAllocationManager extends AllocationManager {
 
   public static final Factory FACTORY = new Factory();
 
+  /**
+   * The default cut-off value for switching allocation strategies.
+   * If the request size is not greater than the cut-off value, we will 
allocate memory by
+   * {@link PooledByteBufAllocatorL} APIs,
+   * otherwise, we will use {@link PlatformDependent} APIs.
+   */
+  public static final long DEFAULT_ALLOCATION_CUTOFF_VALUE;
+
+  public static final String DEFAULT_ALLOCATION_CUTOFF_NAME = 
"default.allocation.cutoff.name";
+
+  static {
+long cutOffValue;
+try {
+  cutOffValue = 
Long.parseLong(System.getProperty(DEFAULT_ALLOCATION_CUTOFF_NAME));
+} catch (Exception e) {
+  cutOffValue = Integer.MAX_VALUE;
+}
+DEFAULT_ALLOCATION_CUTOFF_VALUE = cutOffValue;
+  }
+
   private static final PooledByteBufAllocatorL INNER_ALLOCATOR = new 
PooledByteBufAllocatorL();
   static final UnsafeDirectLittleEndian EMPTY = INNER_ALLOCATOR.empty;
   static final long CHUNK_SIZE = INNER_ALLOCATOR.getChunkSize();
 
-  private final int allocatedSize;
+  private final long allocatedSize;
   private final UnsafeDirectLittleEndian memoryChunk;
+  private final long allocatedAddress;
+
+  /**
+   * The cut-off value for switching allocation strategies.
+   */
+  private final long allocationCutOffValue;

Review comment:
   Changed to int. Thanks for the good suggestion. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-25 Thread GitBox


liyafan82 commented on a change in pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#discussion_r415202953



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java
##
@@ -17,48 +17,97 @@
 
 package org.apache.arrow.memory;
 
-import org.apache.arrow.memory.util.LargeMemoryUtil;
-
 import io.netty.buffer.PooledByteBufAllocatorL;
 import io.netty.buffer.UnsafeDirectLittleEndian;
+import io.netty.util.internal.PlatformDependent;
 
 /**
- * The default implementation of AllocationManagerBase. The implementation is 
responsible for managing when memory
+ * The default implementation of {@link AllocationManager}. The implementation 
is responsible for managing when memory
  * is allocated and returned to the Netty-based PooledByteBufAllocatorL.
  */
 public class NettyAllocationManager extends AllocationManager {
 
   public static final Factory FACTORY = new Factory();
 
+  /**
+   * The default cut-off value for switching allocation strategies.
+   * If the request size is not greater than the cut-off value, we will 
allocate memory by
+   * {@link PooledByteBufAllocatorL} APIs,
+   * otherwise, we will use {@link PlatformDependent} APIs.
+   */
+  public static final long DEFAULT_ALLOCATION_CUTOFF_VALUE;
+
+  public static final String DEFAULT_ALLOCATION_CUTOFF_NAME = 
"default.allocation.cutoff.name";
+
+  static {
+long cutOffValue;
+try {
+  cutOffValue = 
Long.parseLong(System.getProperty(DEFAULT_ALLOCATION_CUTOFF_NAME));
+} catch (Exception e) {
+  cutOffValue = Integer.MAX_VALUE;
+}
+DEFAULT_ALLOCATION_CUTOFF_VALUE = cutOffValue;
+  }
+
   private static final PooledByteBufAllocatorL INNER_ALLOCATOR = new 
PooledByteBufAllocatorL();
   static final UnsafeDirectLittleEndian EMPTY = INNER_ALLOCATOR.empty;
   static final long CHUNK_SIZE = INNER_ALLOCATOR.getChunkSize();
 
-  private final int allocatedSize;
+  private final long allocatedSize;
   private final UnsafeDirectLittleEndian memoryChunk;
+  private final long allocatedAddress;
+
+  /**
+   * The cut-off value for switching allocation strategies.
+   */
+  private final long allocationCutOffValue;
 
-  NettyAllocationManager(BaseAllocator accountingAllocator, int requestedSize) 
{
+  NettyAllocationManager(BaseAllocator accountingAllocator, long 
requestedSize, long allocationCutOffValue) {
 super(accountingAllocator);
-this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize);
-this.allocatedSize = memoryChunk.capacity();
+if (allocationCutOffValue > Integer.MAX_VALUE) {
+  throw new IllegalArgumentException("The cut-off value cannot be larger 
than Integer.MAX_VALUE");
+}
+this.allocationCutOffValue = allocationCutOffValue;
+
+if (requestedSize > allocationCutOffValue) {
+  this.memoryChunk = null;
+  this.allocatedAddress = PlatformDependent.allocateMemory(requestedSize);
+  this.allocatedSize = requestedSize;
+} else {
+  this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize);
+  this.allocatedAddress = memoryChunk.memoryAddress();
+  this.allocatedSize = memoryChunk.capacity();
+}
+  }
+
+  NettyAllocationManager(BaseAllocator accountingAllocator, long 
requestedSize) {
+this(accountingAllocator, requestedSize, DEFAULT_ALLOCATION_CUTOFF_VALUE);
   }
 
   /**
* Get the underlying memory chunk managed by this AllocationManager.
-   * @return buffer
+   * @return the underlying memory chunk if the request size is not greater 
than the
+   *   {@link NettyAllocationManager#allocationCutOffValue}, or null otherwise.
+   *
+   * @deprecated this method will be removed in a future release.
*/
+  @Deprecated
   UnsafeDirectLittleEndian getMemoryChunk() {
-return memoryChunk;
+return allocatedSize > allocationCutOffValue ? null : memoryChunk;
   }
 
   @Override
   protected long memoryAddress() {
-return memoryChunk.memoryAddress();
+return allocatedAddress;
   }
 
   @Override
   protected void release0() {
-memoryChunk.release();
+if (allocatedSize > allocationCutOffValue) {

Review comment:
   Revised accordingly. Thank you.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-25 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-619463693


   I found the cause of the test failure: If the `batch_size` isn't aligned 
with the `chunk_size`, categorical columns will fail with the error:
   ```
   pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked 
arrays
   ```
   
   I think this means categorical columns/DictionaryArray columns aren't 
supported by this method for now, except if you are able to align the 
`batch_size` with `chunk_size`. 
   
   Is it possible or even common that `chunk_size` might be variable within a 
file?
   
   (The reason we were seeing the error in Python 3.5 and not in later Python 
versions is I was selecting a subset of columns using indices, and the ordering 
of columns changed between Python versions. I think because of the change in 
dictionary ordering in 3.6+. I've instead moved to have the offending test run 
on all columns.)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7041: ARROW-8584: [C++] Fix ORC link order

2020-04-25 Thread GitBox


github-actions[bot] commented on pull request #7041:
URL: https://github.com/apache/arrow/pull/7041#issuecomment-619457325


   https://issues.apache.org/jira/browse/ARROW-8584



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7041: ARROW-8584: [C++] Fix ORC link order

2020-04-25 Thread GitBox


github-actions[bot] commented on pull request #7041:
URL: https://github.com/apache/arrow/pull/7041#issuecomment-619456125


   Revision: 3e57660bbcb5002a8c53146754146fc7c92b1ead
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-166](https://github.com/ursa-labs/crossbow/branches/all?query=actions-166)
   
   |Task|Status|
   ||--|
   |centos-6-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-centos-6-amd64)|
   |centos-7-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-centos-7-amd64)|
   |centos-8-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-centos-8-amd64)|
   |debian-buster-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-debian-buster-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-debian-buster-amd64)|
   |debian-stretch-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-debian-stretch-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-debian-stretch-amd64)|
   |ubuntu-bionic-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-ubuntu-bionic-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-ubuntu-bionic-amd64)|
   |ubuntu-eoan-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-ubuntu-eoan-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-ubuntu-eoan-amd64)|
   |ubuntu-focal-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-ubuntu-focal-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-ubuntu-focal-amd64)|
   |ubuntu-xenial-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-ubuntu-xenial-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-ubuntu-xenial-amd64)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7041: ARROW-8584: [C++] Fix ORC link order

2020-04-25 Thread GitBox


kou commented on pull request #7041:
URL: https://github.com/apache/arrow/pull/7041#issuecomment-619456003


   @github-actions crossbow submit -g linux



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mayuropensource commented on pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics

2020-04-25 Thread GitBox


mayuropensource commented on pull request #7022:
URL: https://github.com/apache/arrow/pull/7022#issuecomment-619456016


   A better calculation for bandwidth (by removing TTFB from total time) is 
done using following script:
   
   curl --negotiate -u: -o /dev/null -w "total_time_sec=%{time_total} 
data_size_bytes=%{size_download} TTFB_sec=%{time_starttransfer}\n" $S3_DATA_URI 
| awk -F' ' '{ split($1, tt, "="); split($2, ds, "="); split($3, ttfb, "="); 
bw=ds[2] / (1024 * 1024. * (tt[2] - ttfb[2])); print("TTFB_millis =", ttfb[2] * 
1000.); print("Bandwidth_MiB_per_sec =", bw); }'
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mayuropensource edited a comment on pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics

2020-04-25 Thread GitBox


mayuropensource edited a comment on pull request #7022:
URL: https://github.com/apache/arrow/pull/7022#issuecomment-619456016


   A better calculation for bandwidth (by removing TTFB from total time) is 
done using following script:
   
   `curl --negotiate -u: -o /dev/null -w "total_time_sec=%{time_total} 
data_size_bytes=%{size_download} TTFB_sec=%{time_starttransfer}\n" $S3_DATA_URI 
| awk -F' ' '{ split($1, tt, "="); split($2, ds, "="); split($3, ttfb, "="); 
bw=ds[2] / (1024 * 1024. * (tt[2] - ttfb[2])); print("TTFB_millis =", ttfb[2] * 
1000.); print("Bandwidth_MiB_per_sec =", bw); }'`
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou opened a new pull request #7041: ARROW-8584: [C++] Fix ORC link order

2020-04-25 Thread GitBox


kou opened a new pull request #7041:
URL: https://github.com/apache/arrow/pull/7041


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-25 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-619437378


   Two failing checks right now.
   
   For the linting one, it seems to be alarmed by some Rust code that I didn't 
touch. Am I missing something in that output?
   
   For the Python tests, getting [this 
error](https://github.com/apache/arrow/pull/6979/checks?check_run_id=618464479#step:7:2128):
   ```
   pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked 
arrays
   ```
   When I run the same test locally, it succeeds. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-25 Thread GitBox


wjones1 commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r415130068



##
File path: python/pyarrow/tests/test_parquet.py
##
@@ -179,6 +179,99 @@ def alltypes_sample(size=1, seed=0, categorical=False):
 
 
 @pytest.mark.pandas
+def test_iter_batches_columns_reader(tempdir):
+df = alltypes_sample(size=1, categorical=True)
+
+filename = tempdir / 'pandas_roundtrip.parquet'
+arrow_table = pa.Table.from_pandas(df)
+_write_table(arrow_table, filename, version="2.0",
+ coerce_timestamps='ms', chunk_size=1000)
+
+columns = df.columns[4:15]
+
+file_ = pq.ParquetFile(filename)
+
+batches = file_.iter_batches(
+batch_size=500,
+columns=columns
+)
+
+tm.assert_frame_equal(
+next(batches).to_pandas(),
+df.iloc[:500, :].loc[:, columns]
+)
+
+
+@pytest.mark.pandas
+@pytest.mark.parametrize('chunk_size', [1000])
+def test_iter_batches_reader(tempdir, chunk_size):

Review comment:
   Strangely, after I merged the latest changes from master, I am no longer 
seeing this issue with dictionary arrays. I definitely saw it in the original 
fork, so I think it may have actually been fixed (though not sure where).
   
   I've removed the dictionary array correction in the test and hopefully the 
CI should confirm what I am seeing.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7040: ARROW-8505: [Release][C#] "sourcelink test" is failed by Apache.ArrowAssemblyInfo.cs

2020-04-25 Thread GitBox


github-actions[bot] commented on pull request #7040:
URL: https://github.com/apache/arrow/pull/7040#issuecomment-619408251


   https://issues.apache.org/jira/browse/ARROW-8505



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt opened a new pull request #7040: ARROW-8505: [Release][C#] "sourcelink test" is failed by Apache.ArrowAssemblyInfo.cs

2020-04-25 Thread GitBox


eerhardt opened a new pull request #7040:
URL: https://github.com/apache/arrow/pull/7040


   Workaround https://github.com/dotnet/sourcelink/issues/572 by explicitly 
embedding the AssemblyAttributes file into the pdb.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on pull request #6121: ARROW-6603: [C#] - Nullable Array Support

2020-04-25 Thread GitBox


eerhardt commented on pull request #6121:
URL: https://github.com/apache/arrow/pull/6121#issuecomment-619396952


   Thank you for this contribution, @abbotware.
   
   However, my opinion is that #7032 is more inline with how null support 
should be designed for the builder APIs. It also more closely models how the 
other languages do it as well.
   
   I think we should move forward with #7032's design and implementation to 
resolve ARROW-6603.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-25 Thread GitBox


eerhardt commented on pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#issuecomment-619396229


   > ARROW-5634 by properly setting the readonly value for NullCount, which 
previously was hardcoded to -1.
   
   I don't believe this change addresses the issue correctly. Can we remove 
that one from this PR? We can submit a separate PR for ARROW-5634, since it 
would involve more work and I think this PR should be targeted to just the 
ArrayBuilder API to support appending `null`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-25 Thread GitBox


eerhardt commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r415072524



##
File path: csharp/src/Apache.Arrow/Apache.Arrow.csproj
##
@@ -4,7 +4,7 @@
 netstandard1.3;netcoreapp2.1
 true
 
$(DefineConstants);UNSAFE_BYTEBUFFER;BYTEBUFFER_NO_BOUNDS_CHECK;ENABLE_SPAN_T
-
+

Review comment:
   (nit) Can you revert this change? That way it doesn't pollute version 
history.

##
File path: csharp/Directory.Build.props
##
@@ -21,6 +21,12 @@
 
$(CSharpDir)ApacheArrow.snk
   
 
+  
+

Review comment:
   I don't see where this is necessary. I've removed it and it still builds 
and tests pass on my machine.
   
   Can this be removed?
   
   In general, in the http://github.com/dotnet/runtime tests, we try to avoid 
using `InternalsVisibleTo` for testing, and instead write the tests in the same 
way a consumer of the library would use it.

##
File path: csharp/src/Apache.Arrow/Arrays/ArrayData.cs
##
@@ -84,7 +84,7 @@ public ArrayData Slice(int offset, int length)
 length = Math.Min(Length - offset, length);
 offset += Offset;
 
-return new ArrayData(DataType, length, -1, offset, Buffers, 
Children);
+return new ArrayData(DataType, length, NullCount, offset, Buffers, 
Children);

Review comment:
   This isn't correct because we are slicing a subset of the `ArrayData`.
   
   If the total `ArrayData` has values:
   
   `0`, `null`, `2`, `3`, `4`
   
   `NullCount` for the outer `ArrayData` will be `1`.
   
   But if we are slicing from `offset = 2` and `length = 3` and creating a new 
`ArrayData` from the above, the new `ArrayData` won't have any nulls in it. But 
with this change, it will have `NullCount = 1` from its parent.
   
   `NullCount = -1` in the original code means "we don't know yet, it needs to 
be computed". That was done so we don't have to count the nulls in the sliced 
values until it is necessary. Since there are cases where it may not be 
required.

##
File path: csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj
##
@@ -4,6 +4,8 @@
   
 netcoreapp2.1
 true
+true

Review comment:
   If we can remove the `InternalsVisibleTo` attribute, we can revert this 
change as well.

##
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##
@@ -99,43 +105,63 @@ public abstract class PrimitiveArrayBuilder : IArrowArrayBu
 {
 protected TBuilder Instance => this as TBuilder;
 protected ArrowBuffer.Builder ValueBuffer { get; }
+protected BooleanArray.Builder ValidityBuffer { get; }
 
 public int Length => ValueBuffer.Length;
 
+protected int NullCount { get; set; }
+
 // TODO: Implement support for null values (null bitmaps)
 
 internal PrimitiveArrayBuilder()
 {
 ValueBuffer = new ArrowBuffer.Builder();
+ValidityBuffer = new BooleanArray.Builder();
 }
 
 public TBuilder Resize(int length)
 {
 ValueBuffer.Resize(length);
+ValidityBuffer.Resize(length + 1);
 return Instance;
 }
 
 public TBuilder Reserve(int capacity)
 {
 ValueBuffer.Reserve(capacity);
+ValidityBuffer.Reserve(capacity + 1);
 return Instance;
 }
 
 public TBuilder Append(T value)
 {
 ValueBuffer.Append(value);
+ValidityBuffer.Append(true);
 return Instance;
 }
 
 public TBuilder Append(ReadOnlySpan span)
 {
 ValueBuffer.Append(span);
+ValidityBuffer.Append(span != Span.Empty);

Review comment:
   This needs to be similar to `AppendRange` below and append a `true` for 
each element in the `span`.

##
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##
@@ -162,8 +188,8 @@ public TBuilder Swap(int i, int j)
 public TArray Build(MemoryAllocator allocator = default)
 {
 return Build(
-ValueBuffer.Build(allocator), ArrowBuffer.Empty,
-ValueBuffer.Length, 0, 0);
+ValueBuffer.Build(allocator), 
ValidityBuffer.Build(allocator).ValueBuffer,

Review comment:
   How do you feel about making the `ValidityBuffer` lazy? That is, if no 
`null` values were ever appended to the builder, then the `ValidityBuffer` 
stays empty. And only upon the first `null` value being append do we allocate a 
`ValidityBuffer` and append all `true`s for all the values already appended 
before the first `null`.
   
   The advantage is that when a buffer doesn't contain any nulls, we don't need 
to allocate memory here, and it will be smaller memory size when transferring 
the Arrow bytes across a network.

##
File path: csharp/test/Apache.Arrow.Tests/ArrayBuilderTests.cs
##
@@ -81,7 +110,9 @@ L

[GitHub] [arrow] github-actions[bot] commented on pull request #7039: ARROW-8513: [Python] Expose Take with Table input in Python

2020-04-25 Thread GitBox


github-actions[bot] commented on pull request #7039:
URL: https://github.com/apache/arrow/pull/7039#issuecomment-619367004


   https://issues.apache.org/jira/browse/ARROW-8513



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] gramirezespinoza opened a new pull request #7039: ARROW-8513: [Python] Expose Take with Table input in Python

2020-04-25 Thread GitBox


gramirezespinoza opened a new pull request #7039:
URL: https://github.com/apache/arrow/pull/7039


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rollokb commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-25 Thread GitBox


rollokb commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r415027375



##
File path: python/pyarrow/tests/test_parquet.py
##
@@ -179,6 +179,99 @@ def alltypes_sample(size=1, seed=0, categorical=False):
 
 
 @pytest.mark.pandas
+def test_iter_batches_columns_reader(tempdir):
+df = alltypes_sample(size=1, categorical=True)
+
+filename = tempdir / 'pandas_roundtrip.parquet'
+arrow_table = pa.Table.from_pandas(df)
+_write_table(arrow_table, filename, version="2.0",
+ coerce_timestamps='ms', chunk_size=1000)
+
+columns = df.columns[4:15]
+
+file_ = pq.ParquetFile(filename)
+
+batches = file_.iter_batches(
+batch_size=500,
+columns=columns
+)
+
+tm.assert_frame_equal(
+next(batches).to_pandas(),
+df.iloc[:500, :].loc[:, columns]
+)
+
+
+@pytest.mark.pandas
+@pytest.mark.parametrize('chunk_size', [1000])
+def test_iter_batches_reader(tempdir, chunk_size):

Review comment:
   @wjones1  thank you very much for taking over my PR. Unfortunately one 
of the reasons I abandoned this (and just built a wheel for my own company's 
project), was not knowing what to do about the problem this test highlights.
   
   The problem with dictionary arrays is far beyond my skill. I was wondering 
what someone with more architectural vision thinks of this issue.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7038: ARROW-8593: [C++][Parquet] Fix build with musl libc

2020-04-25 Thread GitBox


github-actions[bot] commented on pull request #7038:
URL: https://github.com/apache/arrow/pull/7038#issuecomment-619346808


   https://issues.apache.org/jira/browse/ARROW-8593



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tobim opened a new pull request #7038: ARROW-8593: [C++][Parquet] Fix build with musl libc

2020-04-25 Thread GitBox


tobim opened a new pull request #7038:
URL: https://github.com/apache/arrow/pull/7038


   Converts local constants in `file_serialize_test.cc` to snake_case.
   
   Fixes a confilct with the `PAGE_SIZE` macro declared in the `limits.h` 
header that is shipped with musl libc.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org