date:20200423

[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



emkornfield commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413575153



##
File path: cpp/cmake_modules/SetupCxxFlags.cmake
##
@@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86")
 set(CXX_SUPPORTS_SSE4_2 TRUE)
   else()
 set(ARROW_SSE4_2_FLAG "-msse4.2")
-set(ARROW_AVX2_FLAG "-mavx2")
+set(ARROW_AVX2_FLAG "-march=core-avx2")

Review comment:
   hopefully this is acceptable change.  I'm open to other suggestions on 
how to enable bmi2 (not for some reason using "-mbmi2 -mavx2" didn't seem to 
work for me, but I can try again.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



emkornfield commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413576031



##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,
+const int16_t max_expected_level) {
+  int16_t min_level = std::numeric_limits::max();
+  int16_t max_level = std::numeric_limits::min();
+  for (int x = 0; x < num_levels; x++) {
+min_level = std::min(levels[x], min_level);
+max_level = std::max(levels[x], max_level);
+  }
+  if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > 
max_level))) {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+#if !defined(ARROW_HAVE_BMI2)
+
+inline void DefinitionLevelsToBitmapScalar(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  // We assume here that valid_bits is large enough to accommodate the
+  // additional definition levels and the ones that have already been written
+  ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, 
valid_bits_offset,
+num_def_levels);
+
+  // TODO(itaiin): As an interim solution we are splitting the code path here
+  // between repeated+flat column reads, and non-repeated+nested reads.
+  // Those paths need to be merged in the future
+  for (int i = 0; i < num_def_levels; ++i) {
+if (def_levels[i] == max_definition_level) {
+  valid_bits_writer.Set();
+} else if (max_repetition_level > 0) {
+  // repetition+flat case
+  if (def_levels[i] == (max_definition_level - 1)) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+continue;
+  }
+} else {
+  // non-repeated+nested case
+  if (def_levels[i] < max_definition_level) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+valid_bits_writer.Next();
+  }
+  valid_bits_writer.Finish();
+  *values_read = valid_bits_writer.position();
+}
+#endif
+
+template 
+void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t 
num_def_levels,
+  const int16_t required_definition_level,
+  int64_t* values_read, int64_t* null_count,
+  uint8_t* valid_bits, int64_t 
valid_bits_offset) {
+  constexpr int64_t kBitMaskSize = 64;
+  int64_t set_count = 0;
+  *values_read = 0;
+  while (num_def_levels > 0) {
+int64_t batch_size = std::min(num_def_levels, kBitMaskSize);
+CheckLevelRange(def_levels, batch_size, required_definition_level);
+uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, 
batch_size,
+  
required_definition_level - 1);
+if (has_repeated_parent) {
+  // This is currently a specialized code path assuming only (nested) lists
+  // present through the leaf (i.e. no structs).
+  // Upper level code only calls this method
+  // when the leaf-values are nullable (otherwise no spacing is needed),
+  // Because only nested lists exists it is sufficient to know that the 
field
+  // was either null or included it (i.e. >= previous definition level -> 
> previous
+  // definition - 1). If there where structs mixed in, we need to know the 
def_level

Review comment:
   fix comment should "- 2"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



emkornfield commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413576031



##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,
+const int16_t max_expected_level) {
+  int16_t min_level = std::numeric_limits::max();
+  int16_t max_level = std::numeric_limits::min();
+  for (int x = 0; x < num_levels; x++) {
+min_level = std::min(levels[x], min_level);
+max_level = std::max(levels[x], max_level);
+  }
+  if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > 
max_level))) {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+#if !defined(ARROW_HAVE_BMI2)
+
+inline void DefinitionLevelsToBitmapScalar(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  // We assume here that valid_bits is large enough to accommodate the
+  // additional definition levels and the ones that have already been written
+  ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, 
valid_bits_offset,
+num_def_levels);
+
+  // TODO(itaiin): As an interim solution we are splitting the code path here
+  // between repeated+flat column reads, and non-repeated+nested reads.
+  // Those paths need to be merged in the future
+  for (int i = 0; i < num_def_levels; ++i) {
+if (def_levels[i] == max_definition_level) {
+  valid_bits_writer.Set();
+} else if (max_repetition_level > 0) {
+  // repetition+flat case
+  if (def_levels[i] == (max_definition_level - 1)) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+continue;
+  }
+} else {
+  // non-repeated+nested case
+  if (def_levels[i] < max_definition_level) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+valid_bits_writer.Next();
+  }
+  valid_bits_writer.Finish();
+  *values_read = valid_bits_writer.position();
+}
+#endif
+
+template 
+void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t 
num_def_levels,
+  const int16_t required_definition_level,
+  int64_t* values_read, int64_t* null_count,
+  uint8_t* valid_bits, int64_t 
valid_bits_offset) {
+  constexpr int64_t kBitMaskSize = 64;
+  int64_t set_count = 0;
+  *values_read = 0;
+  while (num_def_levels > 0) {
+int64_t batch_size = std::min(num_def_levels, kBitMaskSize);
+CheckLevelRange(def_levels, batch_size, required_definition_level);
+uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, 
batch_size,
+  
required_definition_level - 1);
+if (has_repeated_parent) {
+  // This is currently a specialized code path assuming only (nested) lists
+  // present through the leaf (i.e. no structs).
+  // Upper level code only calls this method
+  // when the leaf-values are nullable (otherwise no spacing is needed),
+  // Because only nested lists exists it is sufficient to know that the 
field
+  // was either null or included it (i.e. >= previous definition level -> 
> previous
+  // definition - 1). If there where structs mixed in, we need to know the 
def_level

Review comment:
   fix comment should "- 2"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



emkornfield commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413576741



##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,
+const int16_t max_expected_level) {
+  int16_t min_level = std::numeric_limits::max();
+  int16_t max_level = std::numeric_limits::min();
+  for (int x = 0; x < num_levels; x++) {
+min_level = std::min(levels[x], min_level);
+max_level = std::max(levels[x], max_level);
+  }
+  if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > 
max_level))) {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+#if !defined(ARROW_HAVE_BMI2)
+
+inline void DefinitionLevelsToBitmapScalar(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  // We assume here that valid_bits is large enough to accommodate the
+  // additional definition levels and the ones that have already been written
+  ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, 
valid_bits_offset,
+num_def_levels);
+
+  // TODO(itaiin): As an interim solution we are splitting the code path here
+  // between repeated+flat column reads, and non-repeated+nested reads.
+  // Those paths need to be merged in the future
+  for (int i = 0; i < num_def_levels; ++i) {
+if (def_levels[i] == max_definition_level) {
+  valid_bits_writer.Set();
+} else if (max_repetition_level > 0) {
+  // repetition+flat case
+  if (def_levels[i] == (max_definition_level - 1)) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+continue;
+  }
+} else {
+  // non-repeated+nested case
+  if (def_levels[i] < max_definition_level) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+valid_bits_writer.Next();
+  }
+  valid_bits_writer.Finish();
+  *values_read = valid_bits_writer.position();
+}
+#endif
+
+template 
+void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t 
num_def_levels,
+  const int16_t required_definition_level,
+  int64_t* values_read, int64_t* null_count,
+  uint8_t* valid_bits, int64_t 
valid_bits_offset) {
+  constexpr int64_t kBitMaskSize = 64;
+  int64_t set_count = 0;
+  *values_read = 0;
+  while (num_def_levels > 0) {
+int64_t batch_size = std::min(num_def_levels, kBitMaskSize);
+CheckLevelRange(def_levels, batch_size, required_definition_level);
+uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, 
batch_size,
+  
required_definition_level - 1);
+if (has_repeated_parent) {
+  // This is currently a specialized code path assuming only (nested) lists
+  // present through the leaf (i.e. no structs).
+  // Upper level code only calls this method
+  // when the leaf-values are nullable (otherwise no spacing is needed),
+  // Because only nested lists exists it is sufficient to know that the 
field
+  // was either null or included it (i.e. >= previous definition level -> 
> previous
+  // definition - 1). If there where structs mixed in, we need to know the 
def_level

Review comment:
   I think "- 1" might be confusing here because we subtract below.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on issue #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



emkornfield commented on issue #6985:
URL: https://github.com/apache/arrow/pull/6985#issuecomment-618231752


   CC @wesm @pitrou I think this is ready for review now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield edited a comment on issue #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



emkornfield edited a comment on issue #6985:
URL: https://github.com/apache/arrow/pull/6985#issuecomment-618231752


   CC @wesm @pitrou I think this is ready for review now. Will take a closer 
look at CI failures tomorrow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rdettai commented on issue #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)

2020-04-23 Thread GitBox



rdettai commented on issue #6949:
URL: https://github.com/apache/arrow/pull/6949#issuecomment-618235600


   I'll work on your comments today.
   
   What about the problem we are trying to fix here? Do you agree with the 
benefits of this fix ?
   
   Also, I'm not sure why a `Mutex` was used here. To we agree that `Refcell` 
is better as the whole reader is not thread safe, right ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rdettai edited a comment on issue #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)

2020-04-23 Thread GitBox



rdettai edited a comment on issue #6949:
URL: https://github.com/apache/arrow/pull/6949#issuecomment-618235600


   I'll work on your comments today.
   
   What about the problem we are trying to fix here? Do you agree with the 
benefits of this fix ?
   
   Also, I'm not sure why a `Mutex` was used here. Do we agree that `Refcell` 
is better as the whole reader is not thread safe, right ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



nevi-me commented on issue #6972:
URL: https://github.com/apache/arrow/pull/6972#issuecomment-618293330


   > I also think it is fine for the moment to ignore the use case where the 
schema varies between record batches and file a separate issue for that.
   
   Just on this point, there shouldn't be a way of creating a record batch that 
has a mismatching schema



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on issue #6667: ARROW-8162: [Format][Python] Add serialization for CSF sparse tensors to Python

2020-04-23 Thread GitBox



rok commented on issue #6667:
URL: https://github.com/apache/arrow/pull/6667#issuecomment-618295149


   Thanks @mrkn!  :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs commented on issue #6998: ARROW-8541: [Release] Don't remove previous source releases automatically

2020-04-23 Thread GitBox



kszucs commented on issue #6998:
URL: https://github.com/apache/arrow/pull/6998#issuecomment-618328429


   @kou updated



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413696312



##
File path: cpp/cmake_modules/SetupCxxFlags.cmake
##
@@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86")
 set(CXX_SUPPORTS_SSE4_2 TRUE)
   else()
 set(ARROW_SSE4_2_FLAG "-msse4.2")
-set(ARROW_AVX2_FLAG "-mavx2")
+set(ARROW_AVX2_FLAG "-march=core-avx2")

Review comment:
   It would be nice if you could try again.

##
File path: cpp/cmake_modules/SetupCxxFlags.cmake
##
@@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86")
 set(CXX_SUPPORTS_SSE4_2 TRUE)
   else()
 set(ARROW_SSE4_2_FLAG "-msse4.2")
-set(ARROW_AVX2_FLAG "-mavx2")
+set(ARROW_AVX2_FLAG "-march=core-avx2")
 # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
 set(ARROW_AVX512_FLAG "-march=skylake-avx512")
 check_cxx_compiler_flag(${ARROW_SSE4_2_FLAG} CXX_SUPPORTS_SSE4_2)
   endif()
   check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2)
+  check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2)

Review comment:
   This doesn't seem useful :-)

##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,
+const int16_t max_expected_level) {
+  int16_t min_level = std::numeric_limits::max();
+  int16_t max_level = std::numeric_limits::min();
+  for (int x = 0; x < num_levels; x++) {
+min_level = std::min(levels[x], min_level);
+max_level = std::max(levels[x], max_level);
+  }
+  if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > 
max_level))) {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+#if !defined(ARROW_HAVE_BMI2)
+
+inline void DefinitionLevelsToBitmapScalar(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  // We assume here that valid_bits is large enough to accommodate the
+  // additional definition levels and the ones that have already been written
+  ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, 
valid_bits_offset,
+num_def_levels);
+
+  // TODO(itaiin): As an interim solution we are splitting the code path here
+  // between repeated+flat column reads, and non-repeated+nested reads.
+  // Those paths need to be merged in the future
+  for (int i = 0; i < num_def_levels; ++i) {
+if (def_levels[i] == max_definition_level) {
+  valid_bits_writer.Set();
+} else if (max_repetition_level > 0) {
+  // repetition+flat case
+  if (def_levels[i] == (max_definition_level - 1)) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+continue;
+  }
+} else {
+  // non-repeated+nested case
+  if (def_levels[i] < max_definition_level) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+valid_bits_writer.Next();
+  }
+  valid_bits_writer.Finish();
+  *values_read = valid_bits_writer.position();
+}
+#endif
+
+template 
+void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t 
num_def_levels,
+  const int16_t required_definition_level,
+  int64_t* values_read, int64_t* null_count,
+  uint8_t* valid_bits, int64_t 
valid_bits_offset) {
+  constexpr int64_t kBitMaskSize = 64;
+  int64_t set_count = 0;
+  *values_read = 0;
+  while (num_def_levels > 0) {
+int64_t batch_size = std::min(num_def_levels, kBitMaskSize);
+CheckLevelRange(def_levels, batch_size, required_definition_level);
+uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, 
batch_size,
+  
required_definition_level - 1);
+if (has_repeated_parent) {
+  // This is currently a specialized code path assuming only (nested) lists
+  // present through the leaf (i.e. no structs).
+  // Upper level code only calls this method
+  // when the leaf-values are nullable (otherwise no spacing is needed),
+  // Because only nested lists exists it is sufficient to know that the 
field
+  // was either null or included it (i.e. >= previous definition level -> 
> previous
+  // definition - 1). If there where structs mixed in, we need to know the 
def_level
+  // of the repeated parent so we can check for def_level > "def level of 
repeated
+  // parent".
+  uint64_t present_bitmap = internal::GreaterThanBitmap(
+  def_levels, batch_size,

[GitHub] [arrow] pitrou commented on issue #6954: ARROW-8440: [C++] Refine SIMD header files

2020-04-23 Thread GitBox



pitrou commented on issue #6954:
URL: https://github.com/apache/arrow/pull/6954#issuecomment-618329962


   cc @emkornfield 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413715323



##
File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc
##
@@ -331,10 +358,64 @@ BENCHMARK_DEFINE_F(MinioFixture, 
ReadCoalesced500Mib)(benchmark::State& st) {
 }
 BENCHMARK_REGISTER_F(MinioFixture, ReadCoalesced500Mib)->UseRealTime();
 
-BENCHMARK_DEFINE_F(MinioFixture, ReadParquet250K)(benchmark::State& st) {
-  ParquetRead(st, fs_.get(), bucket_ + "/pq_c100_r250k");
-}
-BENCHMARK_REGISTER_F(MinioFixture, ReadParquet250K)->UseRealTime();
+// Helpers to generate various multiple benchmarks for a given Parquet file.
+
+// NAME: the base name of the benchmark.
+// ROWS: the number of rows in the Parquet file.
+// COLS: the number of columns in the Parquet file.
+// STRATEGY: how to read the file (ReadTable or GetRecordBatchReader)
+#define PQ_BENCHMARK_IMPL(NAME, ROWS, COLS, STRATEGY)  
   \
+  BENCHMARK_DEFINE_F(MinioFixture, NAME##STRATEGY##AllNaive)(benchmark::State 
& st) { \
+std::vector column_indices(COLS); 
   \
+std::iota(column_indices.begin(), column_indices.end(), 0);
   \
+std::stringstream ss;  
   \
+ss << bucket_ << "/pq_c" << COLS << "_r" << ROWS << "k";   
   \
+ParquetRead(st, fs_.get(), ss.str(), column_indices, false, #STRATEGY);
   \

Review comment:
   From a code maintainability standpoint, can we avoid putting so much 
logic in C macros? Ideally, the macro can call into a templated function.

##
File path: cpp/src/parquet/file_reader.cc
##
@@ -78,14 +79,46 @@ std::unique_ptr 
RowGroupReader::GetColumnPageReader(int i) {
 // Returns the rowgroup metadata
 const RowGroupMetaData* RowGroupReader::metadata() const { return 
contents_->metadata(); }
 
+/// Compute the section of the file that should be read for the given
+/// row group and column chunk.
+arrow::io::ReadRange ComputeColumnChunkRange(FileMetaData* file_metadata,
+ int64_t source_size, int 
row_group_index,
+ int column_index) {
+  auto row_group_metadata = file_metadata->RowGroup(row_group_index);
+  auto column_metadata = row_group_metadata->ColumnChunk(column_index);
+
+  int64_t col_start = column_metadata->data_page_offset();
+  if (column_metadata->has_dictionary_page() &&
+  column_metadata->dictionary_page_offset() > 0 &&
+  col_start > column_metadata->dictionary_page_offset()) {
+col_start = column_metadata->dictionary_page_offset();
+  }
+
+  int64_t col_length = column_metadata->total_compressed_size();
+  // PARQUET-816 workaround for old files created by older parquet-mr
+  const ApplicationVersion& version = file_metadata->writer_version();
+  if (version.VersionLt(ApplicationVersion::PARQUET_816_FIXED_VERSION())) {
+// The Parquet MR writer had a bug in 1.2.8 and below where it didn't 
include the
+// dictionary page header size in total_compressed_size and 
total_uncompressed_size
+// (see IMPALA-694). We add padding to compensate.
+int64_t bytes_remaining = source_size - (col_start + col_length);
+int64_t padding = std::min(kMaxDictHeaderSize, bytes_remaining);
+col_length += padding;
+  }
+
+  return {col_start, col_length};
+}
+
 // RowGroupReader::Contents implementation for the Parquet file specification
 class SerializedRowGroup : public RowGroupReader::Contents {
  public:
-  SerializedRowGroup(std::shared_ptr source, int64_t 
source_size,
- FileMetaData* file_metadata, int row_group_number,
- const ReaderProperties& props,
+  SerializedRowGroup(std::shared_ptr source,
+ std::shared_ptr<::arrow::io::internal::ReadRangeCache> 
cached_source,
+ int64_t source_size, FileMetaData* file_metadata,
+ int row_group_number, const ReaderProperties& props,
  std::shared_ptr file_decryptor = 
nullptr)
   : source_(std::move(source)),
+cached_source_(cached_source ? std::move(cached_source) : nullptr),

Review comment:
   I don't think the ternary expression is needed. You should be able to 
move a null `shared_ptr` alright.

##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   Be careful: the cache is unbounded. Does it mean you're willing to let 
all these row groups survive in memory until the reader gets destroyed?

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation

2020-04-23 Thread GitBox



jorisvandenbossche commented on a change in pull request #7000:
URL: https://github.com/apache/arrow/pull/7000#discussion_r413622598



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -402,23 +401,21 @@ Result ParquetFileFormat::ScanFile(
 }
 
 Result> ParquetFileFormat::MakeFragment(
-FileSource source, std::shared_ptr options,
-std::shared_ptr partition_expression, std::vector 
row_groups) {
+FileSource source, std::shared_ptr partition_expression,
+std::vector row_groups) {
   return std::shared_ptr(
-  new ParquetFileFragment(std::move(source), shared_from_this(), 
std::move(options),
+  new ParquetFileFragment(std::move(source), shared_from_this(),
   std::move(partition_expression), 
std::move(row_groups)));
 }
 
 Result> ParquetFileFormat::MakeFragment(
-FileSource source, std::shared_ptr options,
-std::shared_ptr partition_expression) {
-  return std::shared_ptr(
-  new ParquetFileFragment(std::move(source), shared_from_this(), 
std::move(options),
-  std::move(partition_expression), {}));
+FileSource source, std::shared_ptr partition_expression) {
+  return std::shared_ptr(new ParquetFileFragment(
+  std::move(source), shared_from_this(), std::move(partition_expression), 
{}));
 }
 
 Result ParquetFileFormat::GetRowGroupFragments(
-const ParquetFileFragment& fragment, std::shared_ptr 
extra_filter) {
+const ParquetFileFragment& fragment, std::shared_ptr filter) {

Review comment:
   Small comment: in the header file it is still called "extra_filter"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me opened a new pull request #7018: ARROW-8536: [Rust] [Flight] Check in proto file, conditional build if file exists

2020-04-23 Thread GitBox



nevi-me opened a new pull request #7018:
URL: https://github.com/apache/arrow/pull/7018


   When a user compiles the `flight` crate, a `build.rs` script is invoked. 
This script recursively looks for the `format/Flight.proto` path. A user might 
not have that path, as they would not have cloned the arrow repository, and as 
such, the build fails.
   
   This change:
   
   a) checks if the `../../format/Flight.proto` path exists, and only builds if 
the file exists, and has been changed.
   b) checks in the proto file into the `src/` directory, changing the default 
location from an opaque directory in the `target/{debug|release}/build-xxx` 
folder.
   
   The rationale for checking in the proto file is that if the file is not 
rebuilt, `flight` users are still able to access the file. It's also beneficial 
for users to easily view the file, and better understand how the generated code 
looks like.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on issue #7018: ARROW-8536: [Rust] [Flight] Check in proto file, conditional build if file exists

2020-04-23 Thread GitBox



github-actions[bot] commented on issue #7018:
URL: https://github.com/apache/arrow/pull/7018#issuecomment-618349230


   https://issues.apache.org/jira/browse/ARROW-8536



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old

2020-04-23 Thread GitBox



jorisvandenbossche commented on a change in pull request #6992:
URL: https://github.com/apache/arrow/pull/6992#discussion_r413740270



##
File path: python/pyarrow/pandas-shim.pxi
##
@@ -55,6 +55,16 @@ cdef class _PandasAPIShim(object):
 from distutils.version import LooseVersion
 self._loose_version = LooseVersion(pd.__version__)
 
+if self._loose_version < LooseVersion('0.23.0'):
+self._have_pandas = False
+if raise_:
+raise ImportError(
+"pyarrow requires pandas 0.23.0 or above, pandas {} is "
+"installed".format(self._version)
+)
+else:
+return

Review comment:
   Yes, that's a good idea





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wesm commented on issue #7017: suggestion: why not serialize complex numbers in a Python list/dict/set

2020-04-23 Thread GitBox



wesm commented on issue #7017:
URL: https://github.com/apache/arrow/issues/7017#issuecomment-618352283


   Can you send an email to one of the mailing lists or open a JIRA if you want 
to propose a development project?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413744080



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   We could add a method to clear/remove the cache, or the user might 
recreate the ParquetFileReader (as the cache is shared among all readers 
created from the file anyways).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413747770



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   I guess my question is: if I'm reading one record batch at a time (in 
streaming fashion), shouldn't the cache be per-record batch?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types

2020-04-23 Thread GitBox



jorisvandenbossche commented on a change in pull request #6959:
URL: https://github.com/apache/arrow/pull/6959#discussion_r413750283



##
File path: python/pyarrow/tests/test_extension_type.py
##
@@ -445,22 +445,28 @@ def test_parquet(tmpdir, registered_period_type):
 import base64
 decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"])
 schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema))
-assert schema.field("ext").metadata == {
-b'ARROW:extension:metadata': b'freq=D',
-b'ARROW:extension:name': b'pandas.period'}
+# Since the type could be reconstructed, the extension type metadata is
+# absent.
+assert schema.field("ext").metadata == {}

Review comment:
   I don't have a fully good sense of what people do with this metadata, 
but I suppose when the type was recognized, it should not be a problem if the 
metadata is not present anymore (since you can always retrieve it again from 
the type instance).
   
   I know Micah mentioned they were using those metadata in BigQuery, but 
without a registered extension type, so such use case should not be impacted.
   
   This test is for parquet roundtrip, but a similar change was done for plain 
IPC roundtrip as well? (that currently also preserves the field metadata, even 
when the type was recognized)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413757274



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   It depends: reading multiple row groups at once might cache chunks that 
span row groups (and in general you want more opportunities to coalesce reads 
if you are aiming for throughput). For more control over that, a consumer could 
make repeated calls to `FileReader->RowGroup` to get individual RowGroupReaders 
that would cache only a single RowGroup.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413759177



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   The other option would be to more fully separate the I/O and allow 
callers to pass a cache for a given read, and have `PreBuffer` instead pre-warm 
a supplied cache for the given row groups/column indices. Then you would have 
full control over the cache's lifetime, at the expense of a lot more 
complicated code (both for the caller and internally).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413760028



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   I don't understand: is a separate `ParquetFileReader::Contents` created 
for each row group?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413760214



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -536,6 +577,14 @@ std::shared_ptr 
ParquetFileReader::RowGroup(int i) {
   return contents_->GetRowGroup(i);
 }
 
+void ParquetFileReader::PreBuffer(const std::vector& row_groups,
+  const std::vector& column_indices,
+  const ::arrow::io::CacheOptions& options) {
+  // Access private methods here
+  SerializedFile* file = static_cast(contents_.get());

Review comment:
   `checked_cast` here. Is it possible for `Contents` to be something else 
than `SerializedFile`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6959:
URL: https://github.com/apache/arrow/pull/6959#discussion_r413762177



##
File path: python/pyarrow/tests/test_extension_type.py
##
@@ -445,22 +445,28 @@ def test_parquet(tmpdir, registered_period_type):
 import base64
 decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"])
 schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema))
-assert schema.field("ext").metadata == {
-b'ARROW:extension:metadata': b'freq=D',
-b'ARROW:extension:name': b'pandas.period'}
+# Since the type could be reconstructed, the extension type metadata is
+# absent.
+assert schema.field("ext").metadata == {}

Review comment:
   The change was actually done on the IPC side and Parquet inherited it, 
yes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old

2020-04-23 Thread GitBox



jorisvandenbossche commented on issue #6992:
URL: https://github.com/apache/arrow/pull/6992#issuecomment-618366876


   cc @wesm @xhochy @BryanCutler are you fine with
   
   1) a hard required minimal pandas version? (meaning: we don't use the pandas 
integration if an older version is installed, instead of trying to use it and 
potentially erroring/failing in certain cases)
   2) if ok with 1, pandas 0.23 as minimal version? (released 2 years ago, but 
our tests still pass with eg 0.21 as well)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413763701



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   No, on the contrary, there's only one instance of `Contents`, and hence 
a single shared cache right now (between all reads of the same file). However, 
lots of instances of `RowGroupReader::Contents` get created (one per row group 
per column), so it's not easy to cache each row group separately.
   
   Perhaps I'm missing the point: what you'd like is a way to stream record 
batches out of a file, and have it internally clean up memory for each row 
group once the data has been fully read, right? (And, not pre-buffer more than 
some fixed number of row groups ahead of the current one.)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413783259



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   I don't want anything specifically, but I wonder what users of the API 
may want. If you're reading a 10GB file, in a streaming fashion, from S3, you 
probably don't want to keep the 10GB in memory, right?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413792469



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   That's fair. In our case, even if a dataset is large, individual files 
are smaller, so it's fine to buffer an entire file and then discard it. But I 
agree that for other use cases, this is not ideal.
   
   A caller who is very concerned about memory might instead choose to 
explicitly read only one row group at a time to limit memory usage. This is 
rather annoying, though. 
   
   We could create a separate cache per row group, but this means we lose some 
performance as we can't coalesce reads across row groups anymore. However that 
might be a worthwhile tradeoff for large files. Correct me if I'm wrong, 
though, but even that doesn't help much without more refactoring, since reading 
is organized along columns. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] tustvold commented on issue #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance

2020-04-23 Thread GitBox



tustvold commented on issue #6980:
URL: https://github.com/apache/arrow/pull/6980#issuecomment-618389453


   Rebased on current master and the CI now builds :tada: 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413797661



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -536,6 +577,14 @@ std::shared_ptr 
ParquetFileReader::RowGroup(int i) {
   return contents_->GetRowGroup(i);
 }
 
+void ParquetFileReader::PreBuffer(const std::vector& row_groups,
+  const std::vector& column_indices,
+  const ::arrow::io::CacheOptions& options) {
+  // Access private methods here
+  SerializedFile* file = static_cast(contents_.get());

Review comment:
   It's not possible, but I made it a checked cast regardless.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413798078



##
File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc
##
@@ -331,10 +358,64 @@ BENCHMARK_DEFINE_F(MinioFixture, 
ReadCoalesced500Mib)(benchmark::State& st) {
 }
 BENCHMARK_REGISTER_F(MinioFixture, ReadCoalesced500Mib)->UseRealTime();
 
-BENCHMARK_DEFINE_F(MinioFixture, ReadParquet250K)(benchmark::State& st) {
-  ParquetRead(st, fs_.get(), bucket_ + "/pq_c100_r250k");
-}
-BENCHMARK_REGISTER_F(MinioFixture, ReadParquet250K)->UseRealTime();
+// Helpers to generate various multiple benchmarks for a given Parquet file.
+
+// NAME: the base name of the benchmark.
+// ROWS: the number of rows in the Parquet file.
+// COLS: the number of columns in the Parquet file.
+// STRATEGY: how to read the file (ReadTable or GetRecordBatchReader)
+#define PQ_BENCHMARK_IMPL(NAME, ROWS, COLS, STRATEGY)  
   \
+  BENCHMARK_DEFINE_F(MinioFixture, NAME##STRATEGY##AllNaive)(benchmark::State 
& st) { \
+std::vector column_indices(COLS); 
   \
+std::iota(column_indices.begin(), column_indices.end(), 0);
   \
+std::stringstream ss;  
   \
+ss << bucket_ << "/pq_c" << COLS << "_r" << ROWS << "k";   
   \
+ParquetRead(st, fs_.get(), ss.str(), column_indices, false, #STRATEGY);
   \

Review comment:
   Good idea, I've created a set of helper functions that set up things and 
then call the benchmark.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r41381



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   I'm afraid I don't understand you. How do you "read one row group at a 
time" if the cache is shared at the FileReader level? Do you mean creating a 
new Parquet file reader for each row group?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413800208



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   > Correct me if I'm wrong, though, but even that doesn't help much 
without more refactoring, since reading is organized along columns.
   
   That was my original point, which you said was irrelevant...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413810237



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -536,6 +577,14 @@ std::shared_ptr 
ParquetFileReader::RowGroup(int i) {
   return contents_->GetRowGroup(i);
 }
 
+void ParquetFileReader::PreBuffer(const std::vector& row_groups,
+  const std::vector& column_indices,
+  const ::arrow::io::CacheOptions& options) {
+  // Access private methods here
+  SerializedFile* file = static_cast(contents_.get());

Review comment:
   Interesting. Do we use the polymorphism only for tests?

##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   To be clear, I don't think this concern should block the PR, but the 
docstring should warn about it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-23 Thread GitBox



fsaintjacques commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r413814578



##
File path: python/pyarrow/_parquet.pyx
##
@@ -1083,6 +1084,50 @@ cdef class ParquetReader:
 def set_use_threads(self, bint use_threads):
 self.reader.get().set_use_threads(use_threads)
 
+def iter_batches(self, int64_t batch_size, row_groups, column_indices=None,

Review comment:
   Ditto, I'd add a method `def set_batch_size(self, int64_t batch_size)`. 
Why do you need the `batch_size` and `use_threads` if they're already property 
of the class and the caller can change them via setters of the instance?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #6846: ARROW-3329: [Python] Python tests for decimal to int and decimal to decimal casts

2020-04-23 Thread GitBox



pitrou commented on issue #6846:
URL: https://github.com/apache/arrow/pull/6846#issuecomment-618404946


   I rebased and improved the tests slightly. Also opened some issues for some 
oddities.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-23 Thread GitBox



fsaintjacques commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r413811549



##
File path: cpp/src/parquet/arrow/reader.cc
##
@@ -260,12 +260,28 @@ class FileReaderImpl : public FileReader {
 
   Status GetRecordBatchReader(const std::vector& row_group_indices,
   const std::vector& column_indices,
-  std::unique_ptr* out) 
override;
+  std::unique_ptr* out,

Review comment:
   The FileReader object already has an `ArrowReaderProperties` property 
with this configuration. There is no need to add this extra parameter. Instead, 
you should add a setter method `set_batch_size()` analogous to 
`set_use_threads()` (in reader.h).
   
   This comment applies to every GetRecordBatchReader methods you added the 
parameter.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-23 Thread GitBox



fsaintjacques commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r413812220



##
File path: python/pyarrow/_parquet.pxd
##
@@ -334,7 +334,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" 
nogil:
 ArrowReaderProperties()
 void set_read_dictionary(int column_index, c_bool read_dict)
 c_bool read_dictionary()
-void set_batch_size()
+void set_batch_size(int batch_size)

Review comment:
   It should be `int64_t` instead of `int`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413819271



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   GitHub acting up, sorry for the delay...
   
   > > Correct me if I'm wrong, though, but even that doesn't help much without 
more refactoring, since reading is organized along columns.
   > 
   > That was my original point, which you said was irrelevant...
   
   Apologies, I remember you had made this point on JIRA or the mailing list 
and was trying to find where (I still can't find where :/) I admit my 
understanding of the internals is poor, and your comment helped me understand 
what to look for better.
   
   > I'm afraid I don't understand you. How do you "read one row group at a 
time" if the cache is shared at the FileReader level? Do you mean creating a 
new Parquet file reader for each row group?
   
   Sorry, I was thinking about creating RowGroupReaders from a FileReader, one 
row group at a time: 
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReader8RowGroupEi
   
   Each time this is called, it internally recreates the cache and buffers a 
new set of ranges. (Hence the comment above about "unsafe".) So this would let 
you read a row group's worth of data at a time, at the cost of more complexity 
for the user.
   
   > To be clear, I don't think this concern should block the PR, but the 
docstring should warn about it.
   
   I'll update it. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #7009: ARROW-8552: [Rust] support iterate parquet row columns

2020-04-23 Thread GitBox



andygrove commented on a change in pull request #7009:
URL: https://github.com/apache/arrow/pull/7009#discussion_r413857080



##
File path: rust/parquet/src/record/api.rs
##
@@ -50,6 +50,33 @@ impl Row {
 pub fn len(&self) -> usize {
 self.fields.len()
 }
+
+pub fn get_column_iter(&self) -> RowColumnIter {
+RowColumnIter {
+fields: &self.fields,
+curr: 0,
+count: self.fields.len(),
+}
+}
+}
+
+pub struct RowColumnIter<'a> {
+fields: &'a Vec<(String, Field)>,
+curr: usize,
+count: usize,
+}
+
+impl<'a> Iterator for RowColumnIter<'a> {
+type Item = (usize, &'a String);

Review comment:
   Why do we need the index included in the iterator results? The user 
could use `enumerate` to get those? Also, would it make more sense for the 
iterator be over the `Field` rather than just field name?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413864236



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -536,6 +577,14 @@ std::shared_ptr 
ParquetFileReader::RowGroup(int i) {
   return contents_->GetRowGroup(i);
 }
 
+void ParquetFileReader::PreBuffer(const std::vector& row_groups,
+  const std::vector& column_indices,
+  const ::arrow::io::CacheOptions& options) {
+  // Access private methods here
+  SerializedFile* file = static_cast(contents_.get());

Review comment:
   I actually don't think it's even used in tests. SerializedFile is the 
only implementation.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #6879: ARROW-8377: [CI][C++][R] Build and run C++ tests on Rtools build

2020-04-23 Thread GitBox



bkietz commented on a change in pull request #6879:
URL: https://github.com/apache/arrow/pull/6879#discussion_r413877530



##
File path: ci/scripts/PKGBUILD
##
@@ -50,6 +52,12 @@ source_dir="$ARROW_HOME"
 
 cpp_build_dir=build-${CARCH}-cpp
 
+# This should be "release" for real R package
+cpp_build_type="debug"

Review comment:
   This seems like it should be derived from an environment variable near 
the top of this script:
   ```suggestion
   cpp_build_type="${CPP_BUILD_TYPE:-debug}"
   ```
   
   Then the value of `depends` can include boost iff we're building debug and 
this script can be used to build release by setting `CPP_BUILD_TYPE=release` or 
so





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson opened a new pull request #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae

2020-04-23 Thread GitBox



nealrichardson opened a new pull request #7019:
URL: https://github.com/apache/arrow/pull/7019


   See https://github.com/apache/arrow/pull/6996#issuecomment-618053499



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae

2020-04-23 Thread GitBox



nealrichardson commented on issue #7019:
URL: https://github.com/apache/arrow/pull/7019#issuecomment-618454886


   @github-actions crossbow submit homebrew-cpp



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae

2020-04-23 Thread GitBox



github-actions[bot] commented on issue #7019:
URL: https://github.com/apache/arrow/pull/7019#issuecomment-618456824


   https://issues.apache.org/jira/browse/ARROW-8569



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wesm commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old

2020-04-23 Thread GitBox



wesm commented on issue #6992:
URL: https://github.com/apache/arrow/pull/6992#issuecomment-618459883


   I'm OK with this. The maintenance burden of supporting several years' worth 
of pandas releases seems like a lot to bear. If there are parties which are 
affected by this they should contribute to the maintenance (either monetarily 
or with their time)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] mayuropensource opened a new pull request #7020: Arrow-8562: [C++] Parameterize I/O coalescing using s3 storage metrics

2020-04-23 Thread GitBox



mayuropensource opened a new pull request #7020:
URL: https://github.com/apache/arrow/pull/7020


   JIRA: https://issues.apache.org/jira/browse/ARROW-8562
   
   This change is not actually used until 
https://github.com/apache/arrow/pull/6744 (@lidavidm) is pushed, however, it 
doesn't need to wait for the other pull request to be merged.
   
   Description:
   The adaptive I/O coalescing algorithm uses two parameters:
   1. max_io_gap or hole_size_limit: Max I/O gap/hole size in bytes
   2. ideal_request_size or range_size_limit: Ideal I/O Request size in bytes
   
   These parameters can be derived from S3 metrics as described below:
   
   In an S3 compatible storage, there are two main metrics:
   1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of 
a new S3 request
   2. Transfer Bandwidth (BW) for data in bytes/sec
   
   1. Computing max_io_gap or hole_size_limit:
   
   max_io_gap = TTFB * BW
   
   This is also called Bandwidth-Delay-Product (BDP).
   
   Two byte ranges that have a gap can still be mapped to the same read if the 
gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], i.e. 
if the Time-To-First-Byte (or call setup latency of a new S3 request) is 
expected to be greater than just reading and discarding the extra bytes on an 
existing HTTP request.
   
   2. Computing ideal_request_size or range_size_limit:
   
   We want to have high bandwidth utilization per S3 connections, i.e. transfer 
large amounts of data to amortize the seek overhead.
   But, we also want to leverage parallelism by slicing very large IO chunks. 
We define two more config parameters with suggested default values to control 
the slice size and seek to balance the two effects with the goal of maximizing 
net data load performance.
   
   BW_util (ideal bandwidth utilization):
   This means what fraction of per connection bandwidth should be utilized to 
maximize net data load.
   A good default value is 90% or 0.9.
   
   MAX_IDEAL_REQUEST_SIZE:
   This means what is the maximum single request size (in bytes) to maximize 
net data load.
   A good default value is 64 MiB.
   
   The amount of data that needs to be transferred in a single S3 get_object 
request to achieve effective bandwidth eff_BW = BW_util * BW is as follows:
   eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW)
   
   Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the 
following result:
   ideal_request_size = max_io_gap * BW_util / (1 - BW_util)
   
   Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
   ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - 
BW_util))



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #6846: ARROW-3329: [Python] Python tests for decimal to int and decimal to decimal casts

2020-04-23 Thread GitBox



pitrou commented on issue #6846:
URL: https://github.com/apache/arrow/pull/6846#issuecomment-618464806


   The CI failure looks unrelated, will merge.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] paddyhoran commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance

2020-04-23 Thread GitBox



paddyhoran commented on a change in pull request #6980:
URL: https://github.com/apache/arrow/pull/6980#discussion_r413903768



##
File path: rust/arrow/src/array/builder.rs
##
@@ -236,6 +251,14 @@ impl BufferBuilderTrait for 
BufferBuilder {
 self.write_bytes(v.to_byte_slice(), 1)
 }
 
+default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> {
+self.reserve(n)?;
+for _ in 0..n {
+self.write_bytes(v.to_byte_slice(), 1)?;
+}

Review comment:
   You can use `to_byte_slice` on a `Vec` or slice also.  To be clear, I'm 
just guessing here.  @vertexclique do you want to expand on your comment or can 
I merge this?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



kiszk commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413905959



##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,
+const int16_t max_expected_level) {
+  int16_t min_level = std::numeric_limits::max();
+  int16_t max_level = std::numeric_limits::min();
+  for (int x = 0; x < num_levels; x++) {
+min_level = std::min(levels[x], min_level);
+max_level = std::max(levels[x], max_level);
+  }
+  if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > 
max_level))) {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+#if !defined(ARROW_HAVE_BMI2)
+
+inline void DefinitionLevelsToBitmapScalar(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  // We assume here that valid_bits is large enough to accommodate the
+  // additional definition levels and the ones that have already been written
+  ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, 
valid_bits_offset,
+num_def_levels);
+
+  // TODO(itaiin): As an interim solution we are splitting the code path here
+  // between repeated+flat column reads, and non-repeated+nested reads.
+  // Those paths need to be merged in the future
+  for (int i = 0; i < num_def_levels; ++i) {
+if (def_levels[i] == max_definition_level) {
+  valid_bits_writer.Set();
+} else if (max_repetition_level > 0) {
+  // repetition+flat case
+  if (def_levels[i] == (max_definition_level - 1)) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+continue;
+  }
+} else {
+  // non-repeated+nested case
+  if (def_levels[i] < max_definition_level) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+valid_bits_writer.Next();
+  }
+  valid_bits_writer.Finish();
+  *values_read = valid_bits_writer.position();
+}
+#endif
+
+template 
+void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t 
num_def_levels,
+  const int16_t required_definition_level,
+  int64_t* values_read, int64_t* null_count,
+  uint8_t* valid_bits, int64_t 
valid_bits_offset) {
+  constexpr int64_t kBitMaskSize = 64;
+  int64_t set_count = 0;
+  *values_read = 0;
+  while (num_def_levels > 0) {
+int64_t batch_size = std::min(num_def_levels, kBitMaskSize);
+CheckLevelRange(def_levels, batch_size, required_definition_level);
+uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, 
batch_size,
+  
required_definition_level - 1);
+if (has_repeated_parent) {
+  // This is currently a specialized code path assuming only (nested) lists
+  // present through the leaf (i.e. no structs).
+  // Upper level code only calls this method
+  // when the leaf-values are nullable (otherwise no spacing is needed),
+  // Because only nested lists exists it is sufficient to know that the 
field
+  // was either null or included it (i.e. >= previous definition level -> 
> previous
+  // definition - 1). If there where structs mixed in, we need to know the 
def_level
+  // of the repeated parent so we can check for def_level > "def level of 
repeated
+  // parent".
+  uint64_t present_bitmap = internal::GreaterThanBitmap(
+  def_levels, batch_size, required_definition_level - 2);
+  *values_read += internal::AppendSelectedBitsToValidityBitmap(
+  /*new_bits=*/defined_bitmap,
+  /*selection_bitmap*/ present_bitmap, valid_bits, &valid_bits_offset,
+  &set_count);
+} else {
+  internal::AppendToValidityBitmap(
+  /*new_bits=*/defined_bitmap,
+  /*new_bit_count=*/batch_size, valid_bits, &valid_bits_offset, 
&set_count);
+  *values_read += batch_size;
+}
+def_levels += batch_size;
+num_def_levels -= batch_size;
+  }
+  *null_count += *values_read - set_count;
+}
+
+inline void DefinitionLevelsToBitmapDispatch(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  if (max_repetition_level > 0) {
+#if ARROW_LITTLE_ENDIAN
+
+#if defined(ARROW_HAVE_BMI2)
+// BMI2 is required for efficient bit extraction.
+DefinitionLevelsToB

[GitHub] [arrow] github-actions[bot] commented on issue #7020: Arrow-8562: [C++] Parameterize I/O coalescing using s3 storage metrics

2020-04-23 Thread GitBox



github-actions[bot] commented on issue #7020:
URL: https://github.com/apache/arrow/pull/7020#issuecomment-618472360


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



kiszk commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413908372



##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,
+const int16_t max_expected_level) {
+  int16_t min_level = std::numeric_limits::max();
+  int16_t max_level = std::numeric_limits::min();
+  for (int x = 0; x < num_levels; x++) {
+min_level = std::min(levels[x], min_level);
+max_level = std::max(levels[x], max_level);
+  }
+  if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > 
max_level))) {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+#if !defined(ARROW_HAVE_BMI2)
+
+inline void DefinitionLevelsToBitmapScalar(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  // We assume here that valid_bits is large enough to accommodate the
+  // additional definition levels and the ones that have already been written
+  ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, 
valid_bits_offset,
+num_def_levels);
+
+  // TODO(itaiin): As an interim solution we are splitting the code path here
+  // between repeated+flat column reads, and non-repeated+nested reads.
+  // Those paths need to be merged in the future
+  for (int i = 0; i < num_def_levels; ++i) {
+if (def_levels[i] == max_definition_level) {
+  valid_bits_writer.Set();
+} else if (max_repetition_level > 0) {
+  // repetition+flat case
+  if (def_levels[i] == (max_definition_level - 1)) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+continue;
+  }
+} else {
+  // non-repeated+nested case
+  if (def_levels[i] < max_definition_level) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+valid_bits_writer.Next();
+  }
+  valid_bits_writer.Finish();
+  *values_read = valid_bits_writer.position();
+}
+#endif
+
+template 
+void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t 
num_def_levels,
+  const int16_t required_definition_level,
+  int64_t* values_read, int64_t* null_count,
+  uint8_t* valid_bits, int64_t 
valid_bits_offset) {
+  constexpr int64_t kBitMaskSize = 64;
+  int64_t set_count = 0;
+  *values_read = 0;
+  while (num_def_levels > 0) {
+int64_t batch_size = std::min(num_def_levels, kBitMaskSize);
+CheckLevelRange(def_levels, batch_size, required_definition_level);
+uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, 
batch_size,
+  
required_definition_level - 1);
+if (has_repeated_parent) {
+  // This is currently a specialized code path assuming only (nested) lists
+  // present through the leaf (i.e. no structs).
+  // Upper level code only calls this method
+  // when the leaf-values are nullable (otherwise no spacing is needed),
+  // Because only nested lists exists it is sufficient to know that the 
field
+  // was either null or included it (i.e. >= previous definition level -> 
> previous
+  // definition - 1). If there where structs mixed in, we need to know the 
def_level
+  // of the repeated parent so we can check for def_level > "def level of 
repeated
+  // parent".
+  uint64_t present_bitmap = internal::GreaterThanBitmap(
+  def_levels, batch_size, required_definition_level - 2);
+  *values_read += internal::AppendSelectedBitsToValidityBitmap(
+  /*new_bits=*/defined_bitmap,
+  /*selection_bitmap*/ present_bitmap, valid_bits, &valid_bits_offset,
+  &set_count);
+} else {
+  internal::AppendToValidityBitmap(
+  /*new_bits=*/defined_bitmap,
+  /*new_bit_count=*/batch_size, valid_bits, &valid_bits_offset, 
&set_count);
+  *values_read += batch_size;
+}
+def_levels += batch_size;
+num_def_levels -= batch_size;
+  }
+  *null_count += *values_read - set_count;
+}
+
+inline void DefinitionLevelsToBitmapDispatch(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  if (max_repetition_level > 0) {
+#if ARROW_LITTLE_ENDIAN
+
+#if defined(ARROW_HAVE_BMI2)
+// BMI2 is required for efficient bit extraction.
+DefinitionLevelsToB

[GitHub] [arrow] kszucs opened a new pull request #7021: Wrap docker-compose commands with archery [WIP]

2020-04-23 Thread GitBox



kszucs opened a new pull request #7021:
URL: https://github.com/apache/arrow/pull/7021


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] mayuropensource commented on issue #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics

2020-04-23 Thread GitBox



mayuropensource commented on issue #7020:
URL: https://github.com/apache/arrow/pull/7020#issuecomment-618486378


   I messed up some commits. Will create a new one.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #7020:
URL: https://github.com/apache/arrow/pull/7020#discussion_r413923773



##
File path: cpp/src/arrow/io/caching.h
##
@@ -27,6 +27,44 @@
 
 namespace arrow {
 namespace io {
+
+struct ARROW_EXPORT CacheOptions {
+  static constexpr double kDefaultIdealBandwidthUtilizationFrac = 0.9;
+  static constexpr int64_t kDefaultMaxIdealRequestSizeMib = 64;
+
+  /// /brief The maximum distance in bytes between two consecutive
+  ///   ranges; beyond this value, ranges are not combined
+  int64_t hole_size_limit;
+  /// /brief The maximum size in bytes of a combined range; if
+  ///   combining two consecutive ranges would produce a range of a
+  ///   size greater than this, they are not combined
+  int64_t range_size_limit;
+
+  bool operator==(const CacheOptions& other) const {
+return hole_size_limit == other.hole_size_limit &&
+   range_size_limit == other.range_size_limit;
+  }
+
+  /// \brief Construct CacheOptions from S3 storage metrics.
+  ///
+  /// \param[in] time_to_first_byte_millis Seek-time or Time-To-First-Byte 
(TTFB) in
+  ///   milliseconds, also called call setup latency of a new S3 request.
+  ///   The value is a positive integer.
+  /// \param[in] transfer_bandwidth_mib_per_sec Data transfer Bandwidth (BW) 
in MiB/sec.
+  ///   The value is a positive integer.
+  /// \param[in] ideal_bandwidth_utilization_frac Transfer bandwidth 
utilization fraction
+  ///   (per connection) to maximize the net data load.
+  ///   The value is a positive double precision number less than or equal to 
1.
+  /// \param[in] max_ideal_request_size_mib The maximum single data request 
size (in MiB)
+  ///   to maximize the net data load.
+  ///   The value is a positive integer.
+  /// \return A new instance of CacheOptions.
+  static CacheOptions MakeFromS3Metrics(

Review comment:
   This might be better named as network metrics; it'd be useful for the 
GCS filesystem that's in progress, for instance.

##
File path: cpp/src/arrow/io/caching.cc
##
@@ -27,6 +28,91 @@
 
 namespace arrow {
 namespace io {
+
+CacheOptions CacheOptions::MakeFromS3Metrics(
+const int64_t time_to_first_byte_millis, const int64_t 
transfer_bandwidth_mib_per_sec,
+const double ideal_bandwidth_utilization_frac,
+const int64_t max_ideal_request_size_mib) {
+  //
+  // The I/O coalescing algorithm uses two parameters:
+  //   1. hole_size_limit (a.k.a max_io_gap): Max I/O gap/hole size in bytes
+  //   2. range_size_limit (a.k.a ideal_request_size): Ideal I/O Request size 
in bytes
+  //
+  // These parameters can be derived from S3 metrics as described below:
+  //
+  // In an S3 compatible storage, there are two main metrics:
+  //   1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup 
latency of a new
+  //  S3 request
+  //   2. Transfer Bandwidth (BW) for data in bytes/sec
+  //
+  // 1. Computing hole_size_limit:
+  //
+  //   hole_size_limit = TTFB * BW
+  //
+  //   This is also called Bandwidth-Delay-Product (BDP).
+  //   Two byte ranges that have a gap can still be mapped to the same read
+  //   if the gap is less than the bandwidth-delay product [TTFB * 
TransferBandwidth],
+  //   i.e. if the Time-To-First-Byte (or call setup latency of a new S3 
request) is
+  //   expected to be greater than just reading and discarding the extra bytes 
on an
+  //   existing HTTP request.
+  //
+  // 2. Computing range_size_limit:
+  //
+  //   We want to have high bandwidth utilization per S3 connections,
+  //   i.e. transfer large amounts of data to amortize the seek overhead.
+  //   But, we also want to leverage parallelism by slicing very large IO 
chunks.
+  //   We define two more config parameters with suggested default values to 
control
+  //   the slice size and seek to balance the two effects with the goal of 
maximizing
+  //   net data load performance.
+  //
+  //   BW_util_frac (ideal bandwidth utilization): Transfer bandwidth 
utilization fraction
+  // (per connection) to maximize the net data load. 90% is a good default 
number for
+  // an effective transfer bandwidth.
+  //
+  //   MAX_IDEAL_REQUEST_SIZE: The maximum single data request size (in MiB) 
to maximize
+  // the net data load. 64 MiB is a good default number for the ideal 
request size.
+  //
+  //   The amount of data that needs to be transferred in a single S3 
get_object
+  //   request to achieve effective bandwidth eff_BW = BW_util_frac * BW is as 
follows:
+  // eff_BW = range_size_limit / (TTFB + range_size_limit / BW)
+  //
+  //   Substituting TTFB = hole_size_limit / BW and eff_BW = BW_util_frac * 
BW, we get the
+  //   following result:
+  // range_size_limit = hole_size_limit * BW_util_frac / (1 - BW_util_frac)
+  //
+  //   Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
+  // range_size_limit = min(MAX_IDEAL_REQUEST_SIZE,
+  //hole_size_limit * BW_util_fra

[GitHub] [arrow] github-actions[bot] commented on issue #7021: Wrap docker-compose commands with archery [WIP]

2020-04-23 Thread GitBox



github-actions[bot] commented on issue #7021:
URL: https://github.com/apache/arrow/pull/7021#issuecomment-618494095


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] mayuropensource opened a new pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics

2020-04-23 Thread GitBox



mayuropensource opened a new pull request #7022:
URL: https://github.com/apache/arrow/pull/7022


   _(Recreating the PR from a clean repo, sorry about earlier PR which was not 
cleanly merged)._
   
   **JIRA:** https://issues.apache.org/jira/browse/ARROW-8562
   
   This change is not actually used until #6744 (@lidavidm) is pushed, however, 
it doesn't need to wait for the other pull request to be merged.
   
   **Description:**
   The adaptive I/O coalescing algorithm uses two parameters:
   
   max_io_gap or hole_size_limit: Max I/O gap/hole size in bytes
   ideal_request_size or range_size_limit: Ideal I/O Request size in bytes
   
   These parameters can be derived from S3 metrics as described below:
   
   In an S3 compatible storage, there are two main metrics:
   
   Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of 
a new S3 request
   
   Transfer Bandwidth (BW) for data in bytes/sec
   
   Computing max_io_gap or hole_size_limit:
   
   max_io_gap = TTFB * BW
   
   This is also called Bandwidth-Delay-Product (BDP).
   
   Two byte ranges that have a gap can still be mapped to the same read if the 
gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], i.e. 
if the Time-To-First-Byte (or call setup latency of a new S3 request) is 
expected to be greater than just reading and discarding the extra bytes on an 
existing HTTP request.
   
   Computing ideal_request_size or range_size_limit:
   
   We want to have high bandwidth utilization per S3 connections, i.e. transfer 
large amounts of data to amortize the seek overhead.
   But, we also want to leverage parallelism by slicing very large IO chunks. 
We define two more config parameters with suggested default values to control 
the slice size and seek to balance the two effects with the goal of maximizing 
net data load performance.
   
   BW_util (ideal bandwidth utilization):
   This means what fraction of per connection bandwidth should be utilized to 
maximize net data load.
   A good default value is 90% or 0.9.
   
   MAX_IDEAL_REQUEST_SIZE:
   This means what is the maximum single request size (in bytes) to maximize 
net data load.
   A good default value is 64 MiB.
   
   The amount of data that needs to be transferred in a single S3 get_object 
request to achieve effective bandwidth eff_BW = BW_util * BW is as follows:
   eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW)
   
   Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the 
following result:
   ideal_request_size = max_io_gap * BW_util / (1 - BW_util)
   
   Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
   ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - 
BW_util))



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth commented on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



markhildreth commented on issue #6972:
URL: https://github.com/apache/arrow/pull/6972#issuecomment-618501806


   @andygrove Thanks for the feedback. I have updated the PR with a less leaky 
API. I also fixed the type inference problem that was caused by the new 
dependency.
   
   @nevi-me True for a singular batch, but it's possible to create multiple 
batches from different sources. For example
   
   ```
   let schema1 = ...
   let csv1 = ...
   let batch1 = ...
   
   let schema2 = ... // different schema
   let csv2 = ...
   let batch2 = ...
   
   print_batches(&[batch1, batch2]);
   ```
   
   As I said, this is probably not something to worry about too much right now, 
but I'll probably add an issue for it for later if that's alright. 
Interestingly, this code wouldn't even necessarily crash; you would just get an 
odd-looking table:
   
   ```
   +---+---+---+---+
   | city  | lat   | lng   |   |
   +---+---+---+---+
   | Elgin, Scotland, the UK   | 57.653484 | -3.335724 |   |
   | Stoke-on-Trent, Staffordshire, the UK | 53.002666 | -2.179404 |   |
   | Solihull, Birmingham, UK  | 52.412811 | -1.778197 |   |
   | Cardiff, Cardiff county, UK   | 51.481583 | -3.17909  |   |
   | 1 | 1.1   | 1.11  | true  |
   | 2 | 2.2   | 2.22  | true  |
   | 3 | 0 | 3.33  | true  |
   | 4 | 4.4   |   | false |
   | 5 | 6.6   |   | false |
   +---+---+---+---+
   
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



markhildreth edited a comment on issue #6972:
URL: https://github.com/apache/arrow/pull/6972#issuecomment-618501806


   @andygrove Thanks for the feedback. I have updated the PR with a less leaky 
API. I also fixed the type inference problem that was caused by the new 
dependency.
   
   @nevi-me True for a singular batch, but it's possible to create multiple 
batches from different sources. For example
   
   ```
   let schema1 = ...
   let csv1 = ...
   let batch1 = ...
   
   let schema2 = ... // different schema
   let csv2 = ...
   let batch2 = ...
   
   print_batches(&[batch1, batch2]);
   ```
   
   As I said, this is probably not something to worry about too much right now, 
but I'll probably add an issue for later to revisit if that's alright. 
Interestingly, this code wouldn't even necessarily crash; you would just get an 
odd-looking table:
   
   ```
   +---+---+---+---+
   | city  | lat   | lng   |   |
   +---+---+---+---+
   | Elgin, Scotland, the UK   | 57.653484 | -3.335724 |   |
   | Stoke-on-Trent, Staffordshire, the UK | 53.002666 | -2.179404 |   |
   | Solihull, Birmingham, UK  | 52.412811 | -1.778197 |   |
   | Cardiff, Cardiff county, UK   | 51.481583 | -3.17909  |   |
   | 1 | 1.1   | 1.11  | true  |
   | 2 | 2.2   | 2.22  | true  |
   | 3 | 0 | 3.33  | true  |
   | 4 | 4.4   |   | false |
   | 5 | 6.6   |   | false |
   +---+---+---+---+
   
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance

2020-04-23 Thread GitBox



vertexclique commented on a change in pull request #6980:
URL: https://github.com/apache/arrow/pull/6980#discussion_r413951534



##
File path: rust/arrow/src/array/builder.rs
##
@@ -236,6 +251,14 @@ impl BufferBuilderTrait for 
BufferBuilder {
 self.write_bytes(v.to_byte_slice(), 1)
 }
 
+default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> {
+self.reserve(n)?;
+for _ in 0..n {
+self.write_bytes(v.to_byte_slice(), 1)?;
+}

Review comment:
   So, `for _ in 0..n` kind of code, even you don't use, checks the 
boundaries. This is one of the infamous code pieces that Rust has and 
discourages. In [here]() you can see one of the examples. Range check exists 
here:
   ```
   mov rax, qword ptr [rip + core::iter::range::>::next@GOTPCREL]
   ```
   This call makes checked_add and which makes range checks. So that is slowing 
down for big iterations as you saw in the link. Since this is an open issue in 
the compiler please consider using iterator methods instead of for loops 
whenever you see.
   
   As you can also see, in this very code 5 cycles spent every range check (in 
minimal), so that said it is always better to use iterator methods. Bringing 
the example here :) 
   ```
   (0..n)
   .for_each(|_| {
   self.write_bytes(v.to_byte_slice(), 1)?;
   })
   ```
   
   This will be completely pruned in `-C opt-level=3` most probably :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth commented on a change in pull request #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



markhildreth commented on a change in pull request #6972:
URL: https://github.com/apache/arrow/pull/6972#discussion_r413951619



##
File path: rust/parquet/src/encodings/rle.rs
##
@@ -830,7 +826,7 @@ mod tests {
 values.clear();
 let mut rng = thread_rng();
 let seed_vec: Vec =
-Standard.sample_iter(&mut rng).take(seed_len).collect();
+rng.sample_iter::(&Standard).take(seed_len).collect();

Review comment:
   This is an example of type inferences that would become compiler errors 
with the inclusion of the `prettytable-rs` crate. I have fixed this code, but 
the original error being generated was:
   
   ```
   error[E0282]: type annotations needed
  --> parquet/src/encodings/rle.rs:833:26
   |
   833 | Standard.sample_iter(&mut 
rng).take(seed_len).collect();
   |  ^^^ cannot infer type for `T`
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance

2020-04-23 Thread GitBox



vertexclique commented on a change in pull request #6980:
URL: https://github.com/apache/arrow/pull/6980#discussion_r413951534



##
File path: rust/arrow/src/array/builder.rs
##
@@ -236,6 +251,14 @@ impl BufferBuilderTrait for 
BufferBuilder {
 self.write_bytes(v.to_byte_slice(), 1)
 }
 
+default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> {
+self.reserve(n)?;
+for _ in 0..n {
+self.write_bytes(v.to_byte_slice(), 1)?;
+}

Review comment:
   So, `for _ in 0..n` kind of code, even you don't use, checks the 
boundaries. This is one of the infamous code pieces that Rust has. Personally I 
don't use this whenever its possible :). In 
[here](https://godbolt.org/z/55sggc) you can see one of the examples. Range 
check exists here:
   ```
   mov rax, qword ptr [rip + core::iter::range::>::next@GOTPCREL]
   ```
   This call makes checked_add and which makes range checks. So that is slowing 
down for big iterations as you saw in the link. Since this is an open issue in 
the compiler please consider using iterator methods instead of for loops 
whenever you see.
   
   As you can also see, in this very code 5 cycles spent every range check (in 
minimal), so that said it is always better to use iterator methods. Bringing 
the example here :) 
   ```
   (0..n)
   .for_each(|_| {
   self.write_bytes(v.to_byte_slice(), 1)?;
   })
   ```
   
   This will be completely pruned in `-C opt-level=3` most probably :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance

2020-04-23 Thread GitBox



vertexclique commented on a change in pull request #6980:
URL: https://github.com/apache/arrow/pull/6980#discussion_r413951534



##
File path: rust/arrow/src/array/builder.rs
##
@@ -236,6 +251,14 @@ impl BufferBuilderTrait for 
BufferBuilder {
 self.write_bytes(v.to_byte_slice(), 1)
 }
 
+default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> {
+self.reserve(n)?;
+for _ in 0..n {
+self.write_bytes(v.to_byte_slice(), 1)?;
+}

Review comment:
   So, `for _ in 0..n` kind of code, even you don't use, checks the 
boundaries. This is one of the infamous code pieces that Rust has. Personally I 
don't use this whenever its possible :). In [here]() you can see one of the 
examples. Range check exists here:
   ```
   mov rax, qword ptr [rip + core::iter::range::>::next@GOTPCREL]
   ```
   This call makes checked_add and which makes range checks. So that is slowing 
down for big iterations as you saw in the link. Since this is an open issue in 
the compiler please consider using iterator methods instead of for loops 
whenever you see.
   
   As you can also see, in this very code 5 cycles spent every range check (in 
minimal), so that said it is always better to use iterator methods. Bringing 
the example here :) 
   ```
   (0..n)
   .for_each(|_| {
   self.write_bytes(v.to_byte_slice(), 1)?;
   })
   ```
   
   This will be completely pruned in `-C opt-level=3` most probably :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth commented on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



markhildreth commented on issue #6972:
URL: https://github.com/apache/arrow/pull/6972#issuecomment-618506903


   From a purely practical standpoint, this PR is ready for further review and 
merging. If approved, I would probably add some minor issue for the following:
   * Trying to avoid the type inference issue.
   * Creating a type that can be used to iterate over multiple batches with 
statically guaranteed same schemas.
   
   Additionally, I’ll create a new issue and PR to cover moving DataFusion to 
this using this API.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



markhildreth edited a comment on issue #6972:
URL: https://github.com/apache/arrow/pull/6972#issuecomment-618501806


   @andygrove Thanks for the feedback. I have updated the PR with a less leaky 
API. I also tweaked the parquet test to workaround the new type inference 
changes.
   
   @nevi-me True for a singular batch, but it's possible to create multiple 
batches from different sources. For example
   
   ```
   let schema1 = ...
   let csv1 = ...
   let batch1 = ...
   
   let schema2 = ... // different schema
   let csv2 = ...
   let batch2 = ...
   
   print_batches(&[batch1, batch2]);
   ```
   
   As I said, this is probably not something to worry about too much right now, 
but I'll probably add an issue for later to revisit if that's alright. 
Interestingly, this code wouldn't even necessarily crash; you would just get an 
odd-looking table:
   
   ```
   +---+---+---+---+
   | city  | lat   | lng   |   |
   +---+---+---+---+
   | Elgin, Scotland, the UK   | 57.653484 | -3.335724 |   |
   | Stoke-on-Trent, Staffordshire, the UK | 53.002666 | -2.179404 |   |
   | Solihull, Birmingham, UK  | 52.412811 | -1.778197 |   |
   | Cardiff, Cardiff county, UK   | 51.481583 | -3.17909  |   |
   | 1 | 1.1   | 1.11  | true  |
   | 2 | 2.2   | 2.22  | true  |
   | 3 | 0 | 3.33  | true  |
   | 4 | 4.4   |   | false |
   | 5 | 6.6   |   | false |
   +---+---+---+---+
   
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



kiszk commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413953833



##
File path: cpp/src/parquet/column_reader.cc
##
@@ -50,6 +51,140 @@ using arrow::internal::checked_cast;
 
 namespace parquet {
 
+namespace {
+
+inline void CheckLevelRange(const int16_t* levels, int64_t num_levels,
+const int16_t max_expected_level) {
+  int16_t min_level = std::numeric_limits::max();
+  int16_t max_level = std::numeric_limits::min();
+  for (int x = 0; x < num_levels; x++) {
+min_level = std::min(levels[x], min_level);
+max_level = std::max(levels[x], max_level);
+  }
+  if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > 
max_level))) {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+#if !defined(ARROW_HAVE_BMI2)
+
+inline void DefinitionLevelsToBitmapScalar(
+const int16_t* def_levels, int64_t num_def_levels, const int16_t 
max_definition_level,
+const int16_t max_repetition_level, int64_t* values_read, int64_t* 
null_count,
+uint8_t* valid_bits, int64_t valid_bits_offset) {
+  // We assume here that valid_bits is large enough to accommodate the
+  // additional definition levels and the ones that have already been written
+  ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, 
valid_bits_offset,
+num_def_levels);
+
+  // TODO(itaiin): As an interim solution we are splitting the code path here
+  // between repeated+flat column reads, and non-repeated+nested reads.
+  // Those paths need to be merged in the future
+  for (int i = 0; i < num_def_levels; ++i) {
+if (def_levels[i] == max_definition_level) {
+  valid_bits_writer.Set();
+} else if (max_repetition_level > 0) {
+  // repetition+flat case
+  if (def_levels[i] == (max_definition_level - 1)) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+continue;
+  }
+} else {
+  // non-repeated+nested case
+  if (def_levels[i] < max_definition_level) {
+valid_bits_writer.Clear();
+*null_count += 1;
+  } else {
+throw ParquetException("definition level exceeds maximum");
+  }
+}
+
+valid_bits_writer.Next();
+  }
+  valid_bits_writer.Finish();
+  *values_read = valid_bits_writer.position();
+}
+#endif
+
+template 
+void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t 
num_def_levels,
+  const int16_t required_definition_level,
+  int64_t* values_read, int64_t* null_count,
+  uint8_t* valid_bits, int64_t 
valid_bits_offset) {
+  constexpr int64_t kBitMaskSize = 64;
+  int64_t set_count = 0;
+  *values_read = 0;
+  while (num_def_levels > 0) {
+int64_t batch_size = std::min(num_def_levels, kBitMaskSize);
+CheckLevelRange(def_levels, batch_size, required_definition_level);
+uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, 
batch_size,
+  
required_definition_level - 1);
+if (has_repeated_parent) {
+  // This is currently a specialized code path assuming only (nested) lists

Review comment:
   Since this path is executed only when #if defined(ARROW_HAVE_BMI2) is 
true, it would be good to add such as code
   
   ```
   #if defined(ARROW_HAVE_BMI2)
   ...
   #else
 assert(false && "must not execute this without BMI2");
   #endif
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth commented on a change in pull request #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



markhildreth commented on a change in pull request #6972:
URL: https://github.com/apache/arrow/pull/6972#discussion_r413951619



##
File path: rust/parquet/src/encodings/rle.rs
##
@@ -830,7 +826,7 @@ mod tests {
 values.clear();
 let mut rng = thread_rng();
 let seed_vec: Vec =
-Standard.sample_iter(&mut rng).take(seed_len).collect();
+rng.sample_iter::(&Standard).take(seed_len).collect();

Review comment:
   This is an example of type inferences that would become compiler errors 
with the inclusion of the `prettytable-rs` crate. I have fixed this code, but 
the original error being generated was:
   
   ```
   error[E0282]: type annotations needed
  --> parquet/src/encodings/rle.rs:833:26
   |
   833 | Standard.sample_iter(&mut 
rng).take(seed_len).collect();
   |  ^^^ cannot infer type for `T`
   ```
   
   Note this would affect all downstream crates, as it did the `parquet` crate.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



kiszk commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413954827



##
File path: cpp/src/parquet/level_conversion_test.cc
##
@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/level_conversion.h"
+
+#include 
+#include 
+
+#include 
+
+#include "arrow/util/bit_util.h"
+
+namespace parquet {
+namespace internal {
+
+using ::testing::ElementsAreArray;
+
+std::string ToString(const uint8_t* bitmap, int64_t bit_count) {
+  return arrow::internal::Bitmap(bitmap, /*offset*/ 0, 
/*length=*/bit_count).ToString();
+}
+
+std::string ToString(const std::vector& bitmap, int64_t bit_count) {
+  return ToString(bitmap.data(), bit_count);
+}
+
+TEST(TestGreaterThanBitmap, GeneratesExpectedBitmasks) {

Review comment:
   Most of these tests fail on big-endian platform
   
   ```
   [  FAILED  ] TestAppendBitmap.TestOffsetOverwritesCorrectBitsOnExistingByte
   [  FAILED  ] TestAppendBitmap.TestOffsetShiftBitsCorrectly
   [  FAILED  ] TestAppendBitmap.AllBytesAreWrittenWithEnoughSpace
   [  FAILED  ] 
TestAppendBitmap.OnlyApproriateBytesWrittenWhenLessThen8BytesAvailable
   [  FAILED  ] TestAppendToValidityBitmap.BasicOperation
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wesm commented on issue #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics

2020-04-23 Thread GitBox



wesm commented on issue #7020:
URL: https://github.com/apache/arrow/pull/7020#issuecomment-618508772


   @mayuropensource it's not necessary to open a new PR when you want to redo 
your commits, you can just force push your branch



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



kiszk commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413954827



##
File path: cpp/src/parquet/level_conversion_test.cc
##
@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/level_conversion.h"
+
+#include 
+#include 
+
+#include 
+
+#include "arrow/util/bit_util.h"
+
+namespace parquet {
+namespace internal {
+
+using ::testing::ElementsAreArray;
+
+std::string ToString(const uint8_t* bitmap, int64_t bit_count) {
+  return arrow::internal::Bitmap(bitmap, /*offset*/ 0, 
/*length=*/bit_count).ToString();
+}
+
+std::string ToString(const std::vector& bitmap, int64_t bit_count) {
+  return ToString(bitmap.data(), bit_count);
+}
+
+TEST(TestGreaterThanBitmap, GeneratesExpectedBitmasks) {

Review comment:
   FYI: Most of these tests fail on big-endian platform
   
   ```
   [  FAILED  ] TestAppendBitmap.TestOffsetOverwritesCorrectBitsOnExistingByte
   [  FAILED  ] TestAppendBitmap.TestOffsetShiftBitsCorrectly
   [  FAILED  ] TestAppendBitmap.AllBytesAreWrittenWithEnoughSpace
   [  FAILED  ] 
TestAppendBitmap.OnlyApproriateBytesWrittenWhenLessThen8BytesAvailable
   [  FAILED  ] TestAppendToValidityBitmap.BasicOperation
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on issue #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics

2020-04-23 Thread GitBox



github-actions[bot] commented on issue #7022:
URL: https://github.com/apache/arrow/pull/7022#issuecomment-618510578


   https://issues.apache.org/jira/browse/ARROW-8562



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value

2020-04-23 Thread GitBox



markhildreth commented on a change in pull request #7006:
URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397



##
File path: rust/arrow/src/array/array.rs
##
@@ -2592,6 +2619,15 @@ mod tests {
 assert_eq!(DataType::Int32, list_array.value_type());
 assert_eq!(3, list_array.len());
 assert_eq!(0, list_array.null_count());
+assert_eq!(

Review comment:
   The rest of the tests were passing previously. This was the failing 
test. The actual value returned was 0.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bryantbiggs commented on issue #4140: ARROW-5123: [Rust] Parquet derive for simple structs

2020-04-23 Thread GitBox



bryantbiggs commented on issue #4140:
URL: https://github.com/apache/arrow/pull/4140#issuecomment-618512883


   thanks @andygrove !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



kiszk commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413960907



##
File path: cpp/src/parquet/level_conversion.h
##
@@ -0,0 +1,191 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include "arrow/util/bit_util.h"
+
+#if defined(ARROW_HAVE_BMI2)
+#include "x86intrin.h"
+#endif
+
+namespace parquet {
+namespace internal {
+// These APIs are likely to be revised as part of ARROW-8494 to reduce 
duplicate code.
+// They currently represent minimal functionality for vectorized computation 
of definition
+// levels.
+
+/// Builds a bitmap by applying predicate to the level vector provided.
+///
+/// \param[in] levels Rep or def level array.
+/// \param[in] num_levels The number of levels to process (must be [0, 64])
+/// \param[in] predicate The predicate to apply (must have the signature `bool
+/// predicate(int16_t)`.
+/// \returns The bitmap using least significant "bit" ordering.
+///
+/// N.B. Correct byte ordering is dependent on little-endian architectures.
+///
+template 
+uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels, Predicate 
predicate) {
+  // Both clang and GCC can vectorize this automatically with AVX2.
+  uint64_t mask = 0;
+  for (int x = 0; x < num_levels; x++) {
+mask |= static_cast(predicate(levels[x]) ? 1 : 0) << x;
+  }
+  return mask;
+}
+
+/// Builds a  bitmap where each set bit indicates the correspond level is 
greater
+/// than rhs.
+static inline int64_t GreaterThanBitmap(const int16_t* levels, int64_t 
num_levels,
+int16_t rhs) {
+  return LevelsToBitmap(levels, num_levels, [&](int16_t value) { return value 
> rhs; });
+}
+
+/// Append bits number_of_bits from new_bits to valid_bits and 
valid_bits_offset.
+///
+/// \param[in] new_bits The zero-padded bitmap to append.
+/// \param[in] number_of_bits The number of bits to append from new_bits.
+/// \param[in] valid_bits_length The number of bytes allocated in valid_bits.
+/// \param[in] valid_bits_offset The bit-offset at which to start appending 
new bits.
+/// \param[in,out] valid_bits The validity bitmap to append to.
+/// \returns The new bit offset inside of valid_bits.
+static inline int64_t AppendBitmap(uint64_t new_bits, int64_t number_of_bits,
+   int64_t valid_bits_length, int64_t 
valid_bits_offset,
+   uint8_t* valid_bits) {
+  // Selection masks to retrieve all low order bits for each bytes.
+  constexpr uint64_t kLsbSelectionMasks[] = {
+  0,  // unused.
+  0x0101010101010101,
+  0x0303030303030303,
+  0x0707070707070707,
+  0x0F0F0F0F0F0F0F0F,
+  0x1F1F1F1F1F1F1F1F,
+  0x3F3F3F3F3F3F3F3F,
+  0x7F7F7F7F7F7F7F7F,
+  };
+  int64_t valid_byte_offset = valid_bits_offset / 8;
+  int64_t bit_offset = valid_bits_offset % 8;
+
+  int64_t new_offset = valid_bits_offset + number_of_bits;
+  union ByteAddressableBitmap {

Review comment:
   Why do we need to use union? `bytes[]` is used only at line 106.   
   It looks simple to explicitly extract the lowest 8bit instead of using union.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] BryanCutler commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old

2020-04-23 Thread GitBox



BryanCutler commented on issue #6992:
URL: https://github.com/apache/arrow/pull/6992#issuecomment-618517973


   Sounds good to me. FWIW, Spark also has a minimum Pandas version set at 
0.23.2.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] tustvold commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance

2020-04-23 Thread GitBox



tustvold commented on a change in pull request #6980:
URL: https://github.com/apache/arrow/pull/6980#discussion_r413969298



##
File path: rust/arrow/src/array/builder.rs
##
@@ -236,6 +251,14 @@ impl BufferBuilderTrait for 
BufferBuilder {
 self.write_bytes(v.to_byte_slice(), 1)
 }
 
+default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> {
+self.reserve(n)?;
+for _ in 0..n {
+self.write_bytes(v.to_byte_slice(), 1)?;
+}

Review comment:
   Thanks, I was not aware of this particular compiler bug. In this 
particular case the codegen at opt-level 2 or higher is identical, I think the 
problem as described [here](https://github.com/rust-lang/rust/issues/35981) 
only manifests on nested for loops.
   
   I can update if necessary but error propagation out of the closure if pretty 
clunky...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-04-23 Thread GitBox



kiszk commented on a change in pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#discussion_r413970301



##
File path: cpp/src/parquet/level_conversion.h
##
@@ -0,0 +1,191 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include "arrow/util/bit_util.h"
+
+#if defined(ARROW_HAVE_BMI2)
+#include "x86intrin.h"
+#endif
+
+namespace parquet {
+namespace internal {
+// These APIs are likely to be revised as part of ARROW-8494 to reduce 
duplicate code.
+// They currently represent minimal functionality for vectorized computation 
of definition
+// levels.
+
+/// Builds a bitmap by applying predicate to the level vector provided.
+///
+/// \param[in] levels Rep or def level array.
+/// \param[in] num_levels The number of levels to process (must be [0, 64])
+/// \param[in] predicate The predicate to apply (must have the signature `bool
+/// predicate(int16_t)`.
+/// \returns The bitmap using least significant "bit" ordering.
+///
+/// N.B. Correct byte ordering is dependent on little-endian architectures.
+///
+template 
+uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels, Predicate 
predicate) {
+  // Both clang and GCC can vectorize this automatically with AVX2.
+  uint64_t mask = 0;
+  for (int x = 0; x < num_levels; x++) {
+mask |= static_cast(predicate(levels[x]) ? 1 : 0) << x;
+  }
+  return mask;
+}
+
+/// Builds a  bitmap where each set bit indicates the correspond level is 
greater
+/// than rhs.
+static inline int64_t GreaterThanBitmap(const int16_t* levels, int64_t 
num_levels,

Review comment:
   Is this function used only for a little-endian platform?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance

2020-04-23 Thread GitBox



vertexclique commented on a change in pull request #6980:
URL: https://github.com/apache/arrow/pull/6980#discussion_r413972358



##
File path: rust/arrow/src/array/builder.rs
##
@@ -236,6 +251,14 @@ impl BufferBuilderTrait for 
BufferBuilder {
 self.write_bytes(v.to_byte_slice(), 1)
 }
 
+default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> {
+self.reserve(n)?;
+for _ in 0..n {
+self.write_bytes(v.to_byte_slice(), 1)?;
+}

Review comment:
   The place that you are doing byte masking can be a good alternative. 
Since that place is also accessing with index:
   
https://github.com/apache/arrow/pull/6980/files#diff-76c0ed0285da0826a55cf62676556c87R102
   
   If that's extreme, just forget about it. :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value

2020-04-23 Thread GitBox



markhildreth commented on a change in pull request #7006:
URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397



##
File path: rust/arrow/src/array/array.rs
##
@@ -2592,6 +2619,15 @@ mod tests {
 assert_eq!(DataType::Int32, list_array.value_type());
 assert_eq!(3, list_array.len());
 assert_eq!(0, list_array.null_count());
+assert_eq!(

Review comment:
   The rest of the added tests were passing previously. This was the 
failing test. The actual value returned was 0.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value

2020-04-23 Thread GitBox



markhildreth commented on a change in pull request #7006:
URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397



##
File path: rust/arrow/src/array/array.rs
##
@@ -2592,6 +2619,15 @@ mod tests {
 assert_eq!(DataType::Int32, list_array.value_type());
 assert_eq!(3, list_array.len());
 assert_eq!(0, list_array.null_count());
+assert_eq!(

Review comment:
   The rest of the added tests would have been passing previously. This 
would be the failing test. The actual value that was being returned was 0.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value

2020-04-23 Thread GitBox



markhildreth commented on a change in pull request #7006:
URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397



##
File path: rust/arrow/src/array/array.rs
##
@@ -2592,6 +2619,15 @@ mod tests {
 assert_eq!(DataType::Int32, list_array.value_type());
 assert_eq!(3, list_array.len());
 assert_eq!(0, list_array.null_count());
+assert_eq!(

Review comment:
   The rest of the added tests would have been passing previously. This 
would be the failing test. The actual value that was being returned was 0.

##
File path: rust/arrow/src/array/array.rs
##
@@ -2592,6 +2619,15 @@ mod tests {
 assert_eq!(DataType::Int32, list_array.value_type());
 assert_eq!(3, list_array.len());
 assert_eq!(0, list_array.null_count());
+assert_eq!(
+3,
+list_array
+.value(0)
+.as_any()
+.downcast_ref::()
+.unwrap()
+.value(0)
+);

Review comment:
   The rest of the added tests would have been passing previously. This 
would be the failing test. The actual value that was being returned was 0.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae

2020-04-23 Thread GitBox



nealrichardson commented on issue #7019:
URL: https://github.com/apache/arrow/pull/7019#issuecomment-618527651


   @github-actions crossbow submit homebrew-cpp



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae

2020-04-23 Thread GitBox



github-actions[bot] commented on issue #7019:
URL: https://github.com/apache/arrow/pull/7019#issuecomment-618528458


   Revision: 5bcfeab4c9bacc0b3a262a7522bfaf985025d3ec
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-165](https://github.com/ursa-labs/crossbow/branches/all?query=actions-165)
   
   |Task|Status|
   ||--|
   
|homebrew-cpp|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-165-travis-homebrew-cpp.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] xhochy opened a new pull request #7023: ARROW-8571: [C++] Switch AppVeyor image to VS 2017

2020-04-23 Thread GitBox



xhochy opened a new pull request #7023:
URL: https://github.com/apache/arrow/pull/7023


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



markhildreth edited a comment on issue #6972:
URL: https://github.com/apache/arrow/pull/6972#issuecomment-618506903


   From a purely practical standpoint, this PR is ready for further review and 
merging. If approved, I would probably add some minor JIRA issue for the 
following:
   * Trying to avoid the type inference issue.
   * Creating a type that can be used to iterate over multiple batches with 
statically guaranteed same schemas.
   
   Additionally, I’ll create a new issue and PR to cover moving DataFusion to 
this using this API.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



markhildreth edited a comment on issue #6972:
URL: https://github.com/apache/arrow/pull/6972#issuecomment-618506903


   From a purely practical standpoint, this PR is ready for further review and 
merging. If approved, I would probably add some minor JIRA issue for the 
following:
   * Trying to avoid the type inference issue.
   * Creating a type that can be used to iterate over multiple batches with 
statically guaranteed same schemas (looks like `RecordBatchReader` is close to 
what we would want).
   
   Additionally, I’ll create a new issue and PR to cover moving DataFusion to 
this using this API.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] tustvold commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance

2020-04-23 Thread GitBox



tustvold commented on a change in pull request #6980:
URL: https://github.com/apache/arrow/pull/6980#discussion_r413987657



##
File path: rust/arrow/src/array/builder.rs
##
@@ -236,6 +251,14 @@ impl BufferBuilderTrait for 
BufferBuilder {
 self.write_bytes(v.to_byte_slice(), 1)
 }
 
+default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> {
+self.reserve(n)?;
+for _ in 0..n {
+self.write_bytes(v.to_byte_slice(), 1)?;
+}

Review comment:
   When unoptimised both forms contain a checked_add, it is just hard to 
see as the foreach+closure isn't inlined, and the codegen for the iterator form 
is significantly worse in general. When optimised the codegen is identical for 
both forms.
   
   Given this, I'm inclined to optimise for readability and use the simpler 
form, but ultimately it is up to you as the maintainers :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #7023: ARROW-8571: [C++] Switch AppVeyor image to VS 2017

2020-04-23 Thread GitBox



pitrou commented on a change in pull request #7023:
URL: https://github.com/apache/arrow/pull/7023#discussion_r413987573



##
File path: appveyor.yml
##
@@ -61,7 +61,7 @@ environment:
 - JOB: "Build"
   GENERATOR: Ninja
   CONFIGURATION: "Release"
-  APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2017
+  APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2019

Review comment:
   Can we switch this one to MSVC 2015 instead? So that we ensure we're 
still compatible with it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wesm commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old

2020-04-23 Thread GitBox



wesm commented on issue #6992:
URL: https://github.com/apache/arrow/pull/6992#issuecomment-618536764


   The Appveyor failure is unrelated



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on issue #7023: ARROW-8571: [C++] Switch AppVeyor image to VS 2017

2020-04-23 Thread GitBox



github-actions[bot] commented on issue #7023:
URL: https://github.com/apache/arrow/pull/7023#issuecomment-618536730


   https://issues.apache.org/jira/browse/ARROW-8571



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wesm commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old

2020-04-23 Thread GitBox



wesm commented on issue #6992:
URL: https://github.com/apache/arrow/pull/6992#issuecomment-618537844


   Actually I'll hold off on merging this to confirm that @jorisvandenbossche 
has done everything that he planned



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics

2020-04-23 Thread GitBox



lidavidm commented on a change in pull request #7022:
URL: https://github.com/apache/arrow/pull/7022#discussion_r413990293



##
File path: cpp/src/arrow/io/caching.h
##
@@ -27,6 +27,44 @@
 
 namespace arrow {
 namespace io {
+
+struct ARROW_EXPORT CacheOptions {
+  static constexpr double kDefaultIdealBandwidthUtilizationFrac = 0.9;
+  static constexpr int64_t kDefaultMaxIdealRequestSizeMib = 64;
+
+  /// /brief The maximum distance in bytes between two consecutive
+  ///   ranges; beyond this value, ranges are not combined
+  int64_t hole_size_limit;
+  /// /brief The maximum size in bytes of a combined range; if
+  ///   combining two consecutive ranges would produce a range of a
+  ///   size greater than this, they are not combined
+  int64_t range_size_limit;
+
+  bool operator==(const CacheOptions& other) const {
+return hole_size_limit == other.hole_size_limit &&
+   range_size_limit == other.range_size_limit;
+  }
+
+  /// \brief Construct CacheOptions from S3 storage metrics.
+  ///
+  /// \param[in] time_to_first_byte_millis Seek-time or Time-To-First-Byte 
(TTFB) in
+  ///   milliseconds, also called call setup latency of a new S3 request.
+  ///   The value is a positive integer.
+  /// \param[in] transfer_bandwidth_mib_per_sec Data transfer Bandwidth (BW) 
in MiB/sec.
+  ///   The value is a positive integer.
+  /// \param[in] ideal_bandwidth_utilization_frac Transfer bandwidth 
utilization fraction
+  ///   (per connection) to maximize the net data load.
+  ///   The value is a positive double precision number less than or equal to 
1.
+  /// \param[in] max_ideal_request_size_mib The maximum single data request 
size (in MiB)
+  ///   to maximize the net data load.
+  ///   The value is a positive integer.
+  /// \return A new instance of CacheOptions.
+  static CacheOptions MakeFromS3Metrics(

Review comment:
   This might be better named 'NetworkMetrics' or something along those 
lines; it's not specific to S3 and would be likely be useful for the 
in-progress Google Cloud Storage filesystem, and perhaps others.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] xhochy commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old

2020-04-23 Thread GitBox



xhochy commented on issue #6992:
URL: https://github.com/apache/arrow/pull/6992#issuecomment-618538122


   👍 
   
   2 years ago released `pandas` version still sounds very generous. People who 
cannot upgrade from that to a newer version will probably have the same 
problems with `pyarrow` updates.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] houqp commented on a change in pull request #7009: ARROW-8552: [Rust] support iterate parquet row columns

2020-04-23 Thread GitBox



houqp commented on a change in pull request #7009:
URL: https://github.com/apache/arrow/pull/7009#discussion_r414007919



##
File path: rust/parquet/src/record/api.rs
##
@@ -50,6 +50,33 @@ impl Row {
 pub fn len(&self) -> usize {
 self.fields.len()
 }
+
+pub fn get_column_iter(&self) -> RowColumnIter {
+RowColumnIter {
+fields: &self.fields,
+curr: 0,
+count: self.fields.len(),
+}
+}
+}
+
+pub struct RowColumnIter<'a> {
+fields: &'a Vec<(String, Field)>,
+curr: usize,
+count: usize,
+}
+
+impl<'a> Iterator for RowColumnIter<'a> {
+type Item = (usize, &'a String);

Review comment:
   Thanks for the suggestions, I have changed it to return column name and 
field as a tuple.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] mayuropensource commented on a change in pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics

2020-04-23 Thread GitBox



mayuropensource commented on a change in pull request #7022:
URL: https://github.com/apache/arrow/pull/7022#discussion_r414031141



##
File path: cpp/src/arrow/io/caching.h
##
@@ -27,6 +27,44 @@
 
 namespace arrow {
 namespace io {
+
+struct ARROW_EXPORT CacheOptions {
+  static constexpr double kDefaultIdealBandwidthUtilizationFrac = 0.9;
+  static constexpr int64_t kDefaultMaxIdealRequestSizeMib = 64;
+
+  /// /brief The maximum distance in bytes between two consecutive
+  ///   ranges; beyond this value, ranges are not combined
+  int64_t hole_size_limit;
+  /// /brief The maximum size in bytes of a combined range; if
+  ///   combining two consecutive ranges would produce a range of a
+  ///   size greater than this, they are not combined
+  int64_t range_size_limit;
+
+  bool operator==(const CacheOptions& other) const {
+return hole_size_limit == other.hole_size_limit &&
+   range_size_limit == other.range_size_limit;
+  }
+
+  /// \brief Construct CacheOptions from S3 storage metrics.
+  ///
+  /// \param[in] time_to_first_byte_millis Seek-time or Time-To-First-Byte 
(TTFB) in
+  ///   milliseconds, also called call setup latency of a new S3 request.
+  ///   The value is a positive integer.
+  /// \param[in] transfer_bandwidth_mib_per_sec Data transfer Bandwidth (BW) 
in MiB/sec.
+  ///   The value is a positive integer.
+  /// \param[in] ideal_bandwidth_utilization_frac Transfer bandwidth 
utilization fraction
+  ///   (per connection) to maximize the net data load.
+  ///   The value is a positive double precision number less than or equal to 
1.
+  /// \param[in] max_ideal_request_size_mib The maximum single data request 
size (in MiB)
+  ///   to maximize the net data load.
+  ///   The value is a positive integer.
+  /// \return A new instance of CacheOptions.
+  static CacheOptions MakeFromS3Metrics(

Review comment:
   good point, I will make the change.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] mayuropensource commented on issue #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics

2020-04-23 Thread GitBox



mayuropensource commented on issue #7020:
URL: https://github.com/apache/arrow/pull/7020#issuecomment-618577963


   @wesm sure thing, I'll keep that in mind in the future.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches

2020-04-23 Thread GitBox



andygrove commented on pull request #6972:
URL: https://github.com/apache/arrow/pull/6972#issuecomment-618597619


   @markhildreth This looks great, but is now duplicating the code between 
arrow and datafusion. Can we remove the datafusion utils copy and have 
datafusion use the arrow utils instead?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 139 matches

Mail list logo