[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413575153 ## File path: cpp/cmake_modules/SetupCxxFlags.cmake ## @@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86") set(CXX_SUPPORTS_SSE4_2 TRUE) else() set(ARROW_SSE4_2_FLAG "-msse4.2") -set(ARROW_AVX2_FLAG "-mavx2") +set(ARROW_AVX2_FLAG "-march=core-avx2") Review comment: hopefully this is acceptable change. I'm open to other suggestions on how to enable bmi2 (not for some reason using "-mbmi2 -mavx2" didn't seem to work for me, but I can try again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413576031 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level Review comment: fix comment should "- 2" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413576031 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level Review comment: fix comment should "- 2" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413576741 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level Review comment: I think "- 1" might be confusing here because we subtract below. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on issue #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on issue #6985: URL: https://github.com/apache/arrow/pull/6985#issuecomment-618231752 CC @wesm @pitrou I think this is ready for review now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield edited a comment on issue #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
emkornfield edited a comment on issue #6985: URL: https://github.com/apache/arrow/pull/6985#issuecomment-618231752 CC @wesm @pitrou I think this is ready for review now. Will take a closer look at CI failures tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rdettai commented on issue #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)
rdettai commented on issue #6949: URL: https://github.com/apache/arrow/pull/6949#issuecomment-618235600 I'll work on your comments today. What about the problem we are trying to fix here? Do you agree with the benefits of this fix ? Also, I'm not sure why a `Mutex` was used here. To we agree that `Refcell` is better as the whole reader is not thread safe, right ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rdettai edited a comment on issue #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)
rdettai edited a comment on issue #6949: URL: https://github.com/apache/arrow/pull/6949#issuecomment-618235600 I'll work on your comments today. What about the problem we are trying to fix here? Do you agree with the benefits of this fix ? Also, I'm not sure why a `Mutex` was used here. Do we agree that `Refcell` is better as the whole reader is not thread safe, right ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
nevi-me commented on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618293330 > I also think it is fine for the moment to ignore the use case where the schema varies between record batches and file a separate issue for that. Just on this point, there shouldn't be a way of creating a record batch that has a mismatching schema This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on issue #6667: ARROW-8162: [Format][Python] Add serialization for CSF sparse tensors to Python
rok commented on issue #6667: URL: https://github.com/apache/arrow/pull/6667#issuecomment-618295149 Thanks @mrkn! :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on issue #6998: ARROW-8541: [Release] Don't remove previous source releases automatically
kszucs commented on issue #6998: URL: https://github.com/apache/arrow/pull/6998#issuecomment-618328429 @kou updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
pitrou commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413696312 ## File path: cpp/cmake_modules/SetupCxxFlags.cmake ## @@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86") set(CXX_SUPPORTS_SSE4_2 TRUE) else() set(ARROW_SSE4_2_FLAG "-msse4.2") -set(ARROW_AVX2_FLAG "-mavx2") +set(ARROW_AVX2_FLAG "-march=core-avx2") Review comment: It would be nice if you could try again. ## File path: cpp/cmake_modules/SetupCxxFlags.cmake ## @@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86") set(CXX_SUPPORTS_SSE4_2 TRUE) else() set(ARROW_SSE4_2_FLAG "-msse4.2") -set(ARROW_AVX2_FLAG "-mavx2") +set(ARROW_AVX2_FLAG "-march=core-avx2") # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ set(ARROW_AVX512_FLAG "-march=skylake-avx512") check_cxx_compiler_flag(${ARROW_SSE4_2_FLAG} CXX_SUPPORTS_SSE4_2) endif() check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2) + check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2) Review comment: This doesn't seem useful :-) ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level + // of the repeated parent so we can check for def_level > "def level of repeated + // parent". + uint64_t present_bitmap = internal::GreaterThanBitmap( + def_levels, batch_size,
[GitHub] [arrow] pitrou commented on issue #6954: ARROW-8440: [C++] Refine SIMD header files
pitrou commented on issue #6954: URL: https://github.com/apache/arrow/pull/6954#issuecomment-618329962 cc @emkornfield This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413715323 ## File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc ## @@ -331,10 +358,64 @@ BENCHMARK_DEFINE_F(MinioFixture, ReadCoalesced500Mib)(benchmark::State& st) { } BENCHMARK_REGISTER_F(MinioFixture, ReadCoalesced500Mib)->UseRealTime(); -BENCHMARK_DEFINE_F(MinioFixture, ReadParquet250K)(benchmark::State& st) { - ParquetRead(st, fs_.get(), bucket_ + "/pq_c100_r250k"); -} -BENCHMARK_REGISTER_F(MinioFixture, ReadParquet250K)->UseRealTime(); +// Helpers to generate various multiple benchmarks for a given Parquet file. + +// NAME: the base name of the benchmark. +// ROWS: the number of rows in the Parquet file. +// COLS: the number of columns in the Parquet file. +// STRATEGY: how to read the file (ReadTable or GetRecordBatchReader) +#define PQ_BENCHMARK_IMPL(NAME, ROWS, COLS, STRATEGY) \ + BENCHMARK_DEFINE_F(MinioFixture, NAME##STRATEGY##AllNaive)(benchmark::State & st) { \ +std::vector column_indices(COLS); \ +std::iota(column_indices.begin(), column_indices.end(), 0); \ +std::stringstream ss; \ +ss << bucket_ << "/pq_c" << COLS << "_r" << ROWS << "k"; \ +ParquetRead(st, fs_.get(), ss.str(), column_indices, false, #STRATEGY); \ Review comment: From a code maintainability standpoint, can we avoid putting so much logic in C macros? Ideally, the macro can call into a templated function. ## File path: cpp/src/parquet/file_reader.cc ## @@ -78,14 +79,46 @@ std::unique_ptr RowGroupReader::GetColumnPageReader(int i) { // Returns the rowgroup metadata const RowGroupMetaData* RowGroupReader::metadata() const { return contents_->metadata(); } +/// Compute the section of the file that should be read for the given +/// row group and column chunk. +arrow::io::ReadRange ComputeColumnChunkRange(FileMetaData* file_metadata, + int64_t source_size, int row_group_index, + int column_index) { + auto row_group_metadata = file_metadata->RowGroup(row_group_index); + auto column_metadata = row_group_metadata->ColumnChunk(column_index); + + int64_t col_start = column_metadata->data_page_offset(); + if (column_metadata->has_dictionary_page() && + column_metadata->dictionary_page_offset() > 0 && + col_start > column_metadata->dictionary_page_offset()) { +col_start = column_metadata->dictionary_page_offset(); + } + + int64_t col_length = column_metadata->total_compressed_size(); + // PARQUET-816 workaround for old files created by older parquet-mr + const ApplicationVersion& version = file_metadata->writer_version(); + if (version.VersionLt(ApplicationVersion::PARQUET_816_FIXED_VERSION())) { +// The Parquet MR writer had a bug in 1.2.8 and below where it didn't include the +// dictionary page header size in total_compressed_size and total_uncompressed_size +// (see IMPALA-694). We add padding to compensate. +int64_t bytes_remaining = source_size - (col_start + col_length); +int64_t padding = std::min(kMaxDictHeaderSize, bytes_remaining); +col_length += padding; + } + + return {col_start, col_length}; +} + // RowGroupReader::Contents implementation for the Parquet file specification class SerializedRowGroup : public RowGroupReader::Contents { public: - SerializedRowGroup(std::shared_ptr source, int64_t source_size, - FileMetaData* file_metadata, int row_group_number, - const ReaderProperties& props, + SerializedRowGroup(std::shared_ptr source, + std::shared_ptr<::arrow::io::internal::ReadRangeCache> cached_source, + int64_t source_size, FileMetaData* file_metadata, + int row_group_number, const ReaderProperties& props, std::shared_ptr file_decryptor = nullptr) : source_(std::move(source)), +cached_source_(cached_source ? std::move(cached_source) : nullptr), Review comment: I don't think the ternary expression is needed. You should be able to move a null `shared_ptr` alright. ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: Be careful: the cache is unbounded. Does it mean you're willing to let all these row groups survive in memory until the reader gets destroyed?
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation
jorisvandenbossche commented on a change in pull request #7000: URL: https://github.com/apache/arrow/pull/7000#discussion_r413622598 ## File path: cpp/src/arrow/dataset/file_parquet.cc ## @@ -402,23 +401,21 @@ Result ParquetFileFormat::ScanFile( } Result> ParquetFileFormat::MakeFragment( -FileSource source, std::shared_ptr options, -std::shared_ptr partition_expression, std::vector row_groups) { +FileSource source, std::shared_ptr partition_expression, +std::vector row_groups) { return std::shared_ptr( - new ParquetFileFragment(std::move(source), shared_from_this(), std::move(options), + new ParquetFileFragment(std::move(source), shared_from_this(), std::move(partition_expression), std::move(row_groups))); } Result> ParquetFileFormat::MakeFragment( -FileSource source, std::shared_ptr options, -std::shared_ptr partition_expression) { - return std::shared_ptr( - new ParquetFileFragment(std::move(source), shared_from_this(), std::move(options), - std::move(partition_expression), {})); +FileSource source, std::shared_ptr partition_expression) { + return std::shared_ptr(new ParquetFileFragment( + std::move(source), shared_from_this(), std::move(partition_expression), {})); } Result ParquetFileFormat::GetRowGroupFragments( -const ParquetFileFragment& fragment, std::shared_ptr extra_filter) { +const ParquetFileFragment& fragment, std::shared_ptr filter) { Review comment: Small comment: in the header file it is still called "extra_filter" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me opened a new pull request #7018: ARROW-8536: [Rust] [Flight] Check in proto file, conditional build if file exists
nevi-me opened a new pull request #7018: URL: https://github.com/apache/arrow/pull/7018 When a user compiles the `flight` crate, a `build.rs` script is invoked. This script recursively looks for the `format/Flight.proto` path. A user might not have that path, as they would not have cloned the arrow repository, and as such, the build fails. This change: a) checks if the `../../format/Flight.proto` path exists, and only builds if the file exists, and has been changed. b) checks in the proto file into the `src/` directory, changing the default location from an opaque directory in the `target/{debug|release}/build-xxx` folder. The rationale for checking in the proto file is that if the file is not rebuilt, `flight` users are still able to access the file. It's also beneficial for users to easily view the file, and better understand how the generated code looks like. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7018: ARROW-8536: [Rust] [Flight] Check in proto file, conditional build if file exists
github-actions[bot] commented on issue #7018: URL: https://github.com/apache/arrow/pull/7018#issuecomment-618349230 https://issues.apache.org/jira/browse/ARROW-8536 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
jorisvandenbossche commented on a change in pull request #6992: URL: https://github.com/apache/arrow/pull/6992#discussion_r413740270 ## File path: python/pyarrow/pandas-shim.pxi ## @@ -55,6 +55,16 @@ cdef class _PandasAPIShim(object): from distutils.version import LooseVersion self._loose_version = LooseVersion(pd.__version__) +if self._loose_version < LooseVersion('0.23.0'): +self._have_pandas = False +if raise_: +raise ImportError( +"pyarrow requires pandas 0.23.0 or above, pandas {} is " +"installed".format(self._version) +) +else: +return Review comment: Yes, that's a good idea This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #7017: suggestion: why not serialize complex numbers in a Python list/dict/set
wesm commented on issue #7017: URL: https://github.com/apache/arrow/issues/7017#issuecomment-618352283 Can you send an email to one of the mailing lists or open a JIRA if you want to propose a development project? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413744080 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: We could add a method to clear/remove the cache, or the user might recreate the ParquetFileReader (as the cache is shared among all readers created from the file anyways). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413747770 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: I guess my question is: if I'm reading one record batch at a time (in streaming fashion), shouldn't the cache be per-record batch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types
jorisvandenbossche commented on a change in pull request #6959: URL: https://github.com/apache/arrow/pull/6959#discussion_r413750283 ## File path: python/pyarrow/tests/test_extension_type.py ## @@ -445,22 +445,28 @@ def test_parquet(tmpdir, registered_period_type): import base64 decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"]) schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema)) -assert schema.field("ext").metadata == { -b'ARROW:extension:metadata': b'freq=D', -b'ARROW:extension:name': b'pandas.period'} +# Since the type could be reconstructed, the extension type metadata is +# absent. +assert schema.field("ext").metadata == {} Review comment: I don't have a fully good sense of what people do with this metadata, but I suppose when the type was recognized, it should not be a problem if the metadata is not present anymore (since you can always retrieve it again from the type instance). I know Micah mentioned they were using those metadata in BigQuery, but without a registered extension type, so such use case should not be impacted. This test is for parquet roundtrip, but a similar change was done for plain IPC roundtrip as well? (that currently also preserves the field metadata, even when the type was recognized) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413757274 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: It depends: reading multiple row groups at once might cache chunks that span row groups (and in general you want more opportunities to coalesce reads if you are aiming for throughput). For more control over that, a consumer could make repeated calls to `FileReader->RowGroup` to get individual RowGroupReaders that would cache only a single RowGroup. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413759177 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: The other option would be to more fully separate the I/O and allow callers to pass a cache for a given read, and have `PreBuffer` instead pre-warm a supplied cache for the given row groups/column indices. Then you would have full control over the cache's lifetime, at the expense of a lot more complicated code (both for the caller and internally). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413760028 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: I don't understand: is a separate `ParquetFileReader::Contents` created for each row group? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413760214 ## File path: cpp/src/parquet/file_reader.cc ## @@ -536,6 +577,14 @@ std::shared_ptr ParquetFileReader::RowGroup(int i) { return contents_->GetRowGroup(i); } +void ParquetFileReader::PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { + // Access private methods here + SerializedFile* file = static_cast(contents_.get()); Review comment: `checked_cast` here. Is it possible for `Contents` to be something else than `SerializedFile`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types
pitrou commented on a change in pull request #6959: URL: https://github.com/apache/arrow/pull/6959#discussion_r413762177 ## File path: python/pyarrow/tests/test_extension_type.py ## @@ -445,22 +445,28 @@ def test_parquet(tmpdir, registered_period_type): import base64 decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"]) schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema)) -assert schema.field("ext").metadata == { -b'ARROW:extension:metadata': b'freq=D', -b'ARROW:extension:name': b'pandas.period'} +# Since the type could be reconstructed, the extension type metadata is +# absent. +assert schema.field("ext").metadata == {} Review comment: The change was actually done on the IPC side and Parquet inherited it, yes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
jorisvandenbossche commented on issue #6992: URL: https://github.com/apache/arrow/pull/6992#issuecomment-618366876 cc @wesm @xhochy @BryanCutler are you fine with 1) a hard required minimal pandas version? (meaning: we don't use the pandas integration if an older version is installed, instead of trying to use it and potentially erroring/failing in certain cases) 2) if ok with 1, pandas 0.23 as minimal version? (released 2 years ago, but our tests still pass with eg 0.21 as well) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413763701 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: No, on the contrary, there's only one instance of `Contents`, and hence a single shared cache right now (between all reads of the same file). However, lots of instances of `RowGroupReader::Contents` get created (one per row group per column), so it's not easy to cache each row group separately. Perhaps I'm missing the point: what you'd like is a way to stream record batches out of a file, and have it internally clean up memory for each row group once the data has been fully read, right? (And, not pre-buffer more than some fixed number of row groups ahead of the current one.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413783259 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: I don't want anything specifically, but I wonder what users of the API may want. If you're reading a 10GB file, in a streaming fashion, from S3, you probably don't want to keep the 10GB in memory, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413792469 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: That's fair. In our case, even if a dataset is large, individual files are smaller, so it's fine to buffer an entire file and then discard it. But I agree that for other use cases, this is not ideal. A caller who is very concerned about memory might instead choose to explicitly read only one row group at a time to limit memory usage. This is rather annoying, though. We could create a separate cache per row group, but this means we lose some performance as we can't coalesce reads across row groups anymore. However that might be a worthwhile tradeoff for large files. Correct me if I'm wrong, though, but even that doesn't help much without more refactoring, since reading is organized along columns. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tustvold commented on issue #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
tustvold commented on issue #6980: URL: https://github.com/apache/arrow/pull/6980#issuecomment-618389453 Rebased on current master and the CI now builds :tada: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413797661 ## File path: cpp/src/parquet/file_reader.cc ## @@ -536,6 +577,14 @@ std::shared_ptr ParquetFileReader::RowGroup(int i) { return contents_->GetRowGroup(i); } +void ParquetFileReader::PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { + // Access private methods here + SerializedFile* file = static_cast(contents_.get()); Review comment: It's not possible, but I made it a checked cast regardless. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413798078 ## File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc ## @@ -331,10 +358,64 @@ BENCHMARK_DEFINE_F(MinioFixture, ReadCoalesced500Mib)(benchmark::State& st) { } BENCHMARK_REGISTER_F(MinioFixture, ReadCoalesced500Mib)->UseRealTime(); -BENCHMARK_DEFINE_F(MinioFixture, ReadParquet250K)(benchmark::State& st) { - ParquetRead(st, fs_.get(), bucket_ + "/pq_c100_r250k"); -} -BENCHMARK_REGISTER_F(MinioFixture, ReadParquet250K)->UseRealTime(); +// Helpers to generate various multiple benchmarks for a given Parquet file. + +// NAME: the base name of the benchmark. +// ROWS: the number of rows in the Parquet file. +// COLS: the number of columns in the Parquet file. +// STRATEGY: how to read the file (ReadTable or GetRecordBatchReader) +#define PQ_BENCHMARK_IMPL(NAME, ROWS, COLS, STRATEGY) \ + BENCHMARK_DEFINE_F(MinioFixture, NAME##STRATEGY##AllNaive)(benchmark::State & st) { \ +std::vector column_indices(COLS); \ +std::iota(column_indices.begin(), column_indices.end(), 0); \ +std::stringstream ss; \ +ss << bucket_ << "/pq_c" << COLS << "_r" << ROWS << "k"; \ +ParquetRead(st, fs_.get(), ss.str(), column_indices, false, #STRATEGY); \ Review comment: Good idea, I've created a set of helper functions that set up things and then call the benchmark. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r41381 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: I'm afraid I don't understand you. How do you "read one row group at a time" if the cache is shared at the FileReader level? Do you mean creating a new Parquet file reader for each row group? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413800208 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: > Correct me if I'm wrong, though, but even that doesn't help much without more refactoring, since reading is organized along columns. That was my original point, which you said was irrelevant... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413810237 ## File path: cpp/src/parquet/file_reader.cc ## @@ -536,6 +577,14 @@ std::shared_ptr ParquetFileReader::RowGroup(int i) { return contents_->GetRowGroup(i); } +void ParquetFileReader::PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { + // Access private methods here + SerializedFile* file = static_cast(contents_.get()); Review comment: Interesting. Do we use the polymorphism only for tests? ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: To be clear, I don't think this concern should block the PR, but the docstring should warn about it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
fsaintjacques commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r413814578 ## File path: python/pyarrow/_parquet.pyx ## @@ -1083,6 +1084,50 @@ cdef class ParquetReader: def set_use_threads(self, bint use_threads): self.reader.get().set_use_threads(use_threads) +def iter_batches(self, int64_t batch_size, row_groups, column_indices=None, Review comment: Ditto, I'd add a method `def set_batch_size(self, int64_t batch_size)`. Why do you need the `batch_size` and `use_threads` if they're already property of the class and the caller can change them via setters of the instance? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #6846: ARROW-3329: [Python] Python tests for decimal to int and decimal to decimal casts
pitrou commented on issue #6846: URL: https://github.com/apache/arrow/pull/6846#issuecomment-618404946 I rebased and improved the tests slightly. Also opened some issues for some oddities. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
fsaintjacques commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r413811549 ## File path: cpp/src/parquet/arrow/reader.cc ## @@ -260,12 +260,28 @@ class FileReaderImpl : public FileReader { Status GetRecordBatchReader(const std::vector& row_group_indices, const std::vector& column_indices, - std::unique_ptr* out) override; + std::unique_ptr* out, Review comment: The FileReader object already has an `ArrowReaderProperties` property with this configuration. There is no need to add this extra parameter. Instead, you should add a setter method `set_batch_size()` analogous to `set_use_threads()` (in reader.h). This comment applies to every GetRecordBatchReader methods you added the parameter. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
fsaintjacques commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r413812220 ## File path: python/pyarrow/_parquet.pxd ## @@ -334,7 +334,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: ArrowReaderProperties() void set_read_dictionary(int column_index, c_bool read_dict) c_bool read_dictionary() -void set_batch_size() +void set_batch_size(int batch_size) Review comment: It should be `int64_t` instead of `int`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413819271 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: GitHub acting up, sorry for the delay... > > Correct me if I'm wrong, though, but even that doesn't help much without more refactoring, since reading is organized along columns. > > That was my original point, which you said was irrelevant... Apologies, I remember you had made this point on JIRA or the mailing list and was trying to find where (I still can't find where :/) I admit my understanding of the internals is poor, and your comment helped me understand what to look for better. > I'm afraid I don't understand you. How do you "read one row group at a time" if the cache is shared at the FileReader level? Do you mean creating a new Parquet file reader for each row group? Sorry, I was thinking about creating RowGroupReaders from a FileReader, one row group at a time: https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReader8RowGroupEi Each time this is called, it internally recreates the cache and buffers a new set of ranges. (Hence the comment above about "unsafe".) So this would let you read a row group's worth of data at a time, at the cost of more complexity for the user. > To be clear, I don't think this concern should block the PR, but the docstring should warn about it. I'll update it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on a change in pull request #7009: ARROW-8552: [Rust] support iterate parquet row columns
andygrove commented on a change in pull request #7009: URL: https://github.com/apache/arrow/pull/7009#discussion_r413857080 ## File path: rust/parquet/src/record/api.rs ## @@ -50,6 +50,33 @@ impl Row { pub fn len(&self) -> usize { self.fields.len() } + +pub fn get_column_iter(&self) -> RowColumnIter { +RowColumnIter { +fields: &self.fields, +curr: 0, +count: self.fields.len(), +} +} +} + +pub struct RowColumnIter<'a> { +fields: &'a Vec<(String, Field)>, +curr: usize, +count: usize, +} + +impl<'a> Iterator for RowColumnIter<'a> { +type Item = (usize, &'a String); Review comment: Why do we need the index included in the iterator results? The user could use `enumerate` to get those? Also, would it make more sense for the iterator be over the `Field` rather than just field name? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413864236 ## File path: cpp/src/parquet/file_reader.cc ## @@ -536,6 +577,14 @@ std::shared_ptr ParquetFileReader::RowGroup(int i) { return contents_->GetRowGroup(i); } +void ParquetFileReader::PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { + // Access private methods here + SerializedFile* file = static_cast(contents_.get()); Review comment: I actually don't think it's even used in tests. SerializedFile is the only implementation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #6879: ARROW-8377: [CI][C++][R] Build and run C++ tests on Rtools build
bkietz commented on a change in pull request #6879: URL: https://github.com/apache/arrow/pull/6879#discussion_r413877530 ## File path: ci/scripts/PKGBUILD ## @@ -50,6 +52,12 @@ source_dir="$ARROW_HOME" cpp_build_dir=build-${CARCH}-cpp +# This should be "release" for real R package +cpp_build_type="debug" Review comment: This seems like it should be derived from an environment variable near the top of this script: ```suggestion cpp_build_type="${CPP_BUILD_TYPE:-debug}" ``` Then the value of `depends` can include boost iff we're building debug and this script can be used to build release by setting `CPP_BUILD_TYPE=release` or so This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae
nealrichardson opened a new pull request #7019: URL: https://github.com/apache/arrow/pull/7019 See https://github.com/apache/arrow/pull/6996#issuecomment-618053499 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae
nealrichardson commented on issue #7019: URL: https://github.com/apache/arrow/pull/7019#issuecomment-618454886 @github-actions crossbow submit homebrew-cpp This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae
github-actions[bot] commented on issue #7019: URL: https://github.com/apache/arrow/pull/7019#issuecomment-618456824 https://issues.apache.org/jira/browse/ARROW-8569 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
wesm commented on issue #6992: URL: https://github.com/apache/arrow/pull/6992#issuecomment-618459883 I'm OK with this. The maintenance burden of supporting several years' worth of pandas releases seems like a lot to bear. If there are parties which are affected by this they should contribute to the maintenance (either monetarily or with their time) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mayuropensource opened a new pull request #7020: Arrow-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
mayuropensource opened a new pull request #7020: URL: https://github.com/apache/arrow/pull/7020 JIRA: https://issues.apache.org/jira/browse/ARROW-8562 This change is not actually used until https://github.com/apache/arrow/pull/6744 (@lidavidm) is pushed, however, it doesn't need to wait for the other pull request to be merged. Description: The adaptive I/O coalescing algorithm uses two parameters: 1. max_io_gap or hole_size_limit: Max I/O gap/hole size in bytes 2. ideal_request_size or range_size_limit: Ideal I/O Request size in bytes These parameters can be derived from S3 metrics as described below: In an S3 compatible storage, there are two main metrics: 1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of a new S3 request 2. Transfer Bandwidth (BW) for data in bytes/sec 1. Computing max_io_gap or hole_size_limit: max_io_gap = TTFB * BW This is also called Bandwidth-Delay-Product (BDP). Two byte ranges that have a gap can still be mapped to the same read if the gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], i.e. if the Time-To-First-Byte (or call setup latency of a new S3 request) is expected to be greater than just reading and discarding the extra bytes on an existing HTTP request. 2. Computing ideal_request_size or range_size_limit: We want to have high bandwidth utilization per S3 connections, i.e. transfer large amounts of data to amortize the seek overhead. But, we also want to leverage parallelism by slicing very large IO chunks. We define two more config parameters with suggested default values to control the slice size and seek to balance the two effects with the goal of maximizing net data load performance. BW_util (ideal bandwidth utilization): This means what fraction of per connection bandwidth should be utilized to maximize net data load. A good default value is 90% or 0.9. MAX_IDEAL_REQUEST_SIZE: This means what is the maximum single request size (in bytes) to maximize net data load. A good default value is 64 MiB. The amount of data that needs to be transferred in a single S3 get_object request to achieve effective bandwidth eff_BW = BW_util * BW is as follows: eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW) Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the following result: ideal_request_size = max_io_gap * BW_util / (1 - BW_util) Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - BW_util)) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #6846: ARROW-3329: [Python] Python tests for decimal to int and decimal to decimal casts
pitrou commented on issue #6846: URL: https://github.com/apache/arrow/pull/6846#issuecomment-618464806 The CI failure looks unrelated, will merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
paddyhoran commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413903768 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: You can use `to_byte_slice` on a `Vec` or slice also. To be clear, I'm just guessing here. @vertexclique do you want to expand on your comment or can I merge this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413905959 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level + // of the repeated parent so we can check for def_level > "def level of repeated + // parent". + uint64_t present_bitmap = internal::GreaterThanBitmap( + def_levels, batch_size, required_definition_level - 2); + *values_read += internal::AppendSelectedBitsToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*selection_bitmap*/ present_bitmap, valid_bits, &valid_bits_offset, + &set_count); +} else { + internal::AppendToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*new_bit_count=*/batch_size, valid_bits, &valid_bits_offset, &set_count); + *values_read += batch_size; +} +def_levels += batch_size; +num_def_levels -= batch_size; + } + *null_count += *values_read - set_count; +} + +inline void DefinitionLevelsToBitmapDispatch( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + if (max_repetition_level > 0) { +#if ARROW_LITTLE_ENDIAN + +#if defined(ARROW_HAVE_BMI2) +// BMI2 is required for efficient bit extraction. +DefinitionLevelsToB
[GitHub] [arrow] github-actions[bot] commented on issue #7020: Arrow-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
github-actions[bot] commented on issue #7020: URL: https://github.com/apache/arrow/pull/7020#issuecomment-618472360 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413908372 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level + // of the repeated parent so we can check for def_level > "def level of repeated + // parent". + uint64_t present_bitmap = internal::GreaterThanBitmap( + def_levels, batch_size, required_definition_level - 2); + *values_read += internal::AppendSelectedBitsToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*selection_bitmap*/ present_bitmap, valid_bits, &valid_bits_offset, + &set_count); +} else { + internal::AppendToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*new_bit_count=*/batch_size, valid_bits, &valid_bits_offset, &set_count); + *values_read += batch_size; +} +def_levels += batch_size; +num_def_levels -= batch_size; + } + *null_count += *values_read - set_count; +} + +inline void DefinitionLevelsToBitmapDispatch( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + if (max_repetition_level > 0) { +#if ARROW_LITTLE_ENDIAN + +#if defined(ARROW_HAVE_BMI2) +// BMI2 is required for efficient bit extraction. +DefinitionLevelsToB
[GitHub] [arrow] kszucs opened a new pull request #7021: Wrap docker-compose commands with archery [WIP]
kszucs opened a new pull request #7021: URL: https://github.com/apache/arrow/pull/7021 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mayuropensource commented on issue #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
mayuropensource commented on issue #7020: URL: https://github.com/apache/arrow/pull/7020#issuecomment-618486378 I messed up some commits. Will create a new one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
lidavidm commented on a change in pull request #7020: URL: https://github.com/apache/arrow/pull/7020#discussion_r413923773 ## File path: cpp/src/arrow/io/caching.h ## @@ -27,6 +27,44 @@ namespace arrow { namespace io { + +struct ARROW_EXPORT CacheOptions { + static constexpr double kDefaultIdealBandwidthUtilizationFrac = 0.9; + static constexpr int64_t kDefaultMaxIdealRequestSizeMib = 64; + + /// /brief The maximum distance in bytes between two consecutive + /// ranges; beyond this value, ranges are not combined + int64_t hole_size_limit; + /// /brief The maximum size in bytes of a combined range; if + /// combining two consecutive ranges would produce a range of a + /// size greater than this, they are not combined + int64_t range_size_limit; + + bool operator==(const CacheOptions& other) const { +return hole_size_limit == other.hole_size_limit && + range_size_limit == other.range_size_limit; + } + + /// \brief Construct CacheOptions from S3 storage metrics. + /// + /// \param[in] time_to_first_byte_millis Seek-time or Time-To-First-Byte (TTFB) in + /// milliseconds, also called call setup latency of a new S3 request. + /// The value is a positive integer. + /// \param[in] transfer_bandwidth_mib_per_sec Data transfer Bandwidth (BW) in MiB/sec. + /// The value is a positive integer. + /// \param[in] ideal_bandwidth_utilization_frac Transfer bandwidth utilization fraction + /// (per connection) to maximize the net data load. + /// The value is a positive double precision number less than or equal to 1. + /// \param[in] max_ideal_request_size_mib The maximum single data request size (in MiB) + /// to maximize the net data load. + /// The value is a positive integer. + /// \return A new instance of CacheOptions. + static CacheOptions MakeFromS3Metrics( Review comment: This might be better named as network metrics; it'd be useful for the GCS filesystem that's in progress, for instance. ## File path: cpp/src/arrow/io/caching.cc ## @@ -27,6 +28,91 @@ namespace arrow { namespace io { + +CacheOptions CacheOptions::MakeFromS3Metrics( +const int64_t time_to_first_byte_millis, const int64_t transfer_bandwidth_mib_per_sec, +const double ideal_bandwidth_utilization_frac, +const int64_t max_ideal_request_size_mib) { + // + // The I/O coalescing algorithm uses two parameters: + // 1. hole_size_limit (a.k.a max_io_gap): Max I/O gap/hole size in bytes + // 2. range_size_limit (a.k.a ideal_request_size): Ideal I/O Request size in bytes + // + // These parameters can be derived from S3 metrics as described below: + // + // In an S3 compatible storage, there are two main metrics: + // 1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of a new + // S3 request + // 2. Transfer Bandwidth (BW) for data in bytes/sec + // + // 1. Computing hole_size_limit: + // + // hole_size_limit = TTFB * BW + // + // This is also called Bandwidth-Delay-Product (BDP). + // Two byte ranges that have a gap can still be mapped to the same read + // if the gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], + // i.e. if the Time-To-First-Byte (or call setup latency of a new S3 request) is + // expected to be greater than just reading and discarding the extra bytes on an + // existing HTTP request. + // + // 2. Computing range_size_limit: + // + // We want to have high bandwidth utilization per S3 connections, + // i.e. transfer large amounts of data to amortize the seek overhead. + // But, we also want to leverage parallelism by slicing very large IO chunks. + // We define two more config parameters with suggested default values to control + // the slice size and seek to balance the two effects with the goal of maximizing + // net data load performance. + // + // BW_util_frac (ideal bandwidth utilization): Transfer bandwidth utilization fraction + // (per connection) to maximize the net data load. 90% is a good default number for + // an effective transfer bandwidth. + // + // MAX_IDEAL_REQUEST_SIZE: The maximum single data request size (in MiB) to maximize + // the net data load. 64 MiB is a good default number for the ideal request size. + // + // The amount of data that needs to be transferred in a single S3 get_object + // request to achieve effective bandwidth eff_BW = BW_util_frac * BW is as follows: + // eff_BW = range_size_limit / (TTFB + range_size_limit / BW) + // + // Substituting TTFB = hole_size_limit / BW and eff_BW = BW_util_frac * BW, we get the + // following result: + // range_size_limit = hole_size_limit * BW_util_frac / (1 - BW_util_frac) + // + // Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: + // range_size_limit = min(MAX_IDEAL_REQUEST_SIZE, + //hole_size_limit * BW_util_fra
[GitHub] [arrow] github-actions[bot] commented on issue #7021: Wrap docker-compose commands with archery [WIP]
github-actions[bot] commented on issue #7021: URL: https://github.com/apache/arrow/pull/7021#issuecomment-618494095 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mayuropensource opened a new pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics
mayuropensource opened a new pull request #7022: URL: https://github.com/apache/arrow/pull/7022 _(Recreating the PR from a clean repo, sorry about earlier PR which was not cleanly merged)._ **JIRA:** https://issues.apache.org/jira/browse/ARROW-8562 This change is not actually used until #6744 (@lidavidm) is pushed, however, it doesn't need to wait for the other pull request to be merged. **Description:** The adaptive I/O coalescing algorithm uses two parameters: max_io_gap or hole_size_limit: Max I/O gap/hole size in bytes ideal_request_size or range_size_limit: Ideal I/O Request size in bytes These parameters can be derived from S3 metrics as described below: In an S3 compatible storage, there are two main metrics: Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of a new S3 request Transfer Bandwidth (BW) for data in bytes/sec Computing max_io_gap or hole_size_limit: max_io_gap = TTFB * BW This is also called Bandwidth-Delay-Product (BDP). Two byte ranges that have a gap can still be mapped to the same read if the gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], i.e. if the Time-To-First-Byte (or call setup latency of a new S3 request) is expected to be greater than just reading and discarding the extra bytes on an existing HTTP request. Computing ideal_request_size or range_size_limit: We want to have high bandwidth utilization per S3 connections, i.e. transfer large amounts of data to amortize the seek overhead. But, we also want to leverage parallelism by slicing very large IO chunks. We define two more config parameters with suggested default values to control the slice size and seek to balance the two effects with the goal of maximizing net data load performance. BW_util (ideal bandwidth utilization): This means what fraction of per connection bandwidth should be utilized to maximize net data load. A good default value is 90% or 0.9. MAX_IDEAL_REQUEST_SIZE: This means what is the maximum single request size (in bytes) to maximize net data load. A good default value is 64 MiB. The amount of data that needs to be transferred in a single S3 get_object request to achieve effective bandwidth eff_BW = BW_util * BW is as follows: eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW) Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the following result: ideal_request_size = max_io_gap * BW_util / (1 - BW_util) Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - BW_util)) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth commented on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618501806 @andygrove Thanks for the feedback. I have updated the PR with a less leaky API. I also fixed the type inference problem that was caused by the new dependency. @nevi-me True for a singular batch, but it's possible to create multiple batches from different sources. For example ``` let schema1 = ... let csv1 = ... let batch1 = ... let schema2 = ... // different schema let csv2 = ... let batch2 = ... print_batches(&[batch1, batch2]); ``` As I said, this is probably not something to worry about too much right now, but I'll probably add an issue for it for later if that's alright. Interestingly, this code wouldn't even necessarily crash; you would just get an odd-looking table: ``` +---+---+---+---+ | city | lat | lng | | +---+---+---+---+ | Elgin, Scotland, the UK | 57.653484 | -3.335724 | | | Stoke-on-Trent, Staffordshire, the UK | 53.002666 | -2.179404 | | | Solihull, Birmingham, UK | 52.412811 | -1.778197 | | | Cardiff, Cardiff county, UK | 51.481583 | -3.17909 | | | 1 | 1.1 | 1.11 | true | | 2 | 2.2 | 2.22 | true | | 3 | 0 | 3.33 | true | | 4 | 4.4 | | false | | 5 | 6.6 | | false | +---+---+---+---+ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth edited a comment on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618501806 @andygrove Thanks for the feedback. I have updated the PR with a less leaky API. I also fixed the type inference problem that was caused by the new dependency. @nevi-me True for a singular batch, but it's possible to create multiple batches from different sources. For example ``` let schema1 = ... let csv1 = ... let batch1 = ... let schema2 = ... // different schema let csv2 = ... let batch2 = ... print_batches(&[batch1, batch2]); ``` As I said, this is probably not something to worry about too much right now, but I'll probably add an issue for later to revisit if that's alright. Interestingly, this code wouldn't even necessarily crash; you would just get an odd-looking table: ``` +---+---+---+---+ | city | lat | lng | | +---+---+---+---+ | Elgin, Scotland, the UK | 57.653484 | -3.335724 | | | Stoke-on-Trent, Staffordshire, the UK | 53.002666 | -2.179404 | | | Solihull, Birmingham, UK | 52.412811 | -1.778197 | | | Cardiff, Cardiff county, UK | 51.481583 | -3.17909 | | | 1 | 1.1 | 1.11 | true | | 2 | 2.2 | 2.22 | true | | 3 | 0 | 3.33 | true | | 4 | 4.4 | | false | | 5 | 6.6 | | false | +---+---+---+---+ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
vertexclique commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413951534 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: So, `for _ in 0..n` kind of code, even you don't use, checks the boundaries. This is one of the infamous code pieces that Rust has and discourages. In [here]() you can see one of the examples. Range check exists here: ``` mov rax, qword ptr [rip + core::iter::range::>::next@GOTPCREL] ``` This call makes checked_add and which makes range checks. So that is slowing down for big iterations as you saw in the link. Since this is an open issue in the compiler please consider using iterator methods instead of for loops whenever you see. As you can also see, in this very code 5 cycles spent every range check (in minimal), so that said it is always better to use iterator methods. Bringing the example here :) ``` (0..n) .for_each(|_| { self.write_bytes(v.to_byte_slice(), 1)?; }) ``` This will be completely pruned in `-C opt-level=3` most probably :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth commented on a change in pull request #6972: URL: https://github.com/apache/arrow/pull/6972#discussion_r413951619 ## File path: rust/parquet/src/encodings/rle.rs ## @@ -830,7 +826,7 @@ mod tests { values.clear(); let mut rng = thread_rng(); let seed_vec: Vec = -Standard.sample_iter(&mut rng).take(seed_len).collect(); +rng.sample_iter::(&Standard).take(seed_len).collect(); Review comment: This is an example of type inferences that would become compiler errors with the inclusion of the `prettytable-rs` crate. I have fixed this code, but the original error being generated was: ``` error[E0282]: type annotations needed --> parquet/src/encodings/rle.rs:833:26 | 833 | Standard.sample_iter(&mut rng).take(seed_len).collect(); | ^^^ cannot infer type for `T` ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
vertexclique commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413951534 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: So, `for _ in 0..n` kind of code, even you don't use, checks the boundaries. This is one of the infamous code pieces that Rust has. Personally I don't use this whenever its possible :). In [here](https://godbolt.org/z/55sggc) you can see one of the examples. Range check exists here: ``` mov rax, qword ptr [rip + core::iter::range::>::next@GOTPCREL] ``` This call makes checked_add and which makes range checks. So that is slowing down for big iterations as you saw in the link. Since this is an open issue in the compiler please consider using iterator methods instead of for loops whenever you see. As you can also see, in this very code 5 cycles spent every range check (in minimal), so that said it is always better to use iterator methods. Bringing the example here :) ``` (0..n) .for_each(|_| { self.write_bytes(v.to_byte_slice(), 1)?; }) ``` This will be completely pruned in `-C opt-level=3` most probably :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
vertexclique commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413951534 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: So, `for _ in 0..n` kind of code, even you don't use, checks the boundaries. This is one of the infamous code pieces that Rust has. Personally I don't use this whenever its possible :). In [here]() you can see one of the examples. Range check exists here: ``` mov rax, qword ptr [rip + core::iter::range::>::next@GOTPCREL] ``` This call makes checked_add and which makes range checks. So that is slowing down for big iterations as you saw in the link. Since this is an open issue in the compiler please consider using iterator methods instead of for loops whenever you see. As you can also see, in this very code 5 cycles spent every range check (in minimal), so that said it is always better to use iterator methods. Bringing the example here :) ``` (0..n) .for_each(|_| { self.write_bytes(v.to_byte_slice(), 1)?; }) ``` This will be completely pruned in `-C opt-level=3` most probably :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth commented on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618506903 From a purely practical standpoint, this PR is ready for further review and merging. If approved, I would probably add some minor issue for the following: * Trying to avoid the type inference issue. * Creating a type that can be used to iterate over multiple batches with statically guaranteed same schemas. Additionally, I’ll create a new issue and PR to cover moving DataFusion to this using this API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth edited a comment on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618501806 @andygrove Thanks for the feedback. I have updated the PR with a less leaky API. I also tweaked the parquet test to workaround the new type inference changes. @nevi-me True for a singular batch, but it's possible to create multiple batches from different sources. For example ``` let schema1 = ... let csv1 = ... let batch1 = ... let schema2 = ... // different schema let csv2 = ... let batch2 = ... print_batches(&[batch1, batch2]); ``` As I said, this is probably not something to worry about too much right now, but I'll probably add an issue for later to revisit if that's alright. Interestingly, this code wouldn't even necessarily crash; you would just get an odd-looking table: ``` +---+---+---+---+ | city | lat | lng | | +---+---+---+---+ | Elgin, Scotland, the UK | 57.653484 | -3.335724 | | | Stoke-on-Trent, Staffordshire, the UK | 53.002666 | -2.179404 | | | Solihull, Birmingham, UK | 52.412811 | -1.778197 | | | Cardiff, Cardiff county, UK | 51.481583 | -3.17909 | | | 1 | 1.1 | 1.11 | true | | 2 | 2.2 | 2.22 | true | | 3 | 0 | 3.33 | true | | 4 | 4.4 | | false | | 5 | 6.6 | | false | +---+---+---+---+ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413953833 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists Review comment: Since this path is executed only when #if defined(ARROW_HAVE_BMI2) is true, it would be good to add such as code ``` #if defined(ARROW_HAVE_BMI2) ... #else assert(false && "must not execute this without BMI2"); #endif ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth commented on a change in pull request #6972: URL: https://github.com/apache/arrow/pull/6972#discussion_r413951619 ## File path: rust/parquet/src/encodings/rle.rs ## @@ -830,7 +826,7 @@ mod tests { values.clear(); let mut rng = thread_rng(); let seed_vec: Vec = -Standard.sample_iter(&mut rng).take(seed_len).collect(); +rng.sample_iter::(&Standard).take(seed_len).collect(); Review comment: This is an example of type inferences that would become compiler errors with the inclusion of the `prettytable-rs` crate. I have fixed this code, but the original error being generated was: ``` error[E0282]: type annotations needed --> parquet/src/encodings/rle.rs:833:26 | 833 | Standard.sample_iter(&mut rng).take(seed_len).collect(); | ^^^ cannot infer type for `T` ``` Note this would affect all downstream crates, as it did the `parquet` crate. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413954827 ## File path: cpp/src/parquet/level_conversion_test.cc ## @@ -0,0 +1,162 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "parquet/level_conversion.h" + +#include +#include + +#include + +#include "arrow/util/bit_util.h" + +namespace parquet { +namespace internal { + +using ::testing::ElementsAreArray; + +std::string ToString(const uint8_t* bitmap, int64_t bit_count) { + return arrow::internal::Bitmap(bitmap, /*offset*/ 0, /*length=*/bit_count).ToString(); +} + +std::string ToString(const std::vector& bitmap, int64_t bit_count) { + return ToString(bitmap.data(), bit_count); +} + +TEST(TestGreaterThanBitmap, GeneratesExpectedBitmasks) { Review comment: Most of these tests fail on big-endian platform ``` [ FAILED ] TestAppendBitmap.TestOffsetOverwritesCorrectBitsOnExistingByte [ FAILED ] TestAppendBitmap.TestOffsetShiftBitsCorrectly [ FAILED ] TestAppendBitmap.AllBytesAreWrittenWithEnoughSpace [ FAILED ] TestAppendBitmap.OnlyApproriateBytesWrittenWhenLessThen8BytesAvailable [ FAILED ] TestAppendToValidityBitmap.BasicOperation ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
wesm commented on issue #7020: URL: https://github.com/apache/arrow/pull/7020#issuecomment-618508772 @mayuropensource it's not necessary to open a new PR when you want to redo your commits, you can just force push your branch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413954827 ## File path: cpp/src/parquet/level_conversion_test.cc ## @@ -0,0 +1,162 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "parquet/level_conversion.h" + +#include +#include + +#include + +#include "arrow/util/bit_util.h" + +namespace parquet { +namespace internal { + +using ::testing::ElementsAreArray; + +std::string ToString(const uint8_t* bitmap, int64_t bit_count) { + return arrow::internal::Bitmap(bitmap, /*offset*/ 0, /*length=*/bit_count).ToString(); +} + +std::string ToString(const std::vector& bitmap, int64_t bit_count) { + return ToString(bitmap.data(), bit_count); +} + +TEST(TestGreaterThanBitmap, GeneratesExpectedBitmasks) { Review comment: FYI: Most of these tests fail on big-endian platform ``` [ FAILED ] TestAppendBitmap.TestOffsetOverwritesCorrectBitsOnExistingByte [ FAILED ] TestAppendBitmap.TestOffsetShiftBitsCorrectly [ FAILED ] TestAppendBitmap.AllBytesAreWrittenWithEnoughSpace [ FAILED ] TestAppendBitmap.OnlyApproriateBytesWrittenWhenLessThen8BytesAvailable [ FAILED ] TestAppendToValidityBitmap.BasicOperation ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics
github-actions[bot] commented on issue #7022: URL: https://github.com/apache/arrow/pull/7022#issuecomment-618510578 https://issues.apache.org/jira/browse/ARROW-8562 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value
markhildreth commented on a change in pull request #7006: URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397 ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( Review comment: The rest of the tests were passing previously. This was the failing test. The actual value returned was 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bryantbiggs commented on issue #4140: ARROW-5123: [Rust] Parquet derive for simple structs
bryantbiggs commented on issue #4140: URL: https://github.com/apache/arrow/pull/4140#issuecomment-618512883 thanks @andygrove ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413960907 ## File path: cpp/src/parquet/level_conversion.h ## @@ -0,0 +1,191 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include +#include "arrow/util/bit_util.h" + +#if defined(ARROW_HAVE_BMI2) +#include "x86intrin.h" +#endif + +namespace parquet { +namespace internal { +// These APIs are likely to be revised as part of ARROW-8494 to reduce duplicate code. +// They currently represent minimal functionality for vectorized computation of definition +// levels. + +/// Builds a bitmap by applying predicate to the level vector provided. +/// +/// \param[in] levels Rep or def level array. +/// \param[in] num_levels The number of levels to process (must be [0, 64]) +/// \param[in] predicate The predicate to apply (must have the signature `bool +/// predicate(int16_t)`. +/// \returns The bitmap using least significant "bit" ordering. +/// +/// N.B. Correct byte ordering is dependent on little-endian architectures. +/// +template +uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels, Predicate predicate) { + // Both clang and GCC can vectorize this automatically with AVX2. + uint64_t mask = 0; + for (int x = 0; x < num_levels; x++) { +mask |= static_cast(predicate(levels[x]) ? 1 : 0) << x; + } + return mask; +} + +/// Builds a bitmap where each set bit indicates the correspond level is greater +/// than rhs. +static inline int64_t GreaterThanBitmap(const int16_t* levels, int64_t num_levels, +int16_t rhs) { + return LevelsToBitmap(levels, num_levels, [&](int16_t value) { return value > rhs; }); +} + +/// Append bits number_of_bits from new_bits to valid_bits and valid_bits_offset. +/// +/// \param[in] new_bits The zero-padded bitmap to append. +/// \param[in] number_of_bits The number of bits to append from new_bits. +/// \param[in] valid_bits_length The number of bytes allocated in valid_bits. +/// \param[in] valid_bits_offset The bit-offset at which to start appending new bits. +/// \param[in,out] valid_bits The validity bitmap to append to. +/// \returns The new bit offset inside of valid_bits. +static inline int64_t AppendBitmap(uint64_t new_bits, int64_t number_of_bits, + int64_t valid_bits_length, int64_t valid_bits_offset, + uint8_t* valid_bits) { + // Selection masks to retrieve all low order bits for each bytes. + constexpr uint64_t kLsbSelectionMasks[] = { + 0, // unused. + 0x0101010101010101, + 0x0303030303030303, + 0x0707070707070707, + 0x0F0F0F0F0F0F0F0F, + 0x1F1F1F1F1F1F1F1F, + 0x3F3F3F3F3F3F3F3F, + 0x7F7F7F7F7F7F7F7F, + }; + int64_t valid_byte_offset = valid_bits_offset / 8; + int64_t bit_offset = valid_bits_offset % 8; + + int64_t new_offset = valid_bits_offset + number_of_bits; + union ByteAddressableBitmap { Review comment: Why do we need to use union? `bytes[]` is used only at line 106. It looks simple to explicitly extract the lowest 8bit instead of using union. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] BryanCutler commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
BryanCutler commented on issue #6992: URL: https://github.com/apache/arrow/pull/6992#issuecomment-618517973 Sounds good to me. FWIW, Spark also has a minimum Pandas version set at 0.23.2. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tustvold commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
tustvold commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413969298 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: Thanks, I was not aware of this particular compiler bug. In this particular case the codegen at opt-level 2 or higher is identical, I think the problem as described [here](https://github.com/rust-lang/rust/issues/35981) only manifests on nested for loops. I can update if necessary but error propagation out of the closure if pretty clunky... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413970301 ## File path: cpp/src/parquet/level_conversion.h ## @@ -0,0 +1,191 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include +#include "arrow/util/bit_util.h" + +#if defined(ARROW_HAVE_BMI2) +#include "x86intrin.h" +#endif + +namespace parquet { +namespace internal { +// These APIs are likely to be revised as part of ARROW-8494 to reduce duplicate code. +// They currently represent minimal functionality for vectorized computation of definition +// levels. + +/// Builds a bitmap by applying predicate to the level vector provided. +/// +/// \param[in] levels Rep or def level array. +/// \param[in] num_levels The number of levels to process (must be [0, 64]) +/// \param[in] predicate The predicate to apply (must have the signature `bool +/// predicate(int16_t)`. +/// \returns The bitmap using least significant "bit" ordering. +/// +/// N.B. Correct byte ordering is dependent on little-endian architectures. +/// +template +uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels, Predicate predicate) { + // Both clang and GCC can vectorize this automatically with AVX2. + uint64_t mask = 0; + for (int x = 0; x < num_levels; x++) { +mask |= static_cast(predicate(levels[x]) ? 1 : 0) << x; + } + return mask; +} + +/// Builds a bitmap where each set bit indicates the correspond level is greater +/// than rhs. +static inline int64_t GreaterThanBitmap(const int16_t* levels, int64_t num_levels, Review comment: Is this function used only for a little-endian platform? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
vertexclique commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413972358 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: The place that you are doing byte masking can be a good alternative. Since that place is also accessing with index: https://github.com/apache/arrow/pull/6980/files#diff-76c0ed0285da0826a55cf62676556c87R102 If that's extreme, just forget about it. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value
markhildreth commented on a change in pull request #7006: URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397 ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( Review comment: The rest of the added tests were passing previously. This was the failing test. The actual value returned was 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value
markhildreth commented on a change in pull request #7006: URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397 ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( Review comment: The rest of the added tests would have been passing previously. This would be the failing test. The actual value that was being returned was 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value
markhildreth commented on a change in pull request #7006: URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397 ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( Review comment: The rest of the added tests would have been passing previously. This would be the failing test. The actual value that was being returned was 0. ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( +3, +list_array +.value(0) +.as_any() +.downcast_ref::() +.unwrap() +.value(0) +); Review comment: The rest of the added tests would have been passing previously. This would be the failing test. The actual value that was being returned was 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae
nealrichardson commented on issue #7019: URL: https://github.com/apache/arrow/pull/7019#issuecomment-618527651 @github-actions crossbow submit homebrew-cpp This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae
github-actions[bot] commented on issue #7019: URL: https://github.com/apache/arrow/pull/7019#issuecomment-618528458 Revision: 5bcfeab4c9bacc0b3a262a7522bfaf985025d3ec Submitted crossbow builds: [ursa-labs/crossbow @ actions-165](https://github.com/ursa-labs/crossbow/branches/all?query=actions-165) |Task|Status| ||--| |homebrew-cpp|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-165-travis-homebrew-cpp.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy opened a new pull request #7023: ARROW-8571: [C++] Switch AppVeyor image to VS 2017
xhochy opened a new pull request #7023: URL: https://github.com/apache/arrow/pull/7023 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth edited a comment on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618506903 From a purely practical standpoint, this PR is ready for further review and merging. If approved, I would probably add some minor JIRA issue for the following: * Trying to avoid the type inference issue. * Creating a type that can be used to iterate over multiple batches with statically guaranteed same schemas. Additionally, I’ll create a new issue and PR to cover moving DataFusion to this using this API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth edited a comment on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618506903 From a purely practical standpoint, this PR is ready for further review and merging. If approved, I would probably add some minor JIRA issue for the following: * Trying to avoid the type inference issue. * Creating a type that can be used to iterate over multiple batches with statically guaranteed same schemas (looks like `RecordBatchReader` is close to what we would want). Additionally, I’ll create a new issue and PR to cover moving DataFusion to this using this API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tustvold commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
tustvold commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413987657 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n(&mut self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: When unoptimised both forms contain a checked_add, it is just hard to see as the foreach+closure isn't inlined, and the codegen for the iterator form is significantly worse in general. When optimised the codegen is identical for both forms. Given this, I'm inclined to optimise for readability and use the simpler form, but ultimately it is up to you as the maintainers :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7023: ARROW-8571: [C++] Switch AppVeyor image to VS 2017
pitrou commented on a change in pull request #7023: URL: https://github.com/apache/arrow/pull/7023#discussion_r413987573 ## File path: appveyor.yml ## @@ -61,7 +61,7 @@ environment: - JOB: "Build" GENERATOR: Ninja CONFIGURATION: "Release" - APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2017 + APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2019 Review comment: Can we switch this one to MSVC 2015 instead? So that we ensure we're still compatible with it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
wesm commented on issue #6992: URL: https://github.com/apache/arrow/pull/6992#issuecomment-618536764 The Appveyor failure is unrelated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7023: ARROW-8571: [C++] Switch AppVeyor image to VS 2017
github-actions[bot] commented on issue #7023: URL: https://github.com/apache/arrow/pull/7023#issuecomment-618536730 https://issues.apache.org/jira/browse/ARROW-8571 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
wesm commented on issue #6992: URL: https://github.com/apache/arrow/pull/6992#issuecomment-618537844 Actually I'll hold off on merging this to confirm that @jorisvandenbossche has done everything that he planned This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics
lidavidm commented on a change in pull request #7022: URL: https://github.com/apache/arrow/pull/7022#discussion_r413990293 ## File path: cpp/src/arrow/io/caching.h ## @@ -27,6 +27,44 @@ namespace arrow { namespace io { + +struct ARROW_EXPORT CacheOptions { + static constexpr double kDefaultIdealBandwidthUtilizationFrac = 0.9; + static constexpr int64_t kDefaultMaxIdealRequestSizeMib = 64; + + /// /brief The maximum distance in bytes between two consecutive + /// ranges; beyond this value, ranges are not combined + int64_t hole_size_limit; + /// /brief The maximum size in bytes of a combined range; if + /// combining two consecutive ranges would produce a range of a + /// size greater than this, they are not combined + int64_t range_size_limit; + + bool operator==(const CacheOptions& other) const { +return hole_size_limit == other.hole_size_limit && + range_size_limit == other.range_size_limit; + } + + /// \brief Construct CacheOptions from S3 storage metrics. + /// + /// \param[in] time_to_first_byte_millis Seek-time or Time-To-First-Byte (TTFB) in + /// milliseconds, also called call setup latency of a new S3 request. + /// The value is a positive integer. + /// \param[in] transfer_bandwidth_mib_per_sec Data transfer Bandwidth (BW) in MiB/sec. + /// The value is a positive integer. + /// \param[in] ideal_bandwidth_utilization_frac Transfer bandwidth utilization fraction + /// (per connection) to maximize the net data load. + /// The value is a positive double precision number less than or equal to 1. + /// \param[in] max_ideal_request_size_mib The maximum single data request size (in MiB) + /// to maximize the net data load. + /// The value is a positive integer. + /// \return A new instance of CacheOptions. + static CacheOptions MakeFromS3Metrics( Review comment: This might be better named 'NetworkMetrics' or something along those lines; it's not specific to S3 and would be likely be useful for the in-progress Google Cloud Storage filesystem, and perhaps others. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
xhochy commented on issue #6992: URL: https://github.com/apache/arrow/pull/6992#issuecomment-618538122 👍 2 years ago released `pandas` version still sounds very generous. People who cannot upgrade from that to a newer version will probably have the same problems with `pyarrow` updates. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp commented on a change in pull request #7009: ARROW-8552: [Rust] support iterate parquet row columns
houqp commented on a change in pull request #7009: URL: https://github.com/apache/arrow/pull/7009#discussion_r414007919 ## File path: rust/parquet/src/record/api.rs ## @@ -50,6 +50,33 @@ impl Row { pub fn len(&self) -> usize { self.fields.len() } + +pub fn get_column_iter(&self) -> RowColumnIter { +RowColumnIter { +fields: &self.fields, +curr: 0, +count: self.fields.len(), +} +} +} + +pub struct RowColumnIter<'a> { +fields: &'a Vec<(String, Field)>, +curr: usize, +count: usize, +} + +impl<'a> Iterator for RowColumnIter<'a> { +type Item = (usize, &'a String); Review comment: Thanks for the suggestions, I have changed it to return column name and field as a tuple. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mayuropensource commented on a change in pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics
mayuropensource commented on a change in pull request #7022: URL: https://github.com/apache/arrow/pull/7022#discussion_r414031141 ## File path: cpp/src/arrow/io/caching.h ## @@ -27,6 +27,44 @@ namespace arrow { namespace io { + +struct ARROW_EXPORT CacheOptions { + static constexpr double kDefaultIdealBandwidthUtilizationFrac = 0.9; + static constexpr int64_t kDefaultMaxIdealRequestSizeMib = 64; + + /// /brief The maximum distance in bytes between two consecutive + /// ranges; beyond this value, ranges are not combined + int64_t hole_size_limit; + /// /brief The maximum size in bytes of a combined range; if + /// combining two consecutive ranges would produce a range of a + /// size greater than this, they are not combined + int64_t range_size_limit; + + bool operator==(const CacheOptions& other) const { +return hole_size_limit == other.hole_size_limit && + range_size_limit == other.range_size_limit; + } + + /// \brief Construct CacheOptions from S3 storage metrics. + /// + /// \param[in] time_to_first_byte_millis Seek-time or Time-To-First-Byte (TTFB) in + /// milliseconds, also called call setup latency of a new S3 request. + /// The value is a positive integer. + /// \param[in] transfer_bandwidth_mib_per_sec Data transfer Bandwidth (BW) in MiB/sec. + /// The value is a positive integer. + /// \param[in] ideal_bandwidth_utilization_frac Transfer bandwidth utilization fraction + /// (per connection) to maximize the net data load. + /// The value is a positive double precision number less than or equal to 1. + /// \param[in] max_ideal_request_size_mib The maximum single data request size (in MiB) + /// to maximize the net data load. + /// The value is a positive integer. + /// \return A new instance of CacheOptions. + static CacheOptions MakeFromS3Metrics( Review comment: good point, I will make the change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mayuropensource commented on issue #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
mayuropensource commented on issue #7020: URL: https://github.com/apache/arrow/pull/7020#issuecomment-618577963 @wesm sure thing, I'll keep that in mind in the future. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
andygrove commented on pull request #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618597619 @markhildreth This looks great, but is now duplicating the code between arrow and datafusion. Can we remove the datafusion utils copy and have datafusion use the arrow utils instead? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org