[GitHub] [arrow] zgramana edited a comment on pull request #6121: ARROW-6603: [C#] - Nullable Array Support
zgramana edited a comment on pull request #6121: URL: https://github.com/apache/arrow/pull/6121#issuecomment-618756547 @eerhardt I'd like to chime in to say that I have just come across this conversation late--and *after* implementing an alternative approach which much more in-line with other Arrow implementations. In particular, I really just finished out the previously stubbed `NullBitmapBuffer` already in the codebase. The most significant change comes with the following addition to `IArrowArrayBuilder`: ```csharp public interface IArrowArrayBuilder : IArrowArrayBuilder where TArray : IArrowArray where TBuilder : IArrowArrayBuilder { TBuilder Append(T value); TBuilder Append(ReadOnlySpan span); TBuilder AppendRange(IEnumerable values); TBuilder AppendNull(); // <- Works with non-nullable types too TBuilder Swap(int i, int j); TBuilder Set(int index, T value); } ``` The work includes adding the supporting work for implementation of `AppendNull` to `PrimitiveArrayBuilder` and `Binary.BuilderBase` and then removing the hardcoded `0`'s passed to `nullCount` in `ArrayData` constructors. I also added two tests to `ArrayBuilderTests`: * The first includes a number of `null` scenarios using `TestArrayBuilder(...)`. * The second includes a `StringArray.Builder` scenario with mixed `null` and `string.Empty` values. I implemented it such that `string.Empty` is considered a valid value (so it increments `Offset` without adding an bytes to `ValueBuffer`. When a null string is passed to `Append` it will invoke `AppendNull` internally instead. * All 160 (existing + new) tests pass. :-) I have requested internal approval to be able to submit a PR and get a CCLA signed. If you can hold off on merging this PR, I would like the opportunity to have my alternate approach considered as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sunchao edited a comment on pull request #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)
sunchao edited a comment on pull request #6949: URL: https://github.com/apache/arrow/pull/6949#issuecomment-618740530 Yes I think it is beneficial to avoid dropping buffers with `seek`, although it will be nice if the `seek_relative` will be stabilized soon so we can just use that. > Also, I'm not sure why a Mutex was used here. Do we agree that Refcell is better as the whole reader is not thread safe, right ? [Originally](https://github.com/sunchao/parquet-rs/pull/49) we designed it this way so that we can concurrently read multiple column chunks after obtaining file handle from a single row group. Since the file handle is shared between these we wanted to provide thread safety on top of it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sunchao commented on pull request #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)
sunchao commented on pull request #6949: URL: https://github.com/apache/arrow/pull/6949#issuecomment-618740530 Yes I think it is beneficial to avoid dropping buffers with `seek`, although it will be nice if the `seek_relative` will be stabilized soon so we can just use that. > Also, I'm not sure why a Mutex was used here. Do we agree that Refcell is better as the whole reader is not thread safe, right ? Originally we designed it this way so that we can concurrently read multiple column chunks after obtaining file handle from a single row group. Since the file handle is shared between these we wanted to provide thread safety on top of it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana edited a comment on pull request #6121: ARROW-6603: [C#] - Nullable Array Support
zgramana edited a comment on pull request #6121: URL: https://github.com/apache/arrow/pull/6121#issuecomment-618756547 @eerhardt I'd like to chime in to say that I have just come across this conversation late--and *after* implementing an alternative approach which much more in-line with other Arrow implementations. In particular, I really just finished out the previously stubbed `NullBitmapBuffer` already in the codebase. The most significant change comes with the following addition to `IArrowArrayBuilder`: ```csharp public interface IArrowArrayBuilder : IArrowArrayBuilder where TArray : IArrowArray where TBuilder : IArrowArrayBuilder { TBuilder Append(T value); TBuilder Append(ReadOnlySpan span); TBuilder AppendRange(IEnumerable values); TBuilder AppendNull(); // <- Works with non-nullable types too TBuilder Swap(int i, int j); TBuilder Set(int index, T value); } ``` The work includes adding the supporting work for implementation of `AppendNull` to `PrimitiveArrayBuilder` and `Binary.BuilderBase` and then removing the hardcoded `0`'s passed to `nullCount` in `ArrayData` constructors. I also added two tests to `ArrayBuilderTests`: * The first includes a number of `null` scenarios using `TestArrayBuilder(...)`. * The second includes a `StringArray.Builder` scenario with mixed `null` and `string.Empty` scenarios. For the later, I implemented it such that `string.Empty` is considered a valid value (so it increments `Offset` without adding an bytes to `ValueBuffer`. When a null string is passed to `Append` it will invoke `AppendNull` internally instead. * All 160 (existing + new) tests pass. :-) I have requested internal approval to be able to submit a PR and get a CCLA signed. If you can hold off on merging this PR, I would like the opportunity to have my alternate approach considered as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on pull request #6121: ARROW-6603: [C#] - Nullable Array Support
zgramana commented on pull request #6121: URL: https://github.com/apache/arrow/pull/6121#issuecomment-618756547 @eerhardt I'd like to chime in that I have just come across this conversation after implementing an alternative approach much more in line with other Arrow language implementations. In particular, I really just finished out the previously stubbed `NullBitmapBuffer` already in the codebase. The most significant change comes with the following addition to `IArrowArrayBuilder`: ```csharp public interface IArrowArrayBuilder : IArrowArrayBuilder where TArray : IArrowArray where TBuilder : IArrowArrayBuilder { TBuilder Append(T value); TBuilder Append(ReadOnlySpan span); TBuilder AppendRange(IEnumerable values); TBuilder AppendNull(); // <- Works with non-nullable types too TBuilder Swap(int i, int j); TBuilder Set(int index, T value); } ``` The work includes adding the supporting work for implementation of `AppendNull` to `PrimitiveArrayBuilder` and `Binary.BuilderBase` and then removing the hardcoded `0`'s passed to `nullCount` in `ArrayData` constructors. I also added two tests to `ArrayBuilderTests`: * The first includes a number of `null` scenarios using `TestArrayBuilder(...)`. * The second includes a `StringArray.Builder` scenario with mixed `null` and `string.Empty` scenarios. For the later, I implemented it such that `string.Empty` is considered a valid value (so it increments `Offset` without adding an bytes to `ValueBuffer`. When a null string is passed to `Append` it will invoke `AppendNull` internally instead. * All 160 (existing + new) tests pass. :-) I have requested internal approval to be able to submit a PR and get a CCLA signed. If you can hold off on merging this PR, I would like the opportunity to have my alternate approach considered as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #6954: ARROW-8440: [C++] Refine SIMD header files
cyb70289 commented on pull request #6954: URL: https://github.com/apache/arrow/pull/6954#issuecomment-618782973 > Is the function `Armv8CrcHashParallel` used somewhere? Sorry if I overlook it. It's not used. Actually the whole file hash_util.h is not used per [this comment](https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/hash_util.h#L20). From git history, I see crc based hash is replaced with xxhash3. I'm glad to remove this file if community think it's ok. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7028: ARROW-8575: [Developer] Add issue_comment workflow to rebase a PR
github-actions[bot] commented on pull request #7028: URL: https://github.com/apache/arrow/pull/7028#issuecomment-618731413 https://issues.apache.org/jira/browse/ARROW-8575 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on a change in pull request #6306: ARROW-7705: [Rust] Initial sort implementation
paddyhoran commented on a change in pull request #6306: URL: https://github.com/apache/arrow/pull/6306#discussion_r414234200 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -0,0 +1,671 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Defines sort kernel for `ArrayRef` + +use std::cmp::Reverse; + +use crate::array::*; +use crate::compute::take; +use crate::datatypes::*; +use crate::error::{ArrowError, Result}; + +use TimeUnit::*; + +/// Sort the `ArrayRef` using `SortOptions`. +/// +/// Performs a stable sort on values and indices, returning nulls after sorted valid values, +/// while preserving the order of the nulls. +/// +/// Returns and `ArrowError::ComputeError(String)` if the array type is either unsupported by `sort_to_indices` or `take`. Review comment: nit: and -> an This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7028: ARROW-8575: [Developer] Add issue_comment workflow to rebase a PR
nealrichardson opened a new pull request #7028: URL: https://github.com/apache/arrow/pull/7028 Instead of adding a PR comment of "This needs rebase" and wait for the author to get around to it, with this workflow you can just type "rebase" and GHA will do it for you. If it rebases cleanly, the workflow force pushes; see [this demonstration](https://github.com/nealrichardson/arrow/runs/613698512?check_suite_focus=true#step:5:6) on my fork. If there are merge conflicts, well, the PR will probably tell you so you know not to try to auto-rebase, but if you do run this workflow and rebase is not clean, it exits with an error and does not push anything (see [this example](https://github.com/nealrichardson/arrow/runs/613691328?check_suite_focus=true)). Relatedly, see https://github.com/r-lib/actions/pull/90 where I add the ability to add args to `git push` in the action. This workflows is currently set to run using my fork, which is fine. If that PR is merged before this one is, we can switch back to using upstream; otherwise, I'll switch in ARROW-8489. Recall that you can't test issue_comment workflow changes on PRs themselves because issue_comment workflows always run off of master, so if you make a "rebase" comment on this PR, it won't do anything. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana edited a comment on pull request #6121: ARROW-6603: [C#] - Nullable Array Support
zgramana edited a comment on pull request #6121: URL: https://github.com/apache/arrow/pull/6121#issuecomment-618756547 @eerhardt I'd like to chime in to say that I have just come across this conversation late--and *after* implementing an alternative approach which much more in-line with other Arrow implementations. In particular, I really just finished out the previously stubbed `NullBitmapBuffer` already in the codebase. The most significant change comes with the following addition to `IArrowArrayBuilder`: ```csharp public interface IArrowArrayBuilder : IArrowArrayBuilder where TArray : IArrowArray where TBuilder : IArrowArrayBuilder { TBuilder Append(T value); TBuilder Append(ReadOnlySpan span); TBuilder AppendRange(IEnumerable values); TBuilder AppendNull(); // <- Works with non-nullable types too TBuilder Swap(int i, int j); TBuilder Set(int index, T value); } ``` The work includes adding the supporting work for implementation of `AppendNull` to `PrimitiveArrayBuilder` and `Binary.BuilderBase` and then removing the hardcoded `0`'s passed to `nullCount` in `ArrayData` constructors. I also added two tests to `ArrayBuilderTests`: * The first includes a number of `null` scenarios using `TestArrayBuilder(...)`. * The second includes a `StringArray.Builder` scenario with mixed `null` and `string.Empty` values. I implemented it such that `string.Empty` is considered a valid value (so it increments `Offset` without adding any bytes to `ValueBuffer`. When `null` is passed to `Append` it will just invoke `AppendNull` internally instead. * All 160 (existing + new) tests pass. :-) I have requested internal approval to be able to submit a PR and get a CCLA signed. If you can hold off on merging this PR, I would like the opportunity to have my alternate approach considered as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mcassels commented on a change in pull request #6770: ARROW-7842: [Rust] [Parquet] implement array_reader for list type columns
mcassels commented on a change in pull request #6770: URL: https://github.com/apache/arrow/pull/6770#discussion_r414276380 ## File path: rust/datafusion/src/utils.rs ## @@ -120,6 +143,7 @@ pub fn array_value_to_string(column: array::ArrayRef, row: usize) -> Result { make_string!(array::Time64NanosecondArray, column, row) } +DataType::List(_) => make_string_from_list!(column, row), Review comment: This recursively calls `array_value_to_string` on the array items, so an unsupported type should give the appropriate error on the recursive call This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mcassels commented on a change in pull request #6770: ARROW-7842: [Rust] [Parquet] implement array_reader for list type columns
mcassels commented on a change in pull request #6770: URL: https://github.com/apache/arrow/pull/6770#discussion_r414276861 ## File path: rust/datafusion/src/logicalplan.rs ## @@ -828,8 +828,8 @@ mod tests { .build()?; let expected = "Projection: #id\ -\n Selection: #state Eq Utf8(\"CO\")\ -\nTableScan: employee.csv projection=Some([0, 3])"; +\n Selection: #state Eq Utf8(\"CO\")\ +\nTableScan: employee.csv projection=Some([0, 3])"; Review comment: yes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on a change in pull request #7004: ARROW-3827: [Rust] Implement UnionArray Updated
paddyhoran commented on a change in pull request #7004: URL: https://github.com/apache/arrow/pull/7004#discussion_r414240892 ## File path: rust/arrow/src/array/equal.rs ## @@ -1046,6 +1062,30 @@ impl PartialEq for Value { } } +impl JsonEqual for UnionArray { +fn equals_json(, _json: &[]) -> bool { +unimplemented!() +} +} + +impl PartialEq for UnionArray { +fn eq(, json: ) -> bool { +match json { +Value::Array(json_array) => self.equals_json_values(_array), +_ => false, +} +} +} + +impl PartialEq for Value { +fn eq(, arrow: ) -> bool { +match self { Review comment: Actually I was just following the pattern here, I thought that they must both be needed for IPC. I removed them. I'll add them back when I get to implementing IPC if needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on pull request #7004: ARROW-3827: [Rust] Implement UnionArray Updated
paddyhoran commented on pull request #7004: URL: https://github.com/apache/arrow/pull/7004#issuecomment-618761556 @andygrove just going to leave a general comment as it's all related. Overall, I felt this PR was getting big, I was trying to avoid getting into the IPC stuff in this PR, this is the reason I left some things `unimplemented`. I created the following follow up JIRA's to address all these issues. Both `JsonEquals` (ARROW-8547) and `ArrayEquals` (ARROW-8576) were introduced to compare arrays as part of the IPC implementation but are required to be implemented to implement the `Array` trait. Adding `UnionArray` to `get_fb_field_type` is obviously IPC related as well (ARROW-8546). I waited until the recent release was cut before submitting this so as to make sure that the `unimplemented` didn't get released. I'll follow up on the items above once I get this much merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
paddyhoran commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r414230710 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n( self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: I'm not sure that there is consensus on a single "best" implementation so I'm going move forward with the current implementation as it is more readable. @vertexclique feel free to open another JIRA and PR (preferable with a benchmark) if you feel strongly. p.s. sorry for causing more confusion than anything else above... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on pull request #7024: ARROW-8573: [Rust] Upgrade Rust to 1.44 nightly
paddyhoran commented on pull request #7024: URL: https://github.com/apache/arrow/pull/7024#issuecomment-618749586 CI is failing again, I thought this was fixed by #8558 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on a change in pull request #7004: ARROW-3827: [Rust] Implement UnionArray Updated
paddyhoran commented on a change in pull request #7004: URL: https://github.com/apache/arrow/pull/7004#discussion_r414241189 ## File path: rust/arrow/src/array/mod.rs ## @@ -85,6 +85,7 @@ mod array; mod builder; mod data; mod equal; +mod union; Review comment: Yea, even just to make the tests categorized better. I'll open a JIRA once this is merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
pitrou commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413696312 ## File path: cpp/cmake_modules/SetupCxxFlags.cmake ## @@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86") set(CXX_SUPPORTS_SSE4_2 TRUE) else() set(ARROW_SSE4_2_FLAG "-msse4.2") -set(ARROW_AVX2_FLAG "-mavx2") +set(ARROW_AVX2_FLAG "-march=core-avx2") Review comment: It would be nice if you could try again. ## File path: cpp/cmake_modules/SetupCxxFlags.cmake ## @@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86") set(CXX_SUPPORTS_SSE4_2 TRUE) else() set(ARROW_SSE4_2_FLAG "-msse4.2") -set(ARROW_AVX2_FLAG "-mavx2") +set(ARROW_AVX2_FLAG "-march=core-avx2") # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ set(ARROW_AVX512_FLAG "-march=skylake-avx512") check_cxx_compiler_flag(${ARROW_SSE4_2_FLAG} CXX_SUPPORTS_SSE4_2) endif() check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2) + check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2) Review comment: This doesn't seem useful :-) ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level + // of the repeated parent so we can check for def_level > "def level of repeated + // parent". + uint64_t present_bitmap = internal::GreaterThanBitmap( + def_levels,
[GitHub] [arrow] nevi-me opened a new pull request #7018: ARROW-8536: [Rust] [Flight] Check in proto file, conditional build if file exists
nevi-me opened a new pull request #7018: URL: https://github.com/apache/arrow/pull/7018 When a user compiles the `flight` crate, a `build.rs` script is invoked. This script recursively looks for the `format/Flight.proto` path. A user might not have that path, as they would not have cloned the arrow repository, and as such, the build fails. This change: a) checks if the `../../format/Flight.proto` path exists, and only builds if the file exists, and has been changed. b) checks in the proto file into the `src/` directory, changing the default location from an opaque directory in the `target/{debug|release}/build-xxx` folder. The rationale for checking in the proto file is that if the file is not rebuilt, `flight` users are still able to access the file. It's also beneficial for users to easily view the file, and better understand how the generated code looks like. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types
pitrou commented on a change in pull request #6959: URL: https://github.com/apache/arrow/pull/6959#discussion_r413762177 ## File path: python/pyarrow/tests/test_extension_type.py ## @@ -445,22 +445,28 @@ def test_parquet(tmpdir, registered_period_type): import base64 decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"]) schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema)) -assert schema.field("ext").metadata == { -b'ARROW:extension:metadata': b'freq=D', -b'ARROW:extension:name': b'pandas.period'} +# Since the type could be reconstructed, the extension type metadata is +# absent. +assert schema.field("ext").metadata == {} Review comment: The change was actually done on the IPC side and Parquet inherited it, yes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #6954: ARROW-8440: [C++] Refine SIMD header files
pitrou commented on issue #6954: URL: https://github.com/apache/arrow/pull/6954#issuecomment-618329962 cc @emkornfield This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413715323 ## File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc ## @@ -331,10 +358,64 @@ BENCHMARK_DEFINE_F(MinioFixture, ReadCoalesced500Mib)(benchmark::State& st) { } BENCHMARK_REGISTER_F(MinioFixture, ReadCoalesced500Mib)->UseRealTime(); -BENCHMARK_DEFINE_F(MinioFixture, ReadParquet250K)(benchmark::State& st) { - ParquetRead(st, fs_.get(), bucket_ + "/pq_c100_r250k"); -} -BENCHMARK_REGISTER_F(MinioFixture, ReadParquet250K)->UseRealTime(); +// Helpers to generate various multiple benchmarks for a given Parquet file. + +// NAME: the base name of the benchmark. +// ROWS: the number of rows in the Parquet file. +// COLS: the number of columns in the Parquet file. +// STRATEGY: how to read the file (ReadTable or GetRecordBatchReader) +#define PQ_BENCHMARK_IMPL(NAME, ROWS, COLS, STRATEGY) \ + BENCHMARK_DEFINE_F(MinioFixture, NAME##STRATEGY##AllNaive)(benchmark::State & st) { \ +std::vector column_indices(COLS); \ +std::iota(column_indices.begin(), column_indices.end(), 0); \ +std::stringstream ss; \ +ss << bucket_ << "/pq_c" << COLS << "_r" << ROWS << "k"; \ +ParquetRead(st, fs_.get(), ss.str(), column_indices, false, #STRATEGY); \ Review comment: From a code maintainability standpoint, can we avoid putting so much logic in C macros? Ideally, the macro can call into a templated function. ## File path: cpp/src/parquet/file_reader.cc ## @@ -78,14 +79,46 @@ std::unique_ptr RowGroupReader::GetColumnPageReader(int i) { // Returns the rowgroup metadata const RowGroupMetaData* RowGroupReader::metadata() const { return contents_->metadata(); } +/// Compute the section of the file that should be read for the given +/// row group and column chunk. +arrow::io::ReadRange ComputeColumnChunkRange(FileMetaData* file_metadata, + int64_t source_size, int row_group_index, + int column_index) { + auto row_group_metadata = file_metadata->RowGroup(row_group_index); + auto column_metadata = row_group_metadata->ColumnChunk(column_index); + + int64_t col_start = column_metadata->data_page_offset(); + if (column_metadata->has_dictionary_page() && + column_metadata->dictionary_page_offset() > 0 && + col_start > column_metadata->dictionary_page_offset()) { +col_start = column_metadata->dictionary_page_offset(); + } + + int64_t col_length = column_metadata->total_compressed_size(); + // PARQUET-816 workaround for old files created by older parquet-mr + const ApplicationVersion& version = file_metadata->writer_version(); + if (version.VersionLt(ApplicationVersion::PARQUET_816_FIXED_VERSION())) { +// The Parquet MR writer had a bug in 1.2.8 and below where it didn't include the +// dictionary page header size in total_compressed_size and total_uncompressed_size +// (see IMPALA-694). We add padding to compensate. +int64_t bytes_remaining = source_size - (col_start + col_length); +int64_t padding = std::min(kMaxDictHeaderSize, bytes_remaining); +col_length += padding; + } + + return {col_start, col_length}; +} + // RowGroupReader::Contents implementation for the Parquet file specification class SerializedRowGroup : public RowGroupReader::Contents { public: - SerializedRowGroup(std::shared_ptr source, int64_t source_size, - FileMetaData* file_metadata, int row_group_number, - const ReaderProperties& props, + SerializedRowGroup(std::shared_ptr source, + std::shared_ptr<::arrow::io::internal::ReadRangeCache> cached_source, + int64_t source_size, FileMetaData* file_metadata, + int row_group_number, const ReaderProperties& props, std::shared_ptr file_decryptor = nullptr) : source_(std::move(source)), +cached_source_(cached_source ? std::move(cached_source) : nullptr), Review comment: I don't think the ternary expression is needed. You should be able to move a null `shared_ptr` alright. ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: Be careful: the cache is unbounded. Does it mean you're willing to let all these row groups survive in memory until the reader gets destroyed?
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413744080 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: We could add a method to clear/remove the cache, or the user might recreate the ParquetFileReader (as the cache is shared among all readers created from the file anyways). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413757274 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: It depends: reading multiple row groups at once might cache chunks that span row groups (and in general you want more opportunities to coalesce reads if you are aiming for throughput). For more control over that, a consumer could make repeated calls to `FileReader->RowGroup` to get individual RowGroupReaders that would cache only a single RowGroup. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #6959: ARROW-5649: [Integration][C++] Create integration test for extension types
jorisvandenbossche commented on a change in pull request #6959: URL: https://github.com/apache/arrow/pull/6959#discussion_r413750283 ## File path: python/pyarrow/tests/test_extension_type.py ## @@ -445,22 +445,28 @@ def test_parquet(tmpdir, registered_period_type): import base64 decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"]) schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema)) -assert schema.field("ext").metadata == { -b'ARROW:extension:metadata': b'freq=D', -b'ARROW:extension:name': b'pandas.period'} +# Since the type could be reconstructed, the extension type metadata is +# absent. +assert schema.field("ext").metadata == {} Review comment: I don't have a fully good sense of what people do with this metadata, but I suppose when the type was recognized, it should not be a problem if the metadata is not present anymore (since you can always retrieve it again from the type instance). I know Micah mentioned they were using those metadata in BigQuery, but without a registered extension type, so such use case should not be impacted. This test is for parquet roundtrip, but a similar change was done for plain IPC roundtrip as well? (that currently also preserves the field metadata, even when the type was recognized) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413760028 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: I don't understand: is a separate `ParquetFileReader::Contents` created for each row group? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413760214 ## File path: cpp/src/parquet/file_reader.cc ## @@ -536,6 +577,14 @@ std::shared_ptr ParquetFileReader::RowGroup(int i) { return contents_->GetRowGroup(i); } +void ParquetFileReader::PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { + // Access private methods here + SerializedFile* file = static_cast(contents_.get()); Review comment: `checked_cast` here. Is it possible for `Contents` to be something else than `SerializedFile`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7018: ARROW-8536: [Rust] [Flight] Check in proto file, conditional build if file exists
github-actions[bot] commented on issue #7018: URL: https://github.com/apache/arrow/pull/7018#issuecomment-618349230 https://issues.apache.org/jira/browse/ARROW-8536 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413763701 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: No, on the contrary, there's only one instance of `Contents`, and hence a single shared cache right now (between all reads of the same file). However, lots of instances of `RowGroupReader::Contents` get created (one per row group per column), so it's not easy to cache each row group separately. Perhaps I'm missing the point: what you'd like is a way to stream record batches out of a file, and have it internally clean up memory for each row group once the data has been fully read, right? (And, not pre-buffer more than some fixed number of row groups ahead of the current one.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on issue #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on issue #6985: URL: https://github.com/apache/arrow/pull/6985#issuecomment-618231752 CC @wesm @pitrou I think this is ready for review now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on issue #6998: ARROW-8541: [Release] Don't remove previous source releases automatically
kszucs commented on issue #6998: URL: https://github.com/apache/arrow/pull/6998#issuecomment-618328429 @kou updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7000: ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation
jorisvandenbossche commented on a change in pull request #7000: URL: https://github.com/apache/arrow/pull/7000#discussion_r413622598 ## File path: cpp/src/arrow/dataset/file_parquet.cc ## @@ -402,23 +401,21 @@ Result ParquetFileFormat::ScanFile( } Result> ParquetFileFormat::MakeFragment( -FileSource source, std::shared_ptr options, -std::shared_ptr partition_expression, std::vector row_groups) { +FileSource source, std::shared_ptr partition_expression, +std::vector row_groups) { return std::shared_ptr( - new ParquetFileFragment(std::move(source), shared_from_this(), std::move(options), + new ParquetFileFragment(std::move(source), shared_from_this(), std::move(partition_expression), std::move(row_groups))); } Result> ParquetFileFormat::MakeFragment( -FileSource source, std::shared_ptr options, -std::shared_ptr partition_expression) { - return std::shared_ptr( - new ParquetFileFragment(std::move(source), shared_from_this(), std::move(options), - std::move(partition_expression), {})); +FileSource source, std::shared_ptr partition_expression) { + return std::shared_ptr(new ParquetFileFragment( + std::move(source), shared_from_this(), std::move(partition_expression), {})); } Result ParquetFileFormat::GetRowGroupFragments( -const ParquetFileFragment& fragment, std::shared_ptr extra_filter) { +const ParquetFileFragment& fragment, std::shared_ptr filter) { Review comment: Small comment: in the header file it is still called "extra_filter" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #7017: suggestion: why not serialize complex numbers in a Python list/dict/set
wesm commented on issue #7017: URL: https://github.com/apache/arrow/issues/7017#issuecomment-618352283 Can you send an email to one of the mailing lists or open a JIRA if you want to propose a development project? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413747770 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: I guess my question is: if I'm reading one record batch at a time (in streaming fashion), shouldn't the cache be per-record batch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413759177 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: The other option would be to more fully separate the I/O and allow callers to pass a cache for a given read, and have `PreBuffer` instead pre-warm a supplied cache for the given row groups/column indices. Then you would have full control over the cache's lifetime, at the expense of a lot more complicated code (both for the caller and internally). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
jorisvandenbossche commented on issue #6992: URL: https://github.com/apache/arrow/pull/6992#issuecomment-618366876 cc @wesm @xhochy @BryanCutler are you fine with 1) a hard required minimal pandas version? (meaning: we don't use the pandas integration if an older version is installed, instead of trying to use it and potentially erroring/failing in certain cases) 2) if ok with 1, pandas 0.23 as minimal version? (released 2 years ago, but our tests still pass with eg 0.21 as well) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413533870 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,141 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level + // of the repeated parent so we can check for def_level > "def level of repeated + // parent". + uint64_t present_bitmap = internal::GreaterThanBitmap( + def_levels, batch_size, required_definition_level - 2); + *values_read += internal::AppendSelectedBitsToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*selection_bitmap*/ present_bitmap, valid_bits, _bits_offset, + _count); +} else { + internal::AppendToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*new_bit_count=*/batch_size, valid_bits, _bits_offset, _count); + *values_read += batch_size; +} +def_levels += batch_size; +num_def_levels -= batch_size; + } + *null_count += *values_read - set_count; +} + +inline void DefinitionLevelsToBitmapDispatch( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + if (max_repetition_level > 0) { +#if ARROW_LITTLE_ENDIAN +// we use AVx2 as a proxy for BMI2 instruction set, since there doesn't seem to be a clean +// way o detect that latter
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413576031 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level Review comment: fix comment should "- 2" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413576031 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level Review comment: fix comment should "- 2" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on issue #6667: ARROW-8162: [Format][Python] Add serialization for CSF sparse tensors to Python
rok commented on issue #6667: URL: https://github.com/apache/arrow/pull/6667#issuecomment-618295149 Thanks @mrkn! :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield edited a comment on issue #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
emkornfield edited a comment on issue #6985: URL: https://github.com/apache/arrow/pull/6985#issuecomment-618231752 CC @wesm @pitrou I think this is ready for review now. Will take a closer look at CI failures tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413575153 ## File path: cpp/cmake_modules/SetupCxxFlags.cmake ## @@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86") set(CXX_SUPPORTS_SSE4_2 TRUE) else() set(ARROW_SSE4_2_FLAG "-msse4.2") -set(ARROW_AVX2_FLAG "-mavx2") +set(ARROW_AVX2_FLAG "-march=core-avx2") Review comment: hopefully this is acceptable change. I'm open to other suggestions on how to enable bmi2 (not for some reason using "-mbmi2 -mavx2" didn't seem to work for me, but I can try again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
nevi-me commented on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618293330 > I also think it is fine for the moment to ignore the use case where the schema varies between record batches and file a separate issue for that. Just on this point, there shouldn't be a way of creating a record batch that has a mismatching schema This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet][WIP] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413576741 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level Review comment: I think "- 1" might be confusing here because we subtract below. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rdettai commented on issue #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)
rdettai commented on issue #6949: URL: https://github.com/apache/arrow/pull/6949#issuecomment-618235600 I'll work on your comments today. What about the problem we are trying to fix here? Do you agree with the benefits of this fix ? Also, I'm not sure why a `Mutex` was used here. To we agree that `Refcell` is better as the whole reader is not thread safe, right ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rdettai edited a comment on issue #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)
rdettai edited a comment on issue #6949: URL: https://github.com/apache/arrow/pull/6949#issuecomment-618235600 I'll work on your comments today. What about the problem we are trying to fix here? Do you agree with the benefits of this fix ? Also, I'm not sure why a `Mutex` was used here. Do we agree that `Refcell` is better as the whole reader is not thread safe, right ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
wesm commented on issue #6992: URL: https://github.com/apache/arrow/pull/6992#issuecomment-618459883 I'm OK with this. The maintenance burden of supporting several years' worth of pandas releases seems like a lot to bear. If there are parties which are affected by this they should contribute to the maintenance (either monetarily or with their time) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #6846: ARROW-3329: [Python] Python tests for decimal to int and decimal to decimal casts
pitrou commented on issue #6846: URL: https://github.com/apache/arrow/pull/6846#issuecomment-618464806 The CI failure looks unrelated, will merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mayuropensource opened a new pull request #7020: Arrow-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
mayuropensource opened a new pull request #7020: URL: https://github.com/apache/arrow/pull/7020 JIRA: https://issues.apache.org/jira/browse/ARROW-8562 This change is not actually used until https://github.com/apache/arrow/pull/6744 (@lidavidm) is pushed, however, it doesn't need to wait for the other pull request to be merged. Description: The adaptive I/O coalescing algorithm uses two parameters: 1. max_io_gap or hole_size_limit: Max I/O gap/hole size in bytes 2. ideal_request_size or range_size_limit: Ideal I/O Request size in bytes These parameters can be derived from S3 metrics as described below: In an S3 compatible storage, there are two main metrics: 1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of a new S3 request 2. Transfer Bandwidth (BW) for data in bytes/sec 1. Computing max_io_gap or hole_size_limit: max_io_gap = TTFB * BW This is also called Bandwidth-Delay-Product (BDP). Two byte ranges that have a gap can still be mapped to the same read if the gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], i.e. if the Time-To-First-Byte (or call setup latency of a new S3 request) is expected to be greater than just reading and discarding the extra bytes on an existing HTTP request. 2. Computing ideal_request_size or range_size_limit: We want to have high bandwidth utilization per S3 connections, i.e. transfer large amounts of data to amortize the seek overhead. But, we also want to leverage parallelism by slicing very large IO chunks. We define two more config parameters with suggested default values to control the slice size and seek to balance the two effects with the goal of maximizing net data load performance. BW_util (ideal bandwidth utilization): This means what fraction of per connection bandwidth should be utilized to maximize net data load. A good default value is 90% or 0.9. MAX_IDEAL_REQUEST_SIZE: This means what is the maximum single request size (in bytes) to maximize net data load. A good default value is 64 MiB. The amount of data that needs to be transferred in a single S3 get_object request to achieve effective bandwidth eff_BW = BW_util * BW is as follows: eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW) Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the following result: ideal_request_size = max_io_gap * BW_util / (1 - BW_util) Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - BW_util)) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs opened a new pull request #7021: Wrap docker-compose commands with archery [WIP]
kszucs opened a new pull request #7021: URL: https://github.com/apache/arrow/pull/7021 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
vertexclique commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413951534 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n( self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: So, `for _ in 0..n` kind of code, even you don't use, checks the boundaries. This is one of the infamous code pieces that Rust has. Personally I don't use this whenever its possible :). In [here](https://godbolt.org/z/55sggc) you can see one of the examples. Range check exists here: ``` mov rax, qword ptr [rip + core::iter::range::>::next@GOTPCREL] ``` This call makes checked_add and which makes range checks. So that is slowing down for big iterations as you saw in the link. Since this is an open issue in the compiler please consider using iterator methods instead of for loops whenever you see. As you can also see, in this very code 5 cycles spent every range check (in minimal), so that said it is always better to use iterator methods. Bringing the example here :) ``` (0..n) .for_each(|_| { self.write_bytes(v.to_byte_slice(), 1)?; }) ``` This will be completely pruned in `-C opt-level=3` most probably :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
vertexclique commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413951534 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n( self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: So, `for _ in 0..n` kind of code, even you don't use, checks the boundaries. This is one of the infamous code pieces that Rust has. Personally I don't use this whenever its possible :). In [here]() you can see one of the examples. Range check exists here: ``` mov rax, qword ptr [rip + core::iter::range::>::next@GOTPCREL] ``` This call makes checked_add and which makes range checks. So that is slowing down for big iterations as you saw in the link. Since this is an open issue in the compiler please consider using iterator methods instead of for loops whenever you see. As you can also see, in this very code 5 cycles spent every range check (in minimal), so that said it is always better to use iterator methods. Bringing the example here :) ``` (0..n) .for_each(|_| { self.write_bytes(v.to_byte_slice(), 1)?; }) ``` This will be completely pruned in `-C opt-level=3` most probably :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
vertexclique commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413951534 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n( self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: So, `for _ in 0..n` kind of code, even you don't use, checks the boundaries. This is one of the infamous code pieces that Rust has and discourages. In [here]() you can see one of the examples. Range check exists here: ``` mov rax, qword ptr [rip + core::iter::range::>::next@GOTPCREL] ``` This call makes checked_add and which makes range checks. So that is slowing down for big iterations as you saw in the link. Since this is an open issue in the compiler please consider using iterator methods instead of for loops whenever you see. As you can also see, in this very code 5 cycles spent every range check (in minimal), so that said it is always better to use iterator methods. Bringing the example here :) ``` (0..n) .for_each(|_| { self.write_bytes(v.to_byte_slice(), 1)?; }) ``` This will be completely pruned in `-C opt-level=3` most probably :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth commented on a change in pull request #6972: URL: https://github.com/apache/arrow/pull/6972#discussion_r413951619 ## File path: rust/parquet/src/encodings/rle.rs ## @@ -830,7 +826,7 @@ mod tests { values.clear(); let mut rng = thread_rng(); let seed_vec: Vec = -Standard.sample_iter( rng).take(seed_len).collect(); +rng.sample_iter::().take(seed_len).collect(); Review comment: This is an example of type inferences that would become compiler errors with the inclusion of the `prettytable-rs` crate. I have fixed this code, but the original error being generated was: ``` error[E0282]: type annotations needed --> parquet/src/encodings/rle.rs:833:26 | 833 | Standard.sample_iter( rng).take(seed_len).collect(); | ^^^ cannot infer type for `T` ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413970301 ## File path: cpp/src/parquet/level_conversion.h ## @@ -0,0 +1,191 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include +#include "arrow/util/bit_util.h" + +#if defined(ARROW_HAVE_BMI2) +#include "x86intrin.h" +#endif + +namespace parquet { +namespace internal { +// These APIs are likely to be revised as part of ARROW-8494 to reduce duplicate code. +// They currently represent minimal functionality for vectorized computation of definition +// levels. + +/// Builds a bitmap by applying predicate to the level vector provided. +/// +/// \param[in] levels Rep or def level array. +/// \param[in] num_levels The number of levels to process (must be [0, 64]) +/// \param[in] predicate The predicate to apply (must have the signature `bool +/// predicate(int16_t)`. +/// \returns The bitmap using least significant "bit" ordering. +/// +/// N.B. Correct byte ordering is dependent on little-endian architectures. +/// +template +uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels, Predicate predicate) { + // Both clang and GCC can vectorize this automatically with AVX2. + uint64_t mask = 0; + for (int x = 0; x < num_levels; x++) { +mask |= static_cast(predicate(levels[x]) ? 1 : 0) << x; + } + return mask; +} + +/// Builds a bitmap where each set bit indicates the correspond level is greater +/// than rhs. +static inline int64_t GreaterThanBitmap(const int16_t* levels, int64_t num_levels, Review comment: Is this function used only for a little-endian platform? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae
github-actions[bot] commented on issue #7019: URL: https://github.com/apache/arrow/pull/7019#issuecomment-618528458 Revision: 5bcfeab4c9bacc0b3a262a7522bfaf985025d3ec Submitted crossbow builds: [ursa-labs/crossbow @ actions-165](https://github.com/ursa-labs/crossbow/branches/all?query=actions-165) |Task|Status| ||--| |homebrew-cpp|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-165-travis-homebrew-cpp.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
paddyhoran commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413903768 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n( self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: You can use `to_byte_slice` on a `Vec` or slice also. To be clear, I'm just guessing here. @vertexclique do you want to expand on your comment or can I merge this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413908372 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level + // of the repeated parent so we can check for def_level > "def level of repeated + // parent". + uint64_t present_bitmap = internal::GreaterThanBitmap( + def_levels, batch_size, required_definition_level - 2); + *values_read += internal::AppendSelectedBitsToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*selection_bitmap*/ present_bitmap, valid_bits, _bits_offset, + _count); +} else { + internal::AppendToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*new_bit_count=*/batch_size, valid_bits, _bits_offset, _count); + *values_read += batch_size; +} +def_levels += batch_size; +num_def_levels -= batch_size; + } + *null_count += *values_read - set_count; +} + +inline void DefinitionLevelsToBitmapDispatch( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + if (max_repetition_level > 0) { +#if ARROW_LITTLE_ENDIAN + +#if defined(ARROW_HAVE_BMI2) +// BMI2 is required for efficient bit extraction. +DefinitionLevelsToBitmapSimd( +
[GitHub] [arrow] lidavidm commented on a change in pull request #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
lidavidm commented on a change in pull request #7020: URL: https://github.com/apache/arrow/pull/7020#discussion_r413923773 ## File path: cpp/src/arrow/io/caching.h ## @@ -27,6 +27,44 @@ namespace arrow { namespace io { + +struct ARROW_EXPORT CacheOptions { + static constexpr double kDefaultIdealBandwidthUtilizationFrac = 0.9; + static constexpr int64_t kDefaultMaxIdealRequestSizeMib = 64; + + /// /brief The maximum distance in bytes between two consecutive + /// ranges; beyond this value, ranges are not combined + int64_t hole_size_limit; + /// /brief The maximum size in bytes of a combined range; if + /// combining two consecutive ranges would produce a range of a + /// size greater than this, they are not combined + int64_t range_size_limit; + + bool operator==(const CacheOptions& other) const { +return hole_size_limit == other.hole_size_limit && + range_size_limit == other.range_size_limit; + } + + /// \brief Construct CacheOptions from S3 storage metrics. + /// + /// \param[in] time_to_first_byte_millis Seek-time or Time-To-First-Byte (TTFB) in + /// milliseconds, also called call setup latency of a new S3 request. + /// The value is a positive integer. + /// \param[in] transfer_bandwidth_mib_per_sec Data transfer Bandwidth (BW) in MiB/sec. + /// The value is a positive integer. + /// \param[in] ideal_bandwidth_utilization_frac Transfer bandwidth utilization fraction + /// (per connection) to maximize the net data load. + /// The value is a positive double precision number less than or equal to 1. + /// \param[in] max_ideal_request_size_mib The maximum single data request size (in MiB) + /// to maximize the net data load. + /// The value is a positive integer. + /// \return A new instance of CacheOptions. + static CacheOptions MakeFromS3Metrics( Review comment: This might be better named as network metrics; it'd be useful for the GCS filesystem that's in progress, for instance. ## File path: cpp/src/arrow/io/caching.cc ## @@ -27,6 +28,91 @@ namespace arrow { namespace io { + +CacheOptions CacheOptions::MakeFromS3Metrics( +const int64_t time_to_first_byte_millis, const int64_t transfer_bandwidth_mib_per_sec, +const double ideal_bandwidth_utilization_frac, +const int64_t max_ideal_request_size_mib) { + // + // The I/O coalescing algorithm uses two parameters: + // 1. hole_size_limit (a.k.a max_io_gap): Max I/O gap/hole size in bytes + // 2. range_size_limit (a.k.a ideal_request_size): Ideal I/O Request size in bytes + // + // These parameters can be derived from S3 metrics as described below: + // + // In an S3 compatible storage, there are two main metrics: + // 1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of a new + // S3 request + // 2. Transfer Bandwidth (BW) for data in bytes/sec + // + // 1. Computing hole_size_limit: + // + // hole_size_limit = TTFB * BW + // + // This is also called Bandwidth-Delay-Product (BDP). + // Two byte ranges that have a gap can still be mapped to the same read + // if the gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], + // i.e. if the Time-To-First-Byte (or call setup latency of a new S3 request) is + // expected to be greater than just reading and discarding the extra bytes on an + // existing HTTP request. + // + // 2. Computing range_size_limit: + // + // We want to have high bandwidth utilization per S3 connections, + // i.e. transfer large amounts of data to amortize the seek overhead. + // But, we also want to leverage parallelism by slicing very large IO chunks. + // We define two more config parameters with suggested default values to control + // the slice size and seek to balance the two effects with the goal of maximizing + // net data load performance. + // + // BW_util_frac (ideal bandwidth utilization): Transfer bandwidth utilization fraction + // (per connection) to maximize the net data load. 90% is a good default number for + // an effective transfer bandwidth. + // + // MAX_IDEAL_REQUEST_SIZE: The maximum single data request size (in MiB) to maximize + // the net data load. 64 MiB is a good default number for the ideal request size. + // + // The amount of data that needs to be transferred in a single S3 get_object + // request to achieve effective bandwidth eff_BW = BW_util_frac * BW is as follows: + // eff_BW = range_size_limit / (TTFB + range_size_limit / BW) + // + // Substituting TTFB = hole_size_limit / BW and eff_BW = BW_util_frac * BW, we get the + // following result: + // range_size_limit = hole_size_limit * BW_util_frac / (1 - BW_util_frac) + // + // Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: + // range_size_limit = min(MAX_IDEAL_REQUEST_SIZE, + //hole_size_limit *
[GitHub] [arrow] mayuropensource opened a new pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics
mayuropensource opened a new pull request #7022: URL: https://github.com/apache/arrow/pull/7022 _(Recreating the PR from a clean repo, sorry about earlier PR which was not cleanly merged)._ **JIRA:** https://issues.apache.org/jira/browse/ARROW-8562 This change is not actually used until #6744 (@lidavidm) is pushed, however, it doesn't need to wait for the other pull request to be merged. **Description:** The adaptive I/O coalescing algorithm uses two parameters: max_io_gap or hole_size_limit: Max I/O gap/hole size in bytes ideal_request_size or range_size_limit: Ideal I/O Request size in bytes These parameters can be derived from S3 metrics as described below: In an S3 compatible storage, there are two main metrics: Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of a new S3 request Transfer Bandwidth (BW) for data in bytes/sec Computing max_io_gap or hole_size_limit: max_io_gap = TTFB * BW This is also called Bandwidth-Delay-Product (BDP). Two byte ranges that have a gap can still be mapped to the same read if the gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], i.e. if the Time-To-First-Byte (or call setup latency of a new S3 request) is expected to be greater than just reading and discarding the extra bytes on an existing HTTP request. Computing ideal_request_size or range_size_limit: We want to have high bandwidth utilization per S3 connections, i.e. transfer large amounts of data to amortize the seek overhead. But, we also want to leverage parallelism by slicing very large IO chunks. We define two more config parameters with suggested default values to control the slice size and seek to balance the two effects with the goal of maximizing net data load performance. BW_util (ideal bandwidth utilization): This means what fraction of per connection bandwidth should be utilized to maximize net data load. A good default value is 90% or 0.9. MAX_IDEAL_REQUEST_SIZE: This means what is the maximum single request size (in bytes) to maximize net data load. A good default value is 64 MiB. The amount of data that needs to be transferred in a single S3 get_object request to achieve effective bandwidth eff_BW = BW_util * BW is as follows: eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW) Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the following result: ideal_request_size = max_io_gap * BW_util / (1 - BW_util) Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - BW_util)) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth commented on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618501806 @andygrove Thanks for the feedback. I have updated the PR with a less leaky API. I also fixed the type inference problem that was caused by the new dependency. @nevi-me True for a singular batch, but it's possible to create multiple batches from different sources. For example ``` let schema1 = ... let csv1 = ... let batch1 = ... let schema2 = ... // different schema let csv2 = ... let batch2 = ... print_batches(&[batch1, batch2]); ``` As I said, this is probably not something to worry about too much right now, but I'll probably add an issue for it for later if that's alright. Interestingly, this code wouldn't even necessarily crash; you would just get an odd-looking table: ``` +---+---+---+---+ | city | lat | lng | | +---+---+---+---+ | Elgin, Scotland, the UK | 57.653484 | -3.335724 | | | Stoke-on-Trent, Staffordshire, the UK | 53.002666 | -2.179404 | | | Solihull, Birmingham, UK | 52.412811 | -1.778197 | | | Cardiff, Cardiff county, UK | 51.481583 | -3.17909 | | | 1 | 1.1 | 1.11 | true | | 2 | 2.2 | 2.22 | true | | 3 | 0 | 3.33 | true | | 4 | 4.4 | | false | | 5 | 6.6 | | false | +---+---+---+---+ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth edited a comment on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618501806 @andygrove Thanks for the feedback. I have updated the PR with a less leaky API. I also fixed the type inference problem that was caused by the new dependency. @nevi-me True for a singular batch, but it's possible to create multiple batches from different sources. For example ``` let schema1 = ... let csv1 = ... let batch1 = ... let schema2 = ... // different schema let csv2 = ... let batch2 = ... print_batches(&[batch1, batch2]); ``` As I said, this is probably not something to worry about too much right now, but I'll probably add an issue for later to revisit if that's alright. Interestingly, this code wouldn't even necessarily crash; you would just get an odd-looking table: ``` +---+---+---+---+ | city | lat | lng | | +---+---+---+---+ | Elgin, Scotland, the UK | 57.653484 | -3.335724 | | | Stoke-on-Trent, Staffordshire, the UK | 53.002666 | -2.179404 | | | Solihull, Birmingham, UK | 52.412811 | -1.778197 | | | Cardiff, Cardiff county, UK | 51.481583 | -3.17909 | | | 1 | 1.1 | 1.11 | true | | 2 | 2.2 | 2.22 | true | | 3 | 0 | 3.33 | true | | 4 | 4.4 | | false | | 5 | 6.6 | | false | +---+---+---+---+ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth edited a comment on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth edited a comment on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618501806 @andygrove Thanks for the feedback. I have updated the PR with a less leaky API. I also tweaked the parquet test to workaround the new type inference changes. @nevi-me True for a singular batch, but it's possible to create multiple batches from different sources. For example ``` let schema1 = ... let csv1 = ... let batch1 = ... let schema2 = ... // different schema let csv2 = ... let batch2 = ... print_batches(&[batch1, batch2]); ``` As I said, this is probably not something to worry about too much right now, but I'll probably add an issue for later to revisit if that's alright. Interestingly, this code wouldn't even necessarily crash; you would just get an odd-looking table: ``` +---+---+---+---+ | city | lat | lng | | +---+---+---+---+ | Elgin, Scotland, the UK | 57.653484 | -3.335724 | | | Stoke-on-Trent, Staffordshire, the UK | 53.002666 | -2.179404 | | | Solihull, Birmingham, UK | 52.412811 | -1.778197 | | | Cardiff, Cardiff county, UK | 51.481583 | -3.17909 | | | 1 | 1.1 | 1.11 | true | | 2 | 2.2 | 2.22 | true | | 3 | 0 | 3.33 | true | | 4 | 4.4 | | false | | 5 | 6.6 | | false | +---+---+---+---+ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth commented on a change in pull request #6972: URL: https://github.com/apache/arrow/pull/6972#discussion_r413951619 ## File path: rust/parquet/src/encodings/rle.rs ## @@ -830,7 +826,7 @@ mod tests { values.clear(); let mut rng = thread_rng(); let seed_vec: Vec = -Standard.sample_iter( rng).take(seed_len).collect(); +rng.sample_iter::().take(seed_len).collect(); Review comment: This is an example of type inferences that would become compiler errors with the inclusion of the `prettytable-rs` crate. I have fixed this code, but the original error being generated was: ``` error[E0282]: type annotations needed --> parquet/src/encodings/rle.rs:833:26 | 833 | Standard.sample_iter( rng).take(seed_len).collect(); | ^^^ cannot infer type for `T` ``` Note this would affect all downstream crates, as it did the `parquet` crate. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413954827 ## File path: cpp/src/parquet/level_conversion_test.cc ## @@ -0,0 +1,162 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "parquet/level_conversion.h" + +#include +#include + +#include + +#include "arrow/util/bit_util.h" + +namespace parquet { +namespace internal { + +using ::testing::ElementsAreArray; + +std::string ToString(const uint8_t* bitmap, int64_t bit_count) { + return arrow::internal::Bitmap(bitmap, /*offset*/ 0, /*length=*/bit_count).ToString(); +} + +std::string ToString(const std::vector& bitmap, int64_t bit_count) { + return ToString(bitmap.data(), bit_count); +} + +TEST(TestGreaterThanBitmap, GeneratesExpectedBitmasks) { Review comment: Most of these tests fail on big-endian platform ``` [ FAILED ] TestAppendBitmap.TestOffsetOverwritesCorrectBitsOnExistingByte [ FAILED ] TestAppendBitmap.TestOffsetShiftBitsCorrectly [ FAILED ] TestAppendBitmap.AllBytesAreWrittenWithEnoughSpace [ FAILED ] TestAppendBitmap.OnlyApproriateBytesWrittenWhenLessThen8BytesAvailable [ FAILED ] TestAppendToValidityBitmap.BasicOperation ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
wesm commented on issue #7020: URL: https://github.com/apache/arrow/pull/7020#issuecomment-618508772 @mayuropensource it's not necessary to open a new PR when you want to redo your commits, you can just force push your branch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413953833 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists Review comment: Since this path is executed only when #if defined(ARROW_HAVE_BMI2) is true, it would be good to add such as code ``` #if defined(ARROW_HAVE_BMI2) ... #else assert(false && "must not execute this without BMI2"); #endif ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413960907 ## File path: cpp/src/parquet/level_conversion.h ## @@ -0,0 +1,191 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include +#include "arrow/util/bit_util.h" + +#if defined(ARROW_HAVE_BMI2) +#include "x86intrin.h" +#endif + +namespace parquet { +namespace internal { +// These APIs are likely to be revised as part of ARROW-8494 to reduce duplicate code. +// They currently represent minimal functionality for vectorized computation of definition +// levels. + +/// Builds a bitmap by applying predicate to the level vector provided. +/// +/// \param[in] levels Rep or def level array. +/// \param[in] num_levels The number of levels to process (must be [0, 64]) +/// \param[in] predicate The predicate to apply (must have the signature `bool +/// predicate(int16_t)`. +/// \returns The bitmap using least significant "bit" ordering. +/// +/// N.B. Correct byte ordering is dependent on little-endian architectures. +/// +template +uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels, Predicate predicate) { + // Both clang and GCC can vectorize this automatically with AVX2. + uint64_t mask = 0; + for (int x = 0; x < num_levels; x++) { +mask |= static_cast(predicate(levels[x]) ? 1 : 0) << x; + } + return mask; +} + +/// Builds a bitmap where each set bit indicates the correspond level is greater +/// than rhs. +static inline int64_t GreaterThanBitmap(const int16_t* levels, int64_t num_levels, +int16_t rhs) { + return LevelsToBitmap(levels, num_levels, [&](int16_t value) { return value > rhs; }); +} + +/// Append bits number_of_bits from new_bits to valid_bits and valid_bits_offset. +/// +/// \param[in] new_bits The zero-padded bitmap to append. +/// \param[in] number_of_bits The number of bits to append from new_bits. +/// \param[in] valid_bits_length The number of bytes allocated in valid_bits. +/// \param[in] valid_bits_offset The bit-offset at which to start appending new bits. +/// \param[in,out] valid_bits The validity bitmap to append to. +/// \returns The new bit offset inside of valid_bits. +static inline int64_t AppendBitmap(uint64_t new_bits, int64_t number_of_bits, + int64_t valid_bits_length, int64_t valid_bits_offset, + uint8_t* valid_bits) { + // Selection masks to retrieve all low order bits for each bytes. + constexpr uint64_t kLsbSelectionMasks[] = { + 0, // unused. + 0x0101010101010101, + 0x0303030303030303, + 0x0707070707070707, + 0x0F0F0F0F0F0F0F0F, + 0x1F1F1F1F1F1F1F1F, + 0x3F3F3F3F3F3F3F3F, + 0x7F7F7F7F7F7F7F7F, + }; + int64_t valid_byte_offset = valid_bits_offset / 8; + int64_t bit_offset = valid_bits_offset % 8; + + int64_t new_offset = valid_bits_offset + number_of_bits; + union ByteAddressableBitmap { Review comment: Why do we need to use union? `bytes[]` is used only at line 106. It looks simple to explicitly extract the lowest 8bit instead of using union. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value
markhildreth commented on a change in pull request #7006: URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397 ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( Review comment: The rest of the added tests would have been passing previously. This would be the failing test. The actual value that was being returned was 0. ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( +3, +list_array +.value(0) +.as_any() +.downcast_ref::() +.unwrap() +.value(0) +); Review comment: The rest of the added tests would have been passing previously. This would be the failing test. The actual value that was being returned was 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae
nealrichardson commented on issue #7019: URL: https://github.com/apache/arrow/pull/7019#issuecomment-618527651 @github-actions crossbow submit homebrew-cpp This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7021: Wrap docker-compose commands with archery [WIP]
github-actions[bot] commented on issue #7021: URL: https://github.com/apache/arrow/pull/7021#issuecomment-618494095 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413954827 ## File path: cpp/src/parquet/level_conversion_test.cc ## @@ -0,0 +1,162 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "parquet/level_conversion.h" + +#include +#include + +#include + +#include "arrow/util/bit_util.h" + +namespace parquet { +namespace internal { + +using ::testing::ElementsAreArray; + +std::string ToString(const uint8_t* bitmap, int64_t bit_count) { + return arrow::internal::Bitmap(bitmap, /*offset*/ 0, /*length=*/bit_count).ToString(); +} + +std::string ToString(const std::vector& bitmap, int64_t bit_count) { + return ToString(bitmap.data(), bit_count); +} + +TEST(TestGreaterThanBitmap, GeneratesExpectedBitmasks) { Review comment: FYI: Most of these tests fail on big-endian platform ``` [ FAILED ] TestAppendBitmap.TestOffsetOverwritesCorrectBitsOnExistingByte [ FAILED ] TestAppendBitmap.TestOffsetShiftBitsCorrectly [ FAILED ] TestAppendBitmap.AllBytesAreWrittenWithEnoughSpace [ FAILED ] TestAppendBitmap.OnlyApproriateBytesWrittenWhenLessThen8BytesAvailable [ FAILED ] TestAppendToValidityBitmap.BasicOperation ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value
markhildreth commented on a change in pull request #7006: URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397 ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( Review comment: The rest of the tests were passing previously. This was the failing test. The actual value returned was 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
vertexclique commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413972358 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n( self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: The place that you are doing byte masking can be a good alternative. Since that place is also accessing with index: https://github.com/apache/arrow/pull/6980/files#diff-76c0ed0285da0826a55cf62676556c87R102 If that's extreme, just forget about it. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7020: Arrow-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
github-actions[bot] commented on issue #7020: URL: https://github.com/apache/arrow/pull/7020#issuecomment-618472360 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
kiszk commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r413905959 ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists + // present through the leaf (i.e. no structs). + // Upper level code only calls this method + // when the leaf-values are nullable (otherwise no spacing is needed), + // Because only nested lists exists it is sufficient to know that the field + // was either null or included it (i.e. >= previous definition level -> > previous + // definition - 1). If there where structs mixed in, we need to know the def_level + // of the repeated parent so we can check for def_level > "def level of repeated + // parent". + uint64_t present_bitmap = internal::GreaterThanBitmap( + def_levels, batch_size, required_definition_level - 2); + *values_read += internal::AppendSelectedBitsToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*selection_bitmap*/ present_bitmap, valid_bits, _bits_offset, + _count); +} else { + internal::AppendToValidityBitmap( + /*new_bits=*/defined_bitmap, + /*new_bit_count=*/batch_size, valid_bits, _bits_offset, _count); + *values_read += batch_size; +} +def_levels += batch_size; +num_def_levels -= batch_size; + } + *null_count += *values_read - set_count; +} + +inline void DefinitionLevelsToBitmapDispatch( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + if (max_repetition_level > 0) { +#if ARROW_LITTLE_ENDIAN + +#if defined(ARROW_HAVE_BMI2) +// BMI2 is required for efficient bit extraction. +DefinitionLevelsToBitmapSimd( +
[GitHub] [arrow] mayuropensource commented on issue #7020: ARROW-8562: [C++] Parameterize I/O coalescing using s3 storage metrics
mayuropensource commented on issue #7020: URL: https://github.com/apache/arrow/pull/7020#issuecomment-618486378 I messed up some commits. Will create a new one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on issue #6972: ARROW-8287: [Rust] Add "pretty" util to help with printing tabular output of RecordBatches
markhildreth commented on issue #6972: URL: https://github.com/apache/arrow/pull/6972#issuecomment-618506903 From a purely practical standpoint, this PR is ready for further review and merging. If approved, I would probably add some minor issue for the following: * Trying to avoid the type inference issue. * Creating a type that can be used to iterate over multiple batches with statically guaranteed same schemas. Additionally, I’ll create a new issue and PR to cover moving DataFusion to this using this API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bryantbiggs commented on issue #4140: ARROW-5123: [Rust] Parquet derive for simple structs
bryantbiggs commented on issue #4140: URL: https://github.com/apache/arrow/pull/4140#issuecomment-618512883 thanks @andygrove ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] BryanCutler commented on issue #6992: ARROW-7950: [Python] Determine + test minimal pandas version + raise error when pandas is too old
BryanCutler commented on issue #6992: URL: https://github.com/apache/arrow/pull/6992#issuecomment-618517973 Sounds good to me. FWIW, Spark also has a minimum Pandas version set at 0.23.2. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tustvold commented on a change in pull request #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
tustvold commented on a change in pull request #6980: URL: https://github.com/apache/arrow/pull/6980#discussion_r413969298 ## File path: rust/arrow/src/array/builder.rs ## @@ -236,6 +251,14 @@ impl BufferBuilderTrait for BufferBuilder { self.write_bytes(v.to_byte_slice(), 1) } +default fn append_n( self, n: usize, v: T::Native) -> Result<()> { +self.reserve(n)?; +for _ in 0..n { +self.write_bytes(v.to_byte_slice(), 1)?; +} Review comment: Thanks, I was not aware of this particular compiler bug. In this particular case the codegen at opt-level 2 or higher is identical, I think the problem as described [here](https://github.com/rust-lang/rust/issues/35981) only manifests on nested for loops. I can update if necessary but error propagation out of the closure if pretty clunky... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value
markhildreth commented on a change in pull request #7006: URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397 ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( Review comment: The rest of the added tests were passing previously. This was the failing test. The actual value returned was 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markhildreth commented on a change in pull request #7006: ARROW-8508: [Rust] FixedSizeListArray improper offset for value
markhildreth commented on a change in pull request #7006: URL: https://github.com/apache/arrow/pull/7006#discussion_r413958397 ## File path: rust/arrow/src/array/array.rs ## @@ -2592,6 +2619,15 @@ mod tests { assert_eq!(DataType::Int32, list_array.value_type()); assert_eq!(3, list_array.len()); assert_eq!(0, list_array.null_count()); +assert_eq!( Review comment: The rest of the added tests would have been passing previously. This would be the failing test. The actual value that was being returned was 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy opened a new pull request #7023: ARROW-8571: [C++] Switch AppVeyor image to VS 2017
xhochy opened a new pull request #7023: URL: https://github.com/apache/arrow/pull/7023 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r41381 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: I'm afraid I don't understand you. How do you "read one row group at a time" if the cache is shared at the FileReader level? Do you mean creating a new Parquet file reader for each row group? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413800208 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: > Correct me if I'm wrong, though, but even that doesn't help much without more refactoring, since reading is organized along columns. That was my original point, which you said was irrelevant... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
fsaintjacques commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r413814578 ## File path: python/pyarrow/_parquet.pyx ## @@ -1083,6 +1084,50 @@ cdef class ParquetReader: def set_use_threads(self, bint use_threads): self.reader.get().set_use_threads(use_threads) +def iter_batches(self, int64_t batch_size, row_groups, column_indices=None, Review comment: Ditto, I'd add a method `def set_batch_size(self, int64_t batch_size)`. Why do you need the `batch_size` and `use_threads` if they're already property of the class and the caller can change them via setters of the instance? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413797661 ## File path: cpp/src/parquet/file_reader.cc ## @@ -536,6 +577,14 @@ std::shared_ptr ParquetFileReader::RowGroup(int i) { return contents_->GetRowGroup(i); } +void ParquetFileReader::PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { + // Access private methods here + SerializedFile* file = static_cast(contents_.get()); Review comment: It's not possible, but I made it a checked cast regardless. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413798078 ## File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc ## @@ -331,10 +358,64 @@ BENCHMARK_DEFINE_F(MinioFixture, ReadCoalesced500Mib)(benchmark::State& st) { } BENCHMARK_REGISTER_F(MinioFixture, ReadCoalesced500Mib)->UseRealTime(); -BENCHMARK_DEFINE_F(MinioFixture, ReadParquet250K)(benchmark::State& st) { - ParquetRead(st, fs_.get(), bucket_ + "/pq_c100_r250k"); -} -BENCHMARK_REGISTER_F(MinioFixture, ReadParquet250K)->UseRealTime(); +// Helpers to generate various multiple benchmarks for a given Parquet file. + +// NAME: the base name of the benchmark. +// ROWS: the number of rows in the Parquet file. +// COLS: the number of columns in the Parquet file. +// STRATEGY: how to read the file (ReadTable or GetRecordBatchReader) +#define PQ_BENCHMARK_IMPL(NAME, ROWS, COLS, STRATEGY) \ + BENCHMARK_DEFINE_F(MinioFixture, NAME##STRATEGY##AllNaive)(benchmark::State & st) { \ +std::vector column_indices(COLS); \ +std::iota(column_indices.begin(), column_indices.end(), 0); \ +std::stringstream ss; \ +ss << bucket_ << "/pq_c" << COLS << "_r" << ROWS << "k"; \ +ParquetRead(st, fs_.get(), ss.str(), column_indices, false, #STRATEGY); \ Review comment: Good idea, I've created a set of helper functions that set up things and then call the benchmark. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413810237 ## File path: cpp/src/parquet/file_reader.cc ## @@ -536,6 +577,14 @@ std::shared_ptr ParquetFileReader::RowGroup(int i) { return contents_->GetRowGroup(i); } +void ParquetFileReader::PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { + // Access private methods here + SerializedFile* file = static_cast(contents_.get()); Review comment: Interesting. Do we use the polymorphism only for tests? ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: To be clear, I don't think this concern should block the PR, but the docstring should warn about it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
fsaintjacques commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r413811549 ## File path: cpp/src/parquet/arrow/reader.cc ## @@ -260,12 +260,28 @@ class FileReaderImpl : public FileReader { Status GetRecordBatchReader(const std::vector& row_group_indices, const std::vector& column_indices, - std::unique_ptr* out) override; + std::unique_ptr* out, Review comment: The FileReader object already has an `ArrowReaderProperties` property with this configuration. There is no need to add this extra parameter. Instead, you should add a setter method `set_batch_size()` analogous to `set_use_threads()` (in reader.h). This comment applies to every GetRecordBatchReader methods you added the parameter. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
fsaintjacques commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r413812220 ## File path: python/pyarrow/_parquet.pxd ## @@ -334,7 +334,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: ArrowReaderProperties() void set_read_dictionary(int column_index, c_bool read_dict) c_bool read_dictionary() -void set_batch_size() +void set_batch_size(int batch_size) Review comment: It should be `int64_t` instead of `int`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on a change in pull request #7009: ARROW-8552: [Rust] support iterate parquet row columns
andygrove commented on a change in pull request #7009: URL: https://github.com/apache/arrow/pull/7009#discussion_r413857080 ## File path: rust/parquet/src/record/api.rs ## @@ -50,6 +50,33 @@ impl Row { pub fn len() -> usize { self.fields.len() } + +pub fn get_column_iter() -> RowColumnIter { +RowColumnIter { +fields: , +curr: 0, +count: self.fields.len(), +} +} +} + +pub struct RowColumnIter<'a> { +fields: &'a Vec<(String, Field)>, +curr: usize, +count: usize, +} + +impl<'a> Iterator for RowColumnIter<'a> { +type Item = (usize, &'a String); Review comment: Why do we need the index included in the iterator results? The user could use `enumerate` to get those? Also, would it make more sense for the iterator be over the `Field` rather than just field name? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on issue #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae
github-actions[bot] commented on issue #7019: URL: https://github.com/apache/arrow/pull/7019#issuecomment-618456824 https://issues.apache.org/jira/browse/ARROW-8569 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413792469 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: That's fair. In our case, even if a dataset is large, individual files are smaller, so it's fine to buffer an entire file and then discard it. But I agree that for other use cases, this is not ideal. A caller who is very concerned about memory might instead choose to explicitly read only one row group at a time to limit memory usage. This is rather annoying, though. We could create a separate cache per row group, but this means we lose some performance as we can't coalesce reads across row groups anymore. However that might be a worthwhile tradeoff for large files. Correct me if I'm wrong, though, but even that doesn't help much without more refactoring, since reading is organized along columns. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7019: ARROW-8569: [CI] Upgrade xcode version for testing homebrew formulae
nealrichardson opened a new pull request #7019: URL: https://github.com/apache/arrow/pull/7019 See https://github.com/apache/arrow/pull/6996#issuecomment-618053499 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #6846: ARROW-3329: [Python] Python tests for decimal to int and decimal to decimal casts
pitrou commented on issue #6846: URL: https://github.com/apache/arrow/pull/6846#issuecomment-618404946 I rebased and improved the tests slightly. Also opened some issues for some oddities. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
pitrou commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413783259 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: I don't want anything specifically, but I wonder what users of the API may want. If you're reading a 10GB file, in a streaming fashion, from S3, you probably don't want to keep the 10GB in memory, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tustvold commented on issue #6980: ARROW-8516: [Rust] Improve PrimitiveBuilder::append_slice performance
tustvold commented on issue #6980: URL: https://github.com/apache/arrow/pull/6980#issuecomment-618389453 Rebased on current master and the CI now builds :tada: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413819271 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: GitHub acting up, sorry for the delay... > > Correct me if I'm wrong, though, but even that doesn't help much without more refactoring, since reading is organized along columns. > > That was my original point, which you said was irrelevant... Apologies, I remember you had made this point on JIRA or the mailing list and was trying to find where (I still can't find where :/) I admit my understanding of the internals is poor, and your comment helped me understand what to look for better. > I'm afraid I don't understand you. How do you "read one row group at a time" if the cache is shared at the FileReader level? Do you mean creating a new Parquet file reader for each row group? Sorry, I was thinking about creating RowGroupReaders from a FileReader, one row group at a time: https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReader8RowGroupEi Each time this is called, it internally recreates the cache and buffers a new set of ranges. (Hence the comment above about "unsafe".) So this would let you read a row group's worth of data at a time, at the cost of more complexity for the user. > To be clear, I don't think this concern should block the PR, but the docstring should warn about it. I'll update it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org