[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r526371625 ## File path: rust/arrow/src/buffer.rs ## @@ -888,6 +888,15 @@ impl MutableBuffer { } self.len += bytes.len(); } + +/// Extends the buffer by `len` with all bytes equal to `0u8`, incrementing its capacity if needed. +pub fn extend( self, len: usize) { +let remaining_capacity = self.capacity - self.len; +if len > remaining_capacity { +self.reserve(self.len + len); Review comment: @jorgecarleitao good spot - what I suggested is more obvious when len > remaining_capacity such as if `len = 256` in your example, then requested capacity would be `54 + (256 - (64 - 54)) = 54 + (256 - 10) = 300`; I think it will still work even in your example because the requested of 52 capacity will be lower than existing capacity so no allocation will be performed as it is not necessary because the requested extension of 8 is smaller than remaining capacity of 10. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r526371625 ## File path: rust/arrow/src/buffer.rs ## @@ -888,6 +888,15 @@ impl MutableBuffer { } self.len += bytes.len(); } + +/// Extends the buffer by `len` with all bytes equal to `0u8`, incrementing its capacity if needed. +pub fn extend( self, len: usize) { +let remaining_capacity = self.capacity - self.len; +if len > remaining_capacity { +self.reserve(self.len + len); Review comment: @jorgecarleitao good spot - what I suggested is more obvious when len > remaining_capacity such if `len = 256` in your example, then requested capacity would be `54 + (256 - (64 - 54)) = 54 + (256 - 10) = 300`; I think it will still work even in your example because the requested of 52 capacity will be lower than existing capacity so no allocation will be performed as it is not necessary because the requested extension of 8 is smaller than remaining capacity of 10. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r525541440 ## File path: rust/arrow/src/buffer.rs ## @@ -888,6 +888,15 @@ impl MutableBuffer { } self.len += bytes.len(); } + +/// Extends the buffer by `len` with all bytes equal to `0u8`, incrementing its capacity if needed. +pub fn extend( self, len: usize) { +let remaining_capacity = self.capacity - self.len; +if len > remaining_capacity { +self.reserve(self.len + len); Review comment: should this be `self.reserve(self.len + (len - remaining_capacity));` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r525539070 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,128 +14,141 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use arrow::{compute::Filter, util::test_util::seedable_rng}; +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = seedable_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = seedable_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = ( rng) +.sample_iter() +.take(10) +.collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = seedable_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; +builder.append_value(value).unwrap(); } builder.finish() } -fn bench_filter_u8(data_array: , filter_array: ) { -filter( -criterion::black_box(data_array), -criterion::black_box(filter_array), -) -.unwrap(); +fn bench_filter(data_array: , filter_array: ) { +criterion::black_box(filter(data_array, filter_array).unwrap()); } -// fn bench_filter_f32(data_array: , filter_array: ) { -// filter(criterion::black_box(data_array), criterion::black_box(filter_array)).unwrap(); -// } - -fn bench_filter_context_u8(data_array: , filter_context: ) { -filter_context -.filter(criterion::black_box(data_array)) -.unwrap(); -} - -fn bench_filter_context_f32(data_array: , filter_context: ) { -filter_context -.filter(criterion::black_box(data_array)) -.unwrap(); +fn bench_built_filter<'a>(filter: <'a>, data: Array) { +criterion::black_box(filter(())); } fn add_benchmark(c: Criterion) { let size = 65536; -let filter_array = create_bool_array(size, |i| matches!(i % 2, 0)); -let sparse_filter_array = create_bool_array(size, |i| matches!(i % 8000, 0)); -let dense_filter_array = create_bool_array(size, |i| !matches!(i % 8000, 0)); +let filter_array = create_bool_array(size, 0.5); +let sparse_filter_array = create_bool_array(size, 1.0 - 1.0 / 8000.0); Review comment: the names of the `sparse_filter_array` (intended to contain mostly 0s) and `dense_filter_array` (intended to contain mostly 1s) variables no longer correspond to the values; should they be reversed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521588710 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,137 +14,136 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = rng.sample_iter().take(10).collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = rand::thread_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; Review comment: @jorgecarleitao once the approach to filter benchmarks has been finalized would it be possible to rearrange the commits so that the same benchmark code can be used to benchmark both the old and new implementations so that we can do a direct comparison? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521574396 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,137 +14,136 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = rng.sample_iter().take(10).collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = rand::thread_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; Review comment: @jorgecarleitao the filter benchmarks may not simulate real-world use cases, but they are designed to test the code under specific conditions such as the worst case scenario with alternating 1s and 0s where no batch can be skipped and all selected values have to be copied individually; how can this scenario be achieved with a randomly generated filter array? the other scenarios which test mostly 0s (best performance because most filter batches can be skipped and only a small number of selected values have to be copied) and mostly 1s (which is not as fast, but still faster than worst case because filter batches can be checked quickly and most values are copied in slices) should be easier to achieve with random filter arrays but are they going to be repeatable? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r521574396 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,137 +14,136 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = rng.sample_iter().take(10).collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = rand::thread_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; Review comment: the filter benchmarks may not simulate real-world use cases, but they are designed to test the code under specific conditions such as the worst case scenario with alternating 1s and 0s where no batch can be skipped and all selected values have to be copied individually; how can this scenario be achieved with a randomly generated filter array? the other scenarios which test mostly 0s (best performance because most filter batches can be skipped and only a small number of selected values have to be copied) and mostly 1s (which is not as fast, but still faster than worst case because filter batches can be checked quickly and most values are copied in slices) should be easier to achieve with random filter arrays but are they going to be repeatable? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r520902160 ## File path: rust/arrow/src/array/transform/mod.rs ## @@ -0,0 +1,496 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::{io::Write, mem::size_of, sync::Arc}; + +use crate::{buffer::MutableBuffer, datatypes::DataType, util::bit_util}; + +use super::{ArrayData, ArrayDataRef}; + +mod boolean; +mod list; +mod primitive; +mod utils; +mod variable_size; + +type ExtendNullBits<'a> = Box () + 'a>; +// function that extends `[start..start+len]` to the mutable array. +// this is dynamic because different data_types influence how buffers and childs are extended. +type Extend<'a> = Box () + 'a>; + +#[derive(Debug)] +struct _MutableArrayData<'a> { +pub data_type: DataType, +pub null_count: usize, + +pub len: usize, +pub null_buffer: MutableBuffer, + +pub buffers: Vec, +pub child_data: Vec>, +} + +impl<'a> _MutableArrayData<'a> { +fn freeze(self, dictionary: Option) -> ArrayData { +let mut buffers = Vec::with_capacity(self.buffers.len()); +for buffer in self.buffers { +buffers.push(buffer.freeze()); +} + +let child_data = match self.data_type { +DataType::Dictionary(_, _) => vec![dictionary.unwrap()], +_ => { +let mut child_data = Vec::with_capacity(self.child_data.len()); +for child in self.child_data { +child_data.push(Arc::new(child.freeze())); +} +child_data +} +}; +ArrayData::new( +self.data_type, +self.len, +Some(self.null_count), +if self.null_count > 0 { +Some(self.null_buffer.freeze()) +} else { +None +}, +0, +buffers, +child_data, +) +} + +/// Returns the buffer `buffer` as a slice of type `T`. When the expected buffer is bit-packed, +/// the slice is not offset. +#[inline] +pub(super) fn buffer(, buffer: usize) -> &[T] { +let values = unsafe { self.buffers[buffer].data().align_to::() }; +if !values.0.is_empty() || !values.2.is_empty() { +// this is unreachable because +unreachable!("The buffer is not byte-aligned with its interpretation") +}; + +} +} + +fn build_set_nulls<'a>(array: &'a ArrayData) -> ExtendNullBits<'a> { +if let Some(bitmap) = array.null_bitmap() { +let bytes = bitmap.bits.data(); +Box::new(move |mutable, start, len| { +utils::resize_for_bits( mutable.null_buffer, mutable.len + len); +mutable.null_count += utils::set_bits( +mutable.null_buffer.data_mut(), +bytes, +mutable.len, +array.offset() + start, +len, +); +}) +} else { +Box::new(|_, _, _| {}) +} +} + +/// Struct to efficiently and interactively create an [ArrayData] from an existing [ArrayData] by +/// copying chunks. +/// The main use case of this struct is to perform unary operations to arrays of arbitrary types, such as `filter` and `take`. +/// # Example: +/// +/// ``` +/// use std::sync::Arc; +/// use arrow::{array::{Int32Array, Array, MutableArrayData}}; +/// +/// let array = Int32Array::from(vec![1, 2, 3, 4, 5]).data(); +/// // Create a new `MutableArrayData` from an array and with a capacity. +/// // Capacity here is equivalent to `Vec::with_capacity` +/// let mut mutable = MutableArrayData::new(, 4); +/// mutable.extend(1, 3); // extend from the slice [1..3], [2,3] +/// mutable.extend(0, 3); // extend from the slice [0..3], [1,2,3] +/// // `.freeze()` to convert `MutableArrayData` into a `ArrayData`. +/// let new_array = Int32Array::from(Arc::new(mutable.freeze())); +/// assert_eq!(Int32Array::from(vec![2, 3, 1, 2, 3]), new_array); +/// ``` +pub struct MutableArrayData<'a> { +data: _MutableArrayData<'a>, + +dictionary: Option, + +push_slice: Extend<'a>, +set_nulls: ExtendNullBits<'a>, +} + +impl<'a>
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r520902160 ## File path: rust/arrow/src/array/transform/mod.rs ## @@ -0,0 +1,496 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::{io::Write, mem::size_of, sync::Arc}; + +use crate::{buffer::MutableBuffer, datatypes::DataType, util::bit_util}; + +use super::{ArrayData, ArrayDataRef}; + +mod boolean; +mod list; +mod primitive; +mod utils; +mod variable_size; + +type ExtendNullBits<'a> = Box () + 'a>; +// function that extends `[start..start+len]` to the mutable array. +// this is dynamic because different data_types influence how buffers and childs are extended. +type Extend<'a> = Box () + 'a>; + +#[derive(Debug)] +struct _MutableArrayData<'a> { +pub data_type: DataType, +pub null_count: usize, + +pub len: usize, +pub null_buffer: MutableBuffer, + +pub buffers: Vec, +pub child_data: Vec>, +} + +impl<'a> _MutableArrayData<'a> { +fn freeze(self, dictionary: Option) -> ArrayData { +let mut buffers = Vec::with_capacity(self.buffers.len()); +for buffer in self.buffers { +buffers.push(buffer.freeze()); +} + +let child_data = match self.data_type { +DataType::Dictionary(_, _) => vec![dictionary.unwrap()], +_ => { +let mut child_data = Vec::with_capacity(self.child_data.len()); +for child in self.child_data { +child_data.push(Arc::new(child.freeze())); +} +child_data +} +}; +ArrayData::new( +self.data_type, +self.len, +Some(self.null_count), +if self.null_count > 0 { +Some(self.null_buffer.freeze()) +} else { +None +}, +0, +buffers, +child_data, +) +} + +/// Returns the buffer `buffer` as a slice of type `T`. When the expected buffer is bit-packed, +/// the slice is not offset. +#[inline] +pub(super) fn buffer(, buffer: usize) -> &[T] { +let values = unsafe { self.buffers[buffer].data().align_to::() }; +if !values.0.is_empty() || !values.2.is_empty() { +// this is unreachable because +unreachable!("The buffer is not byte-aligned with its interpretation") +}; + +} +} + +fn build_set_nulls<'a>(array: &'a ArrayData) -> ExtendNullBits<'a> { +if let Some(bitmap) = array.null_bitmap() { +let bytes = bitmap.bits.data(); +Box::new(move |mutable, start, len| { +utils::resize_for_bits( mutable.null_buffer, mutable.len + len); +mutable.null_count += utils::set_bits( +mutable.null_buffer.data_mut(), +bytes, +mutable.len, +array.offset() + start, +len, +); +}) +} else { +Box::new(|_, _, _| {}) +} +} + +/// Struct to efficiently and interactively create an [ArrayData] from an existing [ArrayData] by +/// copying chunks. +/// The main use case of this struct is to perform unary operations to arrays of arbitrary types, such as `filter` and `take`. +/// # Example: +/// +/// ``` +/// use std::sync::Arc; +/// use arrow::{array::{Int32Array, Array, MutableArrayData}}; +/// +/// let array = Int32Array::from(vec![1, 2, 3, 4, 5]).data(); +/// // Create a new `MutableArrayData` from an array and with a capacity. +/// // Capacity here is equivalent to `Vec::with_capacity` +/// let mut mutable = MutableArrayData::new(, 4); +/// mutable.extend(1, 3); // extend from the slice [1..3], [2,3] +/// mutable.extend(0, 3); // extend from the slice [0..3], [1,2,3] +/// // `.freeze()` to convert `MutableArrayData` into a `ArrayData`. +/// let new_array = Int32Array::from(Arc::new(mutable.freeze())); +/// assert_eq!(Int32Array::from(vec![2, 3, 1, 2, 3]), new_array); +/// ``` +pub struct MutableArrayData<'a> { +data: _MutableArrayData<'a>, + +dictionary: Option, + +push_slice: Extend<'a>, +set_nulls: ExtendNullBits<'a>, +} + +impl<'a>
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r520898403 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,137 +14,136 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = rng.sample_iter().take(10).collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = rand::thread_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; Review comment: I wonder if this is the reason for the inconsistency in the benchmark results - usually a highly selective filter, where most filter bits are 0s and only a small number of values are selected / copied to the output array will always have the best performance because most batches of filter bits can be skipped quickly and only a few values are copied to the output array; this relationship can clearly be seen in the benchmark results listed in this earlier PR https://github.com/apache/arrow/pull/7798; however in the benchmark results listed in the description of this PR in many cases the opposite is true - low selectivity filter benchmarks achieve much better performance than their high selectivity counterparts; I wonder what's the reason for that? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering
yordan-pavlov commented on a change in pull request #8630: URL: https://github.com/apache/arrow/pull/8630#discussion_r520890613 ## File path: rust/arrow/benches/filter_kernels.rs ## @@ -14,137 +14,136 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. +extern crate arrow; + +use rand::{ +distributions::{Alphanumeric, Standard}, +prelude::Distribution, +Rng, +}; use arrow::array::*; -use arrow::compute::{filter, FilterContext}; +use arrow::compute::{build_filter, filter}; use arrow::datatypes::ArrowNumericType; +use arrow::datatypes::{Float32Type, UInt8Type}; + use criterion::{criterion_group, criterion_main, Criterion}; -fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray +fn create_primitive_array(size: usize, null_density: f32) -> PrimitiveArray where T: ArrowNumericType, -F: Fn(usize) -> T::Native, +Standard: Distribution, { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); let mut builder = PrimitiveArraybuilder(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); + +for _ in 0..size { +if rng.gen::() < null_density { +builder.append_null().unwrap(); +} else { +builder.append_value(rng.gen()).unwrap(); +} } builder.finish() } -fn create_u8_array_with_nulls(size: usize) -> UInt8Array { -let mut builder = UInt8Builder::new(size); -for i in 0..size { -if i % 2 == 0 { -builder.append_value(1).unwrap(); -} else { +fn create_string_array(size: usize, null_density: f32) -> StringArray { +// use random numbers to avoid spurious compiler optimizations wrt to branching +let mut rng = rand::thread_rng(); +let mut builder = StringBuilder::new(size); + +for _ in 0..size { +if rng.gen::() < null_density { builder.append_null().unwrap(); +} else { +let value = rng.sample_iter().take(10).collect::(); +builder.append_value().unwrap(); } } builder.finish() } -fn create_bool_array(size: usize, value_fn: F) -> BooleanArray -where -F: Fn(usize) -> bool, -{ +fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray { +let mut rng = rand::thread_rng(); let mut builder = BooleanBuilder::new(size); -for i in 0..size { -builder.append_value(value_fn(i)).unwrap(); +for _ in 0..size { +let value = rng.gen::() < trues_density; Review comment: using random numbers to generate the filter arrays makes it difficult to control filter selectivity; also doesn't that make each benchmark run unique which I think is the opposite of what we want - we want consistent benchmarks with stable, repeatable conditions This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org