[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-18 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r526371625



##
File path: rust/arrow/src/buffer.rs
##
@@ -888,6 +888,15 @@ impl MutableBuffer {
 }
 self.len += bytes.len();
 }
+
+/// Extends the buffer by `len` with all bytes equal to `0u8`, 
incrementing its capacity if needed.
+pub fn extend( self, len: usize) {
+let remaining_capacity = self.capacity - self.len;
+if len > remaining_capacity {
+self.reserve(self.len + len);

Review comment:
   @jorgecarleitao  good spot - what I suggested is more obvious when len > 
remaining_capacity such as if `len = 256` in your example, then requested 
capacity would be `54 + (256 - (64 - 54)) = 54 + (256 - 10) = 300`; I think it 
will still work even in your example because the requested of 52 capacity will 
be lower than existing capacity so no allocation will be performed as it is not 
necessary because the requested extension of 8 is smaller than remaining 
capacity of 10.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-18 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r526371625



##
File path: rust/arrow/src/buffer.rs
##
@@ -888,6 +888,15 @@ impl MutableBuffer {
 }
 self.len += bytes.len();
 }
+
+/// Extends the buffer by `len` with all bytes equal to `0u8`, 
incrementing its capacity if needed.
+pub fn extend( self, len: usize) {
+let remaining_capacity = self.capacity - self.len;
+if len > remaining_capacity {
+self.reserve(self.len + len);

Review comment:
   @jorgecarleitao  good spot - what I suggested is more obvious when len > 
remaining_capacity such if `len = 256` in your example, then requested capacity 
would be `54 + (256 - (64 - 54)) = 54 + (256 - 10) = 300`; I think it will 
still work even in your example because the requested of 52 capacity will be 
lower than existing capacity so no allocation will be performed as it is not 
necessary because the requested extension of 8 is smaller than remaining 
capacity of 10.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-17 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r525541440



##
File path: rust/arrow/src/buffer.rs
##
@@ -888,6 +888,15 @@ impl MutableBuffer {
 }
 self.len += bytes.len();
 }
+
+/// Extends the buffer by `len` with all bytes equal to `0u8`, 
incrementing its capacity if needed.
+pub fn extend( self, len: usize) {
+let remaining_capacity = self.capacity - self.len;
+if len > remaining_capacity {
+self.reserve(self.len + len);

Review comment:
   should this be `self.reserve(self.len + (len - remaining_capacity));` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-17 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r525539070



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,128 +14,141 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use arrow::{compute::Filter, util::test_util::seedable_rng};
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = seedable_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = seedable_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = ( rng)
+.sample_iter()
+.take(10)
+.collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = seedable_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;
+builder.append_value(value).unwrap();
 }
 builder.finish()
 }
 
-fn bench_filter_u8(data_array: , filter_array: ) {
-filter(
-criterion::black_box(data_array),
-criterion::black_box(filter_array),
-)
-.unwrap();
+fn bench_filter(data_array: , filter_array: ) {
+criterion::black_box(filter(data_array, filter_array).unwrap());
 }
 
-// fn bench_filter_f32(data_array: , filter_array: ) 
{
-// filter(criterion::black_box(data_array), 
criterion::black_box(filter_array)).unwrap();
-// }
-
-fn bench_filter_context_u8(data_array: , filter_context: 
) {
-filter_context
-.filter(criterion::black_box(data_array))
-.unwrap();
-}
-
-fn bench_filter_context_f32(data_array: , filter_context: 
) {
-filter_context
-.filter(criterion::black_box(data_array))
-.unwrap();
+fn bench_built_filter<'a>(filter: <'a>, data:  Array) {
+criterion::black_box(filter(()));
 }
 
 fn add_benchmark(c:  Criterion) {
 let size = 65536;
-let filter_array = create_bool_array(size, |i| matches!(i % 2, 0));
-let sparse_filter_array = create_bool_array(size, |i| matches!(i % 8000, 
0));
-let dense_filter_array = create_bool_array(size, |i| !matches!(i % 8000, 
0));
+let filter_array = create_bool_array(size, 0.5);
+let sparse_filter_array = create_bool_array(size, 1.0 - 1.0 / 8000.0);

Review comment:
   the names of the `sparse_filter_array` (intended to contain mostly 0s) 
and `dense_filter_array` (intended to contain mostly 1s) variables no longer 
correspond to the values; should they be reversed?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521588710



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,137 +14,136 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = 
rng.sample_iter().take(10).collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = rand::thread_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;

Review comment:
   @jorgecarleitao once the approach to filter benchmarks has been 
finalized would it be possible to rearrange the commits so that the same 
benchmark code can be used to benchmark both the old and new implementations so 
that we can do a direct comparison?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521574396



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,137 +14,136 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = 
rng.sample_iter().take(10).collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = rand::thread_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;

Review comment:
   @jorgecarleitao  the filter benchmarks may not simulate real-world use 
cases, but they are designed to test the code under specific conditions such as 
the worst case scenario with alternating 1s and 0s where no batch can be 
skipped and all selected values have to be copied individually; how can this 
scenario be achieved with a randomly generated filter array?
   
   the other scenarios which test mostly 0s (best performance because most 
filter batches can be skipped and only a small number of selected values have 
to be copied) and mostly 1s (which is not as fast, but still faster than worst 
case because filter batches can be checked quickly and most values are copied 
in slices) should be easier to achieve with random filter arrays but are they 
going to be repeatable?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-11 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r521574396



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,137 +14,136 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = 
rng.sample_iter().take(10).collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = rand::thread_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;

Review comment:
   the filter benchmarks may not simulate real-world use cases, but they 
are designed to test the code under specific conditions such as the worst case 
scenario with alternating 1s and 0s where no batch can be skipped and all 
selected values have to be copied individually; how can this scenario be 
achieved with a randomly generated filter array?
   
   the other scenarios which test mostly 0s (best performance because most 
filter batches can be skipped and only a small number of selected values have 
to be copied) and mostly 1s (which is not as fast, but still faster than worst 
case because filter batches can be checked quickly and most values are copied 
in slices) should be easier to achieve with random filter arrays but are they 
going to be repeatable?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-10 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r520902160



##
File path: rust/arrow/src/array/transform/mod.rs
##
@@ -0,0 +1,496 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{io::Write, mem::size_of, sync::Arc};
+
+use crate::{buffer::MutableBuffer, datatypes::DataType, util::bit_util};
+
+use super::{ArrayData, ArrayDataRef};
+
+mod boolean;
+mod list;
+mod primitive;
+mod utils;
+mod variable_size;
+
+type ExtendNullBits<'a> = Box () + 
'a>;
+// function that extends `[start..start+len]` to the mutable array.
+// this is dynamic because different data_types influence how buffers and 
childs are extended.
+type Extend<'a> = Box () + 'a>;
+
+#[derive(Debug)]
+struct _MutableArrayData<'a> {
+pub data_type: DataType,
+pub null_count: usize,
+
+pub len: usize,
+pub null_buffer: MutableBuffer,
+
+pub buffers: Vec,
+pub child_data: Vec>,
+}
+
+impl<'a> _MutableArrayData<'a> {
+fn freeze(self, dictionary: Option) -> ArrayData {
+let mut buffers = Vec::with_capacity(self.buffers.len());
+for buffer in self.buffers {
+buffers.push(buffer.freeze());
+}
+
+let child_data = match self.data_type {
+DataType::Dictionary(_, _) => vec![dictionary.unwrap()],
+_ => {
+let mut child_data = Vec::with_capacity(self.child_data.len());
+for child in self.child_data {
+child_data.push(Arc::new(child.freeze()));
+}
+child_data
+}
+};
+ArrayData::new(
+self.data_type,
+self.len,
+Some(self.null_count),
+if self.null_count > 0 {
+Some(self.null_buffer.freeze())
+} else {
+None
+},
+0,
+buffers,
+child_data,
+)
+}
+
+/// Returns the buffer `buffer` as a slice of type `T`. When the expected 
buffer is bit-packed,
+/// the slice is not offset.
+#[inline]
+pub(super) fn buffer(, buffer: usize) -> &[T] {
+let values = unsafe { self.buffers[buffer].data().align_to::() };
+if !values.0.is_empty() || !values.2.is_empty() {
+// this is unreachable because
+unreachable!("The buffer is not byte-aligned with its 
interpretation")
+};
+
+}
+}
+
+fn build_set_nulls<'a>(array: &'a ArrayData) -> ExtendNullBits<'a> {
+if let Some(bitmap) = array.null_bitmap() {
+let bytes = bitmap.bits.data();
+Box::new(move |mutable, start, len| {
+utils::resize_for_bits( mutable.null_buffer, mutable.len + 
len);
+mutable.null_count += utils::set_bits(
+mutable.null_buffer.data_mut(),
+bytes,
+mutable.len,
+array.offset() + start,
+len,
+);
+})
+} else {
+Box::new(|_, _, _| {})
+}
+}
+
+/// Struct to efficiently and interactively create an [ArrayData] from an 
existing [ArrayData] by
+/// copying chunks.
+/// The main use case of this struct is to perform unary operations to arrays 
of arbitrary types, such as `filter` and `take`.
+/// # Example:
+///
+/// ```
+/// use std::sync::Arc;
+/// use arrow::{array::{Int32Array, Array, MutableArrayData}};
+///
+/// let array = Int32Array::from(vec![1, 2, 3, 4, 5]).data();
+/// // Create a new `MutableArrayData` from an array and with a capacity.
+/// // Capacity here is equivalent to `Vec::with_capacity`
+/// let mut mutable = MutableArrayData::new(, 4);
+/// mutable.extend(1, 3); // extend from the slice [1..3], [2,3]
+/// mutable.extend(0, 3); // extend from the slice [0..3], [1,2,3]
+/// // `.freeze()` to convert `MutableArrayData` into a `ArrayData`.
+/// let new_array = Int32Array::from(Arc::new(mutable.freeze()));
+/// assert_eq!(Int32Array::from(vec![2, 3, 1, 2, 3]), new_array);
+/// ```
+pub struct MutableArrayData<'a> {
+data: _MutableArrayData<'a>,
+
+dictionary: Option,
+
+push_slice: Extend<'a>,
+set_nulls: ExtendNullBits<'a>,
+}
+
+impl<'a> 

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-10 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r520902160



##
File path: rust/arrow/src/array/transform/mod.rs
##
@@ -0,0 +1,496 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{io::Write, mem::size_of, sync::Arc};
+
+use crate::{buffer::MutableBuffer, datatypes::DataType, util::bit_util};
+
+use super::{ArrayData, ArrayDataRef};
+
+mod boolean;
+mod list;
+mod primitive;
+mod utils;
+mod variable_size;
+
+type ExtendNullBits<'a> = Box () + 
'a>;
+// function that extends `[start..start+len]` to the mutable array.
+// this is dynamic because different data_types influence how buffers and 
childs are extended.
+type Extend<'a> = Box () + 'a>;
+
+#[derive(Debug)]
+struct _MutableArrayData<'a> {
+pub data_type: DataType,
+pub null_count: usize,
+
+pub len: usize,
+pub null_buffer: MutableBuffer,
+
+pub buffers: Vec,
+pub child_data: Vec>,
+}
+
+impl<'a> _MutableArrayData<'a> {
+fn freeze(self, dictionary: Option) -> ArrayData {
+let mut buffers = Vec::with_capacity(self.buffers.len());
+for buffer in self.buffers {
+buffers.push(buffer.freeze());
+}
+
+let child_data = match self.data_type {
+DataType::Dictionary(_, _) => vec![dictionary.unwrap()],
+_ => {
+let mut child_data = Vec::with_capacity(self.child_data.len());
+for child in self.child_data {
+child_data.push(Arc::new(child.freeze()));
+}
+child_data
+}
+};
+ArrayData::new(
+self.data_type,
+self.len,
+Some(self.null_count),
+if self.null_count > 0 {
+Some(self.null_buffer.freeze())
+} else {
+None
+},
+0,
+buffers,
+child_data,
+)
+}
+
+/// Returns the buffer `buffer` as a slice of type `T`. When the expected 
buffer is bit-packed,
+/// the slice is not offset.
+#[inline]
+pub(super) fn buffer(, buffer: usize) -> &[T] {
+let values = unsafe { self.buffers[buffer].data().align_to::() };
+if !values.0.is_empty() || !values.2.is_empty() {
+// this is unreachable because
+unreachable!("The buffer is not byte-aligned with its 
interpretation")
+};
+
+}
+}
+
+fn build_set_nulls<'a>(array: &'a ArrayData) -> ExtendNullBits<'a> {
+if let Some(bitmap) = array.null_bitmap() {
+let bytes = bitmap.bits.data();
+Box::new(move |mutable, start, len| {
+utils::resize_for_bits( mutable.null_buffer, mutable.len + 
len);
+mutable.null_count += utils::set_bits(
+mutable.null_buffer.data_mut(),
+bytes,
+mutable.len,
+array.offset() + start,
+len,
+);
+})
+} else {
+Box::new(|_, _, _| {})
+}
+}
+
+/// Struct to efficiently and interactively create an [ArrayData] from an 
existing [ArrayData] by
+/// copying chunks.
+/// The main use case of this struct is to perform unary operations to arrays 
of arbitrary types, such as `filter` and `take`.
+/// # Example:
+///
+/// ```
+/// use std::sync::Arc;
+/// use arrow::{array::{Int32Array, Array, MutableArrayData}};
+///
+/// let array = Int32Array::from(vec![1, 2, 3, 4, 5]).data();
+/// // Create a new `MutableArrayData` from an array and with a capacity.
+/// // Capacity here is equivalent to `Vec::with_capacity`
+/// let mut mutable = MutableArrayData::new(, 4);
+/// mutable.extend(1, 3); // extend from the slice [1..3], [2,3]
+/// mutable.extend(0, 3); // extend from the slice [0..3], [1,2,3]
+/// // `.freeze()` to convert `MutableArrayData` into a `ArrayData`.
+/// let new_array = Int32Array::from(Arc::new(mutable.freeze()));
+/// assert_eq!(Int32Array::from(vec![2, 3, 1, 2, 3]), new_array);
+/// ```
+pub struct MutableArrayData<'a> {
+data: _MutableArrayData<'a>,
+
+dictionary: Option,
+
+push_slice: Extend<'a>,
+set_nulls: ExtendNullBits<'a>,
+}
+
+impl<'a> 

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-10 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r520898403



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,137 +14,136 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = 
rng.sample_iter().take(10).collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = rand::thread_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;

Review comment:
   I wonder if this is the reason for the inconsistency in the benchmark 
results - usually a highly selective filter, where most filter bits are 0s and 
only a small number of values are selected / copied to the output array will 
always have the best performance because most batches of filter bits can be 
skipped quickly and only a few values are copied to the output array;
   this relationship can clearly be seen in the benchmark results listed in 
this earlier PR https://github.com/apache/arrow/pull/7798;
   
   however in the benchmark results listed in the description of this PR in 
many cases the opposite is true - low selectivity filter benchmarks achieve 
much better performance than their high selectivity counterparts; I wonder 
what's the reason for that?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #8630: ARROW-10540 [Rust] Improve filtering

2020-11-10 Thread GitBox


yordan-pavlov commented on a change in pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#discussion_r520890613



##
File path: rust/arrow/benches/filter_kernels.rs
##
@@ -14,137 +14,136 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+extern crate arrow;
+
+use rand::{
+distributions::{Alphanumeric, Standard},
+prelude::Distribution,
+Rng,
+};
 
 use arrow::array::*;
-use arrow::compute::{filter, FilterContext};
+use arrow::compute::{build_filter, filter};
 use arrow::datatypes::ArrowNumericType;
+use arrow::datatypes::{Float32Type, UInt8Type};
+
 use criterion::{criterion_group, criterion_main, Criterion};
 
-fn create_primitive_array(size: usize, value_fn: F) -> PrimitiveArray
+fn create_primitive_array(size: usize, null_density: f32) -> 
PrimitiveArray
 where
 T: ArrowNumericType,
-F: Fn(usize) -> T::Native,
+Standard: Distribution,
 {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
 let mut builder = PrimitiveArraybuilder(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+
+for _ in 0..size {
+if rng.gen::() < null_density {
+builder.append_null().unwrap();
+} else {
+builder.append_value(rng.gen()).unwrap();
+}
 }
 builder.finish()
 }
 
-fn create_u8_array_with_nulls(size: usize) -> UInt8Array {
-let mut builder = UInt8Builder::new(size);
-for i in 0..size {
-if i % 2 == 0 {
-builder.append_value(1).unwrap();
-} else {
+fn create_string_array(size: usize, null_density: f32) -> StringArray {
+// use random numbers to avoid spurious compiler optimizations wrt to 
branching
+let mut rng = rand::thread_rng();
+let mut builder = StringBuilder::new(size);
+
+for _ in 0..size {
+if rng.gen::() < null_density {
 builder.append_null().unwrap();
+} else {
+let value = 
rng.sample_iter().take(10).collect::();
+builder.append_value().unwrap();
 }
 }
 builder.finish()
 }
 
-fn create_bool_array(size: usize, value_fn: F) -> BooleanArray
-where
-F: Fn(usize) -> bool,
-{
+fn create_bool_array(size: usize, trues_density: f32) -> BooleanArray {
+let mut rng = rand::thread_rng();
 let mut builder = BooleanBuilder::new(size);
-for i in 0..size {
-builder.append_value(value_fn(i)).unwrap();
+for _ in 0..size {
+let value = rng.gen::() < trues_density;

Review comment:
   using random numbers to generate the filter arrays makes it difficult to 
control filter selectivity; also doesn't that make each benchmark run unique 
which I think is the opposite of what we want - we want consistent benchmarks 
with stable, repeatable conditions





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org