[GitHub] [arrow] jorgecarleitao commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups
jorgecarleitao commented on pull request #9271: URL: https://github.com/apache/arrow/pull/9271#issuecomment-765211936 Isn't the data contained on a buffer `Arc`ed? I.e. `Vec::clone()` should be cheap, no? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups
Dandandan commented on pull request #9271: URL: https://github.com/apache/arrow/pull/9271#issuecomment-765208218 @jorgecarleitao I found the "offending" code is this function in `array/data.rs` which does a `self.clone()`. Any idea how we could write a non copying version? I think it was copied/adapted based on a function in the array module, but I think it was also cloning in previous versions. ```rust /// Creates a zero-copy slice of itself. This creates a new [ArrayData] /// with a different offset, len and a shifted null bitmap. /// /// # Panics /// /// Panics if `offset + length > self.len()`. pub fn slice(, offset: usize, length: usize) -> ArrayData { assert!((offset + length) <= self.len()); let mut new_data = self.clone(); new_data.len = length; new_data.offset = offset + self.offset; new_data.null_count = count_nulls(new_data.null_buffer(), new_data.offset, new_data.len); new_data } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`
codecov-io commented on pull request #9291: URL: https://github.com/apache/arrow/pull/9291#issuecomment-765203292 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=h1) Report > Merging [#9291](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=desc) (beacc5d) into [master](https://codecov.io/gh/apache/arrow/commit/c413566b34bd0c13a01a68148bd78df1bdec3c10?el=desc) (c413566) will **increase** coverage by `0.12%`. > The diff coverage is `97.67%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9291/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=tree) ```diff @@Coverage Diff @@ ## master#9291 +/- ## == + Coverage 81.61% 81.74% +0.12% == Files 215 215 Lines 5250852486 -22 == + Hits4285442903 +49 + Misses 9654 9583 -71 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=tree) | Coverage Δ | | |---|---|---| | [rust/arrow/src/array/equal\_json.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWxfanNvbi5ycw==) | `92.16% <85.71%> (-0.26%)` | :arrow_down: | | [rust/arrow/src/array/array\_primitive.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfcHJpbWl0aXZlLnJz) | `94.52% <100.00%> (ø)` | | | [rust/arrow/src/array/ord.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvb3JkLnJz) | `62.50% <100.00%> (+1.07%)` | :arrow_up: | | [rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=) | `97.20% <100.00%> (+0.21%)` | :arrow_up: | | [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `78.86% <0.00%> (-18.45%)` | :arrow_down: | | [rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==) | `85.10% <0.00%> (-0.42%)` | :arrow_down: | | [rust/arrow/src/array/array\_struct.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RydWN0LnJz) | `88.43% <0.00%> (ø)` | | | [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `74.44% <0.00%> (+0.07%)` | :arrow_up: | | [rust/arrow/src/record\_batch.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvcmVjb3JkX2JhdGNoLnJz) | `84.32% <0.00%> (+0.35%)` | :arrow_up: | | [rust/arrow/src/ipc/convert.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL2NvbnZlcnQucnM=) | `92.56% <0.00%> (+0.45%)` | :arrow_up: | | ... and [8 more](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=footer). Last update [c413566...beacc5d](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao closed pull request #9122: ARROW-10299: [Rust] Use IPC Metadata V5 as default
jorgecarleitao closed pull request #9122: URL: https://github.com/apache/arrow/pull/9122 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`
jorgecarleitao commented on a change in pull request #9291: URL: https://github.com/apache/arrow/pull/9291#discussion_r562432512 ## File path: rust/arrow/src/array/ord.rs ## @@ -66,7 +68,9 @@ where { let left = left.as_any().downcast_ref::>().unwrap(); let right = right.as_any().downcast_ref::>().unwrap(); -Box::new(move |i, j| cmp_nans_last((i), (j))) +let left = left.values(); +let right = right.values(); +Box::new(move |i, j| cmp_nans_last([i], [j])) Review comment: this will be slower, but at this point we have no guarantee that `i` and `j` are within bounds. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`
github-actions[bot] commented on pull request #9291: URL: https://github.com/apache/arrow/pull/9291#issuecomment-765189910 https://issues.apache.org/jira/browse/ARROW-11345 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`
jorgecarleitao commented on pull request #9291: URL: https://github.com/apache/arrow/pull/9291#issuecomment-765190026 cc @tyrelr . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao opened a new pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`
jorgecarleitao opened a new pull request #9291: URL: https://github.com/apache/arrow/pull/9291 # Problem Currently, a lot of our code relies on `PrimitiveArray::value(usize)`, however, that method is highly `unsafe` as any `usize` larger than the length of the array allows to read arbitrary memory regions due to the lack of protections. # This PR: This PR changes many of our code to not rely on it for this operation, replacing by a safe alternative. This PR is expected to improve the performance of the touched kernels as we already have alternatives to efficiently create an array out of an iterator. todo: benchmark This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9215: ARROW-11270: [Rust] Array slice accessors
codecov-io edited a comment on pull request #9215: URL: https://github.com/apache/arrow/pull/9215#issuecomment-761621256 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=h1) Report > Merging [#9215](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=desc) (41c29bc) into [master](https://codecov.io/gh/apache/arrow/commit/499b6d0c5fbcf7d3aabf25e286cb8c300e6de52a?el=desc) (499b6d0) will **increase** coverage by `0.03%`. > The diff coverage is `100.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9215/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=tree) ```diff @@Coverage Diff @@ ## master#9215 +/- ## == + Coverage 81.64% 81.68% +0.03% == Files 215 215 Lines 5241952501 +82 == + Hits4279842885 +87 + Misses 9621 9616 -5 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=tree) | Coverage Δ | | |---|---|---| | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `86.99% <100.00%> (ø)` | | | [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `91.75% <100.00%> (+0.76%)` | :arrow_up: | | [rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==) | `93.84% <100.00%> (+0.78%)` | :arrow_up: | | [rust/arrow/src/array/array\_primitive.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfcHJpbWl0aXZlLnJz) | `94.77% <100.00%> (+0.24%)` | :arrow_up: | | [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `91.66% <100.00%> (+1.66%)` | :arrow_up: | | [rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==) | `85.10% <100.00%> (ø)` | | | [rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=) | `96.96% <100.00%> (-0.03%)` | :arrow_down: | | [rust/arrow/src/compute/kernels/limit.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2xpbWl0LnJz) | `100.00% <100.00%> (ø)` | | | [rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz) | `98.92% <100.00%> (ø)` | | | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=footer). Last update [499b6d0...41c29bc](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on issue #9178: Support for the binary SQL type in rust/datafusion
jorgecarleitao commented on issue #9178: URL: https://github.com/apache/arrow/issues/9178#issuecomment-765142713 Hi @ilya-biryukov, have you tried? The following works for me (based on the example `simple_udf`): ```rust use arrow::{ array::{ArrayRef, BinaryArray, Int64Array}, datatypes::DataType, record_batch::RecordBatch, util::pretty, }; use datafusion::error::Result; use datafusion::{physical_plan::functions::ScalarFunctionImplementation, prelude::*}; use std::sync::Arc; // create local execution context with an in-memory table fn create_context() -> Result { use arrow::datatypes::{Field, Schema}; use datafusion::datasource::MemTable; // define a schema. let schema = Arc::new(Schema::new(vec![ Field::new("c", DataType::Binary, false), ])); let a: &[u8] = b""; // define data. let batch = RecordBatch::try_new( schema.clone(), vec![ Arc::new(BinaryArray::from(vec![Some(a), None, None, None])), ], )?; // declare a new context. In spark API, this corresponds to a new spark SQLsession let mut ctx = ExecutionContext::new(); // declare a table in memory. In spark API, this corresponds to createDataFrame(...). let provider = MemTable::try_new(schema, vec![vec![batch]])?; ctx.register_table("t", Box::new(provider)); Ok(ctx) } /// In this example we will declare a single-type, single return type UDF that exponentiates f64, a^b #[tokio::main] async fn main() -> Result<()> { let mut ctx = create_context()?; // First, declare the actual implementation of the calculation let len: ScalarFunctionImplementation = Arc::new(|args: &[ArrayRef]| { // in DataFusion, all `args` and output are dynamically-typed arrays, which means that we need to: // 1. cast the values to the type we want // 2. perform the computation for every element in the array (using a loop or SIMD) // this is guaranteed by DataFusion based on the function's signature. assert_eq!(args.len(), 1); let value = [0] .as_any() .downcast_ref::() .expect("cast failed"); // 2. run the UDF let array: Int64Array = value.iter().map(|base| { // in arrow, any value can be null. base.map(|x| x.len() as i64) }).collect(); Ok(Arc::new(array)) }); // Next: // * give it a name so that it shows nicely when the plan is printed // * declare what input it expects // * declare its return type let len = create_udf( "len", vec![DataType::Binary], Arc::new(DataType::Int64), len, ); // at this point, we can use it or register it, depending on the use-case: // * if the UDF is expected to be used throughout the program in different contexts, // we can register it, and call it later: ctx.register_udf(len.clone()); // clone is only required in this example because we show both usages // * if the UDF is expected to be used directly in the scope, `.call` it directly: let expr = len.call(vec![col("c")]); // get a DataFrame from the context let df = ctx.table("t")?; // equivalent to `'SELECT pow(a, b), pow(a, b) AS pow1 FROM t'` let df = df.select(vec![ expr, ])?; // execute the query let results = df.collect().await?; // print the results pretty::print_batches()?; Ok(()) } ``` ``` ++ | len(c) | ++ | 4 | || || || ++ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9290: ARROW-11343: [DataFusion] Simplified example with UDF.
codecov-io commented on pull request #9290: URL: https://github.com/apache/arrow/pull/9290#issuecomment-765149755 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=h1) Report > Merging [#9290](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=desc) (4e5220e) into [master](https://codecov.io/gh/apache/arrow/commit/c413566b34bd0c13a01a68148bd78df1bdec3c10?el=desc) (c413566) will **not change** coverage. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9290/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#9290 +/- ## === Coverage 81.61% 81.61% === Files 215 215 Lines 5250852508 === Hits4285442854 Misses 9654 9654 ``` -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=footer). Last update [c413566...0bdcf4b](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me closed pull request #9281: ARROW-11333: [Rust] Generalized creation of empty arrays.
nevi-me closed pull request #9281: URL: https://github.com/apache/arrow/pull/9281 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao opened a new pull request #9290: ARROW-11343: [DataFusion] Simplified example with UDF.
jorgecarleitao opened a new pull request #9290: URL: https://github.com/apache/arrow/pull/9290 I.e. use `collect` instead of the builder, which is simpler and faster This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tyrelr commented on a change in pull request #9215: ARROW-11270: [Rust] Array slice accessors
tyrelr commented on a change in pull request #9215: URL: https://github.com/apache/arrow/pull/9215#discussion_r562406187 ## File path: rust/arrow/src/array/array_primitive.rs ## @@ -86,13 +86,9 @@ impl PrimitiveArray { } /// Returns the primitive value at index `i`. -/// -/// Note this doesn't do any bound checking, for performance reason. -/// # Safety -/// caller must ensure that the passed in offset is less than the array len() +#[inline] pub fn value(, i: usize) -> T::Native { -let offset = i + self.offset(); -unsafe { *self.raw_values.as_ptr().add(offset) } +self.values()[i] Review comment: That makes sense. I'll rebase dropping that commit. ## File path: rust/arrow/src/array/array_primitive.rs ## @@ -86,13 +86,9 @@ impl PrimitiveArray { } /// Returns the primitive value at index `i`. -/// -/// Note this doesn't do any bound checking, for performance reason. -/// # Safety -/// caller must ensure that the passed in offset is less than the array len() +#[inline] pub fn value(, i: usize) -> T::Native { -let offset = i + self.offset(); -unsafe { *self.raw_values.as_ptr().add(offset) } +self.values()[i] Review comment: safety overhead on a few benchmarks seems to be 15-30% ``` group master- safe-prim- unsafe-prim- - --- -- add 512 1.00307.4±4.79ns? B/sec1.07 327.7±6.48ns? B/sec1.30400.5±7.09ns? B/sec add_nulls_512 1.20447.8±8.09ns? B/sec1.32 490.9±3.41ns? B/sec1.00372.2±5.96ns? B/sec buffer_bit_ops or 1.17 731.0±20.02ns? B/sec1.19 743.7±5.43ns? B/sec1.00 624.0±10.86ns? B/sec cast date64 to date32 512 1.00 7.4±0.22µs? B/sec1.10 8.2±0.07µs? B/sec1.03 7.7±0.22µs? B/sec cast i64 to string 512 1.00 24.8±0.33µs? B/sec1.09 27.1±0.79µs? B/sec1.10 27.3±0.41µs? B/sec cast time32s to time32ms 5121.00 917.0±19.20ns? B/sec1.21 1113.0±22.04ns? B/sec1.07984.5±7.53ns? B/sec equal_512 1.00 44.1±0.81ns? B/sec1.13 49.9±0.65ns? B/sec1.01 44.4±0.27ns? B/sec min nulls string 5121.04 6.4±0.12µs? B/sec1.14 7.0±0.15µs? B/sec1.00 6.2±0.09µs? B/sec multiply 5121.14 471.5±10.15ns? B/sec1.18 487.0±4.72ns? B/sec1.00413.4±6.16ns? B/sec not 1.00 1102.3±14.55ns? B/sec1.12 1235.9±19.91ns? B/sec1.02 1120.3±19.83ns? B/sec ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9290: ARROW-11343: [DataFusion] Simplified example with UDF.
github-actions[bot] commented on pull request #9290: URL: https://github.com/apache/arrow/pull/9290#issuecomment-765135950 https://issues.apache.org/jira/browse/ARROW-11343 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9281: ARROW-11333: [Rust] Generalized creation of empty arrays.
github-actions[bot] commented on pull request #9281: URL: https://github.com/apache/arrow/pull/9281#issuecomment-764412043 https://issues.apache.org/jira/browse/ARROW-11333 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9215: ARROW-11270: [Rust] Array slice accessors
jorgecarleitao commented on a change in pull request #9215: URL: https://github.com/apache/arrow/pull/9215#discussion_r561618199 ## File path: rust/arrow/src/array/array_primitive.rs ## @@ -86,13 +86,9 @@ impl PrimitiveArray { } /// Returns the primitive value at index `i`. -/// -/// Note this doesn't do any bound checking, for performance reason. -/// # Safety -/// caller must ensure that the passed in offset is less than the array len() +#[inline] pub fn value(, i: usize) -> T::Native { -let offset = i + self.offset(); -unsafe { *self.raw_values.as_ptr().add(offset) } +self.values()[i] Review comment: Thinking about it, I am not sure we want this: this will have severe implications to almost all uses of `value()`. Could you run the benches before and after this PR? My suggestion is that we mark this method as `unsafe`, so that it is clear that people need to be conscious about it. Since this will impact a large part of our code base (and likely others), I suggest that we do this change on a separate PR ## File path: rust/arrow/src/array/array_primitive.rs ## @@ -86,13 +86,9 @@ impl PrimitiveArray { } /// Returns the primitive value at index `i`. -/// -/// Note this doesn't do any bound checking, for performance reason. -/// # Safety -/// caller must ensure that the passed in offset is less than the array len() +#[inline] pub fn value(, i: usize) -> T::Native { -let offset = i + self.offset(); -unsafe { *self.raw_values.as_ptr().add(offset) } +self.values()[i] Review comment: Thinking about it, I am not sure we want this: this will have severe implications to almost all uses of `value()`. Could you run the benches before and after this PR? My suggestion is that we mark this method as `unsafe`, so that it is clear that people need to be conscious about it. Since this will impact a large part of our code base (and likely others), I suggest that we do this change on a separate PR. I think that we would like to instruct users of this API to switch to `values()`. This change will not trigger that change (as the code continues to run), but will have an amazing performance implication. Think ```rust for i in 0..array.len() { if array.value(i) > 2 { // do x }; } ``` this loop will suffer a lot from this change. we would like users to change it to ``` array.values().for_each(|value| { if value > 2 { // do x }; }); ``` ## File path: rust/arrow/src/array/array_primitive.rs ## @@ -86,13 +86,9 @@ impl PrimitiveArray { } /// Returns the primitive value at index `i`. -/// -/// Note this doesn't do any bound checking, for performance reason. -/// # Safety -/// caller must ensure that the passed in offset is less than the array len() +#[inline] pub fn value(, i: usize) -> T::Native { -let offset = i + self.offset(); -unsafe { *self.raw_values.as_ptr().add(offset) } +self.values()[i] Review comment: Thinking about it, I am not sure we want this: this will have severe implications to almost all uses of `value()`. Could you run the benches before and after this PR? My suggestion is that we mark this method as `unsafe`, so that it is clear that people need to be conscious about it. Since this will impact a large part of our code base (and likely others), I suggest that we do this change on a separate PR. I think that we would like to instruct users of this API to switch to `values()`. This change will not trigger that change (as the code continues to run), but will have an amazing performance implication. Think ```rust for i in 0..array.len() { if array.value(i) > 2 { // do x }; } ``` this loop will suffer a lot from this change. we would like users to change it to ```rust array.values().for_each(|value| { if value > 2 { // do x }; }); ``` ## File path: rust/arrow/src/array/array_primitive.rs ## @@ -86,13 +86,9 @@ impl PrimitiveArray { } /// Returns the primitive value at index `i`. -/// -/// Note this doesn't do any bound checking, for performance reason. -/// # Safety -/// caller must ensure that the passed in offset is less than the array len() +#[inline] pub fn value(, i: usize) -> T::Native { -let offset = i + self.offset(); -unsafe { *self.raw_values.as_ptr().add(offset) } +self.values()[i] Review comment: Thinking about it, I am not sure we want this: this will have severe implications to almost all uses of `value()`. Could you run the benches before and after this PR? My suggestion is that we mark this
[GitHub] [arrow] alamb commented on a change in pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking
alamb commented on a change in pull request #9278: URL: https://github.com/apache/arrow/pull/9278#discussion_r561366255 ## File path: rust/datafusion/src/logical_plan/expr.rs ## @@ -422,6 +422,136 @@ impl Expr { nulls_first, } } + +/// Performs a depth first depth first walk of an expression and +/// its children, calling `visitor.pre_visit` and +/// `visitor.post_visit`. +/// +/// Implements the [visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern) to +/// separate expression algorithms from the structure of the +/// `Expr` tree and make it easier to add new types of expressions +/// and algorithms that walk the tree. +/// +/// For an expression tree such as +/// BinaryExpr (GT) +///left: Column("foo") +///right: Column("bar") +/// +/// The nodes are visited using the following order +/// ```text +/// pre_visit(BinaryExpr(GT)) +/// pre_visit(Column("foo")) +/// pre_visit(Column("bar")) +/// post_visit(Column("bar")) +/// post_visit(Column("bar")) +/// post_visit(BinaryExpr(GT)) +/// ``` +/// +/// If an Err result is returned, recursion is stopped immediately +/// +/// If `Recursion::Stop` is returned on a call to pre_visit, no +/// children of that expression are visited, nor is post_visit +/// called on that expression +/// +pub fn accept(, visitor: V) -> Result { +let visitor = match visitor.pre_visit(self)? { +Recursion::Continue(visitor) => visitor, +// If the recursion should stop, do not visit children +Recursion::Stop(visitor) => return Ok(visitor), +}; + +// recurse (and cover all expression types) +let visitor = match self { +Expr::Alias(expr, _) => expr.accept(visitor), +Expr::Column(..) => Ok(visitor), +Expr::ScalarVariable(..) => Ok(visitor), +Expr::Literal(..) => Ok(visitor), +Expr::BinaryExpr { left, right, .. } => { +let visitor = left.accept(visitor)?; +right.accept(visitor) +} +Expr::Not(expr) => expr.accept(visitor), +Expr::IsNotNull(expr) => expr.accept(visitor), +Expr::IsNull(expr) => expr.accept(visitor), +Expr::Negative(expr) => expr.accept(visitor), +Expr::Between { +expr, low, high, .. +} => { +let visitor = expr.accept(visitor)?; +let visitor = low.accept(visitor)?; +high.accept(visitor) +} +Expr::Case { +expr, +when_then_expr, +else_expr, +} => { +let visitor = if let Some(expr) = expr.as_ref() { +expr.accept(visitor) +} else { +Ok(visitor) +}?; +let visitor = when_then_expr.iter().try_fold( +visitor, +|visitor, (when, then)| { +let visitor = when.accept(visitor)?; +then.accept(visitor) +}, +)?; +if let Some(else_expr) = else_expr.as_ref() { +else_expr.accept(visitor) +} else { +Ok(visitor) +} +} +Expr::Cast { expr, .. } => expr.accept(visitor), +Expr::Sort { expr, .. } => expr.accept(visitor), +Expr::ScalarFunction { args, .. } => args +.iter() +.try_fold(visitor, |visitor, arg| arg.accept(visitor)), +Expr::ScalarUDF { args, .. } => args +.iter() +.try_fold(visitor, |visitor, arg| arg.accept(visitor)), +Expr::AggregateFunction { args, .. } => args +.iter() +.try_fold(visitor, |visitor, arg| arg.accept(visitor)), +Expr::AggregateUDF { args, .. } => args +.iter() +.try_fold(visitor, |visitor, arg| arg.accept(visitor)), +Expr::InList { expr, list, .. } => { +let visitor = expr.accept(visitor)?; +list.iter() +.try_fold(visitor, |visitor, arg| arg.accept(visitor)) +} +Expr::Wildcard => Ok(visitor), +}?; + +visitor.post_visit(self) +} +} + +/// Controls how the visitor recursion should proceed. +pub enum Recursion { +/// Attempt to visit all the children, recursively, of this expression. +Continue(V), +/// Do not visit the children of this expression, though the walk +/// of parents of this expression will not be affected +Stop(V), +} + +/// Encode the traversal of an expression tree. When passed to +/// `visit_expression`, `ExpressionVisitor::visit` is invoked
[GitHub] [arrow] Dandandan commented on a change in pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec
Dandandan commented on a change in pull request #9279: URL: https://github.com/apache/arrow/pull/9279#discussion_r561307105 ## File path: rust/arrow/src/compute/kernels/take.rs ## @@ -423,7 +423,7 @@ where let mut offsets_buffer = MutableBuffer::from_len_zeroed(bytes_offset); let offsets = offsets_buffer.typed_data_mut(); -let mut values = Vec::with_capacity(bytes_offset); +let mut values = MutableBuffer::new(0); Review comment: Used to be here `bytes_offset` but that doesn't make a lot of sense. The values array could be empty as well (e.g. for empty strings), or have a different capacity altogether. ## File path: rust/arrow/src/compute/kernels/take.rs ## @@ -423,7 +423,7 @@ where let mut offsets_buffer = MutableBuffer::from_len_zeroed(bytes_offset); let offsets = offsets_buffer.typed_data_mut(); -let mut values = Vec::with_capacity(bytes_offset); +let mut values = MutableBuffer::new(0); Review comment: Used to be here `bytes_offset` but that doesn't make a lot of sense. The values array could be empty as well (e.g. for empty strings), or have a totally different capacity. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ElenaHenderson commented on pull request #9272: [WIP] Benchmark placebo
ElenaHenderson commented on pull request #9272: URL: https://github.com/apache/arrow/pull/9272#issuecomment-764916679 please benchmark This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9174: ARROW-11220: [Rust] Implement GROUP BY support for Boolean
alamb commented on pull request #9174: URL: https://github.com/apache/arrow/pull/9174#issuecomment-764600978 Thanks again for the contribution @ovr This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9122: ARROW-10299: [Rust] Use IPC Metadata V5 as default
codecov-io edited a comment on pull request #9122: URL: https://github.com/apache/arrow/pull/9122#issuecomment-756042833 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=h1) Report > Merging [#9122](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=desc) (8ccc5bd) into [master](https://codecov.io/gh/apache/arrow/commit/a0e12445cc8689befe40c20e2fac6e6df252bef6?el=desc) (a0e1244) will **increase** coverage by `0.09%`. > The diff coverage is `99.06%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9122/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=tree) ```diff @@Coverage Diff @@ ## master#9122 +/- ## == + Coverage 81.64% 81.74% +0.09% == Files 215 215 Lines 5248952572 +83 == + Hits4285742975 +118 + Misses 9632 9597 -35 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=tree) | Coverage Δ | | |---|---|---| | [rust/arrow/src/array/array\_struct.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RydWN0LnJz) | `88.43% <ø> (ø)` | | | [rust/arrow/src/util/integration\_util.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9pbnRlZ3JhdGlvbl91dGlsLnJz) | `69.95% <75.00%> (+3.08%)` | :arrow_up: | | [rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==) | `84.56% <100.00%> (+1.41%)` | :arrow_up: | | [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `87.82% <100.00%> (+4.60%)` | :arrow_up: | | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: | | [rust/arrow/src/ipc/convert.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL2NvbnZlcnQucnM=) | `92.56% <0.00%> (+0.45%)` | :arrow_up: | | [rust/arrow/src/array/null.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvbnVsbC5ycw==) | `92.59% <0.00%> (+1.85%)` | :arrow_up: | | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `86.40% <0.00%> (+1.94%)` | :arrow_up: | | ... and [1 more](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=footer). Last update [a0e1244...8ccc5bd](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9174: ARROW-11220: [Rust] Implement GROUP BY support for Boolean
alamb closed pull request #9174: URL: https://github.com/apache/arrow/pull/9174 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ovr commented on a change in pull request #9232: ARROW-10818: [Rust] Implement DecimalType
ovr commented on a change in pull request #9232: URL: https://github.com/apache/arrow/pull/9232#discussion_r561946613 ## File path: rust/arrow/src/datatypes/mod.rs ## @@ -199,6 +207,81 @@ pub struct Field { metadata: Option>, } +// Decimal (precision, scale) = Decimal(1, 2) = 1.00 +pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq { +const MAX_DIGITS: usize; + +// fn into_json_value(self) -> Option; + +fn get_byte_width_for_precision_scale(precision: usize, scale: usize) -> usize; + +// Rescale scale part +fn rescale( self, scale: usize); + +// Try to parse string with precision, scale +fn parse( +string: , +precision: usize, +scale: usize, +) -> std::result::Result; + +fn to_byte_slice() -> Vec; + +fn from_bytes_with_precision_scale( +bytes: &[u8], +precision: usize, +scale: usize, +) -> Self; +} + +#[derive(Debug)] +pub enum DecimalType { Review comment: @alamb @andygrove @Dandandan @nevi-me @jorgecarleitao I created representations for Decimal as `Int32DecimalType`, `Int64DecimalType`, `Int128DecimalType` or `LargeDecimal`, which uses BigInt. 1. What is the best way in Rust to abstract representations of `DataType::Decimal` in Rust? Enum? 2. How should I work with `DecimalArray`? Should make it as generic by passing representation type to it (but how should I detect type, match in everyplace?) or make it generic on top of DecimalType and get byte size from it? Thanks ## File path: rust/arrow/src/datatypes/mod.rs ## @@ -199,6 +207,81 @@ pub struct Field { metadata: Option>, } +// Decimal (precision, scale) = Decimal(1, 2) = 1.00 +pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq { +const MAX_DIGITS: usize; + +// fn into_json_value(self) -> Option; + +fn get_byte_width_for_precision_scale(precision: usize, scale: usize) -> usize; + +// Rescale scale part +fn rescale( self, scale: usize); + +// Try to parse string with precision, scale +fn parse( +string: , +precision: usize, +scale: usize, +) -> std::result::Result; + +fn to_byte_slice() -> Vec; + +fn from_bytes_with_precision_scale( +bytes: &[u8], +precision: usize, +scale: usize, +) -> Self; +} + +#[derive(Debug)] +pub enum DecimalType { Review comment: @alamb @andygrove @Dandandan @nevi-me @jorgecarleitao I created representations for Decimal as `Int32DecimalType`, `Int64DecimalType`, `Int128DecimalType` or `LargeDecimal`, which uses BigInt. 1. What is the best way in Rust to abstract representations of `DataType::Decimal` in Rust? Enum? 2. How should I work with `DecimalArray`? Should I make it as generic by passing representation type to it (but how should I detect type, match in everyplace? As I have known functions cannot return type) or make it generic on top of DecimalType and get byte size from it? Thanks ## File path: rust/arrow/src/datatypes/mod.rs ## @@ -199,6 +207,81 @@ pub struct Field { metadata: Option>, } +// Decimal (precision, scale) = Decimal(1, 2) = 1.00 +pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq { +const MAX_DIGITS: usize; + +// fn into_json_value(self) -> Option; + +fn get_byte_width_for_precision_scale(precision: usize, scale: usize) -> usize; + +// Rescale scale part +fn rescale( self, scale: usize); + +// Try to parse string with precision, scale +fn parse( +string: , +precision: usize, +scale: usize, +) -> std::result::Result; + +fn to_byte_slice() -> Vec; + +fn from_bytes_with_precision_scale( +bytes: &[u8], +precision: usize, +scale: usize, +) -> Self; +} + +#[derive(Debug)] +pub enum DecimalType { Review comment: @alamb @andygrove @Dandandan @nevi-me @jorgecarleitao I created representations for Decimal as `Int32DecimalType`, `Int64DecimalType`, `Int128DecimalType` or `LargeDecimal`, which uses BigInt. 1. What is the best way to abstract representations of `DataType::Decimal in Rust`? Enum? 2. How should I work with `DecimalArray`? Should I make it as generic by passing representation type to it (but how should I detect type, match in everyplace? As I have known functions cannot return type) or make it generic on top of DecimalType and get byte size from it? Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking
alamb commented on pull request #9278: URL: https://github.com/apache/arrow/pull/9278#issuecomment-764011355 Thanks @jorgecarleitao and @Dandandan . I am personally quite excited at using this same idea for expression rewriting. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9240: ARROW-10766: [Rust] [Parquet] Compute nested list definitions
nevi-me commented on a change in pull request #9240: URL: https://github.com/apache/arrow/pull/9240#discussion_r562158835 ## File path: rust/parquet/src/arrow/array_reader.rs ## @@ -912,11 +912,36 @@ impl ArrayReader for ListArrayReader { )); } -// Need to remove from the values array the nulls that represent null lists rather than null items -// null lists have def_level = 0 +// List definitions can be encoded as 4 values: +// - n + 0: the list slot is null +// - n + 1: the list slot is not null, but is empty (i.e. []) +// - n + 2: the list slot is not null, but its child is empty (i.e. [ null ]) +// - n + 3: the list slot is not null, and its child is not empty +// Where n is the max definition level of the list's parent. +// If a Parquet schema's only leaf is the list, then n = 0. + +// TODO: ARROW-10391 - add a test case with a non-nullable child, check if max is 3 +let list_field_type = match self.get_data_type() { +ArrowType::List(field) +| ArrowType::FixedSizeList(field, _) +| ArrowType::LargeList(field) => field, +_ => { +// Panic: this is safe as we only write lists from list datatypes +unreachable!() Review comment: They don't make any difference, `unreachable!()` will call `panic!()` with a message about unreachable code being reached. So it's probably a more descriptive panic. I tohught that marking a condition as `unreachable!()` lets the compiler optimise out that condition, but it seems like only its `unsafe` equivalent does. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9289: ARROW-11341: [Python] [Gandiva] Add NULL/None checks to Gandiva builder functions
github-actions[bot] commented on pull request #9289: URL: https://github.com/apache/arrow/pull/9289#issuecomment-765105119 https://issues.apache.org/jira/browse/ARROW-11341 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8240: ARROW-10038: [C++] Spawn thread pool threads lazily
pitrou commented on pull request #8240: URL: https://github.com/apache/arrow/pull/8240#issuecomment-764950406 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9265: ARROW-11320: [C++] Try to strengthen temporary dir creation
pitrou commented on pull request #9265: URL: https://github.com/apache/arrow/pull/9265#issuecomment-764784559 Travis-CI build: https://travis-ci.com/github/pitrou/arrow/builds/213944945 . Will merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead
wesm commented on a change in pull request #9280: URL: https://github.com/apache/arrow/pull/9280#discussion_r562121639 ## File path: cpp/src/arrow/compute/internals_benchmark.cc ## @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one Review comment: I can merge these together then. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ElenaHenderson removed a comment on pull request #9272: [WIP] Benchmark placebo
ElenaHenderson removed a comment on pull request #9272: URL: https://github.com/apache/arrow/pull/9272#issuecomment-763278345 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9276: ARROW-11327: [Rust][DataFusion] Add DictionarySupport to create_batch_empty
codecov-io edited a comment on pull request #9276: URL: https://github.com/apache/arrow/pull/9276#issuecomment-763672876 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=h1) Report > Merging [#9276](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=desc) (07d8ce8) into [master](https://codecov.io/gh/apache/arrow/commit/a0e12445cc8689befe40c20e2fac6e6df252bef6?el=desc) (a0e1244) will **increase** coverage by `0.01%`. > The diff coverage is `80.41%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9276/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=tree) ```diff @@Coverage Diff @@ ## master#9276 +/- ## == + Coverage 81.64% 81.66% +0.01% == Files 215 215 Lines 5248952512 +23 == + Hits4285742884 +27 + Misses 9632 9628 -4 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=tree) | Coverage Δ | | |---|---|---| | [rust/datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow/pull/9276/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvbW1vbi5ycw==) | `82.03% <80.41%> (+7.74%)` | :arrow_up: | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=footer). Last update [a0e1244...07d8ce8](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking
jorgecarleitao commented on a change in pull request #9278: URL: https://github.com/apache/arrow/pull/9278#discussion_r561229060 ## File path: rust/datafusion/src/logical_plan/expr.rs ## @@ -422,6 +422,136 @@ impl Expr { nulls_first, } } + +/// Performs a depth first depth first walk of an expression and Review comment: ```suggestion /// Performs a depth first walk of an expression and ``` ## File path: rust/datafusion/src/logical_plan/expr.rs ## @@ -422,6 +422,136 @@ impl Expr { nulls_first, } } + +/// Performs a depth first depth first walk of an expression and +/// its children, calling `visitor.pre_visit` and Review comment: ```suggestion /// its children, calling [`ExpressionVisitor::pre_visit`] and ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #8920: ARROW-10831: [C++][Compute] Implement quantile kernel
pitrou closed pull request #8920: URL: https://github.com/apache/arrow/pull/8920 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on a change in pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
xhochy commented on a change in pull request #8648: URL: https://github.com/apache/arrow/pull/8648#discussion_r562161018 ## File path: cpp/src/arrow/adapters/orc/adapter_util.cc ## @@ -316,10 +326,482 @@ Status AppendBatch(const liborc::Type* type, liborc::ColumnVectorBatch* batch, } } +template +Status FillNumericBatch(const DataType* type, liborc::ColumnVectorBatch* cbatch, +int64_t& arrowOffset, int64_t& orcOffset, int64_t length, +Array* parray, std::vector* incomingMask) { + auto array = checked_cast(parray); + auto batch = checked_cast(cbatch); + int64_t arrowLength = array->length(); + if (!arrowLength) return Status::OK(); + if (array->null_count() || incomingMask) batch->hasNulls = true; + for (; orcOffset < length && arrowOffset < arrowLength; orcOffset++, arrowOffset++) { +if (array->IsNull(arrowOffset) || (incomingMask && !(*incomingMask)[orcOffset])) { + batch->notNull[orcOffset] = false; +} else { + batch->data[orcOffset] = array->Value(arrowOffset); + batch->notNull[orcOffset] = true; +} + } + batch->numElements = orcOffset; + return Status::OK(); +} + +template +Status FillNumericBatchCast(const DataType* type, liborc::ColumnVectorBatch* cbatch, +int64_t& arrowOffset, int64_t& orcOffset, int64_t length, +Array* parray, std::vector* incomingMask) { + auto array = checked_cast(parray); + auto batch = checked_cast(cbatch); + int64_t arrowLength = array->length(); + if (!arrowLength) return Status::OK(); + if (array->null_count() || incomingMask) batch->hasNulls = true; + for (; orcOffset < length && arrowOffset < arrowLength; orcOffset++, arrowOffset++) { +if (array->IsNull(arrowOffset) || (incomingMask && !(*incomingMask)[orcOffset])) { + batch->notNull[orcOffset] = false; +} else { + batch->data[orcOffset] = static_cast(array->Value(arrowOffset)); + batch->notNull[orcOffset] = true; +} + } + batch->numElements = orcOffset; + return Status::OK(); +} + +Status FillDate64Batch(const DataType* type, liborc::ColumnVectorBatch* cbatch, + int64_t& arrowOffset, int64_t& orcOffset, int64_t length, + Array* parray, std::vector* incomingMask) { + auto array = checked_cast(parray); + auto batch = checked_cast(cbatch); + int64_t arrowLength = array->length(); + if (!arrowLength) return Status::OK(); + if (array->null_count() || incomingMask) batch->hasNulls = true; + for (; orcOffset < length && arrowOffset < arrowLength; orcOffset++, arrowOffset++) { +if (array->IsNull(arrowOffset) || (incomingMask && !(*incomingMask)[orcOffset])) { + batch->notNull[orcOffset] = false; +} else { + int64_t miliseconds = array->Value(arrowOffset); + batch->data[orcOffset] = + static_cast(std::floor(miliseconds / kOneSecondMillis)); + batch->nanoseconds[orcOffset] = + (miliseconds - kOneSecondMillis * batch->data[orcOffset]) * kOneMilliNanos; + batch->notNull[orcOffset] = true; +} + } + batch->numElements = orcOffset; + return Status::OK(); +} + +Status FillTimestampBatch(const DataType* type, liborc::ColumnVectorBatch* cbatch, + int64_t& arrowOffset, int64_t& orcOffset, int64_t length, + Array* parray, std::vector* incomingMask) { + auto array = checked_cast(parray); + auto batch = checked_cast(cbatch); + int64_t arrowLength = array->length(); + if (!arrowLength) return Status::OK(); + if (array->null_count() || incomingMask) batch->hasNulls = true; + for (; orcOffset < length && arrowOffset < arrowLength; orcOffset++, arrowOffset++) { +if (array->IsNull(arrowOffset) || (incomingMask && !(*incomingMask)[orcOffset])) { + batch->notNull[orcOffset] = false; +} else { + int64_t data = array->Value(arrowOffset); + batch->notNull[orcOffset] = true; + switch (std::static_pointer_cast(array->type())->unit()) { +case TimeUnit::type::SECOND: { + batch->data[orcOffset] = data; + batch->nanoseconds[orcOffset] = 0; + break; +} +case TimeUnit::type::MILLI: { + batch->data[orcOffset] = + static_cast(std::floor(data / kOneSecondMillis)); + batch->nanoseconds[orcOffset] = + (data - kOneSecondMillis * batch->data[orcOffset]) * kOneMilliNanos; + break; +} +case TimeUnit::type::MICRO: { + batch->data[orcOffset] = + static_cast(std::floor(data / kOneSecondMicros)); + batch->nanoseconds[orcOffset] = + (data - kOneSecondMicros * batch->data[orcOffset]) * kOneMicroNanos; + break; +} +default: { + batch->data[orcOffset] = + static_cast(std::floor(data / kOneSecondNanos)); + batch->nanoseconds[orcOffset] = data -
[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer over
WeichenXu123 commented on pull request #9187: URL: https://github.com/apache/arrow/pull/9187#issuecomment-765076360 @liyafan82 > The reason is that, for variable width vectors, it is not possible to estimate the buffer size without actually filling up the vector. Why not possible ? For variableWidthVector, The buffer format is quite simple: 1) validity buffer: nbytes = ceil(N/8) 2) offset buffer: nbytes = OFFSET_WIDTH * (N + 1) 3) value buffer: nbytes = sum of all non-null value bytes count For `variableWidthVector`, the `setValueCount` method only do additional things of: 1) fill holes (fill the offset buffer gap for last consecutive NULL values) 2) reset offset reader/writer index (this looks like has some side effect) But, without calling `setValueCount`, it doesn't affect we compute the "vector buffer size" Does it make sense ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking
alamb closed pull request #9278: URL: https://github.com/apache/arrow/pull/9278 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #9274: ARROW-11299: [C++][Python] Fix invalid-offsetof warnings
cyb70289 commented on pull request #9274: URL: https://github.com/apache/arrow/pull/9274#issuecomment-764217852 > @cyb70289 `arrow::dataset::Expression::Call::options` requires that virtual destructor since a `shared_ptr` is used to store subclasses of FunctionOptions. Removing this virtual destructor results in a resource leak since the derived classes' (implicit) destructors are not being called, leaking data members such as `arrow::compute::SetLookupOptions::value_set` which are not destroyed. Thanks @bkietz , I missed this point. Will update according to your comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9235: ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length)
codecov-io edited a comment on pull request #9235: URL: https://github.com/apache/arrow/pull/9235#issuecomment-761864117 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9235: ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length)
jorgecarleitao edited a comment on pull request #9235: URL: https://github.com/apache/arrow/pull/9235#issuecomment-763898758 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline
codecov-io edited a comment on pull request #9261: URL: https://github.com/apache/arrow/pull/9261#issuecomment-764691368 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=h1) Report > Merging [#9261](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=desc) (8d10789) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **increase** coverage by `0.00%`. > The diff coverage is `100.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9261/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=tree) ```diff @@ Coverage Diff@@ ## master#9261+/- ## Coverage 81.61% 81.61% Files 215 215 Lines 5186752508 +641 + Hits4232942853 +524 - Misses 9538 9655 +117 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=tree) | Coverage Δ | | |---|---|---| | [rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=) | `91.67% <100.00%> (+0.17%)` | :arrow_up: | | [rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=) | `39.43% <0.00%> (-14.50%)` | :arrow_down: | | [rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==) | `83.21% <0.00%> (-9.89%)` | :arrow_down: | | [rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=) | `49.01% <0.00%> (-9.17%)` | :arrow_down: | | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `79.75% <0.00%> (-6.52%)` | :arrow_down: | | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `6.97% <0.00%> (-5.22%)` | :arrow_down: | | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `53.12% <0.00%> (-5.21%)` | :arrow_down: | | [rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz) | `93.02% <0.00%> (-5.13%)` | :arrow_down: | | [rust/arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3V0aWxzLnJz) | `95.00% <0.00%> (-5.00%)` | :arrow_down: | | [rust/datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvbW1vbi5ycw==) | `74.28% <0.00%> (-4.51%)` | :arrow_down: | | ... and [91 more](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=footer). Last update [903b41c...8d10789](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9288: ARROW-11342: [Python] [Gandiva] Expose ToString and result type information
github-actions[bot] commented on pull request #9288: URL: https://github.com/apache/arrow/pull/9288#issuecomment-765095930 https://issues.apache.org/jira/browse/ARROW-11342 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rtyler commented on a change in pull request #9240: ARROW-10766: [Rust] [Parquet] Compute nested list definitions
rtyler commented on a change in pull request #9240: URL: https://github.com/apache/arrow/pull/9240#discussion_r561361029 ## File path: rust/parquet/src/arrow/array_reader.rs ## @@ -912,11 +912,36 @@ impl ArrayReader for ListArrayReader { )); } -// Need to remove from the values array the nulls that represent null lists rather than null items -// null lists have def_level = 0 +// List definitions can be encoded as 4 values: +// - n + 0: the list slot is null +// - n + 1: the list slot is not null, but is empty (i.e. []) +// - n + 2: the list slot is not null, but its child is empty (i.e. [ null ]) +// - n + 3: the list slot is not null, and its child is not empty +// Where n is the max definition level of the list's parent. +// If a Parquet schema's only leaf is the list, then n = 0. + +// TODO: ARROW-10391 - add a test case with a non-nullable child, check if max is 3 +let list_field_type = match self.get_data_type() { +ArrowType::List(field) +| ArrowType::FixedSizeList(field, _) +| ArrowType::LargeList(field) => field, +_ => { +// Panic: this is safe as we only write lists from list datatypes +unreachable!() Review comment: For my own curiosity, is there a specific reason to use `unreachable!` rather than `panic!` in cases like this? I understand the outcome will be the same. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] TheNeuralBit commented on a change in pull request #8385: ARROW-10220 [Javascript] Javascript toArray() method ignores nulls on some types.
TheNeuralBit commented on a change in pull request #8385: URL: https://github.com/apache/arrow/pull/8385#discussion_r562374331 ## File path: js/src/visitor/toarray.ts ## @@ -88,18 +88,6 @@ export class ToArrayVisitor extends Visitor {} /** @ignore */ function arrayOfVector(vector: VectorType): T['TArray'] { - -const { type, length, stride } = vector; - -// Fast case, return subarray if possible -switch (type.typeId) { Review comment: Instead of removing the fast case can you just add a check to skip this if the type is nullable? As you pointed out in the jira we could still do this for non-nullable arrays. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset
github-actions[bot] commented on pull request #9284: URL: https://github.com/apache/arrow/pull/9284#issuecomment-764778634 https://issues.apache.org/jira/browse/ARROW-10370 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9276: ARROW-11327: [Rust][DataFusion] Add DictionarySupport to create_batch_empty
alamb closed pull request #9276: URL: https://github.com/apache/arrow/pull/9276 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian
pitrou commented on a change in pull request #7507: URL: https://github.com/apache/arrow/pull/7507#discussion_r562063757 ## File path: cpp/src/arrow/array/util.h ## @@ -56,6 +56,14 @@ Result> MakeArrayFromScalar( namespace internal { +/// \brief Swap endian of each element in a generic ArrayData +/// \param[in] data the array contents +/// \param[in] type the array type +/// \return the resulting Array instance whose elements were swapped +ARROW_EXPORT +Result> SwapEndianArrayData( +std::shared_ptr& data, const std::shared_ptr& type); Review comment: The first parameter should be `const` as well, it seems? ## File path: cpp/src/arrow/array/util.h ## @@ -56,6 +56,14 @@ Result> MakeArrayFromScalar( namespace internal { +/// \brief Swap endian of each element in a generic ArrayData +/// \param[in] data the array contents +/// \param[in] type the array type Review comment: The type is already on the data, is it necessary to pass it separately? ## File path: dev/archery/tests/generate_files_for_endian_test.sh ## @@ -0,0 +1,28 @@ +#!/bin/bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +: ${ARROW_DIR:=/arrow} Review comment: Can you add a comment explaining how to use this script? ## File path: cpp/src/arrow/type.h ## @@ -1641,6 +1655,17 @@ class ARROW_EXPORT Schema : public detail::Fingerprintable, bool Equals(const Schema& other, bool check_metadata = false) const; bool Equals(const std::shared_ptr& other, bool check_metadata = false) const; + /// \brief Replace endianness with platform-native endianness in the schema + /// + /// \return new Schema + std::shared_ptr WithNativeEndianness() const; + + /// \brief Return endianness in the schema + Endianness endianness() const; + + /// \brief Indicate if endianness is equal to platform-native endianness + bool IsNativeEndianness() const; Review comment: I would call it `is_native_endian()`. ## File path: cpp/src/arrow/array/util.cc ## @@ -74,6 +75,220 @@ class ArrayDataWrapper { std::shared_ptr* out_; }; +class ArrayDataEndianSwapper { + public: + ArrayDataEndianSwapper(std::shared_ptr& data, int64_t length, + std::shared_ptr* out) + : data_(data), length_(length), out_(out) {} + + Status SwapType(const DataType& type) { Review comment: And where is the null bitmap handled? Did I miss something? ## File path: cpp/src/arrow/array/util.cc ## @@ -74,6 +75,220 @@ class ArrayDataWrapper { std::shared_ptr* out_; }; +class ArrayDataEndianSwapper { + public: + ArrayDataEndianSwapper(std::shared_ptr& data, int64_t length, + std::shared_ptr* out) + : data_(data), length_(length), out_(out) {} + + Status SwapType(const DataType& type) { +RETURN_NOT_OK(VisitTypeInline(type, this)); +RETURN_NOT_OK(SwapChildren(type.fields())); +return Status::OK(); + } + + Status SwapChildren(std::vector> child_fields) { +int i = 0; +for (const auto& child_field : child_fields) { + ARROW_ASSIGN_OR_RAISE( + (*out_)->child_data[i], + SwapEndianArrayData(data_->child_data[i], child_field.get()->type())); + i++; +} +return Status::OK(); + } + + template + Result> ByteSwapBuffer(std::shared_ptr& in_buffer, Review comment: Either `const&` or no reference at all. ## File path: ci/scripts/integration_arrow.sh ## @@ -24,9 +24,16 @@ source_dir=${1}/cpp build_dir=${2}/cpp gold_dir_0_14_1=$arrow_dir/testing/data/arrow-ipc-stream/integration/0.14.1 gold_dir_0_17_1=$arrow_dir/testing/data/arrow-ipc-stream/integration/0.17.1 +gold_dir_1_0_0_le=$arrow_dir/testing/data/arrow-ipc-stream/integration/1.0.0-littleendian +gold_dir_1_0_0_be=$arrow_dir/testing/data/arrow-ipc-stream/integration/1.0.0-bigendian Review comment: Sidenote: I see that "generated_large_batch" was added to these directories. Is it necessary? ## File path: cpp/src/arrow/array/util.cc ## @@ -74,6 +75,220 @@ class ArrayDataWrapper { std::shared_ptr* out_; }; +class ArrayDataEndianSwapper { + public: +
[GitHub] [arrow] jorisvandenbossche commented on pull request #9282: ARROW-11334: [Python][CI] Fix failing pandas nightly tests
jorisvandenbossche commented on pull request #9282: URL: https://github.com/apache/arrow/pull/9282#issuecomment-764589256 @github-actions crossbow submit test-conda-python-3.7-pandas-master This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #9155: ARROW-6103: [Java] Remove mvn release plugin [WIP]
kou commented on pull request #9155: URL: https://github.com/apache/arrow/pull/9155#issuecomment-763933322 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tyrelr commented on pull request #9215: ARROW-11270: [Rust] Array slice accessors
tyrelr commented on pull request #9215: URL: https://github.com/apache/arrow/pull/9215#issuecomment-764164139 Rebased. But I didn't hit the merge conflicts during... I can see that a number of tests were swapped to use Buffer::from_slice_ref()... so I'll add another commit to switch to that function just to be safe. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9286: Fixed test case mistake that was generating compiler warning
github-actions[bot] commented on pull request #9286: URL: https://github.com/apache/arrow/pull/9286#issuecomment-764995036 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead
github-actions[bot] commented on pull request #9280: URL: https://github.com/apache/arrow/pull/9280#issuecomment-764142561 https://issues.apache.org/jira/browse/ARROW-8928 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9265: ARROW-11320: [C++] Try to strengthen temporary dir creation
pitrou commented on a change in pull request #9265: URL: https://github.com/apache/arrow/pull/9265#discussion_r561909698 ## File path: cpp/src/arrow/util/io_util.cc ## @@ -1482,31 +1482,51 @@ std::string MakeRandomName(int num_chars) { } // namespace Result> TemporaryDir::Make(const std::string& prefix) { - std::string suffix = MakeRandomName(8); + const int kNumChars = 8; + NativePathString base_name; - ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix)); + + auto MakeBaseName = [&]() { +std::string suffix = MakeRandomName(kNumChars); +return StringToNative(prefix + suffix); + }; + + auto TryCreatingDirectory = Review comment: `TryCreatingDirectory` mutates the enclosing scope and is therefore coupled to this function. But I can move out MakeBaseName. ## File path: cpp/src/arrow/util/io_util.cc ## @@ -1482,31 +1482,51 @@ std::string MakeRandomName(int num_chars) { } // namespace Result> TemporaryDir::Make(const std::string& prefix) { - std::string suffix = MakeRandomName(8); + const int kNumChars = 8; + NativePathString base_name; - ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix)); + + auto MakeBaseName = [&]() { +std::string suffix = MakeRandomName(kNumChars); +return StringToNative(prefix + suffix); + }; + + auto TryCreatingDirectory = Review comment: `TryCreatingDirectory` mutates the enclosing scope and is therefore coupled to this function. But I can move out `MakeBaseName`. ## File path: cpp/src/arrow/util/io_util.cc ## @@ -1482,31 +1482,51 @@ std::string MakeRandomName(int num_chars) { } // namespace Result> TemporaryDir::Make(const std::string& prefix) { - std::string suffix = MakeRandomName(8); + const int kNumChars = 8; + NativePathString base_name; - ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix)); + + auto MakeBaseName = [&]() { +std::string suffix = MakeRandomName(kNumChars); +return StringToNative(prefix + suffix); + }; + + auto TryCreatingDirectory = + [&](const NativePathString& base_dir) -> Result> { +Status st; +for (int attempt = 0; attempt < 3; ++attempt) { + PlatformFilename fn(base_dir + kNativeSep + base_name + kNativeSep); + auto result = CreateDir(fn); + if (!result.ok()) { +// Probably a permissions error or a non-existing base_dir +return nullptr; + } + if (*result) { +return std::unique_ptr(new TemporaryDir(std::move(fn))); + } + // The random name already exists in base_dir, try with another name + st = Status::IOError("Path already exists: '", fn.ToString(), "'"); + ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName()); +} +return st; + }; + + ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName()); auto base_dirs = GetPlatformTemporaryDirs(); DCHECK_NE(base_dirs.size(), 0); - auto st = Status::OK(); - for (const auto& p : base_dirs) { -PlatformFilename fn(p + kNativeSep + base_name + kNativeSep); -auto result = CreateDir(fn); -if (!result.ok()) { - st = result.status(); - continue; -} -if (!*result) { - // XXX Should we retry with another random name? - return Status::IOError("Path already exists: '", fn.ToString(), "'"); -} else { - return std::unique_ptr(new TemporaryDir(std::move(fn))); + for (const auto& base_dir : base_dirs) { +ARROW_ASSIGN_OR_RAISE(auto ptr, TryCreatingDirectory(base_dir)); +if (ptr) { + return std::move(ptr); Review comment: Because some old gcc fails compiling without it: https://github.com/apache/arrow/runs/1728457990#step:9:552 https://github.com/apache/arrow/runs/1728457638#step:7:992 ## File path: cpp/src/arrow/util/io_util.cc ## @@ -1482,31 +1482,51 @@ std::string MakeRandomName(int num_chars) { } // namespace Result> TemporaryDir::Make(const std::string& prefix) { - std::string suffix = MakeRandomName(8); + const int kNumChars = 8; + NativePathString base_name; - ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix)); + + auto MakeBaseName = [&]() { +std::string suffix = MakeRandomName(kNumChars); +return StringToNative(prefix + suffix); + }; + + auto TryCreatingDirectory = + [&](const NativePathString& base_dir) -> Result> { +Status st; +for (int attempt = 0; attempt < 3; ++attempt) { + PlatformFilename fn(base_dir + kNativeSep + base_name + kNativeSep); + auto result = CreateDir(fn); + if (!result.ok()) { +// Probably a permissions error or a non-existing base_dir +return nullptr; + } + if (*result) { +return std::unique_ptr(new TemporaryDir(std::move(fn))); + } + // The random name already exists in base_dir, try with another name + st = Status::IOError("Path already exists: '", fn.ToString(), "'"); + ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName()); +
[GitHub] [arrow] alamb commented on pull request #9275: ARROW-11323: [Rust][DataFusion] Allow sort queries to return no results
alamb commented on pull request #9275: URL: https://github.com/apache/arrow/pull/9275#issuecomment-764606664 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #9265: ARROW-11320: [C++] Try to strengthen temporary dir creation
pitrou closed pull request #9265: URL: https://github.com/apache/arrow/pull/9265 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset
jorisvandenbossche commented on a change in pull request #9284: URL: https://github.com/apache/arrow/pull/9284#discussion_r562029994 ## File path: python/pyarrow/tests/test_dataset.py ## @@ -1337,7 +1337,7 @@ def test_construct_from_single_file(tempdir): # instantiate from a single file with a filesystem object d2 = ds.dataset(path, filesystem=fs.LocalFileSystem()) # instantiate from a single file with prefixed filesystem URI -d3 = ds.dataset(relative_path, filesystem=_filesystem_uri(directory)) +d3 = ds.dataset(str(relative_path), filesystem=_filesystem_uri(directory)) Review comment: This change is needed because `relative_path` is a pathlib.Path object, but after this change, we only support that for actual LocalFilesystem, while here we have a SubtreeFilesystem. I suppose that in theory I could check for SubtreeFilesystem that wraps LocalFilesystem as well (but not sure we want to encourage using pathlib.Path for relative paths, and when having an absolute path, you never need SubtreeFileSystem). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #9151: ARROW-11173: [Java] Add map type in complex reader / writer
liyafan82 commented on pull request #9151: URL: https://github.com/apache/arrow/pull/9151#issuecomment-764632071 @nbruno Thanks for the PR. For the Java implementation, the Map type is implemented as a List of Struct. So we should already have supported the Map type? What is the motivation of the new implementation? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9287: ARROW-11340: [C++] Add vcpkg.json manifest to cpp project root
github-actions[bot] commented on pull request #9287: URL: https://github.com/apache/arrow/pull/9287#issuecomment-765071688 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec
Dandandan commented on pull request #9279: URL: https://github.com/apache/arrow/pull/9279#issuecomment-763951458 I think a nice followup for `take_string` would maybe be looking into using the offsets data to calculate the total length of the buffer to allocate, which should be pretty fast, and then allocating based on that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #8240: ARROW-10038: [C++] Spawn thread pool threads lazily
westonpace commented on pull request #8240: URL: https://github.com/apache/arrow/pull/8240#issuecomment-765025427 Sorry, I having trouble understanding what you are referencing. There was an RTools 3.5 failure because of a gcc bug requiring you to std::move a unique_ptr but I don't think that would apply here. The other compiler warning treated as error seemed to be unique to gcc 9.3 and tsan and I didn't know there was any failing job related to it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9276: ARROW-11327: [Rust][DataFusion] Add DictionarySupport to create_batch_empty
alamb commented on pull request #9276: URL: https://github.com/apache/arrow/pull/9276#issuecomment-764615979 I agree that #9281 looks much better. Thanks @jorgecarleitao I might port the tests from this PR subsently, but for now i don't think it is doing anything valuable This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed pull request #9286: ARROW-11337: [C++] Compilation error with ThreadSanitizer
kou closed pull request #9286: URL: https://github.com/apache/arrow/pull/9286 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nbruno commented on pull request #9151: ARROW-11173: [Java] Add map type in complex reader / writer
nbruno commented on pull request #9151: URL: https://github.com/apache/arrow/pull/9151#issuecomment-764664522 @liyafan82 It might be implicitly supported, but you'd have to be aware of that implementation detail in order to work with it. It's also less fluent to work with a list of structs to imply a map in the readers / writers, rather than a dedicated map API. It was also noted as a gap in the API on the [mailing list](https://lists.apache.org/thread.html/r6e2de4be2cdea11e3f277aff2c23b2bda0782d9460d4dddf5c11f240%40%3Cuser.arrow.apache.org%3E). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead
pitrou commented on a change in pull request #9280: URL: https://github.com/apache/arrow/pull/9280#discussion_r562057646 ## File path: cpp/src/arrow/compute/internals_benchmark.cc ## @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one Review comment: We already have `function_benchmark.cc` for such internals benchmarks. ## File path: cpp/src/arrow/compute/internals_benchmark.cc ## @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include "arrow/compute/exec_internal.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/util/benchmark_util.h" + +namespace arrow { +namespace compute { +namespace detail { + +constexpr int32_t kSeed = 0xfede4a7e; + +void BM_ExecBatchIterator(benchmark::State& state) { + // Measure overhead related to deconstructing vector into a sequence of ExecBatch + random::RandomArrayGenerator rag(kSeed); + + const int64_t length = 1 << 20; + const int num_fields = 10; + + std::vector args(num_fields); + for (int i = 0; i < num_fields; ++i) { +args[i] = rag.Int64(length, 0, 100)->data(); + } + + for (auto _ : state) { +std::unique_ptr it = +*ExecBatchIterator::Make(args, state.range(0)); +ExecBatch batch; +while (it->Next()) { + for (int i = 0; i < num_fields; ++i) { +auto data = batch.values[i].array()->buffers[1]->data(); +benchmark::DoNotOptimize(data); + } + continue; +} +benchmark::DoNotOptimize(batch); + } + + state.SetItemsProcessed(state.iterations()); +} + +void BM_DatumSlice(benchmark::State& state) { + // Measure overhead related to deconstructing vector into a sequence of ExecBatch + random::RandomArrayGenerator rag(kSeed); + + const int64_t length = 1000; + + int num_datums = 1000; + std::vector datums(num_datums); + for (int i = 0; i < num_datums; ++i) { +datums[i] = rag.Int64(length, 0, 100)->data(); + } + + for (auto _ : state) { +for (const Datum& datum : datums) { + auto slice = datum.array()->Slice(16, 64); Review comment: Is it significantly different from slicing an array? Otherwise, is there a point for the specific "slice a datum" operation? ## File path: cpp/src/arrow/compute/internals_benchmark.cc ## @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include "arrow/compute/exec_internal.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/util/benchmark_util.h" + +namespace arrow { +namespace compute { +namespace detail { + +constexpr int32_t kSeed = 0xfede4a7e; + +void BM_ExecBatchIterator(benchmark::State& state) { + // Measure overhead related to deconstructing vector into a sequence of ExecBatch + random::RandomArrayGenerator rag(kSeed); + + const int64_t length = 1 << 20; + const int num_fields = 10; + + std::vector args(num_fields); + for (int i = 0; i < num_fields; ++i) { +args[i] = rag.Int64(length, 0, 100)->data(); + } + + for (auto _ : state) { +std::unique_ptr it = +*ExecBatchIterator::Make(args, state.range(0)); +ExecBatch batch; +while (it->Next()) { + for (int i = 0; i < num_fields; ++i) { +auto data = batch.values[i].array()->buffers[1]->data(); +benchmark::DoNotOptimize(data); + } + continue; +}
[GitHub] [arrow] github-actions[bot] commented on pull request #9282: ARROW-11334: [Python][CI] Fix failing pandas nightly tests
github-actions[bot] commented on pull request #9282: URL: https://github.com/apache/arrow/pull/9282#issuecomment-764589077 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflo
liyafan82 commented on pull request #9187: URL: https://github.com/apache/arrow/pull/9187#issuecomment-764197687 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #9272: [WIP] Benchmark placebo
ursabot commented on pull request #9272: URL: https://github.com/apache/arrow/pull/9272#issuecomment-764936041 9272 - 30c8491386125d4315b179ec963552333d0494d4 - 50ba53446ecc43a74d0edb35e8c0b23263d0440a This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #9122: ARROW-10299: [Rust] Use IPC Metadata V5 as default
nevi-me commented on pull request #9122: URL: https://github.com/apache/arrow/pull/9122#issuecomment-764332590 > @nevi-me is this one ready to merge? Should we rebase it against latest master and rerun the checks? Or I can merge it in as is too I've rebased, and changed a few comments: V5 will be default in 4.0.0, pointed to the JIRA about duplicate column names. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline
codecov-io commented on pull request #9261: URL: https://github.com/apache/arrow/pull/9261#issuecomment-764691368 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=h1) Report > Merging [#9261](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=desc) (ce18ab7) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **increase** coverage by `0.00%`. > The diff coverage is `100.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9261/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=tree) ```diff @@ Coverage Diff@@ ## master#9261+/- ## Coverage 81.61% 81.61% Files 215 215 Lines 5186752508 +641 + Hits4232942854 +525 - Misses 9538 9654 +116 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=tree) | Coverage Δ | | |---|---|---| | [rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=) | `91.67% <100.00%> (+0.17%)` | :arrow_up: | | [rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=) | `39.43% <0.00%> (-14.50%)` | :arrow_down: | | [rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==) | `83.21% <0.00%> (-9.89%)` | :arrow_down: | | [rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=) | `49.01% <0.00%> (-9.17%)` | :arrow_down: | | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `79.75% <0.00%> (-6.52%)` | :arrow_down: | | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `6.97% <0.00%> (-5.22%)` | :arrow_down: | | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `53.12% <0.00%> (-5.21%)` | :arrow_down: | | [rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz) | `93.02% <0.00%> (-5.13%)` | :arrow_down: | | [rust/arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3V0aWxzLnJz) | `95.00% <0.00%> (-5.00%)` | :arrow_down: | | [rust/datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvbW1vbi5ycw==) | `74.28% <0.00%> (-4.51%)` | :arrow_down: | | ... and [91 more](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=footer). Last update [903b41c...ce18ab7](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]
codecov-io commented on pull request #9243: URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=h1) Report > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=desc) (64abbc5) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **increase** coverage by `0.18%`. > The diff coverage is `93.35%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=tree) ```diff @@Coverage Diff @@ ## master#9243 +/- ## == + Coverage 81.61% 81.79% +0.18% == Files 215 215 Lines 5186753138+1271 == + Hits4232943466+1137 - Misses 9538 9672 +134 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=tree) | Coverage Δ | | |---|---|---| | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `78.94% <ø> (+1.81%)` | :arrow_up: | | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `85.56% <85.39%> (-1.39%)` | :arrow_down: | | [rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=) | `94.44% <92.42%> (-4.10%)` | :arrow_down: | | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `84.27% <94.47%> (+11.97%)` | :arrow_up: | | [rust/datafusion/src/physical\_plan/aggregates.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2FnZ3JlZ2F0ZXMucnM=) | `91.13% <100.00%> (ø)` | | | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.84% <100.00%> (+<0.01%)` | :arrow_up: | | [rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=) | `91.67% <100.00%> (+0.17%)` | :arrow_up: | | [rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==) | `83.21% <0.00%> (-9.89%)` | :arrow_down: | | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `79.75% <0.00%> (-6.52%)` | :arrow_down: | | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `6.97% <0.00%> (-5.22%)` | :arrow_down: | | ... and [94 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=footer). Last update [1401359...64abbc5](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buff
WeichenXu123 edited a comment on pull request #9187: URL: https://github.com/apache/arrow/pull/9187#issuecomment-763540715 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9155: ARROW-6103: [Java] Remove mvn release plugin [WIP]
andygrove commented on pull request #9155: URL: https://github.com/apache/arrow/pull/9155#issuecomment-763939243 > > > Sorry for my late response. > > We have a post manual(!) task that publishes Java packages to the Maven Central Repository: https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingJavaMavenartifactsinMavencentral > Can we do it by `mvn deploy` with source archive? Yes, that is my assumption. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline
vertexclique commented on pull request #9261: URL: https://github.com/apache/arrow/pull/9261#issuecomment-764560085 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9235: ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length)
jorgecarleitao commented on pull request #9235: URL: https://github.com/apache/arrow/pull/9235#issuecomment-763898758 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9276: ARROW-11327: [Rust][DataFusion] Add DictionarySupport to create_batch_empty
jorgecarleitao commented on pull request #9276: URL: https://github.com/apache/arrow/pull/9276#issuecomment-764612354 fyi, with #9281 , (imo) this is solved more broadly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]
jorgecarleitao commented on a change in pull request #9243: URL: https://github.com/apache/arrow/pull/9243#discussion_r562383540 ## File path: rust/datafusion/src/physical_plan/functions.rs ## @@ -60,10 +59,15 @@ pub enum Signature { // A function such as `array` is `VariadicEqual` // The first argument decides the type used for coercion VariadicEqual, +/// fixed number of arguments of vector of vectors of valid types +// A function of one argument of f64 is `Uniform(vc![vec![vec![DataType::Float64]]])` +// A function of one argument of f64 or f32 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])` +// A function of two arguments with first argument of f64 or f32 and second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64], vec![DataType::Utf8]]])` +Uniform(Vec>>), Review comment: This signature generalizes `UniformEqual`, so, wouldn't it be possible generalize the other instead of creating a new one (replace the existing one by the more general form)? `Signature` should be such that its variants form a complete set of options without overlaps. ## File path: rust/datafusion/src/physical_plan/string_expressions.rs ## @@ -34,42 +35,553 @@ macro_rules! downcast_vec { }}; } -/// concatenate string columns together. -pub fn concatenate(args: &[ArrayRef]) -> Result { +/// Returns the numeric code of the first character of the argument. +pub fn ascii(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| { +x.map(|x: | { +let mut chars = x.chars(); +chars.next().map_or(0, |v| v as i32) +}) +}) +.collect()) +} + +/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string. +pub fn btrim(args: &[ArrayRef]) -> Result { +match args.len() { +0 => Err(DataFusionError::Internal( +"btrim was called with 0 arguments. It requires at least 1.".to_string(), +)), +1 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.map(|x| x.map(|x: | x.trim())) +.collect()) +} +2 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +let characters_array = args[1] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.enumerate() +.map(|(i, x)| { +if characters_array.is_null(i) { +None +} else { +x.map(|x: | { +let chars: Vec = +characters_array.value(i).chars().collect(); +x.trim_start_matches([..]) +.trim_end_matches([..]) +}) +} +}) +.collect()) +} +other => Err(DataFusionError::Internal(format!( +"btrim was called with {} arguments. It requires at most 2.", +other +))), +} +} + +/// Returns number of characters in the string. +pub fn character_length_i32(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| x.map(|x: | x.graphemes(true).count() as i32)) +.collect()) +} + +/// Returns number of characters in the string. +pub fn character_length_i64(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| x.map(|x: | x.graphemes(true).count() as i64)) +.collect()) +} + +/// Returns the character with the given code. +pub fn chr(args: &[ArrayRef]) -> Result { +let array = args[0].as_any().downcast_ref::().unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x: Option| { +x.map(|x| { +if x == 0 { +Err(DataFusionError::Internal( +"null character not permitted.".to_string(), +)) +} else { +match core::char::from_u32(x as u32) { +
[GitHub] [arrow] alamb closed pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline
alamb closed pull request #9261: URL: https://github.com/apache/arrow/pull/9261 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9215: ARROW-11270: [Rust] Array slice accessors
codecov-io edited a comment on pull request #9215: URL: https://github.com/apache/arrow/pull/9215#issuecomment-761621256 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=h1) Report > Merging [#9215](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=desc) (5cd8075) into [master](https://codecov.io/gh/apache/arrow/commit/a0e12445cc8689befe40c20e2fac6e6df252bef6?el=desc) (a0e1244) will **increase** coverage by `0.03%`. > The diff coverage is `100.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9215/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=tree) ```diff @@Coverage Diff @@ ## master#9215 +/- ## == + Coverage 81.64% 81.68% +0.03% == Files 215 215 Lines 5248952572 +83 == + Hits4285742945 +88 + Misses 9632 9627 -5 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=tree) | Coverage Δ | | |---|---|---| | [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `91.75% <100.00%> (+0.76%)` | :arrow_up: | | [rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==) | `84.84% <100.00%> (+1.62%)` | :arrow_up: | | [rust/arrow/src/array/array\_primitive.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfcHJpbWl0aXZlLnJz) | `94.76% <100.00%> (+0.23%)` | :arrow_up: | | [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `91.66% <100.00%> (+1.66%)` | :arrow_up: | | [rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==) | `85.51% <100.00%> (ø)` | | | [rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=) | `96.96% <100.00%> (-0.03%)` | :arrow_down: | | [rust/arrow/src/compute/kernels/limit.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2xpbWl0LnJz) | `100.00% <100.00%> (ø)` | | | [rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz) | `98.92% <100.00%> (ø)` | | | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=footer). Last update [a0e1244...9ca08d7](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace edited a comment on pull request #8240: ARROW-10038: [C++] Spawn thread pool threads lazily
westonpace edited a comment on pull request #8240: URL: https://github.com/apache/arrow/pull/8240#issuecomment-765025427 Sorry, I am having trouble understanding what you are referencing. There was an RTools 3.5 failure because of a gcc bug requiring you to std::move a unique_ptr but I don't think that would apply here. The other compiler warning treated as error seemed to be unique to gcc 9.3 and tsan and I didn't know there was any failing job related to it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead
cyb70289 commented on a change in pull request #9280: URL: https://github.com/apache/arrow/pull/9280#discussion_r561589279 ## File path: cpp/src/arrow/compute/internals_benchmark.cc ## @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include "arrow/compute/exec_internal.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/util/benchmark_util.h" + +namespace arrow { +namespace compute { +namespace detail { + +constexpr int32_t kSeed = 0xfede4a7e; + +void BM_ExecBatchIterator(benchmark::State& state) { + // Measure overhead related to deconstructing vector into a sequence of ExecBatch + random::RandomArrayGenerator rag(kSeed); + + const int64_t length = 1 << 20; + const int num_fields = 10; + + std::vector args(num_fields); + for (int i = 0; i < num_fields; ++i) { +args[i] = rag.Int64(length, 0, 100)->data(); + } + + for (auto _ : state) { +std::unique_ptr it = +*ExecBatchIterator::Make(args, state.range(0)); +ExecBatch batch; +while (it->Next()) { + for (int i = 0; i < num_fields; ++i) { +auto data = batch.values[i].array()->buffers[1]->data(); +benchmark::DoNotOptimize(data); + } + continue; Review comment: remove it? ## File path: cpp/src/arrow/compute/internals_benchmark.cc ## @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include "arrow/compute/exec_internal.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/util/benchmark_util.h" + +namespace arrow { +namespace compute { +namespace detail { + +constexpr int32_t kSeed = 0xfede4a7e; + +void BM_ExecBatchIterator(benchmark::State& state) { + // Measure overhead related to deconstructing vector into a sequence of ExecBatch + random::RandomArrayGenerator rag(kSeed); + + const int64_t length = 1 << 20; + const int num_fields = 10; + + std::vector args(num_fields); + for (int i = 0; i < num_fields; ++i) { +args[i] = rag.Int64(length, 0, 100)->data(); + } + + for (auto _ : state) { +std::unique_ptr it = +*ExecBatchIterator::Make(args, state.range(0)); +ExecBatch batch; +while (it->Next()) { + for (int i = 0; i < num_fields; ++i) { +auto data = batch.values[i].array()->buffers[1]->data(); +benchmark::DoNotOptimize(data); + } + continue; +} +benchmark::DoNotOptimize(batch); + } + + state.SetItemsProcessed(state.iterations()); +} + +void BM_DatumSlice(benchmark::State& state) { + // Measure overhead related to deconstructing vector into a sequence of ExecBatch + random::RandomArrayGenerator rag(kSeed); + + const int64_t length = 1000; + + int num_datums = 1000; Review comment: also define as const? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou commented on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-764050107 > I didn't see anything rust related in this PR so I removed the Rust label @alamb Sorry it didn’t appear in the right place but there is nothing parquet-related in this PR either. Can the parquet label be removed? Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche closed pull request #9282: ARROW-11334: [Python][CI] Fix failing pandas nightly tests
jorisvandenbossche closed pull request #9282: URL: https://github.com/apache/arrow/pull/9282 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9283: ARROW-11235: [Python] Fix test failure inside non-default S3 region
github-actions[bot] commented on pull request #9283: URL: https://github.com/apache/arrow/pull/9283#issuecomment-764594206 https://issues.apache.org/jira/browse/ARROW-11235 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou commented on a change in pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou commented on a change in pull request #8648: URL: https://github.com/apache/arrow/pull/8648#discussion_r561405645 ## File path: cpp/src/arrow/adapters/orc/adapter_test.cc ## @@ -157,4 +225,2478 @@ TEST(TestAdapter, readIntAndStringFileMultipleStripes) { EXPECT_TRUE(stripe_reader->ReadNext(_batch).ok()); } } + +// WriteORC tests + +// General +TEST(TestAdapterWriteGeneral, writeZeroRows) { + std::vector> xFields{field("bool", boolean()), + field("int8", int8()), + field("int16", int16()), + field("int32", int32()), + field("int64", int64()), + field("float", float32()), + field("double", float64()), + field("decimal128nz", decimal(25, 6)), + field("decimal128z", decimal(32, 0)), + field("date32", date32()), + field("ts3", timestamp(TimeUnit::NANO)), + field("string", utf8()), + field("binary", binary())}; + std::shared_ptr sharedPtrSchema = std::make_shared(xFields); + + int64_t numRows = 0; + int64_t numCols = xFields.size(); + + ArrayBuilderVector builders(numCols, NULLPTR); + builders[0] = + std::static_pointer_cast(std::make_shared()); + builders[1] = std::static_pointer_cast(std::make_shared()); + builders[2] = std::static_pointer_cast(std::make_shared()); + builders[3] = std::static_pointer_cast(std::make_shared()); + builders[4] = std::static_pointer_cast(std::make_shared()); + builders[5] = std::static_pointer_cast(std::make_shared()); + builders[6] = std::static_pointer_cast(std::make_shared()); + builders[7] = std::static_pointer_cast( + std::make_shared(decimal(25, 6))); + builders[8] = std::static_pointer_cast( + std::make_shared(decimal(32, 0))); + builders[9] = std::static_pointer_cast(std::make_shared()); + builders[10] = + std::static_pointer_cast(std::make_shared( + timestamp(TimeUnit::NANO), default_memory_pool())); + builders[11] = + std::static_pointer_cast(std::make_shared()); + builders[12] = + std::static_pointer_cast(std::make_shared()); + ArrayVector arrays(numCols, NULLPTR); + ChunkedArrayVector cv; + cv.reserve(numCols); + + for (int col = 0; col < numCols; col++) { +ARROW_EXPECT_OK(builders[col]->Finish([col])); +cv.push_back(std::make_shared(arrays[col])); + } + + std::shared_ptr table = Table::Make(sharedPtrSchema, cv); + + std::unique_ptr writer = + std::unique_ptr(new ORCMemWriter()); + std::unique_ptr out_stream = + std::unique_ptr(static_cast( + new MemoryOutputStream(DEFAULT_SMALL_MEM_STREAM_SIZE / 16))); + ARROW_EXPECT_OK(writer->Open(sharedPtrSchema, out_stream)); + ARROW_EXPECT_OK(writer->Write(table)); + auto output_mem_stream = static_cast(writer->ReleaseOutStream()); + std::shared_ptr in_stream( + new io::BufferReader(std::make_shared( + reinterpret_cast(output_mem_stream->getData()), + static_cast(output_mem_stream->getLength(); + + std::unique_ptr reader; + ASSERT_TRUE( + adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool(), ).ok()); + std::shared_ptr outputTable; + ARROW_EXPECT_OK(reader->Read()); + EXPECT_EQ(outputTable->num_columns(), numCols); + EXPECT_EQ(outputTable->num_rows(), numRows); + EXPECT_TRUE(outputTable->Equals(*table)); +} +TEST(TestAdapterWriteGeneral, writeChunkless) { + std::vector> xFieldsSub{std::make_shared("a", utf8()), + std::make_shared("b", int32())}; + std::vector> xFields{ + field("bool", boolean()), + field("int8", int8()), + field("int16", int16()), + field("int32", int32()), + field("int64", int64()), + field("float", float32()), + field("double", float64()), + field("decimal128nz", decimal(25, 6)), + field("decimal128z", decimal(32, 0)), + field("date32", date32()), + field("ts3", timestamp(TimeUnit::NANO)), + field("string", utf8()), + field("binary", binary()), + field("struct", struct_(xFieldsSub)), + field("list", list(int32())), + field("lsl", list(struct_({field("lsl0", list(int32()))})))}; + std::shared_ptr sharedPtrSchema = std::make_shared(xFields); + + int64_t numRows = 0; + int64_t numCols = xFields.size(); + + ChunkedArrayVector cv; + cv.reserve(numCols); + + ArrayMatrix av(numCols, ArrayVector(0, NULLPTR)); + + for (int col = 0; col < numCols; col++) { +cv.push_back(std::make_shared(av[col], xFields[col]->type())); + } + + std::shared_ptr
[GitHub] [arrow] andygrove commented on pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking
andygrove commented on pull request #9278: URL: https://github.com/apache/arrow/pull/9278#issuecomment-763902393 I just wanted to add that there is no need to wait for me to review before merging. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tyrelr commented on a change in pull request #9215: ARROW-11270: [Rust] Array slice accessors
tyrelr commented on a change in pull request #9215: URL: https://github.com/apache/arrow/pull/9215#discussion_r561918229 ## File path: rust/arrow/src/array/array_primitive.rs ## @@ -86,13 +86,9 @@ impl PrimitiveArray { } /// Returns the primitive value at index `i`. -/// -/// Note this doesn't do any bound checking, for performance reason. -/// # Safety -/// caller must ensure that the passed in offset is less than the array len() +#[inline] pub fn value(, i: usize) -> T::Native { -let offset = i + self.offset(); -unsafe { *self.raw_values.as_ptr().add(offset) } +self.values()[i] Review comment: I agree, that we want to guide API users to switch to values(). My hesitancy with making/leaving it unsafe is that I think it's the only Array value(x) method when combined with the other changes in this PR (which is why it got added as a followon commit). That leads to awkwardness in macros, such as the comparison kernel, where it effectively calls "$left.value(x)", it could be a papercut for macro-style invocation. (There may be an argument that a trait could abstract away the need for macros altogether... but my experiments to rework the comparison kernel to that effect are incomplete - and partly sidelined with various other experiments I'm toying with) I would expect that in the exact for loop you mention, the compiler could recognize the satisfied loop bound since both len() and .value(i) are inlinable and use the same underlying field, but I expect it is also easy to complicate on that loop to defeat that by having a less direct relationship between i & array.len(). The pretty printing logic probably would do additional bounds checks now, as it iterates based on len() from the record batch, which is semantically equal, but nothing in the code enforces that in a way I would expect the compiler to understand. I'll launch some benchmarks of head, master, and the commit just before the PrimitiveArray safety change. ## File path: rust/arrow/src/array/array_primitive.rs ## @@ -86,13 +86,9 @@ impl PrimitiveArray { } /// Returns the primitive value at index `i`. -/// -/// Note this doesn't do any bound checking, for performance reason. -/// # Safety -/// caller must ensure that the passed in offset is less than the array len() +#[inline] pub fn value(, i: usize) -> T::Native { -let offset = i + self.offset(); -unsafe { *self.raw_values.as_ptr().add(offset) } +self.values()[i] Review comment: I agree, that we want to guide API users to switch to values(). My hesitancy with making/leaving it unsafe is that I think it's the only Array value(x) method when combined with the other changes in this PR (which is why it got added as a followon commit). That leads to awkwardness in macros, such as the comparison kernel, where it effectively calls "$left.value(x)". Making macro style invocation sometimes-need-unsafe is a bit unfortunate. (There may be an argument that a trait could abstract away the need for macros altogether... but my experiments to rework the comparison kernel to that effect are incomplete - and partly sidelined with various other experiments I'm toying with... most of which somehow relate to removing direct calls to __Array.value(i) ) I would expect that in the exact for loop you mention, the compiler could recognize the satisfied loop bound since both len() and .value(i) are inlinable and use the same underlying field, but I expect it is also easy to complicate on that loop to defeat that by having a less direct relationship between i & array.len(). The pretty printing logic probably would do additional bounds checks now, as it iterates based on len() from the record batch, which is semantically equal, but nothing in the code enforces that in a way I would expect the compiler to understand. I'll launch some benchmarks of head, master, and the commit just before the PrimitiveArray safety change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9281: ARROW-11333: [Rust] Generalized creation of empty arrays.
jorgecarleitao commented on pull request #9281: URL: https://github.com/apache/arrow/pull/9281#issuecomment-764422417 @ovr, this is what I was trying to express in your PR: we can use `ArrayData` directly and thereby support arbitrary nested types. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline
alamb commented on a change in pull request #9261: URL: https://github.com/apache/arrow/pull/9261#discussion_r562040108 ## File path: .github/workflows/rust.yml ## @@ -226,6 +226,40 @@ jobs: cd rust cargo clippy --all-targets --workspace -- -D warnings -A clippy::redundant_field_names + miri-checks: +name: Miri Checks +runs-on: ubuntu-latest +strategy: + matrix: +arch: [amd64] +rust: [nightly-2021-01-19] +steps: + - uses: actions/checkout@v2 +with: + submodules: true + - uses: actions/cache@v2 +with: + path: | +~/.cargo/registry +~/.cargo/git +target + key: ${{ runner.os }}-cargo-miri-${{ hashFiles('**/Cargo.lock') }} + - name: Setup Rust toolchain +run: | + rustup toolchain install ${{ matrix.rust }} + rustup default ${{ matrix.rust }} + rustup component add rustfmt clippy miri + - name: Run Miri Checks +env: + RUST_BACKTRACE: full + RUST_LOG: 'trace' +run: | + export MIRIFLAGS="-Zmiri-disable-isolation" + cd rust + cargo miri setup + cargo clean + cargo miri test || true Review comment: ```suggestion # Ignore MIRI errors until we can get a clean run cargo miri test || true ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]
seddonm1 commented on a change in pull request #9243: URL: https://github.com/apache/arrow/pull/9243#discussion_r561653478 ## File path: rust/datafusion/src/physical_plan/string_expressions.rs ## @@ -34,42 +34,446 @@ macro_rules! downcast_vec { }}; } -/// concatenate string columns together. -pub fn concatenate(args: &[ArrayRef]) -> Result { +/// Returns the numeric code of the first character of the argument. +pub fn ascii(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| { +x.map(|x: | { +let mut chars = x.chars(); +chars.next().map_or(0, |v| v as i32) +}) +}) +.collect()) +} + +/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string. +pub fn btrim(args: &[ArrayRef]) -> Result { +match args.len() { +0 => Err(DataFusionError::Internal( +"btrim was called with 0 arguments. It requires at least one.".to_string(), +)), +1 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.map(|x| x.map(|x: | x.trim())) +.collect()) +} +2 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +let characters_array = args[1] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.enumerate() +.map(|(i, x)| { +if characters_array.is_null(i) { +None +} else { +x.map(|x: | { +let chars: Vec = +characters_array.value(i).chars().collect(); +x.trim_start_matches([..]) +.trim_end_matches([..]) +}) +} +}) +.collect()) +} +other => Err(DataFusionError::Internal(format!( +"btrim was called with {} arguments. It requires at most two.", +other +))), +} +} + +/// Returns the character with the given code. +pub fn chr(args: &[ArrayRef]) -> Result { +let array = args[0].as_any().downcast_ref::().unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x: Option| { +x.map(|x| { +if x == 0 { +Err(DataFusionError::Internal( +"null character not permitted.".to_string(), +)) +} else { +match core::char::from_u32(x as u32) { +Some(x) => Ok(x.to_string()), +None => Err(DataFusionError::Internal( +"requested character too large for encoding.".to_string(), +)), +} +} +.unwrap() Review comment: I'm not sure if we should be panicing if these characters appear ## File path: rust/datafusion/src/physical_plan/type_coercion.rs ## @@ -69,13 +69,42 @@ pub fn data_types( signature: , ) -> Result> { let valid_types = match signature { -Signature::Variadic(valid_types) => valid_types +Signature::Any(number) => { +if current_types.len() != *number { +return Err(DataFusionError::Plan(format!( +"The function expected {} arguments but received {}", +number, +current_types.len() +))); +} +vec![(0..*number).map(|i| current_types[i].clone()).collect()] +} +Signature::Exact(valid_types) => vec![valid_types.clone()], +Signature::Uniform(valid_types) => { +let valid_signature = valid_types +.iter() +.filter(|x| x.len() == current_types.len()) +.collect::>(); +if valid_signature.len() != 1 { +return Err(DataFusionError::Plan(format!( +"The function expected {} arguments but received {}", +valid_types +.iter() +.map(|x| x.len().to_string()) +.collect::>() +.join(" or "), +current_types.len() +))); +} +
[GitHub] [arrow] alamb commented on pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline
alamb commented on pull request #9261: URL: https://github.com/apache/arrow/pull/9261#issuecomment-764604319 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou edited a comment on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-764050107 > I didn't see anything rust related in this PR so I removed the Rust label @alamb Thanks! There is nothing parquet-related in this PR either. Can the parquet label be removed as well? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9285: ARROW-10349: [Python] Build and publish aarch64 wheels [WIP]
github-actions[bot] commented on pull request #9285: URL: https://github.com/apache/arrow/pull/9285#issuecomment-764893560 https://issues.apache.org/jira/browse/ARROW-10349 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mbrubeck commented on a change in pull request #9235: ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length)
mbrubeck commented on a change in pull request #9235: URL: https://github.com/apache/arrow/pull/9235#discussion_r561239323 ## File path: rust/arrow/src/buffer.rs ## @@ -963,11 +968,131 @@ impl MutableBuffer { /// Extends the buffer by `additional` bytes equal to `0u8`, incrementing its capacity if needed. #[inline] -pub fn extend( self, additional: usize) { +pub fn extend_zeros( self, additional: usize) { self.resize(self.len + additional, 0); } } +/// # Safety +/// `ptr` must be allocated for `old_capacity`. +#[inline] +unsafe fn reallocate( +ptr: NonNull, +old_capacity: usize, +new_capacity: usize, +) -> (NonNull, usize) { +let new_capacity = bit_util::round_upto_multiple_of_64(new_capacity); +let new_capacity = std::cmp::max(new_capacity, old_capacity * 2); +let ptr = memory::reallocate(ptr, old_capacity, new_capacity); +(ptr, new_capacity) +} + +impl Extend for MutableBuffer { +#[inline] +fn extend>( self, iter: T) { +let iterator = iter.into_iter(); +self.extend_from_iter(iterator) +} +} + +impl MutableBuffer { +#[inline] +fn extend_from_iter>( + self, +mut iterator: I, +) { +let size = std::mem::size_of::(); + +// this is necessary because of https://github.com/rust-lang/rust/issues/32155 +let (mut ptr, mut capacity, mut len) = (self.data, self.capacity, self.len); +let mut dst = unsafe { ptr.as_ptr().add(len) as *mut T }; + +while let Some(item) = iterator.next() { +if len + size >= capacity { +let (lower, _) = iterator.size_hint(); +let additional = (lower + 1) * size; +let (new_ptr, new_capacity) = +unsafe { reallocate(ptr, capacity, len + additional) }; Review comment: > Note that arrow does not support complex structs on its buffers (i.e. we only support `u8-u64, i8-i64, f32 and f64`), which means that we never need to call `drop` on the elements. Under this, do we still need a valid `len`? No. Dropping before `self.len` is updated should be fine. However, in my benchmarking I still found that using `SetLenOnDrop` provided a small performance benefit compared to just updating `self.len` after the loop. I'm not sure why. ## File path: rust/arrow/src/buffer.rs ## @@ -963,11 +968,131 @@ impl MutableBuffer { /// Extends the buffer by `additional` bytes equal to `0u8`, incrementing its capacity if needed. #[inline] -pub fn extend( self, additional: usize) { +pub fn extend_zeros( self, additional: usize) { self.resize(self.len + additional, 0); } } +/// # Safety +/// `ptr` must be allocated for `old_capacity`. +#[inline] +unsafe fn reallocate( +ptr: NonNull, +old_capacity: usize, +new_capacity: usize, +) -> (NonNull, usize) { +let new_capacity = bit_util::round_upto_multiple_of_64(new_capacity); +let new_capacity = std::cmp::max(new_capacity, old_capacity * 2); +let ptr = memory::reallocate(ptr, old_capacity, new_capacity); +(ptr, new_capacity) +} + +impl Extend for MutableBuffer { +#[inline] +fn extend>( self, iter: T) { +let iterator = iter.into_iter(); +self.extend_from_iter(iterator) +} +} + +impl MutableBuffer { +#[inline] +fn extend_from_iter>( + self, +mut iterator: I, +) { +let size = std::mem::size_of::(); + +// this is necessary because of https://github.com/rust-lang/rust/issues/32155 +let (mut ptr, mut capacity, mut len) = (self.data, self.capacity, self.len); +let mut dst = unsafe { ptr.as_ptr().add(len) as *mut T }; + +while let Some(item) = iterator.next() { +if len + size >= capacity { +let (lower, _) = iterator.size_hint(); +let additional = (lower + 1) * size; +let (new_ptr, new_capacity) = +unsafe { reallocate(ptr, capacity, len + additional) }; Review comment: > Note that arrow does not support complex structs on its buffers (i.e. we only support `u8-u64, i8-i64, f32 and f64`), which means that we never need to call `drop` on the elements. Under this, do we still need a valid `len`? No. Dropping before `self.len` is updated should be fine. However, in my benchmarking I still found that using `SetLenOnDrop` provided a small performance benefit compared to just updating `self.len` after the loop. I'm not sure why. Maybe the mutable borrow provides some additional hints to the optimizer. ## File path: rust/arrow/src/buffer.rs ## @@ -963,11 +968,131 @@ impl MutableBuffer { /// Extends the buffer by `additional` bytes equal to `0u8`, incrementing its capacity if needed. #[inline] -pub fn extend( self, additional: usize) { +pub fn extend_zeros( self, additional: usize) {
[GitHub] [arrow] alamb closed pull request #9275: ARROW-11323: [Rust][DataFusion] Allow sort queries to return no results
alamb closed pull request #9275: URL: https://github.com/apache/arrow/pull/9275 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]
seddonm1 commented on a change in pull request #9243: URL: https://github.com/apache/arrow/pull/9243#discussion_r562391973 ## File path: rust/datafusion/src/physical_plan/functions.rs ## @@ -60,10 +59,15 @@ pub enum Signature { // A function such as `array` is `VariadicEqual` // The first argument decides the type used for coercion VariadicEqual, +/// fixed number of arguments of vector of vectors of valid types +// A function of one argument of f64 is `Uniform(vc![vec![vec![DataType::Float64]]])` +// A function of one argument of f64 or f32 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])` +// A function of two arguments with first argument of f64 or f32 and second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64], vec![DataType::Utf8]]])` +Uniform(Vec>>), Review comment: Yes. Agree. The existing code clearly took some thought so wanted to leave it until we can agree correct course of action. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]
seddonm1 commented on a change in pull request #9243: URL: https://github.com/apache/arrow/pull/9243#discussion_r562391477 ## File path: rust/datafusion/src/physical_plan/string_expressions.rs ## @@ -34,42 +35,553 @@ macro_rules! downcast_vec { }}; } -/// concatenate string columns together. -pub fn concatenate(args: &[ArrayRef]) -> Result { +/// Returns the numeric code of the first character of the argument. +pub fn ascii(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| { +x.map(|x: | { +let mut chars = x.chars(); +chars.next().map_or(0, |v| v as i32) +}) +}) +.collect()) +} + +/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string. +pub fn btrim(args: &[ArrayRef]) -> Result { +match args.len() { +0 => Err(DataFusionError::Internal( +"btrim was called with 0 arguments. It requires at least 1.".to_string(), +)), +1 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.map(|x| x.map(|x: | x.trim())) +.collect()) +} +2 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +let characters_array = args[1] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.enumerate() +.map(|(i, x)| { +if characters_array.is_null(i) { +None +} else { +x.map(|x: | { +let chars: Vec = +characters_array.value(i).chars().collect(); +x.trim_start_matches([..]) +.trim_end_matches([..]) +}) +} +}) +.collect()) +} +other => Err(DataFusionError::Internal(format!( +"btrim was called with {} arguments. It requires at most 2.", +other +))), +} +} + +/// Returns number of characters in the string. +pub fn character_length_i32(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| x.map(|x: | x.graphemes(true).count() as i32)) +.collect()) +} + +/// Returns number of characters in the string. +pub fn character_length_i64(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| x.map(|x: | x.graphemes(true).count() as i64)) +.collect()) +} + +/// Returns the character with the given code. +pub fn chr(args: &[ArrayRef]) -> Result { +let array = args[0].as_any().downcast_ref::().unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array Review comment: Ah of course 臘 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]
seddonm1 commented on a change in pull request #9243: URL: https://github.com/apache/arrow/pull/9243#discussion_r562391162 ## File path: rust/datafusion/src/physical_plan/type_coercion.rs ## @@ -69,13 +69,42 @@ pub fn data_types( signature: , ) -> Result> { let valid_types = match signature { -Signature::Variadic(valid_types) => valid_types +Signature::Any(number) => { +if current_types.len() != *number { +return Err(DataFusionError::Plan(format!( +"The function expected {} arguments but received {}", +number, +current_types.len() +))); +} +vec![(0..*number).map(|i| current_types[i].clone()).collect()] +} +Signature::Exact(valid_types) => vec![valid_types.clone()], +Signature::Uniform(valid_types) => { +let valid_signature = valid_types +.iter() +.filter(|x| x.len() == current_types.len()) +.collect::>(); +if valid_signature.len() != 1 { +return Err(DataFusionError::Plan(format!( +"The function expected {} arguments but received {}", +valid_types +.iter() +.map(|x| x.len().to_string()) +.collect::>() +.join(" or "), +current_types.len() +))); +} +cartesian_product(valid_signature.first().unwrap()) Review comment: Thanks @jorgecarleitao . Yes I will split this out. A good example is lpad which is either: [string, int] or [string, int, string]. I am away a couple of days but will split this out so we can work throught methodically. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]
jorgecarleitao commented on a change in pull request #9243: URL: https://github.com/apache/arrow/pull/9243#discussion_r562390994 ## File path: rust/datafusion/src/physical_plan/string_expressions.rs ## @@ -34,42 +35,553 @@ macro_rules! downcast_vec { }}; } -/// concatenate string columns together. -pub fn concatenate(args: &[ArrayRef]) -> Result { +/// Returns the numeric code of the first character of the argument. +pub fn ascii(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| { +x.map(|x: | { +let mut chars = x.chars(); +chars.next().map_or(0, |v| v as i32) +}) +}) +.collect()) +} + +/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string. +pub fn btrim(args: &[ArrayRef]) -> Result { +match args.len() { +0 => Err(DataFusionError::Internal( +"btrim was called with 0 arguments. It requires at least 1.".to_string(), +)), +1 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.map(|x| x.map(|x: | x.trim())) +.collect()) +} +2 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +let characters_array = args[1] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.enumerate() +.map(|(i, x)| { +if characters_array.is_null(i) { +None +} else { +x.map(|x: | { +let chars: Vec = +characters_array.value(i).chars().collect(); +x.trim_start_matches([..]) +.trim_end_matches([..]) +}) +} +}) +.collect()) +} +other => Err(DataFusionError::Internal(format!( +"btrim was called with {} arguments. It requires at most 2.", +other +))), +} +} + +/// Returns number of characters in the string. +pub fn character_length_i32(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| x.map(|x: | x.graphemes(true).count() as i32)) +.collect()) +} + +/// Returns number of characters in the string. +pub fn character_length_i64(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| x.map(|x: | x.graphemes(true).count() as i64)) +.collect()) +} + +/// Returns the character with the given code. +pub fn chr(args: &[ArrayRef]) -> Result { +let array = args[0].as_any().downcast_ref::().unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array Review comment: *and remove this `Ok` from here, so that `collect` is implicitly treated as `collect>` instead of `collect<_>` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]
jorgecarleitao commented on a change in pull request #9243: URL: https://github.com/apache/arrow/pull/9243#discussion_r562389680 ## File path: rust/datafusion/src/physical_plan/type_coercion.rs ## @@ -69,13 +69,42 @@ pub fn data_types( signature: , ) -> Result> { let valid_types = match signature { -Signature::Variadic(valid_types) => valid_types +Signature::Any(number) => { +if current_types.len() != *number { +return Err(DataFusionError::Plan(format!( +"The function expected {} arguments but received {}", +number, +current_types.len() +))); +} +vec![(0..*number).map(|i| current_types[i].clone()).collect()] +} +Signature::Exact(valid_types) => vec![valid_types.clone()], +Signature::Uniform(valid_types) => { +let valid_signature = valid_types +.iter() +.filter(|x| x.len() == current_types.len()) +.collect::>(); +if valid_signature.len() != 1 { +return Err(DataFusionError::Plan(format!( +"The function expected {} arguments but received {}", +valid_types +.iter() +.map(|x| x.len().to_string()) +.collect::>() +.join(" or "), +current_types.len() +))); +} +cartesian_product(valid_signature.first().unwrap()) Review comment: I suggest that we PR this separately with a single function that requires this type of signature, as we need to get this requires much more care than the other parts of this PR as it affects all future functions that use it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]
jorgecarleitao commented on a change in pull request #9243: URL: https://github.com/apache/arrow/pull/9243#discussion_r562383540 ## File path: rust/datafusion/src/physical_plan/functions.rs ## @@ -60,10 +59,15 @@ pub enum Signature { // A function such as `array` is `VariadicEqual` // The first argument decides the type used for coercion VariadicEqual, +/// fixed number of arguments of vector of vectors of valid types +// A function of one argument of f64 is `Uniform(vc![vec![vec![DataType::Float64]]])` +// A function of one argument of f64 or f32 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])` +// A function of two arguments with first argument of f64 or f32 and second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64], vec![DataType::Utf8]]])` +Uniform(Vec>>), Review comment: This signature generalizes `UniformEqual`, so, wouldn't it be possible generalize the other instead of creating a new one (replace the existing one by the more general form)? `Signature` should be such that its variants form a complete set of options without overlaps. ## File path: rust/datafusion/src/physical_plan/string_expressions.rs ## @@ -34,42 +35,553 @@ macro_rules! downcast_vec { }}; } -/// concatenate string columns together. -pub fn concatenate(args: &[ArrayRef]) -> Result { +/// Returns the numeric code of the first character of the argument. +pub fn ascii(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| { +x.map(|x: | { +let mut chars = x.chars(); +chars.next().map_or(0, |v| v as i32) +}) +}) +.collect()) +} + +/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string. +pub fn btrim(args: &[ArrayRef]) -> Result { +match args.len() { +0 => Err(DataFusionError::Internal( +"btrim was called with 0 arguments. It requires at least 1.".to_string(), +)), +1 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.map(|x| x.map(|x: | x.trim())) +.collect()) +} +2 => { +let string_array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); + +let characters_array = args[1] +.as_any() +.downcast_ref::>() +.unwrap(); + +Ok(string_array +.iter() +.enumerate() +.map(|(i, x)| { +if characters_array.is_null(i) { +None +} else { +x.map(|x: | { +let chars: Vec = +characters_array.value(i).chars().collect(); +x.trim_start_matches([..]) +.trim_end_matches([..]) +}) +} +}) +.collect()) +} +other => Err(DataFusionError::Internal(format!( +"btrim was called with {} arguments. It requires at most 2.", +other +))), +} +} + +/// Returns number of characters in the string. +pub fn character_length_i32(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| x.map(|x: | x.graphemes(true).count() as i32)) +.collect()) +} + +/// Returns number of characters in the string. +pub fn character_length_i64(args: &[ArrayRef]) -> Result { +let array = args[0] +.as_any() +.downcast_ref::>() +.unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x| x.map(|x: | x.graphemes(true).count() as i64)) +.collect()) +} + +/// Returns the character with the given code. +pub fn chr(args: &[ArrayRef]) -> Result { +let array = args[0].as_any().downcast_ref::().unwrap(); +// first map is the iterator, second is for the `Option<_>` +Ok(array +.iter() +.map(|x: Option| { +x.map(|x| { +if x == 0 { +Err(DataFusionError::Internal( +"null character not permitted.".to_string(), +)) +} else { +match core::char::from_u32(x as u32) { +