[GitHub] [arrow] jorgecarleitao commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

2021-01-21 Thread GitBox


jorgecarleitao commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765211936


   Isn't the data contained on a buffer `Arc`ed? I.e. `Vec::clone()` 
should be cheap, no?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

2021-01-21 Thread GitBox


Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765208218


   @jorgecarleitao  I found the "offending" code is this function in 
`array/data.rs` which does a `self.clone()`. Any idea how we could write a non 
copying version?
   I think it was copied/adapted based on a function in the array module, but I 
think it was also cloning in previous versions.
   
   ```rust
   /// Creates a zero-copy slice of itself. This creates a new [ArrayData]
   /// with a different offset, len and a shifted null bitmap.
   ///
   /// # Panics
   ///
   /// Panics if `offset + length > self.len()`.
   pub fn slice(, offset: usize, length: usize) -> ArrayData {
   assert!((offset + length) <= self.len());
   
   let mut new_data = self.clone();
   
   new_data.len = length;
   new_data.offset = offset + self.offset;
   
   new_data.null_count =
   count_nulls(new_data.null_buffer(), new_data.offset, 
new_data.len);
   
   new_data
   }
   
```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io commented on pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`

2021-01-21 Thread GitBox


codecov-io commented on pull request #9291:
URL: https://github.com/apache/arrow/pull/9291#issuecomment-765203292


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=h1) Report
   > Merging 
[#9291](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=desc) (beacc5d) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/c413566b34bd0c13a01a68148bd78df1bdec3c10?el=desc)
 (c413566) will **increase** coverage by `0.12%`.
   > The diff coverage is `97.67%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9291/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9291  +/-   ##
   ==
   + Coverage   81.61%   81.74%   +0.12% 
   ==
 Files 215  215  
 Lines   5250852486  -22 
   ==
   + Hits4285442903  +49 
   + Misses   9654 9583  -71 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/arrow/src/array/equal\_json.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWxfanNvbi5ycw==)
 | `92.16% <85.71%> (-0.26%)` | :arrow_down: |
   | 
[rust/arrow/src/array/array\_primitive.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfcHJpbWl0aXZlLnJz)
 | `94.52% <100.00%> (ø)` | |
   | 
[rust/arrow/src/array/ord.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvb3JkLnJz)
 | `62.50% <100.00%> (+1.07%)` | :arrow_up: |
   | 
[rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=)
 | `97.20% <100.00%> (+0.21%)` | :arrow_up: |
   | 
[rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==)
 | `78.86% <0.00%> (-18.45%)` | :arrow_down: |
   | 
[rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==)
 | `85.10% <0.00%> (-0.42%)` | :arrow_down: |
   | 
[rust/arrow/src/array/array\_struct.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RydWN0LnJz)
 | `88.43% <0.00%> (ø)` | |
   | 
[rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=)
 | `74.44% <0.00%> (+0.07%)` | :arrow_up: |
   | 
[rust/arrow/src/record\_batch.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvcmVjb3JkX2JhdGNoLnJz)
 | `84.32% <0.00%> (+0.35%)` | :arrow_up: |
   | 
[rust/arrow/src/ipc/convert.rs](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL2NvbnZlcnQucnM=)
 | `92.56% <0.00%> (+0.45%)` | :arrow_up: |
   | ... and [8 
more](https://codecov.io/gh/apache/arrow/pull/9291/diff?src=pr=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=footer). Last 
update 
[c413566...beacc5d](https://codecov.io/gh/apache/arrow/pull/9291?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao closed pull request #9122: ARROW-10299: [Rust] Use IPC Metadata V5 as default

2021-01-21 Thread GitBox


jorgecarleitao closed pull request #9122:
URL: https://github.com/apache/arrow/pull/9122


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`

2021-01-21 Thread GitBox


jorgecarleitao commented on a change in pull request #9291:
URL: https://github.com/apache/arrow/pull/9291#discussion_r562432512



##
File path: rust/arrow/src/array/ord.rs
##
@@ -66,7 +68,9 @@ where
 {
 let left = left.as_any().downcast_ref::>().unwrap();
 let right = right.as_any().downcast_ref::>().unwrap();
-Box::new(move |i, j| cmp_nans_last((i), (j)))
+let left = left.values();
+let right = right.values();
+Box::new(move |i, j| cmp_nans_last([i], [j]))

Review comment:
   this will be slower, but at this point we have no guarantee that `i` and 
`j` are within bounds.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9291:
URL: https://github.com/apache/arrow/pull/9291#issuecomment-765189910


   https://issues.apache.org/jira/browse/ARROW-11345



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`

2021-01-21 Thread GitBox


jorgecarleitao commented on pull request #9291:
URL: https://github.com/apache/arrow/pull/9291#issuecomment-765190026


   cc @tyrelr .



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao opened a new pull request #9291: ARROW-11345: Made most ops not rely on `value(i)`

2021-01-21 Thread GitBox


jorgecarleitao opened a new pull request #9291:
URL: https://github.com/apache/arrow/pull/9291


   # Problem
   
   Currently, a lot of our code relies on `PrimitiveArray::value(usize)`, 
however, that method is highly `unsafe` as any `usize` larger than the length 
of the array allows to read arbitrary memory regions due to the lack of 
protections.
   
   # This PR:
   
   This PR changes many of our code to not rely on it for this operation, 
replacing by a safe alternative. This PR is expected to improve the performance 
of the touched kernels as we already have alternatives to efficiently create an 
array out of an iterator.
   
   todo: benchmark



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io edited a comment on pull request #9215: ARROW-11270: [Rust] Array slice accessors

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #9215:
URL: https://github.com/apache/arrow/pull/9215#issuecomment-761621256


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=h1) Report
   > Merging 
[#9215](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=desc) (41c29bc) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/499b6d0c5fbcf7d3aabf25e286cb8c300e6de52a?el=desc)
 (499b6d0) will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9215/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9215  +/-   ##
   ==
   + Coverage   81.64%   81.68%   +0.03% 
   ==
 Files 215  215  
 Lines   5241952501  +82 
   ==
   + Hits4279842885  +87 
   + Misses   9621 9616   -5 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=)
 | `86.99% <100.00%> (ø)` | |
   | 
[rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz)
 | `91.75% <100.00%> (+0.76%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==)
 | `93.84% <100.00%> (+0.78%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array\_primitive.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfcHJpbWl0aXZlLnJz)
 | `94.77% <100.00%> (+0.24%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz)
 | `91.66% <100.00%> (+1.66%)` | :arrow_up: |
   | 
[rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==)
 | `85.10% <100.00%> (ø)` | |
   | 
[rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=)
 | `96.96% <100.00%> (-0.03%)` | :arrow_down: |
   | 
[rust/arrow/src/compute/kernels/limit.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2xpbWl0LnJz)
 | `100.00% <100.00%> (ø)` | |
   | 
[rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz)
 | `98.92% <100.00%> (ø)` | |
   | 
[rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=)
 | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=footer). Last 
update 
[499b6d0...41c29bc](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on issue #9178: Support for the binary SQL type in rust/datafusion

2021-01-21 Thread GitBox


jorgecarleitao commented on issue #9178:
URL: https://github.com/apache/arrow/issues/9178#issuecomment-765142713


   Hi @ilya-biryukov, have you tried?
   
   The following works for me (based on the example `simple_udf`):
   
   ```rust
   use arrow::{
   array::{ArrayRef, BinaryArray, Int64Array},
   datatypes::DataType,
   record_batch::RecordBatch,
   util::pretty,
   };
   
   use datafusion::error::Result;
   use datafusion::{physical_plan::functions::ScalarFunctionImplementation, 
prelude::*};
   use std::sync::Arc;
   
   // create local execution context with an in-memory table
   fn create_context() -> Result {
   use arrow::datatypes::{Field, Schema};
   use datafusion::datasource::MemTable;
   // define a schema.
   let schema = Arc::new(Schema::new(vec![
   Field::new("c", DataType::Binary, false),
   ]));
   
   let a: &[u8] = b"";
   // define data.
   let batch = RecordBatch::try_new(
   schema.clone(),
   vec![
   Arc::new(BinaryArray::from(vec![Some(a), None, None, None])),
   ],
   )?;
   
   // declare a new context. In spark API, this corresponds to a new spark 
SQLsession
   let mut ctx = ExecutionContext::new();
   
   // declare a table in memory. In spark API, this corresponds to 
createDataFrame(...).
   let provider = MemTable::try_new(schema, vec![vec![batch]])?;
   ctx.register_table("t", Box::new(provider));
   Ok(ctx)
   }
   
   /// In this example we will declare a single-type, single return type UDF 
that exponentiates f64, a^b
   #[tokio::main]
   async fn main() -> Result<()> {
   let mut ctx = create_context()?;
   
   // First, declare the actual implementation of the calculation
   let len: ScalarFunctionImplementation = Arc::new(|args: &[ArrayRef]| {
   // in DataFusion, all `args` and output are dynamically-typed 
arrays, which means that we need to:
   // 1. cast the values to the type we want
   // 2. perform the computation for every element in the array (using 
a loop or SIMD)
   
   // this is guaranteed by DataFusion based on the function's 
signature.
   assert_eq!(args.len(), 1);
   
   let value = [0]
   .as_any()
   .downcast_ref::()
   .expect("cast failed");
   
   // 2. run the UDF
   let array: Int64Array = value.iter().map(|base| {
   // in arrow, any value can be null.
   base.map(|x| x.len() as i64)
   }).collect();
   Ok(Arc::new(array))
   });
   
   // Next:
   // * give it a name so that it shows nicely when the plan is printed
   // * declare what input it expects
   // * declare its return type
   let len = create_udf(
   "len",
   vec![DataType::Binary],
   Arc::new(DataType::Int64),
   len,
   );
   
   // at this point, we can use it or register it, depending on the 
use-case:
   // * if the UDF is expected to be used throughout the program in 
different contexts,
   //   we can register it, and call it later:
   ctx.register_udf(len.clone()); // clone is only required in this example 
because we show both usages
   
   // * if the UDF is expected to be used directly in the scope, `.call` it 
directly:
   let expr = len.call(vec![col("c")]);
   
   // get a DataFrame from the context
   let df = ctx.table("t")?;
   
   // equivalent to `'SELECT pow(a, b), pow(a, b) AS pow1 FROM t'`
   let df = df.select(vec![
   expr,
   ])?;
   
   // execute the query
   let results = df.collect().await?;
   
   // print the results
   pretty::print_batches()?;
   
   Ok(())
   }
   ```
   
   ```
   ++
   | len(c) |
   ++
   | 4  |
   ||
   ||
   ||
   ++
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io commented on pull request #9290: ARROW-11343: [DataFusion] Simplified example with UDF.

2021-01-21 Thread GitBox


codecov-io commented on pull request #9290:
URL: https://github.com/apache/arrow/pull/9290#issuecomment-765149755


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=h1) Report
   > Merging 
[#9290](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=desc) (4e5220e) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/c413566b34bd0c13a01a68148bd78df1bdec3c10?el=desc)
 (c413566) will **not change** coverage.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9290/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=tree)
   
   ```diff
   @@   Coverage Diff   @@
   ##   master#9290   +/-   ##
   ===
 Coverage   81.61%   81.61%   
   ===
 Files 215  215   
 Lines   5250852508   
   ===
 Hits4285442854   
 Misses   9654 9654   
   ```
   
   
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=footer). Last 
update 
[c413566...0bdcf4b](https://codecov.io/gh/apache/arrow/pull/9290?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me closed pull request #9281: ARROW-11333: [Rust] Generalized creation of empty arrays.

2021-01-21 Thread GitBox


nevi-me closed pull request #9281:
URL: https://github.com/apache/arrow/pull/9281


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao opened a new pull request #9290: ARROW-11343: [DataFusion] Simplified example with UDF.

2021-01-21 Thread GitBox


jorgecarleitao opened a new pull request #9290:
URL: https://github.com/apache/arrow/pull/9290


   I.e. use `collect` instead of the builder, which is simpler and faster



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tyrelr commented on a change in pull request #9215: ARROW-11270: [Rust] Array slice accessors

2021-01-21 Thread GitBox


tyrelr commented on a change in pull request #9215:
URL: https://github.com/apache/arrow/pull/9215#discussion_r562406187



##
File path: rust/arrow/src/array/array_primitive.rs
##
@@ -86,13 +86,9 @@ impl PrimitiveArray {
 }
 
 /// Returns the primitive value at index `i`.
-///
-/// Note this doesn't do any bound checking, for performance reason.
-/// # Safety
-/// caller must ensure that the passed in offset is less than the array 
len()
+#[inline]
 pub fn value(, i: usize) -> T::Native {
-let offset = i + self.offset();
-unsafe { *self.raw_values.as_ptr().add(offset) }
+self.values()[i]

Review comment:
   That makes sense.  I'll rebase dropping that commit.

##
File path: rust/arrow/src/array/array_primitive.rs
##
@@ -86,13 +86,9 @@ impl PrimitiveArray {
 }
 
 /// Returns the primitive value at index `i`.
-///
-/// Note this doesn't do any bound checking, for performance reason.
-/// # Safety
-/// caller must ensure that the passed in offset is less than the array 
len()
+#[inline]
 pub fn value(, i: usize) -> T::Native {
-let offset = i + self.offset();
-unsafe { *self.raw_values.as_ptr().add(offset) }
+self.values()[i]

Review comment:
   safety overhead on a few benchmarks seems to be 15-30%
   ```
   group   master-
safe-prim- unsafe-prim-
   -   ---
-- 
   add 512 1.00307.4±4.79ns? B/sec1.07  
  327.7±6.48ns? B/sec1.30400.5±7.09ns? B/sec
   add_nulls_512   1.20447.8±8.09ns? B/sec1.32  
  490.9±3.41ns? B/sec1.00372.2±5.96ns? B/sec
   buffer_bit_ops or   1.17   731.0±20.02ns? B/sec1.19  
  743.7±5.43ns? B/sec1.00   624.0±10.86ns? B/sec
   cast date64 to date32 512   1.00  7.4±0.22µs? B/sec1.10  
8.2±0.07µs? B/sec1.03  7.7±0.22µs? B/sec
   cast i64 to string 512  1.00 24.8±0.33µs? B/sec1.09  
   27.1±0.79µs? B/sec1.10 27.3±0.41µs? B/sec
   cast time32s to time32ms 5121.00   917.0±19.20ns? B/sec1.21  
1113.0±22.04ns? B/sec1.07984.5±7.53ns? B/sec
   equal_512   1.00 44.1±0.81ns? B/sec1.13  
   49.9±0.65ns? B/sec1.01 44.4±0.27ns? B/sec
   min nulls string 5121.04  6.4±0.12µs? B/sec1.14  
7.0±0.15µs? B/sec1.00  6.2±0.09µs? B/sec
   multiply 5121.14   471.5±10.15ns? B/sec1.18  
  487.0±4.72ns? B/sec1.00413.4±6.16ns? B/sec
   not 1.00  1102.3±14.55ns? B/sec1.12  
1235.9±19.91ns? B/sec1.02  1120.3±19.83ns? B/sec
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9290: ARROW-11343: [DataFusion] Simplified example with UDF.

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9290:
URL: https://github.com/apache/arrow/pull/9290#issuecomment-765135950


   https://issues.apache.org/jira/browse/ARROW-11343



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9281: ARROW-11333: [Rust] Generalized creation of empty arrays.

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9281:
URL: https://github.com/apache/arrow/pull/9281#issuecomment-764412043


   https://issues.apache.org/jira/browse/ARROW-11333



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9215: ARROW-11270: [Rust] Array slice accessors

2021-01-21 Thread GitBox


jorgecarleitao commented on a change in pull request #9215:
URL: https://github.com/apache/arrow/pull/9215#discussion_r561618199



##
File path: rust/arrow/src/array/array_primitive.rs
##
@@ -86,13 +86,9 @@ impl PrimitiveArray {
 }
 
 /// Returns the primitive value at index `i`.
-///
-/// Note this doesn't do any bound checking, for performance reason.
-/// # Safety
-/// caller must ensure that the passed in offset is less than the array 
len()
+#[inline]
 pub fn value(, i: usize) -> T::Native {
-let offset = i + self.offset();
-unsafe { *self.raw_values.as_ptr().add(offset) }
+self.values()[i]

Review comment:
   Thinking about it, I am not sure we want this: this will have severe 
implications to almost all uses of `value()`. Could you run the benches before 
and after this PR?
   
   My suggestion is that we mark this method as `unsafe`, so that it is clear 
that people need to be conscious about it. Since this will impact a large part 
of our code base (and likely others), I suggest that we do this change on a 
separate PR

##
File path: rust/arrow/src/array/array_primitive.rs
##
@@ -86,13 +86,9 @@ impl PrimitiveArray {
 }
 
 /// Returns the primitive value at index `i`.
-///
-/// Note this doesn't do any bound checking, for performance reason.
-/// # Safety
-/// caller must ensure that the passed in offset is less than the array 
len()
+#[inline]
 pub fn value(, i: usize) -> T::Native {
-let offset = i + self.offset();
-unsafe { *self.raw_values.as_ptr().add(offset) }
+self.values()[i]

Review comment:
   Thinking about it, I am not sure we want this: this will have severe 
implications to almost all uses of `value()`. Could you run the benches before 
and after this PR?
   
   My suggestion is that we mark this method as `unsafe`, so that it is clear 
that people need to be conscious about it. Since this will impact a large part 
of our code base (and likely others), I suggest that we do this change on a 
separate PR.
   
   I think that we would like to instruct users of this API to switch to 
`values()`. This change will not trigger that change (as the code continues to 
run), but will have an amazing performance implication.
   
   Think 
   
   ```rust
   for i in 0..array.len() {
   if array.value(i) > 2 {
   // do x
   };
   }
   ```
   this loop will suffer a lot from this change.
   
   we would like users to change it to 
   
   ```
   array.values().for_each(|value| {
   if value > 2 {
   // do x
   };
   });
   ```
   

##
File path: rust/arrow/src/array/array_primitive.rs
##
@@ -86,13 +86,9 @@ impl PrimitiveArray {
 }
 
 /// Returns the primitive value at index `i`.
-///
-/// Note this doesn't do any bound checking, for performance reason.
-/// # Safety
-/// caller must ensure that the passed in offset is less than the array 
len()
+#[inline]
 pub fn value(, i: usize) -> T::Native {
-let offset = i + self.offset();
-unsafe { *self.raw_values.as_ptr().add(offset) }
+self.values()[i]

Review comment:
   Thinking about it, I am not sure we want this: this will have severe 
implications to almost all uses of `value()`. Could you run the benches before 
and after this PR?
   
   My suggestion is that we mark this method as `unsafe`, so that it is clear 
that people need to be conscious about it. Since this will impact a large part 
of our code base (and likely others), I suggest that we do this change on a 
separate PR.
   
   I think that we would like to instruct users of this API to switch to 
`values()`. This change will not trigger that change (as the code continues to 
run), but will have an amazing performance implication.
   
   Think 
   
   ```rust
   for i in 0..array.len() {
   if array.value(i) > 2 {
   // do x
   };
   }
   ```
   this loop will suffer a lot from this change.
   
   we would like users to change it to 
   
   ```rust
   array.values().for_each(|value| {
   if value > 2 {
   // do x
   };
   });
   ```
   

##
File path: rust/arrow/src/array/array_primitive.rs
##
@@ -86,13 +86,9 @@ impl PrimitiveArray {
 }
 
 /// Returns the primitive value at index `i`.
-///
-/// Note this doesn't do any bound checking, for performance reason.
-/// # Safety
-/// caller must ensure that the passed in offset is less than the array 
len()
+#[inline]
 pub fn value(, i: usize) -> T::Native {
-let offset = i + self.offset();
-unsafe { *self.raw_values.as_ptr().add(offset) }
+self.values()[i]

Review comment:
   Thinking about it, I am not sure we want this: this will have severe 
implications to almost all uses of `value()`. Could you run the benches before 
and after this PR?
   
   My suggestion is that we mark this 

[GitHub] [arrow] alamb commented on a change in pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking

2021-01-21 Thread GitBox


alamb commented on a change in pull request #9278:
URL: https://github.com/apache/arrow/pull/9278#discussion_r561366255



##
File path: rust/datafusion/src/logical_plan/expr.rs
##
@@ -422,6 +422,136 @@ impl Expr {
 nulls_first,
 }
 }
+
+/// Performs a depth first depth first walk of an expression and
+/// its children, calling `visitor.pre_visit` and
+/// `visitor.post_visit`.
+///
+/// Implements the [visitor 
pattern](https://en.wikipedia.org/wiki/Visitor_pattern) to
+/// separate expression algorithms from the structure of the
+/// `Expr` tree and make it easier to add new types of expressions
+/// and algorithms that walk the tree.
+///
+/// For an expression tree such as
+/// BinaryExpr (GT)
+///left: Column("foo")
+///right: Column("bar")
+///
+/// The nodes are visited using the following order
+/// ```text
+/// pre_visit(BinaryExpr(GT))
+/// pre_visit(Column("foo"))
+/// pre_visit(Column("bar"))
+/// post_visit(Column("bar"))
+/// post_visit(Column("bar"))
+/// post_visit(BinaryExpr(GT))
+/// ```
+///
+/// If an Err result is returned, recursion is stopped immediately
+///
+/// If `Recursion::Stop` is returned on a call to pre_visit, no
+/// children of that expression are visited, nor is post_visit
+/// called on that expression
+///
+pub fn accept(, visitor: V) -> Result {
+let visitor = match visitor.pre_visit(self)? {
+Recursion::Continue(visitor) => visitor,
+// If the recursion should stop, do not visit children
+Recursion::Stop(visitor) => return Ok(visitor),
+};
+
+// recurse (and cover all expression types)
+let visitor = match self {
+Expr::Alias(expr, _) => expr.accept(visitor),
+Expr::Column(..) => Ok(visitor),
+Expr::ScalarVariable(..) => Ok(visitor),
+Expr::Literal(..) => Ok(visitor),
+Expr::BinaryExpr { left, right, .. } => {
+let visitor = left.accept(visitor)?;
+right.accept(visitor)
+}
+Expr::Not(expr) => expr.accept(visitor),
+Expr::IsNotNull(expr) => expr.accept(visitor),
+Expr::IsNull(expr) => expr.accept(visitor),
+Expr::Negative(expr) => expr.accept(visitor),
+Expr::Between {
+expr, low, high, ..
+} => {
+let visitor = expr.accept(visitor)?;
+let visitor = low.accept(visitor)?;
+high.accept(visitor)
+}
+Expr::Case {
+expr,
+when_then_expr,
+else_expr,
+} => {
+let visitor = if let Some(expr) = expr.as_ref() {
+expr.accept(visitor)
+} else {
+Ok(visitor)
+}?;
+let visitor = when_then_expr.iter().try_fold(
+visitor,
+|visitor, (when, then)| {
+let visitor = when.accept(visitor)?;
+then.accept(visitor)
+},
+)?;
+if let Some(else_expr) = else_expr.as_ref() {
+else_expr.accept(visitor)
+} else {
+Ok(visitor)
+}
+}
+Expr::Cast { expr, .. } => expr.accept(visitor),
+Expr::Sort { expr, .. } => expr.accept(visitor),
+Expr::ScalarFunction { args, .. } => args
+.iter()
+.try_fold(visitor, |visitor, arg| arg.accept(visitor)),
+Expr::ScalarUDF { args, .. } => args
+.iter()
+.try_fold(visitor, |visitor, arg| arg.accept(visitor)),
+Expr::AggregateFunction { args, .. } => args
+.iter()
+.try_fold(visitor, |visitor, arg| arg.accept(visitor)),
+Expr::AggregateUDF { args, .. } => args
+.iter()
+.try_fold(visitor, |visitor, arg| arg.accept(visitor)),
+Expr::InList { expr, list, .. } => {
+let visitor = expr.accept(visitor)?;
+list.iter()
+.try_fold(visitor, |visitor, arg| arg.accept(visitor))
+}
+Expr::Wildcard => Ok(visitor),
+}?;
+
+visitor.post_visit(self)
+}
+}
+
+/// Controls how the visitor recursion should proceed.
+pub enum Recursion {
+/// Attempt to visit all the children, recursively, of this expression.
+Continue(V),
+/// Do not visit the children of this expression, though the walk
+/// of parents of this expression will not be affected
+Stop(V),
+}
+
+/// Encode the traversal of an expression tree. When passed to
+/// `visit_expression`, `ExpressionVisitor::visit` is invoked


[GitHub] [arrow] Dandandan commented on a change in pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

2021-01-21 Thread GitBox


Dandandan commented on a change in pull request #9279:
URL: https://github.com/apache/arrow/pull/9279#discussion_r561307105



##
File path: rust/arrow/src/compute/kernels/take.rs
##
@@ -423,7 +423,7 @@ where
 let mut offsets_buffer = MutableBuffer::from_len_zeroed(bytes_offset);
 
 let offsets = offsets_buffer.typed_data_mut();
-let mut values = Vec::with_capacity(bytes_offset);
+let mut values = MutableBuffer::new(0);

Review comment:
   Used to be here `bytes_offset` but that doesn't make a lot of sense.
   The values array could be empty as well (e.g. for empty strings), or have a 
different capacity altogether.

##
File path: rust/arrow/src/compute/kernels/take.rs
##
@@ -423,7 +423,7 @@ where
 let mut offsets_buffer = MutableBuffer::from_len_zeroed(bytes_offset);
 
 let offsets = offsets_buffer.typed_data_mut();
-let mut values = Vec::with_capacity(bytes_offset);
+let mut values = MutableBuffer::new(0);

Review comment:
   Used to be here `bytes_offset` but that doesn't make a lot of sense.
   The values array could be empty as well (e.g. for empty strings), or have a 
totally different capacity.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ElenaHenderson commented on pull request #9272: [WIP] Benchmark placebo

2021-01-21 Thread GitBox


ElenaHenderson commented on pull request #9272:
URL: https://github.com/apache/arrow/pull/9272#issuecomment-764916679


   please benchmark



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9174: ARROW-11220: [Rust] Implement GROUP BY support for Boolean

2021-01-21 Thread GitBox


alamb commented on pull request #9174:
URL: https://github.com/apache/arrow/pull/9174#issuecomment-764600978


   Thanks again for the contribution @ovr  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io edited a comment on pull request #9122: ARROW-10299: [Rust] Use IPC Metadata V5 as default

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #9122:
URL: https://github.com/apache/arrow/pull/9122#issuecomment-756042833


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=h1) Report
   > Merging 
[#9122](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=desc) (8ccc5bd) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/a0e12445cc8689befe40c20e2fac6e6df252bef6?el=desc)
 (a0e1244) will **increase** coverage by `0.09%`.
   > The diff coverage is `99.06%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9122/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9122  +/-   ##
   ==
   + Coverage   81.64%   81.74%   +0.09% 
   ==
 Files 215  215  
 Lines   5248952572  +83 
   ==
   + Hits4285742975 +118 
   + Misses   9632 9597  -35 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/arrow/src/array/array\_struct.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RydWN0LnJz)
 | `88.43% <ø> (ø)` | |
   | 
[rust/arrow/src/util/integration\_util.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9pbnRlZ3JhdGlvbl91dGlsLnJz)
 | `69.95% <75.00%> (+3.08%)` | :arrow_up: |
   | 
[rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==)
 | `84.56% <100.00%> (+1.41%)` | :arrow_up: |
   | 
[rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==)
 | `87.82% <100.00%> (+4.60%)` | :arrow_up: |
   | 
[rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=)
 | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
   | 
[rust/arrow/src/ipc/convert.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL2NvbnZlcnQucnM=)
 | `92.56% <0.00%> (+0.45%)` | :arrow_up: |
   | 
[rust/arrow/src/array/null.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvbnVsbC5ycw==)
 | `92.59% <0.00%> (+1.85%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=)
 | `86.40% <0.00%> (+1.94%)` | :arrow_up: |
   | ... and [1 
more](https://codecov.io/gh/apache/arrow/pull/9122/diff?src=pr=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=footer). Last 
update 
[a0e1244...8ccc5bd](https://codecov.io/gh/apache/arrow/pull/9122?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9174: ARROW-11220: [Rust] Implement GROUP BY support for Boolean

2021-01-21 Thread GitBox


alamb closed pull request #9174:
URL: https://github.com/apache/arrow/pull/9174


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ovr commented on a change in pull request #9232: ARROW-10818: [Rust] Implement DecimalType

2021-01-21 Thread GitBox


ovr commented on a change in pull request #9232:
URL: https://github.com/apache/arrow/pull/9232#discussion_r561946613



##
File path: rust/arrow/src/datatypes/mod.rs
##
@@ -199,6 +207,81 @@ pub struct Field {
 metadata: Option>,
 }
 
+// Decimal (precision, scale) = Decimal(1, 2) = 1.00
+pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq {
+const MAX_DIGITS: usize;
+
+// fn into_json_value(self) -> Option;
+
+fn get_byte_width_for_precision_scale(precision: usize, scale: usize) -> 
usize;
+
+// Rescale scale part
+fn rescale( self, scale: usize);
+
+// Try to parse string with precision, scale
+fn parse(
+string: ,
+precision: usize,
+scale: usize,
+) -> std::result::Result;
+
+fn to_byte_slice() -> Vec;
+
+fn from_bytes_with_precision_scale(
+bytes: &[u8],
+precision: usize,
+scale: usize,
+) -> Self;
+}
+
+#[derive(Debug)]
+pub enum DecimalType {

Review comment:
   @alamb @andygrove @Dandandan @nevi-me @jorgecarleitao
   
   I created representations for Decimal as `Int32DecimalType`, 
`Int64DecimalType`, `Int128DecimalType` or `LargeDecimal`, which uses BigInt.
   
   1. What is the best way in Rust to abstract representations of 
`DataType::Decimal` in Rust? Enum?
   2. How should I work with `DecimalArray`?  Should make it as generic by 
passing representation type to it (but how should I detect type, match in 
everyplace?) or make it generic on top of DecimalType and get byte size from it?
   
   Thanks

##
File path: rust/arrow/src/datatypes/mod.rs
##
@@ -199,6 +207,81 @@ pub struct Field {
 metadata: Option>,
 }
 
+// Decimal (precision, scale) = Decimal(1, 2) = 1.00
+pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq {
+const MAX_DIGITS: usize;
+
+// fn into_json_value(self) -> Option;
+
+fn get_byte_width_for_precision_scale(precision: usize, scale: usize) -> 
usize;
+
+// Rescale scale part
+fn rescale( self, scale: usize);
+
+// Try to parse string with precision, scale
+fn parse(
+string: ,
+precision: usize,
+scale: usize,
+) -> std::result::Result;
+
+fn to_byte_slice() -> Vec;
+
+fn from_bytes_with_precision_scale(
+bytes: &[u8],
+precision: usize,
+scale: usize,
+) -> Self;
+}
+
+#[derive(Debug)]
+pub enum DecimalType {

Review comment:
   @alamb @andygrove @Dandandan @nevi-me @jorgecarleitao
   
   I created representations for Decimal as `Int32DecimalType`, 
`Int64DecimalType`, `Int128DecimalType` or `LargeDecimal`, which uses BigInt.
   
   1. What is the best way in Rust to abstract representations of 
`DataType::Decimal` in Rust? Enum?
   2. How should I work with `DecimalArray`?  Should I make it as generic by 
passing representation type to it (but how should I detect type, match in 
everyplace? As I have known functions cannot return type) or make it generic on 
top of DecimalType and get byte size from it?
   
   Thanks

##
File path: rust/arrow/src/datatypes/mod.rs
##
@@ -199,6 +207,81 @@ pub struct Field {
 metadata: Option>,
 }
 
+// Decimal (precision, scale) = Decimal(1, 2) = 1.00
+pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq {
+const MAX_DIGITS: usize;
+
+// fn into_json_value(self) -> Option;
+
+fn get_byte_width_for_precision_scale(precision: usize, scale: usize) -> 
usize;
+
+// Rescale scale part
+fn rescale( self, scale: usize);
+
+// Try to parse string with precision, scale
+fn parse(
+string: ,
+precision: usize,
+scale: usize,
+) -> std::result::Result;
+
+fn to_byte_slice() -> Vec;
+
+fn from_bytes_with_precision_scale(
+bytes: &[u8],
+precision: usize,
+scale: usize,
+) -> Self;
+}
+
+#[derive(Debug)]
+pub enum DecimalType {

Review comment:
   @alamb @andygrove @Dandandan @nevi-me @jorgecarleitao
   
   I created representations for Decimal as `Int32DecimalType`, 
`Int64DecimalType`, `Int128DecimalType` or `LargeDecimal`, which uses BigInt.
   
   1. What is the best way to abstract representations of `DataType::Decimal in 
Rust`? Enum?
   2. How should I work with `DecimalArray`?  Should I make it as generic by 
passing representation type to it (but how should I detect type, match in 
everyplace? As I have known functions cannot return type) or make it generic on 
top of DecimalType and get byte size from it?
   
   Thanks





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking

2021-01-21 Thread GitBox


alamb commented on pull request #9278:
URL: https://github.com/apache/arrow/pull/9278#issuecomment-764011355


   Thanks @jorgecarleitao  and @Dandandan . I am personally quite excited at 
using this same idea for expression rewriting. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #9240: ARROW-10766: [Rust] [Parquet] Compute nested list definitions

2021-01-21 Thread GitBox


nevi-me commented on a change in pull request #9240:
URL: https://github.com/apache/arrow/pull/9240#discussion_r562158835



##
File path: rust/parquet/src/arrow/array_reader.rs
##
@@ -912,11 +912,36 @@ impl ArrayReader for 
ListArrayReader {
 ));
 }
 
-// Need to remove from the values array the nulls that represent null 
lists rather than null items
-// null lists have def_level = 0
+// List definitions can be encoded as 4 values:
+// - n + 0: the list slot is null
+// - n + 1: the list slot is not null, but is empty (i.e. [])
+// - n + 2: the list slot is not null, but its child is empty (i.e. [ 
null ])
+// - n + 3: the list slot is not null, and its child is not empty
+// Where n is the max definition level of the list's parent.
+// If a Parquet schema's only leaf is the list, then n = 0.
+
+// TODO: ARROW-10391 - add a test case with a non-nullable child, 
check if max is 3
+let list_field_type = match self.get_data_type() {
+ArrowType::List(field)
+| ArrowType::FixedSizeList(field, _)
+| ArrowType::LargeList(field) => field,
+_ => {
+// Panic: this is safe as we only write lists from list 
datatypes
+unreachable!()

Review comment:
   They don't make any difference, `unreachable!()` will call `panic!()` 
with a message about unreachable code being reached. So it's probably a more 
descriptive panic.
   
   I tohught that marking a condition as `unreachable!()` lets the compiler 
optimise out that condition, but it seems like only its `unsafe` equivalent 
does.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9289: ARROW-11341: [Python] [Gandiva] Add NULL/None checks to Gandiva builder functions

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9289:
URL: https://github.com/apache/arrow/pull/9289#issuecomment-765105119


   https://issues.apache.org/jira/browse/ARROW-11341



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8240: ARROW-10038: [C++] Spawn thread pool threads lazily

2021-01-21 Thread GitBox


pitrou commented on pull request #8240:
URL: https://github.com/apache/arrow/pull/8240#issuecomment-764950406







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9265: ARROW-11320: [C++] Try to strengthen temporary dir creation

2021-01-21 Thread GitBox


pitrou commented on pull request #9265:
URL: https://github.com/apache/arrow/pull/9265#issuecomment-764784559


   Travis-CI build: https://travis-ci.com/github/pitrou/arrow/builds/213944945 
. Will merge.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead

2021-01-21 Thread GitBox


wesm commented on a change in pull request #9280:
URL: https://github.com/apache/arrow/pull/9280#discussion_r562121639



##
File path: cpp/src/arrow/compute/internals_benchmark.cc
##
@@ -0,0 +1,86 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
   I can merge these together then. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ElenaHenderson removed a comment on pull request #9272: [WIP] Benchmark placebo

2021-01-21 Thread GitBox


ElenaHenderson removed a comment on pull request #9272:
URL: https://github.com/apache/arrow/pull/9272#issuecomment-763278345







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io edited a comment on pull request #9276: ARROW-11327: [Rust][DataFusion] Add DictionarySupport to create_batch_empty

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #9276:
URL: https://github.com/apache/arrow/pull/9276#issuecomment-763672876


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=h1) Report
   > Merging 
[#9276](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=desc) (07d8ce8) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/a0e12445cc8689befe40c20e2fac6e6df252bef6?el=desc)
 (a0e1244) will **increase** coverage by `0.01%`.
   > The diff coverage is `80.41%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9276/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9276  +/-   ##
   ==
   + Coverage   81.64%   81.66%   +0.01% 
   ==
 Files 215  215  
 Lines   5248952512  +23 
   ==
   + Hits4285742884  +27 
   + Misses   9632 9628   -4 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow/pull/9276/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvbW1vbi5ycw==)
 | `82.03% <80.41%> (+7.74%)` | :arrow_up: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=footer). Last 
update 
[a0e1244...07d8ce8](https://codecov.io/gh/apache/arrow/pull/9276?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking

2021-01-21 Thread GitBox


jorgecarleitao commented on a change in pull request #9278:
URL: https://github.com/apache/arrow/pull/9278#discussion_r561229060



##
File path: rust/datafusion/src/logical_plan/expr.rs
##
@@ -422,6 +422,136 @@ impl Expr {
 nulls_first,
 }
 }
+
+/// Performs a depth first depth first walk of an expression and

Review comment:
   ```suggestion
   /// Performs a depth first walk of an expression and
   ```

##
File path: rust/datafusion/src/logical_plan/expr.rs
##
@@ -422,6 +422,136 @@ impl Expr {
 nulls_first,
 }
 }
+
+/// Performs a depth first depth first walk of an expression and
+/// its children, calling `visitor.pre_visit` and

Review comment:
   ```suggestion
   /// its children, calling [`ExpressionVisitor::pre_visit`] and
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #8920: ARROW-10831: [C++][Compute] Implement quantile kernel

2021-01-21 Thread GitBox


pitrou closed pull request #8920:
URL: https://github.com/apache/arrow/pull/8920


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on a change in pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-01-21 Thread GitBox


xhochy commented on a change in pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#discussion_r562161018



##
File path: cpp/src/arrow/adapters/orc/adapter_util.cc
##
@@ -316,10 +326,482 @@ Status AppendBatch(const liborc::Type* type, 
liborc::ColumnVectorBatch* batch,
   }
 }
 
+template 
+Status FillNumericBatch(const DataType* type, liborc::ColumnVectorBatch* 
cbatch,
+int64_t& arrowOffset, int64_t& orcOffset, int64_t 
length,
+Array* parray, std::vector* incomingMask) {
+  auto array = checked_cast(parray);
+  auto batch = checked_cast(cbatch);
+  int64_t arrowLength = array->length();
+  if (!arrowLength) return Status::OK();
+  if (array->null_count() || incomingMask) batch->hasNulls = true;
+  for (; orcOffset < length && arrowOffset < arrowLength; orcOffset++, 
arrowOffset++) {
+if (array->IsNull(arrowOffset) || (incomingMask && 
!(*incomingMask)[orcOffset])) {
+  batch->notNull[orcOffset] = false;
+} else {
+  batch->data[orcOffset] = array->Value(arrowOffset);
+  batch->notNull[orcOffset] = true;
+}
+  }
+  batch->numElements = orcOffset;
+  return Status::OK();
+}
+
+template 
+Status FillNumericBatchCast(const DataType* type, liborc::ColumnVectorBatch* 
cbatch,
+int64_t& arrowOffset, int64_t& orcOffset, int64_t 
length,
+Array* parray, std::vector* incomingMask) {
+  auto array = checked_cast(parray);
+  auto batch = checked_cast(cbatch);
+  int64_t arrowLength = array->length();
+  if (!arrowLength) return Status::OK();
+  if (array->null_count() || incomingMask) batch->hasNulls = true;
+  for (; orcOffset < length && arrowOffset < arrowLength; orcOffset++, 
arrowOffset++) {
+if (array->IsNull(arrowOffset) || (incomingMask && 
!(*incomingMask)[orcOffset])) {
+  batch->notNull[orcOffset] = false;
+} else {
+  batch->data[orcOffset] = 
static_cast(array->Value(arrowOffset));
+  batch->notNull[orcOffset] = true;
+}
+  }
+  batch->numElements = orcOffset;
+  return Status::OK();
+}
+
+Status FillDate64Batch(const DataType* type, liborc::ColumnVectorBatch* cbatch,
+   int64_t& arrowOffset, int64_t& orcOffset, int64_t 
length,
+   Array* parray, std::vector* incomingMask) {
+  auto array = checked_cast(parray);
+  auto batch = checked_cast(cbatch);
+  int64_t arrowLength = array->length();
+  if (!arrowLength) return Status::OK();
+  if (array->null_count() || incomingMask) batch->hasNulls = true;
+  for (; orcOffset < length && arrowOffset < arrowLength; orcOffset++, 
arrowOffset++) {
+if (array->IsNull(arrowOffset) || (incomingMask && 
!(*incomingMask)[orcOffset])) {
+  batch->notNull[orcOffset] = false;
+} else {
+  int64_t miliseconds = array->Value(arrowOffset);
+  batch->data[orcOffset] =
+  static_cast(std::floor(miliseconds / kOneSecondMillis));
+  batch->nanoseconds[orcOffset] =
+  (miliseconds - kOneSecondMillis * batch->data[orcOffset]) * 
kOneMilliNanos;
+  batch->notNull[orcOffset] = true;
+}
+  }
+  batch->numElements = orcOffset;
+  return Status::OK();
+}
+
+Status FillTimestampBatch(const DataType* type, liborc::ColumnVectorBatch* 
cbatch,
+  int64_t& arrowOffset, int64_t& orcOffset, int64_t 
length,
+  Array* parray, std::vector* incomingMask) {
+  auto array = checked_cast(parray);
+  auto batch = checked_cast(cbatch);
+  int64_t arrowLength = array->length();
+  if (!arrowLength) return Status::OK();
+  if (array->null_count() || incomingMask) batch->hasNulls = true;
+  for (; orcOffset < length && arrowOffset < arrowLength; orcOffset++, 
arrowOffset++) {
+if (array->IsNull(arrowOffset) || (incomingMask && 
!(*incomingMask)[orcOffset])) {
+  batch->notNull[orcOffset] = false;
+} else {
+  int64_t data = array->Value(arrowOffset);
+  batch->notNull[orcOffset] = true;
+  switch (std::static_pointer_cast(array->type())->unit()) {
+case TimeUnit::type::SECOND: {
+  batch->data[orcOffset] = data;
+  batch->nanoseconds[orcOffset] = 0;
+  break;
+}
+case TimeUnit::type::MILLI: {
+  batch->data[orcOffset] =
+  static_cast(std::floor(data / kOneSecondMillis));
+  batch->nanoseconds[orcOffset] =
+  (data - kOneSecondMillis * batch->data[orcOffset]) * 
kOneMilliNanos;
+  break;
+}
+case TimeUnit::type::MICRO: {
+  batch->data[orcOffset] =
+  static_cast(std::floor(data / kOneSecondMicros));
+  batch->nanoseconds[orcOffset] =
+  (data - kOneSecondMicros * batch->data[orcOffset]) * 
kOneMicroNanos;
+  break;
+}
+default: {
+  batch->data[orcOffset] =
+  static_cast(std::floor(data / kOneSecondNanos));
+  batch->nanoseconds[orcOffset] = data - 

[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer over

2021-01-21 Thread GitBox


WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-765076360


   @liyafan82 
   
   > The reason is that, for variable width vectors, it is not possible to 
estimate the buffer size without actually filling up the vector. 
   
   Why not possible ? For variableWidthVector, The buffer format is quite 
simple:
   1) validity buffer: nbytes = ceil(N/8) 
   2) offset buffer: nbytes = OFFSET_WIDTH * (N + 1)
   3) value buffer: nbytes = sum of all non-null value bytes count
   
   For `variableWidthVector`, the `setValueCount` method only do additional 
things of:
   1) fill holes (fill the offset buffer gap for last consecutive NULL values)
   2) reset offset reader/writer index (this looks like has some side effect)
   But, without calling `setValueCount`, it doesn't affect we compute the 
"vector buffer size"
   
   Does it make sense ?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking

2021-01-21 Thread GitBox


alamb closed pull request #9278:
URL: https://github.com/apache/arrow/pull/9278


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #9274: ARROW-11299: [C++][Python] Fix invalid-offsetof warnings

2021-01-21 Thread GitBox


cyb70289 commented on pull request #9274:
URL: https://github.com/apache/arrow/pull/9274#issuecomment-764217852


   > @cyb70289 `arrow::dataset::Expression::Call::options` requires that 
virtual destructor since a `shared_ptr` is used to store 
subclasses of FunctionOptions. Removing this virtual destructor results in a 
resource leak since the derived classes' (implicit) destructors are not being 
called, leaking data members such as 
`arrow::compute::SetLookupOptions::value_set` which are not destroyed.
   
   Thanks @bkietz , I missed this point.
   Will update according to your comment. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io edited a comment on pull request #9235: ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length)

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #9235:
URL: https://github.com/apache/arrow/pull/9235#issuecomment-761864117







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9235: ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length)

2021-01-21 Thread GitBox


jorgecarleitao edited a comment on pull request #9235:
URL: https://github.com/apache/arrow/pull/9235#issuecomment-763898758







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io edited a comment on pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #9261:
URL: https://github.com/apache/arrow/pull/9261#issuecomment-764691368


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=h1) Report
   > Merging 
[#9261](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=desc) (8d10789) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc)
 (1393188) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9261/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=tree)
   
   ```diff
   @@   Coverage Diff@@
   ##   master#9261+/-   ##
   
 Coverage   81.61%   81.61%
   
 Files 215  215
 Lines   5186752508   +641 
   
   + Hits4232942853   +524 
   - Misses   9538 9655   +117 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=)
 | `91.67% <100.00%> (+0.17%)` | :arrow_up: |
   | 
[rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=)
 | `39.43% <0.00%> (-14.50%)` | :arrow_down: |
   | 
[rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==)
 | `83.21% <0.00%> (-9.89%)` | :arrow_down: |
   | 
[rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=)
 | `49.01% <0.00%> (-9.17%)` | :arrow_down: |
   | 
[rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==)
 | `79.75% <0.00%> (-6.52%)` | :arrow_down: |
   | 
[rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==)
 | `6.97% <0.00%> (-5.22%)` | :arrow_down: |
   | 
[rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=)
 | `53.12% <0.00%> (-5.21%)` | :arrow_down: |
   | 
[rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz)
 | `93.02% <0.00%> (-5.13%)` | :arrow_down: |
   | 
[rust/arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3V0aWxzLnJz)
 | `95.00% <0.00%> (-5.00%)` | :arrow_down: |
   | 
[rust/datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvbW1vbi5ycw==)
 | `74.28% <0.00%> (-4.51%)` | :arrow_down: |
   | ... and [91 
more](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=footer). Last 
update 
[903b41c...8d10789](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9288: ARROW-11342: [Python] [Gandiva] Expose ToString and result type information

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9288:
URL: https://github.com/apache/arrow/pull/9288#issuecomment-765095930


   https://issues.apache.org/jira/browse/ARROW-11342



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rtyler commented on a change in pull request #9240: ARROW-10766: [Rust] [Parquet] Compute nested list definitions

2021-01-21 Thread GitBox


rtyler commented on a change in pull request #9240:
URL: https://github.com/apache/arrow/pull/9240#discussion_r561361029



##
File path: rust/parquet/src/arrow/array_reader.rs
##
@@ -912,11 +912,36 @@ impl ArrayReader for 
ListArrayReader {
 ));
 }
 
-// Need to remove from the values array the nulls that represent null 
lists rather than null items
-// null lists have def_level = 0
+// List definitions can be encoded as 4 values:
+// - n + 0: the list slot is null
+// - n + 1: the list slot is not null, but is empty (i.e. [])
+// - n + 2: the list slot is not null, but its child is empty (i.e. [ 
null ])
+// - n + 3: the list slot is not null, and its child is not empty
+// Where n is the max definition level of the list's parent.
+// If a Parquet schema's only leaf is the list, then n = 0.
+
+// TODO: ARROW-10391 - add a test case with a non-nullable child, 
check if max is 3
+let list_field_type = match self.get_data_type() {
+ArrowType::List(field)
+| ArrowType::FixedSizeList(field, _)
+| ArrowType::LargeList(field) => field,
+_ => {
+// Panic: this is safe as we only write lists from list 
datatypes
+unreachable!()

Review comment:
   For my own curiosity, is there a specific reason to use `unreachable!` 
rather than `panic!` in cases like this? I understand the outcome will be the 
same.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] TheNeuralBit commented on a change in pull request #8385: ARROW-10220 [Javascript] Javascript toArray() method ignores nulls on some types.

2021-01-21 Thread GitBox


TheNeuralBit commented on a change in pull request #8385:
URL: https://github.com/apache/arrow/pull/8385#discussion_r562374331



##
File path: js/src/visitor/toarray.ts
##
@@ -88,18 +88,6 @@ export class ToArrayVisitor extends Visitor {}
 
 /** @ignore */
 function arrayOfVector(vector: VectorType): T['TArray'] 
{
-
-const { type, length, stride } = vector;
-
-// Fast case, return subarray if possible
-switch (type.typeId) {

Review comment:
   Instead of removing the fast case can you just add a check to skip this 
if the type is nullable? As you pointed out in the jira we could still do this 
for non-nullable arrays.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-764778634


   https://issues.apache.org/jira/browse/ARROW-10370



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9276: ARROW-11327: [Rust][DataFusion] Add DictionarySupport to create_batch_empty

2021-01-21 Thread GitBox


alamb closed pull request #9276:
URL: https://github.com/apache/arrow/pull/9276


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian

2021-01-21 Thread GitBox


pitrou commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r562063757



##
File path: cpp/src/arrow/array/util.h
##
@@ -56,6 +56,14 @@ Result> MakeArrayFromScalar(
 
 namespace internal {
 
+/// \brief Swap endian of each element in a generic ArrayData
+/// \param[in] data the array contents
+/// \param[in] type the array type
+/// \return the resulting Array instance whose elements were swapped
+ARROW_EXPORT
+Result> SwapEndianArrayData(
+std::shared_ptr& data, const std::shared_ptr& type);

Review comment:
   The first parameter should be `const` as well, it seems?

##
File path: cpp/src/arrow/array/util.h
##
@@ -56,6 +56,14 @@ Result> MakeArrayFromScalar(
 
 namespace internal {
 
+/// \brief Swap endian of each element in a generic ArrayData
+/// \param[in] data the array contents
+/// \param[in] type the array type

Review comment:
   The type is already on the data, is it necessary to pass it separately?

##
File path: dev/archery/tests/generate_files_for_endian_test.sh
##
@@ -0,0 +1,28 @@
+#!/bin/bash
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+: ${ARROW_DIR:=/arrow}

Review comment:
   Can you add a comment explaining how to use this script?

##
File path: cpp/src/arrow/type.h
##
@@ -1641,6 +1655,17 @@ class ARROW_EXPORT Schema : public 
detail::Fingerprintable,
   bool Equals(const Schema& other, bool check_metadata = false) const;
   bool Equals(const std::shared_ptr& other, bool check_metadata = 
false) const;
 
+  /// \brief Replace endianness with platform-native endianness in the schema
+  ///
+  /// \return new Schema
+  std::shared_ptr WithNativeEndianness() const;
+
+  /// \brief Return endianness in the schema
+  Endianness endianness() const;
+
+  /// \brief Indicate if endianness is equal to platform-native endianness
+  bool IsNativeEndianness() const;

Review comment:
   I would call it `is_native_endian()`.

##
File path: cpp/src/arrow/array/util.cc
##
@@ -74,6 +75,220 @@ class ArrayDataWrapper {
   std::shared_ptr* out_;
 };
 
+class ArrayDataEndianSwapper {
+ public:
+  ArrayDataEndianSwapper(std::shared_ptr& data, int64_t length,
+ std::shared_ptr* out)
+  : data_(data), length_(length), out_(out) {}
+
+  Status SwapType(const DataType& type) {

Review comment:
   And where is the null bitmap handled? Did I miss something?

##
File path: cpp/src/arrow/array/util.cc
##
@@ -74,6 +75,220 @@ class ArrayDataWrapper {
   std::shared_ptr* out_;
 };
 
+class ArrayDataEndianSwapper {
+ public:
+  ArrayDataEndianSwapper(std::shared_ptr& data, int64_t length,
+ std::shared_ptr* out)
+  : data_(data), length_(length), out_(out) {}
+
+  Status SwapType(const DataType& type) {
+RETURN_NOT_OK(VisitTypeInline(type, this));
+RETURN_NOT_OK(SwapChildren(type.fields()));
+return Status::OK();
+  }
+
+  Status SwapChildren(std::vector> child_fields) {
+int i = 0;
+for (const auto& child_field : child_fields) {
+  ARROW_ASSIGN_OR_RAISE(
+  (*out_)->child_data[i],
+  SwapEndianArrayData(data_->child_data[i], 
child_field.get()->type()));
+  i++;
+}
+return Status::OK();
+  }
+
+  template 
+  Result> ByteSwapBuffer(std::shared_ptr& 
in_buffer,

Review comment:
   Either `const&` or no reference at all.

##
File path: ci/scripts/integration_arrow.sh
##
@@ -24,9 +24,16 @@ source_dir=${1}/cpp
 build_dir=${2}/cpp
 gold_dir_0_14_1=$arrow_dir/testing/data/arrow-ipc-stream/integration/0.14.1
 gold_dir_0_17_1=$arrow_dir/testing/data/arrow-ipc-stream/integration/0.17.1
+gold_dir_1_0_0_le=$arrow_dir/testing/data/arrow-ipc-stream/integration/1.0.0-littleendian
+gold_dir_1_0_0_be=$arrow_dir/testing/data/arrow-ipc-stream/integration/1.0.0-bigendian

Review comment:
   Sidenote: I see that "generated_large_batch" was added to these 
directories. Is it necessary?

##
File path: cpp/src/arrow/array/util.cc
##
@@ -74,6 +75,220 @@ class ArrayDataWrapper {
   std::shared_ptr* out_;
 };
 
+class ArrayDataEndianSwapper {
+ public:
+  

[GitHub] [arrow] jorisvandenbossche commented on pull request #9282: ARROW-11334: [Python][CI] Fix failing pandas nightly tests

2021-01-21 Thread GitBox


jorisvandenbossche commented on pull request #9282:
URL: https://github.com/apache/arrow/pull/9282#issuecomment-764589256


   @github-actions crossbow submit test-conda-python-3.7-pandas-master
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #9155: ARROW-6103: [Java] Remove mvn release plugin [WIP]

2021-01-21 Thread GitBox


kou commented on pull request #9155:
URL: https://github.com/apache/arrow/pull/9155#issuecomment-763933322







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tyrelr commented on pull request #9215: ARROW-11270: [Rust] Array slice accessors

2021-01-21 Thread GitBox


tyrelr commented on pull request #9215:
URL: https://github.com/apache/arrow/pull/9215#issuecomment-764164139


   Rebased.  But I didn't hit the merge conflicts during...  I can see that a 
number of tests were swapped to use Buffer::from_slice_ref()... so I'll add 
another commit to switch to that function just to be safe.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9286: Fixed test case mistake that was generating compiler warning

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9286:
URL: https://github.com/apache/arrow/pull/9286#issuecomment-764995036







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9280:
URL: https://github.com/apache/arrow/pull/9280#issuecomment-764142561


   https://issues.apache.org/jira/browse/ARROW-8928



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9265: ARROW-11320: [C++] Try to strengthen temporary dir creation

2021-01-21 Thread GitBox


pitrou commented on a change in pull request #9265:
URL: https://github.com/apache/arrow/pull/9265#discussion_r561909698



##
File path: cpp/src/arrow/util/io_util.cc
##
@@ -1482,31 +1482,51 @@ std::string MakeRandomName(int num_chars) {
 }  // namespace
 
 Result> TemporaryDir::Make(const std::string& 
prefix) {
-  std::string suffix = MakeRandomName(8);
+  const int kNumChars = 8;
+
   NativePathString base_name;
-  ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));
+
+  auto MakeBaseName = [&]() {
+std::string suffix = MakeRandomName(kNumChars);
+return StringToNative(prefix + suffix);
+  };
+
+  auto TryCreatingDirectory =

Review comment:
   `TryCreatingDirectory` mutates the enclosing scope and is therefore 
coupled to this function. But I can move out MakeBaseName.

##
File path: cpp/src/arrow/util/io_util.cc
##
@@ -1482,31 +1482,51 @@ std::string MakeRandomName(int num_chars) {
 }  // namespace
 
 Result> TemporaryDir::Make(const std::string& 
prefix) {
-  std::string suffix = MakeRandomName(8);
+  const int kNumChars = 8;
+
   NativePathString base_name;
-  ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));
+
+  auto MakeBaseName = [&]() {
+std::string suffix = MakeRandomName(kNumChars);
+return StringToNative(prefix + suffix);
+  };
+
+  auto TryCreatingDirectory =

Review comment:
   `TryCreatingDirectory` mutates the enclosing scope and is therefore 
coupled to this function. But I can move out `MakeBaseName`.

##
File path: cpp/src/arrow/util/io_util.cc
##
@@ -1482,31 +1482,51 @@ std::string MakeRandomName(int num_chars) {
 }  // namespace
 
 Result> TemporaryDir::Make(const std::string& 
prefix) {
-  std::string suffix = MakeRandomName(8);
+  const int kNumChars = 8;
+
   NativePathString base_name;
-  ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));
+
+  auto MakeBaseName = [&]() {
+std::string suffix = MakeRandomName(kNumChars);
+return StringToNative(prefix + suffix);
+  };
+
+  auto TryCreatingDirectory =
+  [&](const NativePathString& base_dir) -> 
Result> {
+Status st;
+for (int attempt = 0; attempt < 3; ++attempt) {
+  PlatformFilename fn(base_dir + kNativeSep + base_name + kNativeSep);
+  auto result = CreateDir(fn);
+  if (!result.ok()) {
+// Probably a permissions error or a non-existing base_dir
+return nullptr;
+  }
+  if (*result) {
+return std::unique_ptr(new TemporaryDir(std::move(fn)));
+  }
+  // The random name already exists in base_dir, try with another name
+  st = Status::IOError("Path already exists: '", fn.ToString(), "'");
+  ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName());
+}
+return st;
+  };
+
+  ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName());
 
   auto base_dirs = GetPlatformTemporaryDirs();
   DCHECK_NE(base_dirs.size(), 0);
 
-  auto st = Status::OK();
-  for (const auto& p : base_dirs) {
-PlatformFilename fn(p + kNativeSep + base_name + kNativeSep);
-auto result = CreateDir(fn);
-if (!result.ok()) {
-  st = result.status();
-  continue;
-}
-if (!*result) {
-  // XXX Should we retry with another random name?
-  return Status::IOError("Path already exists: '", fn.ToString(), "'");
-} else {
-  return std::unique_ptr(new TemporaryDir(std::move(fn)));
+  for (const auto& base_dir : base_dirs) {
+ARROW_ASSIGN_OR_RAISE(auto ptr, TryCreatingDirectory(base_dir));
+if (ptr) {
+  return std::move(ptr);

Review comment:
   Because some old gcc fails compiling without it:
   https://github.com/apache/arrow/runs/1728457990#step:9:552
   https://github.com/apache/arrow/runs/1728457638#step:7:992

##
File path: cpp/src/arrow/util/io_util.cc
##
@@ -1482,31 +1482,51 @@ std::string MakeRandomName(int num_chars) {
 }  // namespace
 
 Result> TemporaryDir::Make(const std::string& 
prefix) {
-  std::string suffix = MakeRandomName(8);
+  const int kNumChars = 8;
+
   NativePathString base_name;
-  ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));
+
+  auto MakeBaseName = [&]() {
+std::string suffix = MakeRandomName(kNumChars);
+return StringToNative(prefix + suffix);
+  };
+
+  auto TryCreatingDirectory =
+  [&](const NativePathString& base_dir) -> 
Result> {
+Status st;
+for (int attempt = 0; attempt < 3; ++attempt) {
+  PlatformFilename fn(base_dir + kNativeSep + base_name + kNativeSep);
+  auto result = CreateDir(fn);
+  if (!result.ok()) {
+// Probably a permissions error or a non-existing base_dir
+return nullptr;
+  }
+  if (*result) {
+return std::unique_ptr(new TemporaryDir(std::move(fn)));
+  }
+  // The random name already exists in base_dir, try with another name
+  st = Status::IOError("Path already exists: '", fn.ToString(), "'");
+  ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName());
+

[GitHub] [arrow] alamb commented on pull request #9275: ARROW-11323: [Rust][DataFusion] Allow sort queries to return no results

2021-01-21 Thread GitBox


alamb commented on pull request #9275:
URL: https://github.com/apache/arrow/pull/9275#issuecomment-764606664







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #9265: ARROW-11320: [C++] Try to strengthen temporary dir creation

2021-01-21 Thread GitBox


pitrou closed pull request #9265:
URL: https://github.com/apache/arrow/pull/9265


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

2021-01-21 Thread GitBox


jorisvandenbossche commented on a change in pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#discussion_r562029994



##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -1337,7 +1337,7 @@ def test_construct_from_single_file(tempdir):
 # instantiate from a single file with a filesystem object
 d2 = ds.dataset(path, filesystem=fs.LocalFileSystem())
 # instantiate from a single file with prefixed filesystem URI
-d3 = ds.dataset(relative_path, filesystem=_filesystem_uri(directory))
+d3 = ds.dataset(str(relative_path), filesystem=_filesystem_uri(directory))

Review comment:
   This change is needed because `relative_path` is a pathlib.Path object, 
but after this change, we only support that for actual LocalFilesystem, while 
here we have a SubtreeFilesystem. 
   I suppose that in theory I could check for SubtreeFilesystem that wraps 
LocalFilesystem as well (but not sure we want to encourage using pathlib.Path 
for relative paths, and when having an absolute path, you never need 
SubtreeFileSystem).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #9151: ARROW-11173: [Java] Add map type in complex reader / writer

2021-01-21 Thread GitBox


liyafan82 commented on pull request #9151:
URL: https://github.com/apache/arrow/pull/9151#issuecomment-764632071


   @nbruno Thanks for the PR. For the Java implementation, the Map type is 
implemented as a List of Struct. So we should already have supported the Map 
type? What is the motivation of the new implementation?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9287: ARROW-11340: [C++] Add vcpkg.json manifest to cpp project root

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9287:
URL: https://github.com/apache/arrow/pull/9287#issuecomment-765071688







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

2021-01-21 Thread GitBox


Dandandan commented on pull request #9279:
URL: https://github.com/apache/arrow/pull/9279#issuecomment-763951458


   I think a nice followup for `take_string` would maybe be looking into using 
the offsets data to calculate the total length of the buffer to allocate, which 
should be pretty fast, and then allocating based on that. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on pull request #8240: ARROW-10038: [C++] Spawn thread pool threads lazily

2021-01-21 Thread GitBox


westonpace commented on pull request #8240:
URL: https://github.com/apache/arrow/pull/8240#issuecomment-765025427


   Sorry, I having trouble understanding what you are referencing.  There was 
an RTools 3.5 failure because of a gcc bug requiring you to std::move a 
unique_ptr but I don't think that would apply here.
   
   The other compiler warning treated as error seemed to be unique to gcc 9.3 
and tsan and I didn't know there was any failing job related to it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9276: ARROW-11327: [Rust][DataFusion] Add DictionarySupport to create_batch_empty

2021-01-21 Thread GitBox


alamb commented on pull request #9276:
URL: https://github.com/apache/arrow/pull/9276#issuecomment-764615979


   I agree that #9281  looks much better. Thanks @jorgecarleitao 
   
   I might port the tests from this PR subsently, but for now i don't think it 
is doing anything valuable



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #9286: ARROW-11337: [C++] Compilation error with ThreadSanitizer

2021-01-21 Thread GitBox


kou closed pull request #9286:
URL: https://github.com/apache/arrow/pull/9286


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nbruno commented on pull request #9151: ARROW-11173: [Java] Add map type in complex reader / writer

2021-01-21 Thread GitBox


nbruno commented on pull request #9151:
URL: https://github.com/apache/arrow/pull/9151#issuecomment-764664522


   @liyafan82 It might be implicitly supported, but you'd have to be aware of 
that implementation detail in order to work with it. It's also less fluent to 
work with a list of structs to imply a map in the readers / writers, rather 
than a dedicated map API. It was also noted as a gap in the API on the [mailing 
list](https://lists.apache.org/thread.html/r6e2de4be2cdea11e3f277aff2c23b2bda0782d9460d4dddf5c11f240%40%3Cuser.arrow.apache.org%3E).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead

2021-01-21 Thread GitBox


pitrou commented on a change in pull request #9280:
URL: https://github.com/apache/arrow/pull/9280#discussion_r562057646



##
File path: cpp/src/arrow/compute/internals_benchmark.cc
##
@@ -0,0 +1,86 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
   We already have `function_benchmark.cc` for such internals benchmarks.

##
File path: cpp/src/arrow/compute/internals_benchmark.cc
##
@@ -0,0 +1,86 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "arrow/compute/exec_internal.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/util/benchmark_util.h"
+
+namespace arrow {
+namespace compute {
+namespace detail {
+
+constexpr int32_t kSeed = 0xfede4a7e;
+
+void BM_ExecBatchIterator(benchmark::State& state) {
+  // Measure overhead related to deconstructing vector into a sequence 
of ExecBatch
+  random::RandomArrayGenerator rag(kSeed);
+
+  const int64_t length = 1 << 20;
+  const int num_fields = 10;
+
+  std::vector args(num_fields);
+  for (int i = 0; i < num_fields; ++i) {
+args[i] = rag.Int64(length, 0, 100)->data();
+  }
+
+  for (auto _ : state) {
+std::unique_ptr it =
+*ExecBatchIterator::Make(args, state.range(0));
+ExecBatch batch;
+while (it->Next()) {
+  for (int i = 0; i < num_fields; ++i) {
+auto data = batch.values[i].array()->buffers[1]->data();
+benchmark::DoNotOptimize(data);
+  }
+  continue;
+}
+benchmark::DoNotOptimize(batch);
+  }
+
+  state.SetItemsProcessed(state.iterations());
+}
+
+void BM_DatumSlice(benchmark::State& state) {
+  // Measure overhead related to deconstructing vector into a sequence 
of ExecBatch
+  random::RandomArrayGenerator rag(kSeed);
+
+  const int64_t length = 1000;
+
+  int num_datums = 1000;
+  std::vector datums(num_datums);
+  for (int i = 0; i < num_datums; ++i) {
+datums[i] = rag.Int64(length, 0, 100)->data();
+  }
+
+  for (auto _ : state) {
+for (const Datum& datum : datums) {
+  auto slice = datum.array()->Slice(16, 64);

Review comment:
   Is it significantly different from slicing an array? Otherwise, is there 
a point for the specific "slice a datum" operation?

##
File path: cpp/src/arrow/compute/internals_benchmark.cc
##
@@ -0,0 +1,86 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "arrow/compute/exec_internal.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/util/benchmark_util.h"
+
+namespace arrow {
+namespace compute {
+namespace detail {
+
+constexpr int32_t kSeed = 0xfede4a7e;
+
+void BM_ExecBatchIterator(benchmark::State& state) {
+  // Measure overhead related to deconstructing vector into a sequence 
of ExecBatch
+  random::RandomArrayGenerator rag(kSeed);
+
+  const int64_t length = 1 << 20;
+  const int num_fields = 10;
+
+  std::vector args(num_fields);
+  for (int i = 0; i < num_fields; ++i) {
+args[i] = rag.Int64(length, 0, 100)->data();
+  }
+
+  for (auto _ : state) {
+std::unique_ptr it =
+*ExecBatchIterator::Make(args, state.range(0));
+ExecBatch batch;
+while (it->Next()) {
+  for (int i = 0; i < num_fields; ++i) {
+auto data = batch.values[i].array()->buffers[1]->data();
+benchmark::DoNotOptimize(data);
+  }
+  continue;
+}

[GitHub] [arrow] github-actions[bot] commented on pull request #9282: ARROW-11334: [Python][CI] Fix failing pandas nightly tests

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9282:
URL: https://github.com/apache/arrow/pull/9282#issuecomment-764589077







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflo

2021-01-21 Thread GitBox


liyafan82 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-764197687







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ursabot commented on pull request #9272: [WIP] Benchmark placebo

2021-01-21 Thread GitBox


ursabot commented on pull request #9272:
URL: https://github.com/apache/arrow/pull/9272#issuecomment-764936041


   9272 - 30c8491386125d4315b179ec963552333d0494d4 - 
50ba53446ecc43a74d0edb35e8c0b23263d0440a



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #9122: ARROW-10299: [Rust] Use IPC Metadata V5 as default

2021-01-21 Thread GitBox


nevi-me commented on pull request #9122:
URL: https://github.com/apache/arrow/pull/9122#issuecomment-764332590


   > @nevi-me is this one ready to merge? Should we rebase it against latest 
master and rerun the checks? Or I can merge it in as is too
   
   I've rebased, and changed a few comments: V5 will be default in 4.0.0, 
pointed to the JIRA about duplicate column names.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io commented on pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline

2021-01-21 Thread GitBox


codecov-io commented on pull request #9261:
URL: https://github.com/apache/arrow/pull/9261#issuecomment-764691368


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=h1) Report
   > Merging 
[#9261](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=desc) (ce18ab7) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc)
 (1393188) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9261/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=tree)
   
   ```diff
   @@   Coverage Diff@@
   ##   master#9261+/-   ##
   
 Coverage   81.61%   81.61%
   
 Files 215  215
 Lines   5186752508   +641 
   
   + Hits4232942854   +525 
   - Misses   9538 9654   +116 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=)
 | `91.67% <100.00%> (+0.17%)` | :arrow_up: |
   | 
[rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=)
 | `39.43% <0.00%> (-14.50%)` | :arrow_down: |
   | 
[rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==)
 | `83.21% <0.00%> (-9.89%)` | :arrow_down: |
   | 
[rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=)
 | `49.01% <0.00%> (-9.17%)` | :arrow_down: |
   | 
[rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==)
 | `79.75% <0.00%> (-6.52%)` | :arrow_down: |
   | 
[rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==)
 | `6.97% <0.00%> (-5.22%)` | :arrow_down: |
   | 
[rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=)
 | `53.12% <0.00%> (-5.21%)` | :arrow_down: |
   | 
[rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz)
 | `93.02% <0.00%> (-5.13%)` | :arrow_down: |
   | 
[rust/arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3V0aWxzLnJz)
 | `95.00% <0.00%> (-5.00%)` | :arrow_down: |
   | 
[rust/datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvbW1vbi5ycw==)
 | `74.28% <0.00%> (-4.51%)` | :arrow_down: |
   | ... and [91 
more](https://codecov.io/gh/apache/arrow/pull/9261/diff?src=pr=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=footer). Last 
update 
[903b41c...ce18ab7](https://codecov.io/gh/apache/arrow/pull/9261?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

2021-01-21 Thread GitBox


codecov-io commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=h1) Report
   > Merging 
[#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=desc) (64abbc5) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc)
 (1393188) will **increase** coverage by `0.18%`.
   > The diff coverage is `93.35%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9243  +/-   ##
   ==
   + Coverage   81.61%   81.79%   +0.18% 
   ==
 Files 215  215  
 Lines   5186753138+1271 
   ==
   + Hits4232943466+1137 
   - Misses   9538 9672 +134 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==)
 | `78.94% <ø> (+1.81%)` | :arrow_up: |
   | 
[...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==)
 | `85.56% <85.39%> (-1.39%)` | :arrow_down: |
   | 
[rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=)
 | `94.44% <92.42%> (-4.10%)` | :arrow_down: |
   | 
[rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==)
 | `84.27% <94.47%> (+11.97%)` | :arrow_up: |
   | 
[rust/datafusion/src/physical\_plan/aggregates.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2FnZ3JlZ2F0ZXMucnM=)
 | `91.13% <100.00%> (ø)` | |
   | 
[rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==)
 | `99.84% <100.00%> (+<0.01%)` | :arrow_up: |
   | 
[rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=)
 | `91.67% <100.00%> (+0.17%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==)
 | `83.21% <0.00%> (-9.89%)` | :arrow_down: |
   | 
[rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==)
 | `79.75% <0.00%> (-6.52%)` | :arrow_down: |
   | 
[rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==)
 | `6.97% <0.00%> (-5.22%)` | :arrow_down: |
   | ... and [94 
more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=footer). Last 
update 
[1401359...64abbc5](https://codecov.io/gh/apache/arrow/pull/9243?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buff

2021-01-21 Thread GitBox


WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763540715







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on pull request #9155: ARROW-6103: [Java] Remove mvn release plugin [WIP]

2021-01-21 Thread GitBox


andygrove commented on pull request #9155:
URL: https://github.com/apache/arrow/pull/9155#issuecomment-763939243


   > 
   > 
   > Sorry for my late response.
   > 
   > We have a post manual(!) task that publishes Java packages to the Maven 
Central Repository: 
https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingJavaMavenartifactsinMavencentral
   > Can we do it by `mvn deploy` with source archive?
   
   Yes, that is my assumption. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] vertexclique commented on pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline

2021-01-21 Thread GitBox


vertexclique commented on pull request #9261:
URL: https://github.com/apache/arrow/pull/9261#issuecomment-764560085







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on pull request #9235: ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length)

2021-01-21 Thread GitBox


jorgecarleitao commented on pull request #9235:
URL: https://github.com/apache/arrow/pull/9235#issuecomment-763898758







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on pull request #9276: ARROW-11327: [Rust][DataFusion] Add DictionarySupport to create_batch_empty

2021-01-21 Thread GitBox


jorgecarleitao commented on pull request #9276:
URL: https://github.com/apache/arrow/pull/9276#issuecomment-764612354


   fyi, with #9281 , (imo) this is solved more broadly.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

2021-01-21 Thread GitBox


jorgecarleitao commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562383540



##
File path: rust/datafusion/src/physical_plan/functions.rs
##
@@ -60,10 +59,15 @@ pub enum Signature {
 // A function such as `array` is `VariadicEqual`
 // The first argument decides the type used for coercion
 VariadicEqual,
+/// fixed number of arguments of vector of vectors of valid types
+// A function of one argument of f64 is 
`Uniform(vc![vec![vec![DataType::Float64]]])`
+// A function of one argument of f64 or f32 is 
`Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])`
+// A function of two arguments with first argument of f64 or f32 and 
second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, 
DataType::Float64], vec![DataType::Utf8]]])`
+Uniform(Vec>>),

Review comment:
   This signature generalizes `UniformEqual`, so, wouldn't it be possible 
generalize the other instead of creating a new one (replace the existing one by 
the more general form)?
   
   `Signature` should be such that its variants form a complete set of options 
without overlaps.

##
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
 }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii(args: &[ArrayRef]) -> 
Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| {
+x.map(|x: | {
+let mut chars = x.chars();
+chars.next().map_or(0, |v| v as i32)
+})
+})
+.collect())
+}
+
+/// Removes the longest string containing only characters in characters (a 
space by default) from the start and end of string.
+pub fn btrim(args: &[ArrayRef]) -> 
Result {
+match args.len() {
+0 => Err(DataFusionError::Internal(
+"btrim was called with 0 arguments. It requires at least 
1.".to_string(),
+)),
+1 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.map(|x| x.map(|x: | x.trim()))
+.collect())
+}
+2 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+let characters_array = args[1]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.enumerate()
+.map(|(i, x)| {
+if characters_array.is_null(i) {
+None
+} else {
+x.map(|x: | {
+let chars: Vec =
+characters_array.value(i).chars().collect();
+x.trim_start_matches([..])
+.trim_end_matches([..])
+})
+}
+})
+.collect())
+}
+other => Err(DataFusionError::Internal(format!(
+"btrim was called with {} arguments. It requires at most 2.",
+other
+))),
+}
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| x.map(|x: | x.graphemes(true).count() as i32))
+.collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| x.map(|x: | x.graphemes(true).count() as i64))
+.collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result {
+let array = args[0].as_any().downcast_ref::().unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x: Option| {
+x.map(|x| {
+if x == 0 {
+Err(DataFusionError::Internal(
+"null character not permitted.".to_string(),
+))
+} else {
+match core::char::from_u32(x as u32) {
+ 

[GitHub] [arrow] alamb closed pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline

2021-01-21 Thread GitBox


alamb closed pull request #9261:
URL: https://github.com/apache/arrow/pull/9261


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io edited a comment on pull request #9215: ARROW-11270: [Rust] Array slice accessors

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #9215:
URL: https://github.com/apache/arrow/pull/9215#issuecomment-761621256


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=h1) Report
   > Merging 
[#9215](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=desc) (5cd8075) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/a0e12445cc8689befe40c20e2fac6e6df252bef6?el=desc)
 (a0e1244) will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9215/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9215  +/-   ##
   ==
   + Coverage   81.64%   81.68%   +0.03% 
   ==
 Files 215  215  
 Lines   5248952572  +83 
   ==
   + Hits4285742945  +88 
   + Misses   9632 9627   -5 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz)
 | `91.75% <100.00%> (+0.76%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==)
 | `84.84% <100.00%> (+1.62%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array\_primitive.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfcHJpbWl0aXZlLnJz)
 | `94.76% <100.00%> (+0.23%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz)
 | `91.66% <100.00%> (+1.66%)` | :arrow_up: |
   | 
[rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==)
 | `85.51% <100.00%> (ø)` | |
   | 
[rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=)
 | `96.96% <100.00%> (-0.03%)` | :arrow_down: |
   | 
[rust/arrow/src/compute/kernels/limit.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2xpbWl0LnJz)
 | `100.00% <100.00%> (ø)` | |
   | 
[rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz)
 | `98.92% <100.00%> (ø)` | |
   | 
[rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9215/diff?src=pr=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=)
 | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=footer). Last 
update 
[a0e1244...9ca08d7](https://codecov.io/gh/apache/arrow/pull/9215?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace edited a comment on pull request #8240: ARROW-10038: [C++] Spawn thread pool threads lazily

2021-01-21 Thread GitBox


westonpace edited a comment on pull request #8240:
URL: https://github.com/apache/arrow/pull/8240#issuecomment-765025427


   Sorry, I am having trouble understanding what you are referencing.  There 
was an RTools 3.5 failure because of a gcc bug requiring you to std::move a 
unique_ptr but I don't think that would apply here.
   
   The other compiler warning treated as error seemed to be unique to gcc 9.3 
and tsan and I didn't know there was any failing job related to it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on a change in pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead

2021-01-21 Thread GitBox


cyb70289 commented on a change in pull request #9280:
URL: https://github.com/apache/arrow/pull/9280#discussion_r561589279



##
File path: cpp/src/arrow/compute/internals_benchmark.cc
##
@@ -0,0 +1,86 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "arrow/compute/exec_internal.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/util/benchmark_util.h"
+
+namespace arrow {
+namespace compute {
+namespace detail {
+
+constexpr int32_t kSeed = 0xfede4a7e;
+
+void BM_ExecBatchIterator(benchmark::State& state) {
+  // Measure overhead related to deconstructing vector into a sequence 
of ExecBatch
+  random::RandomArrayGenerator rag(kSeed);
+
+  const int64_t length = 1 << 20;
+  const int num_fields = 10;
+
+  std::vector args(num_fields);
+  for (int i = 0; i < num_fields; ++i) {
+args[i] = rag.Int64(length, 0, 100)->data();
+  }
+
+  for (auto _ : state) {
+std::unique_ptr it =
+*ExecBatchIterator::Make(args, state.range(0));
+ExecBatch batch;
+while (it->Next()) {
+  for (int i = 0; i < num_fields; ++i) {
+auto data = batch.values[i].array()->buffers[1]->data();
+benchmark::DoNotOptimize(data);
+  }
+  continue;

Review comment:
   remove it?

##
File path: cpp/src/arrow/compute/internals_benchmark.cc
##
@@ -0,0 +1,86 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "arrow/compute/exec_internal.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/util/benchmark_util.h"
+
+namespace arrow {
+namespace compute {
+namespace detail {
+
+constexpr int32_t kSeed = 0xfede4a7e;
+
+void BM_ExecBatchIterator(benchmark::State& state) {
+  // Measure overhead related to deconstructing vector into a sequence 
of ExecBatch
+  random::RandomArrayGenerator rag(kSeed);
+
+  const int64_t length = 1 << 20;
+  const int num_fields = 10;
+
+  std::vector args(num_fields);
+  for (int i = 0; i < num_fields; ++i) {
+args[i] = rag.Int64(length, 0, 100)->data();
+  }
+
+  for (auto _ : state) {
+std::unique_ptr it =
+*ExecBatchIterator::Make(args, state.range(0));
+ExecBatch batch;
+while (it->Next()) {
+  for (int i = 0; i < num_fields; ++i) {
+auto data = batch.values[i].array()->buffers[1]->data();
+benchmark::DoNotOptimize(data);
+  }
+  continue;
+}
+benchmark::DoNotOptimize(batch);
+  }
+
+  state.SetItemsProcessed(state.iterations());
+}
+
+void BM_DatumSlice(benchmark::State& state) {
+  // Measure overhead related to deconstructing vector into a sequence 
of ExecBatch
+  random::RandomArrayGenerator rag(kSeed);
+
+  const int64_t length = 1000;
+
+  int num_datums = 1000;

Review comment:
   also define as const?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-01-21 Thread GitBox


mathyingzhou commented on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-764050107


   > I didn't see anything rust related in this PR so I removed the Rust label
   
   @alamb Sorry it didn’t appear in the right place but there is nothing 
parquet-related in this PR either. Can the parquet label be removed? Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche closed pull request #9282: ARROW-11334: [Python][CI] Fix failing pandas nightly tests

2021-01-21 Thread GitBox


jorisvandenbossche closed pull request #9282:
URL: https://github.com/apache/arrow/pull/9282


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9283: ARROW-11235: [Python] Fix test failure inside non-default S3 region

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9283:
URL: https://github.com/apache/arrow/pull/9283#issuecomment-764594206


   https://issues.apache.org/jira/browse/ARROW-11235



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou commented on a change in pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-01-21 Thread GitBox


mathyingzhou commented on a change in pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#discussion_r561405645



##
File path: cpp/src/arrow/adapters/orc/adapter_test.cc
##
@@ -157,4 +225,2478 @@ TEST(TestAdapter, readIntAndStringFileMultipleStripes) {
 EXPECT_TRUE(stripe_reader->ReadNext(_batch).ok());
   }
 }
+
+// WriteORC tests
+
+// General
+TEST(TestAdapterWriteGeneral, writeZeroRows) {
+  std::vector> xFields{field("bool", boolean()),
+  field("int8", int8()),
+  field("int16", int16()),
+  field("int32", int32()),
+  field("int64", int64()),
+  field("float", float32()),
+  field("double", float64()),
+  field("decimal128nz", 
decimal(25, 6)),
+  field("decimal128z", decimal(32, 
0)),
+  field("date32", date32()),
+  field("ts3", 
timestamp(TimeUnit::NANO)),
+  field("string", utf8()),
+  field("binary", binary())};
+  std::shared_ptr sharedPtrSchema = std::make_shared(xFields);
+
+  int64_t numRows = 0;
+  int64_t numCols = xFields.size();
+
+  ArrayBuilderVector builders(numCols, NULLPTR);
+  builders[0] =
+  
std::static_pointer_cast(std::make_shared());
+  builders[1] = 
std::static_pointer_cast(std::make_shared());
+  builders[2] = 
std::static_pointer_cast(std::make_shared());
+  builders[3] = 
std::static_pointer_cast(std::make_shared());
+  builders[4] = 
std::static_pointer_cast(std::make_shared());
+  builders[5] = 
std::static_pointer_cast(std::make_shared());
+  builders[6] = 
std::static_pointer_cast(std::make_shared());
+  builders[7] = std::static_pointer_cast(
+  std::make_shared(decimal(25, 6)));
+  builders[8] = std::static_pointer_cast(
+  std::make_shared(decimal(32, 0)));
+  builders[9] = 
std::static_pointer_cast(std::make_shared());
+  builders[10] =
+  
std::static_pointer_cast(std::make_shared(
+  timestamp(TimeUnit::NANO), default_memory_pool()));
+  builders[11] =
+  
std::static_pointer_cast(std::make_shared());
+  builders[12] =
+  
std::static_pointer_cast(std::make_shared());
+  ArrayVector arrays(numCols, NULLPTR);
+  ChunkedArrayVector cv;
+  cv.reserve(numCols);
+
+  for (int col = 0; col < numCols; col++) {
+ARROW_EXPECT_OK(builders[col]->Finish([col]));
+cv.push_back(std::make_shared(arrays[col]));
+  }
+
+  std::shared_ptr table = Table::Make(sharedPtrSchema, cv);
+
+  std::unique_ptr writer =
+  std::unique_ptr(new ORCMemWriter());
+  std::unique_ptr out_stream =
+  std::unique_ptr(static_cast(
+  new MemoryOutputStream(DEFAULT_SMALL_MEM_STREAM_SIZE / 16)));
+  ARROW_EXPECT_OK(writer->Open(sharedPtrSchema, out_stream));
+  ARROW_EXPECT_OK(writer->Write(table));
+  auto output_mem_stream = 
static_cast(writer->ReleaseOutStream());
+  std::shared_ptr in_stream(
+  new io::BufferReader(std::make_shared(
+  reinterpret_cast(output_mem_stream->getData()),
+  static_cast(output_mem_stream->getLength();
+
+  std::unique_ptr reader;
+  ASSERT_TRUE(
+  adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool(), 
).ok());
+  std::shared_ptr outputTable;
+  ARROW_EXPECT_OK(reader->Read());
+  EXPECT_EQ(outputTable->num_columns(), numCols);
+  EXPECT_EQ(outputTable->num_rows(), numRows);
+  EXPECT_TRUE(outputTable->Equals(*table));
+}
+TEST(TestAdapterWriteGeneral, writeChunkless) {
+  std::vector> xFieldsSub{std::make_shared("a", 
utf8()),
+ std::make_shared("b", 
int32())};
+  std::vector> xFields{
+  field("bool", boolean()),
+  field("int8", int8()),
+  field("int16", int16()),
+  field("int32", int32()),
+  field("int64", int64()),
+  field("float", float32()),
+  field("double", float64()),
+  field("decimal128nz", decimal(25, 6)),
+  field("decimal128z", decimal(32, 0)),
+  field("date32", date32()),
+  field("ts3", timestamp(TimeUnit::NANO)),
+  field("string", utf8()),
+  field("binary", binary()),
+  field("struct", struct_(xFieldsSub)),
+  field("list", list(int32())),
+  field("lsl", list(struct_({field("lsl0", list(int32()))})))};
+  std::shared_ptr sharedPtrSchema = std::make_shared(xFields);
+
+  int64_t numRows = 0;
+  int64_t numCols = xFields.size();
+
+  ChunkedArrayVector cv;
+  cv.reserve(numCols);
+
+  ArrayMatrix av(numCols, ArrayVector(0, NULLPTR));
+
+  for (int col = 0; col < numCols; col++) {
+cv.push_back(std::make_shared(av[col], 
xFields[col]->type()));
+  }
+
+  std::shared_ptr 

[GitHub] [arrow] andygrove commented on pull request #9278: ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expression walking

2021-01-21 Thread GitBox


andygrove commented on pull request #9278:
URL: https://github.com/apache/arrow/pull/9278#issuecomment-763902393


   I just wanted to add that there is no need to wait for me to review before 
merging.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tyrelr commented on a change in pull request #9215: ARROW-11270: [Rust] Array slice accessors

2021-01-21 Thread GitBox


tyrelr commented on a change in pull request #9215:
URL: https://github.com/apache/arrow/pull/9215#discussion_r561918229



##
File path: rust/arrow/src/array/array_primitive.rs
##
@@ -86,13 +86,9 @@ impl PrimitiveArray {
 }
 
 /// Returns the primitive value at index `i`.
-///
-/// Note this doesn't do any bound checking, for performance reason.
-/// # Safety
-/// caller must ensure that the passed in offset is less than the array 
len()
+#[inline]
 pub fn value(, i: usize) -> T::Native {
-let offset = i + self.offset();
-unsafe { *self.raw_values.as_ptr().add(offset) }
+self.values()[i]

Review comment:
   I agree, that we want to guide API users to switch to values().  My 
hesitancy with making/leaving it unsafe is that I think it's the only Array 
value(x) method when combined with the other changes in this PR (which is why 
it got added as a followon commit).  That leads to awkwardness in macros, such 
as the comparison kernel, where it effectively calls "$left.value(x)",  it 
could be a papercut for macro-style invocation.  
   
   (There may be an argument that a trait could abstract away the need for 
macros altogether... but my experiments to rework the comparison kernel to that 
effect are incomplete - and partly sidelined with various other experiments I'm 
toying with)
   
   I would expect that in the exact for loop you mention, the compiler could 
recognize the satisfied loop bound since both len() and .value(i) are inlinable 
and use the same underlying field, but I expect it is also easy to complicate 
on that loop to defeat that by having a less direct relationship between i & 
array.len().  The pretty printing logic probably would do additional bounds 
checks now, as it iterates based on  len() from the record batch, which is 
semantically equal, but nothing in the code enforces that in a way I would 
expect the compiler to understand.
   
   I'll launch some benchmarks of head, master, and the commit just before the 
PrimitiveArray safety change.

##
File path: rust/arrow/src/array/array_primitive.rs
##
@@ -86,13 +86,9 @@ impl PrimitiveArray {
 }
 
 /// Returns the primitive value at index `i`.
-///
-/// Note this doesn't do any bound checking, for performance reason.
-/// # Safety
-/// caller must ensure that the passed in offset is less than the array 
len()
+#[inline]
 pub fn value(, i: usize) -> T::Native {
-let offset = i + self.offset();
-unsafe { *self.raw_values.as_ptr().add(offset) }
+self.values()[i]

Review comment:
   I agree, that we want to guide API users to switch to values().  My 
hesitancy with making/leaving it unsafe is that I think it's the only Array 
value(x) method when combined with the other changes in this PR (which is why 
it got added as a followon commit).  That leads to awkwardness in macros, such 
as the comparison kernel, where it effectively calls "$left.value(x)".  Making 
macro style invocation sometimes-need-unsafe is a bit unfortunate.
   
   (There may be an argument that a trait could abstract away the need for 
macros altogether... but my experiments to rework the comparison kernel to that 
effect are incomplete - and partly sidelined with various other experiments I'm 
toying with... most of which somehow relate to removing direct calls to 
__Array.value(i) )
   
   I would expect that in the exact for loop you mention, the compiler could 
recognize the satisfied loop bound since both len() and .value(i) are inlinable 
and use the same underlying field, but I expect it is also easy to complicate 
on that loop to defeat that by having a less direct relationship between i & 
array.len().  The pretty printing logic probably would do additional bounds 
checks now, as it iterates based on  len() from the record batch, which is 
semantically equal, but nothing in the code enforces that in a way I would 
expect the compiler to understand.
   
   I'll launch some benchmarks of head, master, and the commit just before the 
PrimitiveArray safety change.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on pull request #9281: ARROW-11333: [Rust] Generalized creation of empty arrays.

2021-01-21 Thread GitBox


jorgecarleitao commented on pull request #9281:
URL: https://github.com/apache/arrow/pull/9281#issuecomment-764422417


   @ovr, this is what I was trying to express in your PR: we can use 
`ArrayData` directly and thereby support arbitrary nested types.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline

2021-01-21 Thread GitBox


alamb commented on a change in pull request #9261:
URL: https://github.com/apache/arrow/pull/9261#discussion_r562040108



##
File path: .github/workflows/rust.yml
##
@@ -226,6 +226,40 @@ jobs:
   cd rust
   cargo clippy --all-targets --workspace -- -D warnings -A 
clippy::redundant_field_names
 
+  miri-checks:
+name: Miri Checks
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+arch: [amd64]
+rust: [nightly-2021-01-19]
+steps:
+  - uses: actions/checkout@v2
+with:
+  submodules: true
+  - uses: actions/cache@v2
+with:
+  path: |
+~/.cargo/registry
+~/.cargo/git
+target
+  key: ${{ runner.os }}-cargo-miri-${{ hashFiles('**/Cargo.lock') }}
+  - name: Setup Rust toolchain
+run: |
+  rustup toolchain install ${{ matrix.rust }}
+  rustup default ${{ matrix.rust }}
+  rustup component add rustfmt clippy miri
+  - name: Run Miri Checks
+env:
+  RUST_BACKTRACE: full
+  RUST_LOG: 'trace'
+run: |
+  export MIRIFLAGS="-Zmiri-disable-isolation"
+  cd rust
+  cargo miri setup
+  cargo clean
+  cargo miri test || true

Review comment:
   ```suggestion
 # Ignore MIRI errors until we can get a clean run
 cargo miri test || true
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

2021-01-21 Thread GitBox


seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r561653478



##
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##
@@ -34,42 +34,446 @@ macro_rules! downcast_vec {
 }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii(args: &[ArrayRef]) -> 
Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| {
+x.map(|x: | {
+let mut chars = x.chars();
+chars.next().map_or(0, |v| v as i32)
+})
+})
+.collect())
+}
+
+/// Removes the longest string containing only characters in characters (a 
space by default) from the start and end of string.
+pub fn btrim(args: &[ArrayRef]) -> 
Result {
+match args.len() {
+0 => Err(DataFusionError::Internal(
+"btrim was called with 0 arguments. It requires at least 
one.".to_string(),
+)),
+1 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.map(|x| x.map(|x: | x.trim()))
+.collect())
+}
+2 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+let characters_array = args[1]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.enumerate()
+.map(|(i, x)| {
+if characters_array.is_null(i) {
+None
+} else {
+x.map(|x: | {
+let chars: Vec =
+characters_array.value(i).chars().collect();
+x.trim_start_matches([..])
+.trim_end_matches([..])
+})
+}
+})
+.collect())
+}
+other => Err(DataFusionError::Internal(format!(
+"btrim was called with {} arguments. It requires at most two.",
+other
+))),
+}
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result {
+let array = args[0].as_any().downcast_ref::().unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x: Option| {
+x.map(|x| {
+if x == 0 {
+Err(DataFusionError::Internal(
+"null character not permitted.".to_string(),
+))
+} else {
+match core::char::from_u32(x as u32) {
+Some(x) => Ok(x.to_string()),
+None => Err(DataFusionError::Internal(
+"requested character too large for 
encoding.".to_string(),
+)),
+}
+}
+.unwrap()

Review comment:
   I'm not sure if we should be panicing if these characters appear

##
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##
@@ -69,13 +69,42 @@ pub fn data_types(
 signature: ,
 ) -> Result> {
 let valid_types = match signature {
-Signature::Variadic(valid_types) => valid_types
+Signature::Any(number) => {
+if current_types.len() != *number {
+return Err(DataFusionError::Plan(format!(
+"The function expected {} arguments but received {}",
+number,
+current_types.len()
+)));
+}
+vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+}
+Signature::Exact(valid_types) => vec![valid_types.clone()],
+Signature::Uniform(valid_types) => {
+let valid_signature = valid_types
+.iter()
+.filter(|x| x.len() == current_types.len())
+.collect::>();
+if valid_signature.len() != 1 {
+return Err(DataFusionError::Plan(format!(
+"The function expected {} arguments but received {}",
+valid_types
+.iter()
+.map(|x| x.len().to_string())
+.collect::>()
+.join(" or "),
+current_types.len()
+)));
+}
+

[GitHub] [arrow] alamb commented on pull request #9261: ARROW-11141: [Rust] Add basic Miri checks to CI pipeline

2021-01-21 Thread GitBox


alamb commented on pull request #9261:
URL: https://github.com/apache/arrow/pull/9261#issuecomment-764604319







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-01-21 Thread GitBox


mathyingzhou edited a comment on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-764050107


   > I didn't see anything rust related in this PR so I removed the Rust label
   
   @alamb Thanks! There is nothing parquet-related in this PR either. Can the 
parquet label be removed as well? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9285: ARROW-10349: [Python] Build and publish aarch64 wheels [WIP]

2021-01-21 Thread GitBox


github-actions[bot] commented on pull request #9285:
URL: https://github.com/apache/arrow/pull/9285#issuecomment-764893560


   https://issues.apache.org/jira/browse/ARROW-10349



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mbrubeck commented on a change in pull request #9235: ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length)

2021-01-21 Thread GitBox


mbrubeck commented on a change in pull request #9235:
URL: https://github.com/apache/arrow/pull/9235#discussion_r561239323



##
File path: rust/arrow/src/buffer.rs
##
@@ -963,11 +968,131 @@ impl MutableBuffer {
 
 /// Extends the buffer by `additional` bytes equal to `0u8`, incrementing 
its capacity if needed.
 #[inline]
-pub fn extend( self, additional: usize) {
+pub fn extend_zeros( self, additional: usize) {
 self.resize(self.len + additional, 0);
 }
 }
 
+/// # Safety
+/// `ptr` must be allocated for `old_capacity`.
+#[inline]
+unsafe fn reallocate(
+ptr: NonNull,
+old_capacity: usize,
+new_capacity: usize,
+) -> (NonNull, usize) {
+let new_capacity = bit_util::round_upto_multiple_of_64(new_capacity);
+let new_capacity = std::cmp::max(new_capacity, old_capacity * 2);
+let ptr = memory::reallocate(ptr, old_capacity, new_capacity);
+(ptr, new_capacity)
+}
+
+impl Extend for MutableBuffer {
+#[inline]
+fn extend>( self, iter: T) {
+let iterator = iter.into_iter();
+self.extend_from_iter(iterator)
+}
+}
+
+impl MutableBuffer {
+#[inline]
+fn extend_from_iter>(
+ self,
+mut iterator: I,
+) {
+let size = std::mem::size_of::();
+
+// this is necessary because of 
https://github.com/rust-lang/rust/issues/32155
+let (mut ptr, mut capacity, mut len) = (self.data, self.capacity, 
self.len);
+let mut dst = unsafe { ptr.as_ptr().add(len) as *mut T };
+
+while let Some(item) = iterator.next() {
+if len + size >= capacity {
+let (lower, _) = iterator.size_hint();
+let additional = (lower + 1) * size;
+let (new_ptr, new_capacity) =
+unsafe { reallocate(ptr, capacity, len + additional) };

Review comment:
   > Note that arrow does not support complex structs on its buffers (i.e. 
we only support `u8-u64, i8-i64, f32 and f64`), which means that we never need 
to call `drop` on the elements. Under this, do we still need a valid `len`?
   
   No.  Dropping before `self.len` is updated should be fine.  However, in my 
benchmarking I still found that using `SetLenOnDrop` provided a small 
performance benefit compared to just updating `self.len` after the loop.  I'm 
not sure why.

##
File path: rust/arrow/src/buffer.rs
##
@@ -963,11 +968,131 @@ impl MutableBuffer {
 
 /// Extends the buffer by `additional` bytes equal to `0u8`, incrementing 
its capacity if needed.
 #[inline]
-pub fn extend( self, additional: usize) {
+pub fn extend_zeros( self, additional: usize) {
 self.resize(self.len + additional, 0);
 }
 }
 
+/// # Safety
+/// `ptr` must be allocated for `old_capacity`.
+#[inline]
+unsafe fn reallocate(
+ptr: NonNull,
+old_capacity: usize,
+new_capacity: usize,
+) -> (NonNull, usize) {
+let new_capacity = bit_util::round_upto_multiple_of_64(new_capacity);
+let new_capacity = std::cmp::max(new_capacity, old_capacity * 2);
+let ptr = memory::reallocate(ptr, old_capacity, new_capacity);
+(ptr, new_capacity)
+}
+
+impl Extend for MutableBuffer {
+#[inline]
+fn extend>( self, iter: T) {
+let iterator = iter.into_iter();
+self.extend_from_iter(iterator)
+}
+}
+
+impl MutableBuffer {
+#[inline]
+fn extend_from_iter>(
+ self,
+mut iterator: I,
+) {
+let size = std::mem::size_of::();
+
+// this is necessary because of 
https://github.com/rust-lang/rust/issues/32155
+let (mut ptr, mut capacity, mut len) = (self.data, self.capacity, 
self.len);
+let mut dst = unsafe { ptr.as_ptr().add(len) as *mut T };
+
+while let Some(item) = iterator.next() {
+if len + size >= capacity {
+let (lower, _) = iterator.size_hint();
+let additional = (lower + 1) * size;
+let (new_ptr, new_capacity) =
+unsafe { reallocate(ptr, capacity, len + additional) };

Review comment:
   > Note that arrow does not support complex structs on its buffers (i.e. 
we only support `u8-u64, i8-i64, f32 and f64`), which means that we never need 
to call `drop` on the elements. Under this, do we still need a valid `len`?
   
   No.  Dropping before `self.len` is updated should be fine.  However, in my 
benchmarking I still found that using `SetLenOnDrop` provided a small 
performance benefit compared to just updating `self.len` after the loop.  I'm 
not sure why.  Maybe the mutable borrow provides some additional hints to the 
optimizer.

##
File path: rust/arrow/src/buffer.rs
##
@@ -963,11 +968,131 @@ impl MutableBuffer {
 
 /// Extends the buffer by `additional` bytes equal to `0u8`, incrementing 
its capacity if needed.
 #[inline]
-pub fn extend( self, additional: usize) {
+pub fn extend_zeros( self, additional: usize) {
 

[GitHub] [arrow] alamb closed pull request #9275: ARROW-11323: [Rust][DataFusion] Allow sort queries to return no results

2021-01-21 Thread GitBox


alamb closed pull request #9275:
URL: https://github.com/apache/arrow/pull/9275


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

2021-01-21 Thread GitBox


seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562391973



##
File path: rust/datafusion/src/physical_plan/functions.rs
##
@@ -60,10 +59,15 @@ pub enum Signature {
 // A function such as `array` is `VariadicEqual`
 // The first argument decides the type used for coercion
 VariadicEqual,
+/// fixed number of arguments of vector of vectors of valid types
+// A function of one argument of f64 is 
`Uniform(vc![vec![vec![DataType::Float64]]])`
+// A function of one argument of f64 or f32 is 
`Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])`
+// A function of two arguments with first argument of f64 or f32 and 
second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, 
DataType::Float64], vec![DataType::Utf8]]])`
+Uniform(Vec>>),

Review comment:
   Yes. Agree. The existing code clearly took some thought so wanted to 
leave it until we can agree correct course of action.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

2021-01-21 Thread GitBox


seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562391477



##
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
 }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii(args: &[ArrayRef]) -> 
Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| {
+x.map(|x: | {
+let mut chars = x.chars();
+chars.next().map_or(0, |v| v as i32)
+})
+})
+.collect())
+}
+
+/// Removes the longest string containing only characters in characters (a 
space by default) from the start and end of string.
+pub fn btrim(args: &[ArrayRef]) -> 
Result {
+match args.len() {
+0 => Err(DataFusionError::Internal(
+"btrim was called with 0 arguments. It requires at least 
1.".to_string(),
+)),
+1 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.map(|x| x.map(|x: | x.trim()))
+.collect())
+}
+2 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+let characters_array = args[1]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.enumerate()
+.map(|(i, x)| {
+if characters_array.is_null(i) {
+None
+} else {
+x.map(|x: | {
+let chars: Vec =
+characters_array.value(i).chars().collect();
+x.trim_start_matches([..])
+.trim_end_matches([..])
+})
+}
+})
+.collect())
+}
+other => Err(DataFusionError::Internal(format!(
+"btrim was called with {} arguments. It requires at most 2.",
+other
+))),
+}
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| x.map(|x: | x.graphemes(true).count() as i32))
+.collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| x.map(|x: | x.graphemes(true).count() as i64))
+.collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result {
+let array = args[0].as_any().downcast_ref::().unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array

Review comment:
   Ah of course 臘 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

2021-01-21 Thread GitBox


seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562391162



##
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##
@@ -69,13 +69,42 @@ pub fn data_types(
 signature: ,
 ) -> Result> {
 let valid_types = match signature {
-Signature::Variadic(valid_types) => valid_types
+Signature::Any(number) => {
+if current_types.len() != *number {
+return Err(DataFusionError::Plan(format!(
+"The function expected {} arguments but received {}",
+number,
+current_types.len()
+)));
+}
+vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+}
+Signature::Exact(valid_types) => vec![valid_types.clone()],
+Signature::Uniform(valid_types) => {
+let valid_signature = valid_types
+.iter()
+.filter(|x| x.len() == current_types.len())
+.collect::>();
+if valid_signature.len() != 1 {
+return Err(DataFusionError::Plan(format!(
+"The function expected {} arguments but received {}",
+valid_types
+.iter()
+.map(|x| x.len().to_string())
+.collect::>()
+.join(" or "),
+current_types.len()
+)));
+}
+cartesian_product(valid_signature.first().unwrap())

Review comment:
   Thanks @jorgecarleitao . Yes I will split this out. 
   
   A good example is lpad which is either:
   [string, int] or [string, int, string]. I am away a couple of days but will 
split this out so we can work throught methodically.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

2021-01-21 Thread GitBox


jorgecarleitao commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562390994



##
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
 }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii(args: &[ArrayRef]) -> 
Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| {
+x.map(|x: | {
+let mut chars = x.chars();
+chars.next().map_or(0, |v| v as i32)
+})
+})
+.collect())
+}
+
+/// Removes the longest string containing only characters in characters (a 
space by default) from the start and end of string.
+pub fn btrim(args: &[ArrayRef]) -> 
Result {
+match args.len() {
+0 => Err(DataFusionError::Internal(
+"btrim was called with 0 arguments. It requires at least 
1.".to_string(),
+)),
+1 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.map(|x| x.map(|x: | x.trim()))
+.collect())
+}
+2 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+let characters_array = args[1]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.enumerate()
+.map(|(i, x)| {
+if characters_array.is_null(i) {
+None
+} else {
+x.map(|x: | {
+let chars: Vec =
+characters_array.value(i).chars().collect();
+x.trim_start_matches([..])
+.trim_end_matches([..])
+})
+}
+})
+.collect())
+}
+other => Err(DataFusionError::Internal(format!(
+"btrim was called with {} arguments. It requires at most 2.",
+other
+))),
+}
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| x.map(|x: | x.graphemes(true).count() as i32))
+.collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| x.map(|x: | x.graphemes(true).count() as i64))
+.collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result {
+let array = args[0].as_any().downcast_ref::().unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array

Review comment:
   *and remove this `Ok` from here, so that `collect` is implicitly treated 
as `collect>` instead of `collect<_>`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

2021-01-21 Thread GitBox


jorgecarleitao commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562389680



##
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##
@@ -69,13 +69,42 @@ pub fn data_types(
 signature: ,
 ) -> Result> {
 let valid_types = match signature {
-Signature::Variadic(valid_types) => valid_types
+Signature::Any(number) => {
+if current_types.len() != *number {
+return Err(DataFusionError::Plan(format!(
+"The function expected {} arguments but received {}",
+number,
+current_types.len()
+)));
+}
+vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+}
+Signature::Exact(valid_types) => vec![valid_types.clone()],
+Signature::Uniform(valid_types) => {
+let valid_signature = valid_types
+.iter()
+.filter(|x| x.len() == current_types.len())
+.collect::>();
+if valid_signature.len() != 1 {
+return Err(DataFusionError::Plan(format!(
+"The function expected {} arguments but received {}",
+valid_types
+.iter()
+.map(|x| x.len().to_string())
+.collect::>()
+.join(" or "),
+current_types.len()
+)));
+}
+cartesian_product(valid_signature.first().unwrap())

Review comment:
   I suggest that we PR this separately with a single function that 
requires this type of signature, as we need to get this requires much more care 
than the other parts of this PR as it affects all future functions that use it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

2021-01-21 Thread GitBox


jorgecarleitao commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562383540



##
File path: rust/datafusion/src/physical_plan/functions.rs
##
@@ -60,10 +59,15 @@ pub enum Signature {
 // A function such as `array` is `VariadicEqual`
 // The first argument decides the type used for coercion
 VariadicEqual,
+/// fixed number of arguments of vector of vectors of valid types
+// A function of one argument of f64 is 
`Uniform(vc![vec![vec![DataType::Float64]]])`
+// A function of one argument of f64 or f32 is 
`Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])`
+// A function of two arguments with first argument of f64 or f32 and 
second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, 
DataType::Float64], vec![DataType::Utf8]]])`
+Uniform(Vec>>),

Review comment:
   This signature generalizes `UniformEqual`, so, wouldn't it be possible 
generalize the other instead of creating a new one (replace the existing one by 
the more general form)?
   
   `Signature` should be such that its variants form a complete set of options 
without overlaps.

##
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
 }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii(args: &[ArrayRef]) -> 
Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| {
+x.map(|x: | {
+let mut chars = x.chars();
+chars.next().map_or(0, |v| v as i32)
+})
+})
+.collect())
+}
+
+/// Removes the longest string containing only characters in characters (a 
space by default) from the start and end of string.
+pub fn btrim(args: &[ArrayRef]) -> 
Result {
+match args.len() {
+0 => Err(DataFusionError::Internal(
+"btrim was called with 0 arguments. It requires at least 
1.".to_string(),
+)),
+1 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.map(|x| x.map(|x: | x.trim()))
+.collect())
+}
+2 => {
+let string_array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+let characters_array = args[1]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+
+Ok(string_array
+.iter()
+.enumerate()
+.map(|(i, x)| {
+if characters_array.is_null(i) {
+None
+} else {
+x.map(|x: | {
+let chars: Vec =
+characters_array.value(i).chars().collect();
+x.trim_start_matches([..])
+.trim_end_matches([..])
+})
+}
+})
+.collect())
+}
+other => Err(DataFusionError::Internal(format!(
+"btrim was called with {} arguments. It requires at most 2.",
+other
+))),
+}
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| x.map(|x: | x.graphemes(true).count() as i32))
+.collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result {
+let array = args[0]
+.as_any()
+.downcast_ref::>()
+.unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x| x.map(|x: | x.graphemes(true).count() as i64))
+.collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result {
+let array = args[0].as_any().downcast_ref::().unwrap();
+// first map is the iterator, second is for the `Option<_>`
+Ok(array
+.iter()
+.map(|x: Option| {
+x.map(|x| {
+if x == 0 {
+Err(DataFusionError::Internal(
+"null character not permitted.".to_string(),
+))
+} else {
+match core::char::from_u32(x as u32) {
+ 

  1   2   >