[GitHub] [arrow] sunchao commented on a change in pull request #9612: ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion

2021-03-14 Thread GitBox


sunchao commented on a change in pull request #9612:
URL: https://github.com/apache/arrow/pull/9612#discussion_r594048206



##
File path: rust/parquet/src/arrow/schema.rs
##
@@ -364,32 +385,51 @@ fn arrow_to_parquet_type(field: ) -> Result {
 DataType::Float64 => Type::primitive_type_builder(name, 
PhysicalType::DOUBLE)
 .with_repetition(repetition)
 .build(),
-DataType::Timestamp(time_unit, _) => {
+DataType::Timestamp(time_unit, zone) => {
 Type::primitive_type_builder(name, PhysicalType::INT64)
-.with_converted_type(match time_unit {
-TimeUnit::Second => ConvertedType::TIMESTAMP_MILLIS,
-TimeUnit::Millisecond => ConvertedType::TIMESTAMP_MILLIS,
-TimeUnit::Microsecond => ConvertedType::TIMESTAMP_MICROS,
-TimeUnit::Nanosecond => ConvertedType::TIMESTAMP_MICROS,
-})
+.with_logical_type(Some(LogicalType::TIMESTAMP(TimestampType {
+is_adjusted_to_u_t_c: matches!(zone, Some(z) if z.as_str() 
== "UTC"),

Review comment:
   I think we can perhaps follow Arrow C++ and always set it to true 
whenever the timezone info is set, and normalize the timestamp value to UTC 
when converting to Parquet. Please see a previous discussion 
[here](https://github.com/apache/arrow/pull/4421#issuecomment-498413899) and 
the related C++ code 
[here](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L138).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on issue #9636: Is there an API to deserialize ListVector into double[] efficiently ?

2021-03-14 Thread GitBox


liyafan82 commented on issue #9636:
URL: https://github.com/apache/arrow/issues/9636#issuecomment-799069748


   The float8 data in the list vector and double array are both in a continuous 
memory region. 
   So in theory, it is possible to populate the data in the `double[]` in a 
single copy. 
   
   The unsafe API provides such functionality, and please note that:
   1. There is some header space between the object address and the data 
address in the `double[]`
   2. In the Arrow vector, there may be nulls in the slots, the behavior of 
such slots should be defined. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #9706: ARROW-11941: [Dev] Don't update Jira if run "DEBUG=1 merge_arrow_pr.py"

2021-03-14 Thread GitBox


cyb70289 commented on pull request #9706:
URL: https://github.com/apache/arrow/pull/9706#issuecomment-799050981


   Please note **this patch is not tested**. I only checked the added lines 
manually that there is no syntax error.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9706: ARROW-11941: [Dev] Don't update Jira if run "DEBUG=1 merge_arrow_pr.py"

2021-03-14 Thread GitBox


github-actions[bot] commented on pull request #9706:
URL: https://github.com/apache/arrow/pull/9706#issuecomment-799049802


   https://issues.apache.org/jira/browse/ARROW-11941



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9706: [ARROW-11941]: [Dev] Don't update Jira if run "DEBUG=1 merge_arrow_pr.py"

2021-03-14 Thread GitBox


github-actions[bot] commented on pull request #9706:
URL: https://github.com/apache/arrow/pull/9706#issuecomment-799049571


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 opened a new pull request #9706: [ARROW-11941]: [Dev] Don't update Jira if run "DEBUG=1 merge_arrow_pr.py"

2021-03-14 Thread GitBox


cyb70289 opened a new pull request #9706:
URL: https://github.com/apache/arrow/pull/9706


   When environment variable DEBUG=1, merge_arrow_pr.py should only try
   the merge steps without updating anything. Though PR status is not
   changed, Jira issue is updated even if DEBUG=1. This patch fixes it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] houqp commented on pull request #9692: ARROW-11790: [Rust][DataFusion] RFC: Change builder signatures to take Vec rather than &[Expr]

2021-03-14 Thread GitBox


houqp commented on pull request #9692:
URL: https://github.com/apache/arrow/pull/9692#issuecomment-799038552


   @jorgecarleitao yeah, you are right, sorry I got it mixed up with `dyn` :P
   
   I think `IntoIterator` won't require copy or clone because the 
iterator owns those values. `AsRef` with slice on the other side would require 
copy/clone due to use of slice, which behaves closer to 
`IntoIterator`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #9705: ARROW-11365: [Rust] [Parquet] Logical type printer and parser

2021-03-14 Thread GitBox


nevi-me commented on pull request #9705:
URL: https://github.com/apache/arrow/pull/9705#issuecomment-799030871


   @sunchao I've created this on top of #9612, PTAL when you can.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9705: ARROW-11365: [Rust] [Parquet] Logical type printer and parser

2021-03-14 Thread GitBox


github-actions[bot] commented on pull request #9705:
URL: https://github.com/apache/arrow/pull/9705#issuecomment-799030723


   https://issues.apache.org/jira/browse/ARROW-11365



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me opened a new pull request #9705: ARROW-11365: [Rust] [Parquet] Logical type printer and parser

2021-03-14 Thread GitBox


nevi-me opened a new pull request #9705:
URL: https://github.com/apache/arrow/pull/9705


   This implements the parser and printer for logical types, allowing us to 
read and generate the schema in the form `REQUIRED INT32 field_name 
(INTEGER(16,false))`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #9612: ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion

2021-03-14 Thread GitBox


nevi-me commented on a change in pull request #9612:
URL: https://github.com/apache/arrow/pull/9612#discussion_r593995897



##
File path: rust/parquet/src/arrow/schema.rs
##
@@ -583,48 +626,109 @@ impl ParquetTypeConverter<'_> {
 }
 
 fn from_int32() -> Result {
-match self.schema.get_basic_info().converted_type() {
-ConvertedType::NONE => Ok(DataType::Int32),
-ConvertedType::UINT_8 => Ok(DataType::UInt8),
-ConvertedType::UINT_16 => Ok(DataType::UInt16),
-ConvertedType::UINT_32 => Ok(DataType::UInt32),
-ConvertedType::INT_8 => Ok(DataType::Int8),
-ConvertedType::INT_16 => Ok(DataType::Int16),
-ConvertedType::INT_32 => Ok(DataType::Int32),
-ConvertedType::DATE => Ok(DataType::Date32),
-ConvertedType::TIME_MILLIS => 
Ok(DataType::Time32(TimeUnit::Millisecond)),
-ConvertedType::DECIMAL => Ok(self.to_decimal()),
-other => Err(ArrowError(format!(
-"Unable to convert parquet INT32 logical type {}",
-other
+match (
+self.schema.get_basic_info().logical_type(),

Review comment:
   Yeah, I can do it as a follow-up when I've completed the overall 2.6.0 
type support





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #9612: ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion

2021-03-14 Thread GitBox


nevi-me commented on a change in pull request #9612:
URL: https://github.com/apache/arrow/pull/9612#discussion_r593995672



##
File path: rust/parquet/src/arrow/schema.rs
##
@@ -364,32 +385,51 @@ fn arrow_to_parquet_type(field: ) -> Result {
 DataType::Float64 => Type::primitive_type_builder(name, 
PhysicalType::DOUBLE)
 .with_repetition(repetition)
 .build(),
-DataType::Timestamp(time_unit, _) => {
+DataType::Timestamp(time_unit, zone) => {
 Type::primitive_type_builder(name, PhysicalType::INT64)
-.with_converted_type(match time_unit {
-TimeUnit::Second => ConvertedType::TIMESTAMP_MILLIS,
-TimeUnit::Millisecond => ConvertedType::TIMESTAMP_MILLIS,
-TimeUnit::Microsecond => ConvertedType::TIMESTAMP_MICROS,
-TimeUnit::Nanosecond => ConvertedType::TIMESTAMP_MICROS,
-})
+.with_logical_type(Some(LogicalType::TIMESTAMP(TimestampType {
+is_adjusted_to_u_t_c: matches!(zone, Some(z) if z.as_str() 
== "UTC"),

Review comment:
   Yeah, I think that my logic is faulty. Reading 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc
 again, I now see that it says that `is_adjusted_to_u_t_c = true` is if we 
actually adjust the timezone.
   
   So, I think it's safer to use `false` always, as we don't adjust any 
timezones?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


Dandandan commented on pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#issuecomment-798985687


   > @Dandandan I have mixed feelings about this change - I get the performance 
improvement side, but in my opinion it makes the explain output less readable 
by replacing the string to date cast with a date: `#l_shipdate GtEq 
CAST(Utf8("1994-01-01") ` vs `#l_shipdate GtEq Date32("8766")`; for me the 
string to date cast is much more readable and I can't really tell what date is 
`Date32("8766")`
   
   The `Date32("8766")` I don't find very elegant, but it's not really part of 
this PR I would say. The formatting of constant dates could be improved to not 
show the integer value of the date but a formatted -mm-dd date instead. I 
agree the non-folded version is much easier to understand than this integer 
value. The idea of the PR is of course to evaluate the constant expressions, 
which in some cases might enable other optimizations.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


yordan-pavlov commented on a change in pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#discussion_r593965350



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -645,8 +654,29 @@ fn build_predicate_expression(
 use crate::logical_plan;
 // predicate expression can only be a binary expression
 let (left, op, right) = match expr {
-Expr::BinaryExpr { left, op, right } => (left, *op, right),
+Expr::BinaryExpr { left, op, right } => (left.clone(), *op, 
right.clone()),
+Expr::Between {
+expr,
+negated,
+low,
+high,
+} if !negated // NOT BETWEEN not supported
+=> {
+let left = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::GtEq,
+right: low.clone(),
+});
+let right = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::LtEq,
+right: high.clone(),
+});
+
+(left, Operator::And, right)
+}
 _ => {
+debug!("Filter expression not supported in ParquetExec {:?}", 
expr);

Review comment:
   @Dandandan regarding progress on reading arrow arrays from parquet 
files, I have been able to make some progress and hope to be able to update the 
jira issue with my latest findings in the next day or two





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


yordan-pavlov commented on a change in pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#discussion_r593965090



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -645,8 +654,29 @@ fn build_predicate_expression(
 use crate::logical_plan;
 // predicate expression can only be a binary expression
 let (left, op, right) = match expr {
-Expr::BinaryExpr { left, op, right } => (left, *op, right),
+Expr::BinaryExpr { left, op, right } => (left.clone(), *op, 
right.clone()),
+Expr::Between {
+expr,
+negated,
+low,
+high,
+} if !negated // NOT BETWEEN not supported
+=> {
+let left = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::GtEq,
+right: low.clone(),
+});
+let right = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::LtEq,
+right: high.clone(),
+});
+
+(left, Operator::And, right)
+}
 _ => {
+debug!("Filter expression not supported in ParquetExec {:?}", 
expr);

Review comment:
   yes, filter push-down into parquet works best when the filter is over 
sorted data, if the data is randomly distributed across files it can't do much 
for performance as you have found 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


yordan-pavlov commented on pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#issuecomment-798984352


   @Dandandan I have mixed feelings about this change -  I get the performance 
improvement side, but in my opinion it makes the explain output less readable 
by replacing the string to date cast with a date: `#l_shipdate GtEq 
CAST(Utf8("1994-01-01") ` vs `#l_shipdate GtEq Date32("8766")`; for me the 
string to date cast is much more readable and I can't really tell what date is  
`Date32("8766")`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


Dandandan commented on a change in pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#discussion_r593964624



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -645,8 +654,29 @@ fn build_predicate_expression(
 use crate::logical_plan;
 // predicate expression can only be a binary expression
 let (left, op, right) = match expr {
-Expr::BinaryExpr { left, op, right } => (left, *op, right),
+Expr::BinaryExpr { left, op, right } => (left.clone(), *op, 
right.clone()),
+Expr::Between {
+expr,
+negated,
+low,
+high,
+} if !negated // NOT BETWEEN not supported
+=> {
+let left = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::GtEq,
+right: low.clone(),
+});
+let right = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::LtEq,
+right: high.clone(),
+});
+
+(left, Operator::And, right)
+}
 _ => {
+debug!("Filter expression not supported in ParquetExec {:?}", 
expr);

Review comment:
   Very interested in your progress with Parquet to Arrow! I was hoping 
this PR would do something for the tpch benchmark, but seems the filter 
pushdown is not very useful in this case :) 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


Dandandan commented on a change in pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#discussion_r593962933



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -645,8 +654,29 @@ fn build_predicate_expression(
 use crate::logical_plan;
 // predicate expression can only be a binary expression
 let (left, op, right) = match expr {
-Expr::BinaryExpr { left, op, right } => (left, *op, right),
+Expr::BinaryExpr { left, op, right } => (left.clone(), *op, 
right.clone()),
+Expr::Between {
+expr,
+negated,
+low,
+high,
+} if !negated // NOT BETWEEN not supported
+=> {
+let left = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::GtEq,
+right: low.clone(),
+});
+let right = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::LtEq,
+right: high.clone(),
+});
+
+(left, Operator::And, right)
+}
 _ => {
+debug!("Filter expression not supported in ParquetExec {:?}", 
expr);

Review comment:
   Yeah this is very useful! I found that the filter pushdown doesn't 
filter out *anything* on the TPC-H benchmark (even when adding some extra 
support), as the data is distributed so evenly, so every row group contains 
almost the complete range of values.
   
   I think @andygrove was thinking about some more general support for keeping 
statistics like this, but some debug logging is something easy to add.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


yordan-pavlov commented on a change in pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#discussion_r593961778



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -645,8 +654,29 @@ fn build_predicate_expression(
 use crate::logical_plan;
 // predicate expression can only be a binary expression
 let (left, op, right) = match expr {
-Expr::BinaryExpr { left, op, right } => (left, *op, right),
+Expr::BinaryExpr { left, op, right } => (left.clone(), *op, 
right.clone()),
+Expr::Between {
+expr,
+negated,
+low,
+high,
+} if !negated // NOT BETWEEN not supported
+=> {
+let left = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::GtEq,
+right: low.clone(),
+});
+let right = Box::new(Expr::BinaryExpr {
+left: expr.clone(),
+op: Operator::LtEq,
+right: high.clone(),
+});
+
+(left, Operator::And, right)
+}
 _ => {
+debug!("Filter expression not supported in ParquetExec {:?}", 
expr);

Review comment:
   thank you for adding some `debug!` logs, I have been thinking how 
predicate push-down into parquet could be displayed into `explain` output for 
easier diagnostics, but lately I have been preoccupied with rethinking of 
reading of arrow arrays from parquet files





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9692: ARROW-11790: [Rust][DataFusion] RFC: Change builder signatures to take Vec rather than &[Expr]

2021-03-14 Thread GitBox


jorgecarleitao edited a comment on pull request #9692:
URL: https://github.com/apache/arrow/pull/9692#issuecomment-798977003


   Another option is to use `T: AsRef<[Expr]>`, which works for any type that 
can be represented as a reference to an array (e.g. `vec` and &[]`)
   
   @houqp , AFAIK `impl` is just syntactic sugar for a generic; a trait object 
would be ` Expr`.
   
   I am not sure the `IntoIterator` or `AsRef` would help, though: we would 
still need to `collect` it into a vector, which afaik results in a clone.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on pull request #9692: ARROW-11790: [Rust][DataFusion] RFC: Change builder signatures to take Vec rather than &[Expr]

2021-03-14 Thread GitBox


jorgecarleitao commented on pull request #9692:
URL: https://github.com/apache/arrow/pull/9692#issuecomment-798977003


   Another option is to use `T: AsRef<[Expr]>`, which works for any type that 
can be represented as a reference to an array (e.g. `vec` and &[]`)
   
   @houqp , AFAIK `impl` is just syntactic sugar for a generic; a trait object 
would be ` Expr`.
   
   I am not sure the `IntoIterator` would help, though: we would still need to 
`collect` it into a vector, which afaik results in a clone.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] houqp commented on pull request #9692: ARROW-11790: [Rust][DataFusion] RFC: Change builder signatures to take Vec rather than &[Expr]

2021-03-14 Thread GitBox


houqp commented on pull request #9692:
URL: https://github.com/apache/arrow/pull/9692#issuecomment-798975788


   Yes, I think if the function only requires sequential access, iterator would 
have been even better. Although I would prefer to write it as a generic 
function rather than using Trait object.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #8717: ARROW-10659: [Ruby] Refactor Table#initialize

2021-03-14 Thread GitBox


kou commented on pull request #8717:
URL: https://github.com/apache/arrow/pull/8717#issuecomment-798973718


   I close this because this isn't active in 4 months.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #8717: ARROW-10659: [Ruby] Refactor Table#initialize

2021-03-14 Thread GitBox


kou closed pull request #8717:
URL: https://github.com/apache/arrow/pull/8717


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #9687: ARROW-11949: [Ruby] Accept raw Ruby objects as sort key and options

2021-03-14 Thread GitBox


kou closed pull request #9687:
URL: https://github.com/apache/arrow/pull/9687


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #9687: ARROW-11949: [Ruby] Accept raw Ruby objects as sort key and options

2021-03-14 Thread GitBox


kou commented on pull request #9687:
URL: https://github.com/apache/arrow/pull/9687#issuecomment-798972719


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support

2021-03-14 Thread GitBox


mathyingzhou commented on pull request #8648:
URL: https://github.com/apache/arrow/pull/8648#issuecomment-798971411


   @pitrou Could you please check again? My next PR is going to be ready soon 
and it is dependent on this one merging to be clean.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] houqp commented on a change in pull request #9701: ARROW-11962: [Rust][DataFusion] Improve DataFusion docs

2021-03-14 Thread GitBox


houqp commented on a change in pull request #9701:
URL: https://github.com/apache/arrow/pull/9701#discussion_r593941886



##
File path: rust/datafusion/README.md
##
@@ -19,11 +19,48 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory 
model. It supports executing SQL queries against CSV and Parquet files as well 
as querying directly against in-memory data.
+
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion 
achieves superior performance
+* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet 
and Flight), DataFusion works well with the rest of the big data ecosystem
+* *Easy to Embed*: Allowing extension at almost any point in its design, 
DataFusion can be tailored for your specific usecase
+* *High Quality*:  Extensively tested, both by itself and with the rest of the 
Arrow ecosystem, DataFusion can be used as the foundation for production 
systems.
+
+## Known Uses
+
+Here are some of the projects known to use DataFusion:
+
+* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series 
Database
+* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute 
Platform
+* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
+* [ROAPI](https://github.com/roapi/roapi)

Review comment:
   delta-rs (https://github.com/delta-io/delta-rs) also uses datafusion to 
for querying delta tables :) for example: 
https://github.com/delta-io/delta-rs/blob/main/rust/tests/datafusion_test.rs





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on a change in pull request #9695: ARROW-11955: [Rust][DataFusion] Support Union

2021-03-14 Thread GitBox


andygrove commented on a change in pull request #9695:
URL: https://github.com/apache/arrow/pull/9695#discussion_r593934133



##
File path: rust/datafusion/src/physical_plan/union.rs
##
@@ -0,0 +1,76 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Some of these functions reference the Postgres documentation
+// or implementation to ensure compatibility and are subject to
+// the Postgres license.
+
+//! Unicode expressions
+
+use std::{any::Any, sync::Arc};
+
+use arrow::datatypes::SchemaRef;
+
+use super::{ExecutionPlan, Partitioning, SendableRecordBatchStream};
+use crate::error::Result;
+use async_trait::async_trait;
+
+/// UNION ALL execution plan
+#[derive(Debug)]
+pub struct UnionExec {
+/// Input execution plan
+inputs: Vec>,
+}
+
+impl UnionExec {
+/// Create a new MergeExec
+pub fn new(inputs: Vec>) -> Self {
+UnionExec { inputs }
+}
+}
+
+#[async_trait]
+impl ExecutionPlan for UnionExec {
+/// Return a reference to Any that can be used for downcasting
+fn as_any() ->  Any {
+self
+}
+
+fn schema() -> SchemaRef {
+self.inputs[0].schema()
+}
+
+fn children() -> Vec> {
+self.inputs.clone()
+}
+
+/// Get the output partitioning of this plan
+fn output_partitioning() -> Partitioning {
+Partitioning::UnknownPartitioning(self.inputs.len())
+}
+
+fn with_new_children(
+,
+children: Vec>,
+) -> Result> {
+Ok(Arc::new(UnionExec::new(children)))
+}
+
+async fn execute(, partition: usize) -> 
Result {
+self.inputs[partition].execute(0).await

Review comment:
   Thanks, this is looking good.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan removed a comment on pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


Dandandan removed a comment on pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#issuecomment-798943827


   @yordan-pavlov I am trying to use the filtering in parquet - however on any 
query the row groups being filtered seems to be 0. Do you know what could be 
happening here?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


Dandandan commented on pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#issuecomment-798943827


   @yordan-pavlov I am trying to use the filtering in parquet - however on any 
query the row groups being filtered seems to be 0. Do you know what could be 
happening here?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan opened a new pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


Dandandan opened a new pull request #9704:
URL: https://github.com/apache/arrow/pull/9704


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-14 Thread GitBox


github-actions[bot] commented on pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#issuecomment-798937300


   https://issues.apache.org/jira/browse/ARROW-11964



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] returnString commented on pull request #9703: ARROW-11790: [Rust][DataFusion] Change builder signatures to take impl Interator rather than &[Expr]

2021-03-14 Thread GitBox


returnString commented on pull request #9703:
URL: https://github.com/apache/arrow/pull/9703#issuecomment-798918286


   > I couldn't figure out how to allow the Dataframe API (which is a Trait) to 
take an impl Iterator
   
   Yeah, it's a pain but `impl trait` (in both return and argument position) 
isn't supported in traits right now. Afaik there were some historical 
discussions about opening up the system to support it but I can't find anything 
recent.
   
   Related: is there a reason that `DataFrame` is a trait with a single impl? 
I'm wondering if it could be a concrete struct with the actual execution logic 
handling more exotic setups (custom `TableProvider` impls etc), which would 
free it from all the trait-imposed restrictions. Are we expecting to support 
multiple DF implementations?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-14 Thread GitBox


Dandandan commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r593904547



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   Clear, thanks @sundy-li.
   
   Maybe some docs can be added about this?
   
   I think it probably is not really a problem, but might be surprising by 
someone using the kernel. And we might want to see whether we could use a 
(faster) unstable version of the normal sort kernel as well (I think that could 
be used by DataFusion as well I think).
   
   What do you think @alamb 

##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   Clear, thanks @sundy-li.
   
   Maybe some docs can be added about this?
   
   I think it probably is not really a problem, but might be surprising by 
someone using the kernel. And we might want to see whether we could use a 
(faster) unstable version of the normal sort kernel as well (Ithat could be 
used by DataFusion as well I think).
   
   What do you think @alamb 

##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   Clear, thanks @sundy-li.
   
   Maybe some docs can be added about this?
   
   I think it probably is not really a problem, but might be surprising by 
someone using the kernel. And we might want to see whether we could use a 
(faster) unstable version of the normal sort kernel as well (that could be used 
by DataFusion as well I think).
   
   What do you think @alamb 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-14 Thread GitBox


Dandandan commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r593904547



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   Clear, thanks @sundy-li.
   
   Maybe some docs can be added about this?
   
   I think it probably is not really a problem, but might be surprising by 
someone using the kernel. And we might want to see whether we could use a 
(faster) unstable version of the normal sort kernel as well (I think that could 
be used by DataFusion as well).
   
   What do you think @alamb 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9701: ARROW-11962: [Rust][DataFusion] Improve DataFusion docs

2021-03-14 Thread GitBox


Dandandan commented on a change in pull request #9701:
URL: https://github.com/apache/arrow/pull/9701#discussion_r593902991



##
File path: rust/datafusion/README.md
##
@@ -19,11 +19,48 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory 
model. It supports executing SQL queries against CSV and Parquet files as well 
as querying directly against in-memory data.
+
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion 
achieves superior performance
+* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet 
and Flight), DataFusion works well with the rest of the big data ecosystem
+* *Easy to Embed*: Allowing extension at almost any point in its design, 
DataFusion can be tailored for your specific usecase
+* *High Quality*:  Extensively tested, both by itself and with the rest of the 
Arrow ecosystem, DataFusion can be used as the foundation for production 
systems.
+
+## Known Uses
+
+Here are some of the projects known to use DataFusion:
+
+* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series 
Database
+* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute 
Platform
+* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)

Review comment:
   https://github.com/cube-js/cube.js could be added?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9703: ARROW-11790: [Rust][DataFusion] Change builder signatures to take impl Interator rather than &[Expr]

2021-03-14 Thread GitBox


alamb commented on a change in pull request #9703:
URL: https://github.com/apache/arrow/pull/9703#discussion_r593902253



##
File path: rust/datafusion/src/logical_plan/builder.rs
##
@@ -220,20 +260,28 @@ impl LogicalPlanBuilder {
 }))
 }
 
-/// Apply an aggregate
-pub fn aggregate(, group_expr: &[Expr], aggr_expr: &[Expr]) -> 
Result {
-let mut all_expr = group_expr.to_vec();

Review comment:
   it was even worse here -- *two* copies of each input expr were made -- 
one copy to pass to `validate_unique_names` and once to put on the 
`LogicalPlanAggregate`.
   
   Admittedly, removing the second copy doesn't require changing the signature 
of this function





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9703: ARROW-11790: [Rust][DataFusion] Change builder signatures to take impl Interator rather than &[Expr]

2021-03-14 Thread GitBox


alamb commented on a change in pull request #9703:
URL: https://github.com/apache/arrow/pull/9703#discussion_r593901579



##
File path: rust/datafusion/src/logical_plan/builder.rs
##
@@ -39,6 +39,43 @@ use crate::logical_plan::{DFField, DFSchema, DFSchemaRef, 
Partitioning};
 use std::collections::HashSet;
 
 /// Builder for logical plans
+///
+/// ```
+/// # use datafusion::prelude::*;
+/// # use datafusion::logical_plan::LogicalPlanBuilder;
+/// # use datafusion::error::Result;
+/// # use arrow::datatypes::{Schema, DataType, Field};
+/// #
+/// # fn main() -> Result<()> {
+/// #
+/// # fn employee_schema() -> Schema {
+/// #Schema::new(vec![
+/// #   Field::new("id", DataType::Int32, false),
+/// #   Field::new("first_name", DataType::Utf8, false),
+/// #   Field::new("last_name", DataType::Utf8, false),
+/// #   Field::new("state", DataType::Utf8, false),
+/// #   Field::new("salary", DataType::Int32, false),
+/// #   ])
+/// #   }
+/// #
+/// // Create a plan similar to

Review comment:
   As I initially found the `impl IntoIterator

[GitHub] [arrow] alamb commented on a change in pull request #9703: ARROW-11790: [Rust][DataFusion] Change builder signatures to take impl Interator rather than &[Expr]

2021-03-14 Thread GitBox


alamb commented on a change in pull request #9703:
URL: https://github.com/apache/arrow/pull/9703#discussion_r593901365



##
File path: rust/benchmarks/src/bin/tpch.rs
##
@@ -1636,7 +1636,7 @@ mod tests {
 .file_extension(".out");
 let df = ctx.read_csv(!("{}/answers/q{}.out", path, n), 
options)?;
 let df = df.select(
-_answer_schema(n)

Review comment:
   this is a pretty good example -- the `answer_schema` was created and 
immediately dropped after calling `df.select()`. Now the exprs don't need to be 
copied 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9703: ARROW-11790: [Rust][DataFusion] Change builder signatures to take impl Interator rather than &[Expr]

2021-03-14 Thread GitBox


github-actions[bot] commented on pull request #9703:
URL: https://github.com/apache/arrow/pull/9703#issuecomment-798907390


   https://issues.apache.org/jira/browse/ARROW-11790



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb opened a new pull request #9703: ARROW-11790: [Rust][DataFusion] Change builder signatures to take impl Interator rather than &[Expr]

2021-03-14 Thread GitBox


alamb opened a new pull request #9703:
URL: https://github.com/apache/arrow/pull/9703


   # NOTE:
   Since is a fairly major backwards incompatible change (many callsites need 
to be updated, though mostly mechanically); I gathered some feedback on this 
approach in https://github.com/apache/arrow/pull/9692 and this is the PR I 
propose for merge. 
   
   I'll leave this open for several days and also send a note to the mailing 
lists for additional comment
   
   It is part of my overall plan to make the DataFusion optimizer more 
idiomatic and do much less copying 
[ARROW-11689](https://issues.apache.org/jira/browse/ARROW-11689)
   
   # Rationale:
   All callsites currently need an owned `Vec` (or equivalent) so they can pass 
in `&[Expr]` and then Datafusion copies all the `Expr`s. Many times the 
original `Vec` is discarded immediately after use (I'll point out where 
this happens in a few places below). Thus I it would better (more idiomatic and 
often less copy/faster) to take something that could produce an iterator over 
Expr
   
   # Changes
   1. Change `Dataframe` so it takes `Vec` rather than `&[Expr]`
   2. Change `LogicalPlanBuilder` so it takes `impl Iterator` rather 
than `&[Expr]`
   
   I couldn't figure out how to allow the `Dataframe` API (which is a Trait) to 
take an `impl Iterator`
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-14 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r593899532



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   There are some reasons for using unstable_sort.  
   1. `select_nth_unstable_by` is unstable, so using `before.sort_stable_by` 
don't ensure it's stable. Currently we do not have select_nth_stable_by
   2. `sort_unstable` is faster than sort_stable, refer to 
[doc](https://doc.rust-lang.org/std/primitive.slice.html#method.sort_unstable)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-14 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r593899532



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   There are some reasons for using unstable_sort.  
   1. `select_nth_unstable_by` is unstable, so using 
`before.sort_stable_by`cadon't ensure it's stable. Currently we do not have 
select_nth_stable_by
   2. `sort_unstable` is faster than sort_stable, refer to 
[doc](https://doc.rust-lang.org/std/primitive.slice.html#method.sort_unstable)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-14 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r593899532



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   There are some reasons for using unstable_sort.  
   1. `select_nth_unstable_by` is unstable, so using 
`before.sort_stable_by`cadon't ensure it's stable.
   2. `sort_unstable` is faster than sort_stable, refer to 
[doc](https://doc.rust-lang.org/std/primitive.slice.html#method.sort_unstable)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9702: ARROW-11297: [C++][Python] Add ORC writer options

2021-03-14 Thread GitBox


github-actions[bot] commented on pull request #9702:
URL: https://github.com/apache/arrow/pull/9702#issuecomment-798904875


   https://issues.apache.org/jira/browse/ARROW-11297



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mathyingzhou opened a new pull request #9702: ARROW-11297: [C++][Python] Add ORC writer options

2021-03-14 Thread GitBox


mathyingzhou opened a new pull request #9702:
URL: https://github.com/apache/arrow/pull/9702


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9692: ARROW-11790: [Rust][DataFusion] RFC: Change builder signatures to take Vec rather than &[Expr]

2021-03-14 Thread GitBox


alamb commented on pull request #9692:
URL: https://github.com/apache/arrow/pull/9692#issuecomment-798900651


   @Dandandan  and @houqp 
   
   What do you think about something like this (which is even more Idiomatic 
(TM) I think), though harder for beginner Rust programmers to grok? 
   
   ```
   pub fn project(, expr: impl IntoIterator) -> 
Result {
   ```
   
   This would allow `PlanBuilder::project` to be called with a `Vec` as 
well as things like `BTreeSet`...
   
   
   Sadly, I couldn't give the same treatment to the `DataFrame` trait as when I 
tried,  `rustc` started complaining to me about trait objects
   
   ```
   dataframe::DataFrame` cannot be made into an object
  |
  = help: consider moving `select` to another trait
   note: for a trait to be "object safe" it needs to allow building a vtable to 
allow the call to be resolvable dynamically; for more information visit 

   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9701: ARROW-11962: [Rust][DataFusion] Improve DataFusion docs

2021-03-14 Thread GitBox


alamb commented on a change in pull request #9701:
URL: https://github.com/apache/arrow/pull/9701#discussion_r593891961



##
File path: rust/datafusion/README.md
##
@@ -19,11 +19,48 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory 
model. It supports executing SQL queries against CSV and Parquet files as well 
as querying directly against in-memory data.
+
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases

Review comment:
   One other thing this page might benefit from is an example (e.g. look at 
how tokio does it https://crates.io/crates/tokio)
   
   Perhaps we could just lift the nice example from 
https://docs.rs/datafusion/3.0.0/datafusion/ ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9701: ARROW-11962: [Rust][DataFusion] Improve DataFusion docs

2021-03-14 Thread GitBox


alamb commented on a change in pull request #9701:
URL: https://github.com/apache/arrow/pull/9701#discussion_r593891533



##
File path: rust/datafusion/README.md
##
@@ -19,11 +19,48 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory 
model. It supports executing SQL queries against CSV and Parquet files as well 
as querying directly against in-memory data.
+
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion 
achieves superior performance
+* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet 
and Flight), DataFusion works well with the rest of the big data ecosystem
+* *Easy to Embed*: Allowing extension at almost any point in its design, 
DataFusion can be tailored for your specific usecase
+* *High Quality*:  Extensively tested, both by itself and with the rest of the 
Arrow ecosystem, DataFusion can be used as the foundation for production 
systems.
+
+## Known Uses
+
+Here are some of the projects known to use DataFusion:
+
+* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series 
Database
+* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute 
Platform
+* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)

Review comment:
   fyi @rdettai 

##
File path: rust/datafusion/README.md
##
@@ -19,11 +19,48 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory 
model. It supports executing SQL queries against CSV and Parquet files as well 
as querying directly against in-memory data.
+
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion 
achieves superior performance
+* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet 
and Flight), DataFusion works well with the rest of the big data ecosystem
+* *Easy to Embed*: Allowing extension at almost any point in its design, 
DataFusion can be tailored for your specific usecase
+* *High Quality*:  Extensively tested, both by itself and with the rest of the 
Arrow ecosystem, DataFusion can be used as the foundation for production 
systems.
+
+## Known Uses
+
+Here are some of the projects known to use DataFusion:
+
+* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series 
Database
+* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute 
Platform

Review comment:
   fyi @andygrove 

##
File path: rust/datafusion/README.md
##
@@ -19,11 +19,48 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory 
model. It supports executing SQL queries against CSV and Parquet files as well 
as querying directly against in-memory data.
+
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion 
achieves superior performance
+* *Easy to Connect*: Being part of the Apache 

[GitHub] [arrow] alamb commented on a change in pull request #9701: ARROW-11962: [Rust][DataFusion] Improve DataFusion docs

2021-03-14 Thread GitBox


alamb commented on a change in pull request #9701:
URL: https://github.com/apache/arrow/pull/9701#discussion_r593891472



##
File path: rust/datafusion/DEVELOPERS.md
##
@@ -0,0 +1,79 @@
+# Developer's guide
+
+This section describes how you can get started at developing DataFusion.
+
+### Bootstrap environment
+
+DataFusion is written in Rust and it uses a standard rust toolkit:
+
+* `cargo build`
+* `cargo fmt` to format the code
+* `cargo test` to test
+* etc.
+
+### Architecture Overview
+
+* (March 2021): The DataFusion architecture is described in *Query Engine 
Design and the Rust-Based DataFusion in Apache Arrow*: 
[recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content 
starts ~ 15 minutes in) and 
[slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)

Review comment:
   I added links to the talks from myself and @andygrove  were added -- 
hopefully that is not too aggrandizing, but I think the content is helpful





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9701: ARROW-11962: [Rust][DataFusion] Improve DataFusion docs

2021-03-14 Thread GitBox


alamb commented on a change in pull request #9701:
URL: https://github.com/apache/arrow/pull/9701#discussion_r593891392



##
File path: rust/datafusion/DEVELOPERS.md
##
@@ -0,0 +1,79 @@
+# Developer's guide

Review comment:
   I pulled this file into its own separate file so that it didn't appear 
on https://crates.io/crates/datafusion 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9701: ARROW-11962: [Rust][DataFusion] Improve DataFusion docs

2021-03-14 Thread GitBox


github-actions[bot] commented on pull request #9701:
URL: https://github.com/apache/arrow/pull/9701#issuecomment-798897258


   https://issues.apache.org/jira/browse/ARROW-11962



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb opened a new pull request #9701: ARROW-11962: [Rust][DataFusion] Improve DataFusion docs

2021-03-14 Thread GitBox


alamb opened a new pull request #9701:
URL: https://github.com/apache/arrow/pull/9701


   # Rationale
   1. It would be nice to market / explain DataFusion a bit more and explain 
what it is good for
   
   # Changes
   1. Describe usecases for DataFusion (Add some marketing "spin"??)
   2. Add links to other projects using DataFusion
   3. Add links to tech talks about DataFusion
   3. Split out developer docs into their own file
   4. Use the cool new image that came in 
https://github.com/apache/arrow/pull/9630 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9700: ARROW-11960: [Gandiva][C++]Support escape in LIKE

2021-03-14 Thread GitBox


github-actions[bot] commented on pull request #9700:
URL: https://github.com/apache/arrow/pull/9700#issuecomment-798895560


   https://issues.apache.org/jira/browse/ARROW-11960



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Crystrix opened a new pull request #9700: ARROW-11960: [Gandiva][C++]Support escape in LIKE

2021-03-14 Thread GitBox


Crystrix opened a new pull request #9700:
URL: https://github.com/apache/arrow/pull/9700


   Add gdv_fn_like_utf8_utf8_int8 function in Gandiva to support escape char in 
LIKE. An escape char is stored in an int8 type which is compatible with char 
type in C++.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9645: ARROW-11894: [Rust][DataFusion] Change flight server example to use DataFrame API

2021-03-14 Thread GitBox


alamb commented on a change in pull request #9645:
URL: https://github.com/apache/arrow/pull/9645#discussion_r593885249



##
File path: rust/datafusion/examples/flight_server.rs
##
@@ -94,28 +94,22 @@ impl FlightService for FlightServiceImpl {
 "alltypes_plain",
 !("{}/alltypes_plain.parquet", testdata),
 )
-.map_err(|e| to_tonic_err())?;
+.map_err(to_tonic_err)?;
 
-// create the query plan
-let plan = ctx
-.create_logical_plan()
-.and_then(|plan| ctx.optimize())
-.and_then(|plan| ctx.create_physical_plan())
-.map_err(|e| to_tonic_err())?;
+// create the DataFrame
+let df = ctx.sql(sql).map_err(to_tonic_err)?;
 
 // execute the query
-let results =
-collect(plan.clone()).await.map_err(|e| to_tonic_err())?;
+let results = df.collect().await.map_err(to_tonic_err)?;
 if results.is_empty() {
 return Err(Status::internal("There were no results from 
ticket"));
 }
 
 // add an initial FlightData message that sends schema
 let options = arrow::ipc::writer::IpcWriteOptions::default();
-let schema = plan.schema();
 let schema_flight_data =
 arrow_flight::utils::flight_data_from_arrow_schema(
-schema.as_ref(),
+().clone().into(),

Review comment:
   ❤️  -- looks great @Dandandan 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-14 Thread GitBox


Dandandan commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r593885246



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   I am wondering, why can we use unstable sort here, while we use stable 
sort for the normal sort kernel?
   Shouldn't that be an option of the kernel?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io edited a comment on pull request #9695: ARROW-11955: [Rust][DataFusion] Support Union

2021-03-14 Thread GitBox


codecov-io edited a comment on pull request #9695:
URL: https://github.com/apache/arrow/pull/9695#issuecomment-798734537


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9695?src=pr=h1) Report
   > Merging 
[#9695](https://codecov.io/gh/apache/arrow/pull/9695?src=pr=desc) (ac89542) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/6a1320604a2021711e637c2ca586b374a593d66e?el=desc)
 (6a13206) will **decrease** coverage by `0.00%`.
   > The diff coverage is `83.33%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9695/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9695?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9695  +/-   ##
   ==
   - Coverage   82.57%   82.56%   -0.01% 
   ==
 Files 248  249   +1 
 Lines   5827958443 +164 
   ==
   + Hits4812148256 +135 
   - Misses  1015810187  +29 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9695?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=)
 | `87.82% <0.00%> (-2.32%)` | :arrow_down: |
   | 
[rust/datafusion/src/optimizer/constant\_folding.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvY29uc3RhbnRfZm9sZGluZy5ycw==)
 | `92.30% <0.00%> (-0.36%)` | :arrow_down: |
   | 
[...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==)
 | `53.60% <0.00%> (-1.72%)` | :arrow_down: |
   | 
[...t/datafusion/src/optimizer/projection\_push\_down.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvcHJvamVjdGlvbl9wdXNoX2Rvd24ucnM=)
 | `97.72% <ø> (ø)` | |
   | 
[rust/datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21vZC5ycw==)
 | `88.00% <ø> (ø)` | |
   | 
[rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==)
 | `82.98% <74.00%> (-0.24%)` | :arrow_down: |
   | 
[rust/datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==)
 | `87.43% <76.92%> (-0.71%)` | :arrow_down: |
   | 
[rust/datafusion/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==)
 | `79.72% <78.57%> (-0.05%)` | :arrow_down: |
   | 
[rust/datafusion/src/physical\_plan/union.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3VuaW9uLnJz)
 | `87.50% <87.50%> (ø)` | |
   | 
[rust/datafusion/src/optimizer/limit\_push\_down.rs](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvbGltaXRfcHVzaF9kb3duLnJz)
 | `97.82% <100.00%> (+0.68%)` | :arrow_up: |
   | ... and [7 
more](https://codecov.io/gh/apache/arrow/pull/9695/diff?src=pr=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9695?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9695?src=pr=footer). Last 
update 
[6a13206...ac89542](https://codecov.io/gh/apache/arrow/pull/9695?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9645: ARROW-11894: [Rust][DataFusion] Change flight server example to use DataFrame API

2021-03-14 Thread GitBox


Dandandan commented on a change in pull request #9645:
URL: https://github.com/apache/arrow/pull/9645#discussion_r593884833



##
File path: rust/datafusion/examples/flight_server.rs
##
@@ -94,28 +94,26 @@ impl FlightService for FlightServiceImpl {
 "alltypes_plain",
 !("{}/alltypes_plain.parquet", testdata),
 )
-.map_err(|e| to_tonic_err())?;
+.map_err(to_tonic_err)?;
 
-// create the query plan
-let plan = ctx
-.create_logical_plan()
-.and_then(|plan| ctx.optimize())
-.and_then(|plan| ctx.create_physical_plan())
-.map_err(|e| to_tonic_err())?;
+// create the DataFrame
+let df = ctx.sql(sql).map_err(to_tonic_err)?;
 
 // execute the query
-let results =
-collect(plan.clone()).await.map_err(|e| to_tonic_err())?;
+let results = df.collect().await.map_err(to_tonic_err)?;
 if results.is_empty() {
 return Err(Status::internal("There were no results from 
ticket"));
 }
 
+let physical_plan = ctx

Review comment:
   I "discovered" you can make a `DFSchema` an arrow `Schema` again with 
`df.schema().clone().into()` which I think is maybe even better?
   
   The approach to use `batches[0]` here doesn't work as the batches have type 
`Result`, so that would probable need some more 
"extraction" to retrieve the schema.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9639: ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan

2021-03-14 Thread GitBox


alamb closed pull request #9639:
URL: https://github.com/apache/arrow/pull/9639


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9639: ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan

2021-03-14 Thread GitBox


alamb commented on a change in pull request #9639:
URL: https://github.com/apache/arrow/pull/9639#discussion_r593883317



##
File path: rust/datafusion/src/execution/context.rs
##
@@ -1702,6 +1702,23 @@ mod tests {
 }
 Ok(())
 }
+#[test]
+fn ctx_sql_should_optimize_plan() -> Result<()> {
+let mut ctx = ExecutionContext::new();
+let plan1 =
+ctx.create_logical_plan("SELECT * FROM (SELECT 1) WHERE TRUE AND 
TRUE")?;
+
+let opt_plan1 = ctx.optimize()?;
+
+let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?;

Review comment:
    





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9691: ARROW-11952: [Rust] Make ArrayData --> GenericListArray fallable instead of `panic!`

2021-03-14 Thread GitBox


alamb commented on pull request #9691:
URL: https://github.com/apache/arrow/pull/9691#issuecomment-79368


   I am not sure how much of an improvement this is and I would be happy to 
simply close the PR, but I didn't want to leave it unresolved outstanding



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] Dandandan commented on a change in pull request #9695: ARROW-11955: [Rust][DataFusion] Support Union

2021-03-14 Thread GitBox


Dandandan commented on a change in pull request #9695:
URL: https://github.com/apache/arrow/pull/9695#discussion_r593883069



##
File path: rust/datafusion/src/physical_plan/union.rs
##
@@ -0,0 +1,76 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Some of these functions reference the Postgres documentation
+// or implementation to ensure compatibility and are subject to
+// the Postgres license.
+
+//! Unicode expressions
+
+use std::{any::Any, sync::Arc};
+
+use arrow::datatypes::SchemaRef;
+
+use super::{ExecutionPlan, Partitioning, SendableRecordBatchStream};
+use crate::error::Result;
+use async_trait::async_trait;
+
+/// UNION ALL execution plan
+#[derive(Debug)]
+pub struct UnionExec {
+/// Input execution plan
+inputs: Vec>,
+}
+
+impl UnionExec {
+/// Create a new MergeExec
+pub fn new(inputs: Vec>) -> Self {
+UnionExec { inputs }
+}
+}
+
+#[async_trait]
+impl ExecutionPlan for UnionExec {
+/// Return a reference to Any that can be used for downcasting
+fn as_any() ->  Any {
+self
+}
+
+fn schema() -> SchemaRef {
+self.inputs[0].schema()
+}
+
+fn children() -> Vec> {
+self.inputs.clone()
+}
+
+/// Get the output partitioning of this plan
+fn output_partitioning() -> Partitioning {
+Partitioning::UnknownPartitioning(self.inputs.len())
+}
+
+fn with_new_children(
+,
+children: Vec>,
+) -> Result> {
+Ok(Arc::new(UnionExec::new(children)))
+}
+
+async fn execute(, partition: usize) -> 
Result {
+self.inputs[partition].execute(0).await

Review comment:
   I removed the `MergeExec` and changed this to what I believe Spark also 
does, which is return the sum of the input partitions and added a test for this.
   
   For now, this removes the information about the input partitioning and 
return `UnknownPartitioning`, which is not ideal when the partitioning info 
could be used somewhere else (for example a hash partitioning).
   I added a TODO comment there. I think it would be OK to keep it for now like 
this and maybe create a JIRA issue for?






This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9699: ARROW-11959: [Rust][DataFusion] Fix log line

2021-03-14 Thread GitBox


alamb closed pull request #9699:
URL: https://github.com/apache/arrow/pull/9699


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9600: ARROW-11822: [Rust][Datafusion] Support case sensitive for function

2021-03-14 Thread GitBox


alamb commented on pull request #9600:
URL: https://github.com/apache/arrow/pull/9600#issuecomment-798885671


   > The above is the primary reason why I did not introduce this idea before; 
it will lead to either no one changing that parameter once the system is running
   
   Yes -- this is what I think will happen. I view this configuration parameter 
as something systems that are built on top of DataFusion will configure 
statically based on their needs and not change on a per-query basis (even 
though you could given the way the code is written)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9690: ARROW-11951: [Rust] Remove OffsetSize::prefix

2021-03-14 Thread GitBox


alamb closed pull request #9690:
URL: https://github.com/apache/arrow/pull/9690


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #9232: ARROW-10818: [Rust] Initial implementation for Decimal (Decimal128Type only)

2021-03-14 Thread GitBox


nevi-me commented on a change in pull request #9232:
URL: https://github.com/apache/arrow/pull/9232#discussion_r593875606



##
File path: rust/arrow/src/compute/kernels/cast.rs
##
@@ -1511,6 +1641,65 @@ mod tests {
 assert!(9.0 - c.value(4) < f64::EPSILON);
 }
 
+#[test]
+fn test_cast_i64_to_decimal() {
+let a = Int64Array::from(vec![1, 2]);
+let array = Arc::new(a) as ArrayRef;
+let b = cast(, ::Decimal(5, 10)).unwrap();
+let c = b.as_any().downcast_ref::().unwrap();
+
+assert_eq!("1.00", c.value(0).to_string());
+assert_eq!("2.00", c.value(1).to_string());
+}
+
+#[test]
+fn test_cast_f64_to_decimal() {
+let a = Float64Array::from(vec![1.0, 2.0]);
+let array = Arc::new(a) as ArrayRef;
+let b = cast(, ::Decimal(5, 10)).unwrap();

Review comment:
   My understanding is that scale should be <= precision, such that we 
can't have `Decimal(5, 10)`. Is this correct, or is it me not understanding the 
rules?

##
File path: rust/arrow/src/compute/kernels/cast.rs
##
@@ -1511,6 +1641,65 @@ mod tests {
 assert!(9.0 - c.value(4) < f64::EPSILON);
 }
 
+#[test]
+fn test_cast_i64_to_decimal() {
+let a = Int64Array::from(vec![1, 2]);
+let array = Arc::new(a) as ArrayRef;
+let b = cast(, ::Decimal(5, 10)).unwrap();
+let c = b.as_any().downcast_ref::().unwrap();
+
+assert_eq!("1.00", c.value(0).to_string());
+assert_eq!("2.00", c.value(1).to_string());
+}
+
+#[test]
+fn test_cast_f64_to_decimal() {
+let a = Float64Array::from(vec![1.0, 2.0]);
+let array = Arc::new(a) as ArrayRef;
+let b = cast(, ::Decimal(5, 10)).unwrap();
+let c = b.as_any().downcast_ref::().unwrap();
+
+assert_eq!("1.00", c.value(0).to_string());
+assert_eq!("2.00", c.value(1).to_string());
+}
+
+#[test]
+fn test_cast_decimal_to_i64() {
+let mut builder = DecimalBuilder::new(2, 5, 10);
+builder
+.append_value(Decimal128Type::new(100_i128, 5, 10))
+.unwrap();
+builder
+.append_value(Decimal128Type::new(200_i128, 5, 10))
+.unwrap();
+
+let array = Arc::new(builder.finish()) as ArrayRef;
+let b = cast(, ::Int64).unwrap();
+let c = b.as_any().downcast_ref::().unwrap();
+
+assert_eq!(1_i64, c.value(0));
+assert_eq!(2_i64, c.value(1));
+}
+
+#[test]
+fn test_cast_decimal_to_f64() {
+let mut builder = DecimalBuilder::new(3, 3, 5);
+builder
+.append_value(Decimal128Type::new(123456_i128, 3, 5))
+.unwrap();
+builder
+.append_value(Decimal128Type::new(654321_i128, 3, 5))
+.unwrap();
+
+let array = Arc::new(builder.finish()) as ArrayRef;
+let b = cast(, ::Float64).unwrap();
+let c = b.as_any().downcast_ref::().unwrap();
+
+// string compare to skip clippy::float_cmp

Review comment:
   You can also ignore the clippy lint here if it helps :)

##
File path: rust/datafusion/src/scalar.rs
##
@@ -210,6 +213,14 @@ impl ScalarValue {
 /// Converts a scalar value into an array of `size` rows.
 pub fn to_array_of_size(, size: usize) -> ArrayRef {
 match self {
+ScalarValue::Decimal128(e, p, s) => match e {

Review comment:
   Perhaps @ovr hasn't gotten to it? I also can't think of what a 
limitation would be

##
File path: rust/arrow/src/datatypes/i256_type.rs
##
@@ -0,0 +1,297 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::alloc::NativeType;
+use crate::datatypes::{ArrowDecimalNativeType, DataType};
+use num::Zero;
+use std::cmp::Ordering;
+use std::fmt;
+use std::num::ParseIntError;
+use std::ops::{Add, AddAssign, Div, DivAssign, Mul, MulAssign, Sub, SubAssign};
+use std::str::FromStr;
+
+/// Number of bits a signed 256-bit number occupies.
+pub const BITS: usize = 256;
+
+/// Number of bytes a signed 256-bit number occupies.
+pub const BYTES: usize = 32;
+
+/// An