Re: [I] Add example for writing an `AnalyzerRule` [datafusion]

2024-06-16 Thread via GitHub
goldmedal commented on issue #10855: URL: https://github.com/apache/datafusion/issues/10855#issuecomment-2171124147 Hi @alamb, I just want to share how I use user-defined AnalyzerRule. As you know, I'm working on reimplementing the semantic layer engine, [wren-engine](https://github.com/Ca

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
goldmedal commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641668286 ## datafusion/proto/src/physical_plan/mod.rs: ## @@ -496,11 +496,14 @@ impl AsExecutionPlan for protobuf::PhysicalPlanNode {

[PR] Remove `prefer_hash_join` env variable for clickbench [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 opened a new pull request, #10933: URL: https://github.com/apache/datafusion/pull/10933 ## Which issue does this PR close? Closes #. ## Rationale for this change clickbench does not expect `--prefer_hash_join`, remove it. It is introduce in #10

Re: [PR] feat: Support CartesianProductExec in comet [datafusion-comet]

2024-06-16 Thread via GitHub
leoluan2009 commented on PR #442: URL: https://github.com/apache/datafusion-comet/pull/442#issuecomment-2171428443 > @leoluan2009 Would you like to update the plan stability results? Thanks. Hi @viirya sorry for delay reply, I think this pr do not change the query plan for tpcds? -

Re: [I] `StatisticsConverter::row_group_null_counts` incorrect for missing column [datafusion]

2024-06-16 Thread via GitHub
alamb commented on issue #10926: URL: https://github.com/apache/datafusion/issues/10926#issuecomment-2171460701 > Should we also change the signature and return Result? Yes I think so > Should we also change the signature and return Result? I actually think using

Re: [PR] Minor: Improve `arrow_statistics` tests [datafusion]

2024-06-16 Thread via GitHub
alamb merged PR #10927: URL: https://github.com/apache/datafusion/pull/10927 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Minor: Improve `arrow_statistics` tests [datafusion]

2024-06-16 Thread via GitHub
alamb commented on PR #10927: URL: https://github.com/apache/datafusion/pull/10927#issuecomment-2171460836 Thanks everyone for the reviews -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Minor: Remove `prefer_hash_join` env variable for clickbench [datafusion]

2024-06-16 Thread via GitHub
alamb merged PR #10933: URL: https://github.com/apache/datafusion/pull/10933 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

[I] Add a benchmark for extracting data page statistics [datafusion]

2024-06-16 Thread via GitHub
alamb opened a new issue, #10934: URL: https://github.com/apache/datafusion/issues/10934 ### Is your feature request related to a problem or challenge? As we work to make extracting statistics from parquet data pages more correct and performant in https://github.com/apache/datafusion

Re: [PR] chore: Improve performance of Parquet statistics conversion [datafusion]

2024-06-16 Thread via GitHub
alamb commented on PR #10932: URL: https://github.com/apache/datafusion/pull/10932#issuecomment-2171467776 cc @xinlifoobar and @marvinlanhenke -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
goldmedal commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641808518 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -15,19 +15,254 @@ // specific language governing permissions and limitations // under th

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
goldmedal commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641808587 ## datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs: ## @@ -15,91 +15,139 @@ // specific language governing permissions and limitations

Re: [PR] CSE shorthand alias [datafusion]

2024-06-16 Thread via GitHub
MohamedAbdeen21 commented on PR #10868: URL: https://github.com/apache/datafusion/pull/10868#issuecomment-2171468852 Hey @peter-toth sorry for the late reply. Looks good, and I like the Alias generator. But I have a couple of comments: 1. Thought we agreed on the `#` prefix, is

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
goldmedal commented on PR #10917: URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171470071 > > External error: statement is expected to fail with error: > > (regex) DataFusion error: Error during planning: .* No function matches the given name and argument types >

Re: [I] `StatisticsConverter::row_group_null_counts` incorrect for missing column [datafusion]

2024-06-16 Thread via GitHub
marvinlanhenke commented on issue #10926: URL: https://github.com/apache/datafusion/issues/10926#issuecomment-2171471803 > I actually think using `new_null_array` is wrong here. It returns the wrong type I passed `&DataType::UInt64` into `new_null_array(...)`. However, your approach

[I] Push down filters below `Unnest` in sub queries [datafusion]

2024-06-16 Thread via GitHub
ahirner opened a new issue, #10935: URL: https://github.com/apache/datafusion/issues/10935 ### Is your feature request related to a problem or challenge? Currently where clauses in a sub query are not pushed below `Unnest` expressions. In conjunction with some other limitations, it co

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641813923 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -15,19 +15,254 @@ // specific language governing permissions and limitations // under t

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641818096 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -15,19 +15,254 @@ // specific language governing permissions and limitations // under t

Re: [I] Push down filters below `Unnest` in sub queries [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 commented on issue #10935: URL: https://github.com/apache/datafusion/issues/10935#issuecomment-2171495707 For this kind of query `select unnest(unnest([[1],[1,2],[1,2,3]]));`, I think we can follow what Duckdb is. Unnest with recursive, and max_depth `select unnest([

Re: [PR] chore: Improve performance of Parquet statistics conversion [datafusion]

2024-06-16 Thread via GitHub
marvinlanhenke commented on PR #10932: URL: https://github.com/apache/datafusion/pull/10932#issuecomment-2171497286 > I think this is a nice code impeovement ❤️ It is indeed. Thanks @Weijun-H really nice refactor. > I am not sure this will improve the performance I'll gue

Re: [I] x NOT IN y works but NOT (x IN y) doesn't [datafusion]

2024-06-16 Thread via GitHub
akoshchiy commented on issue #10830: URL: https://github.com/apache/datafusion/issues/10830#issuecomment-2171512487 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[PR] fix: rewriting NOT IN () to anti join [datafusion]

2024-06-16 Thread via GitHub
akoshchiy opened a new pull request, #10936: URL: https://github.com/apache/datafusion/pull/10936 ## Which issue does this PR close? Closes #10830 . ## Rationale for this change ## What changes are included in this PR? ## Are these changes t

Re: [I] Improve performance for grouping by variable length columns (strings) [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 commented on issue #9403: URL: https://github.com/apache/datafusion/issues/9403#issuecomment-2171579091 I try to an approach that take multiple columns into consideration but found that the time spend on `insert_accounted` largely increase. ```rust self.map.insert_accoun

[PR] [DRAFT] Multiple group by optimization [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 opened a new pull request, #10937: URL: https://github.com/apache/datafusion/pull/10937 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub
vaibhawvipul commented on PR #10930: URL: https://github.com/apache/datafusion/pull/10930#issuecomment-2171678704 looks nice, please let me know when it is ready for review. Note, can you also run the relevant scala test? -- This is an automated message from the Apache Git Service

[PR] Add example for writing SQL analysis using DataFusion structures [datafusion]

2024-06-16 Thread via GitHub
LorrensP-2158466 opened a new pull request, #10938: URL: https://github.com/apache/datafusion/pull/10938 ## Which issue does this PR close? Closes #10871 . ## Rationale for this change ## What changes are included in this PR? Added an example of an

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
goldmedal commented on PR #10917: URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171687704 Thanks @jayzhan211 approved it. : ) If no more comments, how about merge this PR first? I guess it may cause some conflicts for other UDAF-related PRs. -- This is an automate

Re: [PR] Move `DynamicFileCatalog` back to core [datafusion]

2024-06-16 Thread via GitHub
goldmedal commented on PR #10745: URL: https://github.com/apache/datafusion/pull/10745#issuecomment-2171692026 Hi @alamb, is there any update on this topic? Or maybe I can help by splitting out some traits into the new crate `datafusion-catalog` first? -- This is an automated message from

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 commented on PR #10917: URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171694652 > Thanks @jayzhan211 approved it. : ) If no more comments, how about merge this PR first? I guess it may cause some conflicts for other UDAF-related PRs. Sure! -- This is

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 merged PR #10917: URL: https://github.com/apache/datafusion/pull/10917 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [I] Convert `ApproxPercentileCont` and ``ApproxPercentileContWithWeight` to UDAF [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 closed issue #10870: Convert `ApproxPercentileCont` and ``ApproxPercentileContWithWeight` to UDAF URL: https://github.com/apache/datafusion/issues/10870 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub
goldmedal commented on PR #10917: URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171695849 Thanks again @jayzhan211 :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] use ScalarValue::to_pyarrow to convert to python object [datafusion-python]

2024-06-16 Thread via GitHub
andygrove merged PR #731: URL: https://github.com/apache/datafusion-python/pull/731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@d

Re: [I] Simpify `PyExpr::python_value` by using `ScalarValue::into_py` [datafusion-python]

2024-06-16 Thread via GitHub
andygrove closed issue #729: Simpify `PyExpr::python_value` by using `ScalarValue::into_py` URL: https://github.com/apache/datafusion-python/issues/729 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [I] Simpify `PyExpr::python_value` by using `ScalarValue::into_py` [datafusion-python]

2024-06-16 Thread via GitHub
andygrove closed issue #729: Simpify `PyExpr::python_value` by using `ScalarValue::into_py` URL: https://github.com/apache/datafusion-python/issues/729 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] CSE shorthand alias [datafusion]

2024-06-16 Thread via GitHub
peter-toth commented on PR #10868: URL: https://github.com/apache/datafusion/pull/10868#issuecomment-2171754117 > Hey @peter-toth sorry for the late reply. > > Looks good, and I like the Alias generator. But I have a couple of comments: > > 1. Thought we agreed on the `#` prefi

Re: [PR] Use shorter aliases in CSE [datafusion]

2024-06-16 Thread via GitHub
peter-toth commented on PR #10939: URL: https://github.com/apache/datafusion/pull/10939#issuecomment-2171754406 cc @MohamedAbdeen21, @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] CSE shorthand alias [datafusion]

2024-06-16 Thread via GitHub
MohamedAbdeen21 closed pull request #10868: CSE shorthand alias URL: https://github.com/apache/datafusion/pull/10868 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscr

Re: [PR] CSE shorthand alias [datafusion]

2024-06-16 Thread via GitHub
MohamedAbdeen21 commented on PR #10868: URL: https://github.com/apache/datafusion/pull/10868#issuecomment-2171757651 Became part of #10939. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [I] Push down filters below `Unnest` in sub queries [datafusion]

2024-06-16 Thread via GitHub
duongcongtoai commented on issue #10935: URL: https://github.com/apache/datafusion/issues/10935#issuecomment-2171768257 I think this is related https://github.com/apache/datafusion/issues/10660 -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] fix: rewriting NOT IN () to anti join [datafusion]

2024-06-16 Thread via GitHub
viirya commented on code in PR #10936: URL: https://github.com/apache/datafusion/pull/10936#discussion_r1641921370 ## datafusion/optimizer/src/decorrelate_predicate_subquery.rs: ## @@ -1099,6 +1126,29 @@ mod tests { Ok(()) } +#[test] +fn wrapped_not_in_su

Re: [I] Push down filters below `Unnest` in sub queries [datafusion]

2024-06-16 Thread via GitHub
ahirner commented on issue #10935: URL: https://github.com/apache/datafusion/issues/10935#issuecomment-2171783792 Thanks for the pointers @duongcongtoai I think the recursive and max_depth options can be a special workaround, but introduce a good amount of complexity. Fixing the push down

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub
dharanad commented on PR #10930: URL: https://github.com/apache/datafusion/pull/10930#issuecomment-2171794774 @jayzhan211 / @alamb can you please help me with a review here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

[PR] refactor: dont include fallback in match on mem_pool_type [datafusion]

2024-06-16 Thread via GitHub
tshauck opened a new pull request, #10940: URL: https://github.com/apache/datafusion/pull/10940 ## Which issue does this PR close? Closes #. ## Rationale for this change Using the wildcard pattern has resulted in some minor bugs in the past in DataFusion. So this

[PR] refactor: remove extra default in max rows [datafusion]

2024-06-16 Thread via GitHub
tshauck opened a new pull request, #10941: URL: https://github.com/apache/datafusion/pull/10941 ## Which issue does this PR close? Closes #. ## Rationale for this change Currently `-h` in the datafusion CLI shows the default for max rows twice. ## What chan

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 commented on code in PR #10930: URL: https://github.com/apache/datafusion/pull/10930#discussion_r1642021123 ## datafusion/functions-aggregate/src/bit_and_or_xor.rs: ## @@ -0,0 +1,400 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contrib

[I] Convert `Avg` to UDAF [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 opened a new issue, #10942: URL: https://github.com/apache/datafusion/issues/10942 ### Is your feature request related to a problem or challenge? Similar to others issue in #8708 ### Describe the solution you'd like _No response_ ### Describe alternativ

[I] Convert `Min` and `Max` to UDAF [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 opened a new issue, #10943: URL: https://github.com/apache/datafusion/issues/10943 ### Is your feature request related to a problem or challenge? Similar to other issues in #8708 I think the change of this might be large, could split it in several PR. ### Desc

[I] Convert `Hyperloglog` to UDAF [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 opened a new issue, #10944: URL: https://github.com/apache/datafusion/issues/10944 ### Is your feature request related to a problem or challenge? Similar to other issues in #8708 ### Describe the solution you'd like _No response_ ### Describe alternativ

Re: [PR] refactor: remove extra default in max rows [datafusion]

2024-06-16 Thread via GitHub
jonahgao merged PR #10941: URL: https://github.com/apache/datafusion/pull/10941 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] feat: RewriteCycle API for short-circuiting optimizer loops [datafusion]

2024-06-16 Thread via GitHub
jayzhan211 commented on code in PR #10386: URL: https://github.com/apache/datafusion/pull/10386#discussion_r1642086691 ## datafusion/optimizer/src/rewrite_cycle.rs: ## @@ -0,0 +1,262 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor licens

Re: [I] Convert `Min` and `Max` to UDAF [datafusion]

2024-06-16 Thread via GitHub
dharanad commented on issue #10943: URL: https://github.com/apache/datafusion/issues/10943#issuecomment-2172176854 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [I] Convert `Hyperloglog` to UDAF [datafusion]

2024-06-16 Thread via GitHub
dharanad commented on issue #10944: URL: https://github.com/apache/datafusion/issues/10944#issuecomment-2172177142 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub
dharanad commented on code in PR #10930: URL: https://github.com/apache/datafusion/pull/10930#discussion_r1642130995 ## datafusion/functions-aggregate/src/bit_and_or_xor.rs: ## @@ -0,0 +1,400 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contribut

Re: [I] Modulus can produce incorrect results in some cases [datafusion-comet]

2024-06-16 Thread via GitHub
vaibhawvipul commented on issue #521: URL: https://github.com/apache/datafusion-comet/issues/521#issuecomment-2172247315 working on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] Support dictionary data type in array_to_string [datafusion]

2024-06-16 Thread via GitHub
Weijun-H commented on code in PR #10908: URL: https://github.com/apache/datafusion/pull/10908#discussion_r1642217621 ## datafusion/functions-array/src/string.rs: ## @@ -281,6 +281,31 @@ pub(super) fn array_to_string_inner(args: &[ArrayRef]) -> Result { Ok(arg

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub
dharanad commented on PR #10930: URL: https://github.com/apache/datafusion/pull/10930#issuecomment-2172374561 cc: @jayzhan211 I have made the changes and also updated the UT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[PR] Convert string agg to udaf [datafusion]

2024-06-16 Thread via GitHub
lewiszlw opened a new pull request, #10945: URL: https://github.com/apache/datafusion/pull/10945 ## Which issue does this PR close? part of https://github.com/apache/datafusion/issues/8708. ## Rationale for this change ## What changes are included in this

[PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]

2024-06-16 Thread via GitHub
marvinlanhenke opened a new pull request, #10946: URL: https://github.com/apache/datafusion/pull/10946 ## Which issue does this PR close? Closes #10926. ## Rationale for this change - row_group_null_counts, data_page_null_counts - data_page_row_counts

Re: [PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]

2024-06-16 Thread via GitHub
marvinlanhenke commented on PR #10946: URL: https://github.com/apache/datafusion/pull/10946#issuecomment-2172417984 @alamb PTAL. I'm not sure about the assumptions I made in `data_page_row_counts` [here](https://github.com/apache/datafusion/pull/10946/files#diff-7110f4709c105a18ef74a2

Re: [I] Convert `Hyperloglog` to UDAF [datafusion]

2024-06-16 Thread via GitHub
dharanad commented on issue #10944: URL: https://github.com/apache/datafusion/issues/10944#issuecomment-2172431668 Hey @jayzhan211 I noticed the `approx_distinct` UDAF implemented in `datafusion/functions-aggregate/src/approx_distinct.rs`. It appears to be using a `HyperLogLog` (HLL) st