Re: [PR] Move `sql_compound_identifier_to_expr ` to `ExprPlanner` [datafusion]

2024-07-17 Thread via GitHub
dharanad commented on PR #11487: URL: https://github.com/apache/datafusion/pull/11487#issuecomment-2235730350 @jayzhan211 I apologize for any confusion caused by submitting multiple reviews. I was mistaken in believing I had the latest changes from the `main` branch. This revision incorpora

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682248395 ## datafusion/physical-plan/src/spill.rs: ## @@ -85,3 +85,104 @@ fn read_spill(sender: Sender>, path: &Path) -> Result<()> { } Ok(()) } + +/// Spill the

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682246227 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -1544,6 +1646,7 @@ fn get_filtered_join_mask( // we don't need to check any others for the

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682245542 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -1502,13 +1572,45 @@ fn get_buffered_columns( buffered_data: &BufferedData, buffered_batch

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682243474 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -1502,13 +1572,45 @@ fn get_buffered_columns( buffered_data: &BufferedData, buffered_batch

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682241988 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -941,6 +1020,7 @@ impl SMJStream { self.buffered_state = BufferedSt

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682236590 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -858,6 +889,57 @@ impl SMJStream { } } +fn free_reservation(&mut self, buffered_

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682230886 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -858,6 +889,57 @@ impl SMJStream { } } +fn free_reservation(&mut self, buffered_

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682227159 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -858,6 +889,57 @@ impl SMJStream { } } +fn free_reservation(&mut self, buffered_

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r168150 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -858,6 +889,57 @@ impl SMJStream { } } +fn free_reservation(&mut self, buffered_

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682196532 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -858,6 +889,57 @@ impl SMJStream { } } +fn free_reservation(&mut self, buffered_

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682192033 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -250,7 +250,12 @@ impl DisplayAs for SortMergeJoinExec { write!(

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682189556 ## datafusion/physical-plan/src/spill.rs: ## @@ -85,3 +85,104 @@ fn read_spill(sender: Sender>, path: &Path) -> Result<()> { } Ok(()) } + +/// Spill the

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682186911 ## datafusion/physical-plan/src/spill.rs: ## @@ -85,3 +85,104 @@ fn read_spill(sender: Sender>, path: &Path) -> Result<()> { } Ok(()) } + +/// Spill the

Re: [I] Explore Updating VariadicAny Signature to take 0 Args [datafusion]

2024-07-17 Thread via GitHub
2010YOUY01 commented on issue #11522: URL: https://github.com/apache/datafusion/issues/11522#issuecomment-2235538061 Does that work for your use case? I vaguely remember `VariadicAny` is set to >=1 argument for some reason 🤔 https://github.com/apache/datafusion/blob/b0925c801e1c07bd78c7

Re: [PR] Support SortMergeJoin spilling [datafusion]

2024-07-17 Thread via GitHub
viirya commented on code in PR #11218: URL: https://github.com/apache/datafusion/pull/11218#discussion_r1682183243 ## datafusion/core/tests/memory_limit/mod.rs: ## @@ -180,6 +180,26 @@ async fn merge_join() { ]) .with_memory_limit(1_000) .with_config(c

Re: [PR] Move `sql_compound_identifier_to_expr ` to `ExprPlanner` [datafusion]

2024-07-17 Thread via GitHub
dharanad commented on code in PR #11487: URL: https://github.com/apache/datafusion/pull/11487#discussion_r1682131864 ## datafusion/sql/examples/sql.rs: ## @@ -54,7 +55,8 @@ fn main() { let context_provider = MyContextProvider::new() .with_udaf(sum_udaf())

Re: [PR] feat: Enable remaining Spark 3.5.1 tests [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove merged PR #676: URL: https://github.com/apache/datafusion-comet/pull/676 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@da

Re: [I] Enable all ignored Spark SQL tests for Spark 3.5.1 [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove closed issue #617: Enable all ignored Spark SQL tests for Spark 3.5.1 URL: https://github.com/apache/datafusion-comet/issues/617 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] Move `sql_compound_identifier_to_expr ` to `ExprPlanner` [datafusion]

2024-07-17 Thread via GitHub
dharanad commented on code in PR #11487: URL: https://github.com/apache/datafusion/pull/11487#discussion_r1682102315 ## datafusion/optimizer/tests/optimizer_integration.rs: ## @@ -344,7 +345,8 @@ fn test_sql(sql: &str) -> Result { .with_udaf(sum_udaf()) .with_u

Re: [PR] feat: add scalar subquery pushdown to scan [datafusion-comet]

2024-07-17 Thread via GitHub
codecov-commenter commented on PR #678: URL: https://github.com/apache/datafusion-comet/pull/678#issuecomment-2235261640 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/678?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campai

Re: [PR] feat: Enable remaining Spark 3.5.1 tests [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on code in PR #676: URL: https://github.com/apache/datafusion-comet/pull/676#discussion_r1682062114 ## dev/diffs/3.5.1.diff: ## @@ -2120,29 +2120,17 @@ index 746f289c393..bc01ffd52ea 100644 checkAnswer(aggDF, df1.groupBy("j").agg(max("k"))) }

Re: [PR] feat: Enable remaining Spark 3.5.1 tests [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on code in PR #676: URL: https://github.com/apache/datafusion-comet/pull/676#discussion_r1682052948 ## dev/diffs/3.5.1.diff: ## @@ -2120,29 +2120,17 @@ index 746f289c393..bc01ffd52ea 100644 checkAnswer(aggDF, df1.groupBy("j").agg(max("k"))) }

Re: [PR] feat: Enable remaining Spark 3.5.1 tests [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on code in PR #676: URL: https://github.com/apache/datafusion-comet/pull/676#discussion_r1682052523 ## dev/diffs/3.5.1.diff: ## @@ -2120,29 +2120,17 @@ index 746f289c393..bc01ffd52ea 100644 checkAnswer(aggDF, df1.groupBy("j").agg(max("k"))) }

Re: [PR] Followup Support NULL literals in where clause [datafusion]

2024-07-17 Thread via GitHub
xinlifoobar commented on PR #11491: URL: https://github.com/apache/datafusion/pull/11491#issuecomment-2235157160 > Thanks -- sorry for the back and forth @xinlifoobar - I think we could make this simpler, But if you are sick of changing things, that is fine too and I can make a follow on PR

Re: [PR] Followup Support NULL literals in where clause [datafusion]

2024-07-17 Thread via GitHub
xinlifoobar commented on code in PR #11491: URL: https://github.com/apache/datafusion/pull/11491#discussion_r1681999516 ## datafusion/optimizer/src/analyzer/type_coercion.rs: ## @@ -103,6 +104,20 @@ fn analyze_internal( // select t2.c2 from t1 where t1.c1 in (select t2.c1 f

Re: [PR] Add parser option enable_options_value_normalization [datafusion]

2024-07-17 Thread via GitHub
xinlifoobar commented on code in PR #11330: URL: https://github.com/apache/datafusion/pull/11330#discussion_r1681972108 ## datafusion/sql/src/statement.rs: ## @@ -1080,6 +1017,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { ))) } +fn parse_options_map( +

Re: [PR] Move `MAKE_MAP` to ExprPlanner [datafusion]

2024-07-17 Thread via GitHub
goldmedal commented on code in PR #11452: URL: https://github.com/apache/datafusion/pull/11452#discussion_r1681967883 ## datafusion/functions-array/src/make_array.rs: ## @@ -131,6 +138,77 @@ impl ScalarUDFImpl for MakeArray { } } +#[derive(Debug)] +pub struct MakeArraySt

Re: [PR] DRAFT: map-with-vec-exprs [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 commented on code in PR #11526: URL: https://github.com/apache/datafusion/pull/11526#discussion_r1681965049 ## datafusion/functions/src/core/map.rs: ## @@ -75,33 +75,31 @@ fn make_map(args: &[ColumnarValue]) -> Result { make_map_batch_internal(key, value, can_ev

Re: [PR] DRAFT: map-with-vec-exprs [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 commented on code in PR #11526: URL: https://github.com/apache/datafusion/pull/11526#discussion_r1681964481 ## datafusion/sql/src/expr/function.rs: ## @@ -229,8 +229,23 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { // user-defined function (UDF) should

Re: [PR] DRAFT: map-with-vec-exprs [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 commented on code in PR #11526: URL: https://github.com/apache/datafusion/pull/11526#discussion_r1681964078 ## datafusion/functions/src/core/map.rs: ## @@ -75,33 +75,31 @@ fn make_map(args: &[ColumnarValue]) -> Result { make_map_batch_internal(key, value, can_ev

Re: [PR] Move `MAKE_MAP` to ExprPlanner [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 commented on code in PR #11452: URL: https://github.com/apache/datafusion/pull/11452#discussion_r1681963551 ## datafusion/functions-array/src/make_array.rs: ## @@ -131,6 +138,77 @@ impl ScalarUDFImpl for MakeArray { } } +#[derive(Debug)] +pub struct MakeArrayS

[PR] DRAFT: map-with-vec-exprs [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 opened a new pull request, #11526: URL: https://github.com/apache/datafusion/pull/11526 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested

Re: [PR] Move `MAKE_MAP` to ExprPlanner [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 commented on code in PR #11452: URL: https://github.com/apache/datafusion/pull/11452#discussion_r1681932892 ## datafusion/functions-array/src/make_array.rs: ## @@ -131,6 +138,77 @@ impl ScalarUDFImpl for MakeArray { } } +#[derive(Debug)] +pub struct MakeArrayS

Re: [PR] feat: Enable remaining Spark 3.5.1 tests [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on code in PR #676: URL: https://github.com/apache/datafusion-comet/pull/676#discussion_r1681941216 ## dev/diffs/3.5.1.diff: ## @@ -2120,29 +2120,17 @@ index 746f289c393..bc01ffd52ea 100644 checkAnswer(aggDF, df1.groupBy("j").agg(max("k")))

Re: [PR] Move `MAKE_MAP` to ExprPlanner [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 commented on code in PR #11452: URL: https://github.com/apache/datafusion/pull/11452#discussion_r1681932892 ## datafusion/functions-array/src/make_array.rs: ## @@ -131,6 +138,77 @@ impl ScalarUDFImpl for MakeArray { } } +#[derive(Debug)] +pub struct MakeArrayS

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove merged PR #671: URL: https://github.com/apache/datafusion-comet/pull/671 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@da

Re: [PR] Extract catalog API to separate crate [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 commented on PR #11516: URL: https://github.com/apache/datafusion/pull/11516#issuecomment-2234919555 I think `CatalogSession` is not the minimum dependency, what is the reason to replace all the SessionState with CatalogSession -- This is an automated message from the Apache Gi

Re: [PR] Move `MAKE_MAP` to ExprPlanner [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 commented on code in PR #11452: URL: https://github.com/apache/datafusion/pull/11452#discussion_r1681920330 ## datafusion/functions-array/src/make_array.rs: ## @@ -131,6 +138,77 @@ impl ScalarUDFImpl for MakeArray { } } +#[derive(Debug)] +pub struct MakeArrayS

Re: [PR] Move `sql_compound_identifier_to_expr ` to `ExprPlanner` [datafusion]

2024-07-17 Thread via GitHub
jayzhan211 commented on code in PR #11487: URL: https://github.com/apache/datafusion/pull/11487#discussion_r1681912874 ## datafusion/optimizer/tests/optimizer_integration.rs: ## @@ -344,7 +345,8 @@ fn test_sql(sql: &str) -> Result { .with_udaf(sum_udaf()) .with

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on code in PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#discussion_r1681869984 ## spark/src/test/scala/org/apache/spark/sql/benchmark/CometTPCDSMicroBenchmark.scala: ## @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation

Re: [I] Comparison between negative zero and false produces incorrect result [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on issue #663: URL: https://github.com/apache/datafusion-comet/issues/663#issuecomment-2234546414 We should fix in the upstream https://github.com/apache/datafusion/issues/11108 -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [I] Divide by zero produces Infinity instead of NULL [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on issue #665: URL: https://github.com/apache/datafusion-comet/issues/665#issuecomment-2234527628 Will be addressed with https://github.com/apache/datafusion-comet/pull/585 -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on code in PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#discussion_r1681836876 ## spark/src/test/scala/org/apache/spark/sql/benchmark/CometTPCDSMicroBenchmark.scala: ## @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foun

Re: [I] Investigate options for improving performance when reading decimals from Parquet [datafusion-comet]

2024-07-17 Thread via GitHub
parthchandra commented on issue #679: URL: https://github.com/apache/datafusion-comet/issues/679#issuecomment-2234431447 It might be that the assignment to `Decimal.longVal` (or `Decimal.intVal`) in `Decimal.createUnsafe` is copying data from native to Jvm memory and that has a performance

Re: [PR] feat: add scalar subquery pushdown to scan [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on code in PR #678: URL: https://github.com/apache/datafusion-comet/pull/678#discussion_r1681803041 ## dev/diffs/4.0.0-preview1.diff: ## @@ -938,16 +938,42 @@ index 68f14f13bbd..4b8e967102f 100644 } assert(exchanges.size === 1) }

Re: [PR] Followup Support NULL literals in where clause [datafusion]

2024-07-17 Thread via GitHub
alamb commented on code in PR #11491: URL: https://github.com/apache/datafusion/pull/11491#discussion_r1681816119 ## datafusion/optimizer/src/analyzer/type_coercion.rs: ## @@ -103,6 +104,20 @@ fn analyze_internal( // select t2.c2 from t1 where t1.c1 in (select t2.c1 from t2

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on code in PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#discussion_r1681816431 ## spark/src/test/scala/org/apache/spark/sql/benchmark/CometTPCDSMicroBenchmark.scala: ## @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation

Re: [I] Investigate options for improving performance when reading decimals from Parquet [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on issue #679: URL: https://github.com/apache/datafusion-comet/issues/679#issuecomment-2234380204 This is the flamegraph where Comet is performing scan + exec. Because the hash aggregate is operating on Arrow data, we no longer see lots of decimals being accessed from J

Re: [PR] Extract catalog API to separate crate [datafusion]

2024-07-17 Thread via GitHub
findepi commented on PR #11516: URL: https://github.com/apache/datafusion/pull/11516#issuecomment-2234372485 Eventually all green. @jayzhan211 @alamb @comphead you may want to take a look -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [I] Investigate options for improving performance when reading decimals from Parquet [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on issue #679: URL: https://github.com/apache/datafusion-comet/issues/679#issuecomment-2234367885 Here is the full flamegraph. The hash aggregate is running in Spark, so maybe it is not surprising that so much time is in calls to getDecimal ![Screenshot from 2024-

Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

2024-07-17 Thread via GitHub
hveiga commented on issue #11042: URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2234367000 I believe the right one is `data_page_row_count_limit` instead of `data_page_row_limit`. So many similar configs :) -- This is an automated message from the Apache Git Service.

Re: [PR] Minor: Assert `test_enabled_backtrace` requirements to run [datafusion]

2024-07-17 Thread via GitHub
findepi commented on PR #11525: URL: https://github.com/apache/datafusion/pull/11525#issuecomment-2234364090 thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] feat: Spark-4.0 widening type support [datafusion-comet]

2024-07-17 Thread via GitHub
parthchandra commented on code in PR #604: URL: https://github.com/apache/datafusion-comet/pull/604#discussion_r1681792237 ## native/core/src/parquet/read/levels.rs: ## @@ -121,8 +121,8 @@ impl LevelDecoder { match self.mode { Mode::RLE => {

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on code in PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#discussion_r1681790305 ## spark/src/test/scala/org/apache/spark/sql/benchmark/CometTPCDSMicroBenchmark.scala: ## @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foun

Re: [PR] Minor: Assert `test_enabled_backtrace` requirements to run [datafusion]

2024-07-17 Thread via GitHub
comphead merged PR #11525: URL: https://github.com/apache/datafusion/pull/11525 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] feat: Spark-4.0 widening type support [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on code in PR #604: URL: https://github.com/apache/datafusion-comet/pull/604#discussion_r1681779734 ## native/core/src/parquet/read/levels.rs: ## @@ -121,8 +121,8 @@ impl LevelDecoder { match self.mode { Mode::RLE => {

Re: [PR] Create `datafusion-physical-optimizer` crate [datafusion]

2024-07-17 Thread via GitHub
alamb merged PR #11507: URL: https://github.com/apache/datafusion/pull/11507 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

2024-07-17 Thread via GitHub
alamb commented on issue #11042: URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2234310741 `data_page_row_limit` is the right one (based on rows) There is another similarly named one called `data_pagesize_limit ` (I would have expected it to be called data_page_si

Re: [PR] Minor: Assert `test_enabled_backtrace` requirements to run [datafusion]

2024-07-17 Thread via GitHub
alamb commented on code in PR #11525: URL: https://github.com/apache/datafusion/pull/11525#discussion_r1681741785 ## datafusion/common/src/error.rs: ## @@ -661,11 +661,16 @@ mod test { assert_eq!(res.strip_backtrace(), "Arrow error: Schema error: bar"); } -//

Re: [PR] Create `datafusion-physical-optimizer` crate [datafusion]

2024-07-17 Thread via GitHub
ozankabak commented on PR #11507: URL: https://github.com/apache/datafusion/pull/11507#issuecomment-2234301261 I like the idea of moving the physical optimizer into its own crate, thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

2024-07-17 Thread via GitHub
hveiga commented on issue #11042: URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2234289986 Thanks for all the feedback, we are actively testing this and can report our findings later today. Regarding `data_page_row_limit=20k` what is the right key to pass in the

Re: [PR] Add tests that show the different defaults for `ArrowWriter` and `TableParquetOptions` [datafusion]

2024-07-17 Thread via GitHub
alamb merged PR #11524: URL: https://github.com/apache/datafusion/pull/11524 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Add tests that show the different defaults for `ArrowWriter` and `TableParquetOptions` [datafusion]

2024-07-17 Thread via GitHub
alamb commented on PR #11524: URL: https://github.com/apache/datafusion/pull/11524#issuecomment-2234289349 > This PR is defining the current behavior only. > Not defining what it should be (altho yes the test cases use "should" lingo 😆 ). I changed the title to reflect what this PR

Re: [I] Inconsistent value for `data_page_max_rows` setting in DataFusion `ParquetOptions` and in `ArrowWriterOptions` [datafusion]

2024-07-17 Thread via GitHub
alamb commented on issue #11367: URL: https://github.com/apache/datafusion/issues/11367#issuecomment-2234282902 @wiedld 's pr https://github.com/apache/datafusion/pull/11524 has a very nice table of the current state of affairs: Here are the places where the current defaults d

Re: [I] Investigate options for improving performance when reading decimals from Parquet [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on issue #679: URL: https://github.com/apache/datafusion-comet/issues/679#issuecomment-2234272725 ![Screenshot from 2024-07-17 14-59-18](https://github.com/user-attachments/assets/85af2838-98c0-452f-a9c7-e3bda4887160) -- This is an automated message from the Apach

Re: [PR] feat: Spark-4.0 widening type support [datafusion-comet]

2024-07-17 Thread via GitHub
parthchandra commented on code in PR #604: URL: https://github.com/apache/datafusion-comet/pull/604#discussion_r1681728168 ## native/core/src/parquet/read/levels.rs: ## @@ -121,8 +121,8 @@ impl LevelDecoder { match self.mode { Mode::RLE => {

Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

2024-07-17 Thread via GitHub
wiedld commented on issue #11042: URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2234271367 > I also found https://github.com/apache/arrow-rs/issues/5828 which might be related and/or relevant. @hveiga is correct that this is one suspected place with extra memory

Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

2024-07-17 Thread via GitHub
alamb commented on issue #11042: URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2234270402 Also spoiler alert is that @wiedld is in the process of harmonizing the default configuration for the parquet writer and the arrow settings as part of https://github.com/apache/d

Re: [I] Move spill related functions to `spill.rs` [datafusion]

2024-07-17 Thread via GitHub
comphead commented on issue #11496: URL: https://github.com/apache/datafusion/issues/11496#issuecomment-2234269846 nw, folks, I shoulda added the PR dependency manually, nw, I already resolved it. -- This is an automated message from the Apache Git Service. To respond to the message, ple

Re: [PR] fix(11367): parquet writer defaults [datafusion]

2024-07-17 Thread via GitHub
wiedld commented on PR #11524: URL: https://github.com/apache/datafusion/pull/11524#issuecomment-2234260902 Coincidentally, the treatment of these defaults [just came up in discussion](https://github.com/apache/datafusion/issues/11042#issuecomment-2234244223) regarding memory debugging. Wha

Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

2024-07-17 Thread via GitHub
wiedld commented on issue #11042: URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2234244223 > I wonder if this could be related to DataFusion overriding the data_page_row_limit setting in https://github.com/apache/datafusion/issues/11367 (that @wiedld is working on)

Re: [PR] feat: add scalar subquery pushdown to scan [datafusion-comet]

2024-07-17 Thread via GitHub
parthchandra commented on PR #678: URL: https://github.com/apache/datafusion-comet/pull/678#issuecomment-2234238920 @kazuyukitanimura Ready for your review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#issuecomment-2234226194 > However, this path would be hit only for precision > 18 or if `spark.comet.use.decimal128` was set to `true` (it is `false` by default). The fields have precision 7 and I

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#issuecomment-2234221966 > Just looking at this one case, with decimal fields and only scan enabled, we are much slower. This is consistent with something I saw when working on the parallel reader. >

[PR] Minor: Assert `test_enabled_backtrace` requirements to run [datafusion]

2024-07-17 Thread via GitHub
comphead opened a new pull request, #11525: URL: https://github.com/apache/datafusion/pull/11525 ## Which issue does this PR close? Closes #. ## Rationale for this change If the `test_enabled_backtrace` launched without `RUST_BACKTRACE=1` the test will fail with the

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
parthchandra commented on PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#issuecomment-2234180553 > TPCDS Micro Benchmarks: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative > ---

Re: [PR] feat: track memory consumers for GreedyMemoryPoolState [datafusion]

2024-07-17 Thread via GitHub
alamb commented on PR #9015: URL: https://github.com/apache/datafusion/pull/9015#issuecomment-2234162615 My concern with this PR was that I could not quite convince myself / find the time to ensure the complexity was necessary... I feel bad we didn't find time to get it over the line. Maybe

Re: [PR] Add parser option enable_options_value_normalization [datafusion]

2024-07-17 Thread via GitHub
tinfoil-knight commented on PR #11330: URL: https://github.com/apache/datafusion/pull/11330#issuecomment-2234156327 @xinlifoobar Could you cleanup the commit history? Avoid using rebase. Just create a merge commit (if really necessary) w/ the `main` branch. -- This is an automated message

Re: [PR] Update parquet page pruning code to use the `StatisticsExtractor` [datafusion]

2024-07-17 Thread via GitHub
alamb commented on code in PR #11483: URL: https://github.com/apache/datafusion/pull/11483#discussion_r1679347111 ## datafusion/core/src/physical_optimizer/pruning.rs: ## @@ -736,12 +738,22 @@ impl RequiredColumns { Self::default() } -/// Returns number of un

Re: [PR] Update parquet page pruning code to use the `StatisticsExtractor` [datafusion]

2024-07-17 Thread via GitHub
alamb commented on code in PR #11483: URL: https://github.com/apache/datafusion/pull/11483#discussion_r1681651905 ## datafusion/core/src/physical_optimizer/pruning.rs: ## @@ -736,12 +738,22 @@ impl RequiredColumns { Self::default() } -/// Returns number of un

Re: [I] Implement nested identifier access ( "Nested identifiers not yet supported" ) [datafusion]

2024-07-17 Thread via GitHub
kavirajk commented on issue #11445: URL: https://github.com/apache/datafusion/issues/11445#issuecomment-2234147415 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
parthchandra commented on PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#issuecomment-2234142009 > Can we also commit the benchmark results (as we do for the other microbenchmarks)? That way we can keep an eye open for any performance regressions. I was under the i

[PR] fix(11367): parquet writer defaults [datafusion]

2024-07-17 Thread via GitHub
wiedld opened a new pull request, #11524: URL: https://github.com/apache/datafusion/pull/11524 ## Which issue does this PR close? Part of #11367 This PR defines the current state. ## Rationale for this change When we switched from using the parquet's ArrowWriter (with

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on code in PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#discussion_r1681640222 ## spark/src/test/resources/tpcds-micro-benchmarks/add_many_decimals.sql: ## @@ -0,0 +1,34 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-

Re: [I] Resources exhuasted errors are confusing return the biggest memory consumers. [datafusion]

2024-07-17 Thread via GitHub
wiedld commented on issue #11523: URL: https://github.com/apache/datafusion/issues/11523#issuecomment-2234117616 > The message is quite confusing and we can probably make it better, mentioning requested, used, allocated and remaining mem. But I'm not sure if there is an easy way to what tak

Re: [PR] feat: track memory consumers for GreedyMemoryPoolState [datafusion]

2024-07-17 Thread via GitHub
wiedld commented on PR #9015: URL: https://github.com/apache/datafusion/pull/9015#issuecomment-2234109663 Note that the need for some tracking of memory consumers has been [raised again](https://github.com/apache/datafusion/issues/11523). Not sure if we plan on reviving/reusing some of the

Re: [I] Resources exhuasted errors are confusing return the biggest memory consumers. [datafusion]

2024-07-17 Thread via GitHub
comphead commented on issue #11523: URL: https://github.com/apache/datafusion/issues/11523#issuecomment-2234109473 The message is quite confusing and we can probably make it better, mentioning requested, used, allocated and remaining mem. But I'm not sure if there is an easy way to what tak

Re: [PR] chore: Add microbenchmarks [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on PR #671: URL: https://github.com/apache/datafusion-comet/pull/671#issuecomment-2234106564 For reference, here are the results from running with sf=100gb data on my Linux workstation. ``` OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux

Re: [I] Investigate performance of decimal math / aggregation [datafusion-comet]

2024-07-17 Thread via GitHub
andygrove commented on issue #670: URL: https://github.com/apache/datafusion-comet/issues/670#issuecomment-2234100899 Benchmark runs @ sf=100 suggest that reading decimal from parquet could potentially be a performance issue. ``` TPCDS Micro Benchmarks: Best Time

Re: [I] Move spill related functions to `spill.rs` [datafusion]

2024-07-17 Thread via GitHub
findepi commented on issue #11496: URL: https://github.com/apache/datafusion/issues/11496#issuecomment-2234098647 i think you did alright (otherwise, how would you keep track of the follow-up issues without creating them). maybe i created the PR too early. apologies @comphead -- This

Re: [I] Move spill related functions to `spill.rs` [datafusion]

2024-07-17 Thread via GitHub
alamb commented on issue #11496: URL: https://github.com/apache/datafusion/issues/11496#issuecomment-2234097093 Sorry! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

Re: [I] Improve memory pool reservation `shrink` error handling [datafusion]

2024-07-17 Thread via GitHub
comphead commented on issue #11457: URL: https://github.com/apache/datafusion/issues/11457#issuecomment-2234094573 Depends on #11218 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] feat: Spark-4.0 widening type support [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on PR #604: URL: https://github.com/apache/datafusion-comet/pull/604#issuecomment-2234069067 @parthchandra @andygrove I have added a relevant change to fix one more test after your approvals https://github.com/apache/datafusion-comet/pull/604/commits/3659f0ad0e

Re: [PR] feat: Spark-4.0 widening type support [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on PR #604: URL: https://github.com/apache/datafusion-comet/pull/604#issuecomment-2234064490 > Thanks @kazuyukitanimura I'm wondering should be have tests to read those types? Thanks @comphead Spark's `ParquetTypeWideningSuite` has a good coverage -- T

Re: [I] Move spill related functions to `spill.rs` [datafusion]

2024-07-17 Thread via GitHub
comphead commented on issue #11496: URL: https://github.com/apache/datafusion/issues/11496#issuecomment-2234063802 I created this issue too early as it will create conflicts in #11218 😁 -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [I] Resources exhuasted errors are confusing return the biggest memory consumers. [datafusion]

2024-07-17 Thread via GitHub
alamb commented on issue #11523: URL: https://github.com/apache/datafusion/issues/11523#issuecomment-2234060771 I agree this message is quite confusing There is a related idea here I think https://github.com/apache/datafusion/issues/6934 I think @westonpace mentioned something

Re: [PR] feat: Spark-4.0 widening type support [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on PR #604: URL: https://github.com/apache/datafusion-comet/pull/604#issuecomment-2234060800 > btw, this is really a `feat` not a `fix` :) updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

Re: [PR] fix: Spark-4.0 widening type support [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on code in PR #604: URL: https://github.com/apache/datafusion-comet/pull/604#discussion_r1681603572 ## native/core/src/parquet/read/values.rs: ## @@ -506,6 +610,21 @@ macro_rules! generate_cast_to_signed { generate_cast_to_signed!(copy_i32_to_i8, i32,

Re: [PR] fix: Spark-4.0 widening type support [datafusion-comet]

2024-07-17 Thread via GitHub
kazuyukitanimura commented on code in PR #604: URL: https://github.com/apache/datafusion-comet/pull/604#discussion_r1681602065 ## native/core/src/parquet/read/values.rs: ## @@ -285,6 +261,59 @@ impl PlainDecoding for Int32DateType { } } +impl PlainDecoding for Int32Times

  1   2   3   >