Re: [PR] chore: Fix some inconsistencies in memory pool configuration [datafusion-comet]

2025-03-20 Thread via GitHub
viirya commented on code in PR #1561: URL: https://github.com/apache/datafusion-comet/pull/1561#discussion_r2006927110 ## common/src/main/scala/org/apache/comet/CometConf.scala: ## @@ -236,17 +236,18 @@ object CometConf extends ShimCometConf { val COMET_MEMORY_OVERHEAD: Optio

Re: [PR] chore: Fix some inconsistencies in memory pool configuration [datafusion-comet]

2025-03-20 Thread via GitHub
viirya commented on code in PR #1561: URL: https://github.com/apache/datafusion-comet/pull/1561#discussion_r2006927319 ## common/src/main/scala/org/apache/comet/CometConf.scala: ## @@ -255,8 +256,7 @@ object CometConf extends ShimCometConf { val COMET_MEMORY_OVERHEAD_MIN_MI

Re: [PR] fix: make register_object_store use same session_env as file scan [datafusion-comet]

2025-03-20 Thread via GitHub
wForget commented on code in PR #1555: URL: https://github.com/apache/datafusion-comet/pull/1555#discussion_r2006926052 ## native/core/src/parquet/mod.rs: ## @@ -641,6 +640,8 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat session_timezo

Re: [PR] SET statements: scope modifier for multiple assignments [datafusion-sqlparser-rs]

2025-03-20 Thread via GitHub
iffyio commented on code in PR #1772: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1772#discussion_r2006914091 ## src/parser/mod.rs: ## @@ -11145,17 +11145,16 @@ impl<'a> Parser<'a> { } /// Parse a `SET ROLE` statement. Expects SET to be consumed alre

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-03-20 Thread via GitHub
alamb commented on PR #15313: URL: https://github.com/apache/datafusion/pull/15313#issuecomment-2741938887 > Just FYI, I think Github will close the issue because it has "closes #xxx" automation Good call -- I have updated the description to say "related to" rather than closes --

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-20 Thread via GitHub
2010YOUY01 commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2742297724 > I ran this against Q23, results look promising! Elapsed 3.173 seconds with `datafusion.optimizer.enable_dynamic_filter_pushdown = true` vs. 4.696 with `false`. Both with predica

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-20 Thread via GitHub
2010YOUY01 commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2006853236 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -1067,35 +1067,53 @@ impl ExecutionPlan for SortExec { ) -> Result { trace!("Start SortExec::

Re: [I] [DISCUSS] Switch to `tree` explain by default [datafusion]

2025-03-20 Thread via GitHub
adriangb commented on issue #15343: URL: https://github.com/apache/datafusion/issues/15343#issuecomment-2742171861 Plus one for making it the default :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [I] [DISCUSS] Switch to `tree` explain by default [datafusion]

2025-03-20 Thread via GitHub
xudong963 commented on issue #15343: URL: https://github.com/apache/datafusion/issues/15343#issuecomment-2742191169 +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

[I] Dependency conflict with rquest due to async-compression and xz2 linking to lzma [datafusion]

2025-03-20 Thread via GitHub
0xlearner opened a new issue, #15342: URL: https://github.com/apache/datafusion/issues/15342 ### Describe the bug ### Description When using `rquest` with `datafusion`, a dependency conflict occurs because both crates depend on libraries that link to the native `lzma` library.

Re: [I] Add documentation about how to plan custom expressions [datafusion]

2025-03-20 Thread via GitHub
Jiashu-Hu commented on issue #15267: URL: https://github.com/apache/datafusion/issues/15267#issuecomment-2742161658 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] perf: Reuse row converter during sort [datafusion]

2025-03-20 Thread via GitHub
2010YOUY01 commented on code in PR #15302: URL: https://github.com/apache/datafusion/pull/15302#discussion_r2006777564 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -688,15 +707,29 @@ impl ExternalSorter { let fetch = self.fetch; let expressions: LexOr

Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-03-20 Thread via GitHub
wForget commented on code in PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#discussion_r2006738045 ## spark/src/test/scala/org/apache/spark/sql/benchmark/CometReadBenchmark.scala: ## @@ -63,6 +65,7 @@ object CometReadBenchmark extends CometBenchmarkBase {

Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-03-20 Thread via GitHub
wForget commented on code in PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#discussion_r2006738045 ## spark/src/test/scala/org/apache/spark/sql/benchmark/CometReadBenchmark.scala: ## @@ -63,6 +65,7 @@ object CometReadBenchmark extends CometBenchmarkBase {

Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-03-20 Thread via GitHub
wForget commented on code in PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#discussion_r2006750847 ## pom.xml: ## @@ -447,6 +448,13 @@ under the License. 5.1.0 + +org.apache.hadoop +hadoop-client-minicluster Review C

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-20 Thread via GitHub
kosiew commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2006732123 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -163,26 +187,32 @@ impl TopK { // TODO make this algorithmically better?: // Idea: filter out r

Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-03-20 Thread via GitHub
wForget commented on code in PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#discussion_r2006742948 ## pom.xml: ## @@ -447,6 +448,13 @@ under the License. 5.1.0 + +org.apache.hadoop +hadoop-client-minicluster Review C

Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-03-20 Thread via GitHub
wForget commented on code in PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#discussion_r2006742948 ## pom.xml: ## @@ -447,6 +448,13 @@ under the License. 5.1.0 + +org.apache.hadoop +hadoop-client-minicluster Review C

Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-03-20 Thread via GitHub
wForget commented on code in PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#discussion_r2006738045 ## spark/src/test/scala/org/apache/spark/sql/benchmark/CometReadBenchmark.scala: ## @@ -63,6 +65,7 @@ object CometReadBenchmark extends CometBenchmarkBase {

Re: [I] Allow UDFs to return custom `Diagnostic` [datafusion]

2025-03-20 Thread via GitHub
jsai28 commented on issue #15276: URL: https://github.com/apache/datafusion/issues/15276#issuecomment-2742070297 Regarding your first two points, I do think that `Vec` may be the way to do this. Mainly as it supports handling the case of literal values out of the box. If sqlparser is eventu

Re: [I] Missing crates.io 46.0.1 release for the `datafusion` crate [datafusion]

2025-03-20 Thread via GitHub
linhr commented on issue #15328: URL: https://github.com/apache/datafusion/issues/15328#issuecomment-2742022196 Thanks @alamb! Everything looks good now! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-20 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2006243095 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -644,10 +737,72 @@ impl RecordBatchStore { } } +struct TopKDynamicFilterSource { +/// The TopK heap

Re: [I] [EPIC] Complete `SQL EXPLAIN` Tree Rendering [datafusion]

2025-03-20 Thread via GitHub
alamb closed issue #14914: [EPIC] Complete `SQL EXPLAIN` Tree Rendering URL: https://github.com/apache/datafusion/issues/14914 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15323: URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2741967802 Do you see too many threads when writing the spill files or when reading? -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [I] [EPIC] Complete `SQL EXPLAIN` Tree Rendering [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #14914: URL: https://github.com/apache/datafusion/issues/14914#issuecomment-2741966208 Thank you so much to @irenjj @zebsme @Standing-Man and others I think we are basically done with this epic... It is even documented! https://datafusion.apache.org/use

Re: [I] Enable `tree` explain by default [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15343: URL: https://github.com/apache/datafusion/issues/15343#issuecomment-2741964039 For example, here is the plan from a recent query from https://github.com/apache/datafusion/issues/15177 (I actually had to trim it to fit in the 65k limit): ```sql

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-20 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2741947591 > I ran this against Q23, results look promising! Elapsed 3.173 seconds with `datafusion.optimizer.enable_dynamic_filter_pushdown = true` vs. 4.696 with `false`. Both with predicate pu

Re: [PR] Implement GroupsAccumulator for min/max Duration [datafusion]

2025-03-20 Thread via GitHub
alamb commented on code in PR #15322: URL: https://github.com/apache/datafusion/pull/15322#discussion_r2006649688 ## datafusion/functions-aggregate/src/min_max.rs: ## @@ -264,6 +265,7 @@ impl AggregateUDFImpl for Max { | Binary | LargeBinary

Re: [PR] feat: simplify regex wildcard pattern [datafusion]

2025-03-20 Thread via GitHub
waynexia commented on code in PR #15299: URL: https://github.com/apache/datafusion/pull/15299#discussion_r2006630060 ## datafusion/sqllogictest/test_files/union.slt: ## @@ -910,8 +910,8 @@ SELECT * FROM (SELECT y FROM u1 UNION ALL SELECT y FROM u2) ORDER BY y; query I SELECT

Re: [I] Create more user friendly aliases from `col` [datafusion-python]

2025-03-20 Thread via GitHub
deanm commented on issue #754: URL: https://github.com/apache/datafusion-python/issues/754#issuecomment-2741575371 Another potential friendly alias method is to use **kwargs in `select` and `aggregate`. Here's a select implementation: ```python def select(self, *exprs: Exp

Re: [I] Dialect-specific parsing and Snowflake JSON support [datafusion-sqlparser-rs]

2025-03-20 Thread via GitHub
tv42 commented on issue #241: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/241#issuecomment-2741880470 Almost 5 years without update. Can this be closed in light of https://github.com/apache/datafusion-sqlparser-rs/blob/main/tests/sqlparser_custom_dialect.rs ? -- This i

Re: [PR] include some BinaryOperator from sqlparser [datafusion]

2025-03-20 Thread via GitHub
waynexia commented on PR #15327: URL: https://github.com/apache/datafusion/pull/15327#issuecomment-2741876016 >It might also be a good idea to include some documentation in the operators themselves that DataFusion doesn't have default implementations Added in [5828cba](https://github

Re: [PR] include some BinaryOperator from sqlparser [datafusion]

2025-03-20 Thread via GitHub
waynexia commented on PR #15327: URL: https://github.com/apache/datafusion/pull/15327#issuecomment-2741867405 >I think there should be sql level tests (sqllogitests) that run these operators That's a good idea! I think I know them much better after writing some SQLs (though none of t

Re: [PR] feat: enable iceberg compat tests, more tests for complex types [datafusion-comet]

2025-03-20 Thread via GitHub
comphead commented on PR #1550: URL: https://github.com/apache/datafusion-comet/pull/1550#issuecomment-2741862047 > Great tests @comphead Do you think we need to add some cases with one more level of nesting - > > ``` > array > +- struct > +- array > ``

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-03-20 Thread via GitHub
blaginin commented on code in PR #15313: URL: https://github.com/apache/datafusion/pull/15313#discussion_r2006567915 ## datafusion/physical-plan/src/aggregates/mod.rs: ## @@ -1776,17 +1790,17 @@ mod tests { assert_eq!(batch.num_columns(), 2); assert_eq!(batch.n

Re: [PR] Blog post on Parquet pruning in datafusion [datafusion-site]

2025-03-20 Thread via GitHub
kevinjqliu commented on code in PR #60: URL: https://github.com/apache/datafusion-site/pull/60#discussion_r2006136201 ## content/blog/2025-03-20-parquet-pruning.md: ## @@ -0,0 +1,118 @@ +--- +layout: post +title: Parquet Pruning in DataFusion: Read Only What Matters +date: 2025-

Re: [PR] documentation :: quick-start.md sample source code correction [datafusion-ballista]

2025-03-20 Thread via GitHub
milenkovicm commented on PR #1213: URL: https://github.com/apache/datafusion-ballista/pull/1213#issuecomment-2741763693 Good catch, thanks @nj7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [I] Unsupported OS/arch [datafusion-comet]

2025-03-20 Thread via GitHub
jinwenjie123 commented on issue #1552: URL: https://github.com/apache/datafusion-comet/issues/1552#issuecomment-2737302767 > [@jinwenjie123](https://github.com/jinwenjie123) I would recommend starting off by using a pre-built JAR which contains native binaries for multiple architectures.

Re: [PR] Comet 0.7.0 [datafusion-site]

2025-03-20 Thread via GitHub
andygrove commented on code in PR #63: URL: https://github.com/apache/datafusion-site/pull/63#discussion_r2006506263 ## content/blog/2025-03-20-datafusion-comet-0.7.0.md: ## @@ -0,0 +1,130 @@ +--- +layout: post +title: Apache DataFusion Comet 0.7.0 Release +date: 2025-03-20 +aut

Re: [I] [Rust] [datafusion] Allow integration in non libc environments [datafusion]

2025-03-20 Thread via GitHub
alamb closed issue #102: [Rust] [datafusion] Allow integration in non libc environments URL: https://github.com/apache/datafusion/issues/102 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

[PR] fix type coercion for uint/int's [datafusion]

2025-03-20 Thread via GitHub
Omega359 opened a new pull request, #15341: URL: https://github.com/apache/datafusion/pull/15341 ## Which issue does this PR close? - Closes #15340 ## Rationale for this change Better handle type coercion when unsigned numerics are involved ## What changes are included

Re: [PR] Comet 0.7.0 [datafusion-site]

2025-03-20 Thread via GitHub
andygrove commented on code in PR #63: URL: https://github.com/apache/datafusion-site/pull/63#discussion_r2006442098 ## content/blog/2025-03-20-datafusion-comet-0.7.0.md: ## @@ -0,0 +1,131 @@ +--- +layout: post +title: Apache DataFusion Comet 0.7.0 Release +date: 2025-03-20 +aut

Re: [PR] Comet 0.7.0 [datafusion-site]

2025-03-20 Thread via GitHub
andygrove commented on code in PR #63: URL: https://github.com/apache/datafusion-site/pull/63#discussion_r2006440492 ## content/blog/2025-03-20-datafusion-comet-0.7.0.md: ## @@ -0,0 +1,131 @@ +--- +layout: post +title: Apache DataFusion Comet 0.7.0 Release +date: 2025-03-20 +aut

Re: [PR] Comet 0.7.0 [datafusion-site]

2025-03-20 Thread via GitHub
kazuyukitanimura commented on code in PR #63: URL: https://github.com/apache/datafusion-site/pull/63#discussion_r2006435819 ## content/blog/2025-03-20-datafusion-comet-0.7.0.md: ## @@ -0,0 +1,131 @@ +--- +layout: post +title: Apache DataFusion Comet 0.7.0 Release +date: 2025-03-

Re: [I] type coercion for arthmetic/binary ops fails for some unsigned/signed mappings [datafusion]

2025-03-20 Thread via GitHub
Omega359 commented on issue #15340: URL: https://github.com/apache/datafusion/issues/15340#issuecomment-2741609413 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] Comet 0.7.0 [datafusion-site]

2025-03-20 Thread via GitHub
comphead commented on code in PR #63: URL: https://github.com/apache/datafusion-site/pull/63#discussion_r2006367866 ## content/blog/2025-03-20-datafusion-comet-0.7.0.md: ## @@ -0,0 +1,130 @@ +--- +layout: post +title: Apache DataFusion Comet 0.7.0 Release +date: 2025-03-20 +auth

Re: [I] Snowflake COPY INTO fails to parse with a semicolon [datafusion-sqlparser-rs]

2025-03-20 Thread via GitHub
tv42 commented on issue #1519: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1519#issuecomment-2741566749 This was fixed in sqlparse v0.55.0, likely https://github.com/apache/datafusion-sqlparser-rs/pull/1669 -- This is an automated message from the Apache Git Service. To

[PR] 1075/enhancement/Make col class with __getattr__ [datafusion-python]

2025-03-20 Thread via GitHub
deanm opened a new pull request, #1076: URL: https://github.com/apache/datafusion-python/pull/1076 # Which issue does this PR close? Closes #1075 # Rationale for this change To improve ergonomics of the API by providing a quicker way of accessing columns using the __ge

[PR] Comet 0.7.0 [datafusion-site]

2025-03-20 Thread via GitHub
andygrove opened a new pull request, #63: URL: https://github.com/apache/datafusion-site/pull/63 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2741469672 I traced this down to an issue in the planner, which uses `PartitionMode::Auto` iff stats are collected (`datafusion.execution.collect_statistics`) We can however still use

[PR] Always use `PartitionMode::Auto` in planner [datafusion]

2025-03-20 Thread via GitHub
Dandandan opened a new pull request, #15339: URL: https://github.com/apache/datafusion/pull/15339 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes teste

Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-03-20 Thread via GitHub
kazuyukitanimura commented on code in PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#discussion_r2006238491 ## spark/src/test/scala/org/apache/comet/WithHdfsCluster.scala: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under on

Re: [PR] chore: Fix some inconsistencies in memory pool configuration [datafusion-comet]

2025-03-20 Thread via GitHub
andygrove commented on code in PR #1561: URL: https://github.com/apache/datafusion-comet/pull/1561#discussion_r2006287257 ## spark/src/main/scala/org/apache/comet/CometExecIterator.scala: ## @@ -63,9 +64,28 @@ class CometExecIterator( }.toArray private val plan = { va

Re: [PR] feat: Support serde for FileScanConfig `batch_size` [datafusion]

2025-03-20 Thread via GitHub
westhide commented on PR #15335: URL: https://github.com/apache/datafusion/pull/15335#issuecomment-2740991253 > Thank you @westhide > > Should we remove the `batch_size` from JSON source too? > > https://github.com/apache/datafusion/blob/dd9c3a815d7b4af2ef503ea557332ecc700af318

[I] Add a Col class instead of just col function to use __getattr__ method [datafusion-python]

2025-03-20 Thread via GitHub
deanm opened a new issue, #1075: URL: https://github.com/apache/datafusion-python/issues/1075 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** This would allow columns to be referred to as attr methods of col. For example inste

Re: [PR] fix: write hive partitions for any int/uint/float [datafusion]

2025-03-20 Thread via GitHub
Omega359 commented on PR #15337: URL: https://github.com/apache/datafusion/pull/15337#issuecomment-2741362724 LGTM, thanks @christophermcdermott ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [I] Add most functions to the Expr class so that they're chainable. [datafusion-python]

2025-03-20 Thread via GitHub
deanm commented on issue #1064: URL: https://github.com/apache/datafusion-python/issues/1064#issuecomment-2741351917 @timsaucer I put in a [draft PR](https://github.com/apache/datafusion-python/pull/1074) that does all the one input arg functions. Is your reluctance to putting *

Re: [PR] Implement GroupsAccumulator for min/max Duration [datafusion]

2025-03-20 Thread via GitHub
shruti2522 commented on code in PR #15322: URL: https://github.com/apache/datafusion/pull/15322#discussion_r2006224462 ## datafusion/functions-aggregate/src/min_max.rs: ## @@ -264,6 +265,7 @@ impl AggregateUDFImpl for Max { | Binary | LargeBinar

Re: [PR] Implement GroupsAccumulator for min/max Duration [datafusion]

2025-03-20 Thread via GitHub
shruti2522 commented on code in PR #15322: URL: https://github.com/apache/datafusion/pull/15322#discussion_r2006224462 ## datafusion/functions-aggregate/src/min_max.rs: ## @@ -264,6 +265,7 @@ impl AggregateUDFImpl for Max { | Binary | LargeBinar

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-20 Thread via GitHub
Omega359 commented on code in PR #61: URL: https://github.com/apache/datafusion-site/pull/61#discussion_r2006154315 ## content/blog/2025-03-21-parquet-pushdown.md: ## @@ -0,0 +1,259 @@ +--- +layout: post +title: Efficient Filter Pushdown in Parquet +date: 2025-03-21 +author: Xia

Re: [I] [Rust] [datafusion] Allow integration in non libc environments [datafusion]

2025-03-20 Thread via GitHub
arpity22 commented on issue #102: URL: https://github.com/apache/datafusion/issues/102#issuecomment-2741307866 Since this issue was opened a while ago, has it been resolved but not updated here? -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] Fix parquet pruning blog post hyperlink [datafusion-site]

2025-03-20 Thread via GitHub
XiangpengHao commented on PR #62: URL: https://github.com/apache/datafusion-site/pull/62#issuecomment-2741287848 Thank you @kevinjqliu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-20 Thread via GitHub
comphead commented on code in PR #61: URL: https://github.com/apache/datafusion-site/pull/61#discussion_r2006162703 ## content/blog/2025-03-21-parquet-pushdown.md: ## @@ -0,0 +1,259 @@ +--- +layout: post +title: Efficient Filter Pushdown in Parquet +date: 2025-03-21 +author: Xia

[PR] fix: write hive partitions for any int/uint/float [datafusion]

2025-03-20 Thread via GitHub
christophermcdermott opened a new pull request, #15337: URL: https://github.com/apache/datafusion/pull/15337 ## Which issue does this PR close? Closes #15336 ## Rationale for this change Support additional types in hive partitions. ## What changes a

Re: [PR] Blog post on Parquet pruning in datafusion [datafusion-site]

2025-03-20 Thread via GitHub
kevinjqliu commented on PR #60: URL: https://github.com/apache/datafusion-site/pull/60#issuecomment-2741250429 #62 should fix it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

[PR] 1064/enhancement/add functions to Expr class [datafusion-python]

2025-03-20 Thread via GitHub
deanm opened a new pull request, #1074: URL: https://github.com/apache/datafusion-python/pull/1074 # Which issue does this PR close? Works towards closing #1064 # Rationale for this change To improve ergonomics of the API by adding functions to the Expr class so th

Re: [PR] Fix parquet pruning blog post hyperlink [datafusion-site]

2025-03-20 Thread via GitHub
kevinjqliu commented on PR #62: URL: https://github.com/apache/datafusion-site/pull/62#issuecomment-2741251030 cc @XiangpengHao @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-20 Thread via GitHub
comphead commented on code in PR #61: URL: https://github.com/apache/datafusion-site/pull/61#discussion_r2006168286 ## content/blog/2025-03-21-parquet-pushdown.md: ## @@ -0,0 +1,259 @@ +--- +layout: post +title: Efficient Filter Pushdown in Parquet +date: 2025-03-21 +author: Xia

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-20 Thread via GitHub
comphead commented on PR #61: URL: https://github.com/apache/datafusion-site/pull/61#issuecomment-2741241689 on content/images/parquet-pushdown/baseline-impl.jpg the flow comes from 3 to 5, I assume it is expected, perhaps its needed to make a separate comment? -- This is an automated me

[I] Write hive partitions for any int/uint/float [datafusion]

2025-03-20 Thread via GitHub
christophermcdermott opened a new issue, #15336: URL: https://github.com/apache/datafusion/issues/15336 ### Is your feature request related to a problem or challenge? I hit this error: DataFusion error: This feature is not implemented: it is not yet supported to write to hive part

Re: [PR] feat: add test to check for `ctx.read_json()` [datafusion-ballista]

2025-03-20 Thread via GitHub
milenkovicm commented on PR #1212: URL: https://github.com/apache/datafusion-ballista/pull/1212#issuecomment-2741214382 apparently you found another bug: https://github.com/apache/datafusion-ballista/blob/bb10a1bebd52ebb91515efa7a2a977df740c2d7a/ballista/scheduler/src/scheduler_serv

Re: [PR] Add hooks to `SchemaAdapter` to add custom column generators [datafusion]

2025-03-20 Thread via GitHub
adriangb commented on PR #15261: URL: https://github.com/apache/datafusion/pull/15261#issuecomment-2741174091 Marking as ready for review. The main TODO is an API for transmitting statistics information for generated columns before they get generated, but that can even be a followup PR. -

Re: [PR] Blog post on Parquet pruning in datafusion [datafusion-site]

2025-03-20 Thread via GitHub
kevinjqliu commented on PR #60: URL: https://github.com/apache/datafusion-site/pull/60#issuecomment-2741193664 > The diagram below illustrates the [Parquet reading pipeline](https://docs.rs/datafusion/46.0.0/datafusion/datasource/physical_plan/parquet/source/struct.ParquetSource.html%60%60%6

Re: [I] Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) [datafusion]

2025-03-20 Thread via GitHub
adriangb commented on issue #7955: URL: https://github.com/apache/datafusion/issues/7955#issuecomment-2741188852 I have a PR up for doing something similar for TopK sorts (`ORDER BY col LIMIT 10`) in https://github.com/apache/datafusion/pull/15301. I think we should be able to re-use that w

Re: [PR] added explaination for Schema and DFSchema to documentation [datafusion]

2025-03-20 Thread via GitHub
comphead commented on code in PR #15329: URL: https://github.com/apache/datafusion/pull/15329#discussion_r2006103754 ## docs/source/library-user-guide/working-with-exprs.md: ## @@ -50,6 +50,29 @@ As another example, the SQL expression `a + b * c` would be represented as an `E

Re: [PR] refactor: move `CteWorkTable`, `default_table_source` a bunch of files out of core [datafusion]

2025-03-20 Thread via GitHub
logan-keede commented on PR #15316: URL: https://github.com/apache/datafusion/pull/15316#issuecomment-2741075889 I thought it mattered because `datasource` has an dependency on `catalog` but on a second look it is only `Session`. Any plans on pulling `Session` out? also corresponding `

Re: [PR] Blog post on Parquet pruning in datafusion [datafusion-site]

2025-03-20 Thread via GitHub
alamb commented on PR #60: URL: https://github.com/apache/datafusion-site/pull/60#issuecomment-2741065799 And it is live: https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [WIP] chore: Fix some inconsistencies in memory pool configuration [datafusion-comet]

2025-03-20 Thread via GitHub
andygrove commented on code in PR #1561: URL: https://github.com/apache/datafusion-comet/pull/1561#discussion_r2005998677 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -1334,26 +1334,46 @@ object CometSparkSessionExtensions extends Logging {

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-03-20 Thread via GitHub
alamb commented on PR #15313: URL: https://github.com/apache/datafusion/pull/15313#issuecomment-2740902341 FYI @blaginin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] refactor: move `CteWorkTable`, `default_table_source` a bunch of files out of core [datafusion]

2025-03-20 Thread via GitHub
logan-keede commented on PR #15316: URL: https://github.com/apache/datafusion/pull/15316#issuecomment-2740991793 where does Memtable belong datasource or catalog? it is TableProvider implementation so I thought It was going to be in catalog, but I m not so sure anymore as it has dependency

Re: [PR] refactor: move `CteWorkTable`, `default_table_source` a bunch of files out of core [datafusion]

2025-03-20 Thread via GitHub
alamb commented on PR #15316: URL: https://github.com/apache/datafusion/pull/15316#issuecomment-2741010424 > where does Memtable belong datasource or catalog? it is TableProvider implementation so I thought It was going to be in catalog, but I m not so sure anymore as it has dependency on d

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740980237 > Thanks for checking [@alamb](https://github.com/alamb) ! > > I think a large portion is spent in the hash join (repartitioning the right side input) - I think because it r

Re: [PR] feat: Support serde for FileScanConfig `batch_size` [datafusion]

2025-03-20 Thread via GitHub
westhide commented on code in PR #15335: URL: https://github.com/apache/datafusion/pull/15335#discussion_r2005989730 ## datafusion/proto/proto/datafusion.proto: ## @@ -997,6 +997,7 @@ message FileScanExecConf { reserved 10; datafusion_common.Constraints constraints = 11;

Re: [PR] refactor: move `CteWorkTable`, `default_table_source` a bunch of files out of core [datafusion]

2025-03-20 Thread via GitHub
alamb commented on code in PR #15316: URL: https://github.com/apache/datafusion/pull/15316#discussion_r2005982815 ## datafusion/physical-expr/src/physical_expr.rs: ## @@ -146,6 +148,38 @@ pub fn create_ordering( Ok(all_sort_orders) } +/// Create a physical sort expressio

Re: [PR] include some BinaryOperator from sqlparser [datafusion]

2025-03-20 Thread via GitHub
alamb commented on code in PR #15327: URL: https://github.com/apache/datafusion/pull/15327#discussion_r2005971323 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -793,8 +793,10 @@ impl BinaryExpr { BitwiseShiftRight => bitwise_shift_right_dyn(left, righ

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740900315 I am not really sure where the time is going 🤔 output of explain analyze: [explain.txt](https://github.com/user-attachments/files/19370532/explain.txt) -- This

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740888007 I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec vs 12sec) ```sql > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "Ev

Re: [I] Unsupported NdJsonExec plan and extension codec [datafusion-ballista]

2025-03-20 Thread via GitHub
alamb closed issue #1209: Unsupported NdJsonExec plan and extension codec URL: https://github.com/apache/datafusion-ballista/issues/1209 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740936826 Thanks for checking @alamb ! I think a large portion is spent in the h join (repartitioning the right input) - I think because it runs as `Partitioned` hash join, instea

Re: [PR] feat: Support serde for JsonSource PhysicalPlan [datafusion]

2025-03-20 Thread via GitHub
alamb merged PR #15311: URL: https://github.com/apache/datafusion/pull/15311 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Dynamic pruning filters from TopK state [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15037: URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2740932019 Thanks @adriangb -- I will try and review it asap (hopefully tomorrow afternoon or tomorrow) -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [PR] feat: Support serde for FileScanConfig `batch_size` [datafusion]

2025-03-20 Thread via GitHub
alamb commented on code in PR #15335: URL: https://github.com/apache/datafusion/pull/15335#discussion_r2005949069 ## datafusion/proto/proto/datafusion.proto: ## @@ -997,6 +997,7 @@ message FileScanExecConf { reserved 10; datafusion_common.Constraints constraints = 11; +

[PR] Enforce JOIN plan to require condition [datafusion]

2025-03-20 Thread via GitHub
goldmedal opened a new pull request, #15334: URL: https://github.com/apache/datafusion/pull/15334 ## Which issue does this PR close? - Closes #13486 ## Rationale for this change When working on unparsing the plan optimized by `ScalarSubqueryToJoin`, I notice

Re: [PR] Fix extended tests by restore datafusion-testing submodule [datafusion]

2025-03-20 Thread via GitHub
alamb commented on PR #15318: URL: https://github.com/apache/datafusion/pull/15318#issuecomment-2740909824 Thanks @adriangb and @ozankabak -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] chore(deps): Update sqlparser to 0.55.0 [datafusion]

2025-03-20 Thread via GitHub
jonahgao commented on code in PR #15183: URL: https://github.com/apache/datafusion/pull/15183#discussion_r2005949005 ## datafusion/sql/src/planner.rs: ## @@ -560,11 +558,11 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { SQLDataType::SmallInt(_) | SQLDataType::Int

Re: [I] Missing 46.0.1 release for the `datafusion` crate [datafusion]

2025-03-20 Thread via GitHub
vadimpiven commented on issue #15328: URL: https://github.com/apache/datafusion/issues/15328#issuecomment-2740896190 Hi! I can report that without `datafusion` crate release the issue https://github.com/apache/datafusion/issues/15122 still reproduces and still requires hotfix ``` [dep

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-03-20 Thread via GitHub
Shreyaskr1409 commented on code in PR #15313: URL: https://github.com/apache/datafusion/pull/15313#discussion_r2005883069 ## datafusion/physical-plan/Cargo.toml: ## @@ -58,6 +58,7 @@ futures = { workspace = true } half = { workspace = true } hashbrown = { workspace = true } i

Re: [PR] [WIP] chore: Fix some inconsistencies in memory pool configuration [datafusion-comet]

2025-03-20 Thread via GitHub
andygrove closed pull request #1561: [WIP] chore: Fix some inconsistencies in memory pool configuration URL: https://github.com/apache/datafusion-comet/pull/1561 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

[PR] feat: Support serde for FileScanConfig `batch_size` [datafusion]

2025-03-20 Thread via GitHub
westhide opened a new pull request, #15335: URL: https://github.com/apache/datafusion/pull/15335 ## Which issue does this PR close? - Closes None. - Reference [Support serde for batch_size](https://github.com/apache/datafusion/pull/15311#discussion_r2004114426) ## Ra

Re: [PR] perf: Reuse row converter during sort [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on code in PR #15302: URL: https://github.com/apache/datafusion/pull/15302#discussion_r2005752259 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -688,15 +707,29 @@ impl ExternalSorter { let fetch = self.fetch; let expressions: LexOrd

  1   2   >