Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191553716 ## content/blog/2025-07-14-user-defined-parquet-indexes.md: ## @@ -0,0 +1,545 @@ +--- +layout: post +title: Embedding User-Defined Indexes in Apache Parquet Fil

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191553716 ## content/blog/2025-07-14-user-defined-parquet-indexes.md: ## @@ -0,0 +1,545 @@ +--- +layout: post +title: Embedding User-Defined Indexes in Apache Parquet Fil

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-07 Thread via GitHub
drexler-sky commented on code in PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#discussion_r2191616235 ## native/core/src/parquet/objectstore/jni.rs: ## @@ -0,0 +1,332 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-07 Thread via GitHub
drexler-sky commented on code in PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#discussion_r2191615824 ## common/src/main/java/org/apache/comet/parquet/Native.java: ## @@ -292,4 +299,104 @@ public static native void currentColumnBatch( * @param handle

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-07 Thread via GitHub
drexler-sky commented on code in PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#discussion_r2191617409 ## native/core/src/parquet/parquet_support.rs: ## @@ -384,6 +392,17 @@ pub(crate) fn prepare_object_store_with_configs( let (object_store, object_st

Re: [PR] Revert "fix: create file for empty stream" [datafusion]

2025-07-07 Thread via GitHub
brunal commented on PR #16682: URL: https://github.com/apache/datafusion/pull/16682#issuecomment-3047584092 Apologies for that; it's now fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-07 Thread via GitHub
zhuqi-lucas commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3047184673 Thank you @alamb , I am curious about the benchmark result comparing the main branch, because we will include the https://github.com/apache/arrow-rs/pull/7850 for this PR.

Re: [PR] optimize `ScalarValue::to_array_of_size` for structural types [datafusion]

2025-07-07 Thread via GitHub
ding-young commented on PR #16706: URL: https://github.com/apache/datafusion/pull/16706#issuecomment-3047193295 > Thanks @ding-young looks good can this PR also have a bench file or benchmark run results to understand the optimization in numbers? Thank you @comphead When I run

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191564872 ## content/blog/2025-07-14-user-defined-parquet-indexes.md: ## @@ -0,0 +1,545 @@ +--- +layout: post +title: Embedding User-Defined Indexes in Apache Parquet Fil

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-07 Thread via GitHub
parthchandra commented on code in PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#discussion_r2191310903 ## native/core/src/parquet/objectstore/jni.rs: ## @@ -0,0 +1,332 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-07 Thread via GitHub
codecov-commenter commented on PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3043815066 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1992?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Enable Projection Pushdown Optimization for Recursive CTEs [datafusion]

2025-07-07 Thread via GitHub
kosiew commented on code in PR #16696: URL: https://github.com/apache/datafusion/pull/16696#discussion_r2189407204 ## datafusion/optimizer/Cargo.toml: ## @@ -64,6 +64,7 @@ datafusion-functions-window-common = { workspace = true } datafusion-sql = { workspace = true } env_logge

Re: [I] Datafusion can't seem to cast evolving structs [datafusion]

2025-07-07 Thread via GitHub
kosiew commented on issue #14757: URL: https://github.com/apache/datafusion/issues/14757#issuecomment-3044059050 You're welcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] Improve display format of BoundedWindowAggExec [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16645: URL: https://github.com/apache/datafusion/pull/16645#issuecomment-3044952327 Seems a few tests need to be updated to get CI to pass. @geetanshjuneja can take a look? -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [PR] feat: Support `PiecewiseMergeJoin` to speed up single range predicate joins [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16660: URL: https://github.com/apache/datafusion/pull/16660#issuecomment-3044955756 I will try and review this Pr later this week -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] fix: sqllogictest runner label condition mismatch [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16633: URL: https://github.com/apache/datafusion/pull/16633#issuecomment-3044947441 run extended tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [PR] Add reproducer for tpch Q16 deserialization bug [datafusion]

2025-07-07 Thread via GitHub
NGA-TRAN commented on code in PR #16662: URL: https://github.com/apache/datafusion/pull/16662#discussion_r2190019994 ## datafusion/proto/tests/cases/roundtrip_physical_plan.rs: ## @@ -1736,3 +1737,55 @@ async fn roundtrip_physical_plan_node() { let _ = plan.execute(0, ctx

Re: [I] Enable comments on datafusion-site via giscus [datafusion-site]

2025-07-07 Thread via GitHub
alamb commented on issue #80: URL: https://github.com/apache/datafusion-site/issues/80#issuecomment-3044353421 > `giscus` works by creating a corresponding discussion thread for each post and linking the comments in the thread. > Heres an example, [kevinjqliu/blog#3](https://github.com/k

Re: [I] feature: adapt predicate pushdown for mismatched nested/struct schemas [datafusion]

2025-07-07 Thread via GitHub
alamb commented on issue #16565: URL: https://github.com/apache/datafusion/issues/16565#issuecomment-3044369805 > I agree with you [@alamb](https://github.com/alamb). > > The issue I see with both approaches is going to be nested types: once you call the arrow cast kernel you can't ta

[PR] fix: port arrow inline fast key fix to datafusion [datafusion]

2025-07-07 Thread via GitHub
zhuqi-lucas opened a new pull request, #16698: URL: https://github.com/apache/datafusion/pull/16698 ## Which issue does this PR close? We fixed the inline key corner case in: https://github.com/apache/arrow-rs/pull/7875 This PR port the change to datafusion CursorValue co

Re: [PR] fix: port arrow inline fast key fix to datafusion [datafusion]

2025-07-07 Thread via GitHub
zhuqi-lucas commented on PR #16698: URL: https://github.com/apache/datafusion/pull/16698#issuecomment-3044409855 Thank you @alamb! Porting arrow fix to datafusion in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-07 Thread via GitHub
bert-beyondloops commented on PR #16632: URL: https://github.com/apache/datafusion/pull/16632#issuecomment-3044765606 Hi @alamb, Thanks for the review. There are already some tests verifying the explain plans for the unnest plan. The scheduled run will currently fail. I'll a

Re: [PR] refactor filter pushdown APIs [datafusion]

2025-07-07 Thread via GitHub
adriangb commented on PR #16642: URL: https://github.com/apache/datafusion/pull/16642#issuecomment-3044828528 Yes by me! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-07-07 Thread via GitHub
jonathanc-n commented on PR #16434: URL: https://github.com/apache/datafusion/pull/16434#issuecomment-3044902893 @Dandandan I can see if I can run some benchmarks. @alamb this should be good to go, i'll see if I can implement some much needed hash join spilling after this gets merged

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-07 Thread via GitHub
alamb commented on code in PR #16647: URL: https://github.com/apache/datafusion/pull/16647#discussion_r2187217486 ## datafusion/physical-plan/src/sorts/stream.rs: ## @@ -76,8 +76,40 @@ impl FusedStreams { } } +/// A pair of `Arc` that can be reused +#[derive(Debug)] +str

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16632: URL: https://github.com/apache/datafusion/pull/16632#issuecomment-3044933393 > There are already some tests verifying the explain plans for the unnest plan. The scheduled run will currently fail. I'll adapt the 2 test failures Indeed the exising tests

Re: [PR] Clickhouse: support empty parenthesized options [datafusion-sqlparser-rs]

2025-07-07 Thread via GitHub
alamb commented on PR #1925: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1925#issuecomment-3044912200 Thanks as always for keeping the code flowing @iffyio -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3044919809 As a random aside, this is the kind of micro optimization that I would have been terrified of making in C/C++ land without extreme care to avoid concurrent access. With Rust I

Re: [I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

2025-07-07 Thread via GitHub
alamb commented on issue #16365: URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3044943658 > I observed a significant performance improvement just testing on the master branch without any metadata caching as compared to before. I don't know what change caused this? Any

[I] Python test dependencies in release verification instructions are out of date [datafusion-python]

2025-07-07 Thread via GitHub
paleolimbot opened a new issue, #1182: URL: https://github.com/apache/datafusion-python/issues/1182 **Describe the bug** In running the instructions for verifying a release candidate I noticed that several additional dependencies were required! **To Reproduce** Run verif

[PR] Optional improvements in verification instructions [datafusion-python]

2025-07-07 Thread via GitHub
paleolimbot opened a new pull request, #1183: URL: https://github.com/apache/datafusion-python/pull/1183 # Which issue does this PR close? Closes #1182 # Rationale for this change The release verification instructions linked in the vote email don't mention the helpf

Re: [PR] Fix duplicate field name error in Join::try_new_with_project_input during physical planning [datafusion]

2025-07-07 Thread via GitHub
LiaCastaneda commented on PR #16454: URL: https://github.com/apache/datafusion/pull/16454#issuecomment-3045022842 > I am sorry I didn't quite follow if this fix is specific to substrait or if it also fixes some issue that could be hit with a SQL as well? Specifically, is there any SQL query

Re: [PR] Per file filter evaluation [datafusion]

2025-07-07 Thread via GitHub
adriangb commented on code in PR #15057: URL: https://github.com/apache/datafusion/pull/15057#discussion_r2190043206 ## datafusion-examples/examples/variant_shredding.rs: ## @@ -0,0 +1,398 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-07 Thread via GitHub
jonathanc-n commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3045016104 @korowa @2010YOUY01 Are you able to take a quick look? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

[PR] Add support for granting privileges to procedures and functions in Sn… [datafusion-sqlparser-rs]

2025-07-07 Thread via GitHub
yoavcloud opened a new pull request, #1930: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1930 …owflake -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [PR] fix: sqllogictest runner label condition mismatch [datafusion]

2025-07-07 Thread via GitHub
alamb merged PR #16633: URL: https://github.com/apache/datafusion/pull/16633 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] [substrait] [sqllogictest] table 'datafusion.public.aggregate_test_100_by_sql' not found [datafusion]

2025-07-07 Thread via GitHub
alamb closed issue #16282: [substrait] [sqllogictest] table 'datafusion.public.aggregate_test_100_by_sql' not found URL: https://github.com/apache/datafusion/issues/16282 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

Re: [PR] Per file filter evaluation [datafusion]

2025-07-07 Thread via GitHub
adriangb commented on code in PR #15057: URL: https://github.com/apache/datafusion/pull/15057#discussion_r2190141364 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -233,21 +238,31 @@ impl FileOpener for ParquetOpener { } } +pred

[I] Output schema of the CopyTo logical plan is not correct. [datafusion]

2025-07-07 Thread via GitHub
bert-beyondloops opened a new issue, #16704: URL: https://github.com/apache/datafusion/issues/16704 ### Describe the bug The output schema of the CopyTo logical plan currently directly outputs the underlying schema of it's child. The physical plan however always returns a 1 row

[PR] Fix: CopyTo logical plan outputs 1 column [datafusion]

2025-07-07 Thread via GitHub
bert-beyondloops opened a new pull request, #16705: URL: https://github.com/apache/datafusion/pull/16705 ## Which issue does this PR close? `Closes #16704` ## Rationale for this change Schema of logical plan should be the same as the associated physical plan.

[I] Bloom filters are unused for certain where clause patterns [datafusion]

2025-07-07 Thread via GitHub
debajyoti-truefoundry opened a new issue, #16697: URL: https://github.com/apache/datafusion/issues/16697 ### Describe the bug ### Query 1: ```sql SELECT col1, col2, count(*) FROM test_data WHERE (col1 = 'category_1' AND col2 = 'type_1') OR (col1 = 'category_2' AND co

Re: [I] [substrait] [sqllogictest] Cannot convert to Substrait [datafusion]

2025-07-07 Thread via GitHub
ViggoC commented on issue #16281: URL: https://github.com/apache/datafusion/issues/16281#issuecomment-3045413737 Hi @gabotechs, I'm trying to fix `cargo test --test sqllogictests -- --substrait-round-trip subquery.slt:1007`, and handle converting OuterReferenceColumn is necessary to do it.

[PR] optimize `ScalarValue::to_array_of_size` for structural types [datafusion]

2025-07-07 Thread via GitHub
ding-young opened a new pull request, #16706: URL: https://github.com/apache/datafusion/pull/16706 ## Which issue does this PR close? - Closes #13754 . ## Rationale for this change ## What changes are included in this PR? ## Are these change

Re: [I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

2025-07-07 Thread via GitHub
swaingotnochill commented on issue #16365: URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3045422144 @alamb You are right, the improvement is due to collect_statistics. My poc is able to reduce the original time, however it still doesn't seem to work very well wi

[PR] Update Upgrade Guide for 48.0.1 [datafusion]

2025-07-07 Thread via GitHub
alamb opened a new pull request, #16699: URL: https://github.com/apache/datafusion/pull/16699 ## Which issue does this PR close? - Part of #16486 ## Rationale for this change We back ported the default setting for statistics `datafusion.execution.collect_statistics` to

Re: [PR] Update Upgrade Guide for 48.0.1 [datafusion]

2025-07-07 Thread via GitHub
alamb commented on code in PR #16699: URL: https://github.com/apache/datafusion/pull/16699#discussion_r2189688564 ## docs/source/index.rst: ## @@ -126,6 +126,7 @@ To get started, see :caption: Library User Guide library-user-guide/index + library-user-guide/upgrad

[PR] chore(deps): bump tokio from 1.46.0 to 1.46.1 [datafusion]

2025-07-07 Thread via GitHub
dependabot[bot] opened a new pull request, #16700: URL: https://github.com/apache/datafusion/pull/16700 [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tokio&package-manager=cargo&previous-version=1.46.0&new-versio

Re: [PR] Blog : Extending Apache Parquet with User Defined Indexes to Accelerate Query Processing with DataFusion [datafusion-site]

2025-07-07 Thread via GitHub
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3045304111 I went over this again and messed around with the wording but not the content. I also made the conclusion a bit stronger and made the wording a bit more concise I'll plan to publ

Re: [PR] Fix duplicate field name error in Join::try_new_with_project_input during physical planning [datafusion]

2025-07-07 Thread via GitHub
alamb merged PR #16454: URL: https://github.com/apache/datafusion/pull/16454 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Fix duplicate field name error in Join::try_new_with_project_input during physical planning [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16454: URL: https://github.com/apache/datafusion/pull/16454#issuecomment-3045323217 THanks again @LiaCastaneda and @gabotechs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [I] Running tests with `--test-threads` option fails. [datafusion]

2025-07-07 Thread via GitHub
kosiew commented on issue #16693: URL: https://github.com/apache/datafusion/issues/16693#issuecomment-3044144428 @mjgarton Thanks for reporting this. I tested and `cargo test -- --test-threads 1` ran ok with [commit 12c40ca](https://github.com/apache/datafusion/pull/166

[PR] Enable Projection Pushdown Optimization for Recursive CTEs [datafusion]

2025-07-07 Thread via GitHub
kosiew opened a new pull request, #16696: URL: https://github.com/apache/datafusion/pull/16696 ## Which issue does this PR close? - Closes #16684 ## Rationale for this change This PR introduces support for projection pushdown in recursive common table expressions (CTEs),

Re: [PR] Add support for Arrow Dictionary type in Substrait [datafusion]

2025-07-07 Thread via GitHub
alamb merged PR #16608: URL: https://github.com/apache/datafusion/pull/16608 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] [substrait] [sqllogictest] Unsupported cast type: Dictionary(Int32, Utf8) [datafusion]

2025-07-07 Thread via GitHub
alamb closed issue #16273: [substrait] [sqllogictest] Unsupported cast type: Dictionary(Int32, Utf8) URL: https://github.com/apache/datafusion/issues/16273 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16632: URL: https://github.com/apache/datafusion/pull/16632#issuecomment-3044638489 Thanks @bert-beyondloops -- sorry for the delay in review I started the CI tests The changes in this PR make sense to me -- the only thing I think it needs is some test t

Re: [PR] Add reproducer for tpch Q16 deserialization bug [datafusion]

2025-07-07 Thread via GitHub
alamb commented on code in PR #16662: URL: https://github.com/apache/datafusion/pull/16662#discussion_r2189820163 ## datafusion/proto/tests/cases/roundtrip_physical_plan.rs: ## @@ -1736,3 +1737,55 @@ async fn roundtrip_physical_plan_node() { let _ = plan.execute(0, ctx.ta

Re: [PR] Add reproducer for tpch Q16 deserialization bug [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16662: URL: https://github.com/apache/datafusion/pull/16662#issuecomment-3044668675 Thanks @NGA-TRAN -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Revert "fix: create file for empty stream" [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16682: URL: https://github.com/apache/datafusion/pull/16682#issuecomment-3044670474 Looks like there may be a test in https://github.com/brunal/datafusion/commit/aed5316583ecae7901d9596039ca4bfe7cd48811 -- This is an automated message from the Apache Git Ser

Re: [PR] fix: try to lower plain reserved functions to columns as well [datafusion]

2025-07-07 Thread via GitHub
alamb commented on code in PR #16669: URL: https://github.com/apache/datafusion/pull/16669#discussion_r2189827445 ## datafusion/sql/src/expr/function.rs: ## @@ -93,6 +93,8 @@ struct FunctionArgs { distinct: bool, /// WITHIN GROUP clause, if any within_group: Vec,

Re: [PR] Optional improvements in verification instructions [datafusion-python]

2025-07-07 Thread via GitHub
paleolimbot commented on code in PR #1183: URL: https://github.com/apache/datafusion-python/pull/1183#discussion_r2190333157 ## dev/release/README.md: ## @@ -176,11 +183,11 @@ source .venv/bin/activate # install release candidate pip install --extra-index-url https://test.pyp

Re: [I] Release DataFusion `48.0.1` [datafusion]

2025-07-07 Thread via GitHub
alamb commented on issue #16486: URL: https://github.com/apache/datafusion/issues/16486#issuecomment-3045541520 released to crates.io and all is good -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [I] Enable comments on datafusion-site via giscus [datafusion-site]

2025-07-07 Thread via GitHub
kevinjqliu commented on issue #80: URL: https://github.com/apache/datafusion-site/issues/80#issuecomment-3045713260 cheers, i'll put up a PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Update Upgrade Guide for 48.0.1 [datafusion]

2025-07-07 Thread via GitHub
alamb merged PR #16699: URL: https://github.com/apache/datafusion/pull/16699 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Enable comments on datafusion-site via giscus [datafusion-site]

2025-07-07 Thread via GitHub
kevinjqliu commented on issue #80: URL: https://github.com/apache/datafusion-site/issues/80#issuecomment-3045716736 @alamb could you take a look at #81 and #82 when you get a chance? I also plan on creating a Makefile so running this repo locally can just be a `make build` -- This is

Re: [PR] feat: Support `PiecewiseMergeJoin` to speed up single range predicate joins [datafusion]

2025-07-07 Thread via GitHub
jonathanc-n commented on code in PR #16660: URL: https://github.com/apache/datafusion/pull/16660#discussion_r2190820204 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -808,16 +809,22 @@ pub(crate) fn get_final_indices_from_shared_bitmap( pub(crate) fn get_final_indices_f

Re: [PR] feat: Support `PiecewiseMergeJoin` to speed up single range predicate joins [datafusion]

2025-07-07 Thread via GitHub
comphead commented on code in PR #16660: URL: https://github.com/apache/datafusion/pull/16660#discussion_r2190672301 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -808,16 +809,22 @@ pub(crate) fn get_final_indices_from_shared_bitmap( pub(crate) fn get_final_indices_from

Re: [PR] chore: extract CreateArray from QueryPlanSerde [datafusion-comet]

2025-07-07 Thread via GitHub
andygrove merged PR #1991: URL: https://github.com/apache/datafusion-comet/pull/1991 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [I] chore: extract CreateArray from QueryPlanSerde [datafusion-comet]

2025-07-07 Thread via GitHub
andygrove closed issue #1990: chore: extract CreateArray from QueryPlanSerde URL: https://github.com/apache/datafusion-comet/issues/1990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [I] Enable comments on datafusion-site via giscus [datafusion-site]

2025-07-07 Thread via GitHub
kevinjqliu commented on issue #80: URL: https://github.com/apache/datafusion-site/issues/80#issuecomment-3046273219 Giscus has 3 requirements - [x] The repository is public, otherwise visitors will not be able to view the discussion. - [ ]The giscus app is installed, otherwise visitors

Re: [PR] remove FileSource::with_projection [datafusion]

2025-07-07 Thread via GitHub
adriangb commented on PR #16708: URL: https://github.com/apache/datafusion/pull/16708#issuecomment-3046278614 I don't think this makes sense, in fact we probably want the opposite refactor (use `with_projection`) so that each data source can decide what projections it can and can't push dow

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-07 Thread via GitHub
MrPowers commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3046278774 I reran the groupby h2o benchmarks on the 1e8 dataset stored in Parquet to compare DataFusion v47.0.0 and DuckDB v1.3.1, here are the results: https://github.com/user-att

Re: [PR] remove FileSource::with_projection [datafusion]

2025-07-07 Thread via GitHub
adriangb closed pull request #16708: remove FileSource::with_projection URL: https://github.com/apache/datafusion/pull/16708 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [I] Update release scripts to publish Comet jars for Spark 4.0.0 [datafusion-comet]

2025-07-07 Thread via GitHub
parthchandra commented on issue #1989: URL: https://github.com/apache/datafusion-comet/issues/1989#issuecomment-3046900140 I don't think the builder docker images are used to build the java/scala code. IIRC the docker containers are used only for the native libs which are then copied over

Re: [PR] Add Snowflake `COPY/REVOKE CURRENT GRANTS` option [datafusion-sqlparser-rs]

2025-07-07 Thread via GitHub
iffyio merged PR #1926: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1926 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] Update Upgrade Guide for 48.0.1 [datafusion]

2025-07-07 Thread via GitHub
comphead commented on code in PR #16699: URL: https://github.com/apache/datafusion/pull/16699#discussion_r2190441264 ## docs/source/library-user-guide/upgrading.md: ## @@ -137,6 +120,25 @@ SET datafusion.execution.spill_compression = 'zstd'; For more details about this config

Re: [PR] chore(deps): bump tokio from 1.46.0 to 1.46.1 [datafusion]

2025-07-07 Thread via GitHub
comphead merged PR #16700: URL: https://github.com/apache/datafusion/pull/16700 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

[I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-07 Thread via GitHub
wegamekinglc opened a new issue, #16707: URL: https://github.com/apache/datafusion/issues/16707 ### Describe the bug Hi team, I have encountered a performance issue when I run same query on a big table with datafusion comparing with DuckDB. I will try to simplify my case and re

Re: [PR] Revert "fix: create file for empty stream" [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16682: URL: https://github.com/apache/datafusion/pull/16682#issuecomment-3046616782 The CI tests seem to have some problems unfortunately -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-07 Thread via GitHub
alamb commented on code in PR #16632: URL: https://github.com/apache/datafusion/pull/16632#discussion_r2191065570 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -39,9 +39,9 @@ physical_plan 01)ProjectionExec: expr=[__unnest_placeholder(v.column2,depth=1)@0 as

Re: [PR] feat: Support `PiecewiseMergeJoin` to speed up single range predicate joins [datafusion]

2025-07-07 Thread via GitHub
jonathanc-n commented on code in PR #16660: URL: https://github.com/apache/datafusion/pull/16660#discussion_r2191068082 ## datafusion/physical-plan/src/joins/piecewise_merge_join.rs: ## @@ -0,0 +1,2114 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3046608364 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] Add reproducer for tpch Q16 deserialization bug [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16662: URL: https://github.com/apache/datafusion/pull/16662#issuecomment-3046622370 Thanks again @NGA-TRAN and @gabotechs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] Add reproducer for tpch Q16 deserialization bug [datafusion]

2025-07-07 Thread via GitHub
alamb merged PR #16662: URL: https://github.com/apache/datafusion/pull/16662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Improve display format of BoundedWindowAggExec [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16645: URL: https://github.com/apache/datafusion/pull/16645#issuecomment-3046635728 > How to update tests in .slt files? Here are the instructions: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest -- This is an automated message

Re: [PR] feat: Support `PiecewiseMergeJoin` to speed up single range predicate joins [datafusion]

2025-07-07 Thread via GitHub
adriangb commented on code in PR #16660: URL: https://github.com/apache/datafusion/pull/16660#discussion_r2191078549 ## datafusion/physical-plan/src/joins/piecewise_merge_join.rs: ## @@ -0,0 +1,2114 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more co

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-07 Thread via GitHub
XiangpengHao commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3046643906 Taking a look at the test failures.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Fix test running compatibility [datafusion]

2025-07-07 Thread via GitHub
alamb commented on code in PR #16694: URL: https://github.com/apache/datafusion/pull/16694#discussion_r2190398395 ## datafusion/sqllogictest/bin/sqllogictests.rs: ## @@ -689,6 +689,12 @@ struct Options { help = "IGNORED (for compatibility with built-in rust test runner)

Re: [PR] Add reproducer for tpch Q16 deserialization bug [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16662: URL: https://github.com/apache/datafusion/pull/16662#issuecomment-3045731762 Actually there was a real clippy bug but I pushed a fix as I had this PR open anyways locally -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub
comphead commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190517681 ## content/blog/2025-07-14-user-defined-parquet-indexes.md: ## @@ -0,0 +1,545 @@ +--- +layout: post +title: Embedding User-Defined Indexes in Apache Parquet Files

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub
comphead commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3045780524 Thanks @zhuqi-lucas @JigaoLuo @alamb Added some possible minor improvements -- This is an automated message from the Apache Git Service. To respond to the message, please log o

Re: [PR] Fix: CopyTo logical plan outputs 1 column [datafusion]

2025-07-07 Thread via GitHub
alamb commented on code in PR #16705: URL: https://github.com/apache/datafusion/pull/16705#discussion_r2190383456 ## datafusion/expr/src/logical_plan/dml.rs: ## @@ -89,6 +91,26 @@ impl Hash for CopyTo { } } +impl CopyTo { +pub fn new( +input: Arc, +ou

Re: [PR] feat: Support `PiecewiseMergeJoin` to speed up single range predicate joins [datafusion]

2025-07-07 Thread via GitHub
comphead commented on code in PR #16660: URL: https://github.com/apache/datafusion/pull/16660#discussion_r2190450917 ## datafusion/physical-plan/src/joins/mod.rs: ## @@ -23,12 +23,14 @@ use datafusion_physical_expr::PhysicalExprRef; pub use hash_join::HashJoinExec; pub use nes

Re: [PR] feat: Support `PiecewiseMergeJoin` to speed up single range predicate joins [datafusion]

2025-07-07 Thread via GitHub
comphead commented on PR #16660: URL: https://github.com/apache/datafusion/pull/16660#issuecomment-3045686528 Thanks @jonathanc-n let me first get familiar with this kind of join -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Update Upgrade Guide for 48.0.1 [datafusion]

2025-07-07 Thread via GitHub
alamb commented on code in PR #16699: URL: https://github.com/apache/datafusion/pull/16699#discussion_r2190454962 ## docs/source/library-user-guide/upgrading.md: ## @@ -137,6 +120,25 @@ SET datafusion.execution.spill_compression = 'zstd'; For more details about this configura

Re: [PR] Update Upgrade Guide for 48.0.1 [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16699: URL: https://github.com/apache/datafusion/pull/16699#issuecomment-3045689726 I am happy to make any other suggested changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Update Upgrade Guide for 48.0.1 [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16699: URL: https://github.com/apache/datafusion/pull/16699#issuecomment-3045688905 Thank you for the review @comphead -- i am going to merge this in ASAP as we have already released 48.0.1 and it would be nice to have this doc in place -- This is an automated mess

[I] Upgrade to DataFusion 49.0.0 [datafusion-comet]

2025-07-07 Thread via GitHub
andygrove opened a new issue, #1993: URL: https://github.com/apache/datafusion-comet/issues/1993 ### What is the problem the feature request solves? We should start the process of updating Comet to use the latest DataFusion code, in preparation for upgrading to the 49.0.0 release.

Re: [I] Update release scripts to publish Comet jars for Spark 4.0.0 [datafusion-comet]

2025-07-07 Thread via GitHub
andygrove commented on issue #1989: URL: https://github.com/apache/datafusion-comet/issues/1989#issuecomment-3046023467 As part of this, we will need to update the builder Docker image to support JDK 17 -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [I] [iceberg] Error loading in-memory sorter check class path [datafusion-comet]

2025-07-07 Thread via GitHub
hsiang-c commented on issue #1982: URL: https://github.com/apache/datafusion-comet/issues/1982#issuecomment-3046570709 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] feat: Support `PiecewiseMergeJoin` to speed up single range predicate joins [datafusion]

2025-07-07 Thread via GitHub
jonathanc-n commented on code in PR #16660: URL: https://github.com/apache/datafusion/pull/16660#discussion_r2191040643 ## datafusion/physical-plan/src/joins/piecewise_merge_join.rs: ## @@ -0,0 +1,2114 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more

  1   2   >