Re: [PR] MSSQL: Add support for functionality `MERGE` output clause [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
dilovancelik commented on PR #1790: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1790#issuecomment-2780303913 Hey I did a rebase, and I think something went wrong, now I just did a classic merge, and et looks like the extra commits are gone. Sorry. -- This is an automated

Re: [I] Trivial WHERE filter not eliminated when combined with CTE [datafusion]

2025-04-04 Thread via GitHub
ding-young commented on issue #15387: URL: https://github.com/apache/datafusion/issues/15387#issuecomment-2780253832 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Add support for MSSQL IF/ELSE statements. [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
iffyio commented on code in PR #1791: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1791#discussion_r2029718008 ## src/ast/spans.rs: ## @@ -739,19 +740,12 @@ impl Spanned for CreateIndex { impl Spanned for CaseStatement { fn span(&self) -> Span { le

Re: [PR] Add support for 'IN ' [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
iffyio commented on code in PR #1793: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1793#discussion_r2029711949 ## src/parser/mod.rs: ## @@ -3742,24 +3742,23 @@ impl<'a> Parser<'a> { }); } self.expect_token(&Token::LParen)?; -

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on code in PR #15561: URL: https://github.com/apache/datafusion/pull/15561#discussion_r2029708307 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -109,47 +108,84 @@ impl FileOpener for ParquetOpener { .schema_adapter_factory .cr

Re: [PR] Add all missing table options to be handled in any order [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
tomershaniii commented on code in PR #1747: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1747#discussion_r2009156970 ## src/ast/dml.rs: ## @@ -138,6 +143,30 @@ pub struct CreateTable { pub engine: Option, pub comment: Option, pub auto_increment_off

Re: [PR] Use `any` instead of `for_each` [datafusion]

2025-04-04 Thread via GitHub
xudong963 commented on code in PR #15289: URL: https://github.com/apache/datafusion/pull/15289#discussion_r2007912619 ## datafusion/datasource-parquet/src/file_format.rs: ## @@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc( total_byte_size += row_group_meta

Re: [PR] chore: Attach Diagnostic to "incompatible type in unary expression" error [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15209: URL: https://github.com/apache/datafusion/pull/15209#issuecomment-2734188614 Thanks again everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [PR] Use `any` instead of `for_each` [datafusion]

2025-04-04 Thread via GitHub
xudong963 commented on code in PR #15289: URL: https://github.com/apache/datafusion/pull/15289#discussion_r2007912619 ## datafusion/datasource-parquet/src/file_format.rs: ## @@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc( total_byte_size += row_group_meta

Re: [PR] Add support for MSSQL IF/ELSE statements. [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
iffyio commented on code in PR #1791: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1791#discussion_r2025392312 ## src/ast/mod.rs: ## @@ -2145,116 +2149,189 @@ impl fmt::Display for CaseStatement { } if let Some(else_block) = else_block { -

Re: [PR] Improve collection during repr and repr_html [datafusion-python]

2025-04-04 Thread via GitHub
konjac commented on code in PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#discussion_r2007612621 ## src/dataframe.rs: ## @@ -771,3 +871,82 @@ fn record_batch_into_schema( RecordBatch::try_new(schema, data_arrays) } + +/// This is a helper function

[PR] Allow single quotes in EXTRACT() for Redshift. [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
romanb opened a new pull request, #1795: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1795 Just like Postgres, Redshift supports enclosing the `datepart` parameter in single quotes. See also the examples in https://docs.aws.amazon.com/redshift/latest/dg/r_EXTRACT_function.htm

Re: [I] Collecting parquet without any transformations throws an exception [datafusion-comet]

2025-04-04 Thread via GitHub
comphead closed issue #1588: Collecting parquet without any transformations throws an exception URL: https://github.com/apache/datafusion-comet/issues/1588 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Add disk usage limit configuration to datafusion-cli [datafusion]

2025-04-04 Thread via GitHub
jsai28 commented on code in PR #15586: URL: https://github.com/apache/datafusion/pull/15586#discussion_r2029700750 ## datafusion-cli/src/main.rs: ## @@ -125,6 +127,14 @@ struct Args { #[clap(long, help = "Enables console syntax highlighting")] color: bool, + +#[c

Re: [PR] Add disk usage limit configuration to datafusion-cli [datafusion]

2025-04-04 Thread via GitHub
2010YOUY01 commented on code in PR #15586: URL: https://github.com/apache/datafusion/pull/15586#discussion_r2029691057 ## docs/source/user-guide/cli/usage.md: ## @@ -57,6 +57,9 @@ OPTIONS: --mem-pool-type Specify the memory pool type 'greedy' or 'fair', de

Re: [I] Will Comet support closed-source forks of Apache Spark (e.g. CSP versions)? [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove commented on issue #414: URL: https://github.com/apache/datafusion-comet/issues/414#issuecomment-2775835470 The documentation does now state that we only support open-source Apache Spark, so I will close this issue -- This is an automated message from the Apache Git Service. To

Re: [PR] Update changelog and version number [datafusion-python]

2025-04-04 Thread via GitHub
timsaucer commented on PR #1089: URL: https://github.com/apache/datafusion-python/pull/1089#issuecomment-2764539321 Included changes were approved via vote on dev mailing list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [I] Spark executor fail to start occasionally with SIGILL [datafusion-comet]

2025-04-04 Thread via GitHub
mbutrovich commented on issue #1598: URL: https://github.com/apache/datafusion-comet/issues/1598#issuecomment-2776011568 A release build will by default set target-cpu=native: https://github.com/apache/datafusion-comet/blob/c5e78b6b59778f0429f0fc8157c6a959bfd9d4c3/Makefile#L101 which

Re: [PR] Blog post on Parquet pruning in datafusion [datafusion-site]

2025-04-04 Thread via GitHub
kevinjqliu commented on code in PR #60: URL: https://github.com/apache/datafusion-site/pull/60#discussion_r2006136201 ## content/blog/2025-03-20-parquet-pruning.md: ## @@ -0,0 +1,118 @@ +--- +layout: post +title: Parquet Pruning in DataFusion: Read Only What Matters +date: 2025-

[PR] Chore: Call arrow's methods `row_count` and `skipped_row_count` [datafusion]

2025-04-04 Thread via GitHub
jayzhan211 opened a new pull request, #15587: URL: https://github.com/apache/datafusion/pull/15587 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes test

[I] Enable `split_file_groups_by_statistics` by default [datafusion]

2025-04-04 Thread via GitHub
alamb opened a new issue, #10336: URL: https://github.com/apache/datafusion/issues/10336 ### Is your feature request related to a problem or challenge? - Part of https://github.com/apache/datafusion/issues/10313 In https://github.com/apache/datafusion/pull/9593, @suremarc added

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2025-04-04 Thread via GitHub
xudong963 commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2780179649 I'll open a follow-up PR to make it default -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [I] Cache Parquet Metadata [datafusion]

2025-04-04 Thread via GitHub
matthewmturner commented on issue #15582: URL: https://github.com/apache/datafusion/issues/15582#issuecomment-2780145360 I am working on this for `dft` right now actually and I plan on integrating it into the observability feature that I have been working on (where different observability m

Re: [PR] Docs : Added Sql examples for window Functions : `nth_val` , etc [datafusion]

2025-04-04 Thread via GitHub
Adez017 commented on PR #1: URL: https://github.com/apache/datafusion/pull/1#issuecomment-2780153961 does it now going to merge ? @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
zhuqi-lucas commented on code in PR #15561: URL: https://github.com/apache/datafusion/pull/15561#discussion_r2029658117 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -109,47 +108,84 @@ impl FileOpener for ParquetOpener { .schema_adapter_factory

[I] `count` fails for FFI Table Providers [datafusion]

2025-04-04 Thread via GitHub
timsaucer opened a new issue, #15569: URL: https://github.com/apache/datafusion/issues/15569 ### Describe the bug When using FFI Table Providers, we generate an error because the input schemas do not match for cases like `count` where the input schema is irrelevant. See the minim

[PR] Add disk usage limit configuration to datafusion-cli [datafusion]

2025-04-04 Thread via GitHub
jsai28 opened a new pull request, #15586: URL: https://github.com/apache/datafusion/pull/15586 ## Which issue does this PR close? Closes #15553. ## Rationale for this change Allows users to specify a disk limit for spill queries. ## What changes are included in this PR?

Re: [PR] feat: add MAP type support for first level [datafusion-comet]

2025-04-04 Thread via GitHub
comphead commented on PR #1603: URL: https://github.com/apache/datafusion-comet/pull/1603#issuecomment-2779913933 Some of failing tests with MapVector could be fixed like https://github.com/apache/datafusion-comet/pull/1610#discussion_r2029452928 -- This is an automated message from the

Re: [PR] chore: Create simple fuzz test as part of test suite [datafusion-comet]

2025-04-04 Thread via GitHub
comphead commented on code in PR #1610: URL: https://github.com/apache/datafusion-comet/pull/1610#discussion_r2029557330 ## common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala: ## @@ -278,7 +277,7 @@ object Utils { case v @ (_: BitVector | _: TinyIntVector |

Re: [I] Running Spark Shell with Comet throws Exception [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove commented on issue #872: URL: https://github.com/apache/datafusion-comet/issues/872#issuecomment-2775874807 This issue has not been updated for a long time so I will close. @radhikabajaj123 Please feel free to reopen if you still have the issue. -- This is an automated message

Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-04-04 Thread via GitHub
xudong963 commented on PR #15503: URL: https://github.com/apache/datafusion/pull/15503#issuecomment-2775970366 @alamb @berkaysynnada My thought about unifying the two methods: ```rust /// Specifies what statistics to compute pub enum StatisticsType { /// Only compute global st

[PR] fix: corrected the logic of eliminating CometSparkToColumnarExec [datafusion-comet]

2025-04-04 Thread via GitHub
wForget opened a new pull request, #1597: URL: https://github.com/apache/datafusion-comet/pull/1597 ## Which issue does this PR close? Closes #1314 and #1588. ## Rationale for this change `EliminateRedundantTransitions` eliminates the required `ColumnarToRowExec`

Re: [PR] STRING_AGG missing functionality [datafusion]

2025-04-04 Thread via GitHub
gabotechs commented on code in PR #14412: URL: https://github.com/apache/datafusion/pull/14412#discussion_r2026461993 ## datafusion/functions-aggregate/src/string_agg.rs: ## @@ -129,52 +172,326 @@ impl AggregateUDFImpl for StringAgg { #[derive(Debug)] pub(crate) struct Strin

Re: [I] [DISCUSS] Switch to `tree` explain by default [datafusion]

2025-04-04 Thread via GitHub
alamb closed issue #15343: [DISCUSS] Switch to `tree` explain by default URL: https://github.com/apache/datafusion/issues/15343 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] docs: change OSX/OS X to macOS [datafusion-comet]

2025-04-04 Thread via GitHub
codecov-commenter commented on PR #1584: URL: https://github.com/apache/datafusion-comet/pull/1584#issuecomment-2767636712 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1584?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-04-04 Thread via GitHub
alamb commented on code in PR #15313: URL: https://github.com/apache/datafusion/pull/15313#discussion_r2005880020 ## datafusion/physical-plan/Cargo.toml: ## @@ -58,6 +58,7 @@ futures = { workspace = true } half = { workspace = true } hashbrown = { workspace = true } indexmap

Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-04-04 Thread via GitHub
2010YOUY01 commented on PR #15501: URL: https://github.com/apache/datafusion/pull/15501#issuecomment-2774917623 > In my mind the only thing remaining for this PR is to reduce the time down from 30 seconds somehow (maybe split it into multiple smaller tests that can run in parallel, for exam

Re: [PR] chore: Create simple fuzz test as part of test suite [datafusion-comet]

2025-04-04 Thread via GitHub
codecov-commenter commented on PR #1610: URL: https://github.com/apache/datafusion-comet/pull/1610#issuecomment-2779892103 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1610?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778655813 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64

Re: [I] Will Comet support closed-source forks of Apache Spark (e.g. CSP versions)? [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove closed issue #414: Will Comet support closed-source forks of Apache Spark (e.g. CSP versions)? URL: https://github.com/apache/datafusion-comet/issues/414 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] minor: Fix clippy warnings [datafusion-comet]

2025-04-04 Thread via GitHub
codecov-commenter commented on PR #1606: URL: https://github.com/apache/datafusion-comet/pull/1606#issuecomment-2775693271 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1606?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] chore: Fix some inconsistencies in memory pool configuration [datafusion-comet]

2025-04-04 Thread via GitHub
viirya commented on code in PR #1561: URL: https://github.com/apache/datafusion-comet/pull/1561#discussion_r2006929718 ## spark/src/main/scala/org/apache/comet/CometExecIterator.scala: ## @@ -63,9 +64,28 @@ class CometExecIterator( }.toArray private val plan = { val c

Re: [PR] docs: various improvements to tuning guide [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove commented on code in PR #1525: URL: https://github.com/apache/datafusion-comet/pull/1525#discussion_r2003780906 ## spark/src/main/scala/org/apache/spark/Plugins.scala: ## @@ -63,13 +63,10 @@ class CometDriverPlugin extends DriverPlugin with Logging with ShimCometDrive

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2025-04-04 Thread via GitHub
alamb closed issue #10336: Enable `split_file_groups_by_statistics` by default URL: https://github.com/apache/datafusion/issues/10336 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

[I] Filter cache [datafusion]

2025-04-04 Thread via GitHub
adriangb opened a new issue, #15585: URL: https://github.com/apache/datafusion/issues/15585 ### Is your feature request related to a problem or challenge? I would like to propose APIs / maybe a non-default implementation of a "filter cache". It's an idea I got from the paper titled `P

Re: [PR] feat: implement GroupsAccumulator for `count(DISTINCT)` aggr [datafusion]

2025-04-04 Thread via GitHub
Dandandan commented on code in PR #15324: URL: https://github.com/apache/datafusion/pull/15324#discussion_r2005787338 ## datafusion/functions-aggregate/src/count.rs: ## @@ -752,10 +761,245 @@ impl Accumulator for DistinctCountAccumulator { } } +/// GroupsAccumulator for

Re: [PR] Add support for 'IN ' [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
adamchainz commented on code in PR #1793: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1793#discussion_r2029486915 ## src/parser/mod.rs: ## @@ -3742,24 +3742,23 @@ impl<'a> Parser<'a> { }); } self.expect_token(&Token::LParen)?; -

Re: [PR] Add support for 'IN ' [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
adamchainz commented on code in PR #1793: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1793#discussion_r2029485879 ## src/parser/mod.rs: ## @@ -3744,10 +3744,18 @@ impl<'a> Parser<'a> { self.expect_token(&Token::LParen)?; let in_op = if self.par

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on code in PR #15561: URL: https://github.com/apache/datafusion/pull/15561#discussion_r2027464873 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -295,3 +315,84 @@ fn create_initial_plan( // default to scanning all row groups Ok(ParquetAccessPl

Re: [PR] Improve performance of `last_value` by implementing special `GroupsAccumulator` [datafusion]

2025-04-04 Thread via GitHub
comphead commented on code in PR #15542: URL: https://github.com/apache/datafusion/pull/15542#discussion_r2029478517 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -291,7 +202,121 @@ impl AggregateUDFImpl for FirstValue { } } -struct FirstPrimitiveGroupsAccum

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub
XiangpengHao commented on PR #67: URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2779791930 This is a great tool; I've been hoping for it for years! Nice work! I previously relied on DuckDB to generate TPC-H (as [suggested](https://xuanwo.io/links/2025/02/duckdb-i

Re: [I] Update ClickBench queries to avoid ::INT::DATE casting [datafusion]

2025-04-04 Thread via GitHub
comphead closed issue #15509: Update ClickBench queries to avoid ::INT::DATE casting URL: https://github.com/apache/datafusion/issues/15509 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] chore: update clickbench [datafusion]

2025-04-04 Thread via GitHub
comphead merged PR #15574: URL: https://github.com/apache/datafusion/pull/15574 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub
XiangpengHao commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029469488 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data genera

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub
XiangpengHao commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029467839 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data genera

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub
XiangpengHao commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029467529 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data genera

Re: [PR] Run test [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on PR #15584: URL: https://github.com/apache/datafusion/pull/15584#issuecomment-2779771555 /benchmark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Run test [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on PR #15584: URL: https://github.com/apache/datafusion/pull/15584#issuecomment-2779770979 /benchmark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Run test [datafusion]

2025-04-04 Thread via GitHub
adriangb closed pull request #15584: Run test URL: https://github.com/apache/datafusion/pull/15584 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: githu

Re: [PR] Run test [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on PR #15584: URL: https://github.com/apache/datafusion/pull/15584#issuecomment-2779771916 (sorry wrong repo) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

[PR] Run test [datafusion]

2025-04-04 Thread via GitHub
adriangb opened a new pull request, #15584: URL: https://github.com/apache/datafusion/pull/15584 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub
andygrove commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029454642 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator

Re: [PR] fix: corrected the logic of eliminating CometSparkToColumnarExec [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove merged PR #1597: URL: https://github.com/apache/datafusion-comet/pull/1597 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] Fix: after repartitioning, the `PartitionedFile` and `FileGroup` statistics should be inexact [datafusion]

2025-04-04 Thread via GitHub
alamb commented on code in PR #15539: URL: https://github.com/apache/datafusion/pull/15539#discussion_r2027622762 ## datafusion/datasource/src/file_groups.rs: ## @@ -263,7 +264,21 @@ impl FileGroupPartitioner { .flatten() .chunk_by(|(partition_idx, _)|

Re: [PR] chore: Create simple fuzz test as part of test suite [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove commented on code in PR #1610: URL: https://github.com/apache/datafusion-comet/pull/1610#discussion_r2029452928 ## common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala: ## @@ -278,7 +277,7 @@ object Utils { case v @ (_: BitVector | _: TinyIntVector

[PR] chore: remove unused executor configuration option [datafusion-ballista]

2025-04-04 Thread via GitHub
milenkovicm opened a new pull request, #1229: URL: https://github.com/apache/datafusion-ballista/pull/1229 # Which issue does this PR close? Closes None. # Rationale for this change Remove unused executor configuration option # What changes are included in thi

Re: [I] Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-04 Thread via GitHub
zhuqi-lucas commented on issue #15524: URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2771988430 Thank you @berkaysynnada, i agree it's a common linearity property, this is a great idea. I will try to address it, and may be we can start from SUM function. And add more c

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2779728837 > @alamb I'm curious, could you do a main vs. main run just so we get an idea of what the variability is? I don't know if I should be looking at <10% changes or not. I did that

Re: [I] Nested correlated subquery error with a depth exceeding 1 [datafusion]

2025-04-04 Thread via GitHub
irenjj commented on issue #15558: URL: https://github.com/apache/datafusion/issues/15558#issuecomment-2775368757 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub
timsaucer commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029415412 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator

[PR] feat: remove flight-sql from scheduler [datafusion-ballista]

2025-04-04 Thread via GitHub
milenkovicm opened a new pull request, #1228: URL: https://github.com/apache/datafusion-ballista/pull/1228 # Which issue does this PR close? Closes #1227 . # Rationale for this change # What changes are included in this PR? # Are there any user-facing chan

Re: [PR] perf: replace `merge` `uninitiated_partitions` `VecDeque` with custom fixed size queue [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15562: URL: https://github.com/apache/datafusion/pull/15562#issuecomment-2779722285 > I will file a ticket to investigate this -Filed https://github.com/apache/datafusion/issues/15582 -- This is an automated message from the Apache Git Service. To respond to t

[I] Cache Parquet Metadata [datafusion]

2025-04-04 Thread via GitHub
alamb opened a new issue, #15582: URL: https://github.com/apache/datafusion/issues/15582 ### Is your feature request related to a problem or challenge? When looking at some Samply profiles of ClickBench queries on my laptop, it appears there are several times where processing stalls

Re: [PR] fix: add an "expr_planners" method to SessionState [datafusion]

2025-04-04 Thread via GitHub
Omega359 commented on code in PR #15119: URL: https://github.com/apache/datafusion/pull/15119#discussion_r2007422753 ## datafusion/core/src/execution/context/mod.rs: ## @@ -1632,7 +1632,7 @@ impl FunctionRegistry for SessionContext { } fn expr_planners(&self) -> Vec>

Re: [PR] Blog post about user defined window functions [datafusion-site]

2025-04-04 Thread via GitHub
Adez017 commented on PR #66: URL: https://github.com/apache/datafusion-site/pull/66#issuecomment-2778089125 hi @alamb , i think there is some issue while running it locally from docker image , just raising a PR , I think it pretty nice , if you think any modification needed please let m

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-04-04 Thread via GitHub
goldmedal commented on PR #15476: URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2764514075 > This PR appears to have caused CI failures for some reason so @goldmedal has a PR to revert it: > > * [Revert #15476 to fix the datafusion-examples CI fail  #15496](https:/

Re: [PR] Minor: add Arc for statistics in FileGroup [datafusion]

2025-04-04 Thread via GitHub
jayzhan211 merged PR #15564: URL: https://github.com/apache/datafusion/pull/15564 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [I] A complete solution for stable and safe sort with spill [datafusion]

2025-04-04 Thread via GitHub
qstommyshu commented on issue #14692: URL: https://github.com/apache/datafusion/issues/14692#issuecomment-2771022723 > This is a reproducer for an external sort query failure under a very low memory limit [#15028](https://github.com/apache/datafusion/issues/15028) Discord discussion:[disco

Re: [PR] perf: replace `merge` `uninitiated_partitions` `VecDeque` with custom fixed size queue [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15562: URL: https://github.com/apache/datafusion/pull/15562#issuecomment-2779705535 > My current intuition about it is twofold: I agree with this assesment -- so we save CPU work with this change but the total query time doesn't really decrease because we are no

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778656470 Benchmark completed Details ``` Comparing HEAD and alamb_test_upgrade_54 Benchmark tpch_mem_sf1.json ┏━

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778656411 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64

Re: [PR] Draft: Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-04 Thread via GitHub
zhuqi-lucas commented on code in PR #15532: URL: https://github.com/apache/datafusion/pull/15532#discussion_r2024355571 ## datafusion/sqllogictest/test_files/explain.slt: ## @@ -244,6 +244,159 @@ physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/

[I] Remove `flight-sql` from ballista [datafusion-ballista]

2025-04-04 Thread via GitHub
milenkovicm opened a new issue, #1227: URL: https://github.com/apache/datafusion-ballista/issues/1227 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** I would like to propose to remove `flight-sql` as the part of #1068 there is

Re: [PR] Improve spill performance: Disable re-validation of spilled files [datafusion]

2025-04-04 Thread via GitHub
2010YOUY01 commented on code in PR #15454: URL: https://github.com/apache/datafusion/pull/15454#discussion_r2022268406 ## datafusion/physical-plan/benches/spill_io.rs: ## @@ -0,0 +1,123 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lic

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-04 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2779688162 We have drafted a blog about this project in case anyone wants to review / check it out: - https://github.com/apache/datafusion-site/pull/67 -- This is an automated message f

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2779677205 @alamb I'm curious, could you do a main vs. main run just so we get an idea of what the variability is? I don't know if I should be looking at <10% changes or not. -- This is an

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2779674618 🤖: Benchmark completed Details ``` Comparing HEAD and filter-pushdown-apis Benchmark clickbench_extended.json

Re: [I] Support integration with Parquet modular encryption [datafusion]

2025-04-04 Thread via GitHub
corwinjoy commented on issue #15216: URL: https://github.com/apache/datafusion/issues/15216#issuecomment-2742213463 So, to play the devil's advocate, here are some arguments for having encryption configurations encoded as plain strings: 1. Users may want to run datafusion using the CLI. I

Re: [PR] bench: Document how to use cross platform Samply profiler [datafusion]

2025-04-04 Thread via GitHub
comphead merged PR #15481: URL: https://github.com/apache/datafusion/pull/15481 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2025-04-04 Thread via GitHub
leoyvens commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2779659064 Should this issue have been closed? Did #15473 change default behaviour? -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] Allow single quotes in EXTRACT() for Redshift. [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
iffyio merged PR #1795: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1795 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] MSSQL: Add support for functionality `MERGE` output clause [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
iffyio commented on PR #1790: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1790#issuecomment-2779657352 @dilovancelik Oh could you take a look at the added commits, not sure how the merge from main was done but there's some extraneous diffs that's been brought into the PR

[PR] Add Table Functions to FFI Crate [datafusion]

2025-04-04 Thread via GitHub
timsaucer opened a new pull request, #15581: URL: https://github.com/apache/datafusion/pull/15581 ## Which issue does this PR close? This addresses part of #14562 ## Rationale for this change We currently have support for user defined scalar functions. Aggregates and Win

Re: [I] `batches_to_sort_string` differing from similar implementation in `assert_batches_sorted_eq` [datafusion]

2025-04-04 Thread via GitHub
blaginin commented on issue #15312: URL: https://github.com/apache/datafusion/issues/15312#issuecomment-2737028735 Hey, I think this is happening because `assert_batches_sorted_eq` sorts both lhs and rhs, while `batches_to_sort_string` sorts its _only_ input - and because the snapshot is on

Re: [PR] Migrate optimizer tests to insta [datafusion]

2025-04-04 Thread via GitHub
alamb merged PR #15446: URL: https://github.com/apache/datafusion/pull/15446 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] perf: replace `merge` `uninitiated_partitions` `VecDeque` with custom fixed size queue [datafusion]

2025-04-04 Thread via GitHub
Dandandan commented on PR #15562: URL: https://github.com/apache/datafusion/pull/15562#issuecomment-2779629454 My current intuition about it is twofold: * It doesn't show up as clearly in a normal flamegraph as it doesn't show you the profile per thread but only the "overall" samples,

Re: [PR] fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait [datafusion-comet]

2025-04-04 Thread via GitHub
kazuyukitanimura commented on PR #1578: URL: https://github.com/apache/datafusion-comet/pull/1578#issuecomment-2768105136 Merged, thanks @mbutrovich @andygrove @parthchandra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

[PR] Site/tpch data generator [datafusion-site]

2025-04-04 Thread via GitHub
alamb opened a new pull request, #67: URL: https://github.com/apache/datafusion-site/pull/67 - closes https://github.com/clflushopt/tpchgen-rs/issues/45 - Related to https://github.com/apache/datafusion/issues/14608 Draft post See the staged version here: XXX My only p

[PR] chore(deps): bump quote from 1.0.38 to 1.0.40 [datafusion]

2025-04-04 Thread via GitHub
dependabot[bot] opened a new pull request, #15332: URL: https://github.com/apache/datafusion/pull/15332 Bumps [quote](https://github.com/dtolnay/quote) from 1.0.38 to 1.0.40. Release notes Sourced from https://github.com/dtolnay/quote/releases";>quote's releases. 1.0.40

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2779619085 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_

  1   2   3   4   5   >