Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-02-29 Thread Prem Sahoo
Hello Dongjoon, Thanks for emailing me. Could you please share a list of fixes as the link provided by you is not working. On Thu, Feb 29, 2024 at 11:27 AM Dongjoon Hyun wrote: > Hi, > > If you are observing correctness issues, you may hit some old (and fixed) > correctness issues. > > For

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Xinrong Meng
Congratulations! Thanks, Xinrong On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun wrote: > Congratulations! > > Bests, > Dongjoon. > > On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > >> Congratulations! >> >> >> >> At 2024-02-28 17:43:25, "Jungtaek Lim" >> wrote: >> >> Hi everyone, >> >> We

Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-02-29 Thread Dongjoon Hyun
Hi, If you are observing correctness issues, you may hit some old (and fixed) correctness issues. For example, from Apache Spark 3.2.1 to 3.2.4, we fixed 31 correctness issues.

When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-02-29 Thread Prem Sahoo
When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why. We have scenario when spark job complains FetchFailedException as one of the data node got rebooted middle of job running . Now due to this we have few duplicate data and

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Dongjoon Hyun
Congratulations! Bests, Dongjoon. On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > Congratulations! > > > > At 2024-02-28 17:43:25, "Jungtaek Lim" > wrote: > > Hi everyone, > > We are happy to announce the availability of Spark 3.5.1! > > Spark 3.5.1 is a maintenance release containing

unsubscribe

2024-02-28 Thread Sssxxx
unsubscribe Sssxxx sixliu_s...@foxmail.com

Re:[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread beliefer
Congratulations! At 2024-02-28 17:43:25, "Jungtaek Lim" wrote: Hi everyone, We are happy to announce the availability of Spark 3.5.1! Spark 3.5.1 is a maintenance release containing stability fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly

[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Jungtaek Lim
Hi everyone, We are happy to announce the availability of Spark 3.5.1! Spark 3.5.1 is a maintenance release containing stability fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly recommend all 3.5 users to upgrade to this stable release. To download Spark

Re: Please unlock Jira ticket for SPARK-24815, Dynamic resource allocation for structured streaming

2024-02-26 Thread Pavan Kotikalapudi
Thanks Yuming. On Mon, Feb 26, 2024 at 9:55 PM Yuming Wang wrote: > Unlocked. > > On Tue, Feb 27, 2024 at 11:47 AM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> >> Hi, >> >> Can a committer please unlock this SPIP? It is for Dynamic resource >> allocation for structured streaming

Re: Please unlock Jira ticket for SPARK-24815, Dynamic resource allocation for structured streaming

2024-02-26 Thread Yuming Wang
Unlocked. On Tue, Feb 27, 2024 at 11:47 AM Mich Talebzadeh wrote: > > Hi, > > Can a committer please unlock this SPIP? It is for Dynamic resource > allocation for structured streaming that has got 6 votes. it was locked > because of inactivity by GitHub actions > > [SPARK-24815] Structured

Please unlock Jira ticket for SPARK-24815, Dynamic resource allocation for structured streaming

2024-02-26 Thread Mich Talebzadeh
Hi, Can a committer please unlock this SPIP? It is for Dynamic resource allocation for structured streaming that has got 6 votes. it was locked because of inactivity by GitHub actions [SPARK-24815] Structured Streaming should support dynamic allocation - ASF JIRA (apache.org)

unsubscribe

2024-02-24 Thread Ameet Kini
unsubscribe

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-23 Thread Jungtaek Lim
Thanks for figuring this out. That is my bad. My understanding is that 3.5.1 RC2 doc should be correctly generated in VOTE but it happened during the finalization step. I lost the build artifact for docs (I followed steps and removed docs from dev dist before realizing I shouldn't remove them)

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-23 Thread Dongjoon Hyun
Hi, All. Unfortunately, the Apache Spark `3.5.1 RC2` document artifact seems to be generated from unknown source code instead of the correct source code of the tag, `3.5.1`. https://spark.apache.org/docs/3.5.1/ [image: Screenshot 2024-02-23 at 14.13.07.png] Dongjoon. On Wed, Feb 21, 2024 at

Proposal about moving on from the Shepherd terminology in SPIPs

2024-02-23 Thread Mich Talebzadeh
We had a discussion about getting a Shepherd to assist with Structured streaming SPIP a few hours ago. As an active member I am proposing a move to replace the current terminology "SPIP Shepherd" with the more respectful and inclusive term "SPIP Mentor." We have over the past few years have tried

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-23 Thread Mich Talebzadeh
Hi Pavan and those who kindly voted for this SPIP Great to have 6+ votes and no -1 and 0. The so-called mass volume is there. The rest is admin matter and how to drive the project forward and yes there is more than one way of skinning the cat. I think we need some flexibility in the rules given

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-23 Thread Pavan Kotikalapudi
Thanks for the pointers Mich, will wait for Jungtaek Lee or any other PMC members to respond. aggregating upvotes to this email thread +6 Mich Talebzadeh Adam Hobbs Pavan Kotikalapudi Krystal Mitchell Sona Torosyan Aaron Kern Thank you, Pavan On Thu, Feb 22, 2024 at 3:07 PM Mich Talebzadeh

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-23 Thread Mich Talebzadeh
+1 for me Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the

Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-23 Thread Aaron Kern
+1

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-22 Thread Mich Talebzadeh
Hi, please check this doc Spark Project Improvement Proposals (SPIP) | Apache Spark and specifically the below extract Discussing an SPIP All discussion of an SPIP should take place in a public forum, preferably the discussion attached to

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-22 Thread Pavan Kotikalapudi
Hi Mich, We have five +1s till now. Mich Talebzadeh Adam Hobbs Pavan Kotikalapudi Krystal Mitchell Sona Torosyan (few more in github pr) +0: None -1: None Does it pass the required condition as approved? Not sure of that though, nothing about minimum required is mentioned in the past

Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-22 Thread Sona Torosyan
+1

Re: Generating config docs automatically

2024-02-22 Thread Nicholas Chammas
Thank you, Holden! Yes, having everything live in the ConfigEntry is attractive. The main reason I proposed an alternative where the groups are defined in YAML is that if the config groups are defined in ConfigEntry, then altering the groupings – which is relevant only to the display of config

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-22 Thread Mich Talebzadeh
Hi Pavan, Do you have a list of votes for this feature by any chance? Does it pass the required condition as approved? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-22 Thread Pavan Kotikalapudi
Yes. The PR was closed due to inactivity by github actions.. The msg also says > If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! On Thu, Feb 22, 2024 at 1:09 AM Mich Talebzadeh

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-22 Thread Mich Talebzadeh
I can see it was closed. Was it because of inactivity? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-21 Thread Pavan Kotikalapudi
Hi Spark PMC members, I think we have few upvotes for this effort here and more people are showing interest (see PR comments .) Is anyone interested in mentoring and reviewing this effort? Also can the repository admin/owner

Re: Generating config docs automatically

2024-02-21 Thread Holden Karau
I think this is a good idea. I like having everything in one source of truth rather than two (so option 1 sounds like a good idea); but that’s just my opinion. I'd be happy to help with reviews though. On Wed, Feb 21, 2024 at 6:37 AM Nicholas Chammas wrote: > I know config documentation is not

Re: Generating config docs automatically

2024-02-21 Thread Nicholas Chammas
I know config documentation is not the most exciting thing. If there is anything I can do to make this as easy as possible for a committer to shepherd, I’m all ears! > On Feb 14, 2024, at 8:53 PM, Nicholas Chammas > wrote: > > I’m interested in automating our config documentation and need

[VOTE][RESULT] Release Apache Spark 3.5.1 (RC2)

2024-02-21 Thread Jungtaek Lim
The vote passes with 6 +1s (4 binding +1s). Thanks to all who helped with the release! (* = binding) +1: Jungtaek Lim Wenchen Fan (*) Cheng Pan Xiao Li (*) Hyukjin Kwon (*) Maxim Gekk (*) +0: None -1: None

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-21 Thread Jungtaek Lim
Thanks everyone for participating the vote! The vote passed. I'll send out the vote result and proceed to the next steps. On Wed, Feb 21, 2024 at 4:36 PM Maxim Gekk wrote: > +1 > > On Wed, Feb 21, 2024 at 9:50 AM Hyukjin Kwon wrote: > >> +1 >> >> On Tue, 20 Feb 2024 at 22:00, Cheng Pan wrote:

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-20 Thread Maxim Gekk
+1 On Wed, Feb 21, 2024 at 9:50 AM Hyukjin Kwon wrote: > +1 > > On Tue, 20 Feb 2024 at 22:00, Cheng Pan wrote: > >> +1 (non-binding) >> >> - Build successfully from source code. >> - Pass integration tests with Spark ClickHouse Connector[1] >> >> [1]

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-20 Thread Hyukjin Kwon
+1 On Tue, 20 Feb 2024 at 22:00, Cheng Pan wrote: > +1 (non-binding) > > - Build successfully from source code. > - Pass integration tests with Spark ClickHouse Connector[1] > > [1] https://github.com/housepower/spark-clickhouse-connector/pull/299 > > Thanks, > Cheng Pan > > > > On Feb 20,

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-20 Thread Xiao Li
+1 Xiao Cheng Pan 于2024年2月20日周二 04:59写道: > +1 (non-binding) > > - Build successfully from source code. > - Pass integration tests with Spark ClickHouse Connector[1] > > [1] https://github.com/housepower/spark-clickhouse-connector/pull/299 > > Thanks, > Cheng Pan > > > > On Feb 20, 2024, at

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-20 Thread Krystal Mitchell
+1 On 2024/01/17 17:49:32 Pavan Kotikalapudi wrote: > Thanks for proposing and voting for the feature Mich. > > adding some references to the thread. > >- Jira ticket - SPARK-24815 > >- Design Doc > >

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-20 Thread Cheng Pan
+1 (non-binding) - Build successfully from source code. - Pass integration tests with Spark ClickHouse Connector[1] [1] https://github.com/housepower/spark-clickhouse-connector/pull/299 Thanks, Cheng Pan > On Feb 20, 2024, at 10:56, Jungtaek Lim wrote: > > Thanks Sean, let's continue the

Community Over Code Asia 2024 Travel Assistance Applications now open!

2024-02-20 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code Asia 2024 are now open! We will be supporting Community over Code Asia, Hangzhou, China July 26th - 28th, 2024. TAC exists

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-19 Thread Wenchen Fan
+1, thanks for making the release! On Sat, Feb 17, 2024 at 3:54 AM Sean Owen wrote: > Yeah let's get that fix in, but it seems to be a minor test only issue so > should not block release. > > On Fri, Feb 16, 2024, 9:30 AM yangjie01 wrote: > >> Very sorry. When I was fixing `SPARK-45242 ( >>

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-19 Thread Jungtaek Lim
Thanks Sean, let's continue the process for this RC. +1 (non-binding) - downloaded all files from URL - checked signature - extracted all archives - ran all tests from source files in source archive file, via running "sbt clean test package" - Ubuntu 20.04.4 LTS, OpenJDK 17.0.9. Also bump to

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Mich Talebzadeh
Ok thanks for your clarifications Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Chao Sun
Hi Mich, > Also have you got some benchmark results from your tests that you can possibly share? We only have some partial benchmark results internally so far. Once shuffle and better memory management have been introduced, we plan to publish the benchmark results (at least TPC-H) in the repo.

Re: ASF board report draft for February

2024-02-18 Thread Mich Talebzadeh
Np, thanks for addressing the point promptly Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The

Re: ASF board report draft for February

2024-02-18 Thread Matei Zaharia
Thanks for the clarification. I updated it to say Comet is in the process of being open sourced. > On Feb 18, 2024, at 1:55 AM, Mich Talebzadeh > wrote: > > Hi Matei, > > With regard to your last point > > "- Project Comet, a plugin designed to accelerate Spark query execution by >

Re: ASF board report draft for February

2024-02-18 Thread Mich Talebzadeh
Hi Matei, With regard to your last point "- Project Comet, a plugin designed to accelerate Spark query execution by leveraging DataFusion and Arrow, has been open-sourced under the Apache Arrow project. For more information, visit https://github.com/apache/arrow-datafusion-comet.; If my

Re: ASF board report draft for February

2024-02-18 Thread Dongjoon Hyun
+1, it looks good to me. Thank you, Matei. Dongjoon On Sat, Feb 17, 2024 at 11:21 AM Matei Zaharia wrote: > Hi all, > > I missed some reminder emails about our board report this month, but here > is my draft. I’ll submit it tomorrow if that’s ok. > > == > > Issues for the board: >

ASF board report draft for February

2024-02-17 Thread Matei Zaharia
Hi all, I missed some reminder emails about our board report this month, but here is my draft. I’ll submit it tomorrow if that’s ok. == Issues for the board: - None Project status: - We made two patch releases: Spark 3.3.4 (EOL release) on December 16, 2023, and Spark 3.4.2 on

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Sean Owen
Yeah let's get that fix in, but it seems to be a minor test only issue so should not block release. On Fri, Feb 16, 2024, 9:30 AM yangjie01 wrote: > Very sorry. When I was fixing `SPARK-45242 ( > https://github.com/apache/spark/pull/43594)` > , I

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread yangjie01
Very sorry. When I was fixing `SPARK-45242 (https://github.com/apache/spark/pull/43594)`, I noticed that its `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I didn't realize that it had also been merged into branch-3.5, so I didn't advocate for SPARK-45357 to be

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Jungtaek Lim
I traced back relevant changes and got a sense of what happened. Yangjie figured out the issue via link . It's a tricky issue according to the comments from Yangjie - the test is dependent on ordering of execution for test suites.

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-16 Thread Mich Talebzadeh
Hi Chao, As a cool feature - Compared to standard Spark, what kind of performance gains can be expected with Comet? - Can one use Comet on k8s in conjunction with something like a Volcano addon? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-15 Thread Sean Owen
Is anyone seeing this Spark Connect test failure? then again, I have some weird issue with this env that always fails 1 or 2 tests that nobody else can replicate. - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-15 Thread Jungtaek Lim
UPDATE: The vote thread is up now. https://lists.apache.org/thread/f28h0brncmkoyv5mtsqtxx38hx309c2j On Tue, Feb 6, 2024 at 11:30 AM Jungtaek Lim wrote: > Thanks all for the positive feedback! Will figure out time to go through > the RC process. Stay tuned! > > On Mon, Feb 5, 2024 at 7:46 AM

Re: Heads-up: Update on Spark 3.5.1 RC

2024-02-15 Thread Jungtaek Lim
UPDATE: Now the vote thread is up for RC2. https://lists.apache.org/thread/f28h0brncmkoyv5mtsqtxx38hx309c2j On Wed, Feb 14, 2024 at 2:59 AM Dongjoon Hyun wrote: > Thank you for the update, Jungtaek. > > Dongjoon. > > On Tue, Feb 13, 2024 at 7:29 AM Jungtaek Lim > wrote: > >> Hi, >> >> Just a

[VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-15 Thread Jungtaek Lim
DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured out doc generation issue after tagging RC1. Please vote on releasing the following candidate as Apache Spark version 3.5.1. The vote is open until February 18th 9AM (PST) and passes if a majority +1 PMC votes are cast,

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-15 Thread Mich Talebzadeh
Hi,I gather from the replies that the plugin is not currently available in the form expected although I am aware of the shell script. Also have you got some benchmark results from your tests that you can possibly share? Thanks, Mich Talebzadeh, Dad | Technologist | Solutions Architect |

Generating config docs automatically

2024-02-14 Thread Nicholas Chammas
I’m interested in automating our config documentation and need input from a committer who is interested in shepherding this work. We have around 60 tables of configs across our documentation. Here’s a typical example.

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
Hi Praveen, We will add a "Getting Started" section in the README soon, but basically comet-spark-shell in the repo should provide a basic tool to build Comet and launch a Spark shell with it. Note that we haven't

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Liu(Laswift) Cao
This is very cool! Congrats on the amazing work Chao and the team! It's exciting to see this native engine trend within the community. Other than gluten, I ran into https://github.com/blaze-init/blaze as well (but haven't evaluated it in detail) On Wed, Feb 14, 2024 at 09:20 Chao Sun wrote: > >

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
> Out of interest what are the differences in the approach between this and > Glutten? Overall they are similar, although Gluten supports multiple backends including Velox and Clickhouse. One major difference is (obviously) Comet is based on DataFusion and Arrow, and written in Rust, while

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
Congratulations! Excellent work! On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu wrote: > Absolutely thrilled to see the project going open-source! Huge congrats to > Chao and the entire team on this milestone! > > Yufei > > > On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > >> Hi all, >> >> We are

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Yufei Gu
Absolutely thrilled to see the project going open-source! Huge congrats to Chao and the entire team on this milestone! Yufei On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Mich Talebzadeh
Sure thanks for clarification. I gather what you are alluding to is -- in a distributed environment, when one does operations that involve shuffling or repartitioning of data, the order in which this data is processed across partitions is not guaranteed. So when repartitioning a dataframe, the

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Jack Goodson
Apologies if it wasn't clear, I was meaning the difficulty of debugging, not floating point precision :) On Wed, Feb 14, 2024 at 2:03 AM Mich Talebzadeh wrote: > Hi Jack, > > " most SQL engines suffer from the same issue... "" > > Sure. This behavior is not a bug, but rather a consequence

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the approach between this and Glutten? On Tue, Feb 13, 2024 at 12:42 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via leveraging DataFusion

Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Chao Sun
Hi all, We are very happy to announce that Project Comet, a plugin to accelerate Spark query execution via leveraging DataFusion and Arrow, has now been open sourced under the Apache Arrow umbrella. Please check the project repo https://github.com/apache/arrow-datafusion-comet for more details if

Re: Heads-up: Update on Spark 3.5.1 RC

2024-02-13 Thread Dongjoon Hyun
Thank you for the update, Jungtaek. Dongjoon. On Tue, Feb 13, 2024 at 7:29 AM Jungtaek Lim wrote: > Hi, > > Just a head-up since I didn't give an update for a week after the last > update from the discussion thread. > > I've been following the automated release process and encountered several

Re: Extracting Input and Output Partitions in Spark

2024-02-13 Thread Daniel Saha
This would be helpful for a few use cases. For context my team works in security space, and customers access data through a wrapper around spark sql connected to hive metastore. 1. When snapshot (non-partitioned) tables are queried, it’s not clear when the underlying snapshot was last updated.

Heads-up: Update on Spark 3.5.1 RC

2024-02-13 Thread Jungtaek Lim
Hi, Just a head-up since I didn't give an update for a week after the last update from the discussion thread. I've been following the automated release process and encountered several issues. Maybe I will file JIRA tickets and follow PRs. Issues I figured out so far are 1) python library

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Mich Talebzadeh
Hi Jack, " most SQL engines suffer from the same issue... "" Sure. This behavior is not a bug, but rather a consequence of the limitations of floating-point precision. The numbers involved in the example (see SPIP [SPARK-47024] Sum of floats/doubles may be incorrect depending on

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Jack Goodson
I may be ignorant of other debugging methods in Spark but the best success I've had is using smaller datasets (if runs take a long time) and adding intermediate output steps. This is quite different from application development in non-distributed systems where a debugger is trivial to attach but I

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Nicholas Chammas
OK, I figured it out. The details are in SPARK-47024 for anyone who’s interested. It turned out to be a floating point arithmetic “bug”. The main reason I was able to figure it out was because I’ve been investigating another, unrelated bug (a

Re: Extracting Input and Output Partitions in Spark

2024-02-12 Thread Aditya Sohoni
Sharing an example since a few people asked me off-list: We have stored the partition details in the read/write nodes of the physical plan. So this can be accessed via the plan like plan.getInputPartitions or plan.getOutputPartitions, which internally loops through the nodes in the plan and

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Herman van Hovell
There is no really easy way of getting the state of the aggregation buffer, unless you are willing to modify the code generation and sprinkle in some logging. What I would start with is dumping the generated code by calling explain('codegen') on the DataFrame. That helped me to find similar

How do you debug a code-generated aggregate?

2024-02-11 Thread Nicholas Chammas
Consider this example: >>> from pyspark.sql.functions import sum >>> spark.range(4).repartition(2).select(sum("id")).show() +---+ |sum(id)| +---+ | 6| +---+ I’m trying to understand how this works because I’m investigating a bug in this kind of aggregate. I see that

Re: Pyspark Write Batch Streaming Data to Snowflake Fails with more columns

2024-02-10 Thread Varun Shah
Hi Mich, Thanks for the suggestions. I checked the documentation regarding the issue in data types and found that the different timezone settings being used in spark & snowflake was the issue. Specifying the timezone in spark options while writing the data to snowflake worked  Documentation

Re: Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
The full code is available from the link below https://github.com/michTalebzadeh/Event_Driven_Real_Time_data_processor_with_SSS_and_API_integration Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Enhanced Console Sink for Structured Streaming

2024-02-09 Thread Neil Ramaswamy
Thanks for the comments, Anish and Jerry. To summarize so far, we are in agreement that: 1. Enhanced console sink is a good tool for new users to understand Structured Streaming semantics 2. It should be opt-in via an option (unlike my original proposal) 3. Out of the 2 modes of verbosity I

Re: Pyspark Write Batch Streaming Data to Snowflake Fails with more columns

2024-02-09 Thread Mich Talebzadeh
Hi Varun, I am no expert on Snowflake, however, the issue you are facing, particularly if it involves data trimming in a COPY statement and potential data mismatch, is likely related to how Snowflake handles data ingestion rather than being directly tied to PySpark. The COPY command in Snowflake

Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
Appreciate your thoughts on this, Personally I think Spark Structured Streaming can be used effectively in an Event Driven Architecture as well as continuous streaming) >From the link here

Pyspark Write Batch Streaming Data to Snowflake Fails with more columns

2024-02-09 Thread Varun Shah
Hi Team, We currently have implemented pyspark spark-streaming application on databricks, where we read data from s3 and write to the snowflake table using snowflake connector jars (net.snowflake:snowflake-jdbc v3.14.5 and net.snowflake:spark-snowflake v2.12:2.14.0-spark_3.3) . Currently facing

Re: Enhanced Console Sink for Structured Streaming

2024-02-08 Thread Jerry Peng
I am generally a +1 on this as we can use this information in our docs to demonstrate certains concepts to potential users. I am in agreement with other reviewers that we should keep the existing default behavior of the console sink. This new style of output should be enabled behind a flag. As

Re: Enhanced Console Sink for Structured Streaming

2024-02-08 Thread Anish Shrigondekar
Hi Neil, Thanks for putting this together. +1 to the proposal of enhancing the console sink further. I think it will help new users understand some of the streaming/micro-batch semantics a bit better in Spark. Agree with not having verbose mode enabled by default. I think step 1 described above

Re: Shuffle write and read phase optimizations for parquet+zstd write

2024-02-08 Thread Mich Talebzadeh
Hi, ... Most of our jobs end up with a shuffle stage based on a partition column value before writing into a parquet, and most of the time we have data skewness in partitions Have you considered the causes of these recurring issues and some potential alternative strategies? 1. -

Shuffle write and read phase optimizations for parquet+zstd write

2024-02-07 Thread satyajit vegesna
Hi Community, Can someone please help validate the idea below and suggest pros/cons. Most of our jobs end up with a shuffle stage based on a partition column value before writing into parquet, and most of the time we have data skew ness in partitions. Currently most of the problems happen at

Re: Enhanced Console Sink for Structured Streaming

2024-02-06 Thread Neil Ramaswamy
Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode being off by default. I think it's reasonable to have 1 or 2 levels of verbosity: 1. The first verbose mode could target new users, and take a highly opinionated view on what's important to understand streaming

Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Mich Talebzadeh
I don't think adding this to the streaming flow (at micro level) will be that useful However, this can be added to Spark UI as an enhancement to the Streaming Query Statistics page. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my

Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Raghu Angadi
Agree, the default behavior does not need to change. Neil, how about separating it into two sections: - Actual rows in the sink (same as current output) - Followed by metadata data

Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Jungtaek Lim
Maybe we could keep the default as it is, and explicitly turn on verboseMode to enable auxiliary information. I'm not a believer that anyone will parse the output of console sink (which means this could be a breaking change), but changing the default behavior should be taken conservatively. We can

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-05 Thread Jungtaek Lim
Thanks all for the positive feedback! Will figure out time to go through the RC process. Stay tuned! On Mon, Feb 5, 2024 at 7:46 AM Gengliang Wang wrote: > +1 > > On Sun, Feb 4, 2024 at 1:57 PM Hussein Awala wrote: > >> +1 >> >> On Sun, Feb 4, 2024 at 10:13 PM John Zhuge wrote: >> >>> +1 >>>

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-02-05 Thread kalyan
Hey, Disk space not enough is also a reliability concern, but might need a diff strategy to handle it. As suggested by Mridul, I am working on making things more configurable in another(new) module… with that, we can plug in new rules for each type of error. Regards Kalyan. On Mon, 5 Feb 2024 at

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-02-04 Thread Jay Han
Hi, what about supporting for solving the disk space problem of "device space isn't enough"? I think it's same as OOM exception. kalyan 于2024年1月27日周六 13:00写道: > Hi all, > > Sorry for the delay in getting the first draft of (my first) SPIP out. > >

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Gengliang Wang
+1 On Sun, Feb 4, 2024 at 1:57 PM Hussein Awala wrote: > +1 > > On Sun, Feb 4, 2024 at 10:13 PM John Zhuge wrote: > >> +1 >> >> John Zhuge >> >> >> On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale >> wrote: >> >>> +1 >>> >>> On Sun, Feb 4, 2024, 8:18 PM Xiao Li >>> wrote: >>> +1

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Hussein Awala
+1 On Sun, Feb 4, 2024 at 10:13 PM John Zhuge wrote: > +1 > > John Zhuge > > > On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale > wrote: > >> +1 >> >> On Sun, Feb 4, 2024, 8:18 PM Xiao Li >> wrote: >> >>> +1 >>> >>> On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote: >>> +1

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread John Zhuge
+1 John Zhuge On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale wrote: > +1 > > On Sun, Feb 4, 2024, 8:18 PM Xiao Li > wrote: > >> +1 >> >> On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote: >> >>> +1 >>> >>> >>> >>> 在 2024-02-04 15:26:13,"Dongjoon Hyun" 写道: >>> >>> +1 >>> >>> On Sat, Feb 3,

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Santosh Pingale
+1 On Sun, Feb 4, 2024, 8:18 PM Xiao Li wrote: > +1 > > On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote: > >> +1 >> >> >> >> 在 2024-02-04 15:26:13,"Dongjoon Hyun" 写道: >> >> +1 >> >> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 >> wrote: >> >>> +1 >>> >>> 在 2024/2/4 13:13,“Kent

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Xiao Li
+1 On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote: > +1 > > > > 在 2024-02-04 15:26:13,"Dongjoon Hyun" 写道: > > +1 > > On Sat, Feb 3, 2024 at 9:18 PM yangjie01 > wrote: > >> +1 >> >> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入: >> >> >> +1 >> >> >> Jungtaek Lim >

Re:Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread beliefer
+1 在 2024-02-04 15:26:13,"Dongjoon Hyun" 写道: +1 On Sat, Feb 3, 2024 at 9:18 PM yangjie01 wrote: +1 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入: +1 Jungtaek Lim mailto:kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道: > > Hi dev, > > looks like there are a huge

Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Dongjoon Hyun
+1 On Sat, Feb 3, 2024 at 9:18 PM yangjie01 wrote: > +1 > > 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入: > > > +1 > > > Jungtaek Lim kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道: > > > > Hi dev, > > > > looks like there are a huge number of commits being pushed to branch-3.5

Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread yangjie01
+1 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入: +1 Jungtaek Lim mailto:kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道: > > Hi dev, > > looks like there are a huge number of commits being pushed to branch-3.5 > after 3.5.0 was released, 200+ commits. > > $ git log --oneline

Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Kent Yao
+1 Jungtaek Lim 于2024年2月3日周六 21:14写道: > > Hi dev, > > looks like there are a huge number of commits being pushed to branch-3.5 > after 3.5.0 was released, 200+ commits. > > $ git log --oneline v3.5.0..HEAD | wc -l > 202 > > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and

<    1   2   3   4   5   6   7   8   9   10   >