Re: [DISCUSS] SPIP: Python Stored Procedures

2023-09-05 Thread Allison Wang
Hi Mich, Thank you for your comments! I've left some comments on the SPIP, but let's continue the discussion here. You've highlighted the potential advantages of Python stored procedures, and I'd like to emphasize two important aspects: 1. *Versatility*: Integrating Python into SQL provides

Re: Feature to restart Spark job from previous failure point

2023-09-05 Thread Mich Talebzadeh
Hi Dipayan, You ought to maintain data source consistency minimising changes. upstream. Spark is not a Swiss Army knife :) Anyhow, we already do this in spark structured streaming with the concept of checkpointing.You can do so by implementing - Checkpointing - Stateful processing in

Feature to restart Spark job from previous failure point

2023-09-04 Thread Dipayan Dev
Hi Team, One of the biggest pain points we're facing is when Spark reads upstream partition data and during Action, the upstream also gets refreshed and the application fails with 'File not exists' error. It could happen that the job has already spent a reasonable amount of time, and re-running

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-09-03 Thread Mich Talebzadeh
On this subject of launching both the driver and the executors using lazy executor IDs, this can introduce complexity but potentially could be a viable strategy in certain scenarios. Basically your mileage varies Pros: 1. Faster Startup: launching the driver and initial executors

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-02 Thread Yuanjian Li
Sure, no problem. Holden Karau 于2023年9月2日周六 22:10写道: > Can we delay the next RC cut until after Labor Day? > > On Sat, Sep 2, 2023 at 9:59 PM Yuanjian Li wrote: > >> Thank you for all the reports! >> The vote has failed. I plan to cut RC4 in two days. >> >> @Dipayan Dev I quickly skimmed

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-02 Thread Holden Karau
Can we delay the next RC cut until after Labor Day? On Sat, Sep 2, 2023 at 9:59 PM Yuanjian Li wrote: > Thank you for all the reports! > The vote has failed. I plan to cut RC4 in two days. > > @Dipayan Dev I quickly skimmed through the > corresponding ticket, and it doesn't seem to be a

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-02 Thread Yuanjian Li
Thank you for all the reports! The vote has failed. I plan to cut RC4 in two days. @Dipayan Dev I quickly skimmed through the corresponding ticket, and it doesn't seem to be a regression introduced in 3.5. Additionally, someone is asking if this is the same issue as SPARK-35279. @Yuming Wang I

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-09-02 Thread Mich Talebzadeh
I have noticed an worthy discussion in the SPIP comments regarding the definition of "stored procedure" in the context of Spark, and I believe it is an important point to address. To provide some historical context, Sybase , a

Re: [DISCUSS] Incremental statistics collection

2023-09-01 Thread Rakesh Raushan
Thanks all for all your insights. @Mich I am not trying to introduce any sampling model here. This idea is about collecting the task write metrics while writing the data and aggregating it with the existing values present in the catalog(create a new entry if it's a CTAS command). This approach is

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-01 Thread Jungtaek Lim
My apologies, I have to add another ticket for a blocker, SPARK-45045 . That said, I'm -1 (non-binding). SPARK-43183 made a behavioral change regarding the StreamingQueryListener as well as

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-31 Thread Wenchen Fan
Sorry for the last-minute bug report, but we found a regression in 3.5: the SQL INSERT command without a column list fills missing columns with NULL while Spark 3.4 does not allow it. According to the SQL standard, this shouldn't be allowed and thus a regression in 3.5. The fix has been merged

Re: [DISCUSS] Updating documentation hosted for EOL and maintenance releases

2023-08-31 Thread Matei Zaharia
It would be great to do this IMO, because there are often usability and formatting fixes needed to docs over time, and people naturally search for docs from their *deployed* version of the project — not the latest version, hoping that it also applies to their release. For example, right now

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Mich Talebzadeh
I concur with the view point raised by @Sean Owen While this might introduce some challenges related to compatibility and environment issues, it is not fundamentally different from how the users currently import and use common code in Python. The main difference is that now this shared code would

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-31 Thread Ian Manning
+1 (non-binding) Using Spark Core, Spark SQL, Structured Streaming. On Tue, Aug 29, 2023 at 8:12 PM Yuanjian Li wrote: > Please vote on releasing the following candidate(RC3) as Apache Spark > version 3.5.0. > > The vote is open until 11:59pm Pacific time Aug 31st and passes if a > majority +1

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Sean Owen
I think you're talking past Hyukjin here. I think the response is: none of that is managed by Pyspark now, and this proposal does not change that. Your current interpreter and environment is used to execute the stored procedure, which is just Python code. It's on you to bring an environment that

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Mich Talebzadeh
These are my initial thoughts: As usual your mileage varies. Depending on the use case, introducing support for stored procedures (SP) in Spark SQL with Python as the procedural language *Pros* - Can potentially provide more flexibility and capabilities in the respective SQL workflows. We

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Mich Talebzadeh
Thanks Allison! Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any

[DISCUSS] Updating documentation hosted for EOL and maintenance releases

2023-08-30 Thread Hyukjin Kwon
Hi all, I would like to raise a discussion about updating documentation hosted for EOL and maintenance versions. To provide some context, we currently host the documentation for EOL versions of Apache Spark, which can be found at links like

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Alexander Shorin
> Which Python version will run that stored procedure? > > All Python versions supported in PySpark > Where in stored procedure defines the exact python version which will run the code? That was the question. > How to manage external dependencies? > > Existing way we have >

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Hyukjin Kwon
Which Python version will run that stored procedure? All Python versions supported in PySpark How to manage external dependencies? Existing way we have https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html . In fact, this will use the external dependencies within your

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Alexander Shorin
-1 Great idea to ignore the experience of others and copy bad practices back for nothing. If you are familiar with Python ecosystem then you should answer the questions: 1. Which Python version will run that stored procedure? 2. How to manage external dependencies? 3. How to test it via a common

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Yuming Wang
It seems can not check signature: yumwang@G9L07H60PK Downloads % gpg --keyserver hkps://keys.openpgp.org --recv-key FC3AE3A7EAA1BAC98770840E7E1ABCC53AAA2216 gpg: key 7E1ABCC53AAA2216: no user ID gpg: Total number processed: 1 yumwang@G9L07H60PK Downloads % gpg --batch --verify

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Sean Owen
It worked fine after I ran it again I included "package test" instead of "test" (I had previously run "install") +1 On Wed, Aug 30, 2023 at 6:06 AM yangjie01 wrote: > Hi, Sean > > > > I have performed testing with Java 17 and Scala 2.13 using maven (`mvn > clean install` and `mvn package

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Hyukjin Kwon
+1 we should have this .. a lot of other projects and DBMSes have this too, and we currently don't have a way to handle them within Apache Spark. Disclaimer: I am the shepherd of this SPIP. On Thu, 31 Aug 2023 at 09:31, Allison Wang wrote: > Hi Mich, > > I've updated the permissions on the

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Mridul Muralidharan
+1 Signatures, digests, etc check out fine. Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes Regards, Mridul On Wed, Aug 30, 2023 at 6:10 AM yangjie01 wrote: > Hi, Sean > > > > I have performed testing with Java 17 and Scala 2.13 using maven (`mvn > clean install` and

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Allison Wang
Hi Mich, I've updated the permissions on the document. Please feel free to leave comments. Thanks, Allison On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh wrote: > Hi, > > Great. Please allow edit access on SPIP or ability to comment. > > Thanks > > Mich Talebzadeh, > Distinguished

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Allison Wang
Hi Mich, I've updated the permissions on the document. Please feel free to leave comments. Thanks, Allison On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh wrote: > Hi, > > Great. Please allow edit access on SPIP or ability to comment. > > Thanks > > Mich Talebzadeh, > Distinguished

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Mich Talebzadeh
Hi, Great. Please allow edit access on SPIP or ability to comment. Thanks Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile

[DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Allison Wang
Hi all, I would like to start a discussion on “Python Stored Procedures". This proposal aims to extend Spark SQL by introducing support for stored procedures, starting with Python as the procedural language. This will enable users to run complex logic using Python within their SQL workflows and

Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Sorry I missed this one In the context what has been changed we ought to have an additional column timestamp In short we can have datachange(object_name, partition_name, colname, timestamp) timestamp is the point in time you want to compare against for changes. Example SELECT * FROM WHERE

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread yangjie01
Hi, Sean I have performed testing with Java 17 and Scala 2.13 using maven (`mvn clean install` and `mvn package test`), and have not encountered the issue you mentioned. The test for the connect module depends on the `spark-protobuf` module to complete the `package,` was it successful? Or

Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Another idea that came to my mind from the old days, is the concept of having a function called *datachange* This datachange function should measure the amount of change in the data distribution since ANALYZE STATISTICS last ran. Specifically, it should measure the number of inserts, updates and

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Dipayan Dev
Can we fix this bug in Spark 3.5.0? https://issues.apache.org/jira/browse/SPARK-44884 On Wed, Aug 30, 2023 at 11:51 AM Sean Owen wrote: > It looks good except that I'm getting errors running the Spark Connect > tests at the end (Java 17, Scala 2.13) It looks like I missed something > necessary

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Sean Owen
It looks good except that I'm getting errors running the Spark Connect tests at the end (Java 17, Scala 2.13) It looks like I missed something necessary to build; is anyone getting this? [ERROR] [Error]

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
Thanks for the detailed explanation. Regards, Chetan On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh wrote: > OK, let us take a deeper look here > > ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3* > > In above, we are *explicitly grouping columns c1 and c2 together for >

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Martin Grund
+1 (non binding) Tested Spark Connect fully isolated and with PySpark build. Tested as well some of the new PySpark ML Connect features On Tue 29. Aug 2023 at 18:25 Yuanjian Li wrote: > Please vote on releasing the following candidate(RC3) as Apache Spark > version 3.5.0. > > The vote is open

[VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Yuanjian Li
Please vote on releasing the following candidate(RC3) as Apache Spark version 3.5.0. The vote is open until 11:59pm Pacific time Aug 31st and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.5.0 [ ] -1 Do not release this

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
OK, let us take a deeper look here ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3* In above, we are *explicitly grouping columns c1 and c2 together for which we want to compute statistic*s. Additionally, we are also *computing statistics for column c3 independen*t*ly*. This

Re: Spark Connect: API mismatch in SparkSesession#execute

2023-08-29 Thread Stefan Hagedorn
Thank you, Martin! I got it working now using the same shading rules in my project as in Spark. From: Martin Grund Date: Monday, 28. August 2023 at 17:58 To: Stefan Hagedorn Cc: dev@spark.apache.org Subject: Re: Spark Connect: API mismatch in SparkSesession#execute Hi Stefan, There are some

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
Hi, If we are taking this up, then would ask can we support multicolumn stats such as : ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3 This should help in estimating better for conditions involving c1 and c2 Thanks. On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh wrote: >

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
short answer on top of my head My point was with regard to Cost Based Optimizer (CBO) in traditional databases. The concept of a rowkey in HBase is somewhat similar to that of a primary key in RDBMS. Now in databases with automatic deduplication features (i.e. ignore duplication of rowkey),

Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Jia Fan
For those databases with automatic deduplication capabilities, such as hbase, we have inserted 100 rows with the same rowkey, but in fact there is only one in hbase. Is the new statistical value we added 100 or 1, or hbase already contains this rowkey, the value would be 0. How should we handle

Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Mich Talebzadeh
I have never been fond of the notion that measuring inserts, updates, and deletes (referred to as DML) is the sole criterion for signaling a necessity to update statistics for Spark's CBO. Nevertheless, in the absence of an alternative mechanism, it seems this is the only approach at our disposal

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-28 Thread Mich Talebzadeh
Thanks Qian for your feedback. I will have a look Regards, Mich view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or

Re: Spark Connect: API mismatch in SparkSesession#execute

2023-08-28 Thread Martin Grund
Hi Stefan, There are some current limitations around how protobuf is embedded in Spark Connect. One of the challenges there is that for compatibility reasons we currently shade protobuf that then shades the `prototobuf.GeneramtedMessage` class. The way to work around this is to shade the protobuf

Spark Connect: API mismatch in SparkSesession#execute

2023-08-28 Thread Stefan Hagedorn
Hi everyone, Trying my luck here, after no success in the user mailing list :) I’m trying to use the "extension" feature of the Spark Connect CommandPlugin (Spark 3.4.1) [1]. I created a simple protobuf message `MyMessage` that I want to send from the connect client-side to the connect server

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-27 Thread Qian Sun
Hi Mich, ImageCache is an alibaba cloud ECI feature[1]. An image cache is a cluster-level resource that you can use to accelerate the creation of pods in different namespaces. If need to update the spark image, imagecache will be created in the cluster. And specify pod annotation to use image

Re: Beginner - Looking for starter issues

2023-08-27 Thread Harry
Thanks, I'll check it out. On Thu, Jun 29, 2023 at 2:42 AM Jia Fan wrote: > Hi Harry, > Maybe you can start with https://issues.apache.org/jira/browse/SPARK-37935 > > > > Jia Fan > > > 2023年6月28日 08:09,Harry 写道: > > Hi, > > I am looking to pick up some tasks on ASF

Re: [DISCUSS] Incremental statistics collection

2023-08-26 Thread Mich Talebzadeh
Hi, Impressive, yet in the realm of classic DBMSs, it could be seen as a case of old wine in a new bottle. The objective, I assume, is to employ dynamic sampling to enhance the optimizer's capacity to create effective execution plans without the burden of complete I/O and in less time. For

Two new tickets for Spark on K8s

2023-08-26 Thread Mich Talebzadeh
Hi, @holden Karau recently created two Jiras that deal with two items of interest namely: 1. Improve Spark Driver Launch Time SPARK-44950 2. Improve Spark Dynamic Allocation SPARK-44951

[DISCUSS] Incremental statistics collection

2023-08-26 Thread RAKSON RAKESH
Hi all, I would like to propose the incremental collection of statistics in spark. SPARK-44817 has been raised for the same. Currently, spark invalidates the stats after data changing commands which would make CBO non-functional. To update

Re: Clarification on ExecutorRoll Plugin & Ignore Decommission Fetch Failure

2023-08-26 Thread Arun Ravi
Hi Team, Thank you for clarifying about decommission ignore fetch failure behavior. Previously I was using Executor Rolling and Decommision and Ignore Decommission Fetch Failure as a solution for all the problems. I understand that Executor rolling must be carefully tuned to minimize fetch

Re: Clarification on ExecutorRoll Plugin & Ignore Decommission Fetch Failure

2023-08-25 Thread Dongjoon Hyun
Hi, Arun. Here are some answers to your questions. First, the fetch failure is irrelevant to the Executor Rolling feature because the plugin itself only asked the Spark scheduler to decommission it, not terminate it. More specifically, it's independent from the underlying Decommissioning

Re: Clarification on ExecutorRoll Plugin & Ignore Decommission Fetch Failure

2023-08-25 Thread Mich Talebzadeh
Hi, The crux of the matter here as I understand is " how should I be using Executor Rolling, without triggering stage failures?" The object of executor rolling is to replace decommissioning executors with new ones while minimizing the impact on running tasks and stages. in k8s. As mentioned

unsubscribe

2023-08-25 Thread Nizam Shaik
unsubscribe

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-25 Thread Mich Talebzadeh
Hi Qian, How in practice have you implemented image caching for the driver and executor pods respectively? Thanks On Thu, 24 Aug 2023 at 02:44, Qian Sun wrote: > Hi Mich > > I agree with your opinion that the startup time of the Spark on Kubernetes > cluster needs to be improved. > >

Clarification on ExecutorRoll Plugin & Ignore Decommission Fetch Failure

2023-08-25 Thread Arun Ravi
Hi Team, I am running Apache Spark 3.4.1 Application on K8s with the below configuration related to executor rolling and Ignore Decommission Fetch Failure. spark.plugins: "org.apache.spark.scheduler.cluster.k8s.ExecutorRollPlugin" spark.kubernetes.executor.rollInterval: "1800s"

Apache Spark 4.0.0-SNAPSHOT is ready for Java 21

2023-08-25 Thread Dongjoon Hyun
Hi, All. Java 21 will be released in a month and Apache Spark master branch (4.0.0-SNAPSHOT) achieved the first milestone (SPARK-43831: Build and Run Spark on Java 21) Today. 1. JDK 21: https://openjdk.org/projects/jdk/21/ - 2023/08/24 Final Release Candidate - 2023/09/19 General

Re: [VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-24 Thread Yuanjian Li
-1, do not release this package because the correctness issue https://issues.apache.org/jira/browse/SPARK-44871 / https://github.com/apache/spark/pull/42559 was not addressed in RC2. The vote has failed. I plan to cut RC3 in two days. Best, Yuanjian yangjie01 于2023年8月20日周日 20:24写道: > -1, due

Re: Some questions about Spark github action

2023-08-24 Thread Jia Fan
Thanks Xinrong and Jack. I will take a look, also I find https://github.com/apache/spark/pull/32092 is what I want. Thanks a lot. Xinrong Meng 于2023年8月25日周五 04:30写道: > Hi Jia, > > Consider reviewing GitHub Action variables like > $GITHUB_REPOSITORY. Detailed information can be found >

Fwd:  Wednesday: Join 6 Members at "Ofir Press | Complementing Scale: Novel Guidance Methods for Improving LMs"

2023-08-24 Thread Mich Talebzadeh
They recently combined Apache Spark and AI meeting in London. An online session worth attending for some? HTH Mich view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk.

Re: Some questions about Spark github action

2023-08-24 Thread Xinrong Meng
Hi Jia, Consider reviewing GitHub Action variables like $GITHUB_REPOSITORY. Detailed information can be found https://docs.github.com/en/actions/learn-github-actions/variables. Additionally, you might find the code segment

Re: Some questions about Spark github action

2023-08-24 Thread Jack Wells
Hi Jia, Github Action workflows are stored in the .github/workflows directory off the base of the git repo. Here’s a link: https://github.com/apache/spark/tree/master/.github/workflows. Does this help? Jack On Aug 24, 2023 at 04:54:31, Jia Fan wrote: > Hi, folks > I'm a PMC member of

Re: What else could be removed in Spark 4?

2023-08-24 Thread Steve Loughran
I would recommend cutting them. + historically they've fixed the version of aws-sdk jar used in spark releases, meaning s3a connector through spark rarely used the same sdk release as that qualified through the hadoop sdk update process, so if there were incompatibilities, it'd be up to the spark

Some questions about Spark github action

2023-08-24 Thread Jia Fan
Hi, folks I'm a PMC member of Apache SeaTunnel. Recently, I’m optimizing the github action process on SeaTunnel. The main purpose is to be like Spark, when a developer submits a PR, it can automatically run github action on the fork repository instead of the main repository. In this way, all

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-23 Thread Holden Karau
One option could be to initially launch both drivers and initial executors (using the lazy executor ID allocation), but it would introduce a lot of complexity. On Wed, Aug 23, 2023 at 6:44 PM Qian Sun wrote: > Hi Mich > > I agree with your opinion that the startup time of the Spark on

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-23 Thread Qian Sun
Hi Mich I agree with your opinion that the startup time of the Spark on Kubernetes cluster needs to be improved. Regarding the fetching image directly, I have utilized ImageCache to store the images on the node, eliminating the time required to pull images from a remote repository, which does

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-23 Thread Pavan Kotikalapudi
Thanks for the review Mich. I have updated the Q4 with as concise information as possible and left the detailed explanation to Appendix. here is the updated answer to the Q4 Thank you,

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-23 Thread Mich Talebzadeh
Hi all, On this conversion, one of the issues I brought up was the driver start-up time. This is especially true in k8s. As spark on k8s is modeled on Spark on standalone schedler, Spark on k8s consist of a single-driver pod (as master on standalone”) and a number of executors (“workers”). When

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-23 Thread Mich Talebzadeh
Hi Pavan, I started reading your SPIP but have difficulty understanding it in detail. Specifically under Q4, " What is new in your approach and why do you think it will be successful?", I believe it would be better to remove the plots and focus on "what this proposed solution is going to add to

Re: Volcano in spark distro

2023-08-23 Thread Santosh Pingale
> In any way, I'd like to say that the root cause of the difference is those scheduler designs instead of Apache Spark itself. For example, Apache YuniKorn doesn't force us to add a new dependency at all while Volcano did. This makes sense! > In these day, I prefer and invest more Apache

Re: Volcano in spark distro

2023-08-22 Thread Dongjoon Hyun
Of course, we can make Apache Spark distribution bigger and bigger, but I'm a little neutral about Volcano. In any way, I'd like to say that the root cause of the difference is those scheduler designs instead of Apache Spark itself. For example, Apache YuniKorn doesn't force us to add a new

Fwd: Recap on current status of "SPIP: Support Customized Kubernetes Schedulers"

2023-08-22 Thread Mich Talebzadeh
I found some of the notes on Volcano and my tests back in Feb 2022. I did my volcano tests on Spark 3.1.1. The results were not very great then. Hence I asked in thread from @santosh, if any updated comparisons are available. I will try the test with Spark 3.4.1 at some point. Maybe some users

Re: Volcano in spark distro

2023-08-22 Thread Yikun Jiang
@Santosh We tried to add this in v3.3.0. [1] The main reason for not adding it at that time was: 1. Volcano multi-arch not supported before v1.7.0. (already upgraded to 1.7.0 since Spark 3.4.0) 2. Spark on K8s + Volcano is experimental. (We have removed the experimental [2]) Consider spark

Re: Volcano in spark distro

2023-08-22 Thread Mich Talebzadeh
Hi Santosh, We had a Google team discussion about k8s back in February and it was mentioned then. My personal experience with Volcano was not that impressive. Do you have some stats to prove that it is worth adding as an addition. Anyone else is welcome to comment. HTH Mich Talebzadeh,

Volcano in spark distro

2023-08-22 Thread Santosh Pingale
Hey all It would useful to support volcano in spark distro itself just like yunikorn. So I am wondering what is the reason behind this decision of not packaging it already.

[ANNOUNCE] Apache Spark 3.3.3 released

2023-08-22 Thread Yuming Wang
We are happy to announce the availability of Apache Spark 3.3.3! Spark 3.3.3 is a maintenance release containing stability fixes. This release is based on the branch-3.3 maintenance branch of Spark. We strongly recommend all 3.3 users to upgrade to this stable release. To download Spark 3.3.3,

Re: [VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-20 Thread yangjie01
-1, due to SPARK-43646 and SPARK-44784 not yet being fixed. Jie Yang 发件人: Sean Owen 日期: 2023年8月20日 星期日 04:43 收件人: Yuanjian Li 抄送: Spark dev list 主题: Re: [VOTE] Release Apache Spark 3.5.0

Probable Bug in Spark 3.3.0

2023-08-20 Thread Dipayan Dev
Hi Dev Team, https://issues.apache.org/jira/browse/SPARK-44884 We have recently upgraded to Spark 3.3.0 in our Production Dataproc. We have a lot of downstream application that relies on the SUCCESS file. Please let me know if this is a bug or I need to any additional configuration to fix this

Unsubscribe

2023-08-20 Thread Dennis Suhari
Unsubscribe Von meinem iPhone gesendet - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-20 Thread Pavan Kotikalapudi
IMO ML might be good for cluster scheduler but for the core DRA algorithm of SSS I believe we should start with some primitives of Structured streaming. I would love to get some reviews on the doc and opinions on the feasibility of the solution. We have seen quite some savings using this solution

Re: [VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-19 Thread Sean Owen
+1 this looks better to me. Works with Scala 2.13 / Java 17 for me. On Sat, Aug 19, 2023 at 3:23 AM Yuanjian Li wrote: > Please vote on releasing the following candidate(RC2) as Apache Spark > version 3.5.0. > > The vote is open until 11:59pm Pacific time Aug 23th and passes if a > majority +1

[VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-19 Thread Yuanjian Li
Please vote on releasing the following candidate(RC2) as Apache Spark version 3.5.0. The vote is open until 11:59pm Pacific time Aug 23th and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.5.0 [ ] -1 Do not release this

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-19 Thread Yuanjian Li
Thank you for all the reports! I just cut RC2 a few hours before Peter's report. I will continue to monitor the details of the correctness issue and the voting status for RC2. Peter Toth 于2023年8月18日周五 07:57写道: > Hi Yuanjian, > > This is a correctness issue that we should probably fix in 3.5: >

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-18 Thread Peter Toth
Hi Yuanjian, This is a correctness issue that we should probably fix in 3.5: https://issues.apache.org/jira/browse/SPARK-44871 / https://github.com/apache/spark/pull/42559 Cheers, Peter yangjie01 ezt írta (időpont: 2023. aug. 12., Szo, 15:38): > Hi, Yuanjian, > > > > Maybe there is another

Mailing list threading improvements

2023-08-17 Thread Christofer Dutz
TL;DR: We’re updating how auto-generated email from Github will be threaded on your mailing lists. If you want to keep the old defaults, details are below. We’re pleased to let you know that we’re tweaking the way that auto- generated email from Github will appear on your mailing lists. This will

Re: Spark writing API

2023-08-17 Thread Wenchen Fan
I'm not quite sure if this hint is useful. People usually keep a buffer and flush the buffer when it's full, so that they can control the batch size of writing, no matter how many inputs they will get. e.g. if spark hints to you that there will be 1 GB data, are you going to allocate a 1 GB buffer

Re: What else could be removed in Spark 4?

2023-08-16 Thread Yang Jie
I would like to know how we should handle the two Kinesis-related modules in Spark 4.0. They have a very low frequency of code updates, and because the corresponding tests are not continuously executed in any GitHub Actions pipeline, so I think they significantly lack quality assurance. On top

Re: Spark writing API

2023-08-16 Thread Andrew Melo
Hello Wenchen, On Wed, Aug 16, 2023 at 23:33 Wenchen Fan wrote: > > is there a way to hint to the downstream users on the number of rows > expected to write? > > It will be very hard to do. Spark pipelines the execution (within shuffle > boundaries) and we can't predict the number of final

Re: Spark writing API

2023-08-16 Thread Wenchen Fan
> is there a way to hint to the downstream users on the number of rows expected to write? It will be very hard to do. Spark pipelines the execution (within shuffle boundaries) and we can't predict the number of final output rows. On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran wrote: > > > On

Unsubscribe

2023-08-16 Thread 赵军
赵军

Unsubscribe

2023-08-15 Thread Raffael Bottoli Schemmer
Unsubscribe

[VOTE][RESULT] Release Spark 3.3.3 (RC1)

2023-08-14 Thread Yuming Wang
The vote passes with 7 +1s (4 binding +1s). Thanks to all who helped with the release! (* = binding) +1: - Yuming Wang * - Jie Yang - Dongjoon Hyun * - Liang-Chi Hsieh * - Cheng Pan - Mridul Muralidharan * - Jia Fan +0: None -1: None

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-14 Thread Mich Talebzadeh
Thank you for your comments. My vision of integrating machine learning (ML) into Spark Structured Streaming (SSS) for capacity planning and performance optimization seems to be promising. By leveraging ML techniques, I believe that we can potentially create predictive models that enhance the

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-14 Thread Martin Andersson
IMO, using any kind of machine learning or AI for DRA is overkill. The effort involved would be considerable and likely counterproductive, compared to a more conventional approach of comparing the rate of incoming stream data with the effort of handling previous data rates.

Unsubscribe

2023-08-14 Thread xu han

Fwd: Question about ARRAY_INSERT between Spark and Databricks

2023-08-14 Thread Ran Tao
> Forward to dev Yes, the databricks runtime 13.0 and 13.1 and 13.2 are all ok and have the same behavior with open source Apache Spark 3.4.x. But I think the docs of databricks need to be updated[1]. It's confusing. [1]

Re: [VOTE] Release Apache Spark 3.3.3 (RC1)

2023-08-13 Thread Jia Fan
+1 Mridul Muralidharan 于2023年8月11日周五 15:57写道: > > +1 > > Signatures, digests, etc check out fine. > Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes > > Regards, > Mridul > > > On Fri, Aug 11, 2023 at 2:00 AM Cheng Pan wrote: > >> +1 (non-binding) >> >> Passed

Re: Question about ARRAY_INSERT between Spark and Databricks

2023-08-13 Thread Sean Owen
There shouldn't be any difference here. In fact, I get the results you list for 'spark' from Databricks. It's possible the difference is a bug fix along the way that is in the Spark version you are using locally but not in the DBR you are using. But, yeah seems to work as. you say. If you're

Question about ARRAY_INSERT between Spark and Databricks

2023-08-13 Thread Ran Tao
Hi, devs. I found that the ARRAY_INSERT[1] function (from spark 3.4.0) has different semantics with databricks[2]. e.g. // spark SELECT array_insert(array('a', 'b', 'c'), -1, 'z'); ["a","b","z","c"] // databricks SELECT array_insert(array('a', 'b', 'c'), -1, 'z'); ["a","b","c","z"] //

<    6   7   8   9   10   11   12   13   14   15   >