[VOTE][RESULT] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh
The vote passes with 13+1s (8 binding +1s) and 1+0. (* = binding) +1: Chao Sun (*) Liang-Chi Hsieh (*) Huaxin Gao (*) Bo Yang Dongjoon Hyun (*) Kent Yao Wenchen Fan (*) Ryan Blue Anton Okolnychyi Zhou Jiang Gengliang Wang (*) Xiao Li (*) Hyukjin Kwon (*) +0: None Mich Talebzadeh -1: None

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh
Hi all, Thanks all for participating and your support! The vote has been passed. I'll send out the result in a separate thread. On Wed, May 15, 2024 at 4:44 PM Hyukjin Kwon wrote: > > +1 > > On Tue, 14 May 2024 at 16:39, Wenchen Fan wrote: >> >> +1 >> >> On Tue, May 14, 2024 at 8:19 AM Zhou

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread Hyukjin Kwon
+1 On Tue, 14 May 2024 at 16:39, Wenchen Fan wrote: > +1 > > On Tue, May 14, 2024 at 8:19 AM Zhou Jiang wrote: > >> +1 (non-binding) >> >> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >>> Hi all, >>> >>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >>> >>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-15 Thread Wenchen Fan
Thanks all for the feedback here! Let me put up a new version, which clarifies the definition of "users": Behavior changes mean user-visible functional changes in a new release via public APIs. The "user" here is not only the user who writes queries and/or develops Spark plugins, but also the

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-15 Thread Wenchen Fan
RC1 failed because of this issue. I'll cut RC2 after we downgrade Jetty to 9.x. On Sat, May 11, 2024 at 3:37 PM Cheng Pan wrote: > -1 (non-binding) > > A small question, the tag is orphan but I suppose it should belong to the > master branch. > > Seems YARN integration is broken due to javax =>

Community over Code EU 2024: The countdown has started!

2024-05-14 Thread Ryan Skraba
[Note: You're receiving this email because you are subscribed to one or more project dev@ mailing lists at the Apache Software Foundation.] We are very close to Community Over Code EU -- check out the amazing program and the special discounts that we have for you. Special discounts You still

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Xiao Li
+1 Gengliang Wang 于2024年5月13日周一 16:24写道: > +1 > > On Mon, May 13, 2024 at 12:30 PM Zhou Jiang > wrote: > >> +1 (non-binding) >> >> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >>> Hi all, >>> >>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >>> >>> Please

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Wenchen Fan
+1 On Tue, May 14, 2024 at 8:19 AM Zhou Jiang wrote: > +1 (non-binding) > > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > >> Hi all, >> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >> >> Please also refer to: >> >>- Discussion thread: >>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Gengliang Wang
+1 On Mon, May 13, 2024 at 12:30 PM Zhou Jiang wrote: > +1 (non-binding) > > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > >> Hi all, >> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >> >> Please also refer to: >> >>- Discussion thread: >>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Zhou Jiang
+1 (non-binding) On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > Hi all, > > I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. > > Please also refer to: > >- Discussion thread: > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo >- JIRA ticket:

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Anton Okolnychyi
+1 On 2024/05/13 15:33:33 Ryan Blue wrote: > +1 > > On Mon, May 13, 2024 at 12:31 AM Mich Talebzadeh > wrote: > > > +0 > > > > For reasons I outlined in the discussion thread > > > > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo > > > > Mich Talebzadeh, > > Technologist |

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Ryan Blue
+1 On Mon, May 13, 2024 at 12:31 AM Mich Talebzadeh wrote: > +0 > > For reasons I outlined in the discussion thread > > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > London > United

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-13 Thread Wenchen Fan
Hi Nicholas, Thanks for your help! I'm definitely interested in participating in this unification work. Let me know how I can help. Wenchen On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas wrote: > Re: unification > > We also have a long-standing problem with how we manage Python >

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Mich Talebzadeh
+0 For reasons I outlined in the discussion thread https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Nicholas Chammas
Re: unification We also have a long-standing problem with how we manage Python dependencies, something I’ve tried (unsuccessfully ) to fix in the past. Consider, for example, how many separate places this numpy dependency is installed: 1.

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Wenchen Fan
+1 On Mon, May 13, 2024 at 10:30 AM Kent Yao wrote: > +1 > > Dongjoon Hyun 于2024年5月13日周一 08:39写道: > > > > +1 > > > > On Sun, May 12, 2024 at 3:50 PM huaxin gao > wrote: > >> > >> +1 > >> > >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: > >>> > >>> +1 > >>> > >>> On Sat, May 11, 2024

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Wenchen Fan
After finishing the 4.0.0-preview1 RC1, I have more experience with this topic now. In fact, the main job of the release process: building packages and documents, is tested in Github Action jobs. However, the way we test them is different from what we do in the release scripts. 1. the execution

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Kent Yao
+1 Dongjoon Hyun 于2024年5月13日周一 08:39写道: > > +1 > > On Sun, May 12, 2024 at 3:50 PM huaxin gao wrote: >> >> +1 >> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: >>> >>> +1 >>> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: >>> > >>> > +1 >>> > >>> > On Sat, May 11, 2024 at 2:10 PM

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Dongjoon Hyun
+1 On Sun, May 12, 2024 at 3:50 PM huaxin gao wrote: > +1 > > On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: > >> +1 >> >> On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: >> > >> > +1 >> > >> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >> >> >> Hi all, >> >> >> >> I’d like to

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread bo yang
+1 On Sat, May 11, 2024 at 4:43 PM huaxin gao wrote: > +1 > > On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: > >> +1 >> >> On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: >> > >> > +1 >> > >> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >> >> >> Hi all, >> >> >> >> I’d like to

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread huaxin gao
+1 On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: > +1 > > On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: > > > > +1 > > > > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > >> > >> Hi all, > >> > >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. > >> > >>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh
+1 On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: > > +1 > > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >> Hi all, >> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >> >> Please also refer to: >> >>- Discussion thread: >>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Chao Sun
+1 On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > Hi all, > > I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. > > Please also refer to: > >- Discussion thread: > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo >- JIRA ticket:

[VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh
Hi all, I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. Please also refer to: - Discussion thread: https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo - JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167 - SPIP doc:

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Mich Talebzadeh
Thanks In the context of stored procedures API for Catalogs, this approach deviates from the traditional definition of stored procedures in RDBMS for two key reasons: - Compilation vs. Interpretation: Traditional stored procedures are typically pre-compiled into machine code for faster

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Anton Okolnychyi
Mich, I don't think the invalidation will be necessary in our case as there is no plan to preprocess or compile the procedures into executable objects. They will be loaded and executed on demand via the Catalog API. пт, 10 трав. 2024 р. о 10:37 Mich Talebzadeh пише: > Hi, > > If the underlying

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-11 Thread Cheng Pan
-1 (non-binding) A small question, the tag is orphan but I suppose it should belong to the master branch. Seems YARN integration is broken due to javax => jakarta namespace migration, I filled SPARK-48238, and left some comments on https://github.com/apache/spark/pull/45154 Caused by:

[VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-10 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 4.0.0-preview1. The vote is open until May 16 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 4.0.0-preview1 [ ] -1 Do not release this package

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-10 Thread Mich Talebzadeh
Hi, If the underlying table changes (DDL), if I recall from RDBMSs like Oracle, the stored procedure will be invalidated as it is a compiled object. How is this going to be handled? Does it follow the same mechanism? Thanks Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread huaxin gao
Thanks Anton for the updated proposal -- it looks great! I appreciate the hard work put into refining it. I am looking forward to the upcoming vote and moving forward with this initiative. Thanks, Huaxin On Thu, May 9, 2024 at 7:30 PM L. C. Hsieh wrote: > Thanks Anton. Thank you, Wenchen,

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Wenchen Fan
Thanks for leading this project! Let's move forward. On Fri, May 10, 2024 at 10:31 AM L. C. Hsieh wrote: > Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and > others if I miss those who are participating in the discussion. > > I suppose we have reached a consensus or close to

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread L. C. Hsieh
Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and others if I miss those who are participating in the discussion. I suppose we have reached a consensus or close to being in the design. If you have some more comments, please let us know. If not, I will go to start a vote soon

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Anton Okolnychyi
Thanks to everyone who commented on the design doc. I updated the proposal and it is ready for another look. I hope we can converge and move forward with this effort! - Anton пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi пише: > Hi folks, > > I'd like to start a discussion on SPARK-44167 that

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
UPDATE: I've successfully uploaded the release packages: https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/ (I skipped SparkR as I was not able to fix the errors, I'll get back to it later) However, there is a new issue with doc building:

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
Please re-try to upload, Wenchen. ASF Infra team bumped up our upload limit based on our request. > Your upload limit has been increased to 650MB Dongjoon. On Thu, May 9, 2024 at 8:12 AM Wenchen Fan wrote: > I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776 > > On

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776 On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun wrote: > In addition, FYI, I was the latest release manager with Apache Spark 3.4.3 > (2024-04-15 Vote) > > According to my work log, I uploaded the following binaries to SVN

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
In addition, FYI, I was the latest release manager with Apache Spark 3.4.3 (2024-04-15 Vote) According to my work log, I uploaded the following binaries to SVN from EC2 (us-west-2) without any issues. -rw-r--r--. 1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz -rw-r--r--. 1 centos

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
Could you file an INFRA JIRA issue with the error message and context first, Wenchen? As you know, if we see something, we had better file a JIRA issue because it could be not only an Apache Spark project issue but also all ASF project issues. Dongjoon. On Thu, May 9, 2024 at 12:28 AM Wenchen

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Wenchen Fan
Thanks for starting the discussion! To add a bit more color, we should at least add a test job to make sure the release script can produce the packages correctly. Today it's kind of being manually tested by the release manager each time, which slows down the release process. It's better if we can

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Hussein Awala
Hello, I can answer some of your common questions with other Apache projects. > Who currently has permissions for Github actions? Is there a specific owner for that today or a different volunteer each time? The Apache organization owns Github Actions, and committers (contributors with write

[DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Nimrod Ofek
Following the conversation started with Spark 4.0.0 release, this is a thread to discuss improvements to our release processes. I'll Start by raising some questions that probably should have answers to start the discussion: 1. What is currently running in GitHub Actions? 2. Who currently

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
UPDATE: After resolving a few issues in the release scripts, I can finally build the release packages. However, I can't upload them to the staging SVN repo due to a transmitting error, and it seems like a limitation from the server side. I tried it on both my local laptop and remote AWS instance,

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Very helpful! On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh wrote: > *Potential reasons* > > >- Data Serialization: Spark needs to serialize the DataFrame into an >in-memory format suitable for storage. This process can be time-consuming, >especially for large datasets like 3.2 GB

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Holden Karau
That looks cool, maybe let’s split off a thread on how to improve our release processes? Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Erik Krogen
On that note, GitHub recently released (public preview) a new feature called Artifact Attestions which may be relevant/useful here: Introducing Artifact Attestations–now in public beta - The GitHub Blog On Wed,

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Mich Talebzadeh
*Potential reasons* - Data Serialization: Spark needs to serialize the DataFrame into an in-memory format suitable for storage. This process can be time-consuming, especially for large datasets like 3.2 GB with complex schemas. - Shuffle Operations: If your transformations involve

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Could any one help me here ? Sent from my iPhone > On May 7, 2024, at 4:30 PM, Prem Sahoo wrote: > >  > Hello Folks, > in Spark I have read a file and done some transformation and finally writing > to hdfs. > > Now I am interested in writing the same dataframe to MapRFS but for this > Spark

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Nimrod Ofek
I have no permissions so I can't do it but I'm happy to help (although I am more familiar with Gitlab CICD than Github Actions). Is there some point of contact that can provide me needed context and permissions? I'd also love to see why the costs are high and see how we can reduce them... Thanks,

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
I think signing the artifacts produced from a secure CI sounds like a good idea. I know we’ve been asked to reduce our GitHub action usage but perhaps someone interested could volunteer to set that up. Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.):

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi, Thanks for the reply. >From my experience, a build on a build server would be much more predictable and less error prone than building on some laptop- and of course much faster to have builds, snapshots, release candidates, early previews releases, release candidates or final releases. It

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
Indeed. We could conceivably build the release in CI/CD but the final verification / signing should be done locally to keep the keys safe (there was some concern from earlier release processes). Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.):

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Dongjoon Hyun
Thank you so much for the update, Wenchen! Dongjoon. On Tue, May 7, 2024 at 10:49 AM Wenchen Fan wrote: > UPDATE: > > Unfortunately, it took me quite some time to set up my laptop and get it > ready for the release process (docker desktop doesn't work anymore, my pgp > key is lost, etc.). I'll

caching a dataframe in Spark takes lot of time

2024-05-07 Thread Prem Sahoo
Hello Folks, in Spark I have read a file and done some transformation and finally writing to hdfs. Now I am interested in writing the same dataframe to MapRFS but for this Spark will execute the full DAG again (recompute all the previous steps)(all the read + transformations ). I don't want

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi, Sorry for the novice question, Wenchen - the release is done manually from a laptop? Not using a CI CD process on a build server? Thanks, Nimrod On Tue, May 7, 2024 at 8:50 PM Wenchen Fan wrote: > UPDATE: > > Unfortunately, it took me quite some time to set up my laptop and get it > ready

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Wenchen Fan
UPDATE: Unfortunately, it took me quite some time to set up my laptop and get it ready for the release process (docker desktop doesn't work anymore, my pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for your patience! Wenchen On Fri, May 3, 2024 at 7:47 AM yangjie01

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks, I wanted to check why spark doesn't create staging dir while doing an insertInto on partitioned tables. I'm running below example code – ``` spark.sql("set hive.exec.dynamic.partition.mode=nonstrict") val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3))) val df =

Re: ASF board report draft for May

2024-05-06 Thread Matei Zaharia
I’ll mention that we’re working toward a preview release, even if the details are not finalized by the time we sent the report. > On May 6, 2024, at 10:52 AM, Holden Karau wrote: > > I trust Wenchen to manage the preview release effectively but if there are > concerns around how to manage a

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau
I trust Wenchen to manage the preview release effectively but if there are concerns around how to manage a developer preview release lets split that off from the board report discussion. On Mon, May 6, 2024 at 10:44 AM Mich Talebzadeh wrote: > I did some historical digging on this. > > Whilst

Re: Why spark-submit works with package not with jar

2024-05-06 Thread Mich Talebzadeh
Thanks David. I wanted to explain the difference between Package and Jar with comments from the community on previous discussions back a few years ago. cheers Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh
I did some historical digging on this. Whilst both preview release and RCs are pre-release versions, the main difference lies in their maturity and readiness for production use. Preview releases are early versions aimed at gathering feedback, while release candidates (RCs) are nearly finished

Re: Why spark-submit works with package not with jar

2024-05-06 Thread David Rabinowitz
Hi, It seems this library is several years old. Have you considered using the Google provided connector? You can find it in https://github.com/GoogleCloudDataproc/spark-bigquery-connector Regards, David Rabinowitz On Sun, May 5, 2024 at 6:07 PM Jeff Zhang wrote: > Are you sure

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh
@Wenchen Fan Thanks for the update! To clarify, is the vote for approving a specific preview build, or is it for moving towards an RC stage? I gather there is a distinction between these two? Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau
If folks are against the term soon we could say “in-progress” Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Mon, May 6, 2024 at

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh
Hi, We should reconsider using the term "soon" for ASF board as it is subjective with no date (assuming this is an official communication on Wednesday). We ought to say "Spark 4, the next major release after Spark 3.x, is currently under development. We plan to make a preview version available

Re: ASF board report draft for May

2024-05-06 Thread Wenchen Fan
The preview release also needs a vote. I'll try my best to cut the RC on Monday, but the actual release may take some time. Hopefully, we can get it out this week but if the vote fails, it will take longer as we need more RCs. On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun wrote: > +1 for

Re: Why spark-submit works with package not with jar

2024-05-05 Thread Jeff Zhang
Are you sure com.google.api.client.http.HttpRequestInitialize is in the spark-bigquery-latest.jar or it may be in the transitive dependency of spark-bigquery_2.11? On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh wrote: > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative

Re: ASF board report draft for May

2024-05-05 Thread Dongjoon Hyun
+1 for Holden's comment. Yes, it would be great to mention `it` as "soon". (If Wenchen release it on Monday, we can simply mention the release) In addition, Apache Spark PMC received an official notice from ASF Infra team. https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg >

Re: ASF board report draft for May

2024-05-05 Thread Holden Karau
Do we want to include that we’re planning on having a preview release of Spark 4 so folks can see the APIs “soon”? Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams:

ASF board report draft for May

2024-05-05 Thread Matei Zaharia
It’s time for our quarterly ASF board report on Apache Spark this Wednesday. Here’s a draft, feel free to suggest changes. Description: Apache Spark is a fast and general purpose engine for large-scale data processing. It offers high-level APIs in Java, Scala, Python, R

Re: [SparkListener] Accessing classes loaded via the '--packages' option

2024-05-04 Thread Mich Talebzadeh
In answer to this part of your question "..*Understanding the Issue:* Are there known reasons within Spark that could explain this difference in behavior when loading dependencies via `--packages` versus placing JARs directly? *2. "* --jar Adds only that jar --package adds the Jar and a its

Fwd: Why spark-submit works with package not with jar

2024-05-04 Thread Mich Talebzadeh
Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct

Fwd: [SparkListener] Accessing classes loaded via the '--packages' option

2024-05-04 Thread Damien Hawes
Hi folks, I'm contributing to the OpenLineage project, specifically the Apache Spark integration. My current focus is on extending the project to support data lineage extraction for Spark Streaming, beginning with Apache Kafka sources and sinks. I've encountered an obstacle when attempting to

Re: Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Jungtaek Lim
(remove user@ as the topic is not aiming to user group) I would like to make a clarification of SPIP as there have been multiple times of improper proposals and the ticket also mentions SPIP without fulfilling effective requirements. SPIP is only effective when there is a dedicated individual or

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh
Hi, I have raised a ticket SPARK-48117 for enhancing Spark capabilities with Materialised Views (MV). Currently both Hive and Databricks support this. I have added these potential benefits to the ticket -* Improved Query Performance

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs as

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess you must have created the view from Hive and are trying to drop it from Spark and that is why you are running to the issue with DROP first. There is

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered a

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread yangjie01
+1 发件人: Jungtaek Lim 日期: 2024年5月2日 星期四 10:21 收件人: Holden Karau 抄送: Chao Sun , Xiao Li , Tathagata Das , Wenchen Fan , Cheng Pan , Nicholas Chammas , Dongjoon Hyun , Cheng Pan , Spark dev list , Anish Shrigondekar 主题: Re: [DISCUSS] Spark 4.0.0 release +1 love to see it! On Thu, May 2,

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Mich Talebzadeh
- Integration with additional external data sources or systems, say Hive - Enhancements to the Spark UI for improved monitoring and debugging - Enhancements to machine learning (MLlib) algorithms and capabilities, like TensorFlow or PyTorch,( if any in the pipeline) HTH Mich

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Steve Loughran
There's a new parquet RC up this week which would be good to pull in. On Thu, 2 May 2024 at 03:20, Jungtaek Lim wrote: > +1 love to see it! > > On Thu, May 2, 2024 at 10:08 AM Holden Karau > wrote: > >> +1 :) yay previews >> >> On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: >> >>> +1 >>> >>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Will Raschkowski
To add some user perspective, I wanted to share our experience from automatically upgrading tens of thousands of jobs from Spark 2 to 3 at Palantir: We didn't mind "loud" changes that threw exceptions. We have some infra to try run jobs with Spark 3 and fallback to Spark 2 if there's an

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Nimrod Ofek
Hi Erik and Wenchen, I think that usually a good practice with public api and with internal api that has big impact and a lot of usage is to ease in changes by providing defaults to new parameters that will keep former behaviour in a method with the previous signature with deprecation notice, and

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Jungtaek Lim
+1 love to see it! On Thu, May 2, 2024 at 10:08 AM Holden Karau wrote: > +1 :) yay previews > > On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: > >> +1 >> >> On Wed, May 1, 2024 at 5:23 PM Xiao Li wrote: >> >>> +1 for next Monday. >>> >>> We can do more previews when the other features are

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Holden Karau
+1 :) yay previews On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: > +1 > > On Wed, May 1, 2024 at 5:23 PM Xiao Li wrote: > >> +1 for next Monday. >> >> We can do more previews when the other features are ready for preview. >> >> Tathagata Das 于2024年5月1日周三 08:46写道: >> >>> Next week sounds

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Hi Erik, Thanks for sharing your thoughts! Note: developer APIs are also public APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking changes should be avoided as much as we can and new APIs should be mentioned in the release notes. Breaking binary compatibility is also a

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Chao Sun
+1 On Wed, May 1, 2024 at 5:23 PM Xiao Li wrote: > +1 for next Monday. > > We can do more previews when the other features are ready for preview. > > Tathagata Das 于2024年5月1日周三 08:46写道: > >> Next week sounds great! Thank you Wenchen! >> >> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: >>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Hyukjin Kwon
SGTM On Thu, 2 May 2024 at 02:06, Dongjoon Hyun wrote: > +1 for next Monday. > > Dongjoon. > > On Wed, May 1, 2024 at 8:46 AM Tathagata Das > wrote: > >> Next week sounds great! Thank you Wenchen! >> >> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: >> >>> Yea I think a preview release

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Xiao Li
+1 for next Monday. We can do more previews when the other features are ready for preview. Tathagata Das 于2024年5月1日周三 08:46写道: > Next week sounds great! Thank you Wenchen! > > On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: > >> Yea I think a preview release won't hurt (without a branch

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Dongjoon Hyun
+1 for next Monday. Dongjoon. On Wed, May 1, 2024 at 8:46 AM Tathagata Das wrote: > Next week sounds great! Thank you Wenchen! > > On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: > >> Yea I think a preview release won't hurt (without a branch cut). We don't >> need to wait for all the

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Erik Krogen
Thanks for raising this important discussion Wenchen! Two points I would like to raise, though I'm fully supportive of any improvements in this regard, my points below notwithstanding -- I am not intending to let perfect be the enemy of good here. On a similar note as Santosh's comment, we should

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Good point, Santosh! I was originally targeting end users who write queries with Spark, as this is probably the largest user base. But we should definitely consider other users who deploy and manage Spark clusters. Those users are usually more tolerant of behavior changes and I think it should be

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Next week sounds great! Thank you Wenchen! On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: > Yea I think a preview release won't hurt (without a branch cut). We don't > need to wait for all the ongoing projects to be ready. How about we do a > 4.0 preview release based on the current master

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Santosh Pingale
Thanks Wenchen for starting this! How do we define "the user" for spark? 1. End users: There are some users that use spark as a service from a provider 2. Providers/Operators: There are some users that provide spark as a service for their internal(on-prem setup with yarn/k8s)/external(Something

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Wenchen Fan
Yea I think a preview release won't hurt (without a branch cut). We don't need to wait for all the ongoing projects to be ready. How about we do a 4.0 preview release based on the current master branch next Monday? On Wed, May 1, 2024 at 11:06 PM Tathagata Das wrote: > Hey all, > > Reviving

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Hey all, Reviving this thread, but Spark master has already accumulated a huge amount of changes. As a downstream project maintainer, I want to really start testing the new features and other breaking changes, and it's hard to do that without a Preview release. So the sooner we make a Preview

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-05-01 Thread Mich Talebzadeh
It is important to consider potential impacts on Spark tables stored in the Hive metastore during an "upgrade". Depending on the upgrade path, the Hive metastore schema or SerDes behavior might change, requiring adjustments in the Sparkark code or configurations. I mentioned the need to test the

[DISCUSS] clarify the definition of behavior changes

2024-04-30 Thread Wenchen Fan
Hi all, It's exciting to see innovations keep happening in the Spark community and Spark keeps evolving itself. To make these innovations available to more users, it's important to help users upgrade to newer Spark versions easily. We've done a good job on it: the PR template requires the author

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Wenchen Fan
Yes, Spark has a shim layer to support all Hive versions. It shouldn't be an issue as many users create native Spark data source tables already today, by explicitly putting the `USING` clause in the CREATE TABLE statement. On Wed, May 1, 2024 at 12:56 AM Mich Talebzadeh wrote: > @Wenchen Fan

Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Mich Talebzadeh
@Wenchen Fan Got your explanation, thanks! My understanding is that even if we create Spark tables using Spark's native data sources, by default, the metadata about these tables will be stored in the Hive metastore. As a consequence, a Hive upgrade can potentially affect Spark tables. For

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Kent Yao
+1 Kent Yao On 2024/04/30 09:07:21 Yuming Wang wrote: > +1 > > On Tue, Apr 30, 2024 at 3:31 PM Ye Xianjin wrote: > > > +1 > > Sent from my iPhone > > > > On Apr 30, 2024, at 3:23 PM, DB Tsai wrote: > > > >  > > +1 > > > > On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote: > > > >  > > To add

  1   2   3   4   5   6   7   8   9   10   >