Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh
I did some historical digging on this. Whilst both preview release and RCs are pre-release versions, the main difference lies in their maturity and readiness for production use. Preview releases are early versions aimed at gathering feedback, while release candidates (RCs) are nearly finished

Re: Why spark-submit works with package not with jar

2024-05-06 Thread David Rabinowitz
Hi, It seems this library is several years old. Have you considered using the Google provided connector? You can find it in https://github.com/GoogleCloudDataproc/spark-bigquery-connector Regards, David Rabinowitz On Sun, May 5, 2024 at 6:07 PM Jeff Zhang wrote: > Are you sure

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh
@Wenchen Fan Thanks for the update! To clarify, is the vote for approving a specific preview build, or is it for moving towards an RC stage? I gather there is a distinction between these two? Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau
If folks are against the term soon we could say “in-progress” Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Mon, May 6, 2024 at

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh
Hi, We should reconsider using the term "soon" for ASF board as it is subjective with no date (assuming this is an official communication on Wednesday). We ought to say "Spark 4, the next major release after Spark 3.x, is currently under development. We plan to make a preview version available

Re: ASF board report draft for May

2024-05-06 Thread Wenchen Fan
The preview release also needs a vote. I'll try my best to cut the RC on Monday, but the actual release may take some time. Hopefully, we can get it out this week but if the vote fails, it will take longer as we need more RCs. On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun wrote: > +1 for

Re: Why spark-submit works with package not with jar

2024-05-05 Thread Jeff Zhang
Are you sure com.google.api.client.http.HttpRequestInitialize is in the spark-bigquery-latest.jar or it may be in the transitive dependency of spark-bigquery_2.11? On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh wrote: > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative

Re: ASF board report draft for May

2024-05-05 Thread Dongjoon Hyun
+1 for Holden's comment. Yes, it would be great to mention `it` as "soon". (If Wenchen release it on Monday, we can simply mention the release) In addition, Apache Spark PMC received an official notice from ASF Infra team. https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg >

Re: ASF board report draft for May

2024-05-05 Thread Holden Karau
Do we want to include that we’re planning on having a preview release of Spark 4 so folks can see the APIs “soon”? Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams:

ASF board report draft for May

2024-05-05 Thread Matei Zaharia
It’s time for our quarterly ASF board report on Apache Spark this Wednesday. Here’s a draft, feel free to suggest changes. Description: Apache Spark is a fast and general purpose engine for large-scale data processing. It offers high-level APIs in Java, Scala, Python, R

Re: [SparkListener] Accessing classes loaded via the '--packages' option

2024-05-04 Thread Mich Talebzadeh
In answer to this part of your question "..*Understanding the Issue:* Are there known reasons within Spark that could explain this difference in behavior when loading dependencies via `--packages` versus placing JARs directly? *2. "* --jar Adds only that jar --package adds the Jar and a its

Fwd: Why spark-submit works with package not with jar

2024-05-04 Thread Mich Talebzadeh
Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct

Fwd: [SparkListener] Accessing classes loaded via the '--packages' option

2024-05-04 Thread Damien Hawes
Hi folks, I'm contributing to the OpenLineage project, specifically the Apache Spark integration. My current focus is on extending the project to support data lineage extraction for Spark Streaming, beginning with Apache Kafka sources and sinks. I've encountered an obstacle when attempting to

Re: Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Jungtaek Lim
(remove user@ as the topic is not aiming to user group) I would like to make a clarification of SPIP as there have been multiple times of improper proposals and the ticket also mentions SPIP without fulfilling effective requirements. SPIP is only effective when there is a dedicated individual or

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh
Hi, I have raised a ticket SPARK-48117 for enhancing Spark capabilities with Materialised Views (MV). Currently both Hive and Databricks support this. I have added these potential benefits to the ticket -* Improved Query Performance

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs as

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess you must have created the view from Hive and are trying to drop it from Spark and that is why you are running to the issue with DROP first. There is

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered a

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread yangjie01
+1 发件人: Jungtaek Lim 日期: 2024年5月2日 星期四 10:21 收件人: Holden Karau 抄送: Chao Sun , Xiao Li , Tathagata Das , Wenchen Fan , Cheng Pan , Nicholas Chammas , Dongjoon Hyun , Cheng Pan , Spark dev list , Anish Shrigondekar 主题: Re: [DISCUSS] Spark 4.0.0 release +1 love to see it! On Thu, May 2,

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Mich Talebzadeh
- Integration with additional external data sources or systems, say Hive - Enhancements to the Spark UI for improved monitoring and debugging - Enhancements to machine learning (MLlib) algorithms and capabilities, like TensorFlow or PyTorch,( if any in the pipeline) HTH Mich

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Steve Loughran
There's a new parquet RC up this week which would be good to pull in. On Thu, 2 May 2024 at 03:20, Jungtaek Lim wrote: > +1 love to see it! > > On Thu, May 2, 2024 at 10:08 AM Holden Karau > wrote: > >> +1 :) yay previews >> >> On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: >> >>> +1 >>> >>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Will Raschkowski
To add some user perspective, I wanted to share our experience from automatically upgrading tens of thousands of jobs from Spark 2 to 3 at Palantir: We didn't mind "loud" changes that threw exceptions. We have some infra to try run jobs with Spark 3 and fallback to Spark 2 if there's an

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Nimrod Ofek
Hi Erik and Wenchen, I think that usually a good practice with public api and with internal api that has big impact and a lot of usage is to ease in changes by providing defaults to new parameters that will keep former behaviour in a method with the previous signature with deprecation notice, and

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Jungtaek Lim
+1 love to see it! On Thu, May 2, 2024 at 10:08 AM Holden Karau wrote: > +1 :) yay previews > > On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: > >> +1 >> >> On Wed, May 1, 2024 at 5:23 PM Xiao Li wrote: >> >>> +1 for next Monday. >>> >>> We can do more previews when the other features are

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Holden Karau
+1 :) yay previews On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: > +1 > > On Wed, May 1, 2024 at 5:23 PM Xiao Li wrote: > >> +1 for next Monday. >> >> We can do more previews when the other features are ready for preview. >> >> Tathagata Das 于2024年5月1日周三 08:46写道: >> >>> Next week sounds

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Hi Erik, Thanks for sharing your thoughts! Note: developer APIs are also public APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking changes should be avoided as much as we can and new APIs should be mentioned in the release notes. Breaking binary compatibility is also a

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Chao Sun
+1 On Wed, May 1, 2024 at 5:23 PM Xiao Li wrote: > +1 for next Monday. > > We can do more previews when the other features are ready for preview. > > Tathagata Das 于2024年5月1日周三 08:46写道: > >> Next week sounds great! Thank you Wenchen! >> >> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: >>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Hyukjin Kwon
SGTM On Thu, 2 May 2024 at 02:06, Dongjoon Hyun wrote: > +1 for next Monday. > > Dongjoon. > > On Wed, May 1, 2024 at 8:46 AM Tathagata Das > wrote: > >> Next week sounds great! Thank you Wenchen! >> >> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: >> >>> Yea I think a preview release

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Xiao Li
+1 for next Monday. We can do more previews when the other features are ready for preview. Tathagata Das 于2024年5月1日周三 08:46写道: > Next week sounds great! Thank you Wenchen! > > On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: > >> Yea I think a preview release won't hurt (without a branch

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Dongjoon Hyun
+1 for next Monday. Dongjoon. On Wed, May 1, 2024 at 8:46 AM Tathagata Das wrote: > Next week sounds great! Thank you Wenchen! > > On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: > >> Yea I think a preview release won't hurt (without a branch cut). We don't >> need to wait for all the

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Erik Krogen
Thanks for raising this important discussion Wenchen! Two points I would like to raise, though I'm fully supportive of any improvements in this regard, my points below notwithstanding -- I am not intending to let perfect be the enemy of good here. On a similar note as Santosh's comment, we should

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Good point, Santosh! I was originally targeting end users who write queries with Spark, as this is probably the largest user base. But we should definitely consider other users who deploy and manage Spark clusters. Those users are usually more tolerant of behavior changes and I think it should be

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Next week sounds great! Thank you Wenchen! On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: > Yea I think a preview release won't hurt (without a branch cut). We don't > need to wait for all the ongoing projects to be ready. How about we do a > 4.0 preview release based on the current master

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Santosh Pingale
Thanks Wenchen for starting this! How do we define "the user" for spark? 1. End users: There are some users that use spark as a service from a provider 2. Providers/Operators: There are some users that provide spark as a service for their internal(on-prem setup with yarn/k8s)/external(Something

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Wenchen Fan
Yea I think a preview release won't hurt (without a branch cut). We don't need to wait for all the ongoing projects to be ready. How about we do a 4.0 preview release based on the current master branch next Monday? On Wed, May 1, 2024 at 11:06 PM Tathagata Das wrote: > Hey all, > > Reviving

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Hey all, Reviving this thread, but Spark master has already accumulated a huge amount of changes. As a downstream project maintainer, I want to really start testing the new features and other breaking changes, and it's hard to do that without a Preview release. So the sooner we make a Preview

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-05-01 Thread Mich Talebzadeh
It is important to consider potential impacts on Spark tables stored in the Hive metastore during an "upgrade". Depending on the upgrade path, the Hive metastore schema or SerDes behavior might change, requiring adjustments in the Sparkark code or configurations. I mentioned the need to test the

[DISCUSS] clarify the definition of behavior changes

2024-04-30 Thread Wenchen Fan
Hi all, It's exciting to see innovations keep happening in the Spark community and Spark keeps evolving itself. To make these innovations available to more users, it's important to help users upgrade to newer Spark versions easily. We've done a good job on it: the PR template requires the author

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Wenchen Fan
Yes, Spark has a shim layer to support all Hive versions. It shouldn't be an issue as many users create native Spark data source tables already today, by explicitly putting the `USING` clause in the CREATE TABLE statement. On Wed, May 1, 2024 at 12:56 AM Mich Talebzadeh wrote: > @Wenchen Fan

Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Mich Talebzadeh
@Wenchen Fan Got your explanation, thanks! My understanding is that even if we create Spark tables using Spark's native data sources, by default, the metadata about these tables will be stored in the Hive metastore. As a consequence, a Hive upgrade can potentially affect Spark tables. For

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Kent Yao
+1 Kent Yao On 2024/04/30 09:07:21 Yuming Wang wrote: > +1 > > On Tue, Apr 30, 2024 at 3:31 PM Ye Xianjin wrote: > > > +1 > > Sent from my iPhone > > > > On Apr 30, 2024, at 3:23 PM, DB Tsai wrote: > > > >  > > +1 > > > > On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote: > > > >  > > To add

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Yuming Wang
+1 On Tue, Apr 30, 2024 at 3:31 PM Ye Xianjin wrote: > +1 > Sent from my iPhone > > On Apr 30, 2024, at 3:23 PM, DB Tsai wrote: > >  > +1 > > On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote: > >  > To add more color: > > Spark data source table and Hive Serde table are both stored in the

[VOTE][RESULT] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Dongjoon Hyun
The vote passes with 11 +1s (6 binding +1s) and one -1. Thanks to all who helped with the vote! (* = binding) +1: - Dongjoon Hyun * - Gengliang Wang * - Liang-Chi Hsieh * - Holden Karau * - Zhou Jiang - Cheng Pan - Hyukjin Kwon * - DB Tsai * - Ye Xianjin - XiDuo You - Nimrod Ofek +0: None -1: -

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Nimrod Ofek
+1 (non-binding) p.s How do I become binding? Thanks, Nimrod On Tue, Apr 30, 2024 at 10:53 AM Ye Xianjin wrote: > +1 > Sent from my iPhone > > On Apr 30, 2024, at 3:23 PM, DB Tsai wrote: > >  > +1 > > On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote: > >  > To add more color: > > Spark data

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread XiDuo You
+1 Dongjoon Hyun 于2024年4月27日周六 03:50写道: > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault > to `false` by default. The technical scope is defined in the following PR. > > - DISCUSSION: https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd > - JIRA:

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Ye Xianjin
+1 Sent from my iPhoneOn Apr 30, 2024, at 3:23 PM, DB Tsai wrote:+1 On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote:To add more color:Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread DB Tsai
+1 On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote:To add more color:Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they have different "table provider", which means Spark will use different

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Wenchen Fan
To add more color: Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they have different "table provider", which means Spark will use different reader/writer. Ideally the Spark native data

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Wenchen Fan
@Mich Talebzadeh there seems to be a misunderstanding here. The Spark native data source table is still stored in the Hive metastore, it's just that Spark will use a different (and faster) reader/writer for it. `hive-site.xml` should work as it is today. On Tue, Apr 30, 2024 at 5:23 AM Hyukjin

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon
Mich, It is a legacy config we should get rid of in the end, and it has been tested in production for very long time. Spark should create a Spark table by default. On Tue, Apr 30, 2024 at 5:38 AM Mich Talebzadeh wrote: > Your point > > ".. t's a surprise to me to see that someone has different

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon
+1 It's a legacy conf that we should eventually remove it away. Spark should create Spark table by default, not Hive table. Mich, for your workload, you can simply switch that conf off if it concerns you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd to stop in the middle

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
? I'm not sure why you think in that direction. What I wrote was the following. - You voted +1 for SPARK-4 on April 14th (https://lists.apache.org/thread/tp92yzf8y4yjfk6r3dkqjtlb060g82sy) - You voted -1 for SPARK-46122 on April 26th.

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Mich Talebzadeh
Your point ".. t's a surprise to me to see that someone has different positions in a very short period of time in the community" Well, I have been with Spark since 2015 and this is the article in the medium dated February 7, 2016 with regard to both Hive and Spark and also presented in

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
It's a surprise to me to see that someone has different positions in a very short period of time in the community. Mitch casted +1 for SPARK-4 and -1 for SPARK-46122. - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc -

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-28 Thread Mich Talebzadeh
Hi @Wenchen Fan Thanks for your response. I believe we have not had enough time to "DISCUSS" this matter. Currently in order to make Spark take advantage of Hive, I create a soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1 /opt/spark/conf/hive-site.xml ->

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-28 Thread Wenchen Fan
@Mich Talebzadeh thanks for sharing your concern! Note: creating Spark native data source tables is usually Hive compatible as well, unless we use features that Hive does not support (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to create Spark native table in this case,

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-27 Thread Hussein Awala
+1 On Saturday, April 27, 2024, John Zhuge wrote: > +1 > > On Fri, Apr 26, 2024 at 8:41 AM Kent Yao wrote: > >> +1 >> >> yangjie01 于2024年4月26日周五 17:16写道: >> > >> > +1 >> > >> > >> > >> > 发件人: Ruifeng Zheng >> > 日期: 2024年4月26日 星期五 15:05 >> > 收件人: Xinrong Meng >> > 抄送: Dongjoon Hyun ,

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread John Zhuge
+1 On Fri, Apr 26, 2024 at 8:41 AM Kent Yao wrote: > +1 > > yangjie01 于2024年4月26日周五 17:16写道: > > > > +1 > > > > > > > > 发件人: Ruifeng Zheng > > 日期: 2024年4月26日 星期五 15:05 > > 收件人: Xinrong Meng > > 抄送: Dongjoon Hyun , "dev@spark.apache.org" < > dev@spark.apache.org> > > 主题: Re: [FYI]

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Cheng Pan
+1 (non-binding) Thanks, Cheng Pan On Sat, Apr 27, 2024 at 9:29 AM Holden Karau wrote: > > +1 > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Zhou Jiang
+1 (non-binding) On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun wrote: > I'll start with my +1. > > Dongjoon. > > On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > > Please vote on SPARK-46122 to set > spark.sql.legacy.createHiveTableByDefault > > to `false` by default. The technical scope is

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Mich Talebzadeh
-1 for me Do not change spark.sql.legacy.createHiveTableByDefault because: 1. We have not had enough time to "DISCUSS" this matter. The discussion thread was opened almost 24 hours ago. 2. Compatibility: Changing the default behavior could potentially break existing workflows or

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Holden Karau
+1 Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh wrote: > +1 > > On Fri, Apr 26, 2024

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread L. C. Hsieh
+1 On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun wrote: > > I'll start with my +1. > > Dongjoon. > > On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault > > to `false` by default. The technical scope is defined in the

Re: Which version of spark version supports parquet version 2 ?

2024-04-26 Thread Prem Sahoo
Confirmed, closing this . Thanks everyone for valuable information. Sent from my iPhone > On Apr 25, 2024, at 9:55 AM, Prem Sahoo wrote: > >  > Hello Spark , > After discussing with the Parquet and Pyarrow community . We can use the > below config so that Spark can write Parquet V2 files. >

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Gengliang Wang
+1 On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun wrote: > I'll start with my +1. > > Dongjoon. > > On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > > Please vote on SPARK-46122 to set > spark.sql.legacy.createHiveTableByDefault > > to `false` by default. The technical scope is defined in the

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
I'll start with my +1. Dongjoon. On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault > to `false` by default. The technical scope is defined in the following PR. > > - DISCUSSION: >

[VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault to `false` by default. The technical scope is defined in the following PR. - DISCUSSION: https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd - JIRA: https://issues.apache.org/jira/browse/SPARK-46122 - PR:

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
Thank you, Kent, Wenchen, Mich, Nimrod, Yuming, LiangChi. I'll start a vote. To Mich, for your question, Apache Spark has a long history of converting Hive-provider tables into Spark's datasource tables to handle better in a Spark way. > Can you please elaborate on the above specifically with

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread Kent Yao
+1 yangjie01 于2024年4月26日周五 17:16写道: > > +1 > > > > 发件人: Ruifeng Zheng > 日期: 2024年4月26日 星期五 15:05 > 收件人: Xinrong Meng > 抄送: Dongjoon Hyun , "dev@spark.apache.org" > > 主题: Re: [FYI] SPARK-47993: Drop Python 3.8 > > > > +1 > > > > On Fri, Apr 26, 2024 at 10:26 AM Xinrong Meng wrote: > > +1 > >

Survey: To Understand the requirements regarding TRAINING & TRAINING CONTENT in your ASF project

2024-04-26 Thread Mirko Kämpf
Hello ASF people, As a member of ASF Training (Incubating) project and in preparation for our presentation at the CoC conference in June in Bratislava we do conduct a survey.The purpose is this: *We want to understand the requirements regarding training materials and procedures in various ASF

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Mich Talebzadeh
Hi, I would like to add a side note regarding the discussion process and the current title of the proposal. The title '[DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific configuration parameter, which might lead some participants to overlook its

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread yangjie01
+1 发件人: Ruifeng Zheng 日期: 2024年4月26日 星期五 15:05 收件人: Xinrong Meng 抄送: Dongjoon Hyun , "dev@spark.apache.org" 主题: Re: [FYI] SPARK-47993: Drop Python 3.8 +1 On Fri, Apr 26, 2024 at 10:26 AM Xinrong Meng mailto:xinr...@apache.org>> wrote: +1 On Thu, Apr 25, 2024 at 2:08 PM Holden Karau

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread Ruifeng Zheng
+1 On Fri, Apr 26, 2024 at 10:26 AM Xinrong Meng wrote: > +1 > > On Thu, Apr 25, 2024 at 2:08 PM Holden Karau > wrote: > >> +1 >> >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 >>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread L. C. Hsieh
+1 On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang wrote: > +1 > > On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek wrote: > >> Of course, I can't think of a scenario of thousands of tables with single >> in memory Spark cluster with in memory catalog. >> Thanks for the help! >> >> בתאריך יום ה׳, 25

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Yuming Wang
+1 On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek wrote: > Of course, I can't think of a scenario of thousands of tables with single > in memory Spark cluster with in memory catalog. > Thanks for the help! > > בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏< >

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Denny Lee
+1 (non-binding) On Thu, Apr 25, 2024 at 19:26 Xinrong Meng wrote: > +1 > > On Thu, Apr 25, 2024 at 2:08 PM Holden Karau > wrote: > >> +1 >> >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Xinrong Meng
+1 On Thu, Apr 25, 2024 at 2:08 PM Holden Karau wrote: > +1 > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On Thu,

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, I can't think of a scenario of thousands of tables with single in memory Spark cluster with in memory catalog. Thanks for the help! בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏< mich.talebza...@gmail.com>: > > > Agreed. In scenarios where most of the interactions with

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
ok thanks got it Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
Agreed. In scenarios where most of the interactions with the catalog are related to query planning, saving and metadata management, the choice of catalog implementation may have less impact on query runtime performance. This is because the time spent on metadata operations is generally minimal

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread L. C. Hsieh
+1 On Thu, Apr 25, 2024 at 11:19 AM Maciej wrote: > > +1 > > Best regards, > Maciej Szymkiewicz > > Web: https://zero323.net > PGP: A30CEF0C31A501EC > > On 4/25/24 6:21 PM, Reynold Xin wrote: > > +1 > > On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale > wrote: >> >> +1 >> >> On Thu, Apr 25,

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Holden Karau
+1 Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Thu, Apr 25, 2024 at 11:18 AM Maciej wrote: > +1 > > Best regards, > Maciej

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Maciej
+1 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 4/25/24 6:21 PM, Reynold Xin wrote: +1 On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale wrote: +1 On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun wrote: FYI, there is a proposal to drop

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Reynold Xin
+1 On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale wrote: > +1 > > On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun > wrote: > >> FYI, there is a proposal to drop Python 3.8 because its EOL is October >> 2024. >> >> https://github.com/apache/spark/pull/46228 >> [SPARK-47993][PYTHON] Drop Python 3.8

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Santosh Pingale
+1 On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun wrote: > FYI, there is a proposal to drop Python 3.8 because its EOL is October > 2024. > > https://github.com/apache/spark/pull/46228 > [SPARK-47993][PYTHON] Drop Python 3.8 > > Since it's still alive and there will be an overlap between the

[FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Dongjoon Hyun
FYI, there is a proposal to drop Python 3.8 because its EOL is October 2024. https://github.com/apache/spark/pull/46228 [SPARK-47993][PYTHON] Drop Python 3.8 Since it's still alive and there will be an overlap between the lifecycle of Python 3.8 and Apache Spark 4.0.0, please give us your

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, but it's in memory and not persisted which is much faster, and as I said- I believe that most of the interaction with it is during the planning and save and not actual query run operations, and they are short and minimal compared to data fetching and manipulation so I don't believe it

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
Well, I will be surprised because Derby database is single threaded and won't be much of a use here. Most Hive metastore in the commercial world utilise postgres or Oracle for metastore that are battle proven, replicated and backed up. Mich Talebzadeh, Technologist | Architect | Data Engineer |

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Yes, in memory hive catalog backed by local Derby DB. And again, I presume that most metadata related parts are during planning and not actual run, so I don't see why it should strongly affect query performance. Thanks, בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
With regard to your point below "The thing I'm missing is this: let's say that the output format I choose is delta lake or iceberg or whatever format that uses parquet. Where does the catalog implementation (which holds metadata afaik, same metadata that iceberg and delta lake save for their

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
It's for the data source. For example, Spark's built-in Parquet reader/writer is faster than the Hive serde Parquet reader/writer. On Thu, Apr 25, 2024 at 9:55 PM Mich Talebzadeh wrote: > I see a statement made as below and I quote > > "The proposal of SPARK-46122 is to switch the default

Re: Which version of spark version supports parquet version 2 ?

2024-04-25 Thread Prem Sahoo
Hello Spark , After discussing with the Parquet and Pyarrow community . We can use the below config so that Spark can write Parquet V2 files. *"hadoopConfiguration.set(“parquet.writer.version”, “v2”)" while creating Parquet then those are V2 parquet.* *Could you please confirm ?* >

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Thanks for the detailed answer. The thing I'm missing is this: let's say that the output format I choose is delta lake or iceberg or whatever format that uses parquet. Where does the catalog implementation (which holds metadata afaik, same metadata that iceberg and delta lake save for their tables

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
My take regarding your question is that your mileage varies so to speak. 1) Hive provides a more mature and widely adopted catalog solution that integrates well with other components in the Hadoop ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using Hive

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
I will also appreciate some material that describes the differences between Spark native tables vs hive tables and why each should be used... Thanks Nimrod בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏< mich.talebza...@gmail.com>: > I see a statement made as below and I quote > >

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
I see a statement made as below and I quote "The proposal of SPARK-46122 is to switch the default value of this configuration from `true` to `false` to use Spark native tables because we support better." Can you please elaborate on the above specifically with regard to the phrase ".. because we

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
+1 On Thu, Apr 25, 2024 at 2:46 PM Kent Yao wrote: > +1 > > Nit: the umbrella ticket is SPARK-44111, not SPARK-4. > > Thanks, > Kent Yao > > Dongjoon Hyun 于2024年4月25日周四 14:39写道: > > > > Hi, All. > > > > It's great to see community activities to polish 4.0.0 more and more. > > Thank you

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Kent Yao
+1 Nit: the umbrella ticket is SPARK-44111, not SPARK-4. Thanks, Kent Yao Dongjoon Hyun 于2024年4月25日周四 14:39写道: > > Hi, All. > > It's great to see community activities to polish 4.0.0 more and more. > Thank you all. > > I'd like to bring SPARK-46122 (another SQL topic) to you from the

[DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-24 Thread Dongjoon Hyun
Hi, All. It's great to see community activities to polish 4.0.0 more and more. Thank you all. I'd like to bring SPARK-46122 (another SQL topic) to you from the subtasks of SPARK-4 (Prepare Apache Spark 4.0.0), - https://issues.apache.org/jira/browse/SPARK-46122 Set

[FYI] SPARK-47046: Apache Spark 4.0.0 Dependency Audit and Cleanup

2024-04-21 Thread Dongjoon Hyun
Hi, All. As a part of Apache Spark 4.0.0 (SPAR-44111), we have been doing dependency audits. Today, we want to share the current readiness of Apache Spark 4.0.0 and get your feedback for further completeness. https://issues.apache.org/jira/browse/SPARK-44111 Prepare Apache Spark 4.0.0

<    1   2   3   4   5   6   7   8   9   10   >