Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon
Mich, It is a legacy config we should get rid of in the end, and it has been tested in production for very long time. Spark should create a Spark table by default. On Tue, Apr 30, 2024 at 5:38 AM Mich Talebzadeh wrote: > Your point > > ".. t's a surprise to me to see that someone has different

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon
+1 It's a legacy conf that we should eventually remove it away. Spark should create Spark table by default, not Hive table. Mich, for your workload, you can simply switch that conf off if it concerns you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd to stop in the middle

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
? I'm not sure why you think in that direction. What I wrote was the following. - You voted +1 for SPARK-4 on April 14th (https://lists.apache.org/thread/tp92yzf8y4yjfk6r3dkqjtlb060g82sy) - You voted -1 for SPARK-46122 on April 26th.

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Mich Talebzadeh
Your point ".. t's a surprise to me to see that someone has different positions in a very short period of time in the community" Well, I have been with Spark since 2015 and this is the article in the medium dated February 7, 2016 with regard to both Hive and Spark and also presented in

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
It's a surprise to me to see that someone has different positions in a very short period of time in the community. Mitch casted +1 for SPARK-4 and -1 for SPARK-46122. - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc -

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-28 Thread Mich Talebzadeh
Hi @Wenchen Fan Thanks for your response. I believe we have not had enough time to "DISCUSS" this matter. Currently in order to make Spark take advantage of Hive, I create a soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1 /opt/spark/conf/hive-site.xml ->

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-28 Thread Wenchen Fan
@Mich Talebzadeh thanks for sharing your concern! Note: creating Spark native data source tables is usually Hive compatible as well, unless we use features that Hive does not support (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to create Spark native table in this case,

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-27 Thread Hussein Awala
+1 On Saturday, April 27, 2024, John Zhuge wrote: > +1 > > On Fri, Apr 26, 2024 at 8:41 AM Kent Yao wrote: > >> +1 >> >> yangjie01 于2024年4月26日周五 17:16写道: >> > >> > +1 >> > >> > >> > >> > 发件人: Ruifeng Zheng >> > 日期: 2024年4月26日 星期五 15:05 >> > 收件人: Xinrong Meng >> > 抄送: Dongjoon Hyun ,

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread John Zhuge
+1 On Fri, Apr 26, 2024 at 8:41 AM Kent Yao wrote: > +1 > > yangjie01 于2024年4月26日周五 17:16写道: > > > > +1 > > > > > > > > 发件人: Ruifeng Zheng > > 日期: 2024年4月26日 星期五 15:05 > > 收件人: Xinrong Meng > > 抄送: Dongjoon Hyun , "dev@spark.apache.org" < > dev@spark.apache.org> > > 主题: Re: [FYI]

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Cheng Pan
+1 (non-binding) Thanks, Cheng Pan On Sat, Apr 27, 2024 at 9:29 AM Holden Karau wrote: > > +1 > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Zhou Jiang
+1 (non-binding) On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun wrote: > I'll start with my +1. > > Dongjoon. > > On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > > Please vote on SPARK-46122 to set > spark.sql.legacy.createHiveTableByDefault > > to `false` by default. The technical scope is

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Mich Talebzadeh
-1 for me Do not change spark.sql.legacy.createHiveTableByDefault because: 1. We have not had enough time to "DISCUSS" this matter. The discussion thread was opened almost 24 hours ago. 2. Compatibility: Changing the default behavior could potentially break existing workflows or

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Holden Karau
+1 Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh wrote: > +1 > > On Fri, Apr 26, 2024

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread L. C. Hsieh
+1 On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun wrote: > > I'll start with my +1. > > Dongjoon. > > On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault > > to `false` by default. The technical scope is defined in the

Re: Which version of spark version supports parquet version 2 ?

2024-04-26 Thread Prem Sahoo
Confirmed, closing this . Thanks everyone for valuable information. Sent from my iPhone > On Apr 25, 2024, at 9:55 AM, Prem Sahoo wrote: > >  > Hello Spark , > After discussing with the Parquet and Pyarrow community . We can use the > below config so that Spark can write Parquet V2 files. >

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Gengliang Wang
+1 On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun wrote: > I'll start with my +1. > > Dongjoon. > > On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > > Please vote on SPARK-46122 to set > spark.sql.legacy.createHiveTableByDefault > > to `false` by default. The technical scope is defined in the

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
I'll start with my +1. Dongjoon. On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault > to `false` by default. The technical scope is defined in the following PR. > > - DISCUSSION: >

[VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault to `false` by default. The technical scope is defined in the following PR. - DISCUSSION: https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd - JIRA: https://issues.apache.org/jira/browse/SPARK-46122 - PR:

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
Thank you, Kent, Wenchen, Mich, Nimrod, Yuming, LiangChi. I'll start a vote. To Mich, for your question, Apache Spark has a long history of converting Hive-provider tables into Spark's datasource tables to handle better in a Spark way. > Can you please elaborate on the above specifically with

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread Kent Yao
+1 yangjie01 于2024年4月26日周五 17:16写道: > > +1 > > > > 发件人: Ruifeng Zheng > 日期: 2024年4月26日 星期五 15:05 > 收件人: Xinrong Meng > 抄送: Dongjoon Hyun , "dev@spark.apache.org" > > 主题: Re: [FYI] SPARK-47993: Drop Python 3.8 > > > > +1 > > > > On Fri, Apr 26, 2024 at 10:26 AM Xinrong Meng wrote: > > +1 > >

Survey: To Understand the requirements regarding TRAINING & TRAINING CONTENT in your ASF project

2024-04-26 Thread Mirko Kämpf
Hello ASF people, As a member of ASF Training (Incubating) project and in preparation for our presentation at the CoC conference in June in Bratislava we do conduct a survey.The purpose is this: *We want to understand the requirements regarding training materials and procedures in various ASF

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Mich Talebzadeh
Hi, I would like to add a side note regarding the discussion process and the current title of the proposal. The title '[DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific configuration parameter, which might lead some participants to overlook its

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread yangjie01
+1 发件人: Ruifeng Zheng 日期: 2024年4月26日 星期五 15:05 收件人: Xinrong Meng 抄送: Dongjoon Hyun , "dev@spark.apache.org" 主题: Re: [FYI] SPARK-47993: Drop Python 3.8 +1 On Fri, Apr 26, 2024 at 10:26 AM Xinrong Meng mailto:xinr...@apache.org>> wrote: +1 On Thu, Apr 25, 2024 at 2:08 PM Holden Karau

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread Ruifeng Zheng
+1 On Fri, Apr 26, 2024 at 10:26 AM Xinrong Meng wrote: > +1 > > On Thu, Apr 25, 2024 at 2:08 PM Holden Karau > wrote: > >> +1 >> >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 >>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread L. C. Hsieh
+1 On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang wrote: > +1 > > On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek wrote: > >> Of course, I can't think of a scenario of thousands of tables with single >> in memory Spark cluster with in memory catalog. >> Thanks for the help! >> >> בתאריך יום ה׳, 25

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Yuming Wang
+1 On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek wrote: > Of course, I can't think of a scenario of thousands of tables with single > in memory Spark cluster with in memory catalog. > Thanks for the help! > > בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏< >

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Denny Lee
+1 (non-binding) On Thu, Apr 25, 2024 at 19:26 Xinrong Meng wrote: > +1 > > On Thu, Apr 25, 2024 at 2:08 PM Holden Karau > wrote: > >> +1 >> >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Xinrong Meng
+1 On Thu, Apr 25, 2024 at 2:08 PM Holden Karau wrote: > +1 > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On Thu,

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, I can't think of a scenario of thousands of tables with single in memory Spark cluster with in memory catalog. Thanks for the help! בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏< mich.talebza...@gmail.com>: > > > Agreed. In scenarios where most of the interactions with

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
ok thanks got it Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
Agreed. In scenarios where most of the interactions with the catalog are related to query planning, saving and metadata management, the choice of catalog implementation may have less impact on query runtime performance. This is because the time spent on metadata operations is generally minimal

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread L. C. Hsieh
+1 On Thu, Apr 25, 2024 at 11:19 AM Maciej wrote: > > +1 > > Best regards, > Maciej Szymkiewicz > > Web: https://zero323.net > PGP: A30CEF0C31A501EC > > On 4/25/24 6:21 PM, Reynold Xin wrote: > > +1 > > On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale > wrote: >> >> +1 >> >> On Thu, Apr 25,

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Holden Karau
+1 Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Thu, Apr 25, 2024 at 11:18 AM Maciej wrote: > +1 > > Best regards, > Maciej

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Maciej
+1 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 4/25/24 6:21 PM, Reynold Xin wrote: +1 On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale wrote: +1 On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun wrote: FYI, there is a proposal to drop

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Reynold Xin
+1 On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale wrote: > +1 > > On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun > wrote: > >> FYI, there is a proposal to drop Python 3.8 because its EOL is October >> 2024. >> >> https://github.com/apache/spark/pull/46228 >> [SPARK-47993][PYTHON] Drop Python 3.8

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Santosh Pingale
+1 On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun wrote: > FYI, there is a proposal to drop Python 3.8 because its EOL is October > 2024. > > https://github.com/apache/spark/pull/46228 > [SPARK-47993][PYTHON] Drop Python 3.8 > > Since it's still alive and there will be an overlap between the

[FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Dongjoon Hyun
FYI, there is a proposal to drop Python 3.8 because its EOL is October 2024. https://github.com/apache/spark/pull/46228 [SPARK-47993][PYTHON] Drop Python 3.8 Since it's still alive and there will be an overlap between the lifecycle of Python 3.8 and Apache Spark 4.0.0, please give us your

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, but it's in memory and not persisted which is much faster, and as I said- I believe that most of the interaction with it is during the planning and save and not actual query run operations, and they are short and minimal compared to data fetching and manipulation so I don't believe it

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
Well, I will be surprised because Derby database is single threaded and won't be much of a use here. Most Hive metastore in the commercial world utilise postgres or Oracle for metastore that are battle proven, replicated and backed up. Mich Talebzadeh, Technologist | Architect | Data Engineer |

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Yes, in memory hive catalog backed by local Derby DB. And again, I presume that most metadata related parts are during planning and not actual run, so I don't see why it should strongly affect query performance. Thanks, בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
With regard to your point below "The thing I'm missing is this: let's say that the output format I choose is delta lake or iceberg or whatever format that uses parquet. Where does the catalog implementation (which holds metadata afaik, same metadata that iceberg and delta lake save for their

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
It's for the data source. For example, Spark's built-in Parquet reader/writer is faster than the Hive serde Parquet reader/writer. On Thu, Apr 25, 2024 at 9:55 PM Mich Talebzadeh wrote: > I see a statement made as below and I quote > > "The proposal of SPARK-46122 is to switch the default

Re: Which version of spark version supports parquet version 2 ?

2024-04-25 Thread Prem Sahoo
Hello Spark , After discussing with the Parquet and Pyarrow community . We can use the below config so that Spark can write Parquet V2 files. *"hadoopConfiguration.set(“parquet.writer.version”, “v2”)" while creating Parquet then those are V2 parquet.* *Could you please confirm ?* >

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Thanks for the detailed answer. The thing I'm missing is this: let's say that the output format I choose is delta lake or iceberg or whatever format that uses parquet. Where does the catalog implementation (which holds metadata afaik, same metadata that iceberg and delta lake save for their tables

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
My take regarding your question is that your mileage varies so to speak. 1) Hive provides a more mature and widely adopted catalog solution that integrates well with other components in the Hadoop ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using Hive

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
I will also appreciate some material that describes the differences between Spark native tables vs hive tables and why each should be used... Thanks Nimrod בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏< mich.talebza...@gmail.com>: > I see a statement made as below and I quote > >

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
I see a statement made as below and I quote "The proposal of SPARK-46122 is to switch the default value of this configuration from `true` to `false` to use Spark native tables because we support better." Can you please elaborate on the above specifically with regard to the phrase ".. because we

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
+1 On Thu, Apr 25, 2024 at 2:46 PM Kent Yao wrote: > +1 > > Nit: the umbrella ticket is SPARK-44111, not SPARK-4. > > Thanks, > Kent Yao > > Dongjoon Hyun 于2024年4月25日周四 14:39写道: > > > > Hi, All. > > > > It's great to see community activities to polish 4.0.0 more and more. > > Thank you

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Kent Yao
+1 Nit: the umbrella ticket is SPARK-44111, not SPARK-4. Thanks, Kent Yao Dongjoon Hyun 于2024年4月25日周四 14:39写道: > > Hi, All. > > It's great to see community activities to polish 4.0.0 more and more. > Thank you all. > > I'd like to bring SPARK-46122 (another SQL topic) to you from the

[DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-24 Thread Dongjoon Hyun
Hi, All. It's great to see community activities to polish 4.0.0 more and more. Thank you all. I'd like to bring SPARK-46122 (another SQL topic) to you from the subtasks of SPARK-4 (Prepare Apache Spark 4.0.0), - https://issues.apache.org/jira/browse/SPARK-46122 Set

[FYI] SPARK-47046: Apache Spark 4.0.0 Dependency Audit and Cleanup

2024-04-21 Thread Dongjoon Hyun
Hi, All. As a part of Apache Spark 4.0.0 (SPAR-44111), we have been doing dependency audits. Today, we want to share the current readiness of Apache Spark 4.0.0 and get your feedback for further completeness. https://issues.apache.org/jira/browse/SPARK-44111 Prepare Apache Spark 4.0.0

Re: [DISCUSS] Un-deprecate Trigger.Once

2024-04-21 Thread Jungtaek Lim
While I understand your concern about confusion for reverting the decision on deprecation, we had a revert of deprecation against API which was deprecated for multiple years before reverting the decision. See SPARK-32686 . Maybe we had more cases,

Re: [DISCUSS] Un-deprecate Trigger.Once

2024-04-19 Thread Dongjoon Hyun
For that case, I believe it's enough for us to revise the deprecation message only by making sure that Apache Spark will keep it without removal for backward-compatibility purposes only. That's what the users asked, isn't that? > deprecation of Trigger.Once confuses users that the trigger won't

[DISCUSS] Un-deprecate Trigger.Once

2024-04-19 Thread Jungtaek Lim
Hi dev, I'd like to raise a discussion to un-deprecate Trigger.Once in future releases. I've proposed deprecation of Trigger.Once because it's semantically broken and we made a change, but we've realized that there are really users who strictly require the behavior of Trigger.Once (only run a

[DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-04-19 Thread Anton Okolnychyi
Hi folks, I'd like to start a discussion on SPARK-44167 that aims to enable catalogs to expose custom routines as stored procedures. I believe this functionality will enhance Spark’s ability to interact with external connectors and allow users to perform more operations in plain SQL. SPIP [1]

Re: Which version of spark version supports parquet version 2 ?

2024-04-19 Thread Steve Loughran
Those are some quite good improvements -but committing to storing all your data in an unstable format, is, well, "bold". For temporary data as part of a workflow though, it could be appealing Now, assuming you are going to be working with s3, you might want to start with merging PARQUET-2117 into

Re: Which version of spark version supports parquet version 2 ?

2024-04-18 Thread Prem Sahoo
Thanks for below information. Sent from my iPhoneOn Apr 18, 2024, at 3:31 AM, Bjørn Jørgensen wrote:" Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written

[ANNOUNCE] Apache Spark 3.4.3 released

2024-04-18 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.3! Spark 3.4.3 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade to this

[VOTE][RESULT] Release Spark 3.4.3 (RC2)

2024-04-18 Thread Dongjoon Hyun
The vote passes with 10 +1s (8 binding +1s). Thanks to all who helped with the release! (* = binding) +1: - Dongjoon Hyun * - Mridul Muralidharan * - Wenchen Fan * - Liang-Chi Hsieh * - Gengliang Wang * - Hyukjin Kwon * - Bo Yang - DB Tsai * - Kent Yao - Huaxin Gao * +0: None -1: None

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-18 Thread Dongjoon Hyun
This vote passed. I'll conclude this vote. Dongjoon On 2024/04/17 03:11:36 huaxin gao wrote: > +1 > > On Tue, Apr 16, 2024 at 6:55 PM Kent Yao wrote: > > > +1(non-binding) > > > > Thanks, > > Kent Yao > > > > bo yang 于2024年4月17日周三 09:49写道: > > > > > > +1 > > > > > > On Tue, Apr 16, 2024 at

Re: Which version of spark version supports parquet version 2 ?

2024-04-18 Thread Bjørn Jørgensen
" *Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written using Parquet V2 instead of Parquet V1. The aforementioned query performance tests utilized the C3

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Mich Talebzadeh
Hi Prem, Your question about writing Parquet v2 with Spark 3.2.0. Spark 3.2.0 Limitations: Spark 3.2.0 doesn't have a built-in way to explicitly force Parquet v2 encoding. As we saw previously, even Spark 3.4 created a file with parquet-mr version, indicating v1 encoding. Dremio v2 Support: As

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Prem Sahoo
Hello Ryan, May I know how you can write Parquet V2 encoding from spark 3.2.0 ? As per my knowledge Dremio is creating and reading Parquet V2. "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by engines that write Parquet data, supports delta encodings. However, these

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Ryan Blue
Prem, as I said earlier, v2 is not a finalized spec so you should not use it. That's why it is not the default. You can get Spark to write v2 files, but it isn't recommended by the Parquet community. On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo wrote: > Hello Community, > Could anyone shed more

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Prem Sahoo
Hello Community, Could anyone shed more light on this (Spark Supporting Parquet V2)? On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh wrote: > Hi Prem, > > Regrettably this is not my area of speciality. I trust another colleague > will have a more informed idea. Alternatively you may raise an

Re: [DISCUSS] Spark 4.0.0 release

2024-04-17 Thread Wenchen Fan
Thank you all for the replies! To @Nicholas Chammas : Thanks for cleaning up the error terminology and documentation! I've merged the first PR and let's finish others before the 4.0 release. To @Dongjoon Hyun : Thanks for driving the ANSI on by default effort! Now the vote has passed, let's

[VOTE][RESULT] SPARK-44444: Use ANSI SQL mode by default

2024-04-17 Thread Dongjoon Hyun
The vote passes with 24 +1s (13 binding +1s). Thanks to all who helped with the vote! (* = binding) +1: - Dongjoon Hyun * - Gengliang Wang * - Chao Sun * - Hyukjin Kwon * - Liang-Chi Hsieh * - Holden Karau * - Huaxin Gao * - Denny Lee - Xiao Li * - Mich Talebzadeh - Christiano Anderson - Yang Jie

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-17 Thread Dongjoon Hyun
Thank you all. The vote passed. I'll conclude this vote. Dongjooon. On 2024/04/16 04:58:39 Arun Dakua wrote: > +1 > > On Tue, Apr 16, 2024 at 12:50 AM Josh Rosen wrote: > > > +1 > > > > On Mon, Apr 15, 2024 at 11:26 AM Maciej wrote: > > > >> +1 > >> > >> Best regards, > >> Maciej

Re: [DISCUSS] Spark 4.0.0 release

2024-04-16 Thread Cheng Pan
will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0? Thanks, Cheng Pan > On Apr 15, 2024, at 09:58, Jungtaek Lim wrote: > > W.r.t. state data source - reader (SPARK-45511), there are several follow-up > tickets, but we don't plan to address them soon. The current

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread huaxin gao
+1 On Tue, Apr 16, 2024 at 6:55 PM Kent Yao wrote: > +1(non-binding) > > Thanks, > Kent Yao > > bo yang 于2024年4月17日周三 09:49写道: > > > > +1 > > > > On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon > wrote: > >> > >> +1 > >> > >> On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh wrote: > >>> > >>> +1 >

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Kent Yao
+1(non-binding) Thanks, Kent Yao bo yang 于2024年4月17日周三 09:49写道: > > +1 > > On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon wrote: >> >> +1 >> >> On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh wrote: >>> >>> +1 >>> >>> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan wrote: >>> > >>> > +1 >>> > >>> > On

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread DB Tsai
+1Sent from my iPhoneOn Apr 16, 2024, at 3:11 PM, bo yang wrote:+1On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon wrote:+1On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh wrote:+1 On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan wrote: > > +1 >

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread bo yang
+1 On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon wrote: > +1 > > On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh wrote: > >> +1 >> >> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan wrote: >> > >> > +1 >> > >> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun >> wrote: >> >> >> >> I'll start with my

Configuration to disable file exists in DataSource

2024-04-16 Thread Romain Ardiet
Hi community, When using DataFrameReader to read parquet files located on s3, there is no way to disable file existence checks done by the driver. My use case is that I have a spark job reading list of s3 files generated by an upstream job. This list can contain thousands of files. The process

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Hyukjin Kwon
+1 On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh wrote: > +1 > > On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan wrote: > > > > +1 > > > > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun > wrote: > >> > >> I'll start with my +1. > >> > >> - Checked checksum and signature > >> - Checked

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Hi Prem, Regrettably this is not my area of speciality. I trust another colleague will have a more informed idea. Alternatively you may raise an SPIP for it. Spark Project Improvement Proposals (SPIP) | Apache Spark HTH Mich Talebzadeh,

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Gengliang Wang
+1 On Tue, Apr 16, 2024 at 11:57 AM L. C. Hsieh wrote: > +1 > > On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan wrote: > > > > +1 > > > > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun > wrote: > >> > >> I'll start with my +1. > >> > >> - Checked checksum and signature > >> - Checked

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread L. C. Hsieh
+1 On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan wrote: > > +1 > > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun wrote: >> >> I'll start with my +1. >> >> - Checked checksum and signature >> - Checked Scala/Java/R/Python/SQL Document's Spark version >> - Checked published Maven artifacts >> -

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Prem Sahoo
Hello Mich,Thanks for example.I have the same parquet-mr version which creates Parquet version 1. We need to create V2 as it is more optimized. We have Dremio where if we use Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better in case of write . so we are inclined towards

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Well let us do a test in PySpark. Take this code and create a default parquet file. My spark is 3.4 cat parquet_checxk.py from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate() data = [("London", 8974432), ("New York City", 8804348),

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Prem Sahoo
Hello Community,Could any of you shed some light on below questions please ?Sent from my iPhoneOn Apr 15, 2024, at 9:02 PM, Prem Sahoo wrote:Any specific reason spark does not support or community doesn't want to go to Parquet V2 , which is more optimized and read and write is too much faster

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Wenchen Fan
+1 On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun wrote: > I'll start with my +1. > > - Checked checksum and signature > - Checked Scala/Java/R/Python/SQL Document's Spark version > - Checked published Maven artifacts > - All CIs passed. > > Thanks, > Dongjoon. > > On 2024/04/15 04:22:26

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Arun Dakua
+1 On Tue, Apr 16, 2024 at 12:50 AM Josh Rosen wrote: > +1 > > On Mon, Apr 15, 2024 at 11:26 AM Maciej wrote: > >> +1 >> >> Best regards, >> Maciej Szymkiewicz >> >> Web: https://zero323.net >> PGP: A30CEF0C31A501EC >> >> On 4/15/24 8:16 PM, Rui Wang wrote: >> >> +1, non-binding. >> >> Thanks

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-15 Thread Mridul Muralidharan
+1 Signatures, digests, etc check out fine. Checked out tag and build/tested with -Phive -Pyarn -Pkubernetes Regards, Mridul On Sun, Apr 14, 2024 at 11:31 PM Dongjoon Hyun wrote: > I'll start with my +1. > > - Checked checksum and signature > - Checked Scala/Java/R/Python/SQL Document's

[VOTE][RESULT] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-15 Thread L. C. Hsieh
Hi all, The vote passes with 7+1s (5 binding +1s). (* = binding) +1: Dongjoon Hyun(*) Liang-Chi Hsieh(*) Huaxin Gao(*) Bo Yang Xiao Li(*) Chao Sun(*) Hussein Awala +0: None -1: None Thanks. - To unsubscribe e-mail:

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Any specific reason spark does not support or community doesn't want to go to Parquet V2 , which is more optimized and read and write is too much faster (form other component which I am using) On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue wrote: > Spark will read data written with v2 encodings just

Request Review for [SPARK-46992]Fix cache consistency

2024-04-15 Thread Jay Han
Hi community, I've fixed the issue about consistency of cache: SPARK-46992 long time ago. I'll appreciate if someone could help review this pr! -- Best, Jay

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
Spark will read data written with v2 encodings just fine. You just don't need to worry about making Spark produce v2. And you should probably also not produce v2 encodings from other systems. On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo wrote: > oops but so spark does not support parquet V2 atm

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
oops but so spark does not support parquet V2 atm ?, as We have a use case where we need parquet V2 as one of our components uses Parquet V2 . On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue wrote: > Hi Prem, > > Parquet v1 is the default because v2 has not been finalized and adopted by > the

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
Hi Prem, Parquet v1 is the default because v2 has not been finalized and adopted by the community. I highly recommend not using v2 encodings at this time. Ryan On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo wrote: > I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 > which

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 which writes in parquet version 1 not version version 2:(. so I was looking how to write in Parquet version2 ? On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh wrote: > Sorry you have a point there. It was released in

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Sorry you have a point there. It was released in version 3.00. What version of spark are you using? Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Thank you so much for the info! But do we have any release notes where it says spark2.4.0 onwards supports parquet version 2. I was under the impression Spark3.0 onwards it started supporting . On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh wrote: > Well if I am correct, Parquet version 2

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Well if I am correct, Parquet version 2 support was introduced in Spark version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports Parquet version 2. Assuming that you are using Spark version 2.4.0 or later, you should be able to take advantage of Parquet version 2 features. HTH

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Thank you for the information! I can use any version of parquet-mr to produce parquet file. regarding 2nd question . Which version of spark is supporting parquet version 2? May I get the release notes where parquet versions are mentioned ? On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh wrote:

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Josh Rosen
+1 On Mon, Apr 15, 2024 at 11:26 AM Maciej wrote: > +1 > > Best regards, > Maciej Szymkiewicz > > Web: https://zero323.net > PGP: A30CEF0C31A501EC > > On 4/15/24 8:16 PM, Rui Wang wrote: > > +1, non-binding. > > Thanks Dongjoon to drive this! > > > -Rui > > On Mon, Apr 15, 2024 at 10:10 AM

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Parquet-mr is a Java library that provides functionality for working with Parquet files with hadoop. It is therefore more geared towards working with Parquet files within the Hadoop ecosystem, particularly using MapReduce jobs. There is no definitive way to check exact compatible versions within

Re: Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Takuya UESHIN
+1 On Mon, Apr 15, 2024 at 11:17 AM Rui Wang wrote: > +1, non-binding. > > Thanks Dongjoon to drive this! > > > -Rui > > On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng wrote: > >> +1 >> >> Thank you @Dongjoon Hyun ! >> >> On Mon, Apr 15, 2024 at 6:33 AM beliefer wrote: >> >>> +1 >>> >>> >>> 在

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Maciej
+1 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 4/15/24 8:16 PM, Rui Wang wrote: +1, non-binding. Thanks Dongjoon to drive this! -Rui On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng wrote: +1 Thank you @Dongjoon Hyun

Re: Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Rui Wang
+1, non-binding. Thanks Dongjoon to drive this! -Rui On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng wrote: > +1 > > Thank you @Dongjoon Hyun ! > > On Mon, Apr 15, 2024 at 6:33 AM beliefer wrote: > >> +1 >> >> >> 在 2024-04-15 15:54:07,"Peter Toth" 写道: >> >> +1 >> >> Wenchen Fan ezt írta

  1   2   3   4   5   6   7   8   9   10   >