Re: Apache Spark 3.2 Expectation

2021-07-01 Thread Gengliang Wang
Hi all,

I just cut branch-3.2 on Github and created version 3.3.0 on Jira.
When merging PRs on the master branch before 3.2.0 RC, please help
cherry-picking bug fixes and ongoing major features mentioned in this
thread to branch-3.2, thanks!

On Fri, Jul 2, 2021 at 2:31 AM Dongjoon Hyun 
wrote:

> Thank you, Gengliang!
>
> On Wed, Jun 30, 2021 at 10:56 PM Gengliang Wang  wrote:
>
>> Hi all,
>>
>> Just as a gentle reminder, I will do the branch cut tomorrow. Please
>> focus on finalizing the works to land in Spark 3.2.0.
>> After the branch cut, we can still merge the ongoing major features
>> mentioned in this thread. There should no be other new features in branch
>> 3.2.
>> Thanks!
>>
>> On Thu, Jun 17, 2021 at 2:57 PM Hyukjin Kwon  wrote:
>>
>>> *GA -> QA
>>>
>>> On Thu, 17 Jun 2021, 15:16 Hyukjin Kwon,  wrote:
>>>
 I think we would make sure treating these items in the list as
 exceptions from the code freeze, and discourage to push new APIs and
 features though.

 GA period ideally we should focus on bug fixes and polishing.

 It would be great if we can speed up on these items in the list too.


 On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:

> Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
> Now we make it clear that it's a soft cut and we can still merge
> important code changes to branch-3.2 before RC. Let's keep the branch cut
> date as July 1st.
>
> On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
> wrote:
>
>> > First, I think you are saying "branch-3.2";
>>
>> To Xiao. Yes, it's was a typo of "branch-3.2".
>>
>> > We do strongly prefer to cut the release for Spark 3.2.0 including
>> all the patches under SPARK-30602.
>> > This way, we can backport the other performance/operability
>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>> future Spark 3.2.x patch releases.
>>
>> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+
>> as Xiao wrote.
>>
>>
>>
>> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>>
>>> To Liang-Chi, I'm -1 for postponing the branch cut because this is a
 soft cut and the committers still are able to commit to `branch-3.3`
 according to their decisions.
>>>
>>>
>>> First, I think you are saying "branch-3.2";
>>>
>>> Second, the "so cut" means no "code freeze", although we cut the
>>> branch. To avoid releasing half-baked and unready features, the release
>>> manager needs to be very careful when cutting the RC. Based on what is
>>> proposed here, the RC date is the actual code freeze date.
>>>
>>> This way, we can backport the other performance/operability
 enhancements tickets under SPARK-33235 into branch-3.2 to be released 
 in
 future Spark 3.2.x patch releases.
>>>
>>>
>>> This is not allowed based on the policy. Only bug fixes can be
>>> merged to the patch releases. Thus, if we know it will introduce major
>>> performance regression, we have to turn the feature off by default.
>>>
>>> Xiao
>>>
>>>
>>>
>>> Min Shen  于2021年6月16日周三 下午3:22写道:
>>>
 Hi Gengliang,

 Thanks for volunteering as the release manager for Spark 3.2.0.
 Regarding the ongoing work of push-based shuffle in SPARK-30602, we
 are close to having all the patches merged to master to enable 
 push-based
 shuffle.
 Currently, there are 2 PRs under SPARK-30602 that are under active
 review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
 We should be able to post the PRs for the other 2 remaining tickets
 (SPARK-32923 and SPARK-35546) early next week.

 The tickets under SPARK-30602 are the minimum set of patches to
 enable push-based shuffle.
 We do have other performance/operability enhancements tickets under
 SPARK-33235 that are needed to fully contribute what we have 
 internally for
 push-based shuffle.
 However, these are optional for enabling push-based shuffle.
 We do strongly prefer to cut the release for Spark 3.2.0 including
 all the patches under SPARK-30602.
 This way, we can backport the other performance/operability
 enhancements tickets under SPARK-33235 into branch-3.2 to be released 
 in
 future Spark 3.2.x patch releases.
 I understand the preference of not postponing the branch cut date.
 We will check with Dongjoon regarding the soft cut date and the
 flexibility for including the remaining tickets under SPARK-30602 into
 branch-3.2.

 Best,
 Min

 On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
 wrote:

>
> Thanks Dongjoon. I've 

Re: Spark on Kubernetes scheduler variety

2021-07-01 Thread Mich Talebzadeh
Hi,

A rather simple question.

As Kubernetes is a special work requiring some effort in setting it up
properly, do we have a dev/test bed to conduct development work?

What I am trying to get at is if there is official support for Volcano
stuff that a vendor can provide free cluster usage in exchange for R & D.
For example Google themselves?

Thanks,

Mich




   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 05:00, Mich Talebzadeh 
wrote:

> Hi Klaus,
>
> Thanks
>
> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1289
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 1 Jul 2021 at 03:16, Klaus Ma  wrote:
>
>> Hi Mich,
>>
>> Would you help to open an issue at spark-on-k8s-operator repo? We're
>> going to submit a PR to update the install steps :)
>>
>> -- Klaus
>>
>> On Wed, Jun 30, 2021 at 12:24 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Yikun
>>>
>>> In reference
>>>
>>>
>>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md
>>>
>>> Trying to install Volcano I am getting this error
>>>
>>> helm repo add incubator
>>> http://storage.googleapis.com/kubernetes-charts-incubator
>>> Error: looks like "
>>> http://storage.googleapis.com/kubernetes-charts-incubator; is not a
>>> valid chart repository or cannot be reached: failed to fetch
>>> http://storage.googleapis.com/kubernetes-charts-incubator/index.yaml :
>>> 404 Not Found
>>>
>>> Any ideas will be appreciated.
>>>
>>> Thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 29 Jun 2021 at 09:14, Mich Talebzadeh 
>>> wrote:
>>>
 Cool, thanks!



view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 29 Jun 2021 at 07:33, Yikun Jiang  wrote:

> > Is this the correct link for integrating Volcano with Spark?
>
> Yes, it is Kubernetes operator style of integrating Volcano. And if
> you want to just use spark submit style to submit a native support job, 
> you
> can see [2] as ref.
>
> [1]
> https://github.com/huawei-cloudnative/spark/commit/6c1f37525f026353eaead34216d47dad653f13a4
>
> Regards,
> Yikun
>
>
> Mich Talebzadeh  于2021年6月28日周一 下午6:03写道:
>
>> Hi Yikun,
>>
>> Is this the correct link for integrating Volcano with Spark?
>>
>> spark-on-k8s-operator/volcano-integration.md at master ·
>> GoogleCloudPlatform/spark-on-k8s-operator · GitHub
>> 
>>
>> Thanks
>>
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 25 Jun 2021 at 09:45, Yikun Jiang 
>> wrote:
>>
>>> Oops, sorry for the error link, it should be:
>>>
>>> We will also 

Re: Apache Spark 3.2 Expectation

2021-07-01 Thread Dongjoon Hyun
Thank you, Gengliang!

On Wed, Jun 30, 2021 at 10:56 PM Gengliang Wang  wrote:

> Hi all,
>
> Just as a gentle reminder, I will do the branch cut tomorrow. Please
> focus on finalizing the works to land in Spark 3.2.0.
> After the branch cut, we can still merge the ongoing major features
> mentioned in this thread. There should no be other new features in branch
> 3.2.
> Thanks!
>
> On Thu, Jun 17, 2021 at 2:57 PM Hyukjin Kwon  wrote:
>
>> *GA -> QA
>>
>> On Thu, 17 Jun 2021, 15:16 Hyukjin Kwon,  wrote:
>>
>>> I think we would make sure treating these items in the list as
>>> exceptions from the code freeze, and discourage to push new APIs and
>>> features though.
>>>
>>> GA period ideally we should focus on bug fixes and polishing.
>>>
>>> It would be great if we can speed up on these items in the list too.
>>>
>>>
>>> On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:
>>>
 Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
 Now we make it clear that it's a soft cut and we can still merge
 important code changes to branch-3.2 before RC. Let's keep the branch cut
 date as July 1st.

 On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
 wrote:

> > First, I think you are saying "branch-3.2";
>
> To Xiao. Yes, it's was a typo of "branch-3.2".
>
> > We do strongly prefer to cut the release for Spark 3.2.0 including
> all the patches under SPARK-30602.
> > This way, we can backport the other performance/operability
> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
> future Spark 3.2.x patch releases.
>
> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+
> as Xiao wrote.
>
>
>
> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>
>> To Liang-Chi, I'm -1 for postponing the branch cut because this is a
>>> soft cut and the committers still are able to commit to `branch-3.3`
>>> according to their decisions.
>>
>>
>> First, I think you are saying "branch-3.2";
>>
>> Second, the "so cut" means no "code freeze", although we cut the
>> branch. To avoid releasing half-baked and unready features, the release
>> manager needs to be very careful when cutting the RC. Based on what is
>> proposed here, the RC date is the actual code freeze date.
>>
>> This way, we can backport the other performance/operability
>>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>>> future Spark 3.2.x patch releases.
>>
>>
>> This is not allowed based on the policy. Only bug fixes can be merged
>> to the patch releases. Thus, if we know it will introduce major 
>> performance
>> regression, we have to turn the feature off by default.
>>
>> Xiao
>>
>>
>>
>> Min Shen  于2021年6月16日周三 下午3:22写道:
>>
>>> Hi Gengliang,
>>>
>>> Thanks for volunteering as the release manager for Spark 3.2.0.
>>> Regarding the ongoing work of push-based shuffle in SPARK-30602, we
>>> are close to having all the patches merged to master to enable 
>>> push-based
>>> shuffle.
>>> Currently, there are 2 PRs under SPARK-30602 that are under active
>>> review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
>>> We should be able to post the PRs for the other 2 remaining tickets
>>> (SPARK-32923 and SPARK-35546) early next week.
>>>
>>> The tickets under SPARK-30602 are the minimum set of patches to
>>> enable push-based shuffle.
>>> We do have other performance/operability enhancements tickets under
>>> SPARK-33235 that are needed to fully contribute what we have internally 
>>> for
>>> push-based shuffle.
>>> However, these are optional for enabling push-based shuffle.
>>> We do strongly prefer to cut the release for Spark 3.2.0 including
>>> all the patches under SPARK-30602.
>>> This way, we can backport the other performance/operability
>>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>>> future Spark 3.2.x patch releases.
>>> I understand the preference of not postponing the branch cut date.
>>> We will check with Dongjoon regarding the soft cut date and the
>>> flexibility for including the remaining tickets under SPARK-30602 into
>>> branch-3.2.
>>>
>>> Best,
>>> Min
>>>
>>> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
>>> wrote:
>>>

 Thanks Dongjoon. I've talked with Dongjoon offline to know more
 this.
 As it is soft cut date, there is no reason to postpone it.

 It sounds good then to keep original branch cut date.

 Thank you.



 Dongjoon Hyun-2 wrote
 > Thank you for volunteering, Gengliang.
 >
 > Apache Spark 3.2.0 is the first version enabling AQE by default.
 

Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
Hi Pralabh,

You need to check the latest compatibility between Spark version that can
successfully work as Hive execution engine

This is my old file alluding to spark-1.3.1 as the execution engine

set spark.home=/data6/hduser/spark-1.3.1-bin-hadoop2.6;
--set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
set spark.master=yarn-client;
set hive.execution.engine=spark;


Hive is great as a data warehouse but the default mapReduce used is
Jurassic Park.

On the other hand Spark has performant inbuilt API for Hive. Otherwise you
can connect to Hive on a remote cluster through JDBC.

In python you can do

from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext


And use it like below


sqltext  = ""
if (spark.sql("SHOW TABLES IN test like 'randomDataPy'").count() == 1):
  rows = spark.sql(f"""SELECT COUNT(1) FROM
{fullyQualifiedTableName}""").collect()[0][0]
  print ("number of rows is ",rows)
else:
  print("\nTable test.randomDataPy does not exist, creating table ")
  sqltext = """
 CREATE TABLE test.randomDataPy(
   ID INT
 , CLUSTERED INT
 , SCATTERED INT
 , RANDOMISED INT
 , RANDOM_STRING VARCHAR(50)
 , SMALL_VC VARCHAR(50)
 , PADDING  VARCHAR(4000)
)
STORED AS PARQUET
"""
  spark.sql(sqltext)

HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 11:50, Pralabh Kumar  wrote:

> Hi mich
>
> Thx for replying.your answer really helps. The comparison was done in
> 2016. I would like to know the latest comparison with spark 3.0
>
> Also what you are suggesting is to migrate queries to Spark ,which is
> hivecontxt or hive on spark, which is what Facebook also did
> . Is that understanding correct ?
>
> Regards
> Pralabh
>
> On Thu, 1 Jul 2021, 15:44 Mich Talebzadeh, 
> wrote:
>
>> Hi Prahabh,
>>
>> This question has been asked before :)
>>
>> Few years ago (late 2016),  I made a presentation on running Hive Queries
>> on the Spark execution engine for Hortonworks.
>>
>>
>> https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations
>>
>> The issue you will face will be compatibility problems with versions of
>> Hive and Spark.
>>
>> My suggestion would be to use Spark as a massive parallel processing and
>> Hive as a storage layer. However, you need to test what can be migrated or
>> not.
>>
>> HTH
>>
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar 
>> wrote:
>>
>>> Hi Dev
>>>
>>> I am having thousands of legacy hive queries .  As a plan to move to
>>> Spark , we are planning to migrate Hive queries on Spark .  Now there are
>>> two approaches
>>>
>>>
>>>1.  One is Hive on Spark , which is similar to changing the
>>>execution engine in hive queries like TEZ.
>>>2. Another one is migrating hive queries to Hivecontext/sparksql ,
>>>an approach used by Facebook and presented in Spark conference.
>>>
>>> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>>>.
>>>
>>>
>>> Can you please guide me which option to go for . I am personally
>>> inclined to go for option 2 . It also allows the use of the latest spark .
>>>
>>> Please help me on the same , as there are not much comparisons online
>>> available keeping Spark 3.0 in perspective.
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>>
>>>


Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi mich

Thx for replying.your answer really helps. The comparison was done in 2016.
I would like to know the latest comparison with spark 3.0

Also what you are suggesting is to migrate queries to Spark ,which is
hivecontxt or hive on spark, which is what Facebook also did
. Is that understanding correct ?

Regards
Pralabh

On Thu, 1 Jul 2021, 15:44 Mich Talebzadeh, 
wrote:

> Hi Prahabh,
>
> This question has been asked before :)
>
> Few years ago (late 2016),  I made a presentation on running Hive Queries
> on the Spark execution engine for Hortonworks.
>
>
> https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations
>
> The issue you will face will be compatibility problems with versions of
> Hive and Spark.
>
> My suggestion would be to use Spark as a massive parallel processing and
> Hive as a storage layer. However, you need to test what can be migrated or
> not.
>
> HTH
>
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar  wrote:
>
>> Hi Dev
>>
>> I am having thousands of legacy hive queries .  As a plan to move to
>> Spark , we are planning to migrate Hive queries on Spark .  Now there are
>> two approaches
>>
>>
>>1.  One is Hive on Spark , which is similar to changing the execution
>>engine in hive queries like TEZ.
>>2. Another one is migrating hive queries to Hivecontext/sparksql , an
>>approach used by Facebook and presented in Spark conference.
>>
>> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>>.
>>
>>
>> Can you please guide me which option to go for . I am personally inclined
>> to go for option 2 . It also allows the use of the latest spark .
>>
>> Please help me on the same , as there are not much comparisons online
>> available keeping Spark 3.0 in perspective.
>>
>> Regards
>> Pralabh Kumar
>>
>>
>>


Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
Hi Prahabh,

This question has been asked before :)

Few years ago (late 2016),  I made a presentation on running Hive Queries
on the Spark execution engine for Hortonworks.

https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations

The issue you will face will be compatibility problems with versions of
Hive and Spark.

My suggestion would be to use Spark as a massive parallel processing and
Hive as a storage layer. However, you need to test what can be migrated or
not.

HTH


Mich


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar  wrote:

> Hi Dev
>
> I am having thousands of legacy hive queries .  As a plan to move to Spark
> , we are planning to migrate Hive queries on Spark .  Now there are two
> approaches
>
>
>1.  One is Hive on Spark , which is similar to changing the execution
>engine in hive queries like TEZ.
>2. Another one is migrating hive queries to Hivecontext/sparksql , an
>approach used by Facebook and presented in Spark conference.
>
> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>.
>
>
> Can you please guide me which option to go for . I am personally inclined
> to go for option 2 . It also allows the use of the latest spark .
>
> Please help me on the same , as there are not much comparisons online
> available keeping Spark 3.0 in perspective.
>
> Regards
> Pralabh Kumar
>
>
>


Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi Dev

I am having thousands of legacy hive queries .  As a plan to move to Spark
, we are planning to migrate Hive queries on Spark .  Now there are two
approaches


   1.  One is Hive on Spark , which is similar to changing the execution
   engine in hive queries like TEZ.
   2. Another one is migrating hive queries to Hivecontext/sparksql , an
   approach used by Facebook and presented in Spark conference.
   
https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
   .


Can you please guide me which option to go for . I am personally inclined
to go for option 2 . It also allows the use of the latest spark .

Please help me on the same , as there are not much comparisons online
available keeping Spark 3.0 in perspective.

Regards
Pralabh Kumar