Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
Excellent work, congratulations!

On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image for
>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>
>>

-- 
John Zhuge


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
Congratulations! Excellent work!

On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:

> Absolutely thrilled to see the project going open-source! Huge congrats to
> Chao and the entire team on this milestone!
>
> Yufei
>
>
> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>
>> Hi all,
>>
>> We are very happy to announce that Project Comet, a plugin to
>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> has now been open sourced under the Apache Arrow umbrella. Please
>> check the project repo
>> https://github.com/apache/arrow-datafusion-comet for more details if
>> you are interested. We'd love to collaborate with people from the open
>> source community who share similar goals.
>>
>> Thanks,
>> Chao
>>
>> -----
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
John Zhuge


Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
tch Scheduling.
>>>> <https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>
>>>>
>>>>
>>>>
>>>> What is not very clear is the degree of progress of these projects. You
>>>> may be kind enough to elaborate on KPI for each of these projects and where
>>>> you think your contributions is going to be.
>>>>
>>>>
>>>> HTH,
>>>>
>>>>
>>>> Mich
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 18 Jun 2021 at 00:44, Holden Karau 
>>>> wrote:
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> I'm continuing my adventures to make Spark on containers party and I
>>>>> was wondering if folks have experience with the different batch
>>>>> scheduler options that they prefer? I was thinking so that we can
>>>>> better support dynamic allocation it might make sense for us to
>>>>> support using different schedulers and I wanted to see if there are
>>>>> any that the community is more interested in?
>>>>>
>>>>> I know that one of the Spark on Kube operators supports
>>>>> volcano/kube-batch so I was thinking that might be a place I start
>>>>> exploring but also want to be open to other schedulers that folks
>>>>> might be interested in.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
John Zhuge


Re: Timestamp Difference/operations

2018-10-12 Thread John Zhuge
Yeah, operator "-" does not seem to be supported, however, you can use
"datediff" function:

In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP),
CAST('2000-01-01 00:00:00' AS TIMESTAMP))
Out[9]:
+--+
| datediff(CAST(CAST(2000-02-01 12:34:34 AS TIMESTAMP) AS DATE),
CAST(CAST(2000-01-01 00:00:00 AS TIMESTAMP) AS DATE)) |
+--+
| 31
   |
+--+

In [10]: select datediff('2000-02-01 12:34:34', '2000-01-01 00:00:00')
Out[10]:
++
| datediff(CAST(2000-02-01 12:34:34 AS DATE), CAST(2000-01-01 00:00:00 AS
DATE)) |
++
| 31
 |
++

In [11]: select datediff(timestamp '2000-02-01 12:34:34', timestamp
'2000-01-01 00:00:00')
Out[11]:
+--+
| datediff(CAST(TIMESTAMP('2000-02-01 12:34:34.0') AS DATE),
CAST(TIMESTAMP('2000-01-01 00:00:00.0') AS DATE)) |
+--+
| 31
   |
+--+

On Fri, Oct 12, 2018 at 7:01 AM Paras Agarwal 
wrote:

> Hello Spark Community,
>
> Currently in hive we can do operations on Timestamp Like :
> CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS
> TIMESTAMP)
>
> Seems its not supporting in spark.
> Is there any way available.
>
> Kindly provide some insight on this.
>
>
> Paras
> 9130006036
>


-- 
John


Re: Handle BlockMissingException in pyspark

2018-08-06 Thread John Zhuge
BlockMissingException typically indicates the HDFS file is corrupted. Might
be an HDFS issue, Hadoop mailing list is a better bet:
u...@hadoop.apache.org.

Capture at the full stack trace in executor log.
If the file still exists, run `hdfs fsck -blockId blk_1233169822_159765693`
to determine whether it is corrupted.
If not corrupted, could there be excessive (thousands) current reads on the
block?
Hadoop version? Spark version?



On Mon, Aug 6, 2018 at 2:21 AM Divay Jindal 
wrote:

> Hi ,
>
> I am running pyspark in dockerized jupyter environment , I am constantly
> getting this error :
>
> ```
>
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 33 
> in stage 25.0 failed 1 times, most recent failure: Lost task 33.0 in stage 
> 25.0 (TID 35067, localhost, executor driver)
> : org.apache.hadoop.hdfs.BlockMissingException
> : Could not obtain block: 
> BP-1742911633-10.225.201.50-1479296658503:blk_1233169822_159765693
>
> ```
>
> Please can anyone help me with how to handle such exception in pyspark.
>
> --
> Best Regards
> *Divay Jindal*
>
>
>

-- 
John


Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread John Zhuge
Sounds good.

Should we add another paragraph after this paragraph in configuration.md to
explain executor env as well? I will be happy to upload a simple patch.

Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
>  property in your conf/spark-defaults.conf file. Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in clustermode. See the YARN-related Spark
> Properties
> <https://github.com/apache/spark/blob/master/docs/running-on-yarn.html#spark-properties>
>  for
> more information.


Something like:

Note: When running Spark on YARN, environment variables for the executors
need to be set using the spark.yarn.executorEnv.[EnvironmentVariableName]
property in your conf/spark-defaults.conf file or on the command line.
Environment variables that are set in spark-env.sh will not be reflected in
the executor process.



On Wed, Jan 3, 2018 at 7:53 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Because spark-env.sh is something that makes sense only on the gateway
> machine (where the app is being submitted from).
>
> On Wed, Jan 3, 2018 at 6:46 PM, John Zhuge <john.zh...@gmail.com> wrote:
> > Thanks Jacek and Marcelo!
> >
> > Any reason it is not sourced? Any security consideration?
> >
> >
> > On Wed, Jan 3, 2018 at 9:59 AM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >>
> >> On Tue, Jan 2, 2018 at 10:57 PM, John Zhuge <jzh...@apache.org> wrote:
> >> > I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster.
> Is
> >> > spark-env.sh sourced when starting the Spark AM container or the
> >> > executor
> >> > container?
> >>
> >> No, it's not.
> >>
> >> --
> >> Marcelo
> >
> >
> >
> >
> > --
> > John
>
>
>
> --
> Marcelo
>



-- 
John


Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread John Zhuge
Thanks Jacek and Marcelo!

Any reason it is not sourced? Any security consideration?


On Wed, Jan 3, 2018 at 9:59 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Tue, Jan 2, 2018 at 10:57 PM, John Zhuge <jzh...@apache.org> wrote:
> > I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster. Is
> > spark-env.sh sourced when starting the Spark AM container or the executor
> > container?
>
> No, it's not.
>
> --
> Marcelo
>



-- 
John


Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-02 Thread John Zhuge
Hi,

I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster. Is
spark-env.sh sourced when starting the Spark AM container or the executor
container?

Saw this paragraph on
https://github.com/apache/spark/blob/master/docs/configuration.md:

Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] 
> property
> in your conf/spark-defaults.conf file. Environment variables that are set
> in spark-env.sh will not be reflected in the YARN Application Master
> process in clustermode. See the YARN-related Spark Properties
> <https://github.com/apache/spark/blob/master/docs/running-on-yarn.html#spark-properties>
>  for
> more information.


Does it mean spark-env.sh will not be sourced when starting AM in cluster
mode?
Does this paragraph appy to executor as well?

Thanks,
-- 
John Zhuge