Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a 
Remote Shuffle Service for Spark on K8s.

There are some useful resources you might be interested in.

[1] https://celeborn.apache.org/
[2] https://www.youtube.com/watch?v=s5xOtG6Venw
[3] https://github.com/aws-samples/emr-remote-shuffle-service
[4] https://github.com/apache/celeborn/issues/2140

Thanks,
Cheng Pan


> On Apr 6, 2024, at 21:41, Mich Talebzadeh  wrote:
> 
> I have seen some older references for shuffle service for k8s,
> although it is not clear they are talking about a generic shuffle
> service for k8s.
> 
> Anyhow with the advent of genai and the need to allow for a larger
> volume of data, I was wondering if there has been any more work on
> this matter. Specifically larger and scalable file systems like HDFS,
> GCS , S3 etc, offer significantly larger storage capacity than local
> disks on individual worker nodes in a k8s cluster, thus allowing
> handling much larger datasets more efficiently. Also the degree of
> parallelism and fault tolerance  with these files systems come into
> it. I will be interested in hearing more about any progress on this.
> 
> Thanks
> .
> 
> Mich Talebzadeh,
> 
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> 
> London
> United Kingdom
> 
> 
>   view my Linkedin profile
> 
> 
> https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[DISCUSS] MySQL version support policy

2024-03-24 Thread Cheng Pan
Hi, Spark community,

I noticed that the Spark JDBC connector MySQL dialect is testing against the 
8.3.0[1] now, a non-LTS version.

MySQL changed the version policy recently[2], which is now very similar to the 
Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version, 8.1, 8.2, 
8.3 is non-LTS, and the next LTS version is 8.4. 

I would say that MySQL is one of the most important infrastructures today, I 
checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version support 
policy, and both only support 5.7 and 8.0.

Also, Spark officially only supports LTS Java versions, like JDK 17 and 21, but 
not 22. I would recommend using MySQL 8.0 for testing until the next MySQL LTS 
version (8.4) is available.

Additional discussion can be found at [3]

[1] https://issues.apache.org/jira/browse/SPARK-47453
[2] 
https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
[3] https://github.com/apache/spark/pull/45581
[4] https://aws.amazon.com/rds/mysql/
[5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy

Thanks,
Cheng Pan



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Apache Kyuubi 1.8.1 is available

2024-02-20 Thread Cheng Pan
Hi all,

The Apache Kyuubi community is pleased to announce that
Apache Kyuubi 1.8.1 has been released!

Apache Kyuubi is a distributed and multi-tenant gateway to provide
serverless SQL on data warehouses and lakehouses.

Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC and
RESTful interfaces for end-users to manipulate large-scale data with
pre-programmed and extensible Spark/Flink/Trino/Hive engines.

We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses
and lakehouses.

This "out-of-the-box" model minimizes the barriers and costs for end-users
to use Spark/Flink/Trino/Hive engines on the client side.

At the server-side, Kyuubi server and engine's multi-tenant architecture
provides the administrators a way to achieve computing resource isolation,
data security, high availability, high client concurrency, etc.

The full release notes and download links are available at:
Release Notes: https://kyuubi.apache.org/release/1.8.1.html

To learn more about Apache Kyuubi, please see
https://kyuubi.apache.org/

Kyuubi Resources:
- Issue: https://github.com/apache/kyuubi/issues
- Mailing list: d...@kyuubi.apache.org

We would like to thank all contributors of the Kyuubi community
who made this release possible!

Thanks,
Cheng Pan, on behalf of Apache Kyuubi community

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Cheng Pan
Spark has supported the window-based executor failure-tracking mechanism for 
YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this 
feature to K8s.

[1] https://issues.apache.org/jira/browse/SPARK-41210
[2] https://github.com/apache/spark/pull/38732

Thanks,
Cheng Pan


> On Feb 19, 2024, at 23:59, Sri Potluri  wrote:
> 
> Hello Spark Community,
> 
> I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, 
> for running various Spark applications. While the system generally works 
> well, I've encountered a challenge related to how Spark applications handle 
> executor failures, specifically in scenarios where executors enter an error 
> state due to persistent issues.
> 
> Problem Description
> 
> When an executor of a Spark application fails, the system attempts to 
> maintain the desired level of parallelism by automatically recreating a new 
> executor to replace the failed one. While this behavior is beneficial for 
> transient errors, ensuring that the application continues to run, it becomes 
> problematic in cases where the failure is due to a persistent issue (such as 
> misconfiguration, inaccessible external resources, or incompatible 
> environment settings). In such scenarios, the application enters a loop, 
> continuously trying to recreate executors, which leads to resource wastage 
> and complicates application management.
> 
> Desired Behavior
> 
> Ideally, I would like to have a mechanism to limit the number of retries for 
> executor recreation. If the system fails to successfully create an executor 
> more than a specified number of times (e.g., 5 attempts), the entire Spark 
> application should fail and stop trying to recreate the executor. This 
> behavior would help in efficiently managing resources and avoiding prolonged 
> failure states.
> 
> Questions for the Community
> 
> 1. Is there an existing configuration or method within Spark or the Spark 
> Operator to limit executor recreation attempts and fail the job after 
> reaching a threshold?
>
> 2. Has anyone else encountered similar challenges and found workarounds or 
> solutions that could be applied in this context?
> 
> 
> Additional Context
> 
> I have explored Spark's task and stage retry configurations 
> (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these 
> do not directly address the issue of limiting executor creation retries. 
> Implementing a custom monitoring solution to track executor failures and 
> manually stop the application is a potential workaround, but it would be 
> preferable to have a more integrated solution.
> 
> I appreciate any guidance, insights, or feedback you can provide on this 
> matter.
> 
> Thank you for your time and support.
> 
> Best regards,
> Sri P


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Apache Kyuubi released 1.8.0

2023-11-06 Thread Cheng Pan
Hi all,

The Apache Kyuubi community is pleased to announce that
Apache Kyuubi 1.8.0 has been released!

Apache Kyuubi is a distributed and multi-tenant gateway to provide
serverless SQL on data warehouses and lakehouses.

Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface
for end-users to manipulate large-scale data with pre-programmed and
extensible Spark SQL engines.

We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses
and lakehouses.

This "out-of-the-box" model minimizes the barriers and costs for end-users
to use Spark, Flink, and other computing engines at the client side.

At the server-side, Kyuubi server and engine's multi-tenant architecture
provides the administrators a way to achieve computing resource isolation,
data security, high availability, high client concurrency, etc.

The full release notes and download links are available at:
Release Notes: https://kyuubi.apache.org/release/1.8.0.html

To learn more about Apache Kyuubi, please see
https://kyuubi.apache.org/

Kyuubi Resources:
- Issue: https://github.com/apache/kyuubi/issues
- Mailing list: d...@kyuubi.apache.org

We would like to thank all contributors of the Kyuubi community
who made this release possible!

Thanks,
On behalf of Apache Kyuubi community

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Apache Celeborn(incubating) 0.3.1 available

2023-10-13 Thread Cheng Pan
Hi all,

Apache Celeborn(Incubating) community is glad to announce the
new release of Apache Celeborn(Incubating) 0.3.1.

Celeborn is dedicated to improving the efficiency and elasticity of
different map-reduce engines and provides an elastic, high-efficient
service for intermediate data including shuffle data, spilled data,
result data, etc.

Download Link: https://celeborn.apache.org/download/

GitHub Release Tag:
- https://github.com/apache/incubator-celeborn/releases/tag/v0.3.1-incubating

Release Notes:
- https://celeborn.apache.org/community/release_notes/release_note_0.3.1

Home Page: https://celeborn.apache.org/

Celeborn Resources:
- Issue Management: https://issues.apache.org/jira/projects/CELEBORN
- Mailing List: d...@celeborn.apache.org

Thanks,
Cheng Pan
On behalf of the Apache Celeborn(incubating) community




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Vulnerabilities

2023-08-14 Thread Cheng Pan
For the Guava case, you may be interested in 
https://github.com/apache/spark/pull/42493

Thanks,
Cheng Pan


> On Aug 14, 2023, at 16:50, Sankavi Nagalingam 
>  wrote:
> 
> Hi Team,
>  We could see there are many dependent vulnerabilities present in the latest 
> spark-core:3.4.1.jar. PFA
> Could you please let us know when will be the fix version available for the 
> users.
>  Thanks,
> Sankavi
>  
> The information in this e-mail and any attachments is confidential and may be 
> legally privileged. It is intended solely for the addressee or addressees. 
> Any use or disclosure of the contents of this e-mail/attachments by a not 
> intended recipient is unauthorized and may be unlawful. If you have received 
> this e-mail in error please notify the sender. Please note that any views or 
> opinions presented in this e-mail are solely those of the author and do not 
> necessarily represent those of TEMENOS. We recommend that you check this 
> e-mail and any attachments against viruses. TEMENOS accepts no liability for 
> any damage caused by any malicious code or virus transmitted by this e-mail.

Spark-3.4.1-Vulnerablities.xlsx
Description: MS-Excel 2007 spreadsheet
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Cheng Pan
There is a DSv2-based Hive connector in Apache Kyuubi[1] that supports
connecting multiple HMS in a single Spark application.

Some limitations

- currently only supports Spark 3.3
- has a known issue when using w/ `spark-sql`, but OK w/ spark-shell and
normal jar-based Spark application.

[1]
https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive

Thanks,
Cheng Pan


On Apr 18, 2023 at 00:38:23, Elliot West  wrote:

> Hi Ankit,
>
> While not a part of Spark, there is a project called 'WaggleDance' that
> can federate multiple Hive metastores so that they are accessible via a
> single URI: https://github.com/ExpediaGroup/waggle-dance
>
> This may be useful or perhaps serve as inspiration.
>
> Thanks,
>
> Elliot.
>
> On Mon, 17 Apr 2023 at 16:38, Ankit Gupta  wrote:
>
>> ++
>> User Mailing List
>>
>> Just a reminder, anyone who can help on this.
>>
>> Thanks a lot !
>>
>> Ankit Prakash Gupta
>>
>> On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta 
>> wrote:
>>
>>> Hi All
>>>
>>> The question is regarding the support of multiple Remote Hive Metastore
>>> catalogs with Spark. Starting Spark 3, multiple catalog support is added in
>>> spark, but have we implemented any CatalogPlugin that can help us configure
>>> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with
>>> the Fully Qualified Class Name that I can try using for configuring a Hive
>>> Metastore Catalog. If not, I would like to work on the implementation of
>>> the CatalogPlugin that we can use to configure multiple Hive Metastore
>>> Servers' .
>>>
>>> Thanks and Regards.
>>>
>>> Ankit Prakash Gupta
>>> +91 8750101321
>>> info.ank...@gmail.com
>>>
>>>


Re: spark on k8s daemonset collect log

2023-03-14 Thread Cheng Pan
The filebeat supports multiline matching, here is an example[1]

BTW, I’m working on External Log Service integration[2], it may be useful
in your case, feel free to review/left comments

[1]
https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html#multiline
[2] https://github.com/apache/spark/pull/38357

Thanks,
Cheng Pan


On Mar 14, 2023 at 16:36:45, 404  wrote:

> hi, all
>
> Spark runs on k8s, uses daemonset filebeat to collect logs, and writes
> them to elasticsearch. The docker logs are in json format, and each line is
> a json string. How to merge multi-line exceptions?
>
>


[ANNOUNCE] Apache Kyuubi released 1.7.0

2023-03-07 Thread Cheng Pan
Hi all,

The Apache Kyuubi community is pleased to announce that Apache Kyuubi
1.7.0 has been released!

Apache Kyuubi is a distributed multi-tenant Lakehouse gateway for
large-scale data processing and analytics, built on top of Apache Spark,
Apache Flink, Trino and also supports other computing engines.

Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface
for end-users to manipulate large-scale data with pre-programmed and
extensible Spark SQL engines.

We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses
and data lakes.

This "out-of-the-box" model minimizes the barriers and costs for end-users
to use Spark at the client side.

At the server-side, Kyuubi server and engine's multi-tenant architecture
provides the administrators a way to achieve computing resource isolation,
data security, high availability, high client concurrency, etc.

The full release notes and download links are available at:
Release Notes: https://kyuubi.apache.org/release/1.7.0.html

To learn more about Apache Kyuubi, please see: https://kyuubi.apache.org/

Kyuubi Resources:
- Issue: https://github.com/apache/kyuubi/issues
- Mailing list: d...@kyuubi.apache.org

We would like to thank all contributors of the Kyuubi community who
made this release possible!

Thanks,
Cheng Pan, on behalf of Apache Kyuubi community


Re: The Dataset unit test is much slower than the RDD unit test (in Scala)

2022-11-01 Thread Cheng Pan
Which Spark version are you using?

SPARK-36444[1] and SPARK-38138[2] may be related, please test w/ the
patched version or disable DPP by setting
spark.sql.optimizer.dynamicPartitionPruning.enabled=false to see if it
helps.

[1] https://issues.apache.org/jira/browse/SPARK-36444
[2] https://issues.apache.org/jira/browse/SPARK-38138


Thanks,
Cheng Pan


On Nov 2, 2022 at 00:14:34, Enrico Minack  wrote:

> Hi Tanin,
>
> running your test with option "spark.sql.planChangeLog.level" set to
> "info" or "warn" (depending on your Spark log level) will show you
> insights into the planning (which rules are applied, how long rules
> take, how many iterations are done).
>
> Hoping this helps,
> Enrico
>
>
> Am 25.10.22 um 21:54 schrieb Tanin Na Nakorn:
>
> Hi All,
>
>
> Our data job is very complex (e.g. 100+ joins), and we have switched
>
> from RDD to Dataset recently.
>
>
> We've found that the unit test takes much longer. We profiled it and
>
> have found that it's the planning phase that is slow, not execution.
>
>
> I wonder if anyone has encountered this issue before and if there's a
>
> way to make the planning phase faster (e.g. maybe disabling certain
>
> optimizers).
>
>
> Any thoughts or input would be appreciated.
>
>
> Thank you,
>
> Tanin
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread Cheng Pan
There are some projects based on Spark DataSource V2 that I hope will help you.

https://github.com/datastax/spark-cassandra-connector
https://github.com/housepower/spark-clickhouse-connector
https://github.com/oracle/spark-oracle
https://github.com/pingcap/tispark

Thanks,
Cheng Pan

On Wed, Apr 6, 2022 at 5:52 PM daniel queiroz  wrote:
>
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/read/index.html
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/write/index.html
>
> https://developer.here.com/documentation/java-scala-dev/dev_guide/spark-connector/index.html
>
> Grato,
>
> Daniel Queiroz
> 81 996289671
>
>
> Em qua., 6 de abr. de 2022 às 03:57, Dyanesh Varun 
>  escreveu:
>>
>> Hey team,
>>
>> Can you please share some documentation/blogs where we can get to know how 
>> we can write custom sources and sinks for both streaming and static datasets.
>>
>> Thanks in advance
>> Dyanesh Varun
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark as data warehouse?

2022-03-26 Thread Cheng Pan
Sorry I missed the original channel, added it back.

-

I have less knowledge about dbt. If it supports Hive, it should support Kyuubi.
Basically, Kyuubi is gateway between your client(e.g. beeline, hive
jdbc client) and compute engine(e.g. Spark, Flink, Trino), I think the
most valuable things are:
1) Kyuubi reuses the Hive Thrift Protocol, it say you can treat Kyuubi
as a HiveServer2, and continue use beeline, hive jdbc driver to
connect Kyuubi to run SQL(in your compute engine dialect). Ideally, if
a tool claims it supports Hive, then it supports Kyuubi.
2) Kyuubi manages the compute engine lifecycle and share level, makes
a good trade-off between isolation and resource consumption.[1]

PS: Kyuubi's support for Spark is very mature, you can find lots of
production use cases here[2]. The support for Flink & Trino is in beta
phase.

[1] https://kyuubi.apache.org/docs/latest/deployment/engine_share_level.html
[2] https://github.com/apache/incubator-kyuubi/discussions/925

Thanks,
Cheng Pan

---

Thanks, I'll check it out.
I have a use case where we want to use dbt as data middling tool .
Will it take dbt queries and create the resulting model ?
I see it supports Trino , so I am guessing yes .

I will love to contribute to it as well.

Thanks
Deepak

---

Spark SQL can indeed take over your Hive workloads, and if you're
looking for an open source solution, Apache Kyuubi(Incubating)[1]
might help.

[1] https://kyuubi.apache.org/

Thanks,
Cheng Pan

On Sat, Mar 26, 2022 at 4:51 PM Cheng Pan  wrote:
>
> I have less knowledge about dbt. If it supports Hive, it should support 
> Kyuubi.
> Basically, Kyuubi is gateway between your client(e.g. beeline, hive
> jdbc client) and compute engine(e.g. Spark, Flink, Trino), I think the
> most valuable things are:
> 1) Kyuubi reuses the Hive Thrift Protocol, it say you can treat Kyuubi
> as a HiveServer2, and continue use beeline, hive jdbc driver to
> connect Kyuubi to run SQL(in your compute engine dialect). Ideally, if
> a tool claims it supports Hive, then it supports Kyuubi.
> 2) Kyuubi manages the compute engine lifecycle and share level, makes
> a good trade-off between isolation and resource consumption.[1]
>
> PS: Kyuubi's support for Spark is very mature, you can find lots of
> production use cases here[2]. The support for Flink & Trino is in beta
> phase.
>
> [1] https://kyuubi.apache.org/docs/latest/deployment/engine_share_level.html
> [2] https://github.com/apache/incubator-kyuubi/discussions/925
>
> Thanks,
> Cheng Pan
>
> On Sat, Mar 26, 2022 at 4:16 PM Deepak Sharma  wrote:
> >
> > Thanks, I'll check it out.
> > I have a use case where we want to use dbt as data middling tool .
> > Will it take dbt queries and create the resulting model ?
> > I see it supports Trino , so I am guessing yes .
> >
> > I will love to contribute to it as well.
> >
> >
> > Thanks
> > Deepak
> >
> > On Sat, 26 Mar 2022 at 1:24 PM, Cheng Pan  wrote:
> >>
> >> Spark SQL can indeed take over your Hive workloads, and if you're
> >> looking for an open source solution, Apache Kyuubi(Incubating)[1]
> >> might help.
> >>
> >> [1] https://kyuubi.apache.org/
> >>
> >> Thanks,
> >> Cheng Pan
> >>
> >> On Sat, Mar 26, 2022 at 11:45 AM Deepak Sharma  
> >> wrote:
> >> >
> >> > It can be used as warehouse but then you have to keep long running spark 
> >> > jobs.
> >> > This can be possible using cached data frames or dataset .
> >> >
> >> > Thanks
> >> > Deepak
> >> >
> >> > On Sat, 26 Mar 2022 at 5:56 AM,  wrote:
> >> >>
> >> >> In the past time we have been using hive for building the data
> >> >> warehouse.
> >> >> Do you think if spark can used for this purpose? it's even more realtime
> >> >> than hive.
> >> >>
> >> >> Thanks.
> >> >>
> >> >> -
> >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >> >>
> >> > --
> >> > Thanks
> >> > Deepak
> >> > www.bigdatabig.com
> >> > www.keosha.net
> >
> > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Release Apache Kyuubi(Incubating) 1.3.0-incubating

2021-09-26 Thread Cheng Pan
Hello Spark Community,

The Apache Kyuubi(Incubating) community is pleased to announce that
Apache Kyuubi(Incubating) 1.3.0-incubating has been released!

Apache Kyuubi(Incubating) is a distributed multi-tenant JDBC server for
large-scale data processing and analytics, built on top of Apache Spark
and designed to support more engines (i.e. Flink).

Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface
for end-users to manipulate large-scale data with pre-programmed and
extensible Spark SQL engines.

We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses
and data lakes.

This "out-of-the-box" model minimizes the barriers and costs for end-users
to use Spark at the client side.

At the server-side, Kyuubi server and engine's multi-tenant architecture
provides the administrators a way to achieve computing resource isolation,
data security, high availability, high client concurrency, etc.

The full release notes and download links are available at:
Release Notes: https://kyuubi.apache.org/release/1.3.0-incubating.html

To learn more about Apache Kyuubi (Incubating), please see
https://kyuubi.apache.org/

Kyuubi Resources:
- Issue: https://github.com/apache/incubator-kyuubi/issues
- Mailing list: d...@kyuubi.apache.org

We would like to thank all contributors of the Kyuubi community and
Incubating
community who made this release possible!

Thanks,
On behalf of Apache Kyuubi(Incubating) community