Re: External Spark shuffle service for k8s
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a Remote Shuffle Service for Spark on K8s. There are some useful resources you might be interested in. [1] https://celeborn.apache.org/ [2] https://www.youtube.com/watch?v=s5xOtG6Venw [3] https://github.com/aws-samples/emr-remote-shuffle-service [4] https://github.com/apache/celeborn/issues/2140 Thanks, Cheng Pan > On Apr 6, 2024, at 21:41, Mich Talebzadeh wrote: > > I have seen some older references for shuffle service for k8s, > although it is not clear they are talking about a generic shuffle > service for k8s. > > Anyhow with the advent of genai and the need to allow for a larger > volume of data, I was wondering if there has been any more work on > this matter. Specifically larger and scalable file systems like HDFS, > GCS , S3 etc, offer significantly larger storage capacity than local > disks on individual worker nodes in a k8s cluster, thus allowing > handling much larger datasets more efficiently. Also the degree of > parallelism and fault tolerance with these files systems come into > it. I will be interested in hearing more about any progress on this. > > Thanks > . > > Mich Talebzadeh, > > Technologist | Solutions Architect | Data Engineer | Generative AI > > London > United Kingdom > > > view my Linkedin profile > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > Disclaimer: The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner Von Braun)". > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[DISCUSS] MySQL version support policy
Hi, Spark community, I noticed that the Spark JDBC connector MySQL dialect is testing against the 8.3.0[1] now, a non-LTS version. MySQL changed the version policy recently[2], which is now very similar to the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version, 8.1, 8.2, 8.3 is non-LTS, and the next LTS version is 8.4. I would say that MySQL is one of the most important infrastructures today, I checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version support policy, and both only support 5.7 and 8.0. Also, Spark officially only supports LTS Java versions, like JDK 17 and 21, but not 22. I would recommend using MySQL 8.0 for testing until the next MySQL LTS version (8.4) is available. Additional discussion can be found at [3] [1] https://issues.apache.org/jira/browse/SPARK-47453 [2] https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/ [3] https://github.com/apache/spark/pull/45581 [4] https://aws.amazon.com/rds/mysql/ [5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy Thanks, Cheng Pan - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ANNOUNCE] Apache Kyuubi 1.8.1 is available
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.8.1 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC and RESTful interfaces for end-users to manipulate large-scale data with pre-programmed and extensible Spark/Flink/Trino/Hive engines. We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses and lakehouses. This "out-of-the-box" model minimizes the barriers and costs for end-users to use Spark/Flink/Trino/Hive engines on the client side. At the server-side, Kyuubi server and engine's multi-tenant architecture provides the administrators a way to achieve computing resource isolation, data security, high availability, high client concurrency, etc. The full release notes and download links are available at: Release Notes: https://kyuubi.apache.org/release/1.8.1.html To learn more about Apache Kyuubi, please see https://kyuubi.apache.org/ Kyuubi Resources: - Issue: https://github.com/apache/kyuubi/issues - Mailing list: d...@kyuubi.apache.org We would like to thank all contributors of the Kyuubi community who made this release possible! Thanks, Cheng Pan, on behalf of Apache Kyuubi community - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures
Spark has supported the window-based executor failure-tracking mechanism for YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this feature to K8s. [1] https://issues.apache.org/jira/browse/SPARK-41210 [2] https://github.com/apache/spark/pull/38732 Thanks, Cheng Pan > On Feb 19, 2024, at 23:59, Sri Potluri wrote: > > Hello Spark Community, > > I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, > for running various Spark applications. While the system generally works > well, I've encountered a challenge related to how Spark applications handle > executor failures, specifically in scenarios where executors enter an error > state due to persistent issues. > > Problem Description > > When an executor of a Spark application fails, the system attempts to > maintain the desired level of parallelism by automatically recreating a new > executor to replace the failed one. While this behavior is beneficial for > transient errors, ensuring that the application continues to run, it becomes > problematic in cases where the failure is due to a persistent issue (such as > misconfiguration, inaccessible external resources, or incompatible > environment settings). In such scenarios, the application enters a loop, > continuously trying to recreate executors, which leads to resource wastage > and complicates application management. > > Desired Behavior > > Ideally, I would like to have a mechanism to limit the number of retries for > executor recreation. If the system fails to successfully create an executor > more than a specified number of times (e.g., 5 attempts), the entire Spark > application should fail and stop trying to recreate the executor. This > behavior would help in efficiently managing resources and avoiding prolonged > failure states. > > Questions for the Community > > 1. Is there an existing configuration or method within Spark or the Spark > Operator to limit executor recreation attempts and fail the job after > reaching a threshold? > > 2. Has anyone else encountered similar challenges and found workarounds or > solutions that could be applied in this context? > > > Additional Context > > I have explored Spark's task and stage retry configurations > (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these > do not directly address the issue of limiting executor creation retries. > Implementing a custom monitoring solution to track executor failures and > manually stop the application is a potential workaround, but it would be > preferable to have a more integrated solution. > > I appreciate any guidance, insights, or feedback you can provide on this > matter. > > Thank you for your time and support. > > Best regards, > Sri P - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ANNOUNCE] Apache Kyuubi released 1.8.0
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.8.0 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for end-users to manipulate large-scale data with pre-programmed and extensible Spark SQL engines. We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses and lakehouses. This "out-of-the-box" model minimizes the barriers and costs for end-users to use Spark, Flink, and other computing engines at the client side. At the server-side, Kyuubi server and engine's multi-tenant architecture provides the administrators a way to achieve computing resource isolation, data security, high availability, high client concurrency, etc. The full release notes and download links are available at: Release Notes: https://kyuubi.apache.org/release/1.8.0.html To learn more about Apache Kyuubi, please see https://kyuubi.apache.org/ Kyuubi Resources: - Issue: https://github.com/apache/kyuubi/issues - Mailing list: d...@kyuubi.apache.org We would like to thank all contributors of the Kyuubi community who made this release possible! Thanks, On behalf of Apache Kyuubi community - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ANNOUNCE] Apache Celeborn(incubating) 0.3.1 available
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.3.1. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including shuffle data, spilled data, result data, etc. Download Link: https://celeborn.apache.org/download/ GitHub Release Tag: - https://github.com/apache/incubator-celeborn/releases/tag/v0.3.1-incubating Release Notes: - https://celeborn.apache.org/community/release_notes/release_note_0.3.1 Home Page: https://celeborn.apache.org/ Celeborn Resources: - Issue Management: https://issues.apache.org/jira/projects/CELEBORN - Mailing List: d...@celeborn.apache.org Thanks, Cheng Pan On behalf of the Apache Celeborn(incubating) community - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Vulnerabilities
For the Guava case, you may be interested in https://github.com/apache/spark/pull/42493 Thanks, Cheng Pan > On Aug 14, 2023, at 16:50, Sankavi Nagalingam > wrote: > > Hi Team, > We could see there are many dependent vulnerabilities present in the latest > spark-core:3.4.1.jar. PFA > Could you please let us know when will be the fix version available for the > users. > Thanks, > Sankavi > > The information in this e-mail and any attachments is confidential and may be > legally privileged. It is intended solely for the addressee or addressees. > Any use or disclosure of the contents of this e-mail/attachments by a not > intended recipient is unauthorized and may be unlawful. If you have received > this e-mail in error please notify the sender. Please note that any views or > opinions presented in this e-mail are solely those of the author and do not > necessarily represent those of TEMENOS. We recommend that you check this > e-mail and any attachments against viruses. TEMENOS accepts no liability for > any damage caused by any malicious code or virus transmitted by this e-mail. Spark-3.4.1-Vulnerablities.xlsx Description: MS-Excel 2007 spreadsheet > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Multiple Hive Metastore Catalog Support
There is a DSv2-based Hive connector in Apache Kyuubi[1] that supports connecting multiple HMS in a single Spark application. Some limitations - currently only supports Spark 3.3 - has a known issue when using w/ `spark-sql`, but OK w/ spark-shell and normal jar-based Spark application. [1] https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive Thanks, Cheng Pan On Apr 18, 2023 at 00:38:23, Elliot West wrote: > Hi Ankit, > > While not a part of Spark, there is a project called 'WaggleDance' that > can federate multiple Hive metastores so that they are accessible via a > single URI: https://github.com/ExpediaGroup/waggle-dance > > This may be useful or perhaps serve as inspiration. > > Thanks, > > Elliot. > > On Mon, 17 Apr 2023 at 16:38, Ankit Gupta wrote: > >> ++ >> User Mailing List >> >> Just a reminder, anyone who can help on this. >> >> Thanks a lot ! >> >> Ankit Prakash Gupta >> >> On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta >> wrote: >> >>> Hi All >>> >>> The question is regarding the support of multiple Remote Hive Metastore >>> catalogs with Spark. Starting Spark 3, multiple catalog support is added in >>> spark, but have we implemented any CatalogPlugin that can help us configure >>> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with >>> the Fully Qualified Class Name that I can try using for configuring a Hive >>> Metastore Catalog. If not, I would like to work on the implementation of >>> the CatalogPlugin that we can use to configure multiple Hive Metastore >>> Servers' . >>> >>> Thanks and Regards. >>> >>> Ankit Prakash Gupta >>> +91 8750101321 >>> info.ank...@gmail.com >>> >>>
Re: spark on k8s daemonset collect log
The filebeat supports multiline matching, here is an example[1] BTW, I’m working on External Log Service integration[2], it may be useful in your case, feel free to review/left comments [1] https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html#multiline [2] https://github.com/apache/spark/pull/38357 Thanks, Cheng Pan On Mar 14, 2023 at 16:36:45, 404 wrote: > hi, all > > Spark runs on k8s, uses daemonset filebeat to collect logs, and writes > them to elasticsearch. The docker logs are in json format, and each line is > a json string. How to merge multi-line exceptions? > >
[ANNOUNCE] Apache Kyuubi released 1.7.0
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.7.0 has been released! Apache Kyuubi is a distributed multi-tenant Lakehouse gateway for large-scale data processing and analytics, built on top of Apache Spark, Apache Flink, Trino and also supports other computing engines. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for end-users to manipulate large-scale data with pre-programmed and extensible Spark SQL engines. We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses and data lakes. This "out-of-the-box" model minimizes the barriers and costs for end-users to use Spark at the client side. At the server-side, Kyuubi server and engine's multi-tenant architecture provides the administrators a way to achieve computing resource isolation, data security, high availability, high client concurrency, etc. The full release notes and download links are available at: Release Notes: https://kyuubi.apache.org/release/1.7.0.html To learn more about Apache Kyuubi, please see: https://kyuubi.apache.org/ Kyuubi Resources: - Issue: https://github.com/apache/kyuubi/issues - Mailing list: d...@kyuubi.apache.org We would like to thank all contributors of the Kyuubi community who made this release possible! Thanks, Cheng Pan, on behalf of Apache Kyuubi community
Re: The Dataset unit test is much slower than the RDD unit test (in Scala)
Which Spark version are you using? SPARK-36444[1] and SPARK-38138[2] may be related, please test w/ the patched version or disable DPP by setting spark.sql.optimizer.dynamicPartitionPruning.enabled=false to see if it helps. [1] https://issues.apache.org/jira/browse/SPARK-36444 [2] https://issues.apache.org/jira/browse/SPARK-38138 Thanks, Cheng Pan On Nov 2, 2022 at 00:14:34, Enrico Minack wrote: > Hi Tanin, > > running your test with option "spark.sql.planChangeLog.level" set to > "info" or "warn" (depending on your Spark log level) will show you > insights into the planning (which rules are applied, how long rules > take, how many iterations are done). > > Hoping this helps, > Enrico > > > Am 25.10.22 um 21:54 schrieb Tanin Na Nakorn: > > Hi All, > > > Our data job is very complex (e.g. 100+ joins), and we have switched > > from RDD to Dataset recently. > > > We've found that the unit test takes much longer. We profiled it and > > have found that it's the planning phase that is slow, not execution. > > > I wonder if anyone has encountered this issue before and if there's a > > way to make the planning phase faster (e.g. maybe disabling certain > > optimizers). > > > Any thoughts or input would be appreciated. > > > Thank you, > > Tanin > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Writing Custom Spark Readers and Writers
There are some projects based on Spark DataSource V2 that I hope will help you. https://github.com/datastax/spark-cassandra-connector https://github.com/housepower/spark-clickhouse-connector https://github.com/oracle/spark-oracle https://github.com/pingcap/tispark Thanks, Cheng Pan On Wed, Apr 6, 2022 at 5:52 PM daniel queiroz wrote: > > https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/read/index.html > https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/write/index.html > > https://developer.here.com/documentation/java-scala-dev/dev_guide/spark-connector/index.html > > Grato, > > Daniel Queiroz > 81 996289671 > > > Em qua., 6 de abr. de 2022 às 03:57, Dyanesh Varun > escreveu: >> >> Hey team, >> >> Can you please share some documentation/blogs where we can get to know how >> we can write custom sources and sinks for both streaming and static datasets. >> >> Thanks in advance >> Dyanesh Varun >> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark as data warehouse?
Sorry I missed the original channel, added it back. - I have less knowledge about dbt. If it supports Hive, it should support Kyuubi. Basically, Kyuubi is gateway between your client(e.g. beeline, hive jdbc client) and compute engine(e.g. Spark, Flink, Trino), I think the most valuable things are: 1) Kyuubi reuses the Hive Thrift Protocol, it say you can treat Kyuubi as a HiveServer2, and continue use beeline, hive jdbc driver to connect Kyuubi to run SQL(in your compute engine dialect). Ideally, if a tool claims it supports Hive, then it supports Kyuubi. 2) Kyuubi manages the compute engine lifecycle and share level, makes a good trade-off between isolation and resource consumption.[1] PS: Kyuubi's support for Spark is very mature, you can find lots of production use cases here[2]. The support for Flink & Trino is in beta phase. [1] https://kyuubi.apache.org/docs/latest/deployment/engine_share_level.html [2] https://github.com/apache/incubator-kyuubi/discussions/925 Thanks, Cheng Pan --- Thanks, I'll check it out. I have a use case where we want to use dbt as data middling tool . Will it take dbt queries and create the resulting model ? I see it supports Trino , so I am guessing yes . I will love to contribute to it as well. Thanks Deepak --- Spark SQL can indeed take over your Hive workloads, and if you're looking for an open source solution, Apache Kyuubi(Incubating)[1] might help. [1] https://kyuubi.apache.org/ Thanks, Cheng Pan On Sat, Mar 26, 2022 at 4:51 PM Cheng Pan wrote: > > I have less knowledge about dbt. If it supports Hive, it should support > Kyuubi. > Basically, Kyuubi is gateway between your client(e.g. beeline, hive > jdbc client) and compute engine(e.g. Spark, Flink, Trino), I think the > most valuable things are: > 1) Kyuubi reuses the Hive Thrift Protocol, it say you can treat Kyuubi > as a HiveServer2, and continue use beeline, hive jdbc driver to > connect Kyuubi to run SQL(in your compute engine dialect). Ideally, if > a tool claims it supports Hive, then it supports Kyuubi. > 2) Kyuubi manages the compute engine lifecycle and share level, makes > a good trade-off between isolation and resource consumption.[1] > > PS: Kyuubi's support for Spark is very mature, you can find lots of > production use cases here[2]. The support for Flink & Trino is in beta > phase. > > [1] https://kyuubi.apache.org/docs/latest/deployment/engine_share_level.html > [2] https://github.com/apache/incubator-kyuubi/discussions/925 > > Thanks, > Cheng Pan > > On Sat, Mar 26, 2022 at 4:16 PM Deepak Sharma wrote: > > > > Thanks, I'll check it out. > > I have a use case where we want to use dbt as data middling tool . > > Will it take dbt queries and create the resulting model ? > > I see it supports Trino , so I am guessing yes . > > > > I will love to contribute to it as well. > > > > > > Thanks > > Deepak > > > > On Sat, 26 Mar 2022 at 1:24 PM, Cheng Pan wrote: > >> > >> Spark SQL can indeed take over your Hive workloads, and if you're > >> looking for an open source solution, Apache Kyuubi(Incubating)[1] > >> might help. > >> > >> [1] https://kyuubi.apache.org/ > >> > >> Thanks, > >> Cheng Pan > >> > >> On Sat, Mar 26, 2022 at 11:45 AM Deepak Sharma > >> wrote: > >> > > >> > It can be used as warehouse but then you have to keep long running spark > >> > jobs. > >> > This can be possible using cached data frames or dataset . > >> > > >> > Thanks > >> > Deepak > >> > > >> > On Sat, 26 Mar 2022 at 5:56 AM, wrote: > >> >> > >> >> In the past time we have been using hive for building the data > >> >> warehouse. > >> >> Do you think if spark can used for this purpose? it's even more realtime > >> >> than hive. > >> >> > >> >> Thanks. > >> >> > >> >> - > >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> >> > >> > -- > >> > Thanks > >> > Deepak > >> > www.bigdatabig.com > >> > www.keosha.net > > > > -- > > Thanks > > Deepak > > www.bigdatabig.com > > www.keosha.net - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ANNOUNCE] Release Apache Kyuubi(Incubating) 1.3.0-incubating
Hello Spark Community, The Apache Kyuubi(Incubating) community is pleased to announce that Apache Kyuubi(Incubating) 1.3.0-incubating has been released! Apache Kyuubi(Incubating) is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark and designed to support more engines (i.e. Flink). Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for end-users to manipulate large-scale data with pre-programmed and extensible Spark SQL engines. We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses and data lakes. This "out-of-the-box" model minimizes the barriers and costs for end-users to use Spark at the client side. At the server-side, Kyuubi server and engine's multi-tenant architecture provides the administrators a way to achieve computing resource isolation, data security, high availability, high client concurrency, etc. The full release notes and download links are available at: Release Notes: https://kyuubi.apache.org/release/1.3.0-incubating.html To learn more about Apache Kyuubi (Incubating), please see https://kyuubi.apache.org/ Kyuubi Resources: - Issue: https://github.com/apache/incubator-kyuubi/issues - Mailing list: d...@kyuubi.apache.org We would like to thank all contributors of the Kyuubi community and Incubating community who made this release possible! Thanks, On behalf of Apache Kyuubi(Incubating) community