Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Mich Talebzadeh
Hi Eugene, With regard to your points What are the PYTHONPATH and SPARK_HOME env variables in your script? OK let us look at a typical of my Spark project structure - project_root |-- README.md |-- __init__.py |-- conf | |-- (configuration files for Spark) |-- deployment | |--

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Setting PYSPARK_ARCHIVES_PATH to hfds:// did the tricky. But don't understand a few things 1) The default behaviour is if PYSPARK_ARCHIVES_PATH is empty, pyspark.zip is uploaded from the local SPARK_HOME. If it is set to "local://" the upload is skipped. I would expect the latter to be the

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Thanks Mich, Tried this and still getting INF Client: "Uploading resource file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip -> hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and __spark_conf__.zip). It is working now because I enabled direct access to HDFS to allow copying

Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
It is just a goal… however I would not tune the no of regions or region size yet.Simply specify gc algorithm and max heap size.Try to tune other options only if there is a need, only one at at time (otherwise it is difficult to determine cause/effects) and have a performance testing framework in

Re: Spark on Java 17

2023-12-09 Thread Faiz Halde
Thanks, IL check them out Curious though, the official G1GC page https://www.oracle.com/technical-resources/articles/java/g1gc.html says that there must be no more than 2048 regions and region size is limited upto 32mb That's strange because our heaps go up to 100gb and that would require 64mb

Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
If you do tests with newer Java versions you can also try: - UseNUMA: -XX:+UseNUMA. See https://openjdk.org/jeps/345 You can also assess the new Java GC algorithms: - -XX:+UseShenandoahGC - works with terabyte of heaps - more memory efficient than zgc with heaps <32 GB. See also:

RE: Spark on Java 17

2023-12-09 Thread Luca Canali
Hi Faiz, We find G1GC works well for some of our workloads that are Parquet-read intensive and we have been using G1GC with Spark on Java 8 already (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= “-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and higher)

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Venkatesan Muniappan
Thanks for the clarification. I will try to do plain jdbc connection on Scala/Java and will update this thread on how it goes. *Thanks,* *Venkat* On Thu, Dec 7, 2023 at 9:40 AM Nicholas Chammas wrote: > PyMySQL has its own implementation >

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas
PyMySQL has its own implementation of the MySQL client-server protocol. It does not use JDBC. > On Dec 6, 2023, at 10:43 PM, Venkatesan Muniappan > wrote: > > Thanks for the advice

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Venkatesan Muniappan
Thanks for the advice Nicholas. As mentioned in the original email, I have tried JDBC + SSH Tunnel using pymysql and sshtunnel and it worked fine. The problem happens only with Spark. *Thanks,* *Venkat* On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas wrote: > This is not a question for the

Re: ordering of rows in dataframe

2023-12-05 Thread Enrico Minack
Looks like what you want is to add a column that, when ordered by that column, the current order of the dateframe is preserved. All you need is the monotonically_increasing_id() function: spark.range(0, 10, 1, 5).withColumn("row", monotonically_increasing_id()).show() +---+---+ | id|

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-05 Thread Enrico Minack
Hi Michail, with spark.conf.set("spark.sql.planChangeLog.level", "WARN") you can see how Spark optimizes the query plan. In PySpark, the plan is optimized into Project ...   +- CollectMetrics 2, [count(1) AS count(1)#200L]   +- LocalTableScan , [col1#125, col2#126L, col3#127, col4#132L] The

Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Holden Karau
So I think this sounds like a bug to me, in the help options for both regular spark-submit and ./sbin/start-connect-server.sh we say: " --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.

Re: Do we have any mechanism to control requests per second for a Kafka connect sink?

2023-12-04 Thread Yeikel Santana
Apologies to everyone. I sent this to the wrong email list. Please discard On Mon, 04 Dec 2023 10:48:11 -0500 Yeikel Santana wrote --- Hello everyone, Is there any mechanism to force Kafka Connect to ingest at a given rate per second as opposed to tasks? I am operating in a

Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Aironman DirtDiver
The issue you're encountering with the iceberg-spark-runtime dependency not being properly passed to the executors in your Spark Connect server deployment could be due to a couple of factors: 1. *Spark Submit Packaging:* When you use the --packages parameter in spark-submit, it only

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-04 Thread Enrico Minack
Hi Michail, observations as well as ordinary accumulators only observe / process rows that are iterated / consumed by downstream stages. If the query plan decides to skip one side of the join, that one will be removed from the final plan completely. Then, the Observation will not retrieve any

Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome! On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun wrote: > Hi, All. > > As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community > starts to have test coverage for all supported Python versions from Today. > > - https://github.com/apache/spark/actions/runs/7061665420 > >

Re: [Streaming (DStream) ] : Does Spark Streaming supports pause/resume consumption of message from Kafka?

2023-12-01 Thread Mich Talebzadeh
he condition for pausing is met, the stop() method is called to temporarily halt message processing. Conversely, when the condition for resuming is met, the start() method is invoked to re-enable message consumption. Let us have a go at it is_paused = False def process_stream(message): glob

Re:[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread beliefer
Congratulations! At 2023-12-01 01:23:55, "Dongjoon Hyun" wrote: We are happy to announce the availability of Apache Spark 3.4.2! Spark 3.4.2 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance

Re: Tuning Best Practices

2023-11-29 Thread Bryant Wright
Thanks, Jack! Please let me know if you find any other guides specific to tuning shuffles and joins. Currently, the best way I know how to handle joins across large datasets that can't be broadcast is by rewriting the source tables HIVE partitioned by one or two join keys, and then breaking down

RE: Re: Spark Compatibility with Spring Boot 3.x

2023-11-29 Thread Guru Panda
Team, Do we have any updates when spark 4.x version will release in order to address below issues related to > java.lang.NoClassDefFoundError: javax/servlet/Servlet Thanks and Regards, Guru On 2023/10/05 17:19:51 Angshuman Bhattacharya wrote: > Thanks Ahmed. I am trying to bring this up with

Re: Tuning Best Practices

2023-11-28 Thread Jack Goodson
Hi Bryant, the below docs are a good start on performance tuning https://spark.apache.org/docs/latest/sql-performance-tuning.html Hope it helps! On Wed, Nov 29, 2023 at 9:32 AM Bryant Wright wrote: > Hi, I'm looking for a comprehensive list of Tuning Best Practices for > spark. > > I did a

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
I actually think it should be totally possible to use it on an executor side. Maybe it will require a small extension/udf, but generally no issues here. Pf4j is very lightweight, so you'll only have a small overhead for classloaders. There's still a small question of distribution of

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Faiz Halde
Hey Pasha, Is your suggestion towards the spark team? I can make use of the plugin system on the driver side of spark but considering spark is distributed, the executor side of spark needs to adapt to the pf4j framework I believe too Thanks Faiz On Tue, Nov 28, 2023, 16:57 Pasha Finkelshtein

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
To me it seems like it's the best possible use case for PF4J. [image: facebook] [image: twitter] [image: linkedin] [image: instagram] Pasha Finkelshteyn Developer Advocate

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Faiz Halde
Thanks Holden, So you're saying even Spark connect is not going to provide that guarantee? The code referred to above is taken up from Spark connect implementation Could you explain which parts are tricky to get right? Just to be well prepared of the consequences On Tue, Nov 28, 2023, 01:30

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Holden Karau
So I don’t think we make any particular guarantees around class path isolation there, so even if it does work it’s something you’d need to pay attention to on upgrades. Class path isolation is tricky to get right. On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde wrote: > Hello, > > We are using spark

Re: Spark structured streaming tab is missing from spark web UI

2023-11-24 Thread Jungtaek Lim
The feature was added in Spark 3.0. Btw, you may want to check out the EOL date for Apache Spark releases - https://endoflife.date/apache-spark 2.x is already EOLed. On Fri, Nov 24, 2023 at 11:13 PM mallesh j wrote: > Hi Team, > > I am trying to test the performance of a spark streaming

Re: How exactly does dropDuplicatesWithinWatermark work?

2023-11-21 Thread Jungtaek Lim
I'll probably reply the same to SO but posting here first. This is mentioned in JIRA ticket, design doc, and also API doc, but to reiterate, the contract/guarantee of the new API is that the API will deduplicate events properly when the max distance of all your duplicate events are less than

Re: Spark-submit without access to HDFS

2023-11-17 Thread Mich Talebzadeh
Hi, How are you submitting your spark job from your client? Your files can either be on HDFS or HCFS such as gs, s3 etc. With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I assume you want your spark-submit --verbose \ --deploy-mode cluster \

RE: The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-16 Thread Stevens, Clay
Perhaps you also need to upgrade Scala? Clay Stevens From: Hanyu Huang Sent: Wednesday, 15 November, 2023 1:15 AM To: user@spark.apache.org Subject: The job failed when we upgraded from spark 3.3.1 to spark3.4.1 Caution, this email may be from a sender outside Wolters Kluwer. Verify the

Re: Spark-submit without access to HDFS

2023-11-16 Thread Jörn Franke
I am not 100% sure but I do not think this works - the driver would need access to HDFS.What you could try (have not tested it though in your scenario):- use SparkConnect: https://spark.apache.org/docs/latest/spark-connect-overview.html- host the zip file on a https server and use that url (I

Re: Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
-site.xml, for instance, fs.oss.impl, etc. eabour From: Eugene Miretsky Date: 2023-11-16 09:58 To: eab...@163.com CC: Eugene Miretsky; user @spark Subject: Re: [EXTERNAL] Re: Spark-submit without access to HDFS Hey! Thanks for the response. We are getting the error because there is no network

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
Hey! Thanks for the response. We are getting the error because there is no network connectivity to the data nodes - that's expected. What I am trying to find out is WHY we need access to the data nodes, and if there is a way to submit a job without it. Cheers, Eugene On Wed, Nov 15, 2023 at

Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene, I think you should Check if the HDFS service is running properly. From the logs, it appears that there are two datanodes in HDFS, but none of them are healthy. Please investigate the reasons why the datanodes are not functioning properly. It seems that the issue might be due

Re: Okio Vulnerability in Spark 3.4.1

2023-11-14 Thread Bjørn Jørgensen
t; >> >> *From:* Sean Owen >> *Sent:* Thursday, August 31, 2023 5:10 PM >> *To:* Agrawal, Sanket >> *Cc:* user@spark.apache.org >> *Subject:* [EXT] Re: Okio Vulnerability in Spark 3.4.1 >> >> >> >> Does the vulnerability affect Spark? >> >&g

Re: Unsubscribe

2023-11-08 Thread Xin Zhang
Unsubscribe -- Email:josseph.zh...@gmail.com

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
Any update on this? On Fri, 13 Oct, 2023, 12:56 pm Suyash Ajmera, wrote: > This issue is related to CharVarcharCodegenUtils readSidePadding method . > > Appending white spaces while reading ENUM data from mysql > > Causing issue in querying , writing the same data to Cassandra. > > On Thu, 12

Re: Spark master shuts down when one of zookeeper dies

2023-11-07 Thread Mich Talebzadeh
Hi, Spark standalone mode does not use or rely on ZooKeeper by default. The Spark master and workers communicate directly with each other without using ZooKeeper. However, it appears that in your case you are relying on ZooKeeper to provide high availability for your standalone cluster. By

Re: Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Mich Talebzadeh
General The reason why os.path.join is appending double backslash on Windows is because that is how Windows paths are represented. However, GCS paths (a Hadoop Compatible File System (HCFS) use forward slashes like in Linux. This can cause problems if you are trying to use a Windows path in a

Re: Data analysis issues

2023-11-02 Thread Mich Talebzadeh
Hi, Your mileage varies so to speak.Whether or not the data you use to analyze in Spark through RStudio will be seen by Spark's back-end depends on how you deploy Spark and RStudio. If you are deploying Spark and RStudio on your own premises or in a private cloud environment, then the data you

Re: Spark / Scala conflict

2023-11-02 Thread Harry Jamison
Thanks Alonso, I think this gives me some ideas. My code is written in Python, and I use spark-submit to submit it. I am not sure what code is written in scala.  Maybe the Phoenix driver based on the stack trace? How do I tell which version of scala that was compiled against? Is there a jar

RE: jackson-databind version mismatch

2023-11-02 Thread moshik.vitas
...@163.com Cc: user @spark ; Saar Barhoom ; moshik.vi...@veeva.com Subject: Re: jackson-databind version mismatch [SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution <https://github.com/apache/spark/pull/40893> tor. 2. nov. 2023 kl.

Re: Re: jackson-databind version mismatch

2023-11-02 Thread eab...@163.com
.jar 2023/09/09 10:08 513,968 jackson-module-scala_2.12-2.15.2.jar eabour From: Bjørn Jørgensen Date: 2023-11-02 16:40 To: eab...@163.com CC: user @spark; Saar Barhoom; moshik.vitas Subject: Re: jackson-databind version mismatch [SPARK-43225][BUILD][SQL] Remove jackson-core-asl

Re: jackson-databind version mismatch

2023-11-02 Thread Bjørn Jørgensen
[SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution tor. 2. nov. 2023 kl. 09:15 skrev Bjørn Jørgensen : > In spark 3.5.0 removed jackson-core-asl and jackson-mapper-asl those > are with groupid

Re: Spark / Scala conflict

2023-11-02 Thread Aironman DirtDiver
The error message Caused by: java.lang.ClassNotFoundException: scala.Product$class indicates that the Spark job is trying to load a class that is not available in the classpath. This can happen if the Spark job is compiled with a different version of Scala than the version of Scala that is used to

Re: jackson-databind version mismatch

2023-11-02 Thread Bjørn Jørgensen
In spark 3.5.0 removed jackson-core-asl and jackson-mapper-asl those are with groupid org.codehaus.jackson. Those others jackson-* are with groupid com.fasterxml.jackson.core tor. 2.

Re: jackson-databind version mismatch

2023-11-01 Thread eab...@163.com
Hi, Please check the versions of jar files starting with "jackson-". Make sure all versions are consistent. jackson jar list in spark-3.3.0: 2022/06/10 04:3775,714 jackson-annotations-2.13.3.jar 2022/06/10 04:37 374,895 jackson-core-2.13.3.jar

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-29 Thread Nagatomi Yasukazu
t; eabour > > > *From:* eab...@163.com > *Date:* 2023-10-19 14:28 > *To:* Nagatomi Yasukazu ; user @spark > > *Subject:* Re: Re: Running Spark Connect Server in Cluster Mode on > Kubernetes > Hi all, > > Has the spark connect server running on k8s functionality been imple

Re: Spark join produce duplicate rows in resultset

2023-10-27 Thread Meena Rajani
Thanks all: Patrick selected rev.* and I.* cleared the confusion. The Item actually brought 4 rows hence the final result set had 4 rows. Regards, Meena On Sun, Oct 22, 2023 at 10:13 AM Bjørn Jørgensen wrote: > alos remove the space in rev. scode > > søn. 22. okt. 2023 kl. 19:08 skrev Sadha

Re: [Structured Streaming] Joins after aggregation don't work in streaming

2023-10-27 Thread Andrzej Zera
Hi, thank you very much for an update! Thanks, Andrzej On 2023/10/27 01:50:35 Jungtaek Lim wrote: > Hi, we are aware of your ticket and plan to look into it. We can't say > about ETA but just wanted to let you know that we are going to look into > it. Thanks for reporting! > > Thanks, >

Re: [Structured Streaming] Joins after aggregation don't work in streaming

2023-10-26 Thread Jungtaek Lim
Hi, we are aware of your ticket and plan to look into it. We can't say about ETA but just wanted to let you know that we are going to look into it. Thanks for reporting! Thanks, Jungtaek Lim (HeartSaVioR) On Fri, Oct 27, 2023 at 5:22 AM Andrzej Zera wrote: > Hey All, > > I'm trying to

[Resolved] Re: spark.stop() cannot stop spark connect session

2023-10-25 Thread eab...@163.com
Hi all. I read source code at spark/python/pyspark/sql/connect/session.py at master · apache/spark (github.com) and the comment for the "stop" method is described as follows: def stop(self) -> None: # Stopping the session will only close the connection to the current session

Re: automatically/dinamically renew aws temporary token

2023-10-24 Thread Carlos Aguni
hi all, thank you for your reply. > Can’t you attach the cross account permission to the glue job role? Why the detour via AssumeRole ? yes Jorn, i also believe this is the best approach. but here we're dealing with company policies and all the bureaucracy that comes along. in parallel i'm

Re: Maximum executors in EC2 Machine

2023-10-24 Thread Riccardo Ferrari
Hi, I would refer to their documentation to better understand the concepts behind cluster overview and submitting applications: - https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types - https://spark.apache.org/docs/latest/submitting-applications.html When

Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Pol Santamaria
Hi Carlos! Take a look at this project, it's 6 years old but the approach is still valid: https://github.com/zillow/aws-custom-credential-provider The credential provider gets called each time an S3 or Glue Catalog is accessed, and then you can decide whether to use a cached token or renew.

Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Jörn Franke
Can’t you attach the cross account permission to the glue job role? Why the detour via AssumeRole ? Assumerole can make sense if you use an AWS IAM user and STS authentication, but this would make no sense within AWS for cross-account access as attaching the permissions to the Glue job role is

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Bjørn Jørgensen
alos remove the space in rev. scode søn. 22. okt. 2023 kl. 19:08 skrev Sadha Chilukoori : > Hi Meena, > > I'm asking to clarify, are the *on *& *and* keywords optional in the join > conditions? > > Please try this snippet, and see if it helps > > select rev.* from rev > inner join customer c >

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Sadha Chilukoori
Hi Meena, I'm asking to clarify, are the *on *& *and* keywords optional in the join conditions? Please try this snippet, and see if it helps select rev.* from rev inner join customer c on rev.custumer_id =c.id inner join product p on rev.sys = p.sys and rev.prin = p.prin and rev.scode= p.bcode

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Patrick Tucci
Hi Meena, It's not impossible, but it's unlikely that there's a bug in Spark SQL randomly duplicating rows. The most likely explanation is there are more records in the item table that match your sys/custumer_id/scode criteria than you expect. In your original query, try changing select rev.* to

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
-10-19 14:28 To: Nagatomi Yasukazu; user @spark Subject: Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes Hi all, Has the spark connect server running on k8s functionality been implemented? From: Nagatomi Yasukazu Date: 2023-09-05 17:51 To: user Subject: Re: Running Spark

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi all, Has the spark connect server running on k8s functionality been implemented? From: Nagatomi Yasukazu Date: 2023-09-05 17:51 To: user Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes Dear Spark Community, I've been exploring the capabilities of the Spark

Re: hive: spark as execution engine. class not found problem

2023-10-17 Thread Vijay Shankar
UNSUBSCRIBE On Tue, Oct 17, 2023 at 5:09 PM Amirhossein Kabiri < amirhosseikab...@gmail.com> wrote: > I used Ambari to config and install Hive and Spark. I want to insert into > a hive table using Spark execution Engine but I face to this weird error. > The error is: > > Job failed with

Re: Spark stand-alone mode

2023-10-17 Thread Ilango
Hi all, Thanks a lot for your suggestions and knowledge sharing. I like to let you know that, I completed setting up the stand alone cluster and couple of data science users are able to use it already for last two weeks. And the performance is really good. Almost 10X performance improvement

Re: Can not complete the read csv task

2023-10-14 Thread Khalid Mammadov
This command only defines a new DataFrame, in order to see some results you need to do something like merged_spark_data.show() on a new line. Regarding the error I think it's typical error that you get when you run Spark on Windows OS. You can suppress it using Winutils tool (Google it or ChatGPT

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
This issue is related to CharVarcharCodegenUtils readSidePadding method . Appending white spaces while reading ENUM data from mysql Causing issue in querying , writing the same data to Cassandra. On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, wrote: > I have upgraded my spark job from spark

Re: Autoscaling in Spark

2023-10-10 Thread Mich Talebzadeh
This has been brought up a few times. I will focus on Spark Structured Streaming Autoscaling does not support Spark Structured Streaming (SSS). Why because streaming jobs are typically long-running jobs that need to maintain state across micro-batches. Autoscaling is designed to scale up and down

Re: Updating delta file column data

2023-10-10 Thread Mich Talebzadeh
Hi, Since you mentioned that there could be duplicate records with the same unique key in the Delta table, you will need a way to handle these duplicate records. One approach I can suggest is to use a timestamp to determine the latest or most relevant record among duplicates, the so-called

Re: Log file location in Spark on K8s

2023-10-09 Thread Prashant Sharma
Hi Sanket, Driver and executor logs are written to stdout by default, it can be configured using SPARK_HOME/conf/log4j.properties file. The file including the entire SPARK_HOME/conf is auto propogateded to all driver and executor container and mounted as volume. Thanks On Mon, 9 Oct, 2023, 5:37

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Danilo Sousa
Unsubscribe > Em 9 de out. de 2023, à(s) 07:03, Mich Talebzadeh > escreveu: > > Hi, > > Please see my responses below: > > 1) In Spark Structured Streaming does commit mean streaming data has been > delivered to the sink like Snowflake? > > No. a commit does not refer to data being

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Your mileage varies. Often there is a flavour of Cloud Data warehouse already there. CDWs like BigQuery, Redshift, Snowflake and so forth. They can all do a good job for various degrees - Use efficient data types. Choose data types that are efficient for Spark to process. For example, use

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread ashok34...@yahoo.com.INVALID
Thank you for your feedback Mich. In general how can one optimise the cloud data warehouses (the sink part), to handle streaming Spark data efficiently, avoiding bottlenecks that discussed. AKOn Monday, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh wrote: Hi, Please see my

Re: Updating delta file column data

2023-10-09 Thread Mich Talebzadeh
In a nutshell, is this what you are trying to do? 1. Read the Delta table into a Spark DataFrame. 2. Explode the string column into a struct column. 3. Convert the hexadecimal field to an integer. 4. Write the DataFrame back to the Delta table in merge mode with a unique key. Is

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Hi, Please see my responses below: 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? No. a commit does not refer to data being delivered to a sink like Snowflake or bigQuery. The term commit refers to Spark Structured Streaming (SS)

Re: Updating delta file column data

2023-10-09 Thread Karthick Nk
Hi All, I have mentioned the sample data below and the operation I need to perform over there, I have delta tables with columns, in that columns I have the data in the string data type(contains the struct data), So, I need to update one key value in the struct field data in the string column

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Igor Calabria
You might be affected by this issue: https://github.com/apache/iceberg/issues/8601 It was already patched but it isn't released yet. On Thu, Oct 5, 2023 at 7:47 PM Prashant Sharma wrote: > Hi Sanket, more details might help here. > > How does your spark configuration look like? > > What

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Angshuman Bhattacharya
Thanks Ahmed. I am trying to bring this up with Spark DE community On Thu, Oct 5, 2023 at 12:32 PM Ahmed Albalawi < ahmed.albal...@capitalone.com> wrote: > Hello team, > > We are in the process of upgrading one of our apps to Spring Boot 3.x > while using Spark, and we have encountered an issue

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Sean Owen
I think we already updated this in Spark 4. However for now you would have to also include a JAR with the jakarta.* classes instead. You are welcome to try Spark 4 now by building from master, but it's far from release. On Thu, Oct 5, 2023 at 11:53 AM Ahmed Albalawi wrote: > Hello team, > > We

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Mich Talebzadeh
The fact that you have 60 partitions or brokers in kaka is not directly correlated to Spark Structured Streaming (SSS) executors by itself. See below. Spark starts with 200 partitions. However, by default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the node,

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Perez
You can try the 'optimize' command of delta lake. That will help you for sure. It merges small files. Also, it depends on the file format. If you are working with Parquet then still small files should not cause any issues. P. On Thu, Oct 5, 2023 at 10:55 AM Shao Yang Hong wrote: > Hi

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Prashant Sharma
Hi Sanket, more details might help here. How does your spark configuration look like? What exactly was done when this happened? On Thu, 5 Oct, 2023, 2:29 pm Agrawal, Sanket, wrote: > Hello Everyone, > > > > We are trying to stream the changes in our Iceberg tables stored in AWS > S3. We are

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi Raghavendra, Yes, we are trying to reduce the number of files in delta as well (the small file problem [0][1]). We already have a scheduled app to compact files, but the number of files is still large, at 14K files per day. [0]:

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Raghavendra Ganesh
Hi, What is the purpose for which you want to use repartition() .. to reduce the number of files in delta? Also note that there is an alternative option of using coalesce() instead of repartition(). -- Raghavendra On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong wrote: > Hi all on user@spark: >

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jon Rodríguez Aranguren
Dear Jörn Franke, Jayabindu Singh and Spark Community members, Thank you profoundly for your initial insights. I feel it's necessary to provide more precision on our setup to facilitate a deeper understanding. We're interfacing with S3 Compatible storages, but our operational context is somewhat

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
There is nowadays more a trend to move away from static credentials/certificates that are stored in a secret vault. The issue is that the rotation of them is complex, once they are leaked they can be abused, making minimal permissions feasible is cumbersome etc. That is why keyless approaches are

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
With oidc sth comparable is possible: https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.htmlAm 01.10.2023 um 11:13 schrieb Mich Talebzadeh :It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Mich Talebzadeh
It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment on exposed storage such as s3 and gcs. The sensitive metadata protected by metadata concealment is also

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jayabindu Singh
Hi Jon, Using IAM as suggested by Jorn is the best approach. We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM. It will save you from secret management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jörn Franke
Don’t use static iam (s3) credentials. It is an outdated insecure method - even AWS recommend against using this for anything (cf eg https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html). It is almost a guarantee to get your data stolen and your account manipulated. If

Re: Inquiry about Processing Speed

2023-09-28 Thread Jack Goodson
Hi Haseeb, I think the user mailing list is what you're looking for, people are usually pretty active on here if you present a direct question about apache spark. I've linked below the community guidelines which says which mailing lists are for what etc https://spark.apache.org/community.html

Re: Inquiry about Processing Speed

2023-09-27 Thread Deepak Goel
Hi "Processing Speed" can be at a software level (Code Optimization) and at a hardware level (Capacity Planning) Deepak "The greatness of a nation can be judged by the way its animals are treated - Mahatma Gandhi" +91 73500 12833 deic...@gmail.com Facebook: https://www.facebook.com/deicool

Re: Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-22 Thread Karthick
Hi All, It will be helpful if we gave any pointers to the problem addressed. Thanks Karthick. On Wed, Sep 20, 2023 at 3:03 PM Gowtham S wrote: > Hi Spark Community, > > Thank you for bringing up this issue. We've also encountered the same > challenge and are actively working on finding a

Re: Parallel write to different partitions

2023-09-21 Thread Shrikant Prasad
Found this issue reported earlier but was bulk closed: https://issues.apache.org/jira/browse/SPARK-27030 Regards, Shrikant On Fri, 22 Sep 2023 at 12:03 AM, Shrikant Prasad wrote: > Hi all, > > We have multiple spark jobs running in parallel trying to write into same > hive table but each job

Re: Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread Mich Talebzadeh
In general you can probably do all this in spark-sql by reading in Hive table through a DF in Pyspark, then creating a TempView on that DF, select PM data through CAST() function and then use a windowing function to select the top 5 with DENSE_RANK() #Read Hive table as a DataFrame df =

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Oh, I saw it now. Thanks! On Wed, Sep 20, 2023 at 1:04 PM Sean Owen wrote: > [ External sender. Exercise caution. ] > > I think the announcement mentioned there were some issues with pypi and > the upload size this time. I am sure it's intended to be there when > possible. > > On Wed, Sep 20,

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the upload size this time. I am sure it's intended to be there when possible. On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong wrote: > Hi, > > Are there any plans to upload PySpark 3.5.0 to PyPI ( >

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen
This has turned into a big thread for a simple thing and has been answered 3 times over now. Neither is better, they just calculate different things. That the 'default' is sample stddev is just convention. stddev_pop is the simple standard deviation of a set of numbers stddev_samp is used when

Re: Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-20 Thread Gowtham S
Hi Spark Community, Thank you for bringing up this issue. We've also encountered the same challenge and are actively working on finding a solution. It's reassuring to know that we're not alone in this. If you have any insights or suggestions regarding how to address this problem, please feel

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Mich Talebzadeh
Spark uses the sample standard deviation stddev_samp by default, whereas *Hive* uses population standard deviation stddev_pop as default. My understanding is that spark uses sample standard deviation by default because - It is more commonly used. - It is more efficient to calculate. -

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Mich Talebzadeh
Hi Helen, Assuming you want to calculate stddev_samp, Spark correctly points STDDEV to STDDEV_SAMP. In below replace sales with your table name and AMOUNT_SOLD with the column you want to do the calculation SELECT

<    1   2   3   4   5   6   7   8   9   10   >