Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Bjørn Jørgensen
alos remove the space in rev. scode søn. 22. okt. 2023 kl. 19:08 skrev Sadha Chilukoori : > Hi Meena, > > I'm asking to clarify, are the *on *& *and* keywords optional in the join > conditions? > > Please try this snippet, and see if it helps > > select rev.* from rev > inner join customer c >

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Sadha Chilukoori
Hi Meena, I'm asking to clarify, are the *on *& *and* keywords optional in the join conditions? Please try this snippet, and see if it helps select rev.* from rev inner join customer c on rev.custumer_id =c.id inner join product p on rev.sys = p.sys and rev.prin = p.prin and rev.scode= p.bcode

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Patrick Tucci
Hi Meena, It's not impossible, but it's unlikely that there's a bug in Spark SQL randomly duplicating rows. The most likely explanation is there are more records in the item table that match your sys/custumer_id/scode criteria than you expect. In your original query, try changing select rev.* to

automatically/dinamically renew aws temporary token

2023-10-22 Thread Carlos Aguni
hi all, i've a scenario where I need to assume a cross account role to have S3 bucket access. the problem is that this role only allows for 1h time span (no negotiation). that said. does anyone know a way to tell spark to automatically renew the token or to dinamically renew the token on each

Spark join produce duplicate rows in resultset

2023-10-21 Thread Meena Rajani
Hello all: I am using spark sql to join two tables. To my surprise I am getting redundant rows. What could be the cause. select rev.* from rev inner join customer c on rev.custumer_id =c.id inner join product p rev.sys = p.sys rev.prin = p.prin rev.scode= p.bcode left join item I on rev.sys =

Error when trying to get the data from Hive Materialized View

2023-10-21 Thread Siva Sankar Reddy
Hi Team , We are not getting any error when retrieving the data from hive table in PYSPARK , but getting the error ( Scala.matcherror MATERIALIZED_VIEW ( of class org.Apache.Hadoop.hive.metastore.TableType ) . Please let me know resolution for this ? Thanks

spark.stop() cannot stop spark connect session

2023-10-20 Thread eab...@163.com
Hi, my code: from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://172.29.190.147").getOrCreate() import pandas as pd # 创建pandas dataframe pdf = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "gender": ["F", "M", "M"] }) #

"Premature end of Content-Length" Error

2023-10-19 Thread Sandhya Bala
Hi all, I am running into the following error with spark 2.4.8 Job aborted due to stage failure: Task 9 in stage 2.0 failed 4 times, most > recent failure: Lost task 9.3 in stage 2.0 (TID 100, 10.221.8.73, executor > 2): org.apache.http.ConnectionClosedException: Premature end of >

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi, I have found three important classes: org.apache.spark.sql.connect.service.SparkConnectServer : the ./sbin/start-connect-server.sh script use SparkConnectServer class as main class. In main function, use SparkSession.builder.getOrCreate() create local sessin, and start

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi all, Has the spark connect server running on k8s functionality been implemented? From: Nagatomi Yasukazu Date: 2023-09-05 17:51 To: user Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes Dear Spark Community, I've been exploring the capabilities of the Spark

Re: hive: spark as execution engine. class not found problem

2023-10-17 Thread Vijay Shankar
UNSUBSCRIBE On Tue, Oct 17, 2023 at 5:09 PM Amirhossein Kabiri < amirhosseikab...@gmail.com> wrote: > I used Ambari to config and install Hive and Spark. I want to insert into > a hive table using Spark execution Engine but I face to this weird error. > The error is: > > Job failed with

hive: spark as execution engine. class not found problem

2023-10-17 Thread Amirhossein Kabiri
I used Ambari to config and install Hive and Spark. I want to insert into a hive table using Spark execution Engine but I face to this weird error. The error is: Job failed with java.lang.ClassNotFoundException: ive_20231017100559_301568f9-bdfa-4f7c-89a6-f69a65b30aaf:1 2023-10-17 10:07:42,972

Re: Spark stand-alone mode

2023-10-17 Thread Ilango
Hi all, Thanks a lot for your suggestions and knowledge sharing. I like to let you know that, I completed setting up the stand alone cluster and couple of data science users are able to use it already for last two weeks. And the performance is really good. Almost 10X performance improvement

Re: Can not complete the read csv task

2023-10-14 Thread Khalid Mammadov
This command only defines a new DataFrame, in order to see some results you need to do something like merged_spark_data.show() on a new line. Regarding the error I think it's typical error that you get when you run Spark on Windows OS. You can suppress it using Winutils tool (Google it or ChatGPT

[ANNOUNCE] Apache Celeborn(incubating) 0.3.1 available

2023-10-13 Thread Cheng Pan
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.3.1. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Fwd: Fw: Can not complete the read csv task

2023-10-13 Thread KP Youtuber
Dear group members, I'm trying to get a fresh start with Spark, but came a cross following issue; I tried to read few CSV files from a folder, but the task got stuck and didn't complete. ( copied content from the terminal.) Can someone help to understand what is going wrong? Versions; java

Fw: Can not complete the read csv task

2023-10-13 Thread Kelum Perera
From: Kelum Perera Sent: Thursday, October 12, 2023 11:40 AM To: user@spark.apache.org ; Kelum Perera ; Kelum Gmail Subject: Can not complete the read csv task Dear friends, I'm trying to get a fresh start with Spark. I tried to read few CSV files in a

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
This issue is related to CharVarcharCodegenUtils readSidePadding method . Appending white spaces while reading ENUM data from mysql Causing issue in querying , writing the same data to Cassandra. On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, wrote: > I have upgraded my spark job from spark

[ SPARK SQL ]: PPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-12 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying to Mysql Database and applying `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as expected in spark 3.3.1 , but not working with 3.5.0. Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st)

Can not complete the read csv task

2023-10-12 Thread Kelum Perera
Dear friends, I'm trying to get a fresh start with Spark. I tried to read few CSV files in a folder, but the task got stuck and not completed as shown in the copied content from the terminal. Can someone help to understand what is going wrong? Versions; java version "11.0.16" 2022-07-19 LTS

Re: Autoscaling in Spark

2023-10-10 Thread Mich Talebzadeh
This has been brought up a few times. I will focus on Spark Structured Streaming Autoscaling does not support Spark Structured Streaming (SSS). Why because streaming jobs are typically long-running jobs that need to maintain state across micro-batches. Autoscaling is designed to scale up and down

Autoscaling in Spark

2023-10-10 Thread Kiran Biswal
Hello Experts Is there any true auto scaling option for spark? The dynamic auto scaling works only for batch. Any guidelines on spark streaming autoscaling and how that will be tied to any cluster level autoscaling solutions? Thanks

Re: Updating delta file column data

2023-10-10 Thread Mich Talebzadeh
Hi, Since you mentioned that there could be duplicate records with the same unique key in the Delta table, you will need a way to handle these duplicate records. One approach I can suggest is to use a timestamp to determine the latest or most relevant record among duplicates, the so-called

Re: Log file location in Spark on K8s

2023-10-09 Thread Prashant Sharma
Hi Sanket, Driver and executor logs are written to stdout by default, it can be configured using SPARK_HOME/conf/log4j.properties file. The file including the entire SPARK_HOME/conf is auto propogateded to all driver and executor container and mounted as volume. Thanks On Mon, 9 Oct, 2023, 5:37

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Danilo Sousa
Unsubscribe > Em 9 de out. de 2023, à(s) 07:03, Mich Talebzadeh > escreveu: > > Hi, > > Please see my responses below: > > 1) In Spark Structured Streaming does commit mean streaming data has been > delivered to the sink like Snowflake? > > No. a commit does not refer to data being

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Your mileage varies. Often there is a flavour of Cloud Data warehouse already there. CDWs like BigQuery, Redshift, Snowflake and so forth. They can all do a good job for various degrees - Use efficient data types. Choose data types that are efficient for Spark to process. For example, use

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread ashok34...@yahoo.com.INVALID
Thank you for your feedback Mich. In general how can one optimise the cloud data warehouses (the sink part), to handle streaming Spark data efficiently, avoiding bottlenecks that discussed. AKOn Monday, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh wrote: Hi, Please see my

Re: Updating delta file column data

2023-10-09 Thread Mich Talebzadeh
In a nutshell, is this what you are trying to do? 1. Read the Delta table into a Spark DataFrame. 2. Explode the string column into a struct column. 3. Convert the hexadecimal field to an integer. 4. Write the DataFrame back to the Delta table in merge mode with a unique key. Is

Log file location in Spark on K8s

2023-10-09 Thread Agrawal, Sanket
Hi All, We are trying to send the spark logs using fluent-bit. We validated that fluent-bit is able to move logs of all other pods except the driver/executor pods. It would be great if someone can guide us where should I look for spark logs in Spark on Kubernetes with client/cluster mode

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Hi, Please see my responses below: 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? No. a commit does not refer to data being delivered to a sink like Snowflake or bigQuery. The term commit refers to Spark Structured Streaming (SS)

Re: Updating delta file column data

2023-10-09 Thread Karthick Nk
Hi All, I have mentioned the sample data below and the operation I need to perform over there, I have delta tables with columns, in that columns I have the data in the string data type(contains the struct data), So, I need to update one key value in the struct field data in the string column

Clarification with Spark Structured Streaming

2023-10-08 Thread ashok34...@yahoo.com.INVALID
Hello team 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? 2) if sinks like Snowflake  cannot absorb or digest streaming data in a timely manner, will there be an impact on spark streaming itself? Thanks AK

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Igor Calabria
You might be affected by this issue: https://github.com/apache/iceberg/issues/8601 It was already patched but it isn't released yet. On Thu, Oct 5, 2023 at 7:47 PM Prashant Sharma wrote: > Hi Sanket, more details might help here. > > How does your spark configuration look like? > > What

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Angshuman Bhattacharya
Thanks Ahmed. I am trying to bring this up with Spark DE community On Thu, Oct 5, 2023 at 12:32 PM Ahmed Albalawi < ahmed.albal...@capitalone.com> wrote: > Hello team, > > We are in the process of upgrading one of our apps to Spring Boot 3.x > while using Spark, and we have encountered an issue

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Sean Owen
I think we already updated this in Spark 4. However for now you would have to also include a JAR with the jakarta.* classes instead. You are welcome to try Spark 4 now by building from master, but it's far from release. On Thu, Oct 5, 2023 at 11:53 AM Ahmed Albalawi wrote: > Hello team, > > We

Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Ahmed Albalawi
Hello team, We are in the process of upgrading one of our apps to Spring Boot 3.x while using Spark, and we have encountered an issue with Spark compatibility, specifically with Jakarta Servlet. Spring Boot 3.x uses Jakarta Servlet while Spark uses Javax Servlet. Can we get some guidance on how

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Mich Talebzadeh
The fact that you have 60 partitions or brokers in kaka is not directly correlated to Spark Structured Streaming (SSS) executors by itself. See below. Spark starts with 200 partitions. However, by default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the node,

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Perez
You can try the 'optimize' command of delta lake. That will help you for sure. It merges small files. Also, it depends on the file format. If you are working with Parquet then still small files should not cause any issues. P. On Thu, Oct 5, 2023 at 10:55 AM Shao Yang Hong wrote: > Hi

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Prashant Sharma
Hi Sanket, more details might help here. How does your spark configuration look like? What exactly was done when this happened? On Thu, 5 Oct, 2023, 2:29 pm Agrawal, Sanket, wrote: > Hello Everyone, > > > > We are trying to stream the changes in our Iceberg tables stored in AWS > S3. We are

Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Agrawal, Sanket
Hello Everyone, We are trying to stream the changes in our Iceberg tables stored in AWS S3. We are achieving this through Spark-Iceberg Connector and using JAR files for Spark-AWS. Suddenly we have started receiving error "Connection pool shut down". Spark Version: 3.4.1 Iceberg: 1.3.1 Any

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi Raghavendra, Yes, we are trying to reduce the number of files in delta as well (the small file problem [0][1]). We already have a scheduled app to compact files, but the number of files is still large, at 14K files per day. [0]:

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi all on user@spark: We are looking for advice and suggestions on how to tune the .repartition() parameter. We are using Spark Streaming on our data pipeline to consume messages and persist them to a Delta Lake (https://delta.io/learn/getting-started/). We read messages from a Kafka topic,

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Raghavendra Ganesh
Hi, What is the purpose for which you want to use repartition() .. to reduce the number of files in delta? Also note that there is an alternative option of using coalesce() instead of repartition(). -- Raghavendra On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong wrote: > Hi all on user@spark: >

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi all on user@spark: We are looking for advice and suggestions on how to tune the .repartition() parameter. We are using Spark Streaming on our data pipeline to consume messages and persist them to a Delta Lake (https://delta.io/learn/getting-started/). We read messages from a Kafka topic,

[Spark Core]: Recomputation cost of a job due to executor failures

2023-10-04 Thread Faiz Halde
Hello, Due to the way Spark implements shuffle, a loss of an executor sometimes results in the recomputation of partitions that were lost The definition of a *partition* is the tuple ( RDD-ids, partition id ) RDD-ids is a sequence of RDD ids In our system, we define the unit of work performed

Updating delta file column data

2023-10-02 Thread Karthick Nk
Hi community members, In databricks adls2 delta tables, I need to perform the below operation, could you help me with your thoughts I have the delta tables with one colum with data type string , which contains the json data in string data type, I need to do the following 1. I have to update one

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jon Rodríguez Aranguren
Dear Jörn Franke, Jayabindu Singh and Spark Community members, Thank you profoundly for your initial insights. I feel it's necessary to provide more precision on our setup to facilitate a deeper understanding. We're interfacing with S3 Compatible storages, but our operational context is somewhat

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
There is nowadays more a trend to move away from static credentials/certificates that are stored in a secret vault. The issue is that the rotation of them is complex, once they are leaked they can be abused, making minimal permissions feasible is cumbersome etc. That is why keyless approaches are

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
With oidc sth comparable is possible: https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.htmlAm 01.10.2023 um 11:13 schrieb Mich Talebzadeh :It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Mich Talebzadeh
It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment on exposed storage such as s3 and gcs. The sensitive metadata protected by metadata concealment is also

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jayabindu Singh
Hi Jon, Using IAM as suggested by Jorn is the best approach. We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM. It will save you from secret management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jörn Franke
Don’t use static iam (s3) credentials. It is an outdated insecure method - even AWS recommend against using this for anything (cf eg https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html). It is almost a guarantee to get your data stolen and your account manipulated. If

using facebook Prophet + pyspark for forecasting - Dataframe has less than 2 non-NaN rows

2023-09-29 Thread karan alang
Hello - Anyone used Prophet + pyspark for forecasting ? I'm trying to backfill forecasts, and running into issues (error - Dataframe has less than 2 non-NaN rows) I'm removing all records with NaN values, yet getting this error. details are in stackoverflow link ->

Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-29 Thread Jon Rodríguez Aranguren
Dear Spark Community Members, I trust this message finds you all in good health and spirits. I'm reaching out to the collective expertise of this esteemed community with a query regarding Spark on Kubernetes. As a newcomer, I have always admired the depth and breadth of knowledge shared within

Re: Inquiry about Processing Speed

2023-09-28 Thread Jack Goodson
Hi Haseeb, I think the user mailing list is what you're looking for, people are usually pretty active on here if you present a direct question about apache spark. I've linked below the community guidelines which says which mailing lists are for what etc https://spark.apache.org/community.html

Thread dump only shows 10 shuffle clients

2023-09-28 Thread Nebi Aydin
Hi all, I set the spark.shuffle.io.serverThreads and spark.shuffle.io.clientThreads to *800* But when I click Thread dump from the Spark UI for the executor: I only see 10 shuffle client threads for the executor. Is that normal, am I missing something?

Re: Inquiry about Processing Speed

2023-09-27 Thread Deepak Goel
Hi "Processing Speed" can be at a software level (Code Optimization) and at a hardware level (Capacity Planning) Deepak "The greatness of a nation can be judged by the way its animals are treated - Mahatma Gandhi" +91 73500 12833 deic...@gmail.com Facebook: https://www.facebook.com/deicool

Files io threads vs shuffle io threads

2023-09-27 Thread Nebi Aydin
Hi all, Can someone explain the difference between Files io threads and shuffle io threads, as I couldn't find any explanation. I'm specifically asking about these: spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads spark.files.io.serverThreads spark.files.io.clientThreads

Inquiry about Processing Speed

2023-09-27 Thread Haseeb Khalid
Dear Support Team, I hope this message finds you well. My name is Haseeb Khalid, and I am reaching out to discuss a scenario related to processing speed in Apache Spark. I have been utilizing these technologies in our projects, and we have encountered a specific use case where we are seeking to

Reading Glue Catalog Views through Spark.

2023-09-25 Thread Agrawal, Sanket
Hello Everyone, We have setup spark and setup Iceberg-Glue connectors as mentioned at https://iceberg.apache.org/docs/latest/aws/ to integrate Spark, Iceberg, and AWS Glue Catalog. We are able to read tables through this but we are unable to read data through views. PFB, the error:

[PySpark][Spark logs] Is it possible to dynamically customize Spark logs?

2023-09-25 Thread Ayman Rekik
Hello, What would be the right way, if any, to inject a runtime variable into Spark logs. So that, for example, if Spark (driver/worker) logs some info/warning/error message, the variable will be output there (in order to help filtering logs for the sake of monitoring and troubleshooting).

[ANNOUNCE] Apache Kyuubi released 1.7.3

2023-09-25 Thread Zhen Wang
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.7.3 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

Spark Connect Multi-tenant Support

2023-09-22 Thread Kezhi Xiong
Hi, >From Spark Connect's official site's image, it mentions the "Multi-tenant Application Gateway" on driver. Are there any more documents about it? Can I know how users can utilize such a feature? Thanks, Kezhi

Re: Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-22 Thread Karthick
Hi All, It will be helpful if we gave any pointers to the problem addressed. Thanks Karthick. On Wed, Sep 20, 2023 at 3:03 PM Gowtham S wrote: > Hi Spark Community, > > Thank you for bringing up this issue. We've also encountered the same > challenge and are actively working on finding a

Re: Parallel write to different partitions

2023-09-21 Thread Shrikant Prasad
Found this issue reported earlier but was bulk closed: https://issues.apache.org/jira/browse/SPARK-27030 Regards, Shrikant On Fri, 22 Sep 2023 at 12:03 AM, Shrikant Prasad wrote: > Hi all, > > We have multiple spark jobs running in parallel trying to write into same > hive table but each job

Parallel write to different partitions

2023-09-21 Thread Shrikant Prasad
Hi all, We have multiple spark jobs running in parallel trying to write into same hive table but each job writing into different partition. This was working fine with Spark 2.3 and Hadoop 2.7. But after upgrading to Spark 3.2 and Hadoop 3.2.2, these parallel jobs are failing with FileNotFound

Re: Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread Mich Talebzadeh
In general you can probably do all this in spark-sql by reading in Hive table through a DF in Pyspark, then creating a TempView on that DF, select PM data through CAST() function and then use a windowing function to select the top 5 with DENSE_RANK() #Read Hive table as a DataFrame df =

Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread ashok34...@yahoo.com.INVALID
Hello gurus, I have a Hive table created as below (there are more columns) CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume INT ); Data is stored in that table In PySpark, I want to  select the top 5 incoming IP addresses with the highest total volume of data

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Oh, I saw it now. Thanks! On Wed, Sep 20, 2023 at 1:04 PM Sean Owen wrote: > [ External sender. Exercise caution. ] > > I think the announcement mentioned there were some issues with pypi and > the upload size this time. I am sure it's intended to be there when > possible. > > On Wed, Sep 20,

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the upload size this time. I am sure it's intended to be there when possible. On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong wrote: > Hi, > > Are there any plans to upload PySpark 3.5.0 to PyPI ( >

PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Hi, Are there any plans to upload PySpark 3.5.0 to PyPI ( https://pypi.org/project/pyspark/)? It's still 3.4.1. Thanks, Kezhi

[Spark 3.5.0] Is the protobuf-java JAR no longer shipped with Spark?

2023-09-20 Thread Gijs Hendriksen
Hi all, This week, I tried upgrading to Spark 3.5.0, as it contained some fixes for spark-protobuf that I need for my project. However, my code is no longer running under Spark 3.5.0. My build.sbt file is configured as follows: val sparkV  = "3.5.0" val hadoopV = "3.3.6"

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen
This has turned into a big thread for a simple thing and has been answered 3 times over now. Neither is better, they just calculate different things. That the 'default' is sample stddev is just convention. stddev_pop is the simple standard deviation of a set of numbers stddev_samp is used when

Re: Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-20 Thread Gowtham S
Hi Spark Community, Thank you for bringing up this issue. We've also encountered the same challenge and are actively working on finding a solution. It's reassuring to know that we're not alone in this. If you have any insights or suggestions regarding how to address this problem, please feel

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Mich Talebzadeh
Spark uses the sample standard deviation stddev_samp by default, whereas *Hive* uses population standard deviation stddev_pop as default. My understanding is that spark uses sample standard deviation by default because - It is more commonly used. - It is more efficient to calculate. -

unsubscribe

2023-09-19 Thread Danilo Sousa
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2023-09-19 Thread Ghousia
unsubscribe

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Mich Talebzadeh
Hi Helen, Assuming you want to calculate stddev_samp, Spark correctly points STDDEV to STDDEV_SAMP. In below replace sales with your table name and AMOUNT_SOLD with the column you want to do the calculation SELECT

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Bjørn Jørgensen
from pyspark.sql import SparkSession from pyspark.sql.functions import stddev_samp, stddev_pop spark = SparkSession.builder.getOrCreate() data = [(52.7,), (45.3,), (60.2,), (53.8,), (49.1,), (44.6,), (58.0,), (56.5,), (47.9,), (50.3,)] df = spark.createDataFrame(data, ["value"])

Create an external table with DataFrameWriterV2

2023-09-19 Thread Christophe Préaud
Hi, I usually create an external Delta table with the command below, using DataFrameWriter API: df.write    .format("delta")    .option("path", "")    .saveAsTable("") Now I would like to use the DataFrameWriterV2 API. I have tried the following command: df.writeTo("")    .using("delta")    

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen
Pyspark follows SQL databases here. stddev is stddev_samp, and sample standard deviation is the calculation with the Bessel correction, n-1 in the denominator. stddev_pop is simply standard deviation, with n in the denominator. On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe wrote: > Hi! > > > > I

Spark streaming sourceArchiveDir does not move file to archive directory

2023-09-19 Thread Yunus Emre G?rses
Hello everyone, I'm using scala and spark with the version 3.4.1 in Windows 10. While streaming using Spark, I give the `cleanSource` option as "archive" and the `sourceArchiveDir` option as "archived" as in the code below. ``` spark.readStream .option("cleanSource", "archive")

Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Helene Bøe
Hi! I am applying the stddev function (so actually stddev_samp), however when comparing with the sample standard deviation in Excel the resuls do not match. I cannot find in your documentation any more specifics on how the sample standard deviation is calculated, so I cannot compare the

Re: Spark stand-alone mode

2023-09-19 Thread Patrick Tucci
Multiple applications can run at once, but you need to either configure Spark or your applications to allow that. In stand-alone mode, each application attempts to take all resources available by default. This section of the documentation has more details:

Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-19 Thread Karthick
Subject: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem Dear Spark Community, I recently reached out to the Apache Flink community for assistance with a critical issue we are facing in our IoT platform, which relies on Apache Kafka and real-time data processing. We received some

unsubscribe

2023-09-18 Thread Ghazi Naceur
unsubscribe

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng
Hi Craig, Thank you for sending us more information. Can you answer my previous question which I don't think the document addresses. How did you determine duplicates in the output? How was the output data read? The FileStreamSink provides exactly-once writes ONLY if you read the output with the

Re: getting emails in different order!

2023-09-18 Thread Mich Talebzadeh
OK thanks Sean. Not a big issue for me. Normally happens AM GMT/London time.. I see the email trail but not the thread owner's email first. Normally responses first. Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile

Re: getting emails in different order!

2023-09-18 Thread Sean Owen
I have seen this, and not sure if it's just the ASF mailer being weird, or more likely, because emails are moderated and we inadvertently moderate them out of order On Mon, Sep 18, 2023 at 10:59 AM Mich Talebzadeh wrote: > Hi, > > I use gmail to receive spark user group emails. > > On

Re: Spark stand-alone mode

2023-09-18 Thread Ilango
Thanks all for your suggestions. Noted with thanks. Just wanted share few more details about the environment 1. We use NFS for data storage and data is in parquet format 2. All HPC nodes are connected and already work as a cluster for Studio workbench. I can setup password less SSH if it not exist

getting emails in different order!

2023-09-18 Thread Mich Talebzadeh
Hi, I use gmail to receive spark user group emails. On occasions, I get the latest emails first and later in the day I receive the original email. Has anyone else seen this behaviour recently? Thanks Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United

[ANNOUNCE] Apache Kyuubi released 1.7.2

2023-09-18 Thread Zhen Wang
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.7.2 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

Re: First Time contribution.

2023-09-17 Thread Haejoon Lee
Welcome Ram! :-) I would recommend you to check https://issues.apache.org/jira/browse/SPARK-37935 out as a starter task. Refer to https://github.com/apache/spark/pull/41504, https://github.com/apache/spark/pull/41455 as an example PR. Or you can also add a new sub-task if you find any error

Re: First Time contribution.

2023-09-17 Thread Denny Lee
Hi Ram, We have some good guidance at https://spark.apache.org/contributing.html HTH! Denny On Sun, Sep 17, 2023 at 17:18 ram manickam wrote: > > > > Hello All, > Recently, joined this community and would like to contribute. Is there a > guideline or recommendation on tasks that can be

About Peak Jvm Memory Onheap

2023-09-17 Thread Nebi Aydin
Hi all, I couldn't find any useful doc that explains `*Peak JVM Memory Onheap`* field on Spark UI. Most of the time my applications have very low *On heap storage memory *and *Peak execution memory on heap* But have very big `*Peak JVM Memory Onheap`.* on Spark UI Can someone please explain the

Fwd: First Time contribution.

2023-09-17 Thread ram manickam
Hello All, Recently, joined this community and would like to contribute. Is there a guideline or recommendation on tasks that can be picked up by a first timer or a started task?. Tried looking at stack overflow tag: apache-spark , couldn't

Re: Filter out 20% of rows

2023-09-16 Thread ashok34...@yahoo.com.INVALID
Thank you Bjorn and Mich.  Appreciated Best On Saturday, 16 September 2023 at 16:50:04 BST, Mich Talebzadeh wrote: Hi Bjorn, I thought that one is better off using percentile_approx as it seems to be the recommended approach for computing percentiles and can simplify the code. I have

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
EDIT: I don't think that the question asker will have only returned the top 25 percentages. lør. 16. sep. 2023 kl. 21:54 skrev Bjørn Jørgensen : > percentile_approx returns the approximate percentile(s) > The memory consumption is > bounded. The

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
percentile_approx returns the approximate percentile(s) The memory consumption is bounded. The larger accuracy parameter we choose, the smaller error we get. The default accuracy value is 1, to match with Hive default setting. Choose a smaller value

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Hi Bjorn, I thought that one is better off using percentile_approx as it seems to be the recommended approach for computing percentiles and can simplify the code. I have modified your code to use percentile_approx rather than manually computing it. It would be interesting to hear ideas on this.

<    2   3   4   5   6   7   8   9   10   11   >