[PySpark] Intermittent Spark session initialization error on M1 Mac

2023-06-27 Thread BeoumSuk Kim
Hi, When I launch pyspark CLI on my M1 Macbook (standalone mode), I intermittently get the following error and the Spark session doesn't get initialized. 7~8 times out of 10, it doesn't have the issue, but it intermittently fails. And, this occurs only when I specify `spark.jars.packages` option.

[k8s] Fail to expose custom port on executor container specified in my executor pod template

2023-06-26 Thread James Yu
Hi Team, I have no luck in trying to expose port 5005 (for remote debugging purpose) on my executor container using the following pod template and spark configuration s3a://mybucket/pod-template-executor-debug.yaml

[Spark-SQL] Dataframe write saveAsTable failed

2023-06-26 Thread Anil Dasari
Hi, We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table creation while writing dataframe as saveAsTable failed with below error. Can not create the managed table(``) The associated location('hdfs:') already exists. On high level our code does below before writing dataframe as

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK, good news. You have made some progress here :) bzip (bzip2) works (splittable) because it is block-oriented whereas gzip is stream oriented. I also noticed that you are creating a managed ORC file. You can bucket and partition an ORC (Optimized Row Columnar file format. An example below:

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hi Mich, Thanks for the reply. I started running ANALYZE TABLE on the external table, but the progress was very slow. The stage had only read about 275MB in 10 minutes. That equates to about 5.5 hours just to analyze the table. This might just be the reality of trying to process a 240m record

Unable to populate spark metrics using custom metrics API

2023-06-26 Thread Surya Soma
Hello, I am trying to publish custom metrics using Spark CustomMetric API as supported since spark 3.2 https://github.com/apache/spark/pull/31476, https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html I have created a custom metric implementing

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL COLUMNS; spark-sql (default)> DESC EXTENDED test.stg_t2; Hive external tables have little optimization HTH Mich Talebzadeh, Solutions Architect/Engineering

Unsubscribe

2023-06-26 Thread Ghazi Naceur
Unsubscribe

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hello, I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and 64GB of RAM. I'm trying to process a large pipe delimited file that has been compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m

Re: [Spark streaming]: Microbatch id in logs

2023-06-26 Thread Mich Talebzadeh
In SSS writeStream. \ outputMode('append'). \ option("truncate", "false"). \ * foreachBatch(SendToBigQuery). \* option('checkpointLocation', checkpoint_path). \ so this writeStream will call

[Spark streaming]: Microbatch id in logs

2023-06-25 Thread Anil Dasari
Hi, I am using spark 3.3.1 distribution and spark stream in my application. Is there a way to add a microbatch id to all logs generated by spark and spark applications ? Thanks.

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread yangjie01
Thanks Dongjoon ~ 在 2023/6/24 10:29,“L. C. Hsieh”mailto:vii...@gmail.com>> 写入: Thanks Dongjoon! On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon mailto:gurwls...@apache.org>> wrote: > > Thanks! > > On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan > wrote: >> >> >>

Re:[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread beliefer
Thanks! Dongjoon Hyun. Congratulation too! At 2023-06-24 07:57:05, "Dongjoon Hyun" wrote: We are happy to announce the availability of Apache Spark 3.4.1! Spark 3.4.1 is a maintenance release containing stability fixes. This release is based on the branch-3.4 maintenance branch of Spark.

Apache Spark with watermark - processing data different LogTypes in same kafka topic

2023-06-24 Thread karan alang
Hello All - I'm using Apache Spark Structured Streaming to read data from Kafka topic, and do some processing. I'm using watermark to account for late-coming records and the code works fine. Here is the working(sample) code: ``` from pyspark.sql import SparkSessionfrom pyspark.sql.functions

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread L. C. Hsieh
Thanks Dongjoon! On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon wrote: > > Thanks! > > On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: >> >> >> Thanks Dongjoon ! >> >> Regards, >> Mridul >> >> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: >>> >>> We are happy to announce the

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks! On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: > > Thanks Dongjoon ! > > Regards, > Mridul > > On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > >> We are happy to announce the availability of Apache Spark 3.4.1! >> >> Spark 3.4.1 is a maintenance release containing

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Mridul Muralidharan
Thanks Dongjoon ! Regards, Mridul On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > We are happy to announce the availability of Apache Spark 3.4.1! > > Spark 3.4.1 is a maintenance release containing stability fixes. This > release is based on the branch-3.4 maintenance branch of Spark.

[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.1! Spark 3.4.1 is a maintenance release containing stability fixes. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade to this stable release. To download Spark 3.4.1,

Re: Rename columns without manually setting them all

2023-06-21 Thread Bjørn Jørgensen
data = { "Employee ID": [12345, 12346, 12347, 12348, 12349], "Name": ["Dummy x", "Dummy y", "Dummy z", "Dummy a", "Dummy b"], "Client": ["Dummy a", "Dummy b", "Dummy c", "Dummy d", "Dummy e"], "Project": ["abc", "def", "ghi", "jkl", "mno"], "Team": ["team a", "team b", "team

Re: Rename columns without manually setting them all

2023-06-21 Thread Farshid Ashouri
You can use selectExpr and stack to achieve the same effect in PySpark: df = spark.read.csv("your_file.csv", header=True, inferSchema=True) date_columns = [col for col in df.columns if '/' in col] df = df.selectExpr(["`Employee ID`", "`Name`", "`Client`", "`Project`", "`Team`”] +

Rename columns without manually setting them all

2023-06-21 Thread John Paul Jayme
Hi, This is currently my column definition : Employee ID NameClient Project Team01/01/2022 02/01/2022 03/01/2022 04/01/2022 05/01/2022 12345 Dummy x Dummy a abc team a OFF WO WH WH WH As you can see, the outer columns are just

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
OK thanks for the info. Regards Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:*

Unsubscribe

2023-06-20 Thread Bhargava Sukkala
-- Thanks, Bhargava Sukkala. Cell no:216-278-1066 MS in Business Analytics, Arizona State University.

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
yes, p_df = DF.toPandas() that is THE pandas the one you know. change p_df = DF.toPandas() to p_df = DF.pandas_on_spark() or p_df = DF.to_pandas_on_spark() or p_df = DF.pandas_api() or p_df = DF.to_koalas() https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
OK thanks So the issue seems to be creating a Panda DF from Spark DF (I do it for plotting with something like import matplotlib.pyplot as plt p_df = DF.toPandas() p_df.plt() I guess that stays in the driver. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir

Re: Shuffle data on pods which get decomissioned

2023-06-20 Thread Mich Talebzadeh
If one executor fails, it moves the processing over to another executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go back to the source.Does this mean that only those tasks that the dead executor was executing at the time need to be

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
No, a pandas on Spark DF is distributed. On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh wrote: > Thanks but if you create a Spark DF from Pandas DF that Spark DF is not > distributed and remains on the driver. I recall a while back we had this > conversation. I don't think anything has changed.

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
Thanks but if you create a Spark DF from Pandas DF that Spark DF is not distributed and remains on the driver. I recall a while back we had this conversation. I don't think anything has changed. Happy to be corrected Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
Pandas API on spark is an API so that users can use spark as they use pandas. This was known as koalas. Is this limitation still valid for Pandas? For pandas, yes. But what I did show wos pandas API on spark so its spark. Additionally when we convert from Panda DF to Spark DF, what process is

Shuffle data on pods which get decomissioned

2023-06-20 Thread Nikhil Goyal
Hi folks, When running Spark on K8s, what would happen to shuffle data if an executor is terminated or lost. Since there is no shuffle service, does all the work done by that executor gets recomputed? Thanks Nikhil

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
Whenever someone mentions Pandas I automatically think of it as an excel sheet for Python. OK my point below needs some qualification Why Spark here. Generally, parallel architecture comes into play when the data size is significantly large which cannot be handled on a single machine, hence, the

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
This is pandas API on spark from pyspark import pandas as ps df = ps.read_excel("testexcel.xlsx") [image: image.png] this will convert it to pyspark [image: image.png] tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme : > Good day, > > > > I have a task to read excel files in databricks but I

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
It is indeed not part of SparkSession. See the link you cite. It is part of the pyspark pandas API On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme wrote: > Good day, > > > > I have a task to read excel files in databricks but I cannot seem to > proceed. I am referencing the API documents -

How to read excel file in PySpark

2023-06-20 Thread John Paul Jayme
Good day, I have a task to read excel files in databricks but I cannot seem to proceed. I am referencing the API documents - read_excel , but there is an error sparksession object has

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh
OK the number of partitions n or more to the point the "optimum" no of partitions depends on the size of your batch data DF among other things and the degree of parallelism at the end point where you will be writing to sink. If you require high parallelism because your tasks are fine grained, then

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh
Is this the point you are trying to implement? I have state data source which enables the state in SS --> Structured Streaming to be rewritten, which enables repartitioning, schema evolution, etc via batch query. The writer requires hash partitioning against group key, with the "desired number of

implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Pengfei Li
Hi All, I'm developing a DataSource on Spark 3.2 to write data to our system, and using DataSource V2 API. I want to implement the interface RequiresDistributionAndOrdering

TAC Applications for Community Over Code North America and Asia now open

2023-06-16 Thread Gavin McDonald
Hi All, (This email goes out to all our user and dev project mailing lists, so you may receive this email more than once.) The Travel Assistance Committee has opened up applications to help get people to the following events: *Community Over Code Asia 2023 - * *August 18th to August 20th in

Fwd: iceberg queries

2023-06-15 Thread Gaurav Agarwal
Hi Team, Sample Merge query: df.createOrReplaceTempView("source") MERGE INTO iceberg_hive_cat.iceberg_poc_db.iceberg_tab target USING (SELECT * FROM source) ON target.col1 = source.col1// this is my bucket column WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * The source dataset

Re: Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
> HI > > I am using spark with iceberg, updating the table with 1700 columns , > We are loading 0.6 Million rows from parquet files ,in future it will be > 16 Million rows and trying to update the data in the table which has 16 > buckets . > Using the default partitioner of spark .Also we don't do

Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
HI I am using spark with iceberg, updating the table with 1700 columns , We are loading 0.6 Million rows from parquet files ,in future it will be 16 Million rows and trying to update the data in the table which has 16 buckets . Using the default partitioner of spark .Also we don't do any

Unsubscribe

2023-06-11 Thread Yu voidy

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan
DataFrame view stores the logical plan, while SQL view stores SQL text. I don't think we can support this feature until we have a reliable way to materialize logical plans. On Sun, Jun 4, 2023 at 10:31 PM Mich Talebzadeh wrote: > Try sending it to d...@spark.apache.org (and join that group) > >

Announcing the Community Over Code 2023 Streaming Track

2023-06-09 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Halifax, Nova Scotia October 7-10, 2023. The call for presentations is open now through July 13, 2023. I am one of the co-chairs for the

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Enrico Minack
Sean is right, casting timestamps to strings (which is what show() does) uses the local timezone, either the Java default zone `user.timezone`, the Spark default zone `spark.sql.session.timeZone` or the default DataFrameWriter zone `timeZone`(when writing to file). You say you are in PST,

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen
You sure it is not just that it's displaying in your local TZ? Check the actual value as a long for example. That is likely the same time. On Thu, Jun 8, 2023, 5:50 PM karan alang wrote: > ref : >

Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread karan alang
ref : https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly Hello All, I've data stored in MongoDB collection and the timestamp column is not being read by Apache Spark correctly. I'm running Apache Spark on GCP Dataproc. Here is sample data :

Getting SparkRuntimeException: Unexpected value for length in function slice: length must be greater than or equal to 0

2023-06-06 Thread Bariudin, Daniel
I'm using Pyspark (version 3.2) and I've encountered the following exception while trying to perform a slice on array in a DataFrame: "org.apache.spark.SparkRuntimeException: Unexpected value for length in function slice: length must be greater than or equal to 0" but the length is grater then

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread Mich Talebzadeh
Try sending it to d...@spark.apache.org (and join that group) You need to raise a JIRA for this request plus related doc related Example JIRA https://issues.apache.org/jira/browse/SPARK-42485 and the related *Spark project improvement proposals (SPIP) *to be filled in

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread keen
Do Spark **devs** read this mailing list? Is there another/a better way to make feature requests? I tried in the past to write a mail to the dev mailing list but it did not show at all. Cheers keen schrieb am Do., 1. Juni 2023, 07:11: > Hi all, > currently only *temporary* Spark Views can be

[Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-01 Thread keen
Hi all, currently only *temporary* Spark Views can be created from a DataFrame (df.createOrReplaceTempView or df.createOrReplaceGlobalTempView). When I want a *permanent* Spark View I need to specify it via Spark SQL (CREATE VIEW AS SELECT ...). Sometimes it is easier to specify the desired

Re: ChatGPT and prediction of Spark future

2023-06-01 Thread Mich Talebzadeh
Great stuff Winston. I added a channel in Slack Community for Spark https://sparkcommunitytalk.slack.com/archives/C05ACMS63RT cheers Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Comparison of Trino, Spark, and Hive-MR3

2023-05-31 Thread Sungwoo Park
Hi everyone, We published an article on the performance and correctness of Trino, Spark, and Hive-MR3, and thought that it could be of interest to Spark users. https://www.datamonad.com/post/2023-05-31-trino-spark-hive-performance-1.7/ Omitted in the article is the performance of Spark 2.3.1 vs

Re: Viewing UI for spark jobs running on K8s

2023-05-31 Thread Qian Sun
Hi Nikhil Spark operator supports ingress for exposing all UIs of running spark applications. reference: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#driver-ui-access-and-ingress On Thu, Jun 1, 2023 at 6:19 AM Nikhil Goyal wrote: > Hi

Re: ChatGPT and prediction of Spark future

2023-05-31 Thread Winston Lai
Hi Mich, I have been using ChatGPT free version, Bing AI, Google Bard and other AI chatbots. My use cases so far include writing, debugging code, generating documentation and explanation on Spark key terminologies for beginners to quickly pick up new concepts, summarizing pros and cons or

Viewing UI for spark jobs running on K8s

2023-05-31 Thread Nikhil Goyal
Hi folks, Is there an equivalent of the Yarn RM page for Spark on Kubernetes. We can port-forward the UI from the driver pod for each but this process is tedious given we have multiple jobs running. Is there a clever solution to exposing all Driver UIs in a centralized place? Thanks Nikhil

ChatGPT and prediction of Spark future

2023-05-31 Thread Mich Talebzadeh
I have started looking into ChatGPT as a consumer. The one I have tried is the free not plus version. I asked a question entitled "what is the future for spark" and asked for a concise response This was the answer "Spark has a promising future due to its capabilities in data processing,

Structured streaming append mode picture question

2023-05-31 Thread Hill Liu
Hi, I have a question related to this picture https://spark.apache.org/docs/latest/img/structured-streaming-watermark-append-mode.png in structured streaming programming guide web page. At 12:20, the wm is 12:11, so the window 12:00 ~ 12:10 will be flushed out. But in this pic, it seems the window

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-29 Thread Aishwarya Panicker
Hi, Thanks for your response. I understand there is no explicit way to configure dynamic scaling for Spark Structured Streaming as the ticket is still open for that. But is there a way to manage dynamic scaling with the existing Batch Dynamic scaling algorithm as this kicks in when Dynamic

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
2.13.8 you must change 2.13.6 to 2.13.8 man. 29. mai 2023 kl. 18:02 skrev Mich Talebzadeh : > Thanks everyone. Still not much progress :(. It is becoming a bit > confusing as I am getting

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
Thanks everyone. Still not much progress :(. It is becoming a bit confusing as I am getting this error Compiling ReduceByKey [INFO] Scanning for projects... [INFO] [INFO] -< spark:ReduceByKey >-- [INFO] Building ReduceByKey 3.0 [INFO] from pom.xml

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
Change org.scala-lang scala-library 2.13.11-M2 to org.scala-lang scala-library ${scala.version} man. 29. mai 2023 kl. 13:20 skrev Lingzhe Sun : > Hi Mich, > > Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8 >

Re: JDK version support information

2023-05-29 Thread Sean Owen
Per docs, it is Java 8. It's possible Java 11 partly works with 2.x but not supported. But then again 2.x is not supported either. On Mon, May 29, 2023, 6:43 AM Poorna Murali wrote: > We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. > So, we wanted to check the maximum

Re: JDK version support information

2023-05-29 Thread Poorna Murali
We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. So, we wanted to check the maximum JDK version supported for 2.4.5.1. On Mon, 29 May, 2023, 5:03 pm Aironman DirtDiver, wrote: > Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the > official Spark

Re: JDK version support information

2023-05-29 Thread Aironman DirtDiver
Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the official Spark documentation for version 2.4.5, the maximum supported JDK (Java Development Kit) version is JDK 8 (Java 8). Spark 2.4.5 is not compatible with JDK versions higher than Java 8. Therefore, you should use JDK 8 to

JDK version support information

2023-05-29 Thread Poorna Murali
Hi, We are using spark version 2.4.5.1. We would like to know the maximum JDK version supported for the same. Thanks, Poorna

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Lingzhe Sun
Hi Mich, Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8. Since you are using spark-core_2.13 and spark-sql_2.13, you should stick to the major(13) and the minor version(8). Not using any of these may cause unexpected behaviour(though scala claims compatibility among minor

Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
Thanks for your helpful comments Bjorn. I managed to compile the code with maven but when it run it fails with Application is ReduceByKey Exception in thread "main" java.lang.NoSuchMethodError: scala.package$.Seq()Lscala/collection/immutable/Seq$; at

Re: maven with Spark 3.4.0 fails compilation

2023-05-28 Thread Bjørn Jørgensen
>From chatgpt4 The problem appears to be that there is a mismatch between the version of Scala used by the Scala Maven plugin and the version of the Scala library defined as a dependency in your POM. You've defined your Scala version in your properties as `2.12.17` but you're pulling in

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Mich Talebzadeh
Hi, Autoscaling is not compatible with Spark Structured Streaming since Spark Structured Streaming currently does not support dynamic allocation (see SPARK-24815: Structured Streaming should support dynamic

[Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Aishwarya Panicker
Hi Team, I have been working on Spark Structured Streaming and trying to autoscale our application through dynamic allocation. But I couldn't find any documentation or configurations that supports dynamic scaling in Spark Structured Streaming, due to which I had been using Spark Batch mode

Re: [MLlib] how-to find implementation of Decision Tree Regressor fit function

2023-05-25 Thread Sean Owen
Are you looking for https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala On Thu, May 25, 2023 at 6:54 AM Max wrote: > Good day, I'm working on an Implantation from Joint Probability Trees > (JPT) using the Spark framework. For this

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-24 Thread Enrico Minack
Hi, given your dataset: val df=Seq( (1, 20230523, "M01"), (2, 20230523, "M01"), (3, 20230523, "M01"), (4, 20230523, "M02"), (5, 20230523, "M02"), (6, 20230523, "M02"), (7, 20230523, "M01"), (8, 20230523, "M01"), (9, 20230523, "M02"), (10, 20230523, "M02"), (11, 20230523, "M02"), (12,

Dynamic value as the offset of lag() function

2023-05-23 Thread Nipuna Shantha
Hi all This is the sample set of data that I used for this task ,[image: image.png] My need is to pass count as the offset of lag() function. *[ lag(col(), lag(count)).over(windowspec) ]* But as the lag function expects lag(Column, Int) above code does not work. So can you guys suggest a

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Raghavendra Ganesh
Given, you are already stating the above can be imagined as a partition, I can think of mapPartitions iterator. val inputSchema = inputDf.schema val outputRdd = inputDf.rdd.mapPartitions(rows => new SomeClass(rows)) val outputDf = sparkSession.createDataFrame(outputRdd,

Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Nipuna Shantha
Hi all, This is the sample set of data that I used for this task [image: image.png] My expected output is as below [image: image.png] My scenario is if Type is M01 the count should be 0 and if Type is M02 it should be incremented from 1 or 0 until the sequence of M02 is finished. Imagine this

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread ashok34...@yahoo.com.INVALID
Thanks great Rauf. Regards On Tuesday, 23 May 2023 at 13:18:55 BST, Rauf Khan wrote: Hi , PartitionBy() is analogous to group by, all rows  that will have the same value in the specified column will form one window.The data will be shuffled to form group. RegardsRaouf On Fri, May 12,

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread Rauf Khan
Hi , PartitionBy() is analogous to group by, all rows that will have the same value in the specified column will form one window. The data will be shuffled to form group. Regards Raouf On Fri, May 12, 2023, 18:48 ashok34...@yahoo.com.INVALID wrote: > Hello, > > In Spark windowing does call

cannot load model using pyspark

2023-05-23 Thread second_co...@yahoo.com.INVALID
spark.sparkContext.textFile("s3a://a_bucket/models/random_forest_zepp/bestModel/metadata", 1).getNumPartitions() when i run above code, i get below error. Can advice how to troubleshoot? i' using spark 3.3.0. the above file path exist.

Data Stream Processing applications testing

2023-05-22 Thread Alexandre Strapacao Guedes Vianna
Hey everyone, I wanted to share my latest paper, "A Grey Literature Review on Data Stream Processing Applications Testing," in the Journal of Systems and Software (JSS), Elsevier. This paper provides unique industry insights, addresses the challenges faced in Data Stream Processing (DSP)

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
Just to correct the last sentence, if we end up starting a new instance of Spark, I don't think it will be able to read the shuffle data from storage from another instance, I stand corrected. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
Hi Maksym. Let us understand the basics here first My thoughtsSpark replicates the partitions among multiple nodes. If one executor fails, it moves the processing over to the other executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go

RE: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Maksym M
Hey vaquar, The link does't explain the crucial detail we're interested in - does executor re-use the data that exists on a node from previous executor and if not, how can we configure it to do so? We are not running on kubernetes, so EKS/Kubernetes-specific advice isn't very relevant. We are

Re: Spark shuffle and inevitability of writing to Disk

2023-05-17 Thread Mich Talebzadeh
Ok, I did a bit of a test that shows that the shuffle does spill to memory then to disk if my assertion is valid. The sample code I wrote is as follows: import sys from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-17 Thread vaquar khan
Following link you will get all required details https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/ Let me know if you required further informations. Regards, Vaquar khan On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh wrote: > Couple of points > > Why

RE: Understanding Spark S3 Read Performance

2023-05-16 Thread info
Hi,For clarification, are those 12 / 14 minutes cumulative cpu time or wall clock time? How many executors executed those 1 / 375 tasks?Cheers,Enrico Ursprüngliche Nachricht Von: Shashank Rao Datum: 16.05.23 19:48 (GMT+01:00) An: user@spark.apache.org Betreff:

Understanding Spark S3 Read Performance

2023-05-16 Thread Shashank Rao
Hi, I'm trying to set up a Spark pipeline which reads data from S3 and writes it into Google Big Query. Environment Details: --- Java 8 AWS EMR-6.10.0 Spark v3.3.1 2 m5.xlarge executor nodes S3 Directory structure: --- bucket-name: |---folder1: |---folder2:

Spark shuffle and inevitability of writing to Disk

2023-05-16 Thread Mich Talebzadeh
Hi, On the issue of Spark shuffle it is accepted that shuffle *often involves* the following if not all below: - Disk I/O - Data serialization and deserialization - Network I/O Excluding external shuffle service and without relying on the configuration options provided by spark for

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Mich Talebzadeh
Couple of points Why use spot or pre-empt intantes when your application as you stated shuffles heavily. Have you looked at why you are having these shuffles? What is the cause of these large transformations ending up in shuffle Also on your point: "..then ideally we should expect that when an

[spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Faiz Halde
Hello, We've been in touch with a few spark specialists who suggested us a potential solution to improve the reliability of our jobs that are shuffle heavy Here is what our setup looks like - Spark version: 3.3.1 - Java version: 1.8 - We do not use external shuffle service - We use

Pyspark cluster mode on standalone deployment

2023-05-14 Thread خالد القحطاني
Hi Can I deploy my Pyspark application on standalone cluster with cluster mode I believe it was not possible to do that but I searched all the documentation and I did not find it. My Spark standalone cluster version is 3.3.1

Re: Error while merge in delta table

2023-05-12 Thread Farhan Misarwala
Hi Karthick, If you have confirmed that the incompatibility between Delta and spark versions is not the case, then I would say the same what Jacek said earlier, there’s not enough “data” here. To further comment on it, we would need to know more on how you are structuring your multi threaded

Shuffle with Window().partitionBy()

2023-05-12 Thread ashok34...@yahoo.com.INVALID
Hello, In Spark windowing does call with  Window().partitionBy() can cause shuffle to take place? If so what is the performance impact if any if the data result set is large. Thanks

Re: Error while merge in delta table

2023-05-12 Thread Karthick Nk
Hi Farhan, Thank you for your response, I am using databricks with 11.3x-scala2.12. Here I am overwriting all the tables in the same database in concurrent thread, But when I do in the iterative manner it is working fine, For Example, i am having 200 tables in same database, i am overwriting the

Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-11 Thread Vijay B
In my view spark is behaving as expected. TL:DR Every time a dataframe is reused or branched or forked the sequence operations evaluated run again. Use Cache or persist to avoid this behavior and un-persist when no longer required, spark does not un-persist automatically. Couple of things

Re: Error while merge in delta table

2023-05-11 Thread Farhan Misarwala
Hi Karthick, I think I have seen this before and this probably could be because of an incompatibility between your spark and delta versions. Or an incompatibility between the delta version you are using now vs the one you used earlier on the existing table you are merging with. Let me know if

Re: Error while merge in delta table

2023-05-11 Thread Jacek Laskowski
Hi Karthick, Sorry to say it but there's not enough "data" to help you. There should be something more above or below this exception snippet you posted that could pinpoint the root cause. Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books Follow me on

Error while merge in delta table

2023-05-10 Thread Karthick Nk
Hi, I am trying to merge daaframe with delta table in databricks, but i am getting error, i have attached the code nippet and error message for reference below, code: [image: image.png] error: [image: image.png] Thanks

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-10 Thread Vijay B
Please see if this works -- aggregate array into map of element of count SELECT aggregate(array(1,2,3,4,5), map('cnt',0), (acc,x) -> map('cnt', acc.cnt+1)) as array_count thanks Vijay On 2023/05/05 19:32:04 Yong Zhang wrote: > Hi, This is on Spark 3.1 environment. > > For some reason, I can

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
When I run this job in local mode spark-submit --master local[4] with spark = SparkSession.builder \ .appName("tests") \ .enableHiveSupport() \ .getOrCreate() spark.conf.set("spark.sql.adaptive.enabled", "true") df3.explain(extended=True) and no caching I see this

<    5   6   7   8   9   10   11   12   13   14   >