Re: conver panda image column to spark dataframe

2023-07-31 Thread second_co...@yahoo.com.INVALID
i changed to ArrayType(ArrayType(ArrayType(IntegerType( , still get same error Thank you for responding On Thursday, July 27, 2023 at 06:58:09 PM GMT+8, Adrian Pop-Tifrea wrote: Hello,  when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images? Becaus

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
ok so as expected the underlying database is Hive. Hive uses hdfs storage. You said you encountered limitations on concurrent writes. The order and limitations are introduced by Hive metastore so to speak. Since this is all happening through Spark, by default implementation of the Hive metastore <

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
Hi Mich and Pol, Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster restarted so I lost the stack trace in the application UI. In the snippets I saved, it looks like the exception being thrown was from Hive. Given the feedback you've provided, I suspect the issue is with how

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
Hi Patrick, You can have multiple writers simultaneously writing to the same table in HDFS by utilizing an open table format with concurrency control. Several formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast Format, offer this capability. All of them provide advanced features t

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or layer that throws the error. Spark acts as an ETL tool. What is the underlying DB where the table resides? Is concurrency supported. Please send the error to this list HTH Mich Talebzadeh, Solutions Architect/Engineeri

Re: The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Mich Talebzadeh
Spark on tin boxes like Google Dataproc or AWS EC2 often utilise YARN resource manager. YARN is the most widely used resource manager not just for Spark but for other artefacts as well. On-premise YARN is used extensively. In Cloud it is also used widely in Infrastructure as a Service such as Goog

Re: conver panda image column to spark dataframe

2023-07-27 Thread Adrian Pop-Tifrea
Hello, when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images? Because if that's the case, then you'd want ro only use 3 layers of ArrayType when you define the schema. Best regards, Adrian On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID wrote: > i h

Re: spark context list_packages()

2023-07-27 Thread Sean Owen
There is no such method in Spark. I think that's some EMR-specific modification. On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID wrote: > I ran the following code > > spark.sparkContext.list_packages() > > on spark 3.4.1 and i get below error > > An error was encountered: > Attri

Re: Interested in contributing to SPARK-24815

2023-07-26 Thread Pavan Kotikalapudi
Thanks for the response with all the information Sean and Kent. Is there a way to figure out if my employer (Twilio) part of CCLA? cc'ing: @Rinat Shangeeta our Open Source Counsel at twilio Thank you, Pavan On Tue, Jul 25, 2023 at 10:48 PM Kent Yao wrote: > Hi Pavan, > > Refer to the ASF So

Re: Interested in contributing to SPARK-24815

2023-07-25 Thread Kent Yao
Hi Pavan, Refer to the ASF Source Header and Copyright Notice Policy[1], code directly submitted to ASF should include the Apache license header without any additional copyright notice. Kent Yao [1] https://www.apache.org/legal/src-headers.html#headers Sean Owen 于2023年7月25日周二 07:22写道: > > Wh

Re: Interested in contributing to SPARK-24815

2023-07-24 Thread Sean Owen
When contributing to an ASF project, it's governed by the terms of the ASF ICLA: https://www.apache.org/licenses/icla.pdf or CCLA: https://www.apache.org/licenses/cla-corporate.pdf I don't believe ASF projects ever retain an original author copyright statement, but rather source files have a state

Re: Spark 3.3 + parquet 1.10

2023-07-24 Thread Mich Talebzadeh
personally I have not done it myself. CCed to spark user group if some user has tried it among users. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Unable to launch Spark connect on Docker image

2023-07-22 Thread Mich Talebzadeh
This is the downloaded docker? Try this with the added configuration options as below /opt/spark/sbin/start-connect-server.sh *--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" *--packages org.apache.spark:spark-connect_2.12:3.4.1 And you will get starting org.apache.

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Mich Talebzadeh
this link might help https://stackoverflow.com/questions/46929351/spark-reading-orc-file-in-driver-not-in-executors Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Dipayan Dev
I used the following config and the performance has improved a lot. .config("spark.sql.orc.splits.include.file.footer", true) I am not able to find the default value of this config anywhere? Can someone please share what's the default config of this- is it false? Also just curious what this actual

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Dipayan Dev
Thank you. Will try out these options. With Best Regards, On Wed, Jul 19, 2023 at 1:40 PM Mich Talebzadeh wrote: > Sounds like if the mv command is inherently slow, there is little that can > be done. > > The only suggestion I can make is to create the staging table as > compressed to reduc

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Mich Talebzadeh
Sounds like if the mv command is inherently slow, there is little that can be done. The only suggestion I can make is to create the staging table as compressed to reduce its size and hence mv? Is that feasible? Also the managed table can be created with SNAPPY compression STORED AS ORC TBLPROPERT

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
Hi Mich, Ok, my use-case is a bit different. I have a Hive table partitioned by dates and need to do dynamic partition updates(insert overwrite) daily for the last 30 days (partitions). The ETL inside the staging directories is completed in hardly 5minutes, but then renaming takes a lot of time as

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Mich Talebzadeh
Spark has no role in creating that hive staging directory. That directory belongs to Hive and Spark simply does ETL there, loading to the Hive managed table in your case which ends up in saging directory I suggest that you review your design and use an external hive table with explicit location on

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
It does help performance but not significantly. I am just wondering, once Spark creates that staging directory along with the SUCCESS file, can we just do a gsutil rsync command and move these files to original directory? Anyone tried this approach or foresee any concern? On Mon, 17 Jul 2023 at

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
++ DEV community On Mon, Jul 17, 2023 at 4:14 PM Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project locally for my 1st > Open Source Contribution, by building and creating a package as mentioned

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Hi Team, I am still looking for a guidance here. Really appreciate anything that points me in the right direction. On Mon, Jul 17, 2023, 16:14 Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, is there any suggestion how much I can increase those parameters? On Mon, 17 Jul 2023 at 8:25 PM, Jay wrote: > Fileoutputcommitter v2 is supported in GCS but the rename is a metadata > copy and delete operation in GCS and therefore if there are many number of > files it will take a l

Re: Contributing to Spark MLLib

2023-07-17 Thread Gourav Sengupta
Hi, Holden Karau has some fantastic videos in her channel which will be quite helpful. Thanks Gourav On Sun, 16 Jul 2023, 19:15 Brian Huynh, wrote: > Good morning Dipayan, > > Happy to see another contributor! > > Please go through this document for contributors. Please note the > MLlib-specif

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata copy and delete operation in GCS and therefore if there are many number of files it will take a long time to perform this step. One workaround will be to create smaller number of larger files if that is possible from Spark and

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
You said this Hive table was a managed table partitioned by date -->${TODAY} How do you define your Hive managed table? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
It does support- It doesn’t error out for me atleast. But it took around 4 hours to finish the job. Interestingly, it took only 10 minutes to write the output in the staging directory and rest of the time it took to rename the objects. Thats the concern. Looks like a known issue as spark behaves

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Yeachan Park
Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is supported on GCS? IIRC it wasn't, but you could check with GCP support On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev wrote: > Thanks Jay, > > I will try that option. > > Any insight on the file committer algorithms? > > I tried

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, I will try that option. Any insight on the file committer algorithms? I tried v2 algorithm but its not enhancing the runtime. What’s the best practice in Dataproc for dynamic updates in Spark. On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote: > You can try increasing fs.gs.batch.threads

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch. The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote: > No, I am using Spark 2.4

Re: Unsubscribe

2023-07-17 Thread srini subramanian
Unsubscribe  On Monday, July 17, 2023 at 11:19:41 AM GMT+5:30, Bode, Meikel wrote: Unsubscribe

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
No, I am using Spark 2.4 to update the GCS partitions . I have a managed Hive table on top of this. [image: image.png] When I do a dynamic partition update of Spark, it creates the new file in a Staging area as shown here. But the GCS blob renaming takes a lot of time. I have a partition based on d

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
So you are using GCP and your Hive is installed on Dataproc which happens to run your Spark as well. Is that correct? What version of Hive are you using? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile <

Re: Contributing to Spark MLLib

2023-07-16 Thread Brian Huynh
Good morning Dipayan, Happy to see another contributor! Please go through this document for contributors. Please note the MLlib-specific contribution guidelines section in particular. https://spark.apache.org/contributing.html Since you are looking for something to start with, take a look at th

Re: Unable to populate spark metrics using custom metrics API

2023-07-13 Thread Surya Soma
Gentle reminder on this. On Sat, Jul 8, 2023 at 7:59 PM Surya Soma wrote: > Hello, > > I am trying to publish custom metrics using Spark CustomMetric API as > supported since spark 3.2 https://github.com/apache/spark/pull/31476, > > > https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark

Re: Spark Not Connecting

2023-07-12 Thread Artemis User
Well, in that case, you may want to make sure your Spark server is running properly and you can access the Spark UI using your browser.  If you're not owning the spark cluster, contact your spark admin. On 7/12/23 1:56 PM, timi ayoade wrote: I can't even connect to the spark UI On Wed, Jul 12

Re: [EXTERNAL] Spark Not Connecting

2023-07-12 Thread Daniel Tavares de Santana
unsubscribe From: timi ayoade Sent: Wednesday, July 12, 2023 6:11 AM To: user@spark.apache.org Subject: [EXTERNAL] Spark Not Connecting Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it bu

Re: Loading in custom Hive jars for spark

2023-07-11 Thread Mich Talebzadeh
Are you using Spark 3.4? Under directory $SPARK_HOME get a list of jar files for hive and hadoop. This one is for version 3.4.0 /opt/spark/jars> ltr *hive* *hadoop* -rw-r--r--. 1 hduser hadoop 717820 Apr 7 03:43 spark-hive_2.12-3.4.0.jar -rw-r--r--. 1 hduser hadoop 563632 Apr 7 03:43 spark-

Re: PySpark error java.lang.IllegalArgumentException

2023-07-10 Thread elango vaidyanathan
Finally I was able to solve this issue by setting this conf. "spark.driver.extraJavaOptions=-Dorg.xerial.snappy.tempdir=/my_user/temp_ folder" Thanks all! On Sat, 8 Jul 2023 at 3:45 AM, Brian Huynh wrote: > Hi Khalid, > > Elango mentioned the file is working fine in our another environment wi

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Brian Huynh
Hi Khalid,Elango mentioned the file is working fine in our another environment with the same driver and executor memoryBrianOn Jul 7, 2023, at 10:18 AM, Khalid Mammadov wrote:Perhaps that parquet file is corrupted or got that is in that folder?To check, try to read that file with pandas or other

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Khalid Mammadov
Perhaps that parquet file is corrupted or got that is in that folder? To check, try to read that file with pandas or other tools to see if you can read without Spark. On Wed, 5 Jul 2023, 07:25 elango vaidyanathan, wrote: > > Hi team, > > Any updates on this below issue > > On Mon, 3 Jul 2023 at

Re: Unsubscribe

2023-07-07 Thread Atheeth SH
please send an empty email to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. Thanks On Fri, 7 Jul 2023 at 12:05, Mihai Musat wrote: > Unsubscribe >

Re: PySpark error java.lang.IllegalArgumentException

2023-07-04 Thread elango vaidyanathan
Hi team, Any updates on this below issue On Mon, 3 Jul 2023 at 6:18 PM, elango vaidyanathan wrote: > > > Hi all, > > I am reading a parquet file like this and it gives > java.lang.IllegalArgumentException. > However i can work with other parquet files (such as nyc taxi parquet > files) without

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Z is just an example. It could be anything. Basically, anything that's not in schema should be filtered out. On Tue, 4 Jul 2023, 13:27 Hill Liu, wrote: > I think you can define schema with column z and filter out records with z > is null. > > On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao > wrote:

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Hill Liu
I think you can define schema with column z and filter out records with z is null. On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao wrote: > Yes, drop malformed does filter out record4. However, record 5 is not. > > On Tue, 4 Jul 2023 at 07:41, Vikas Kumar wrote: > >> Have you tried dropmalformed op

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Yes, drop malformed does filter out record4. However, record 5 is not. On Tue, 4 Jul 2023 at 07:41, Vikas Kumar wrote: > Have you tried dropmalformed option ? > > On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > >> Update: Got it working by using the *_corrupt_record *field for the >> first c

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Vikas Kumar
Have you tried dropmalformed option ? On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > Update: Got it working by using the *_corrupt_record *field for the first > case (record 4) > > schema = schema.add("_corrupt_record", DataTypes.StringType); > Dataset ds = spark.read().schema(schema).option

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gavin Ray
Wow, really neat -- thanks for sharing! On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English SDK >

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing. On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri wrote: > This is wonderful news! > > On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > >> Dear Apache Spark community, >> >> We are delighted to announce the launch of a groundbreaking tool that >> aims to make Apache

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Farshid Ashouri
This is wonderful news! On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English SDK >

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
Update: Got it working by using the *_corrupt_record *field for the first case (record 4) schema = schema.add("_corrupt_record", DataTypes.StringType); Dataset ds = spark.read().schema(schema).option("mode", "PERMISSIVE").json("path").collect(); ds = ds.filter(functions.col("_corrupt_record").isNu

Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
Hi Ruben, I’m not sure if this answers your question, but if you’re interested in exploring the underlying tables, you could always try something like the below in a Databricks notebook: display(spark.read.table(’samples.nyctaxi.trips’)) (For vanilla Spark users, it would be spark.read.table(’s

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
t just be the reality of trying to process a 240m record file > with 80+ columns, unless there's an obvious issue with my setup that > someone sees. The solution is likely going to involve increasing > parallelization. > > To that end, I extracted and re-zipped this file in bzip. Since

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
file with 80+ columns, unless there's an obvious issue with my setup that someone sees. The solution is likely going to involve increasing parallelization. To that end, I extracted and re-zipped this file in bzip. Since bzip is splittable and gzip is not, Spark can process the bzip file in par

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL COLUMNS; spark-sql (default)> DESC EXTENDED test.stg_t2; Hive external tables have little optimization HTH Mich Talebzadeh, Solutions Architect/Engineering

Re: [Spark streaming]: Microbatch id in logs

2023-06-26 Thread Mich Talebzadeh
In SSS writeStream. \ outputMode('append'). \ option("truncate", "false"). \ * foreachBatch(SendToBigQuery). \* option('checkpointLocation', checkpoint_path). \ so this writeStream will call foreachBatc

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread yangjie01
Thanks Dongjoon ~ 在 2023/6/24 10:29,“L. C. Hsieh”mailto:vii...@gmail.com>> 写入: Thanks Dongjoon! On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon mailto:gurwls...@apache.org>> wrote: > > Thanks! > > On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan > wrote: >> >> >> Th

Re:[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread beliefer
Thanks! Dongjoon Hyun. Congratulation too! At 2023-06-24 07:57:05, "Dongjoon Hyun" wrote: We are happy to announce the availability of Apache Spark 3.4.1! Spark 3.4.1 is a maintenance release containing stability fixes. This release is based on the branch-3.4 maintenance branch of Spark.

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread L. C. Hsieh
Thanks Dongjoon! On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon wrote: > > Thanks! > > On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: >> >> >> Thanks Dongjoon ! >> >> Regards, >> Mridul >> >> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: >>> >>> We are happy to announce the av

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks! On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: > > Thanks Dongjoon ! > > Regards, > Mridul > > On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > >> We are happy to announce the availability of Apache Spark 3.4.1! >> >> Spark 3.4.1 is a maintenance release containing st

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Mridul Muralidharan
Thanks Dongjoon ! Regards, Mridul On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > We are happy to announce the availability of Apache Spark 3.4.1! > > Spark 3.4.1 is a maintenance release containing stability fixes. This > release is based on the branch-3.4 maintenance branch of Spark. W

Re: Rename columns without manually setting them all

2023-06-21 Thread Bjørn Jørgensen
data = { "Employee ID": [12345, 12346, 12347, 12348, 12349], "Name": ["Dummy x", "Dummy y", "Dummy z", "Dummy a", "Dummy b"], "Client": ["Dummy a", "Dummy b", "Dummy c", "Dummy d", "Dummy e"], "Project": ["abc", "def", "ghi", "jkl", "mno"], "Team": ["team a", "team b", "team c",

Re: Rename columns without manually setting them all

2023-06-21 Thread Farshid Ashouri
You can use selectExpr and stack to achieve the same effect in PySpark: df = spark.read.csv("your_file.csv", header=True, inferSchema=True) date_columns = [col for col in df.columns if '/' in col] df = df.selectExpr(["`Employee ID`", "`Name`", "`Client`", "`Project`", "`Team`”] + [f"

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
OK thanks for the info. Regards Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:*

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
yes, p_df = DF.toPandas() that is THE pandas the one you know. change p_df = DF.toPandas() to p_df = DF.pandas_on_spark() or p_df = DF.to_pandas_on_spark() or p_df = DF.pandas_api() or p_df = DF.to_koalas() https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html T

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
OK thanks So the issue seems to be creating a Panda DF from Spark DF (I do it for plotting with something like import matplotlib.pyplot as plt p_df = DF.toPandas() p_df.plt() I guess that stays in the driver. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies

Re: Shuffle data on pods which get decomissioned

2023-06-20 Thread Mich Talebzadeh
If one executor fails, it moves the processing over to another executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go back to the source.Does this mean that only those tasks that the dead executor was executing at the time need to be

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
No, a pandas on Spark DF is distributed. On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh wrote: > Thanks but if you create a Spark DF from Pandas DF that Spark DF is not > distributed and remains on the driver. I recall a while back we had this > conversation. I don't think anything has changed. >

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
Thanks but if you create a Spark DF from Pandas DF that Spark DF is not distributed and remains on the driver. I recall a while back we had this conversation. I don't think anything has changed. Happy to be corrected Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
Pandas API on spark is an API so that users can use spark as they use pandas. This was known as koalas. Is this limitation still valid for Pandas? For pandas, yes. But what I did show wos pandas API on spark so its spark. Additionally when we convert from Panda DF to Spark DF, what process is in

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
Whenever someone mentions Pandas I automatically think of it as an excel sheet for Python. OK my point below needs some qualification Why Spark here. Generally, parallel architecture comes into play when the data size is significantly large which cannot be handled on a single machine, hence, the

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
This is pandas API on spark from pyspark import pandas as ps df = ps.read_excel("testexcel.xlsx") [image: image.png] this will convert it to pyspark [image: image.png] tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme : > Good day, > > > > I have a task to read excel files in databricks but I c

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
It is indeed not part of SparkSession. See the link you cite. It is part of the pyspark pandas API On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme wrote: > Good day, > > > > I have a task to read excel files in databricks but I cannot seem to > proceed. I am referencing the API documents - read_e

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh
OK the number of partitions n or more to the point the "optimum" no of partitions depends on the size of your batch data DF among other things and the degree of parallelism at the end point where you will be writing to sink. If you require high parallelism because your tasks are fine grained, then

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh
Is this the point you are trying to implement? I have state data source which enables the state in SS --> Structured Streaming to be rewritten, which enables repartitioning, schema evolution, etc via batch query. The writer requires hash partitioning against group key, with the "desired number of

Re: Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
> HI > > I am using spark with iceberg, updating the table with 1700 columns , > We are loading 0.6 Million rows from parquet files ,in future it will be > 16 Million rows and trying to update the data in the table which has 16 > buckets . > Using the default partitioner of spark .Also we don't do

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan
DataFrame view stores the logical plan, while SQL view stores SQL text. I don't think we can support this feature until we have a reliable way to materialize logical plans. On Sun, Jun 4, 2023 at 10:31 PM Mich Talebzadeh wrote: > Try sending it to d...@spark.apache.org (and join that group) > >

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Enrico Minack
Sean is right, casting timestamps to strings (which is what show() does) uses the local timezone, either the Java default zone `user.timezone`, the Spark default zone `spark.sql.session.timeZone` or the default DataFrameWriter zone `timeZone`(when writing to file). You say you are in PST, whic

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen
You sure it is not just that it's displaying in your local TZ? Check the actual value as a long for example. That is likely the same time. On Thu, Jun 8, 2023, 5:50 PM karan alang wrote: > ref : > https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-co

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread Mich Talebzadeh
Try sending it to d...@spark.apache.org (and join that group) You need to raise a JIRA for this request plus related doc related Example JIRA https://issues.apache.org/jira/browse/SPARK-42485 and the related *Spark project improvement proposals (SPIP) *to be filled in https://spark.apache.org

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread keen
Do Spark **devs** read this mailing list? Is there another/a better way to make feature requests? I tried in the past to write a mail to the dev mailing list but it did not show at all. Cheers keen schrieb am Do., 1. Juni 2023, 07:11: > Hi all, > currently only *temporary* Spark Views can be cr

Re: ChatGPT and prediction of Spark future

2023-06-01 Thread Mich Talebzadeh
Great stuff Winston. I added a channel in Slack Community for Spark https://sparkcommunitytalk.slack.com/archives/C05ACMS63RT cheers Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Viewing UI for spark jobs running on K8s

2023-05-31 Thread Qian Sun
Hi Nikhil Spark operator supports ingress for exposing all UIs of running spark applications. reference: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#driver-ui-access-and-ingress On Thu, Jun 1, 2023 at 6:19 AM Nikhil Goyal wrote: > Hi folks

Re: ChatGPT and prediction of Spark future

2023-05-31 Thread Winston Lai
Hi Mich, I have been using ChatGPT free version, Bing AI, Google Bard and other AI chatbots. My use cases so far include writing, debugging code, generating documentation and explanation on Spark key terminologies for beginners to quickly pick up new concepts, summarizing pros and cons or uses

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-29 Thread Aishwarya Panicker
Hi, Thanks for your response. I understand there is no explicit way to configure dynamic scaling for Spark Structured Streaming as the ticket is still open for that. But is there a way to manage dynamic scaling with the existing Batch Dynamic scaling algorithm as this kicks in when Dynamic allo

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
ilt with version 2.13.8 >>> <https://github.com/apache/spark/blob/88f69d6f92860823b1a90bc162ebca2b7c8132fc/pom.xml#L170>. >>> Since you are using spark-core_2.13 and spark-sql_2.13, you should stick to >>> the major(13) and the minor version(8). Not using any of the

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
That may due to bug fixes and >> upgrade of scala itself.). >> And although I did not encountered such problem, this >> <https://stackoverflow.com/a/26411339/19476830>can be a a pitfall for >> you. >> >> -- >> Best Regards! >> &g

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
com/a/26411339/19476830>can be a a pitfall for you. > > -- > Best Regards! > ... > Lingzhe Sun > Hirain Technology > > > *From:* Mich Talebzadeh > *Date:* 2023-05-29 17:55 > *To:* Bjørn Jør

Re: JDK version support information

2023-05-29 Thread Sean Owen
Per docs, it is Java 8. It's possible Java 11 partly works with 2.x but not supported. But then again 2.x is not supported either. On Mon, May 29, 2023, 6:43 AM Poorna Murali wrote: > We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. > So, we wanted to check the maximum

Re: JDK version support information

2023-05-29 Thread Poorna Murali
We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. So, we wanted to check the maximum JDK version supported for 2.4.5.1. On Mon, 29 May, 2023, 5:03 pm Aironman DirtDiver, wrote: > Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the > official Spark docu

Re: JDK version support information

2023-05-29 Thread Aironman DirtDiver
Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the official Spark documentation for version 2.4.5, the maximum supported JDK (Java Development Kit) version is JDK 8 (Java 8). Spark 2.4.5 is not compatible with JDK versions higher than Java 8. Therefore, you should use JDK 8 to

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Lingzhe Sun
gards! ... Lingzhe Sun Hirain Technology From: Mich Talebzadeh Date: 2023-05-29 17:55 To: Bjørn Jørgensen CC: user @spark Subject: Re: maven with Spark 3.4.0 fails compilation Thanks for your helpful comments Bjorn. I managed to compile the

Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
Thanks for your helpful comments Bjorn. I managed to compile the code with maven but when it run it fails with Application is ReduceByKey Exception in thread "main" java.lang.NoSuchMethodError: scala.package$.Seq()Lscala/collection/immutable/Seq$; at ReduceByKey$.main(ReduceByKey.scala

Re: maven with Spark 3.4.0 fails compilation

2023-05-28 Thread Bjørn Jørgensen
>From chatgpt4 The problem appears to be that there is a mismatch between the version of Scala used by the Scala Maven plugin and the version of the Scala library defined as a dependency in your POM. You've defined your Scala version in your properties as `2.12.17` but you're pulling in `scala-li

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Mich Talebzadeh
Hi, Autoscaling is not compatible with Spark Structured Streaming since Spark Structured Streaming currently does not support dynamic allocation (see SPARK-24815: Structured Streaming should support dynamic allocatio

Re: [MLlib] how-to find implementation of Decision Tree Regressor fit function

2023-05-25 Thread Sean Owen
Are you looking for https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala On Thu, May 25, 2023 at 6:54 AM Max wrote: > Good day, I'm working on an Implantation from Joint Probability Trees > (JPT) using the Spark framework. For this to

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-24 Thread Enrico Minack
Hi, given your dataset: val df=Seq( (1, 20230523, "M01"), (2, 20230523, "M01"), (3, 20230523, "M01"), (4, 20230523, "M02"), (5, 20230523, "M02"), (6, 20230523, "M02"), (7, 20230523, "M01"), (8, 20230523, "M01"), (9, 20230523, "M02"), (10, 20230523, "M02"), (11, 20230523, "M02"), (12, 20230523

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Raghavendra Ganesh
Given, you are already stating the above can be imagined as a partition, I can think of mapPartitions iterator. val inputSchema = inputDf.schema val outputRdd = inputDf.rdd.mapPartitions(rows => new SomeClass(rows)) val outputDf = sparkSession.createDataFrame(outputRdd, inputSchema.add("coun

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread ashok34...@yahoo.com.INVALID
Thanks great Rauf. Regards On Tuesday, 23 May 2023 at 13:18:55 BST, Rauf Khan wrote: Hi , PartitionBy() is analogous to group by, all rows  that will have the same value in the specified column will form one window.The data will be shuffled to form group. RegardsRaouf On Fri, May 12,

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread Rauf Khan
Hi , PartitionBy() is analogous to group by, all rows that will have the same value in the specified column will form one window. The data will be shuffled to form group. Regards Raouf On Fri, May 12, 2023, 18:48 ashok34...@yahoo.com.INVALID wrote: > Hello, > > In Spark windowing does call wi

<    2   3   4   5   6   7   8   9   10   11   >