Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Nebi Aydin
rency and Thread Issues: If there are too many concurrent > connections or thread limitations, > it could result in failed connections. *Adjust > spark.shuffle.io.clientThreads* > - It might be prudent to do the same to *spark.shuffle.io.server.Threads* > - Check how stable your envi

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Mich Talebzadeh
, it could result in failed connections. *Adjust spark.shuffle.io.clientThreads* - It might be prudent to do the same to *spark.shuffle.io.server.Threads* - Check how stable your environment is. Observe any issues reported in Spark UI HTH Mich Talebzadeh, Solutions Architect/Engineering Lead

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
trust that you are familiar with the concept of shuffle in Spark. > Spark Shuffle is an expensive operation since it involves the following > >- > >Disk I/O >- > >Involves data serialization and deserialization >- > >Network I/O > > Bas

Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Mich Talebzadeh
Hi, These two threads that you sent seem to be duplicates of each other? Anyhow I trust that you are familiar with the concept of shuffle in Spark. Spark Shuffle is an expensive operation since it involves the following - Disk I/O - Involves data serialization and deserialization

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

RE: Re: Spark Vulnerabilities

2023-08-18 Thread Sankavi Nagalingam
s back. Thanks, Sankavi From: Bjørn Jørgensen Sent: Monday, August 14, 2023 6:11 PM To: Sankavi Nagalingam Cc: user@spark.apache.org; Vijaya Kumar Mathupaiyan Subject: [EXT MSG] Re: Spark Vulnerabilities EXTERNAL source. Be CAREFUL with links / attachments I have added links to the github

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
Yes, it sounds like it. So the broadcast DF size seems to be between 1 and 4GB. So I suggest that you leave it as it is. I have not used the standalone mode since spark-2.4.3 so I may be missing a fair bit of context here. I am sure there are others like you that are still using it! HTH Mich

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
from > such loss, damage or destruction. > > > > > On Thu, 17 Aug 2023 at 21:01, Patrick Tucci > wrote: > >> Hi Mich, >> >> Here are my config values from spark-defaults.conf: >> >> spark.eventLog.enabled true >> spark.eventLog.dir hdfs:/

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
ny monetary damages arising from such loss, damage or destruction. On Thu, 17 Aug 2023 at 21:01, Patrick Tucci wrote: > Hi Mich, > > Here are my config values from spark-defaults.conf: > > spark.eventLog.enabled true > spark.eventLog.dir hdfs://10.0.50.1:8020/spark-

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Here are my config values from spark-defaults.conf: spark.eventLog.enabled true spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hello Paatrick, As a matter of interest what parameters and their respective values do you use in spark-submit. I assume it is running in YARN mode. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/m

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Yes, that's the sequence of events. I think the big breakthrough is that (for now at least) Spark is throwing errors instead of the queries hanging. Which is a big step forward. I can at least troubleshoot issues if I know what they are. When I reflect on the issues I faced an

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hi Patrik, glad that you have managed to sort this problem out. Hopefully it will go away for good. Still we are in the dark about how this problem is going away and coming back :( As I recall the chronology of events were as follows: 1. The Issue with hanging Spark job reported 2

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Everyone, I just wanted to follow up on this issue. This issue has continued since our last correspondence. Today I had a query hang and couldn't resolve the issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After doing so, instead of the query hanging, I got an error me

Re: Spark Vulnerabilities

2023-08-14 Thread Cheng Pan
For the Guava case, you may be interested in https://github.com/apache/spark/pull/42493 Thanks, Cheng Pan > On Aug 14, 2023, at 16:50, Sankavi Nagalingam > wrote: > > Hi Team, > We could see there are many dependent vulnerabilities present in the latest > spark-core:3.4.

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen
Yeah, we generally don't respond to "look at the output of my static analyzer". Some of these are already addressed in a later version. Some don't affect Spark. Some are possibly an issue but hard to change without breaking lots of things - they are really issues with upstrea

Re: Spark Vulnerabilities

2023-08-14 Thread Bjørn Jørgensen
I have added links to the github PR. Or comment for those that I have not seen before. Apache Spark has very many dependencies, some can easily be upgraded while others are very hard to fix. Please feel free to open a PR if you wanna help. man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam

Spark Vulnerabilities

2023-08-14 Thread Sankavi Nagalingam
Hi Team, We could see there are many dependent vulnerabilities present in the latest spark-core:3.4.1.jar. PFA Could you please let us know when will be the fix version available for the users. Thanks, Sankavi The information in this e-mail and any attachments is confidential and may be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
r install an additional Java version, I attempted to use the > latest alpha as well. This appears to have worked, although I couldn't > figure out how to get it to use the metastore_db from Spark. > > After turning my attention back to Spark, I determined the issue. After > much trouble

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
tions suggest it might be a Java incompatibility issue. Since I didn't want to downgrade or install an additional Java version, I attempted to use the latest alpha as well. This appears to have worked, although I couldn't figure out how to get it to use the metastore_db from Spark. After turni

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
to migrate >>> to Delta Lake and see if that solves the issue. >>> >>> Thanks again for your feedback. >>> >>> Patrick >>> >>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
; >> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi Patrick, >>> >>> There is not anything wrong with Hive On-premise it is the best data >>> warehouse there is >>> >>> Hive handles bo

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
..@gmail.com> wrote: > >> Hi Patrick, >> >> There is not anything wrong with Hive On-premise it is the best data >> warehouse there is >> >> Hive handles both ORC and Parquet formal well. They are both columnar >> implementations of relational mod

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
both ORC and Parquet formal well. They are both columnar > implementations of relational model. What you are seeing is the Spark API > to Hive which prefers Parquet. I found out a few years ago. > > From your point of view I suggest you stick to parquet format with Hive > specific t

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Hi Patrick, There is not anything wrong with Hive On-premise it is the best data warehouse there is Hive handles both ORC and Parquet formal well. They are both columnar implementations of relational model. What you are seeing is the Spark API to Hive which prefers Parquet. I found out a few

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
Thanks for the reply Stephen and Mich. Stephen, you're right, it feels like Spark is waiting for something, but I'm not sure what. I'm the only user on the cluster and there are plenty of resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark and the host serve

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes before, if I recall correctly. Since this limitation may be due to Hive metastore. By default Spark uses Apache Derby for its database persistence. *However it is limited to only one Spark session at any time for the purposes

Re: Spark Connect, Master, and Workers

2023-08-10 Thread Brian Huynh
Hi Kezhi, Yes, you no longer need to start a master to make the client work. Please see the quickstart. https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html You can think of Spark Connect as an API on top of Master so workers can be added to the cluster same

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, I don't believe Hive is installed. I set up this cluster from scratch. I installed Hadoop and Spark by downloading them from their project websites. If Hive isn't bundled with Hadoop or Spark, I don't believe I have it. I'm running the Thrift server distributed

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-10 Thread Mich Talebzadeh
Hi Mark, I created a spark3.4.1 docker file. Details from spark-py-3.4.1-scala_2.12-11-jre-slim-buster <https://hub.docker.com/repository/docker/michtalebzadeh/spark_dockerfiles/tags?page=1&ordering=last_updated> Pull instructions are given docker pull michtalebzadeh/spark_dockerfile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
damages arising from > such loss, damage or destruction. > > > > > On Thu, 10 Aug 2023 at 20:02, Patrick Tucci > wrote: > >> Hi Mich, >> >> Thanks for the reply. Unfortunately I don't have Hive set up on my >> cluster. I can explore this if th

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
my > cluster. I can explore this if there are no other ways to troubleshoot. > > I'm using beeline to run commands against the Thrift server. Here's the > command I use: > > ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f > command.sql > > Thanks

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, Thanks for the reply. Unfortunately I don't have Hive set up on my cluster. I can explore this if there are no other ways to troubleshoot. I'm using beeline to run commands against the Thrift server. Here's the command I use: ~/spark/bin/beeline -u jdbc:hive2://10.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
mail's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 10 Aug 2023 at 18:39, Patrick Tucci wrote: > Hello, > > I'm attempting to run a query on Spar

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hello, I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS for storage. The query is as follows: SELECT ME.*, MB.BenefitID FROM MemberEnrollment ME JOIN MemberBenefits MB ON

Spark Connect, Master, and Workers

2023-08-09 Thread Kezhi Xiong
Hi, I'm recently learning Spark Connect but have some questions regarding the connect server's relation with master or workers: so when I'm using the connect server, I don't have to start a master alone side to make clients work. Is the connect server simply using "local[

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mich Talebzadeh
Hi Mark, you can build it yourself, no big deal :) REPOSITORY TAG IMAGE ID CREATED SIZE sparkpy/spark-py 3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile a876102b2206 1 second ago

dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mark Elliot
Hello, I noticed that the apache/spark-py image for Spark's 3.4.1 release is not available (apache/spark@3.4.1 is available). Would it be possible to get the 3.4.1 release build for the apache/spark-py image published? Thanks, Mark -- This communication, together wit

Re: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Daniel Tavares de Santana
unsubscribe From: Mich Talebzadeh Sent: Tuesday, August 8, 2023 4:43 PM To: user @spark Subject: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance I am currently pondering and sharing my thoughts openly. Given our reliance on gathered

Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Mich Talebzadeh
I am currently pondering and sharing my thoughts openly. Given our reliance on gathered statistics, it prompts the question of whether we could integrate specific machine learning components into Spark Structured Streaming. Consider a scenario where we aim to adjust configuration values on the fly

Spark 3.41 with Java 11 performance on k8s serverless/autopilot

2023-08-07 Thread Mich Talebzadeh
Hi, I would like to share experience on spark 3.4.1 running on k8s autopilot or some refer to it as serverless. My current experience is on Google GKE autopilot <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>. So essentially you specify the name and region a

Re: conver panda image column to spark dataframe

2023-08-03 Thread Sean Owen
pp4 has one row, I'm guessing - containing an array of 10 images. You want 10 rows of 1 image each. But, just don't do this. Pass the bytes of the image as an array, along with width/height/channels, and reshape it on use. It's just easier. That is how the Spark image representati

Re: conver panda image column to spark dataframe

2023-08-03 Thread second_co...@yahoo.com.INVALID
Hello Adrian,    here is the snippet import tensorflow_datasets as tfds (ds_train, ds_test), ds_info = tfds.load(     dataset_name, data_dir='',  split=["train", "test"], with_info=True, as_supervised=True ) schema = StructType([     StructField("image", ArrayType(ArrayType(ArrayType(Integer

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Sean Owen
will cover me as I plan to be out of the > office soon) > > Hi Kent and Sean, > > Nice to meet you. I am working on the OSS legal aspects with Pavan who is > planning to make the contribution request to the Spark project. I saw that > Sean mentioned in his email that the contribu

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Rinat Shangeeta
(Adding my manager Eugene Kim who will cover me as I plan to be out of the office soon) Hi Kent and Sean, Nice to meet you. I am working on the OSS legal aspects with Pavan who is planning to make the contribution request to the Spark project. I saw that Sean mentioned in his email that the

Re: conver panda image column to spark dataframe

2023-08-03 Thread Adrian Pop-Tifrea
Hello, can you also please show us how you created the pandas dataframe? I mean, how you added the actual data into the dataframe. It would help us for reproducing the error. Thank you, Pop-Tifrea Adrian On Mon, Jul 31, 2023 at 5:03 AM second_co...@yahoo.com < second_co...@yahoo.com> wrote: > i

Custom Session Windowing in Spark using Scala/Python

2023-08-03 Thread Ravi Teja
Hi, I am new to Spark and looking for help regarding the session windowing <https://spark.apache.org/docs/3.4.1/structured-streaming-programming-guide.html#types-of-time-windows> in Spark. I want to create session windows on a user activity stream with a gap duration of `x` minutes and als

Re: conver panda image column to spark dataframe

2023-07-31 Thread second_co...@yahoo.com.INVALID
i changed to ArrayType(ArrayType(ArrayType(IntegerType( , still get same error Thank you for responding On Thursday, July 27, 2023 at 06:58:09 PM GMT+8, Adrian Pop-Tifrea wrote: Hello,  when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images? Becaus

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
ok so as expected the underlying database is Hive. Hive uses hdfs storage. You said you encountered limitations on concurrent writes. The order and limitations are introduced by Hive metastore so to speak. Since this is all happening through Spark, by default implementation of the Hive metastore

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
4:28 PM Mich Talebzadeh > wrote: > >> It is not Spark SQL that throws the error. It is the underlying Database >> or layer that throws the error. >> >> Spark acts as an ETL tool. What is the underlying DB where the table >> resides? Is concurrency supported.

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
that will work better in different use cases according to the writing pattern, type of queries, data characteristics, etc. *Pol Santamaria* On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh wrote: > It is not Spark SQL that throws the error. It is the underlying Database > or layer that

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or layer that throws the error. Spark acts as an ETL tool. What is the underlying DB where the table resides? Is concurrency supported. Please send the error to this list HTH Mich Talebzadeh, Solutions Architect

Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
Hello, I'm building an application on Spark SQL. The cluster is set up in standalone mode with HDFS as storage. The only Spark application running is the Spark Thrift Server using FAIR scheduling mode. Queries are submitted to Thrift Server using beeline. I have multiple queries that insert

Re: The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Mich Talebzadeh
Spark on tin boxes like Google Dataproc or AWS EC2 often utilise YARN resource manager. YARN is the most widely used resource manager not just for Spark but for other artefacts as well. On-premise YARN is used extensively. In Cloud it is also used widely in Infrastructure as a Service such as

The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Trường Trần Phan An
Hi all, I am learning about the performance difference of Spark when performing a JOIN problem on Serverless (K8S) and Serverful (Traditional server) environments. Through experiment, Spark on K8s tends to run slower than Serverful. Through understanding the architecture, I know that Spark runs

Re: conver panda image column to spark dataframe

2023-07-27 Thread Adrian Pop-Tifrea
Hello, when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images? Because if that's the case, then you'd want ro only use 3 layers of ArrayType when you define the schema. Best regards, Adrian On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID wrote: > i h

conver panda image column to spark dataframe

2023-07-27 Thread second_co...@yahoo.com.INVALID
i have panda dataframe with column 'image' using numpy.ndarray. shape is (500, 333, 3) per image. my panda dataframe has 10 rows, thus, shape is (10, 500, 333, 3) when using spark.createDataframe(panda_dataframe, schema), i need to specify the schema, schema = StructType([     StructField(

Re: spark context list_packages()

2023-07-27 Thread Sean Owen
There is no such method in Spark. I think that's some EMR-specific modification. On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID wrote: > I ran the following code > > spark.sparkContext.list_packages() > > on spark 3.4.1 and i get below error > >

spark context list_packages()

2023-07-26 Thread second_co...@yahoo.com.INVALID
I ran the following code spark.sparkContext.list_packages() on spark 3.4.1 and i get below error An error was encountered: AttributeError [Traceback (most recent call last): , File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", line 113, in exec self._exec

Re: Interested in contributing to SPARK-24815

2023-07-26 Thread Pavan Kotikalapudi
A with Twilio and consider > establishing that to govern contributions. > > > > On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi < > pkotikalap...@twilio.com.invalid> wrote: > >> > >> Hi Spark Dev, > >> > >> My name is Pavan Kotikalapudi,

Re: Interested in contributing to SPARK-24815

2023-07-25 Thread Kent Yao
ributed to the project is assumed > to have been licensed per above already. > > It might be wise to review the CCLA with Twilio and consider establishing > that to govern contributions. > > On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi > wrote: >> >> Hi S

Re: Interested in contributing to SPARK-24815

2023-07-24 Thread Sean Owen
e wise to review the CCLA with Twilio and consider establishing that to govern contributions. On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi wrote: > Hi Spark Dev, > > My name is Pavan Kotikalapudi, I work at Twilio. > > I am looking to contribute to this spark issue > h

Fwd: Interested in contributing to SPARK-24815

2023-07-24 Thread Pavan Kotikalapudi
Hi Spark Dev, My name is Pavan Kotikalapudi, I work at Twilio. I am looking to contribute to this spark issue https://issues.apache.org/jira/browse/SPARK-24815. There is a clause from the company's OSS saying - The proposed contribution is about 100 lines of code modification in the

Re: Spark 3.3 + parquet 1.10

2023-07-24 Thread Mich Talebzadeh
personally I have not done it myself. CCed to spark user group if some user has tried it among users. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-p

Re: Unable to launch Spark connect on Docker image

2023-07-22 Thread Mich Talebzadeh
This is the downloaded docker? Try this with the added configuration options as below /opt/spark/sbin/start-connect-server.sh *--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" *--packages org.apache.spark:spark-connect_2.12:3.4.1 And you will get

Unable to launch Spark connect on Docker image

2023-07-21 Thread Edmondo Porcu
Hello, I am trying to launch Spark connect on Docker Image ❯ docker run -it apache/spark:3.4.1-scala2.12-java11-r-ubuntu /bin/bash spark@aa0a670f7433:/opt/spark/work-dir$ /opt/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.1 starting

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Mich Talebzadeh
this link might help https://stackoverflow.com/questions/46929351/spark-reading-orc-file-in-driver-not-in-executors Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebza

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Dipayan Dev
t; partition updates(insert overwrite) daily for the last 30 days >>> (partitions). >>> The ETL inside the staging directories is completed in hardly 5minutes, >>> but then renaming takes a lot of time as it deletes and copies the >>> partitions. >>> My issue is somethi

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Dipayan Dev
rtitions). >> The ETL inside the staging directories is completed in hardly 5minutes, >> but then renaming takes a lot of time as it deletes and copies the >> partitions. >> My issue is something related to this - >> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhyt

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Mich Talebzadeh
tories is completed in hardly 5minutes, > but then renaming takes a lot of time as it deletes and copies the > partitions. > My issue is something related to this - > https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1 > > > > With Best Regards, > >

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
it deletes and copies the partitions. My issue is something related to this - https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1 With Best Regards, Dipayan Dev On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh wrote: > Spark has no role in creating that hive stag

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Mich Talebzadeh
Spark has no role in creating that hive staging directory. That directory belongs to Hive and Spark simply does ETL there, loading to the Hive managed table in your case which ends up in saging directory I suggest that you review your design and use an external hive table with explicit location

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
It does help performance but not significantly. I am just wondering, once Spark creates that staging directory along with the SUCCESS file, can we just do a gsutil rsync command and move these files to original directory? Anyone tried this approach or foresee any concern? On Mon, 17 Jul 2023

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
++ DEV community On Mon, Jul 17, 2023 at 4:14 PM Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project locally for my 1st > Open Source Contribution, by building and creating a pa

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Hi Team, I am still looking for a guidance here. Really appreciate anything that points me in the right direction. On Mon, Jul 17, 2023, 16:14 Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apach

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
ll take a long time to perform this step. One workaround will be > to create smaller number of larger files if that is possible from Spark and > if this is not possible then those configurations allow for configuring the > threadpool which does the metadata copy. > > You can go thr

Re: Contributing to Spark MLLib

2023-07-17 Thread Gourav Sengupta
ote the > MLlib-specific contribution guidelines section in particular. > > https://spark.apache.org/contributing.html > > Since you are looking for something to start with, take a look at this > Jira query for starter issues. > > > https://issues.apache.org/jira/browse/S

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata copy and delete operation in GCS and therefore if there are many number of files it will take a long time to perform this step. One workaround will be to create smaller number of larger files if that is possible from Spark and

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
restingly, it took only 10 minutes to write the output in the staging > directory and rest of the time it took to rename the objects. Thats the > concern. > > Looks like a known issue as spark behaves with GCS but not getting any > workaround for this. > > > On Mon, 17 Jul

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
It does support- It doesn’t error out for me atleast. But it took around 4 hours to finish the job. Interestingly, it took only 10 minutes to write the output in the staging directory and rest of the time it took to rename the objects. Thats the concern. Looks like a known issue as spark behaves

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Yeachan Park
er algorithms? > > I tried v2 algorithm but its not enhancing the runtime. What’s the best > practice in Dataproc for dynamic updates in Spark. > > > On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote: > >> You can try increasing fs.gs.batch.threads and >> fs.gs.max.requests

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, I will try that option. Any insight on the file committer algorithms? I tried v2 algorithm but its not enhancing the runtime. What’s the best practice in Dataproc for dynamic updates in Spark. On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote: > You can try increas

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch. The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote: > No, I am using Sp

Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Resending this message with a proper Subject line Hi Spark Community, I am trying to set up my forked apache/spark project locally for my 1st Open Source Contribution, by building and creating a package as mentioned here under Running Individual Tests <https://spark.apache.org/develo

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
No, I am using Spark 2.4 to update the GCS partitions . I have a managed Hive table on top of this. [image: image.png] When I do a dynamic partition update of Spark, it creates the new file in a Staging area as shown here. But the GCS blob renaming takes a lot of time. I have a partition based on

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
So you are using GCP and your Hive is installed on Dataproc which happens to run your Spark as well. Is that correct? What version of Hive are you using? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Hi All, Of late, I have encountered the issue where I have to overwrite a lot of partitions of the Hive table through Spark. It looks like writing to hive_staging_directory takes 25% of the total time, whereas 75% or more time goes in moving the ORC files from staging directory to the final

Re: Contributing to Spark MLLib

2023-07-16 Thread Brian Huynh
this Jira query for starter issues. https://issues.apache.org/jira/browse/SPARK-38719?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20%22starter%22%20AND%20status%20%3D%20Open Cheers, Brian On Sun, Jul 16, 2023 at 8:49 AM Dipayan Dev wrote: > Hi Spark Community, > > A very good morni

Contributing to Spark MLLib

2023-07-16 Thread Dipayan Dev
Hi Spark Community, A very good morning to you. I am using Spark from last few years now, and new to the community. I am very much interested to be a contributor. I am looking to contribute to Spark MLLib. Can anyone please suggest me how to start with contributing to any new MLLib feature? Is

[Spark RPC]: Yarn - Application Master / executors to Driver communication issue

2023-07-14 Thread Sunayan Saikia
Hey Spark Community, Our Jupyterhub/Jupyterlab (with spark client) runs behind two layers of HAProxy and the Yarn cluster runs remotely. We want to use deploy mode 'client' so that we can capture the output of any spark sql query in jupyterlab. I'm aware of other technologies like

Re: Unable to populate spark metrics using custom metrics API

2023-07-13 Thread Surya Soma
Gentle reminder on this. On Sat, Jul 8, 2023 at 7:59 PM Surya Soma wrote: > Hello, > > I am trying to publish custom metrics using Spark CustomMetric API as > supported since spark 3.2 https://github.com/apache/spark/pull/31476, > > > https://spark.apache.org/docs/3.2.

Re: Spark Not Connecting

2023-07-12 Thread Artemis User
Well, in that case, you may want to make sure your Spark server is running properly and you can access the Spark UI using your browser.  If you're not owning the spark cluster, contact your spark admin. On 7/12/23 1:56 PM, timi ayoade wrote: I can't even connect to the spark UI O

Re: [EXTERNAL] Spark Not Connecting

2023-07-12 Thread Daniel Tavares de Santana
unsubscribe From: timi ayoade Sent: Wednesday, July 12, 2023 6:11 AM To: user@spark.apache.org Subject: [EXTERNAL] Spark Not Connecting Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it

Spark Not Connecting

2023-07-12 Thread timi ayoade
Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it but I have been getting some errors. I have tried debugging the error but to no avail. the screenshot is attached below. I will be glad if responded to. thanks

Re: Loading in custom Hive jars for spark

2023-07-11 Thread Mich Talebzadeh
Are you using Spark 3.4? Under directory $SPARK_HOME get a list of jar files for hive and hadoop. This one is for version 3.4.0 /opt/spark/jars> ltr *hive* *hadoop* -rw-r--r--. 1 hduser hadoop 717820 Apr 7 03:43 spark-hive_2.12-3.4.0.jar -rw-r--r--. 1 hduser hadoop 563632 Apr 7 03:43 sp

Loading in custom Hive jars for spark

2023-07-11 Thread Yeachan Park
Hi all, We made some changes to hive which require changes to the hive jars that Spark is bundled with. Since Spark 3.3.1 comes bundled with Hive 2.3.9 jars, we built our changes in Hive 2.3.9 and put the necessary jars under $SPARK_HOME/jars (replacing the original jars that were there

Unable to populate spark metrics using custom metrics API

2023-07-08 Thread Surya Soma
Hello, I am trying to publish custom metrics using Spark CustomMetric API as supported since spark 3.2 https://github.com/apache/spark/pull/31476, https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html I have created a custom metric implementing

Spark UI - Bug Executors tab when using proxy port

2023-07-06 Thread Bruno Pistone
Hello everyone, I’m really sorry to use this mailing list, but seems impossible to notify a strange behaviour that is happening with the Spark UI. I’m sending also the link to the stackoverflow question here https://stackoverflow.com/questions/76632692/spark-ui-executors-tab-its-empty I’m

<    1   2   3   4   5   6   7   8   9   10   >