I reduced the 'state timeout' from 10 minutes to 2 minutes so that memory would be released quicker & the new numbers for Storage Memory are: 54.7GB out of 598.5GB BUT I still don't trust these numbers. As Amit pointed out, it seems there's a bug in the Spark 2.4 UI.
I am requesting 2TB of Memory but the UI keeps showing 598.5GB. I am not exactly sure if it's a BUG in Spark 2.4 UI OR our cluster is indeed not giving my job enough memory! On Sun, Jan 10, 2021 at 12:32 AM Amit Sharma <resolve...@gmail.com> wrote: > I believe it’s a spark Ui issue which do not display correct value. I > believe it is resolved for spark 3.0. > > Thanks > Amit > > On Fri, Jan 8, 2021 at 4:00 PM Luca Canali <luca.can...@cern.ch> wrote: > >> You report 'Storage Memory': 3.3TB/ 598.5 GB -> The first number is the >> memory used for storage, the second one is the available memory (for >> storage) in the unified memory pool. >> >> The used memory shown in your webui snippet is indeed quite high (higher >> than the available memory!? ), you can probably profit by drilling down on >> that to understand better what is happening. >> >> For example look at the details per executor (the numbers you reported >> are aggregated values), then also look at the “storage tab” for a list of >> cached RDDs with details. >> >> In case, Spark 3.0 has improved memory instrumentation and improved >> instrumentation for streaming, so you can you profit from testing there too. >> >> >> >> >> >> *From:* Eric Beabes <mailinglist...@gmail.com> >> *Sent:* Friday, January 8, 2021 04:23 >> *To:* Luca Canali <luca.can...@cern.ch> >> *Cc:* spark-user <user@spark.apache.org> >> *Subject:* Re: Understanding Executors UI >> >> >> >> So when I see this for 'Storage Memory': *3.3TB/ 598.5 GB* *- it's >> telling me that Spark is using 3.3 TB of memory & 598.5 GB is used for >> caching data, correct?* What I am surprised about is that these numbers >> don't change at all throughout the day even though the load on the system >> is low after 5pm PST. >> >> >> >> I would expect the "Memory used" to be lower than 3.3Tb after 5pm PST. >> >> >> >> Does Spark 3.0 do a better job of memory management? Wondering if >> upgrading to Spark 3.0 would improve performance? >> >> >> >> >> >> On Wed, Jan 6, 2021 at 2:29 PM Luca Canali <luca.can...@cern.ch> wrote: >> >> Hi Eric, >> >> >> >> A few links, in case they can be useful for your troubleshooting: >> >> >> >> The Spark Web UI is documented in Spark 3.x documentation, although you >> can use most of it for Spark 2.4 too: >> https://spark.apache.org/docs/latest/web-ui.html >> >> >> >> Spark memory management is documented at >> https://spark.apache.org/docs/latest/tuning.html#memory-management-overview >> >> >> Additional resource: see also this diagram >> https://canali.web.cern.ch/docs/SparkExecutorMemory.png and >> https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring >> >> >> >> Best, >> >> Luca >> >> >> >> *From:* Eric Beabes <mailinglist...@gmail.com> >> *Sent:* Wednesday, January 6, 2021 00:20 >> *To:* spark-user <user@spark.apache.org> >> *Subject:* Understanding Executors UI >> >> >> >> [image: image.png] >> >> >> >> >> >> Not sure if this image will go through. (Never sent an email to this >> mailing list with an image). >> >> >> >> I am trying to understand this 'Executors' UI in Spark 2.4. I have a >> Stateful Structured Streaming job with 'State timeout' set to 10 minutes. >> When the load on the system is low a message gets written to Kafka >> immediately after the State times out BUT under heavy load it takes over 40 >> minutes to get a message on the output topic. Trying to debug this issue & >> see if performance can be improved. >> >> >> >> Questions: >> >> >> >> 1) I am requesting 3.2 TB of memory but it seems the job keeps using only >> 598.5 GB as per the values in 'Storage Memory' as well as 'On Heap Storage >> Memory'. Wondering if this is a Cluster issue OR am I not setting values >> correctly? >> >> 2) Where can I find documentation to understand different 'Tabs' in the >> Spark UI? (Sorry, Googling didn't help. I will keep searching.) >> >> >> >> Any pointers would be appreciated. Thanks. >> >> >> >>