Hi, looks like you have answered some questions whcih I generally ask. Another thing, can you please let me know the environment? Is it AWS, GCP, Azure, Databricks, HDP, etc?
Regards, Gourav On Sun, Apr 11, 2021 at 8:39 AM András Kolbert <kolbertand...@gmail.com> wrote: > Hi, > > Sure! > > Application: > - Spark version 2.4 > - Kafka Stream (DStream, from a kafka 0.8 brokers) > - 7 executors, 2cores, 3700M memory size > > Logic: > - Process initialises a dataframe that contains metrics for an > account/product metrics (e.g. {"account":A, "product": X123, "metric"; 51} > - After initialisation, the dataframe is persisted on HDFS (dataframe is > around 1GB total size in memory) > - Streaming: > - each bach, processes incoming data, unions the main dataframe with the > new account/product/metric interaction dataframe, aggregates the total, and > then persist on HDFS again (each batch we save the total dataframe again) > - The screenshot I sent earlier, was after this aggregation, and how all > the data seems to be ended up on the same executor. That could explain why > the executor periodically dies with OOM. > > Mich, I hope this provides extra information :) > > Thanks > Andras > > > > > > > On Sat, 10 Apr 2021 at 16:42, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Hi, >> >> Can you provide a bit more info please? >> >> How are you running this job and what is the streaming framework (kafka, >> files etc)? >> >> HTH >> >> >> Mich >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 10 Apr 2021 at 14:28, András Kolbert <kolbertand...@gmail.com> >> wrote: >> >>> hi, >>> >>> I have a streaming job and quite often executors die (due to memory >>> errors/ "unable to find location for shuffle etc) during the processing. I >>> started digging and found that some of the tasks are concentrated to one >>> executor, just as below: >>> [image: image.png] >>> >>> Can this be the reason? >>> Should I repartition the underlying data before I execute a groupby on >>> the top of it? >>> >>> Any advice is welcome >>> >>> Thanks >>> Andras >>> >>