Executor memory used shows data that is cached, not the VM usage. You're running out of memory somewhere, likely in your UDF, which probably parses massive XML docs as a DOM first or something. Use more memory, fewer tasks per executor, or consider using spark-xml if you are really just parsing pieces of it. It'll be more efficient.
On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh < abhimanyu.kr.sing...@gmail.com> wrote: > I'm doing some complex operations inside spark UDF (parsing huge XML). > > Dataframe: > | value | > | Content of XML File 1 | > | Content of XML File 2 | > | Content of XML File N | > > val df = Dataframe.select(UDF_to_parse_xml(value)) > > UDF looks something like: > > val XMLelements : Array[MyClass1] = getXMLelements(xmlContent) > val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct > > Parsing requires creation and de-duplication of arrays from the XML > containing > around 0.1 million elements (consisting of MyClass(Strings, Maps, > Integers, .... )). > > In the Spark UI "executor memory used" is barely 60-70 MB. But still Spark > processing fails > with *ExecutorLostFailure *error for XMLs of size around 2GB. > When I increase the executor size (say 15GB to 25 GB) it works fine. One > partition can contain only > one XML file (with max size 2GB) and 1 task/executor runs in parallel. > > *My question is which memory is being used by UDF for storing arrays, maps > or sets while parsing?* > *And how can I configure it?* > > Should I increase *spark*.*memory*.*offHeap*.size, > spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead? > > Thanks a lot, > Abhimanyu > > PS: I know I shouldn't use UDF this way, but I don't have any other > alternative here. > > > > > > > >