Really depends on what your UDF is doing. You could read 2GB of XML into much more than that as a DOM representation in memory. Remember 15GB of executor memory is shared across tasks. You need to get a handle on what memory your code is using to begin with to start to reason about whether that's enough, first.
On Wed, Jan 26, 2022 at 10:03 AM Abhimanyu Kumar Singh < abhimanyu.kr.sing...@gmail.com> wrote: > Thanks for your quick response. > > For some reasons I can't use spark-xml (schema related issue). > > I've tried reducing number of tasks per executor by increasing the number > of executors, but it still throws same error. > > I can't understand why does even 15gb of executor memory is not sufficient > to parse just 2gb XML file. > How can I check the max amount of JVM memory utilised for each task? > > Do I need to tweak some other configurations for increasing JVM memory > rather than spark.executor.memory? > > On Wed, Jan 26, 2022, 9:23 PM Sean Owen <sro...@gmail.com> wrote: > >> Executor memory used shows data that is cached, not the VM usage. You're >> running out of memory somewhere, likely in your UDF, which probably parses >> massive XML docs as a DOM first or something. Use more memory, fewer tasks >> per executor, or consider using spark-xml if you are really just parsing >> pieces of it. It'll be more efficient. >> >> On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh < >> abhimanyu.kr.sing...@gmail.com> wrote: >> >>> I'm doing some complex operations inside spark UDF (parsing huge XML). >>> >>> Dataframe: >>> | value | >>> | Content of XML File 1 | >>> | Content of XML File 2 | >>> | Content of XML File N | >>> >>> val df = Dataframe.select(UDF_to_parse_xml(value)) >>> >>> UDF looks something like: >>> >>> val XMLelements : Array[MyClass1] = getXMLelements(xmlContent) >>> val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct >>> >>> Parsing requires creation and de-duplication of arrays from the XML >>> containing >>> around 0.1 million elements (consisting of MyClass(Strings, Maps, >>> Integers, .... )). >>> >>> In the Spark UI "executor memory used" is barely 60-70 MB. But still >>> Spark processing fails >>> with *ExecutorLostFailure *error for XMLs of size around 2GB. >>> When I increase the executor size (say 15GB to 25 GB) it works fine. One >>> partition can contain only >>> one XML file (with max size 2GB) and 1 task/executor runs in parallel. >>> >>> *My question is which memory is being used by UDF for storing arrays, >>> maps or sets while parsing?* >>> *And how can I configure it?* >>> >>> Should I increase *spark*.*memory*.*offHeap*.size, >>> spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead? >>> >>> Thanks a lot, >>> Abhimanyu >>> >>> PS: I know I shouldn't use UDF this way, but I don't have any other >>> alternative here. >>> >>> >>> >>> >>> >>> >>> >>>