Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

Sean Owen Wed, 26 Jan 2022 08:06:34 -0800

Really depends on what your UDF is doing. You could read 2GB of XML into
much more than that as a DOM representation in memory.
Remember 15GB of executor memory is shared across tasks.
You need to get a handle on what memory your code is using to begin with to
start to reason about whether that's enough, first.


On Wed, Jan 26, 2022 at 10:03 AM Abhimanyu Kumar Singh <
abhimanyu.kr.sing...@gmail.com> wrote:

> Thanks for your quick response.
>
> For some reasons I can't use spark-xml (schema related issue).
>
> I've tried reducing number of tasks per executor by increasing the number
> of executors, but it still throws same error.
>
> I can't understand why does even 15gb of executor memory is not sufficient
> to parse just 2gb XML file.
> How can I check the max amount of JVM memory utilised for each task?
>
> Do I need to tweak some other configurations for increasing JVM memory
> rather than spark.executor.memory?
>
> On Wed, Jan 26, 2022, 9:23 PM Sean Owen <sro...@gmail.com> wrote:
>
>> Executor memory used shows data that is cached, not the VM usage. You're
>> running out of memory somewhere, likely in your UDF, which probably parses
>> massive XML docs as a DOM first or something. Use more memory, fewer tasks
>> per executor, or consider using spark-xml if you are really just parsing
>> pieces of it. It'll be more efficient.
>>
>> On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh <
>> abhimanyu.kr.sing...@gmail.com> wrote:
>>
>>> I'm doing some complex operations inside spark UDF (parsing huge XML).
>>>
>>> Dataframe:
>>> | value |
>>> | Content of XML File 1 |
>>> | Content of XML File 2 |
>>> | Content of XML File N |
>>>
>>> val df = Dataframe.select(UDF_to_parse_xml(value))
>>>
>>> UDF looks something like:
>>>
>>> val XMLelements : Array[MyClass1] = getXMLelements(xmlContent)
>>> val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct
>>>
>>> Parsing requires creation and de-duplication of arrays from the XML
>>> containing
>>> around 0.1 million elements (consisting of MyClass(Strings, Maps,
>>> Integers, .... )).
>>>
>>> In the Spark UI "executor memory used" is barely 60-70 MB. But still
>>> Spark processing fails
>>> with *ExecutorLostFailure *error for XMLs of size around 2GB.
>>> When I increase the executor size (say 15GB to 25 GB) it works fine. One
>>> partition can contain only
>>> one XML file (with max size 2GB) and 1 task/executor runs in parallel.
>>>
>>> *My question is which memory is being used by UDF for storing arrays,
>>> maps or sets while parsing?*
>>> *And how can I configure it?*
>>>
>>> Should I increase *spark*.*memory*.*offHeap*.size,
>>> spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead?
>>>
>>> Thanks a lot,
>>> Abhimanyu
>>>
>>> PS: I know I shouldn't use UDF this way, but I don't have any other
>>> alternative here.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

Reply via email to