Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets
Hi, Can you please try to see if you can increase the number of cores per task, and therefore give each task more memory per executor? I do not understand what is the XML, what is the data in it, and what is the problem that you are trying to solve writing UDF's to parse XML. So maybe we are not actually solving the problem and just addressing the issue. Regards, Gourav Sengupta On Wed, Jan 26, 2022 at 4:07 PM Sean Owen wrote: > Really depends on what your UDF is doing. You could read 2GB of XML into > much more than that as a DOM representation in memory. > Remember 15GB of executor memory is shared across tasks. > You need to get a handle on what memory your code is using to begin with > to start to reason about whether that's enough, first. > > On Wed, Jan 26, 2022 at 10:03 AM Abhimanyu Kumar Singh < > abhimanyu.kr.sing...@gmail.com> wrote: > >> Thanks for your quick response. >> >> For some reasons I can't use spark-xml (schema related issue). >> >> I've tried reducing number of tasks per executor by increasing the number >> of executors, but it still throws same error. >> >> I can't understand why does even 15gb of executor memory is not >> sufficient to parse just 2gb XML file. >> How can I check the max amount of JVM memory utilised for each task? >> >> Do I need to tweak some other configurations for increasing JVM memory >> rather than spark.executor.memory? >> >> On Wed, Jan 26, 2022, 9:23 PM Sean Owen wrote: >> >>> Executor memory used shows data that is cached, not the VM usage. You're >>> running out of memory somewhere, likely in your UDF, which probably parses >>> massive XML docs as a DOM first or something. Use more memory, fewer tasks >>> per executor, or consider using spark-xml if you are really just parsing >>> pieces of it. It'll be more efficient. >>> >>> On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh < >>> abhimanyu.kr.sing...@gmail.com> wrote: >>> I'm doing some complex operations inside spark UDF (parsing huge XML). Dataframe: | value | | Content of XML File 1 | | Content of XML File 2 | | Content of XML File N | val df = Dataframe.select(UDF_to_parse_xml(value)) UDF looks something like: val XMLelements : Array[MyClass1] = getXMLelements(xmlContent) val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct Parsing requires creation and de-duplication of arrays from the XML containing around 0.1 million elements (consisting of MyClass(Strings, Maps, Integers, )). In the Spark UI "executor memory used" is barely 60-70 MB. But still Spark processing fails with *ExecutorLostFailure *error for XMLs of size around 2GB. When I increase the executor size (say 15GB to 25 GB) it works fine. One partition can contain only one XML file (with max size 2GB) and 1 task/executor runs in parallel. *My question is which memory is being used by UDF for storing arrays, maps or sets while parsing?* *And how can I configure it?* Should I increase *spark*.*memory*.*offHeap*.size, spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead? Thanks a lot, Abhimanyu PS: I know I shouldn't use UDF this way, but I don't have any other alternative here.
Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets
Really depends on what your UDF is doing. You could read 2GB of XML into much more than that as a DOM representation in memory. Remember 15GB of executor memory is shared across tasks. You need to get a handle on what memory your code is using to begin with to start to reason about whether that's enough, first. On Wed, Jan 26, 2022 at 10:03 AM Abhimanyu Kumar Singh < abhimanyu.kr.sing...@gmail.com> wrote: > Thanks for your quick response. > > For some reasons I can't use spark-xml (schema related issue). > > I've tried reducing number of tasks per executor by increasing the number > of executors, but it still throws same error. > > I can't understand why does even 15gb of executor memory is not sufficient > to parse just 2gb XML file. > How can I check the max amount of JVM memory utilised for each task? > > Do I need to tweak some other configurations for increasing JVM memory > rather than spark.executor.memory? > > On Wed, Jan 26, 2022, 9:23 PM Sean Owen wrote: > >> Executor memory used shows data that is cached, not the VM usage. You're >> running out of memory somewhere, likely in your UDF, which probably parses >> massive XML docs as a DOM first or something. Use more memory, fewer tasks >> per executor, or consider using spark-xml if you are really just parsing >> pieces of it. It'll be more efficient. >> >> On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh < >> abhimanyu.kr.sing...@gmail.com> wrote: >> >>> I'm doing some complex operations inside spark UDF (parsing huge XML). >>> >>> Dataframe: >>> | value | >>> | Content of XML File 1 | >>> | Content of XML File 2 | >>> | Content of XML File N | >>> >>> val df = Dataframe.select(UDF_to_parse_xml(value)) >>> >>> UDF looks something like: >>> >>> val XMLelements : Array[MyClass1] = getXMLelements(xmlContent) >>> val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct >>> >>> Parsing requires creation and de-duplication of arrays from the XML >>> containing >>> around 0.1 million elements (consisting of MyClass(Strings, Maps, >>> Integers, )). >>> >>> In the Spark UI "executor memory used" is barely 60-70 MB. But still >>> Spark processing fails >>> with *ExecutorLostFailure *error for XMLs of size around 2GB. >>> When I increase the executor size (say 15GB to 25 GB) it works fine. One >>> partition can contain only >>> one XML file (with max size 2GB) and 1 task/executor runs in parallel. >>> >>> *My question is which memory is being used by UDF for storing arrays, >>> maps or sets while parsing?* >>> *And how can I configure it?* >>> >>> Should I increase *spark*.*memory*.*offHeap*.size, >>> spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead? >>> >>> Thanks a lot, >>> Abhimanyu >>> >>> PS: I know I shouldn't use UDF this way, but I don't have any other >>> alternative here. >>> >>> >>> >>> >>> >>> >>> >>>
Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets
Thanks for your quick response. For some reasons I can't use spark-xml (schema related issue). I've tried reducing number of tasks per executor by increasing the number of executors, but it still throws same error. I can't understand why does even 15gb of executor memory is not sufficient to parse just 2gb XML file. How can I check the max amount of JVM memory utilised for each task? Do I need to tweak some other configurations for increasing JVM memory rather than spark.executor.memory? On Wed, Jan 26, 2022, 9:23 PM Sean Owen wrote: > Executor memory used shows data that is cached, not the VM usage. You're > running out of memory somewhere, likely in your UDF, which probably parses > massive XML docs as a DOM first or something. Use more memory, fewer tasks > per executor, or consider using spark-xml if you are really just parsing > pieces of it. It'll be more efficient. > > On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh < > abhimanyu.kr.sing...@gmail.com> wrote: > >> I'm doing some complex operations inside spark UDF (parsing huge XML). >> >> Dataframe: >> | value | >> | Content of XML File 1 | >> | Content of XML File 2 | >> | Content of XML File N | >> >> val df = Dataframe.select(UDF_to_parse_xml(value)) >> >> UDF looks something like: >> >> val XMLelements : Array[MyClass1] = getXMLelements(xmlContent) >> val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct >> >> Parsing requires creation and de-duplication of arrays from the XML >> containing >> around 0.1 million elements (consisting of MyClass(Strings, Maps, >> Integers, )). >> >> In the Spark UI "executor memory used" is barely 60-70 MB. But still >> Spark processing fails >> with *ExecutorLostFailure *error for XMLs of size around 2GB. >> When I increase the executor size (say 15GB to 25 GB) it works fine. One >> partition can contain only >> one XML file (with max size 2GB) and 1 task/executor runs in parallel. >> >> *My question is which memory is being used by UDF for storing arrays, >> maps or sets while parsing?* >> *And how can I configure it?* >> >> Should I increase *spark*.*memory*.*offHeap*.size, >> spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead? >> >> Thanks a lot, >> Abhimanyu >> >> PS: I know I shouldn't use UDF this way, but I don't have any other >> alternative here. >> >> >> >> >> >> >> >>
Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets
Executor memory used shows data that is cached, not the VM usage. You're running out of memory somewhere, likely in your UDF, which probably parses massive XML docs as a DOM first or something. Use more memory, fewer tasks per executor, or consider using spark-xml if you are really just parsing pieces of it. It'll be more efficient. On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh < abhimanyu.kr.sing...@gmail.com> wrote: > I'm doing some complex operations inside spark UDF (parsing huge XML). > > Dataframe: > | value | > | Content of XML File 1 | > | Content of XML File 2 | > | Content of XML File N | > > val df = Dataframe.select(UDF_to_parse_xml(value)) > > UDF looks something like: > > val XMLelements : Array[MyClass1] = getXMLelements(xmlContent) > val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct > > Parsing requires creation and de-duplication of arrays from the XML > containing > around 0.1 million elements (consisting of MyClass(Strings, Maps, > Integers, )). > > In the Spark UI "executor memory used" is barely 60-70 MB. But still Spark > processing fails > with *ExecutorLostFailure *error for XMLs of size around 2GB. > When I increase the executor size (say 15GB to 25 GB) it works fine. One > partition can contain only > one XML file (with max size 2GB) and 1 task/executor runs in parallel. > > *My question is which memory is being used by UDF for storing arrays, maps > or sets while parsing?* > *And how can I configure it?* > > Should I increase *spark*.*memory*.*offHeap*.size, > spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead? > > Thanks a lot, > Abhimanyu > > PS: I know I shouldn't use UDF this way, but I don't have any other > alternative here. > > > > > > > >
[Spark UDF]: Where does UDF stores temporary Arrays/Sets
I'm doing some complex operations inside spark UDF (parsing huge XML). Dataframe: | value | | Content of XML File 1 | | Content of XML File 2 | | Content of XML File N | val df = Dataframe.select(UDF_to_parse_xml(value)) UDF looks something like: val XMLelements : Array[MyClass1] = getXMLelements(xmlContent) val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct Parsing requires creation and de-duplication of arrays from the XML containing around 0.1 million elements (consisting of MyClass(Strings, Maps, Integers, )). In the Spark UI "executor memory used" is barely 60-70 MB. But still Spark processing fails with *ExecutorLostFailure *error for XMLs of size around 2GB. When I increase the executor size (say 15GB to 25 GB) it works fine. One partition can contain only one XML file (with max size 2GB) and 1 task/executor runs in parallel. *My question is which memory is being used by UDF for storing arrays, maps or sets while parsing?* *And how can I configure it?* Should I increase *spark*.*memory*.*offHeap*.size, spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead? Thanks a lot, Abhimanyu PS: I know I shouldn't use UDF this way, but I don't have any other alternative here.