Spark 3.1.1
view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 9 Apr 2021 at 17:36, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > I ran this one on RHES 7.6 with 64GB of memory and it hit OOM > > >>> data=list(range(rows)) > >>> rdd=sc.parallelize(data,rows) > >>> assert rdd.getNumPartitions()==rows > >>> rdd0=rdd.filter(lambda x:False) > >>> assert rdd0.getNumPartitions()==rows > >>> rdd00=rdd0.coalesce(1) > >>> data=rdd00.collect() > 2021-04-09 17:19:01,452 WARN scheduler.TaskSetManager: Stage 1 contains a > task of very large size (4729 KiB). The maximum recommended task size is > 1000 KiB. > 2021-04-09 17:25:14,249 ERROR executor.Executor: Exception in task 0.0 in > stage 1.0 (TID 1) > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > > > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 9 Apr 2021 at 17:33, Sean Owen <sro...@gmail.com> wrote: > >> OK so it's '70000 threads overwhelming off heap mem in the JVM' kind of >> thing. Or running afoul of ulimits in the OS. >> >> On Fri, Apr 9, 2021 at 11:19 AM Attila Zsolt Piros < >> piros.attila.zs...@gmail.com> wrote: >> >>> Hi Sean! >>> >>> So the "coalesce" without shuffle will create a CoalescedRDD which >>> during its computation delegates to the parent RDD partitions. >>> As the CoalescedRDD contains only 1 partition so we talk about 1 task >>> and 1 task context. >>> >>> The next stop is PythonRunner. >>> >>> Here the python workers at least are reused (when >>> "spark.python.worker.reuse" is true, and true is the default) but the >>> MonitorThreads are not reused and what is worse all the MonitorThreads are >>> created for the same worker and same TaskContext. >>> This means the CoalescedRDD's 1 tasks should be completed to stop the >>> first monitor thread, relevant code: >>> >>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L570 >>> >>> So this will lead to creating 70000 extra threads when 1 would be enough. >>> >>> The jira is: https://issues.apache.org/jira/browse/SPARK-35009 >>> The PR will next week maybe (I am a bit uncertain as I have many other >>> things to do right now). >>> >>> Best Regards, >>> Attila >>> >>>> >>>>>