Actually, I stumbled on this SO page <https://stackoverflow.com/questions/31245083/how-can-pyspark-be-called-in-debug-mode>. While it is not straightforward, it is a fairly simple solution.
In short: - I made sure there is only one executing task at a time by calling repartition(1) - this made it easy to locate the one and only spark deamon - I set a BP wherever I needed to - In order to "catch" the BP, I set a print out and a time.sleep(15) right before it. The print out gives me a notice that the daemon is up and running and the sleep gives me time to push a few buttons so I can attache to the procesa It worked fairly well, and I was able to debug the executor. I did notice two strange things: sometimes I got a strange error and the debugger didnt actually attach. It was not deterministic. Other times I noticed a big gap between the point I got the notification and attached to the process until the execution was resumed and I could actually step through (by big gap I mean a gap that is considerably bigger than the sleep period, usually about 1 minute). Not perfect but worked most of the time. On Wed, Mar 14, 2018 at 12:07 AM, Michael Mansour < michael_mans...@symantec.com> wrote: > Vitaliy, > > > > From what I understand, this is not possible to do. However, let me share > my workaround with you. > > > > Assuming you have your debugger up and running on PyCharm, set a > breakpoint at this line, Take|collect|sample your data (could also > consider doing a glom if its critical the data remain partitioned, then the > take/collect), and pass it into the function directly (direct python, no > spark). Use the debugger to step through there on that small sample. > > > > Alternatively, you can open up the PyCharm execution module. In the > execution module, do the same as above with the RDD, and pass it into the > function. This alleviates the need to write debugging code etc. I find > this model useful and a bit more fast, but it does not offer the > step-through capability. > > > > Best of luck! > > M > > -- > > Michael Mansour > > Data Scientist > > Symantec CASB > > *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> > *Date: *Sunday, March 11, 2018 at 8:46 AM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *[EXT] Debugging a local spark executor in pycharm > > > > I want to step through the work of a spark executor running locally on my > machine, from Pycharm. > > I am running explicit functionality, in the form of > dataset.foreachPartition(f) and I want to see what is going on inside f. > > Is there a straightforward way to do it or do I need to resort to remote > debugging? > > p.s > > > > Posted this on SO > <https://clicktime.symantec.com/a/1/XYlpjXLSKwNlpHDPBadGLxedp5mPjvfMuIlrQmppyAU=?d=8u87emKOH4QJ5KsylIZ3a-sj91IJnMz4MC8WJu6O0ofmn_lSUUdS7RWXMwSMEcMeFkt9iEhnGU-qrxp9tMvOOjLgl2AMzpSBuLdV5zfWaUVfzK25Z9nxNgcy-_1inynQ5O2zLZ19g0IDpi2YaZNd-7HhUUqW_luiZF_Uw4e6SEgMoXlF3gylrRpHpzgnnuZFs_8J7Usq1x4wgD7tiKomSE3y8--cp8QstC7Thv66Z7hwzfY6byPFfPeo5BD-1U7SyeFZj-TP9cYRQO-Gx9UJ-Vra3Eh1Vo-aa9k_99Q7hNgiewvpKkRiJztgJ6WEUbppapzahKbw_rpVQ7CNYlXksEz6eCCrlheFsLXKKqgna1Or1UXg-j-k5qFHNCyNvVklXBB2PSOXved3jhSNiqho4QLYpuNn44aWoCNSXP_RSVmIYENVXyO7y-4saGJ0zrAM2VEX7SWAnWHDICOnzBvpJOuA&u=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F49221733%2Fdebugging-a-local-spark-executor-in-pycharm> > as well. >