Actually, I stumbled on this SO page
<https://stackoverflow.com/questions/31245083/how-can-pyspark-be-called-in-debug-mode>.
While it is not straightforward, it is a fairly simple solution.

In short:


   - I made sure there is only one executing task at a time by calling
   repartition(1) - this made it easy to locate the one and only spark deamon
   - I set a BP wherever I needed to
   - In order to "catch" the BP, I set a print out and a time.sleep(15)
   right before it. The print out gives me a notice that the daemon is up and
   running
   and the sleep gives me time to push a few buttons so I can attache to
   the procesa

It worked fairly well, and I was able to debug the executor. I did notice
two strange things: sometimes I got a strange error and the debugger didnt
actually attach. It was not deterministic.

Other times I noticed a big gap between the point I got the notification
and attached to the process until the execution was resumed and I could
actually step through (by big gap I mean a gap that is considerably bigger
than the sleep period, usually about 1 minute).

Not perfect but worked most of the time.



On Wed, Mar 14, 2018 at 12:07 AM, Michael Mansour <
michael_mans...@symantec.com> wrote:

> Vitaliy,
>
>
>
> From what I understand, this is not possible to do.  However, let me share
> my workaround with you.
>
>
>
> Assuming you have your debugger up and running on PyCharm, set a
> breakpoint at this line, Take|collect|sample  your data (could also
> consider doing a glom if its critical the data remain partitioned, then the
> take/collect), and pass it into the function directly (direct python, no
> spark).  Use the debugger to step through there on that small sample.
>
>
>
> Alternatively, you can open up the PyCharm execution module.  In the
> execution module, do the same as above with the RDD, and pass it into the
> function.  This alleviates the need to write debugging code etc.  I find
> this model useful and a bit more fast, but it does not offer the
> step-through capability.
>
>
>
> Best of luck!
>
> M
>
> --
>
> Michael Mansour
>
> Data Scientist
>
> Symantec CASB
>
> *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
> *Date: *Sunday, March 11, 2018 at 8:46 AM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *[EXT] Debugging a local spark executor in pycharm
>
>
>
> I want to step through the work of a spark executor running locally on my
> machine, from Pycharm.
>
> I am running explicit functionality, in the form of
> dataset.foreachPartition(f) and I want to see what is going on inside f.
>
> Is there a straightforward way to do it or do I need to resort to remote
> debugging?
>
> p.s
>
>
>
> Posted this on SO
> <https://clicktime.symantec.com/a/1/XYlpjXLSKwNlpHDPBadGLxedp5mPjvfMuIlrQmppyAU=?d=8u87emKOH4QJ5KsylIZ3a-sj91IJnMz4MC8WJu6O0ofmn_lSUUdS7RWXMwSMEcMeFkt9iEhnGU-qrxp9tMvOOjLgl2AMzpSBuLdV5zfWaUVfzK25Z9nxNgcy-_1inynQ5O2zLZ19g0IDpi2YaZNd-7HhUUqW_luiZF_Uw4e6SEgMoXlF3gylrRpHpzgnnuZFs_8J7Usq1x4wgD7tiKomSE3y8--cp8QstC7Thv66Z7hwzfY6byPFfPeo5BD-1U7SyeFZj-TP9cYRQO-Gx9UJ-Vra3Eh1Vo-aa9k_99Q7hNgiewvpKkRiJztgJ6WEUbppapzahKbw_rpVQ7CNYlXksEz6eCCrlheFsLXKKqgna1Or1UXg-j-k5qFHNCyNvVklXBB2PSOXved3jhSNiqho4QLYpuNn44aWoCNSXP_RSVmIYENVXyO7y-4saGJ0zrAM2VEX7SWAnWHDICOnzBvpJOuA&u=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F49221733%2Fdebugging-a-local-spark-executor-in-pycharm>
> as well.
>

Reply via email to