Hey everyone,
I am currently trying to run a Python Spark job (using YARN client mode)
that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that,
I create a dependencies.zip file that contains all of the
dependencies/libraries (installed through pip) for the job to run
successful
I have a process in Spark Streamin which lasts 2 seconds. When I check
where the time is spent I see about 0.8s-1s in processing time although the
global time is 2s. This one second is spent in the driver.
I reviewed the code which is executed by the driver and I commented some of
this code with th