Great work documenting repeatable steps for this hard to nail down problem. I 
see similar problems running the spark (scala) interpreter but haven’t been as 
systematic about hunting down the issue as you. 

I do wonder if this is related somehow to 
https://share.polymail.io/v1/z/b/NTkwZGRlMzNiZmFi/Go00wlomvjABQNciq78PfdeRmR4K6c4M5l8KsTYGlks2sD4oe9jS7NYIkVZ2KKlntmyN0z2ZbiIFSP59SQpYL0hq_V6k3ZjCvIj_hSJLDj9DoLv9d08g_CcyOzm8nDm0hYZeZOp12dO42cm970BBLMdQE4GNuXkJXxBA8x9FHzXuJqALbU6-4HZjnzxjiNBKO7esfqjghuuz-eV-QrJnyI5hTNPgwp0O
which just seems to have addressed killing off zombie processes but I’m not 
sure it covered where zombie processes are coming from. Perhaps we need to open 
a ticket for this?

In the mean time if you don’t have the ability to restart zeppelin every time 
you run into this process you can probably just kill the interpreter process. I 
find myself doing that multiple times in an normal work day.

http://www.placeiq.com/ http://www.placeiq.com/ http://www.placeiq.com/

Paul Brenner

https://twitter.com/placeiq https://twitter.com/placeiq 
https://twitter.com/placeiq
https://www.facebook.com/PlaceIQ https://www.facebook.com/PlaceIQ
https://www.linkedin.com/company/placeiq 
https://www.linkedin.com/company/placeiq

DATA SCIENTIST

(217) 390-3033 

 

http://www.placeiq.com/2015/05/26/placeiq-named-winner-of-prestigious-2015-oracle-data-cloud-activate-award/
 
http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/
 
http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/
 
http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/
 
http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/
 
http://placeiq.com/2016/03/08/measuring-addressable-tv-campaigns-is-now-possible/
 
http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/
 
http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/
 
http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/
 
http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/
 
http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/
 
http://pages.placeiq.com/Location-Data-Accuracy-Whitepaper-Download.html?utm_source=Signature&utm_medium=Email&utm_campaign=AccuracyWP
 
http://placeiq.com/2016/08/03/placeiq-bolsters-location-intelligence-platform-with-mastercard-insights/
 
http://placeiq.com/2016/10/26/the-making-of-a-location-data-industry-milestone/ 
http://placeiq.com/2016/12/07/placeiq-introduces-landmark-a-groundbreaking-offering-that-delivers-access-to-the-highest-quality-location-data-for-insights-that-fuel-limitless-business-decisions/

On Sat, May 06, 2017 at 6:47 AM Pietro Pugni

<
mailto:Pietro Pugni <pietro.pu...@gmail.com>
> wrote:

<![CDATA[a, pre, code, a:link, body { word-wrap: break-word !important; }]]>

Hi all,

I am facing a strange issue on two different machines that acts like servers. 
Each of them runs an instance of Zeppelin installed as a system.d service.

The configuration is:

 - Ubuntu Server 16.04.2 LTS

 - Spark 2.1.0

 - Microsoft Open R 3.3.2

 - Zeppelin 0.7.1 (0.7.0 gave the same problems)

zeppelin-env.sh has the following settings:

export SPARK_HOME="/spark/home/directory"

spark-env.sh has the following settings:

export LANG="en_US"

export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir 
-Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"

export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"

spark-defaults.conf is set as:

spark.executor.memory          

21g

spark.driver.memory                     21g

spark.python.worker.memory      

4g

spark.sql.autoBroadcastJoinThreshold    0

I use Spark in stand-alone mode and it works perfectly. It also works correctly 
with Zeppelin but this is what happens:

1) Start zeppelin on the server using the command

service zeppelin start

2) Connect to port 8080 using Mozilla Firefox from client 

3) Insert username and password (I enabled Shiro authentication)

4) open a notebook

5) Execute the following code:

%spark.r

2+2

6) The code runs correctly and I can see that R is currently running as a 
process.

7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin 
remains forever on “Running” or, if the elapsed time is higher (for example 1 
day) since the last run, it returns “Error”. The “time-to-be-unresponsive” 
seems to be random and unpredictable. Also, R is not present in the list of 
running processes. Spark session remains active because I can access Spark UI 
from port 4040 and the application name is “Zeppelin”, so it’s the Spark 
instance created by Zeppelin.

I observed that sometimes I can simply restart the interpreter from Zeppelin 
UI, but many other times it doesn’t work and I have to restart Zeppelin (

service zeppelin restart

).

This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous 
versions. It also happens if Zeppelin isn’t installed as a service.

I can’t provide more detail because I can’t see any error or warning in the 
logs.. this is really strange. 

Thank you all.

Kind regards

 Pietro Pugni

Reply via email to