Zeppelin best practices/ efficiencies

Shanmukha Sreenivas Potti Wed, 03 May 2017 15:55:10 -0700

Hello Zeppelin users,



I’m reaching out to you for some guidance on best practices. We currently
use Zeppelin 0.7.0 on EMR and I have a few questions on gaining
efficiencies with this setup that I would like to get addressed. Would
really appreciate if any of you can help me with these issues or point me
to the right person/team.



*1.       **Interpreter Settings*



I understand that the newer versions (we are currently on Zeppelin 0.7),
have the option of different interpreter nodes such as Scoped, Isolated and
Shared.

Multiple users in our team use the Zeppelin application by creating
separate notebooks. Sometimes, jobs continue to execute endlessly or fail
to execute or time out due to maxing out on memory. We tend to restart the
interpreter or are sometimes forced to restart Zeppelin application on the
EMR master node to resume operations. Is this the best way to deal with
such issues?

We currently use the ‘Scoped’ interpreter setting, i.e. it sets up an
interpreter instance per note.

Would you recommend that we continue to use this interpreter setting or do
you think we would be served better by using any other available
interpreter settings? I did take a look at the Zeppelin documentation for
information on these settings but anything additional would be greatly
helpful.



Also, is there a way to accurately determine how much of the available
memory is being used by the various jobs on Zeppelin? The ‘Job’ tab gives
us insights on what jobs in various notebooks are running but we don’t have
insight on the memory/compute power being used.



Ideally, I would like to figure out the root cause behind why my queries
are not running. Is it because of memory maxing out on Zeppelin or HDFS or
Spark or because of insufficiency in the number of compute nodes.



Would really appreciate if you could share any documentation that can guide
me on these aspects.



*2.       **Installation Ports*

By default Zeppelin on EMR gets installed on port 8890. However, to be
complaint with security policies we needed to use other ports. This change
was made by editing the Zeppelin configuration file in SSH. I’m concerned
if this approach has cloned the application on the other ports and also
restricting my usage of Zeppelin. Is this the right way of installing
Zeppelin on another port?



Appreciate any pointers you may have. Please see below for more information
on the cluster and the applications on the cluster.



*Thanks,*

*Shan*



*Cluster Details:*

Release label: emr-5.4.0

Applications: Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, HBase 1.3.0,
Zeppelin 0.7.0, Oozie 4.3.0, Mahout 0.12.2

Zeppelin best practices/ efficiencies

Reply via email to