Re: Zeppelin best practices/ efficiencies

Shanmukha Sreenivas Potti Wed, 03 May 2017 21:18:25 -0700

Thanks, Jeff!

I'll look into this solution.


On Wed, May 3, 2017 at 5:32 PM, Jeff Zhang <zjf...@gmail.com> wrote:

>
> Regarding interpreter memory issue, this is because zeppelin's spark
> interpreter only support yarn-client mode, that means the driver runs in
> the same host as zeppelin server. So it is pretty easy to run out of memory
> if many users share the same driver (scoped mode you use). You can try livy
> interpreter which support yarn-cluster, so that the driver run in the
> remote host and each user use isolated spark app.
> https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/interpreter/livy.html
>
>
> Shanmukha Sreenivas Potti <shanmu...@utexas.edu>于2017年5月4日周四 上午6:54写道：
>
>> Hello Zeppelin users,
>>
>>
>>
>> I’m reaching out to you for some guidance on best practices. We currently
>> use Zeppelin 0.7.0 on EMR and I have a few questions on gaining
>> efficiencies with this setup that I would like to get addressed. Would
>> really appreciate if any of you can help me with these issues or point me
>> to the right person/team.
>>
>>
>>
>> *1.       **Interpreter Settings*
>>
>>
>>
>> I understand that the newer versions (we are currently on Zeppelin 0.7),
>> have the option of different interpreter nodes such as Scoped, Isolated and
>> Shared.
>>
>> Multiple users in our team use the Zeppelin application by creating
>> separate notebooks. Sometimes, jobs continue to execute endlessly or fail
>> to execute or time out due to maxing out on memory. We tend to restart the
>> interpreter or are sometimes forced to restart Zeppelin application on the
>> EMR master node to resume operations. Is this the best way to deal with
>> such issues?
>>
>> We currently use the ‘Scoped’ interpreter setting, i.e. it sets up an
>> interpreter instance per note.
>>
>> Would you recommend that we continue to use this interpreter setting or
>> do you think we would be served better by using any other available
>> interpreter settings? I did take a look at the Zeppelin documentation for
>> information on these settings but anything additional would be greatly
>> helpful.
>>
>>
>>
>> Also, is there a way to accurately determine how much of the available
>> memory is being used by the various jobs on Zeppelin? The ‘Job’ tab gives
>> us insights on what jobs in various notebooks are running but we don’t have
>> insight on the memory/compute power being used.
>>
>>
>>
>> Ideally, I would like to figure out the root cause behind why my queries
>> are not running. Is it because of memory maxing out on Zeppelin or HDFS or
>> Spark or because of insufficiency in the number of compute nodes.
>>
>>
>>
>> Would really appreciate if you could share any documentation that can
>> guide me on these aspects.
>>
>>
>>
>> *2.       **Installation Ports*
>>
>> By default Zeppelin on EMR gets installed on port 8890. However, to be
>> complaint with security policies we needed to use other ports. This change
>> was made by editing the Zeppelin configuration file in SSH. I’m concerned
>> if this approach has cloned the application on the other ports and also
>> restricting my usage of Zeppelin. Is this the right way of installing
>> Zeppelin on another port?
>>
>>
>>
>> Appreciate any pointers you may have. Please see below for more
>> information on the cluster and the applications on the cluster.
>>
>>
>>
>> *Thanks,*
>>
>> *Shan*
>>
>>
>>
>> *Cluster Details:*
>>
>> Release label: emr-5.4.0
>>
>> Applications: Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, HBase
>> 1.3.0, Zeppelin 0.7.0, Oozie 4.3.0, Mahout 0.12.2
>>
>


-- 
Shan S. Potti,
737-333-1952
https://www.linkedin.com/in/shanmukhasreenivas

Re: Zeppelin best practices/ efficiencies

Reply via email to