Re: Anaconda iPython notebook working with CDH Spark

Sebastián Ramírez Tue, 30 Dec 2014 06:17:22 -0800

Some time ago I did the (2) approach, I installed Anaconda on every node.

But to avoid screwing RedHat (it was CentOS in my case, which is the same)
I installed Anaconda on every node using the user "yarn" and made it the
default python only for that user.


After you install it, Anaconda asks if it should add it's installation path
to the PATH variable in .bashrc for your user (that's the way it overrides
the default Python). If you choose "yes" it will override it only for the
current user. And if that user is "yarn", you can run Spark in cluster
mode, in all the nodes in your cluster, using IPython (a lot better than
the default Python console).

Just in case, you have to check that you have a directory in your HDFS for
yarn (/user/yarn), it may not be created by default and that would
difficult everything, not allowing your Spark to run.

In summary, something like (correct the syntax if it's wrong, I'm not
testing it):

# Create yarn directory in HDFS
su hdfs
hadoop fs -mkdir /user/yarn
hadoop fs -chown yarn:yarn /user/yarn
exit

# Install Anaconda for user yarn
# In every node:
su yarn
cd
wget
http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.sh
# Or the current link for the moment you are doing it:
https://store.continuum.io/cshop/anaconda/
bash Anaconda*.sh
# When asked if set it as the default Python, or to add Anaconda to the
"PATH" (I don't remember how they say it), choose "yes"


I hope that helps,


*Sebastián Ramírez*
Diseñador de Algoritmos

 <http://www.senseta.com>
________________
 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Twitter: @tiangolo <https://twitter.com/tiangolo>
 Email: sebastian.rami...@senseta.com
 www.senseta.com

On Sun, Dec 28, 2014 at 1:57 PM, Bin Wang <binwang...@gmail.com> wrote:

> Hi there,
>
> I have a cluster with CDH5.1 running on top of Redhat6.5, where the
> default Python version is 2.6. I am trying to set up a proper iPython
> notebook environment to develop spark application using pyspark.
>
> Here
> <http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/>
> is a tutorial that I have been following. However, it turned out that the
> author was using iPython1 where we have the latest Anaconda Python2.7
> installed on our name node. When I finished following the tutorial, I can
> connect to the spark cluster but whenever I tried to distribute the work,
> it will errorred out and google tells me it is the difference between the
> version of Python across the cluster.
>
> Here are a few thoughts that I am planning to try.
> (1) remove the Anaconda Python from the namenode and install the iPython
> version that is compatible with Python2.6.
> (2) or I need to install Anaconda Python on every node and make it the
> default Python version across the whole cluster (however, I am not sure if
> this plan will totally screw up the existing environment since some running
> services are built by Python2.6...)
>
> Let me which should be the proper way to set up an iPython notebook
> environment.
>
> Best regards,
>
> Bin
>

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Re: Anaconda iPython notebook working with CDH Spark

Reply via email to