Some time ago I did the (2) approach, I installed Anaconda on every node. But to avoid screwing RedHat (it was CentOS in my case, which is the same) I installed Anaconda on every node using the user "yarn" and made it the default python only for that user.
After you install it, Anaconda asks if it should add it's installation path to the PATH variable in .bashrc for your user (that's the way it overrides the default Python). If you choose "yes" it will override it only for the current user. And if that user is "yarn", you can run Spark in cluster mode, in all the nodes in your cluster, using IPython (a lot better than the default Python console). Just in case, you have to check that you have a directory in your HDFS for yarn (/user/yarn), it may not be created by default and that would difficult everything, not allowing your Spark to run. In summary, something like (correct the syntax if it's wrong, I'm not testing it): # Create yarn directory in HDFS su hdfs hadoop fs -mkdir /user/yarn hadoop fs -chown yarn:yarn /user/yarn exit # Install Anaconda for user yarn # In every node: su yarn cd wget http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.sh # Or the current link for the moment you are doing it: https://store.continuum.io/cshop/anaconda/ bash Anaconda*.sh # When asked if set it as the default Python, or to add Anaconda to the "PATH" (I don't remember how they say it), choose "yes" I hope that helps, *Sebastián Ramírez* Diseñador de Algoritmos <http://www.senseta.com> ________________ Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo <https://twitter.com/tiangolo> Email: sebastian.rami...@senseta.com www.senseta.com On Sun, Dec 28, 2014 at 1:57 PM, Bin Wang <binwang...@gmail.com> wrote: > Hi there, > > I have a cluster with CDH5.1 running on top of Redhat6.5, where the > default Python version is 2.6. I am trying to set up a proper iPython > notebook environment to develop spark application using pyspark. > > Here > <http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/> > is a tutorial that I have been following. However, it turned out that the > author was using iPython1 where we have the latest Anaconda Python2.7 > installed on our name node. When I finished following the tutorial, I can > connect to the spark cluster but whenever I tried to distribute the work, > it will errorred out and google tells me it is the difference between the > version of Python across the cluster. > > Here are a few thoughts that I am planning to try. > (1) remove the Anaconda Python from the namenode and install the iPython > version that is compatible with Python2.6. > (2) or I need to install Anaconda Python on every node and make it the > default Python version across the whole cluster (however, I am not sure if > this plan will totally screw up the existing environment since some running > services are built by Python2.6...) > > Let me which should be the proper way to set up an iPython notebook > environment. > > Best regards, > > Bin > -- *----------------------------------------------------* *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*