Re: Anaconda iPython notebook working with CDH Spark

2014-12-30 Thread Sebastián Ramírez
Some time ago I did the (2) approach, I installed Anaconda on every node.

But to avoid screwing RedHat (it was CentOS in my case, which is the same)
I installed Anaconda on every node using the user yarn and made it the
default python only for that user.

After you install it, Anaconda asks if it should add it's installation path
to the PATH variable in .bashrc for your user (that's the way it overrides
the default Python). If you choose yes it will override it only for the
current user. And if that user is yarn, you can run Spark in cluster
mode, in all the nodes in your cluster, using IPython (a lot better than
the default Python console).

Just in case, you have to check that you have a directory in your HDFS for
yarn (/user/yarn), it may not be created by default and that would
difficult everything, not allowing your Spark to run.

In summary, something like (correct the syntax if it's wrong, I'm not
testing it):

# Create yarn directory in HDFS
su hdfs
hadoop fs -mkdir /user/yarn
hadoop fs -chown yarn:yarn /user/yarn
exit

# Install Anaconda for user yarn
# In every node:
su yarn
cd
wget
http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.sh
# Or the current link for the moment you are doing it:
https://store.continuum.io/cshop/anaconda/
bash Anaconda*.sh
# When asked if set it as the default Python, or to add Anaconda to the
PATH (I don't remember how they say it), choose yes


I hope that helps,


*Sebastián Ramírez*
Diseñador de Algoritmos

 http://www.senseta.com

 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Twitter: @tiangolo https://twitter.com/tiangolo
 Email: sebastian.rami...@senseta.com
 www.senseta.com

On Sun, Dec 28, 2014 at 1:57 PM, Bin Wang binwang...@gmail.com wrote:

 Hi there,

 I have a cluster with CDH5.1 running on top of Redhat6.5, where the
 default Python version is 2.6. I am trying to set up a proper iPython
 notebook environment to develop spark application using pyspark.

 Here
 http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
 is a tutorial that I have been following. However, it turned out that the
 author was using iPython1 where we have the latest Anaconda Python2.7
 installed on our name node. When I finished following the tutorial, I can
 connect to the spark cluster but whenever I tried to distribute the work,
 it will errorred out and google tells me it is the difference between the
 version of Python across the cluster.

 Here are a few thoughts that I am planning to try.
 (1) remove the Anaconda Python from the namenode and install the iPython
 version that is compatible with Python2.6.
 (2) or I need to install Anaconda Python on every node and make it the
 default Python version across the whole cluster (however, I am not sure if
 this plan will totally screw up the existing environment since some running
 services are built by Python2.6...)

 Let me which should be the proper way to set up an iPython notebook
 environment.

 Best regards,

 Bin


-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Anaconda iPython notebook working with CDH Spark

2014-12-28 Thread Bin Wang
Hi there,

I have a cluster with CDH5.1 running on top of Redhat6.5, where the default
Python version is 2.6. I am trying to set up a proper iPython notebook
environment to develop spark application using pyspark.

Here
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
is a tutorial that I have been following. However, it turned out that the
author was using iPython1 where we have the latest Anaconda Python2.7
installed on our name node. When I finished following the tutorial, I can
connect to the spark cluster but whenever I tried to distribute the work,
it will errorred out and google tells me it is the difference between the
version of Python across the cluster.

Here are a few thoughts that I am planning to try.
(1) remove the Anaconda Python from the namenode and install the iPython
version that is compatible with Python2.6.
(2) or I need to install Anaconda Python on every node and make it the
default Python version across the whole cluster (however, I am not sure if
this plan will totally screw up the existing environment since some running
services are built by Python2.6...)

Let me which should be the proper way to set up an iPython notebook
environment.

Best regards,

Bin