I am fairly new to python and am starting a new project that will want to make use of Spark and the python machine learning libraries (matplotlib, pandas, ) . I noticed that the spark-c2 script set up my AWS cluster with python 2.6 and 2.7
http://spark.apache.org/docs/latest/programming-guide.html#linking-with-spar k "Spark 1.5.1 works with Python 2.6+ or Python 3.4+. It can use the standard CPython interpreter, so C libraries like NumPy can be used. It also works with PyPy 2.3+² " PySpark works with IPython 1.0.0 and later.² I realize there are a lot of legacy python packages that are probably vectorized and not easy to port. What would you recommend? I assume if I wanted to use python 3 I would need to install it on all the works and master. And follow the direction in linking-with-spark to cause it to use the correct version of python (of course I realize I need to install 3rd party packages on all the works) Kind regards Andy