PySpark doesn't include a Python interpreter; by default, it will use your system `python`. The pyspark script ( https://github.com/apache/incubator-spark/blob/master/pyspark) just performs some setup of environment variables, adds the PySpark python dependencies to PYTHONPATH, and adds some code to initialize a SparkContext in the Python REPL.
I suppose we could split the Python classes into a proper Python package that could be installed with easy_install / pip and that assumes that SPARK_HOME contains the right JARs. To avoid weird bugs from using incompatible versions of the Python pyspark package and the Java classes, we'd probably need to add some mechanism to detect version mismatches when connecting to the cluster. We'd still support the ./pyspark script that uses the bundled dependencies, too. This is probably as simple as creating a setup.py file in $SPARK_HOME/python. Python packaging experts: please feel free to submit pull requests for this! On Tue, Nov 19, 2013 at 11:08 AM, Michal Romaniuk < [email protected]> wrote: > Hi, > > I would like to use Spark to distribute some computations that rely on > my existing Python installation. I know that Spark includes its own > Python but it would be much easier to just install a package and perhaps > do a bit of configuration. > > Thanks, > Michal > >
