Re: Virtualenv pyspark

2015-05-08 Thread Nicholas Chammas
This is an interesting question. I don't have a solution for you, but you
may be interested in taking a look at Anaconda Cluster
http://continuum.io/anaconda-cluster.

It's made by the same people behind Conda (an alternative to pip focused on
data science pacakges) and may offer a better way of doing this. Haven't
used it though.

On Thu, May 7, 2015 at 5:20 PM alemagnani ale.magn...@gmail.com wrote:

 I am currently using pyspark with a virtualenv.
 Unfortunately I don't have access to the nodes file system and therefore I
 cannot  manually copy the virtual env over there.

 I have been using this technique:

 I first add a tar ball with the venv
 sc.addFile(virtual_env_tarball_file)

 Then in the code used on the node to do the computation I activate the venv
 like this:
 venv_location = SparkFiles.get(venv_name)
 activate_env=%s/bin/activate_this.py % venv_location
 execfile(activate_env, dict(__file__=activate_env))

 Is there a better way to do this?
 One of the problem with this approach is that in
 spark/python/pyspark/statcounter.py numpy is imported
 before the venv is activated and this can cause conflicts with the venv
 numpy.

 Moreover this requires the venv to be sent around in the cluster all the
 time.
 Any suggestions?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Virtualenv pyspark

2015-05-07 Thread alemagnani
I am currently using pyspark with a virtualenv.
Unfortunately I don't have access to the nodes file system and therefore I
cannot  manually copy the virtual env over there.

I have been using this technique:

I first add a tar ball with the venv
sc.addFile(virtual_env_tarball_file)

Then in the code used on the node to do the computation I activate the venv
like this: 
venv_location = SparkFiles.get(venv_name)
activate_env=%s/bin/activate_this.py % venv_location
execfile(activate_env, dict(__file__=activate_env))

Is there a better way to do this? 
One of the problem with this approach is that in
spark/python/pyspark/statcounter.py numpy is imported
before the venv is activated and this can cause conflicts with the venv
numpy.

Moreover this requires the venv to be sent around in the cluster all the
time.
Any suggestions?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org