Re: Unable to ship external Python libraries in PYSPARK

2014-10-07 Thread yh18190
Hi David,

Thanks for the reply and effort u put to explain the concepts.Thanks for
example.It worked.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p15844.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread daijia
Is there some way to ship textfile just like ship python libraries?

Thanks in advance
Daijia



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread Davies Liu
Yes, sc.addFile() is what you want:

 |  addFile(self, path)
 |  Add a file to be downloaded with this Spark job on every node.
 |  The C{path} passed can be either a local file, a file in HDFS
 |  (or other Hadoop-supported filesystems), or an HTTP, HTTPS or
 |  FTP URI.
 |
 |  To access the file in Spark jobs, use
 |  L{SparkFiles.get(fileName)pyspark.files.SparkFiles.get} with the
 |  filename to find its download location.
 |
 |   from pyspark import SparkFiles
 |   path = os.path.join(tempdir, test.txt)
 |   with open(path, w) as testFile:
 |  ...testFile.write(100)
 |   sc.addFile(path)
 |   def func(iterator):
 |  ...with open(SparkFiles.get(test.txt)) as testFile:
 |  ...fileVal = int(testFile.readline())
 |  ...return [x * fileVal for x in iterator]
 |   sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
 |  [100, 200, 300, 400]

On Tue, Sep 16, 2014 at 7:02 PM, daijia jia_...@intsig.com wrote:
 Is there some way to ship textfile just like ship python libraries?

 Thanks in advance
 Daijia



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unable to ship external Python libraries in PYSPARK

2014-09-12 Thread yh18190
Hi all,

I am currently working on pyspark for NLP processing etc.I am using TextBlob
python library.Normally in a standalone mode it easy to install the external
python libraries .In case of cluster mode I am facing problem to install
these libraries on worker nodes remotely.I cannot access each and every
worker machine to install these libs in python path.I tried to use
Sparkcontext pyfiles option to ship .zip files..But the problem is  these
python packages needs to be get installed on worker machines.Could anyone
let me know wat are different ways of doing it so that this lib-Textblob
could be available in python path.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to ship external Python libraries in PYSPARK

2014-09-12 Thread Davies Liu
By SparkContext.addPyFile(xx.zip), the xx.zip will be copies to all
the workers
and stored in temporary directory, the path to xx.zip will be in the sys.path on
worker machines, so you can import xx in your jobs, it does not need to be
installed on worker machines.

PS: the package or module should be in the top level in xx.zip, or it cannot
be imported. such as :

daviesliu@dm:~/work/tmp$ zipinfo textblob.zip
Archive:  textblob.zip   3245946 bytes   517 files
drwxr-xr-x  3.0 unx0 bx stor 12-Sep-14 10:10 textblob/
-rw-r--r--  3.0 unx  203 tx defN 12-Sep-14 10:10 textblob/__init__.py
-rw-r--r--  3.0 unx  563 bx defN 12-Sep-14 10:10 textblob/__init__.pyc
-rw-r--r--  3.0 unx61510 tx defN 12-Sep-14 10:10 textblob/_text.py
-rw-r--r--  3.0 unx68316 bx defN 12-Sep-14 10:10 textblob/_text.pyc
-rw-r--r--  3.0 unx 2962 tx defN 12-Sep-14 10:10 textblob/base.py
-rw-r--r--  3.0 unx 5501 bx defN 12-Sep-14 10:10 textblob/base.pyc
-rw-r--r--  3.0 unx27621 tx defN 12-Sep-14 10:10 textblob/blob.py

you can get this textblob.zip by:

pip install textblob
cd /xxx/xx/site-package/
zip -r path_to_store/textblob.zip textblob

Davies


On Fri, Sep 12, 2014 at 1:39 AM, yh18190 yh18...@gmail.com wrote:
 Hi all,

 I am currently working on pyspark for NLP processing etc.I am using TextBlob
 python library.Normally in a standalone mode it easy to install the external
 python libraries .In case of cluster mode I am facing problem to install
 these libraries on worker nodes remotely.I cannot access each and every
 worker machine to install these libs in python path.I tried to use
 Sparkcontext pyfiles option to ship .zip files..But the problem is  these
 python packages needs to be get installed on worker machines.Could anyone
 let me know wat are different ways of doing it so that this lib-Textblob
 could be available in python path.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org