Re: Unable to ship external Python libraries in PYSPARK
Hi David, Thanks for the reply and effort u put to explain the concepts.Thanks for example.It worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p15844.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to ship external Python libraries in PYSPARK
Is there some way to ship textfile just like ship python libraries? Thanks in advance Daijia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to ship external Python libraries in PYSPARK
Yes, sc.addFile() is what you want: | addFile(self, path) | Add a file to be downloaded with this Spark job on every node. | The C{path} passed can be either a local file, a file in HDFS | (or other Hadoop-supported filesystems), or an HTTP, HTTPS or | FTP URI. | | To access the file in Spark jobs, use | L{SparkFiles.get(fileName)pyspark.files.SparkFiles.get} with the | filename to find its download location. | | from pyspark import SparkFiles | path = os.path.join(tempdir, test.txt) | with open(path, w) as testFile: | ...testFile.write(100) | sc.addFile(path) | def func(iterator): | ...with open(SparkFiles.get(test.txt)) as testFile: | ...fileVal = int(testFile.readline()) | ...return [x * fileVal for x in iterator] | sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect() | [100, 200, 300, 400] On Tue, Sep 16, 2014 at 7:02 PM, daijia jia_...@intsig.com wrote: Is there some way to ship textfile just like ship python libraries? Thanks in advance Daijia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Unable to ship external Python libraries in PYSPARK
Hi all, I am currently working on pyspark for NLP processing etc.I am using TextBlob python library.Normally in a standalone mode it easy to install the external python libraries .In case of cluster mode I am facing problem to install these libraries on worker nodes remotely.I cannot access each and every worker machine to install these libs in python path.I tried to use Sparkcontext pyfiles option to ship .zip files..But the problem is these python packages needs to be get installed on worker machines.Could anyone let me know wat are different ways of doing it so that this lib-Textblob could be available in python path. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to ship external Python libraries in PYSPARK
By SparkContext.addPyFile(xx.zip), the xx.zip will be copies to all the workers and stored in temporary directory, the path to xx.zip will be in the sys.path on worker machines, so you can import xx in your jobs, it does not need to be installed on worker machines. PS: the package or module should be in the top level in xx.zip, or it cannot be imported. such as : daviesliu@dm:~/work/tmp$ zipinfo textblob.zip Archive: textblob.zip 3245946 bytes 517 files drwxr-xr-x 3.0 unx0 bx stor 12-Sep-14 10:10 textblob/ -rw-r--r-- 3.0 unx 203 tx defN 12-Sep-14 10:10 textblob/__init__.py -rw-r--r-- 3.0 unx 563 bx defN 12-Sep-14 10:10 textblob/__init__.pyc -rw-r--r-- 3.0 unx61510 tx defN 12-Sep-14 10:10 textblob/_text.py -rw-r--r-- 3.0 unx68316 bx defN 12-Sep-14 10:10 textblob/_text.pyc -rw-r--r-- 3.0 unx 2962 tx defN 12-Sep-14 10:10 textblob/base.py -rw-r--r-- 3.0 unx 5501 bx defN 12-Sep-14 10:10 textblob/base.pyc -rw-r--r-- 3.0 unx27621 tx defN 12-Sep-14 10:10 textblob/blob.py you can get this textblob.zip by: pip install textblob cd /xxx/xx/site-package/ zip -r path_to_store/textblob.zip textblob Davies On Fri, Sep 12, 2014 at 1:39 AM, yh18190 yh18...@gmail.com wrote: Hi all, I am currently working on pyspark for NLP processing etc.I am using TextBlob python library.Normally in a standalone mode it easy to install the external python libraries .In case of cluster mode I am facing problem to install these libraries on worker nodes remotely.I cannot access each and every worker machine to install these libs in python path.I tried to use Sparkcontext pyfiles option to ship .zip files..But the problem is these python packages needs to be get installed on worker machines.Could anyone let me know wat are different ways of doing it so that this lib-Textblob could be available in python path. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org