I have n zips in a directory and I want to extract each one of those and then get some data out of a file or two lying inside the zips and add it to a graph DB. All of my zips are in a HDFS directory.
I am thinking my code should be along these lines. # Names of all my zips zip_names = ["a.zip", "b.zip", "c.zip"] # function extract_&_populate_graphDB() returns 1 after doing all the work. # This was done so that a closure can be applied to start the spark job. sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a, b: a+b) What I cant do to achieve this is how to extract the zips and read the files lying within. I am able to read all the zips but I can't save those to the HDFS. Here is the code def ze(x): in_memory_data = io.BytesIO(x[1]) file_obj = zipfile.ZipFile(in_memory_data, "r") return file_obj a = sc.binaryFiles("hdfs:/Testing/*.zip") a.map(ze).collect() The above code returns me a list of zipfile.ZipFile objects. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Working-with-zips-in-pyspark-tp26701.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org