I have n zips in a directory and I want to extract each one of those and then
get some data out of a file or two lying inside the zips and add it to a
graph DB. All of my zips are in a HDFS directory.

I am thinking my code should be along these lines.

    # Names of all my zips
    zip_names = ["a.zip", "b.zip", "c.zip"]

   # function extract_&_populate_graphDB() returns 1 after doing all the
work.
   # This was done so that a closure can be applied to start the spark job.
   sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a,
b: a+b)

What I cant do to achieve this is how to extract the zips and read the files
lying within. I am able to read all the zips but I can't save those to the
HDFS. Here is the code

    def ze(x):
        in_memory_data = io.BytesIO(x[1])
        file_obj = zipfile.ZipFile(in_memory_data, "r")
        return file_obj

    a = sc.binaryFiles("hdfs:/Testing/*.zip")
    a.map(ze).collect()

The above code returns me a list of zipfile.ZipFile objects.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Working-with-zips-in-pyspark-tp26701.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to