I have n zips in a directory and I want to extract each one of those and then
get some data out of a file or two lying inside the zips and add it to a
graph DB. All of my zips are in a HDFS directory.
I am thinking my code should be along these lines.
# Names of all my zips
zip_names = ["a.zip", "b.zip", "c.zip"]
# function extract_&_populate_graphDB() returns 1 after doing all the
work.
# This was done so that a closure can be applied to start the spark job.
sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a,
b: a+b)
What I cant do to achieve this is how to extract the zips and read the files
lying within. I am able to read all the zips but I can't save those to the
HDFS. Here is the code
def ze(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
return file_obj
a = sc.binaryFiles("hdfs:/Testing/*.zip")
a.map(ze).collect()
The above code returns me a list of zipfile.ZipFile objects.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Working-with-zips-in-pyspark-tp26701.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]