How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

Bhaarat Sharma Sat, 30 Jul 2016 15:59:02 -0700

I am reading bunch of files in PySpark using binaryFiles. Then I want to
get the number of bytes for each file and write this number to an HDFS file
with the corresponding name.


Example:

if directory /myimages has one.jpg, two.jpg, and three.jpg then I want
three files one-success.jpg, two-success.jpg, and three-success.jpg in HDFS
with a number in each. The number will specify the length of bytes.

Here is what I've done thus far:

from pyspark import SparkContext
import numpy as np

sc = SparkContext("local", "test")

def bytes_length(rawdata):
        length = len(np.asarray(bytearray(rawdata),dtype=np.uint8))
        return length

images = sc.binaryFiles("/root/sift_images_test/*.jpg")
images.map(lambda(filename, contents):
bytes_length(contents)).saveAsTextFile("hdfs://localhost:9000/tmp/somfile")


However, doing this creates a single file in HDFS:

$ hadoop fs -cat /tmp/somfile/part-00000

113212
144926
178923

Instead I want /tmp/somefile in HDFS to have three files:

one-success.txt with value 113212
two-success.txt with value 144926
three-success.txt with value 178923

Is it possible to achieve what I'm after? I don't want to write files
to local file system and them put them in HDFS. Instead, I want to use
the saveAsTextFile method on the RDD directly.

How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

Reply via email to