Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

Bhaarat Sharma Sat, 30 Jul 2016 20:02:02 -0700

I am just trying to do this as a proof of concept. The actual content of
the files will be quite bit.


I'm having problem using foreach or something similar on an RDD.

sc.binaryFiles("/root/sift_images_test/*.jpg")

returns

("filename1", bytes)

("filname2",bytes)

I'm wondering if there is a do processing one each of these (process
in this case is just getting the bytes length but will be something
else in real world) and then write the contents to separate HDFS
files.

If this doesn't make sense, would it make more sense to have all
contents in a single HDFS file?


On Sat, Jul 30, 2016 at 10:19 PM, ayan guha <guha.a...@gmail.com> wrote:

> This sounds a bad idea, given hdfs does not work well with small files.
>
> On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma <bhaara...@gmail.com>
> wrote:
>
>> I am reading bunch of files in PySpark using binaryFiles. Then I want to
>> get the number of bytes for each file and write this number to an HDFS file
>> with the corresponding name.
>>
>> Example:
>>
>> if directory /myimages has one.jpg, two.jpg, and three.jpg then I want
>> three files one-success.jpg, two-success.jpg, and three-success.jpg in HDFS
>> with a number in each. The number will specify the length of bytes.
>>
>> Here is what I've done thus far:
>>
>> from pyspark import SparkContext
>> import numpy as np
>>
>> sc = SparkContext("local", "test")
>>
>> def bytes_length(rawdata):
>>         length = len(np.asarray(bytearray(rawdata),dtype=np.uint8))
>>         return length
>>
>> images = sc.binaryFiles("/root/sift_images_test/*.jpg")
>> images.map(lambda(filename, contents): 
>> bytes_length(contents)).saveAsTextFile("hdfs://localhost:9000/tmp/somfile")
>>
>>
>> However, doing this creates a single file in HDFS:
>>
>> $ hadoop fs -cat /tmp/somfile/part-00000
>>
>> 113212
>> 144926
>> 178923
>>
>> Instead I want /tmp/somefile in HDFS to have three files:
>>
>> one-success.txt with value 113212
>> two-success.txt with value 144926
>> three-success.txt with value 178923
>>
>> Is it possible to achieve what I'm after? I don't want to write files to 
>> local file system and them put them in HDFS. Instead, I want to use the 
>> saveAsTextFile method on the RDD directly.
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

Reply via email to