Re: PairRDD(K, L) to multiple files by key serializing each value in L before

Daniel Valdivia Wed, 16 Dec 2015 10:11:42 -0800

Hi Abhishek,

Thanks for your suggestion, I did considered it, but I'm not sure if to
achieve that I'd ned to collect() the data first, I don't think it would
fit into the Driver memory.


Since I'm trying all of this inside the pyspark shell I'm using a small
dataset, however the main dataset is about 1.5gb of data, and my cluster
has only 2gb of ram nodes (2 of them).

Do you think that your suggestion could work without having to collect()
the results?

Thanks in advance!

On Wed, Dec 16, 2015 at 4:26 AM, Abhishek Shivkumar <
abhisheksgum...@gmail.com> wrote:

> Hello Daniel,
>
>   I was thinking if you can write
>
> catGroupArr.map(lambda line: create_and_write_file(line))
>
> def create_and_write_file(line):
>
>     1. look at the key of line: line[0]
>     2. Open a file with required file name based on key
>     3. iterate through the values of this key,value pair
>
>        for ele in line[1]:
>
>     4. Write every ele into the file created.
>     5. Close the file.
>
> Do you think this works?
>
> Thanks
> Abhishek S
>
>
> Thank you!
>
> With Regards,
> Abhishek S
>
> On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia <h...@danielvaldivia.com>
> wrote:
>
>> Hello everyone,
>>
>> I have a PairRDD with a set of key and list of values, each value in the
>> list is a json which I already loaded beginning of my spark app, how can I
>> iterate over each value of the list in my pair RDD to transform it to a
>> string then save the whole content of the key to a file? one file per key
>>
>> my input files look like cat-0-500.txt:
>>
>> *{cat:'red',value:'asd'}*
>> *{cat:'green',value:'zxc'}*
>> *{cat:'red',value:'jkl'}*
>>
>> The PairRDD looks like
>>
>> *('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])*
>> *('green', [{cat:'green',value:'zxc'}])*
>>
>> so as you can see I I'd like to serialize each json in the value list
>> back to string so I can easily saveAsTextFile(), ofcourse I'm trying to
>> save a separate file for each key
>>
>> The way I got here:
>>
>> *rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")*
>> *import json*
>> *categoriesJson = rawcatRdd.map(lambda x: json.loads(x))*
>> *categories = categoriesJson*
>>
>> *catByDate = categories.map(lambda x: (x['cat'], x)*
>> *catGroup = catByDate.groupByKey()*
>> *catGroupArr = catGroup.mapValues(lambda x : list(x))*
>>
>> Ideally I want to create a cat-red.txt that looks like:
>>
>> {cat:'red',value:'asd'}
>> {cat:'red',value:'jkl'}
>>
>> and the same for the rest of the keys.
>>
>> I already looked at this answer
>> <http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job>
>>  but
>> I'm slightly lost as host to process each value in the list (turn into
>> string) before I save the contents to a file, also I cannot figure out how
>> to import *MultipleTextOutputFormat* in python either.
>>
>> I'm trying all this wacky stuff in the pyspark shell
>>
>> Any advice would be greatly appreciated
>>
>> Thanks in advance!
>>
>
>

Re: PairRDD(K, L) to multiple files by key serializing each value in L before

Reply via email to