Thanks for the inputs (Amandeep and Jean). Writing author -> [list of
documents] mapping to an hdfs file works best for me because that file
(as an NLineInputFormat) will act as input to another mapreduce job in
which the map part is processor intensive. Also, I won't be using the
file for random access. I am not inclined towards emitting <author,
document> in map and consuming <author, list of docs> in reduce
because then I have to do the processor intensive process in the
reduce part and that limits the number of parallel heavy processes
that I can spawn.
Thanks again.
- Rohit Kelkar

On Mon, Nov 7, 2011 at 8:36 PM, Amandeep Khurana <[email protected]> wrote:
> Rohit,
>
> It'll depend on what processing you want to do on all documents for a
> given author. You could either write author -> {list of documents} to
> an HDFS file and scan through that file using a MR job to do the
> processing. Or you could simply output <author, document> as the
> output of the map stage and do the processing on <author, {list of
> documents}> in the reduce stage of the same job.
>
> -ak
>
> On Mon, Nov 7, 2011 at 3:02 AM, Rohit Kelkar <[email protected]> wrote:
>> I needed some feedback about best way of implementing the following -
>> In my document table I have documentid as row-id and content:author,
>> content:text stored in each row. I want to process all documents
>> pertaining to each author in a map reduce job. ie. my map will take
>> key=author and values="all documentids sent by that sender". But for
>> this first I would have to find all distinct authors and store them in
>> another table. Then run map-reduce job on the second table. Am I
>> thinking in the right direction or is there a better way to achieve
>> this?
>> - Rohit Kelkar
>>
>

Reply via email to