Hi Pradeep, Yes.. Basically i'm only writing the key part as the map output.. The V of <K,V> is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers) I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format? On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <[email protected]>wrote: > I'm sorry but I don't understand your question. Is the output of the > mapper you're describing the key portion? If it is the key, then your data > should already be sorted by HouseHoldId since it occurs first in your key. > > The SortComparator will tell Hadoop how to sort your data. So you use this > if you have a need for a non lexical sort order. The GroupingComparator > will tell Hadoop how to group your data for the reducer. All KV-pairs from > the same group will be given to the same Reducer. > > If your reduce computation needs all the KV-pairs for the same > HouseHoldId, then you will need to write a GroupingComparator. > > Also, have you considered using a higher level abstraction on Hadoop such > as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT > easier to write in these languages. > > Hope this helps! > - Pradeep > > > On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <[email protected]>wrote: > >> I need to improve my MR jobs which uses HBase as source as well as sink.. >> >> Basically, i'm reading data from 3 HBase Tables in the mapper, writing >> them out as one huge string for the reducer to do some computation and dump >> into a HBase Table.. >> >> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows. >> >> The output of the mapper is something like this : >> >> HouseHoldId contentID name duration genre type channelId personId >> televisionID timestamp >> >> I'm interested in sorting it on the basis of the HouseHoldID value so i'm >> using this technique. I'm not interested in the V part of pair so i'm kind >> of ignoring it. My mapper class is defined as follows: >> >> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { } >> >> For my MR job to be completed, it takes 22 hours to complete which is not >> desirable at all. I'm supposed to optimize this somehow to run a lot faster >> somehow.. >> >> scan.setCaching(750); >> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob ( >> Table1, // input HBase >> table name >> scan, >> AnalyzeMapper.class, // mapper >> Text.class, // mapper >> output key >> IntWritable.class, // mapper >> output value >> job); >> >> TableMapReduceUtil.initTableReducerJob( >> OutputTable, // >> output table >> AnalyzeReducerTable.class, // >> reducer class >> job); >> job.setNumReduceTasks(RegionCount); >> >> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running >> a 8 node cloudera cluster. >> >> Should i use a custom SortComparator or a Group Comparator? >> >> >> -- >> Regards- >> Pavan >> > > -- Regards- Pavan
