No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow.. Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping.. On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <[email protected]>wrote: > One thing that comes to mind is that your keys are Strings which are > highly inefficient. You might get a lot better performance if you write a > custom writable for your Key object using the appropriate data types. For > example, use a long (LongWritable) for timestamps. This should make > (de)serialization a lot faster. If HouseHoldId is an integer, your speed of > comparisons for sorting will also go up. > > Ensure that your map output's are being compressed. Are your tables in > HBase compressed? Do you have a combiner? > > Have you been able to profile your code to see where the bottlenecks are? > > > On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <[email protected]>wrote: > >> Hi Pradeep, >> Yes.. Basically i'm only writing the key part as the map output.. The V >> of <K,V> is not of much use to me.. But i'm hoping to change that if it >> leads to faster execution.. I'm kind of a newbie so looking to make the >> map/reduce job run a lot faster.. >> >> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But >> seems if i write a map output for each and every row of a 19 m row HBase >> table, its taking nearly a day to complete.. (21 mappers and 21 reducers) >> >> I have looked at both Pig/Hive to do the job but i'm supposed to do this >> via a MR job.. So, cannot use either of that.. Do you recommend me to try >> something if i have the data in that format? >> >> >> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <[email protected] >> > wrote: >> >>> I'm sorry but I don't understand your question. Is the output of the >>> mapper you're describing the key portion? If it is the key, then your data >>> should already be sorted by HouseHoldId since it occurs first in your key. >>> >>> The SortComparator will tell Hadoop how to sort your data. So you use >>> this if you have a need for a non lexical sort order. The >>> GroupingComparator will tell Hadoop how to group your data for the reducer. >>> All KV-pairs from the same group will be given to the same Reducer. >>> >>> If your reduce computation needs all the KV-pairs for the same >>> HouseHoldId, then you will need to write a GroupingComparator. >>> >>> Also, have you considered using a higher level abstraction on Hadoop >>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are >>> a LOT easier to write in these languages. >>> >>> Hope this helps! >>> - Pradeep >>> >>> >>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra >>> <[email protected]>wrote: >>> >>>> I need to improve my MR jobs which uses HBase as source as well as >>>> sink.. >>>> >>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing >>>> them out as one huge string for the reducer to do some computation and dump >>>> into a HBase Table.. >>>> >>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows. >>>> >>>> The output of the mapper is something like this : >>>> >>>> HouseHoldId contentID name duration genre type channelId personId >>>> televisionID timestamp >>>> >>>> I'm interested in sorting it on the basis of the HouseHoldID value so >>>> i'm using this technique. I'm not interested in the V part of pair so i'm >>>> kind of ignoring it. My mapper class is defined as follows: >>>> >>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { >>>> } >>>> >>>> For my MR job to be completed, it takes 22 hours to complete which is >>>> not desirable at all. I'm supposed to optimize this somehow to run a lot >>>> faster somehow.. >>>> >>>> scan.setCaching(750); >>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob ( >>>> Table1, // input HBase >>>> table name >>>> scan, >>>> AnalyzeMapper.class, // mapper >>>> Text.class, // mapper >>>> output key >>>> IntWritable.class, // mapper >>>> output value >>>> job); >>>> >>>> TableMapReduceUtil.initTableReducerJob( >>>> OutputTable, // >>>> output table >>>> AnalyzeReducerTable.class, // >>>> reducer class >>>> job); >>>> job.setNumReduceTasks(RegionCount); >>>> >>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are >>>> running a 8 node cloudera cluster. >>>> >>>> Should i use a custom SortComparator or a Group Comparator? >>>> >>>> >>>> -- >>>> Regards- >>>> Pavan >>>> >>> >>> >> >> >> -- >> Regards- >> Pavan >> > > -- Regards- Pavan
