@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are functional at the same time.. Although, @Pradeep, i should do the compression like you say.. I'll give it a shot.. As far as i can see, i think i'll need to implement Writable and write out the key of the mapper using the specific data types instead of writing it out as a string which might slow the operation down..
On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <[email protected]>wrote: > Pavan, > > It's hard to tell whether there's anything wrong with your design or not > since you haven't given us specific enough details. The best thing you can > do is instrument your code and see what is taking a long time. Rahul > mentioned a problem that I myself have seen before, with only one region > (or a couple) having any data. So even if you have 21 regions, only mapper > might be doing the heavy lifting. > > A combiner is hugely helpful in terms of reducing the data output of > mappers. Writing a combiner is a best practice and you should almost always > have one. Compression can be turned on by setting the following properties > in your job config. > <property> > <name> mapreduce.map.output.compress </name> > <value> true</value> > </property> > <property> > <name>mapreduce.map.output.compress.codec</name> > <value>org.apache.hadoop.io.compress.GzipCodec</value> > </property> > You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. > depending on your use cases. Gzip is really slow but gets the best > compression ratios. Snappy/Lzo are a lot faster but don't have as good of a > compression ratio. If your computations are CPU bound, then you'd probably > want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs > are idle, you can use Gzip. You'll have to experiment and find the best > settings for you. There are a lot of other tweaks that you can try to get > the best performance out of your cluster. > > One of the best things you can do is to install Ganglia (or some other > similar tool) on your cluster and monitor usage of resources while your job > is running. This will tell you if your job is I/O bound or CPU bound. > > Take a look at this paper by Intel about optimizing your Hadoop cluster > and see if that fits your deployment. > http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf > > If your cluster is already optimized and your job is not I/O bound, then > there might be a problem with your algorithm and might warrant a redesign. > > Hope this helps! > - Pradeep > > > On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee < > [email protected]> wrote: > >> One mapper is spawned per hbase table region. You can use the admin ui of >> hbase to find the number of regions per table. It might happen that all the >> data is sitting in a single region , so a single mapper is spawned and you >> are not getting enough parallel work getting done. >> >> If that is the case then you can recreate the tables with predefined >> splits to create more regions. >> >> Thanks, >> Rahul >> >> >> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <[email protected]>wrote: >> >>> Pavan,**** >>> >>> How large are the rows in HBase? 22 million rows is not very much but >>> you mentioned “huge strings”. Can you tell which part of the processing is >>> the limiting factor (read from HBase, mapper output, reducers)?**** >>> >>> John**** >>> >>> ** ** >>> >>> ** ** >>> >>> *From:* Pavan Sudheendra [mailto:[email protected]] >>> *Sent:* Saturday, September 21, 2013 2:17 AM >>> *To:* [email protected] >>> *Subject:* Re: How to best decide mapper output/reducer input for a >>> huge string?**** >>> >>> ** ** >>> >>> No, I don't have a combiner in place. Is it necessary? How do I make my >>> map output compressed? Yes, the Tables in HBase are compressed.**** >>> >>> Although, there's no real bottleneck, the time it takes to process the >>> entire table is huge. I have to constantly check if i can optimize it >>> somehow.. **** >>> >>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see >>> any thing wrong with my design? Does it require any kind of re-work? Thank >>> you so much for helping..**** >>> >>> ** ** >>> >>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <[email protected]> >>> wrote:**** >>> >>> One thing that comes to mind is that your keys are Strings which are >>> highly inefficient. You might get a lot better performance if you write a >>> custom writable for your Key object using the appropriate data types. For >>> example, use a long (LongWritable) for timestamps. This should make >>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of >>> comparisons for sorting will also go up.**** >>> >>> ** ** >>> >>> Ensure that your map output's are being compressed. Are your tables in >>> HBase compressed? Do you have a combiner?**** >>> >>> ** ** >>> >>> Have you been able to profile your code to see where the bottlenecks are? >>> **** >>> >>> ** ** >>> >>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <[email protected]> >>> wrote:**** >>> >>> Hi Pradeep,**** >>> >>> Yes.. Basically i'm only writing the key part as the map output.. The V >>> of <K,V> is not of much use to me.. But i'm hoping to change that if it >>> leads to faster execution.. I'm kind of a newbie so looking to make the >>> map/reduce job run a lot faster.. **** >>> >>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. >>> But seems if i write a map output for each and every row of a 19 m row >>> HBase table, its taking nearly a day to complete.. (21 mappers and 21 >>> reducers)**** >>> >>> ** ** >>> >>> I have looked at both Pig/Hive to do the job but i'm supposed to do this >>> via a MR job.. So, cannot use either of that.. Do you recommend me to try >>> something if i have the data in that format?**** >>> >>> ** ** >>> >>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota < >>> [email protected]> wrote:**** >>> >>> I'm sorry but I don't understand your question. Is the output of the >>> mapper you're describing the key portion? If it is the key, then your data >>> should already be sorted by HouseHoldId since it occurs first in your key. >>> **** >>> >>> ** ** >>> >>> The SortComparator will tell Hadoop how to sort your data. So you use >>> this if you have a need for a non lexical sort order. The >>> GroupingComparator will tell Hadoop how to group your data for the reducer. >>> All KV-pairs from the same group will be given to the same Reducer.**** >>> >>> ** ** >>> >>> If your reduce computation needs all the KV-pairs for the same >>> HouseHoldId, then you will need to write a GroupingComparator.**** >>> >>> ** ** >>> >>> Also, have you considered using a higher level abstraction on Hadoop >>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are >>> a LOT easier to write in these languages.**** >>> >>> ** ** >>> >>> Hope this helps!**** >>> >>> - Pradeep**** >>> >>> ** ** >>> >>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <[email protected]> >>> wrote:**** >>> >>> I need to improve my MR jobs which uses HBase as source as well as >>> sink.. ** ** >>> >>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing >>> them out as one huge string for the reducer to do some computation and dump >>> into a HBase Table.. **** >>> >>> Table1 ~ 19 million rows.**** >>> >>> Table2 ~ 2 million rows.**** >>> >>> Table3 ~ 900,000 rows.**** >>> >>> The output of the mapper is something like this : **** >>> >>> HouseHoldId contentID name duration genre type channelId personId >>> televisionID timestamp**** >>> >>> I'm interested in sorting it on the basis of the HouseHoldID value so >>> i'm using this technique. I'm not interested in the V part of pair so i'm >>> kind of ignoring it. My mapper class is defined as follows:**** >>> >>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { >>> }**** >>> >>> For my MR job to be completed, it takes 22 hours to complete which is >>> not desirable at all. I'm supposed to optimize this somehow to run a lot >>> faster somehow..**** >>> >>> scan.setCaching(750); **** >>> >>> scan.setCacheBlocks(false); **** >>> >>> TableMapReduceUtil.initTableMapperJob (**** >>> >>> Table1, // input HBase >>> table name**** >>> >>> scan, **** >>> >>> AnalyzeMapper.class, // mapper**** >>> >>> Text.class, // mapper >>> output key**** >>> >>> IntWritable.class, // mapper >>> output value**** >>> >>> job);**** >>> >>> ** ** >>> >>> TableMapReduceUtil.initTableReducerJob(**** >>> >>> OutputTable, // >>> output table**** >>> >>> AnalyzeReducerTable.class, // >>> reducer class**** >>> >>> job);**** >>> >>> job.setNumReduceTasks(RegionCount); **** >>> >>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running >>> a 8 node cloudera cluster.**** >>> >>> Should i use a custom SortComparator or a Group Comparator? **** >>> >>> >>> **** >>> >>> >>> -- >>> Regards-**** >>> >>> Pavan**** >>> >>> ** ** >>> >>> >>> >>> **** >>> >>> -- >>> Regards-**** >>> >>> Pavan**** >>> >>> ** ** >>> >>> >>> >>> >>> -- >>> Regards-**** >>> >>> Pavan**** >>> >> >> > -- Regards- Pavan
