@John, to be really frank i don't know what the limiting factor is.. It might be all of them or a subset of them.. Cannot tell..
On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <[email protected]>wrote: > @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers > are functional at the same time.. Although, @Pradeep, i should do the > compression like you say.. I'll give it a shot.. As far as i can see, i > think i'll need to implement Writable and write out the key of the mapper > using the specific data types instead of writing it out as a string which > might slow the operation down.. > > > On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota > <[email protected]>wrote: > >> Pavan, >> >> It's hard to tell whether there's anything wrong with your design or not >> since you haven't given us specific enough details. The best thing you can >> do is instrument your code and see what is taking a long time. Rahul >> mentioned a problem that I myself have seen before, with only one region >> (or a couple) having any data. So even if you have 21 regions, only mapper >> might be doing the heavy lifting. >> >> A combiner is hugely helpful in terms of reducing the data output of >> mappers. Writing a combiner is a best practice and you should almost always >> have one. Compression can be turned on by setting the following properties >> in your job config. >> <property> >> <name> mapreduce.map.output.compress </name> >> <value> true</value> >> </property> >> <property> >> <name>mapreduce.map.output.compress.codec</name> >> <value>org.apache.hadoop.io.compress.GzipCodec</value> >> </property> >> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. >> depending on your use cases. Gzip is really slow but gets the best >> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a >> compression ratio. If your computations are CPU bound, then you'd probably >> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs >> are idle, you can use Gzip. You'll have to experiment and find the best >> settings for you. There are a lot of other tweaks that you can try to get >> the best performance out of your cluster. >> >> One of the best things you can do is to install Ganglia (or some other >> similar tool) on your cluster and monitor usage of resources while your job >> is running. This will tell you if your job is I/O bound or CPU bound. >> >> Take a look at this paper by Intel about optimizing your Hadoop cluster >> and see if that fits your deployment. >> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf >> >> If your cluster is already optimized and your job is not I/O bound, then >> there might be a problem with your algorithm and might warrant a redesign. >> >> Hope this helps! >> - Pradeep >> >> >> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee < >> [email protected]> wrote: >> >>> One mapper is spawned per hbase table region. You can use the admin ui >>> of hbase to find the number of regions per table. It might happen that all >>> the data is sitting in a single region , so a single mapper is spawned and >>> you are not getting enough parallel work getting done. >>> >>> If that is the case then you can recreate the tables with predefined >>> splits to create more regions. >>> >>> Thanks, >>> Rahul >>> >>> >>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley >>> <[email protected]>wrote: >>> >>>> Pavan,**** >>>> >>>> How large are the rows in HBase? 22 million rows is not very much but >>>> you mentioned “huge strings”. Can you tell which part of the processing is >>>> the limiting factor (read from HBase, mapper output, reducers)?**** >>>> >>>> John**** >>>> >>>> ** ** >>>> >>>> ** ** >>>> >>>> *From:* Pavan Sudheendra [mailto:[email protected]] >>>> *Sent:* Saturday, September 21, 2013 2:17 AM >>>> *To:* [email protected] >>>> *Subject:* Re: How to best decide mapper output/reducer input for a >>>> huge string?**** >>>> >>>> ** ** >>>> >>>> No, I don't have a combiner in place. Is it necessary? How do I make my >>>> map output compressed? Yes, the Tables in HBase are compressed.**** >>>> >>>> Although, there's no real bottleneck, the time it takes to process the >>>> entire table is huge. I have to constantly check if i can optimize it >>>> somehow.. **** >>>> >>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you >>>> see any thing wrong with my design? Does it require any kind of re-work? >>>> Thank you so much for helping..**** >>>> >>>> ** ** >>>> >>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota < >>>> [email protected]> wrote:**** >>>> >>>> One thing that comes to mind is that your keys are Strings which are >>>> highly inefficient. You might get a lot better performance if you write a >>>> custom writable for your Key object using the appropriate data types. For >>>> example, use a long (LongWritable) for timestamps. This should make >>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of >>>> comparisons for sorting will also go up.**** >>>> >>>> ** ** >>>> >>>> Ensure that your map output's are being compressed. Are your tables in >>>> HBase compressed? Do you have a combiner?**** >>>> >>>> ** ** >>>> >>>> Have you been able to profile your code to see where the bottlenecks >>>> are?**** >>>> >>>> ** ** >>>> >>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <[email protected]> >>>> wrote:**** >>>> >>>> Hi Pradeep,**** >>>> >>>> Yes.. Basically i'm only writing the key part as the map output.. The V >>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it >>>> leads to faster execution.. I'm kind of a newbie so looking to make the >>>> map/reduce job run a lot faster.. **** >>>> >>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. >>>> But seems if i write a map output for each and every row of a 19 m row >>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21 >>>> reducers)**** >>>> >>>> ** ** >>>> >>>> I have looked at both Pig/Hive to do the job but i'm supposed to do >>>> this via a MR job.. So, cannot use either of that.. Do you recommend me to >>>> try something if i have the data in that format?**** >>>> >>>> ** ** >>>> >>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota < >>>> [email protected]> wrote:**** >>>> >>>> I'm sorry but I don't understand your question. Is the output of the >>>> mapper you're describing the key portion? If it is the key, then your data >>>> should already be sorted by HouseHoldId since it occurs first in your key. >>>> **** >>>> >>>> ** ** >>>> >>>> The SortComparator will tell Hadoop how to sort your data. So you use >>>> this if you have a need for a non lexical sort order. The >>>> GroupingComparator will tell Hadoop how to group your data for the reducer. >>>> All KV-pairs from the same group will be given to the same Reducer.**** >>>> >>>> ** ** >>>> >>>> If your reduce computation needs all the KV-pairs for the same >>>> HouseHoldId, then you will need to write a GroupingComparator.**** >>>> >>>> ** ** >>>> >>>> Also, have you considered using a higher level abstraction on Hadoop >>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are >>>> a LOT easier to write in these languages.**** >>>> >>>> ** ** >>>> >>>> Hope this helps!**** >>>> >>>> - Pradeep**** >>>> >>>> ** ** >>>> >>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <[email protected]> >>>> wrote:**** >>>> >>>> I need to improve my MR jobs which uses HBase as source as well as >>>> sink.. ** ** >>>> >>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing >>>> them out as one huge string for the reducer to do some computation and dump >>>> into a HBase Table.. **** >>>> >>>> Table1 ~ 19 million rows.**** >>>> >>>> Table2 ~ 2 million rows.**** >>>> >>>> Table3 ~ 900,000 rows.**** >>>> >>>> The output of the mapper is something like this : **** >>>> >>>> HouseHoldId contentID name duration genre type channelId personId >>>> televisionID timestamp**** >>>> >>>> I'm interested in sorting it on the basis of the HouseHoldID value so >>>> i'm using this technique. I'm not interested in the V part of pair so i'm >>>> kind of ignoring it. My mapper class is defined as follows:**** >>>> >>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { >>>> }**** >>>> >>>> For my MR job to be completed, it takes 22 hours to complete which is >>>> not desirable at all. I'm supposed to optimize this somehow to run a lot >>>> faster somehow..**** >>>> >>>> scan.setCaching(750); **** >>>> >>>> scan.setCacheBlocks(false); **** >>>> >>>> TableMapReduceUtil.initTableMapperJob (**** >>>> >>>> Table1, // input HBase >>>> table name**** >>>> >>>> scan, **** >>>> >>>> AnalyzeMapper.class, // >>>> mapper**** >>>> >>>> Text.class, // mapper >>>> output key**** >>>> >>>> IntWritable.class, // mapper >>>> output value**** >>>> >>>> job);**** >>>> >>>> ** ** >>>> >>>> TableMapReduceUtil.initTableReducerJob(**** >>>> >>>> OutputTable, // >>>> output table**** >>>> >>>> AnalyzeReducerTable.class, // >>>> reducer class**** >>>> >>>> job);**** >>>> >>>> job.setNumReduceTasks(RegionCount); **** >>>> >>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are >>>> running a 8 node cloudera cluster.**** >>>> >>>> Should i use a custom SortComparator or a Group Comparator? **** >>>> >>>> >>>> **** >>>> >>>> >>>> -- >>>> Regards-**** >>>> >>>> Pavan**** >>>> >>>> ** ** >>>> >>>> >>>> >>>> **** >>>> >>>> -- >>>> Regards-**** >>>> >>>> Pavan**** >>>> >>>> ** ** >>>> >>>> >>>> >>>> >>>> -- >>>> Regards-**** >>>> >>>> Pavan**** >>>> >>> >>> >> > > > -- > Regards- > Pavan > -- Regards- Pavan
