It appears that my issue was caused by the missing sections I mentioned in the second post. I ran a job with these settings, and my job finished in < 6 hours. Thanks for your suggestions because I have further ideas regarding issues moving forward.
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs On Wed, Apr 13, 2016 at 7:32 AM, Colin Kincaid Williams <disc...@uw.edu> wrote: > Hi Chien, > > 4. From 50-150k per * second * to 100-150k per * minute *, as stated > above, so reads went *DOWN* significantly. I think you must have > misread. > > I will take into account some of your other suggestions. > > Thanks, > > Colin > > On Tue, Apr 12, 2016 at 8:19 PM, Chien Le <chie...@gmail.com> wrote: >> Some things I would look at: >> 1. Node statistics, both the mapper and regionserver nodes. Make sure >> they're on fully healthy nodes (no disk issues, no half duplex, etc) and >> that they're not already saturated from other jobs. >> 2. Is there a common regionserver behind the remaining mappers/regions? If >> so, try moving some regions off to spread the load. >> 3. Verify the locality of the region blocks to the regionserver. If you >> don't automate major compacts or have moved regions recently, mapper >> locality might not help. Major compact if needed or move regions if you can >> determine source? >> 4. You mentioned that the requests per sec has gone from 50-150k to >> 100-150k. Was that a typo? Did the read rate really increase? >> 5. You've listed the region sizes but was that done with a cursory hadoop >> fs du? Have you tried using the hfile analyzer to verify number of rows and >> sizes are roughly the same? >> 5. profile the mappers. If you can share the task counters for a completed >> and a still running task to compare, it might help find the issue >> 6. I don't think you should underestimate the perf gains of node local >> tasks vs just rack local, especially if short circuit reads are enabled. >> This is a big gamble unfortunately given how far your tasks have been >> running already so I'd look at this as a last resort >> >> >> HTH, >> Chien >> >> On Tue, Apr 12, 2016 at 3:59 PM, Colin Kincaid Williams <disc...@uw.edu> >> wrote: >> >>> I've noticed that I've omitted >>> >>> scan.setCaching(500); // 1 is the default in Scan, which will >>> be bad for MapReduce jobs >>> scan.setCacheBlocks(false); // don't set to true for MR jobs >>> >>> which appear to be suggestions from examples. Still I am not sure if >>> this explains the significant request slowdown on the final 25% of the >>> jobs. >>> >>> On Tue, Apr 12, 2016 at 10:36 PM, Colin Kincaid Williams <disc...@uw.edu> >>> wrote: >>> > Excuse my double post. I thought I deleted my draft, and then >>> > constructed a cleaner, more detailed, more readable mail. >>> > >>> > On Tue, Apr 12, 2016 at 10:26 PM, Colin Kincaid Williams <disc...@uw.edu> >>> wrote: >>> >> After trying to get help with distcp on hadoop-user and cdh-user >>> >> mailing lists, I've given up on trying to use distcp and exporttable >>> >> to migrate my hbase from .92.1 cdh4.1.3 to .98 on cdh5.3.0 >>> >> >>> >> I've been working on an hbase map reduce job to serialize my entries >>> >> and insert them into kafka. Then I plan to re-import them into >>> >> cdh5.3.0. >>> >> >>> >> Currently I'm having trouble with my map-reduce job. I have 43 maps, >>> >> 33 which have finished successfully, and 10 which are currently still >>> >> running. I had previously seen requests of 50-150k per second. Now for >>> >> the final 10 maps, I'm seeing 100-150k per minute. >>> >> >>> >> I might also mention that there were 6 failures near the application >>> >> start. Unfortunately, I cannot read the logs for these 6 failures. >>> >> There is an exception related to the yarn logging for these maps, >>> >> maybe because they failed to start. >>> >> >>> >> I had a look around HDFS. It appears that the regions are all between >>> >> 5-10GB. The longest completed map so far took 7 hours, with the >>> >> majority appearing to take around 3.5 hours . >>> >> >>> >> The remaining 10 maps have each been running between 23-27 hours. >>> >> >>> >> Considering data locality issues. 6 of the remaining jobs are running >>> >> on the same rack. Then the other 4 are split between my other two >>> >> racks. There should currently be a replica on each rack, since it >>> >> appears the replicas are set to 3. Then I'm not sure this is really >>> >> the cause of the slowdown. >>> >> >>> >> Then I'm looking for advice on what I can do to troubleshoot my job. >>> >> I'm setting up my map job like: >>> >> >>> >> main(String[] args){ >>> >> ... >>> >> Scan fromScan = new Scan(); >>> >> System.out.println(fromScan); >>> >> TableMapReduceUtil.initTableMapperJob(fromTableName, fromScan, >>> Map.class, >>> >> null, null, job, true, TableInputFormat.class); >>> >> >>> >> // My guess is this contols the output type for the reduce function >>> >> base on setOutputKeyClass and setOutput value class from p.27 . Since >>> >> there is no reduce step, then this is currently null. >>> >> job.setOutputFormatClass(NullOutputFormat.class); >>> >> job.setNumReduceTasks(0); >>> >> job.submit(); >>> >> ... >>> >> } >>> >> >>> >> I'm not performing a reduce step, and I'm traversing row keys like >>> >> >>> >> map(final ImmutableBytesWritable fromRowKey, >>> >> Result fromResult, Context context) throws IOException { >>> >> ... >>> >> // should I assume that each keyvalue is a version of the stored >>> row? >>> >> for (KeyValue kv : fromResult.raw()) { >>> >> ADTreeMap.get(kv.getQualifier()).fakeLambda(messageBuilder, >>> >> kv.getValue()); >>> >> //TODO: ADD counter for each qualifier >>> >> } >>> >> >>> >> >>> >> >>> >> I've also have a list of simple questions. >>> >> >>> >> Has anybody experienced a significant slowdown on map jobs related to >>> >> a portion of their hbase regions? If so what issues did you come >>> >> across? >>> >> >>> >> Can I get a suggestion how to show which map corresponds to which >>> >> region, so I can troubleshoot from there? Is this already logged >>> >> somewhere by default, or is there a way to set this up with the >>> >> TableMapReduceUtil.initTableMapperJob ? >>> >> >>> >> Any other suggestions would be appreciated. >>>