How to judge which counter would work?
2013/1/11 <[email protected]> > ** > Hi > > To add on to Harsh's comments. > > You need not have to change the task time out. > > In your map/reduce code, you can increment a counter or report status > intermediate on intervals so that there is communication from the task and > hence won't have a task time out. > > Every map and reduce task run on its own jvm limited by a jvm size. If you > try to holds too much data in memory then it can go beyond the jvm size and > cause OOM errors. > > Regards > Bejoy KS > > Sent from remote device, Please excuse typos > ------------------------------ > *From: * yaotian <[email protected]> > *Date: *Fri, 11 Jan 2013 14:35:07 +0800 > *To: *<[email protected]> > *ReplyTo: * [email protected] > *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave, > but failed. > > See inline. > > > 2013/1/11 Harsh J <[email protected]> > >> If the per-record processing time is very high, you will need to >> periodically report a status. Without a status change report from the task >> to the tracker, it will be killed away as a dead task after a default >> timeout of 10 minutes (600s). >> > =====================> Do you mean to increase the report time: "* > mapred.task.timeout"*? > > >> Also, beware of holding too much memory in a reduce JVM - you're still >> limited there. Best to let the framework do the sort or secondary sort. >> > =======================> You mean use the default value ? This is my > value. > *mapred.job.reduce.memory.mb*-1 > >> >> >> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <[email protected]> wrote: >> >>> Yes, you are right. The data is GPS trace related to corresponding uid. >>> The reduce is doing this: Sort user to get this kind of result: uid, gps1, >>> gps2, gps3........ >>> Yes, the gps data is big because this is 30G data. >>> >>> How to solve this? >>> >>> >>> >>> 2013/1/11 Mahesh Balija <[email protected]> >>> >>>> Hi, >>>> >>>> 2 reducers are successfully completed and 1498 have been >>>> killed. I assume that you have the data issues. (Either the data is huge or >>>> some issues with the data you are trying to process) >>>> One possibility could be you have many values associated to a >>>> single key, which can cause these kind of issues based on the operation you >>>> do in your reducer. >>>> Can you put some logs in your reducer and try to trace out >>>> what is happening. >>>> >>>> Best, >>>> Mahesh Balija, >>>> Calsoft Labs. >>>> >>>> >>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <[email protected]> wrote: >>>> >>>>> I have 1 hadoop master which name node locates and 2 slave which >>>>> datanode locate. >>>>> >>>>> If i choose a small data like 200M, it can be done. >>>>> >>>>> But if i run 30G data, Map is done. But the reduce report error. Any >>>>> sugggestion? >>>>> >>>>> >>>>> This is the information. >>>>> >>>>> *Black-listed TaskTrackers:* >>>>> 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041> >>>>> ------------------------------ >>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed >>>>> Task >>>>> Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041> >>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1> >>>>> 100.00%4500 >>>>> 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed> >>>>> 00 / >>>>> 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed> >>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1> >>>>> 100.00%1500 0 >>>>> 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed> >>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed> >>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed> >>>>> / >>>>> 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed> >>>>> >>>>> >>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters >>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001> >>>>> 0.00% >>>>> 10-Jan-2013 04:18:54 >>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec) >>>>> >>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 >>>>> seconds. Killing! >>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 >>>>> seconds. Killing! >>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 >>>>> seconds. Killing! >>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 >>>>> seconds. Killing! >>>>> >>>>> >>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001> >>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002> >>>>> 0.00% >>>>> 10-Jan-2013 04:18:54 >>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec) >>>>> >>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 >>>>> seconds. Killing! >>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 >>>>> seconds. Killing! >>>>> >>>>> >>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002> >>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003> >>>>> 0.00% >>>>> 10-Jan-2013 04:18:57 >>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec) >>>>> >>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 >>>>> seconds. Killing! >>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 >>>>> seconds. Killing! >>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 >>>>> seconds. Killing! >>>>> >>>>> >>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003> >>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005> >>>>> 0.00% >>>>> 10-Jan-2013 06:11:07 >>>>> 10-Jan-2013 06:46:38 (35mins, 31sec) >>>>> >>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 >>>>> seconds. Killing! >>>>> >>>>> >>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005> >>>>> >>>> >>>> >>> >> >> >> -- >> Harsh J >> > >
