Yup. The counters at the end of the job are the most accurate. On Oct 22, 2012, at 3:00 AM, Lin Ma <[email protected]> wrote:
> Thanks for the help so much, Mike. I learned a lot from this discussion. > > So, the conclusion I learned from the discussion should be, since how/when JT > merge counter in the middle of the process of a job is undefined and internal > behavior, it is more reliable to read counter after the whole job completes? > Agree? > > regards, > Lin > > On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <[email protected]> > wrote: > > On Oct 21, 2012, at 1:45 AM, Lin Ma <[email protected]> wrote: > >> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by >> you. The last two questions (or comments) are used to confirm my >> understanding is correct, >> >> - is it normal use case or best practices for a job to consume/read the >> counters from previous completed job in an automatic way? I ask this because >> I am not sure whether the most use case of counter is human read and manual >> analysis, other then using another job to automatic consume the counters? > > Lin, > Every job has a set of counters to maintain job statistics. > This is specifically for human analysis and to help understand what happened > with your job. > It allows you to see how much data is read in by the job, how many records > processed to be measured against how long the job took to complete. It also > showed you how much data is written back out. > > In addition to this, a set of use cases for counters in Hadoop center on > quality control. Its normal to chain jobs together to form a job flow. > A typical use case for Hadoop is to pull data from various sources, combine > them and do some process on them, resulting in a data set that gets sent to > another system for visualization. > > In this use case, there are usually data cleansing and validation jobs. As > they run, its possible to track a number of defective records. At the end of > that specific job, from the ToolRunner, or whichever job class you used to > launch your job, you can then get these aggregated counters for the job and > determine if the process passed or failed. Based on this, you can exit your > program with either a success or failed flag. Job Flow control tools like > Oozie can capture this and then decide to continue or to stop and alert an > operator of an error. > >> - I want to confirm my understanding is correct, when each task completes, >> JT will aggregate/update the global counter values from the specific counter >> values updated by the complete task, but never expose global counters values >> until job completes? If it is correct, I am wondering why JT doing >> aggregation each time when a task completes, other than doing a one time >> aggregation when the job completes? Is there any design choice reasons? >> thanks. > > That's a good question. I haven't looked at the code, so I can't say > definitively when the JT performs its aggregation. However, as the job runs > and in process, we can look at the job tracker web page(s) and see the > counter summary. This would imply that there has to be some aggregation > occurring mid-flight. (It would be trivial to sum the list of counters > periodically to update the job statistics.) Note too that if the JT web > pages can show a counter, its possible to then write a monitoring tool that > can monitor the job while running and then kill the job mid flight if a > certain threshold of a counter is met. > > That is to say you could in theory write a monitoring process and watch the > counters. If lets say an error counter hits a predetermined threshold, you > could then issue a 'hadoop job -kill <job-id>' command. > >> >> regards, >> Lin >> >> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <[email protected]> >> wrote: >> >> On Oct 19, 2012, at 10:27 PM, Lin Ma <[email protected]> wrote: >> >>> Thanks for the detailed reply Mike, I learned a lot from the discussion. >>> >>> - I just want to confirm with you that, supposing in the same job, when a >>> specific task completed (and counter is aggregated in JT after the task >>> completed from our discussion?), the other running task in the same job >>> cannot get the updated counter value from the previous completed task? I am >>> asking this because I am thinking whether I can use counter to share a >>> global value between tasks. >> >> Yes that is correct. >> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way >> for a task to query the job tracker. This might have changed in YARN >> >>> - If so, what is the traditional use case of counter, only use counter >>> values after the whole job completes? >>> >> Yes the counters are used to provide data at the end of the job... >> >>> BTW: appreciate if you could share me a few use cases from your experience >>> about how counters are used. >>> >> Well you have your typical job data like the number of records processed, >> total number of bytes read, bytes written... >> >> But suppose you wanted to do some quality control on your input. >> So you need to keep a track on the count of bad records. If this job is >> part of a process, you may want to include business logic in your job to >> halt the job flow if X% of the records contain bad data. >> >> Or your process takes input records and in processing them, they sort the >> records based on some characteristic and you want to count those sorted >> records as you processed them. >> >> For a more concrete example, the Illinois Tollway has these 'fast pass' >> lanes where cars equipped with RFID tags can have the tolls automatically >> deducted from their accounts rather than pay the toll manually each time. >> >> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are >> cheaters where they drive through the sensor and the sensor doesn't capture >> the RFID tag. (Note its possible that you have a false positive where the >> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a >> map/reduce job would require the use of counters. >> >> Does that help? >> >> -Mike >> >>> regards, >>> Lin >>> >>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <[email protected]> >>> wrote: >>> Yeah, sorry... >>> >>> I meant that if you were dynamically creating a counter foo in the Mapper >>> task, then each mapper would be creating their own counter foo. >>> As the job runs, these counters will eventually be sent up to the JT. The >>> job tracker would keep a separate counter for each task. >>> >>> At the end, the final count is aggregated from the list of counters for >>> foo. >>> >>> >>> I don't know how you can get a task to ask information from the Job Tracker >>> on how things are going in other tasks. That is what I meant that you >>> couldn't get information about the other counters or even the status of the >>> other tasks running in the same job. >>> >>> I didn't see anything in the APIs that allowed for that type of flow... Of >>> course having said that... someone pops up with a way to do just that. ;-) >>> >>> >>> Does that clarify things? >>> >>> -Mike >>> >>> >>> On Oct 19, 2012, at 11:56 AM, Lin Ma <[email protected]> wrote: >>> >>>> Hi Mike, >>>> >>>> Sorry I am a bit lost... As you are thinking faster than me. :-P >>>> >>>> From your this statement "It would make sense that the JT maintains a >>>> unique counter for each task until the tasks complete." -- it seems each >>>> task cannot see counters from each other, since JT maintains a unique >>>> counter for each tasks; >>>> >>>> From your this comment "I meant that if a Task created and updated a >>>> counter, a different Task has access to that counter. " -- it seems >>>> different tasks could share/access the same counter. >>>> >>>> Appreciate if you could help to clarify a bit. >>>> >>>> regards, >>>> Lin >>>> >>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel >>>> <[email protected]> wrote: >>>> >>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <[email protected]> wrote: >>>> >>>>> Hi Mike, >>>>> >>>>> Thanks for the detailed reply. Two quick questions/comments, >>>>> >>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer >>>>> instance? >>>> >>>> Either. >>>> >>>>> 2. "However, I do not believe that a separate Task could connect with the >>>>> JT and see if the counter exists or if it could get a value or even an >>>>> accurate value since the updates are asynchronous." -- do you mean if a >>>>> mapper is updating custom counter ABC, and another mapper is updating the >>>>> same customer counter ABC, their counter values are updated independently >>>>> by different mappers, and will not published (aggregated) externally >>>>> until job completed successfully? >>>>> >>>> I meant that if a Task created and updated a counter, a different Task has >>>> access to that counter. >>>> >>>> To give you an example, if I want to count the number of quality errors >>>> and then fail after X number of errors, I can't use Global counters to do >>>> this. >>>> >>>>> regards, >>>>> Lin >>>>> >>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel >>>>> <[email protected]> wrote: >>>>> As I understand it... each Task has its own counters and are >>>>> independently updated. As they report back to the JT, they update the >>>>> counter(s)' status. >>>>> The JT then will aggregate them. >>>>> >>>>> In terms of performance, Counters take up some memory in the JT so while >>>>> its OK to use them, if you abuse them, you can run in to issues. >>>>> As to limits... I guess that will depend on the amount of memory on the >>>>> JT machine, the size of the cluster (Number of TT) and the number of >>>>> counters. >>>>> >>>>> In terms of global accessibility... Maybe. >>>>> >>>>> The reason I say maybe is that I'm not sure by what you mean by globally >>>>> accessible. >>>>> If a task creates and implements a dynamic counter... I know that it will >>>>> eventually be reflected in the JT. However, I do not believe that a >>>>> separate Task could connect with the JT and see if the counter exists or >>>>> if it could get a value or even an accurate value since the updates are >>>>> asynchronous. Not to mention that I don't believe that the counters are >>>>> aggregated until the job ends. It would make sense that the JT maintains >>>>> a unique counter for each task until the tasks complete. (If a task >>>>> fails, it would have to delete the counters so that when the task is >>>>> restarted the correct count is maintained. ) Note, I haven't looked at >>>>> the source code so I am probably wrong. >>>>> >>>>> HTH >>>>> Mike >>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <[email protected]> wrote: >>>>> >>>>>> Hi guys, >>>>>> >>>>>> I have some quick questions regarding to Hadoop counter, >>>>>> >>>>>> Hadoop counter (customer defined) is global accessible (for both read >>>>>> and write) for all Mappers and Reducers in a job? >>>>>> What is the performance and best practices of using Hadoop counters? I >>>>>> am not sure if using Hadoop counters too heavy, there will be >>>>>> performance downgrade to the whole job? >>>>>> regards, >>>>>> Lin >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
