Re: Hadoop counter

Michael Segel Mon, 22 Oct 2012 20:58:05 -0700

Yup. 
The counters at the end of the job are the most accurate. 

On Oct 22, 2012, at 3:00 AM, Lin Ma <[email protected]> wrote:


> Thanks for the help so much, Mike. I learned a lot from this discussion.
> 
> So, the conclusion I learned from the discussion should be, since how/when JT 
> merge counter in the middle of the process of a job is undefined and internal 
> behavior, it is more reliable to read counter after the whole job completes? 
> Agree?
> 
> regards,
> Lin
> 
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <[email protected]> 
> wrote:
> 
> On Oct 21, 2012, at 1:45 AM, Lin Ma <[email protected]> wrote:
> 
>> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by 
>> you. The last two questions (or comments) are used to confirm my 
>> understanding is correct,
>> 
>> - is it normal use case or best practices for a job to consume/read the 
>> counters from previous completed job in an automatic way? I ask this because 
>> I am not sure whether the most use case of counter is human read and manual 
>> analysis, other then using another job to automatic consume the counters?
> 
> Lin, 
> Every job has a set of counters to maintain job statistics. 
> This is specifically for human analysis and to help understand what happened 
> with your job. 
> It allows you to see how much data is read in by the job, how many records 
> processed to be measured against how long the job took to complete.  It also 
> showed you how much data is written back out.  
> 
> In addition to this,  a set of use cases for counters in Hadoop center on 
> quality control. Its normal to chain jobs together to form a job flow. 
> A typical use case for Hadoop is to pull data from various sources, combine 
> them and do some process on them, resulting in a data set that gets sent to 
> another system for visualization. 
> 
> In this use case, there are usually data cleansing and validation jobs. As 
> they run, its possible to track a number of defective records. At the end of 
> that specific job, from the ToolRunner, or whichever job class you used to 
> launch your job, you can then get these aggregated counters for the job and 
> determine if the process passed or failed.  Based on this, you can exit your 
> program with either a success or failed flag.  Job Flow control tools like 
> Oozie can capture this and then decide to continue or to stop and alert an 
> operator of an error. 
> 
>> - I want to confirm my understanding is correct, when each task completes, 
>> JT will aggregate/update the global counter values from the specific counter 
>> values updated by the complete task, but never expose global counters values 
>> until job completes? If it is correct, I am wondering why JT doing 
>> aggregation each time when a task completes, other than doing a one time 
>> aggregation when the job completes? Is there any design choice reasons? 
>> thanks.
> 
> That's a good question. I haven't looked at the code, so I can't say 
> definitively when the JT performs its aggregation. However, as the job runs 
> and in process, we can look at the job tracker web page(s) and see the 
> counter summary. This would imply that there has to be some aggregation 
> occurring mid-flight. (It would be trivial to sum the list of counters 
> periodically to update the job statistics.)  Note too that if the JT web 
> pages can show a counter, its possible to then write a monitoring tool that 
> can monitor the job while running and then kill the job mid flight if a 
> certain threshold of a counter is met. 
> 
> That is to say you could in theory write a monitoring process and watch the 
> counters. If lets say an error counter hits a predetermined threshold, you 
> could then issue a 'hadoop job -kill <job-id>' command. 
> 
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <[email protected]> 
>> wrote:
>> 
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <[email protected]> wrote:
>> 
>>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>> 
>>> - I just want to confirm with you that, supposing in the same job, when a 
>>> specific task completed (and counter is aggregated in JT after the task 
>>> completed from our discussion?), the other running task in the same job 
>>> cannot get the updated counter value from the previous completed task? I am 
>>> asking this because I am thinking whether I can use counter to share a 
>>> global value between tasks.
>> 
>> Yes that is correct. 
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way 
>> for a task to query the job tracker. This might have changed in YARN
>> 
>>> - If so, what is the traditional use case of counter, only use counter 
>>> values after the whole job completes?
>>> 
>> Yes the counters are used to provide data at the end of the job... 
>> 
>>> BTW: appreciate if you could share me a few use cases from your experience 
>>> about how counters are used.
>>> 
>> Well you have your typical job data like the number of records processed, 
>> total number of bytes read,  bytes written... 
>> 
>> But suppose you wanted to do some quality control on your input. 
>> So you need to keep a track on the count of bad records.  If this job is 
>> part of a process, you may want to include business logic in your job to 
>> halt the job flow if X% of the records contain bad data. 
>> 
>> Or your process takes input records and in processing them, they sort the 
>> records based on some characteristic and you want to count those sorted 
>> records as you processed them. 
>> 
>> For a more concrete example, the Illinois Tollway has these 'fast pass' 
>> lanes where cars equipped with RFID tags can have the tolls automatically 
>> deducted from their accounts rather than pay the toll manually each time. 
>> 
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are 
>> cheaters where they drive through the sensor and the sensor doesn't capture 
>> the RFID tag. (Note its possible that you have a false positive where the 
>> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a 
>> map/reduce job would require the use of counters.
>> 
>> Does that help? 
>> 
>> -Mike
>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <[email protected]> 
>>> wrote:
>>> Yeah, sorry... 
>>> 
>>> I meant that if you were dynamically creating a counter foo in the Mapper 
>>> task, then each mapper would be creating their own counter foo. 
>>> As the job runs, these counters will eventually be sent up to the JT. The 
>>> job tracker would keep a separate counter for each task. 
>>> 
>>> At the end, the final count is aggregated from the list of counters for 
>>> foo. 
>>> 
>>> 
>>> I don't know how you can get a task to ask information from the Job Tracker 
>>> on how things are going in other tasks.  That is what I meant that you 
>>> couldn't get information about the other counters or even the status of the 
>>> other tasks running in the same job. 
>>> 
>>> I didn't see anything in the APIs that allowed for that type of flow... Of 
>>> course having said that... someone pops up with a way to do just that. ;-) 
>>> 
>>> 
>>> Does that clarify things? 
>>> 
>>> -Mike
>>> 
>>> 
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <[email protected]> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>> 
>>>> From your this statement "It would make sense that the JT maintains a 
>>>> unique counter for each task until the tasks complete." -- it seems each 
>>>> task cannot see counters from each other, since JT maintains a unique 
>>>> counter for each tasks;
>>>> 
>>>> From your this comment "I meant that if a Task created and updated a 
>>>> counter, a different Task has access to that counter. " -- it seems 
>>>> different tasks could share/access the same counter.
>>>> 
>>>> Appreciate if you could help to clarify a bit.
>>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel 
>>>> <[email protected]> wrote:
>>>> 
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <[email protected]> wrote:
>>>> 
>>>>> Hi Mike,
>>>>> 
>>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>> 
>>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer 
>>>>> instance?
>>>> 
>>>> Either. 
>>>> 
>>>>> 2. "However, I do not believe that a separate Task could connect with the 
>>>>> JT and see if the counter exists or if it could get a value or even an 
>>>>> accurate value since the updates are asynchronous." -- do you mean if a 
>>>>> mapper is updating custom counter ABC, and another mapper is updating the 
>>>>> same customer counter ABC, their counter values are updated independently 
>>>>> by different mappers, and will not published (aggregated) externally 
>>>>> until job completed successfully?
>>>>> 
>>>> I meant that if a Task created and updated a counter, a different Task has 
>>>> access to that counter. 
>>>> 
>>>> To give you an example, if I want to count the number of quality errors 
>>>> and then fail after X number of errors, I can't use Global counters to do 
>>>> this.
>>>> 
>>>>> regards,
>>>>> Lin
>>>>> 
>>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel 
>>>>> <[email protected]> wrote:
>>>>> As I understand it... each Task has its own counters and are 
>>>>> independently updated. As they report back to the JT, they update the 
>>>>> counter(s)' status.
>>>>> The JT then will aggregate them. 
>>>>> 
>>>>> In terms of performance, Counters take up some memory in the JT so while 
>>>>> its OK to use them, if you abuse them, you can run in to issues. 
>>>>> As to limits... I guess that will depend on the amount of memory on the 
>>>>> JT machine, the size of the cluster (Number of TT) and the number of 
>>>>> counters. 
>>>>> 
>>>>> In terms of global accessibility... Maybe.
>>>>> 
>>>>> The reason I say maybe is that I'm not sure by what you mean by globally 
>>>>> accessible. 
>>>>> If a task creates and implements a dynamic counter... I know that it will 
>>>>> eventually be reflected in the JT. However, I do not believe that a 
>>>>> separate Task could connect with the JT and see if the counter exists or 
>>>>> if it could get a value or even an accurate value since the updates are 
>>>>> asynchronous.  Not to mention that I don't believe that the counters are 
>>>>> aggregated until the job ends. It would make sense that the JT maintains 
>>>>> a unique counter for each task until the tasks complete. (If a task 
>>>>> fails, it would have to delete the counters so that when the task is 
>>>>> restarted the correct count is maintained. )  Note, I haven't looked at 
>>>>> the source code so I am probably wrong. 
>>>>> 
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <[email protected]> wrote:
>>>>> 
>>>>>> Hi guys,
>>>>>> 
>>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>> 
>>>>>> Hadoop counter (customer defined) is global accessible (for both read 
>>>>>> and write) for all Mappers and Reducers in a job?
>>>>>> What is the performance and best practices of using Hadoop counters? I 
>>>>>> am not sure if using Hadoop counters too heavy, there will be 
>>>>>> performance downgrade to the whole job?
>>>>>> regards,
>>>>>> Lin
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Reply via email to