Re: Hadoop counter

Michael Segel Sun, 21 Oct 2012 05:16:35 -0700

On Oct 21, 2012, at 1:45 AM, Lin Ma <[email protected]> wrote:

> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by 
> you. The last two questions (or comments) are used to confirm my 
> understanding is correct,
> 
> - is it normal use case or best practices for a job to consume/read the 
> counters from previous completed job in an automatic way? I ask this because 
> I am not sure whether the most use case of counter is human read and manual 
> analysis, other then using another job to automatic consume the counters?


Lin, 
Every job has a set of counters to maintain job statistics. 
This is specifically for human analysis and to help understand what happened 
with your job. 
It allows you to see how much data is read in by the job, how many records 
processed to be measured against how long the job took to complete.  It also 
showed you how much data is written back out.  

In addition to this,  a set of use cases for counters in Hadoop center on 
quality control. Its normal to chain jobs together to form a job flow. 
A typical use case for Hadoop is to pull data from various sources, combine 
them and do some process on them, resulting in a data set that gets sent to 
another system for visualization. 

In this use case, there are usually data cleansing and validation jobs. As they 
run, its possible to track a number of defective records. At the end of that 
specific job, from the ToolRunner, or whichever job class you used to launch 
your job, you can then get these aggregated counters for the job and determine 
if the process passed or failed.  Based on this, you can exit your program with 
either a success or failed flag.  Job Flow control tools like Oozie can capture 
this and then decide to continue or to stop and alert an operator of an error. 

> - I want to confirm my understanding is correct, when each task completes, JT 
> will aggregate/update the global counter values from the specific counter 
> values updated by the complete task, but never expose global counters values 
> until job completes? If it is correct, I am wondering why JT doing 
> aggregation each time when a task completes, other than doing a one time 
> aggregation when the job completes? Is there any design choice reasons? 
> thanks.

That's a good question. I haven't looked at the code, so I can't say 
definitively when the JT performs its aggregation. However, as the job runs and 
in process, we can look at the job tracker web page(s) and see the counter 
summary. This would imply that there has to be some aggregation occurring 
mid-flight. (It would be trivial to sum the list of counters periodically to 
update the job statistics.)  Note too that if the JT web pages can show a 
counter, its possible to then write a monitoring tool that can monitor the job 
while running and then kill the job mid flight if a certain threshold of a 
counter is met. 

That is to say you could in theory write a monitoring process and watch the 
counters. If lets say an error counter hits a predetermined threshold, you 
could then issue a 'hadoop job -kill <job-id>' command. 

> 
> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <[email protected]> 
> wrote:
> 
> On Oct 19, 2012, at 10:27 PM, Lin Ma <[email protected]> wrote:
> 
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>> 
>> - I just want to confirm with you that, supposing in the same job, when a 
>> specific task completed (and counter is aggregated in JT after the task 
>> completed from our discussion?), the other running task in the same job 
>> cannot get the updated counter value from the previous completed task? I am 
>> asking this because I am thinking whether I can use counter to share a 
>> global value between tasks.
> 
> Yes that is correct. 
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way 
> for a task to query the job tracker. This might have changed in YARN
> 
>> - If so, what is the traditional use case of counter, only use counter 
>> values after the whole job completes?
>> 
> Yes the counters are used to provide data at the end of the job... 
> 
>> BTW: appreciate if you could share me a few use cases from your experience 
>> about how counters are used.
>> 
> Well you have your typical job data like the number of records processed, 
> total number of bytes read,  bytes written... 
> 
> But suppose you wanted to do some quality control on your input. 
> So you need to keep a track on the count of bad records.  If this job is part 
> of a process, you may want to include business logic in your job to halt the 
> job flow if X% of the records contain bad data. 
> 
> Or your process takes input records and in processing them, they sort the 
> records based on some characteristic and you want to count those sorted 
> records as you processed them. 
> 
> For a more concrete example, the Illinois Tollway has these 'fast pass' lanes 
> where cars equipped with RFID tags can have the tolls automatically deducted 
> from their accounts rather than pay the toll manually each time. 
> 
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are 
> cheaters where they drive through the sensor and the sensor doesn't capture 
> the RFID tag. (Note its possible that you have a false positive where the car 
> has an RFID chip but doesn't trip the sensor.) Pushing the data in a 
> map/reduce job would require the use of counters.
> 
> Does that help? 
> 
> -Mike
> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <[email protected]> 
>> wrote:
>> Yeah, sorry... 
>> 
>> I meant that if you were dynamically creating a counter foo in the Mapper 
>> task, then each mapper would be creating their own counter foo. 
>> As the job runs, these counters will eventually be sent up to the JT. The 
>> job tracker would keep a separate counter for each task. 
>> 
>> At the end, the final count is aggregated from the list of counters for foo. 
>> 
>> 
>> I don't know how you can get a task to ask information from the Job Tracker 
>> on how things are going in other tasks.  That is what I meant that you 
>> couldn't get information about the other counters or even the status of the 
>> other tasks running in the same job. 
>> 
>> I didn't see anything in the APIs that allowed for that type of flow... Of 
>> course having said that... someone pops up with a way to do just that. ;-) 
>> 
>> 
>> Does that clarify things? 
>> 
>> -Mike
>> 
>> 
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <[email protected]> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>> 
>>> From your this statement "It would make sense that the JT maintains a 
>>> unique counter for each task until the tasks complete." -- it seems each 
>>> task cannot see counters from each other, since JT maintains a unique 
>>> counter for each tasks;
>>> 
>>> From your this comment "I meant that if a Task created and updated a 
>>> counter, a different Task has access to that counter. " -- it seems 
>>> different tasks could share/access the same counter.
>>> 
>>> Appreciate if you could help to clarify a bit.
>>> 
>>> regards,
>>> Lin
>>> 
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <[email protected]> 
>>> wrote:
>>> 
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <[email protected]> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>> 
>>>> 1. For "task", you mean a specific mapper instance, or a specific reducer 
>>>> instance?
>>> 
>>> Either. 
>>> 
>>>> 2. "However, I do not believe that a separate Task could connect with the 
>>>> JT and see if the counter exists or if it could get a value or even an 
>>>> accurate value since the updates are asynchronous." -- do you mean if a 
>>>> mapper is updating custom counter ABC, and another mapper is updating the 
>>>> same customer counter ABC, their counter values are updated independently 
>>>> by different mappers, and will not published (aggregated) externally until 
>>>> job completed successfully?
>>>> 
>>> I meant that if a Task created and updated a counter, a different Task has 
>>> access to that counter. 
>>> 
>>> To give you an example, if I want to count the number of quality errors and 
>>> then fail after X number of errors, I can't use Global counters to do this.
>>> 
>>>> regards,
>>>> Lin
>>>> 
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel 
>>>> <[email protected]> wrote:
>>>> As I understand it... each Task has its own counters and are independently 
>>>> updated. As they report back to the JT, they update the counter(s)' status.
>>>> The JT then will aggregate them. 
>>>> 
>>>> In terms of performance, Counters take up some memory in the JT so while 
>>>> its OK to use them, if you abuse them, you can run in to issues. 
>>>> As to limits... I guess that will depend on the amount of memory on the JT 
>>>> machine, the size of the cluster (Number of TT) and the number of 
>>>> counters. 
>>>> 
>>>> In terms of global accessibility... Maybe.
>>>> 
>>>> The reason I say maybe is that I'm not sure by what you mean by globally 
>>>> accessible. 
>>>> If a task creates and implements a dynamic counter... I know that it will 
>>>> eventually be reflected in the JT. However, I do not believe that a 
>>>> separate Task could connect with the JT and see if the counter exists or 
>>>> if it could get a value or even an accurate value since the updates are 
>>>> asynchronous.  Not to mention that I don't believe that the counters are 
>>>> aggregated until the job ends. It would make sense that the JT maintains a 
>>>> unique counter for each task until the tasks complete. (If a task fails, 
>>>> it would have to delete the counters so that when the task is restarted 
>>>> the correct count is maintained. )  Note, I haven't looked at the source 
>>>> code so I am probably wrong. 
>>>> 
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <[email protected]> wrote:
>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>> 
>>>>> Hadoop counter (customer defined) is global accessible (for both read and 
>>>>> write) for all Mappers and Reducers in a job?
>>>>> What is the performance and best practices of using Hadoop counters? I am 
>>>>> not sure if using Hadoop counters too heavy, there will be performance 
>>>>> downgrade to the whole job?
>>>>> regards,
>>>>> Lin
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Hadoop counter

Reply via email to