Re: Hadoop counter

Lin Ma Mon, 22 Oct 2012 01:01:21 -0700

Thanks for the help so much, Mike. I learned a lot from this discussion.

So, the conclusion I learned from the discussion should be, since how/when
JT merge counter in the middle of the process of a job is undefined and
internal behavior, it is more reliable to read counter after the whole job
completes? Agree?


regards,
Lin

On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel <[email protected]>wrote:

>
> On Oct 21, 2012, at 1:45 AM, Lin Ma <[email protected]> wrote:
>
> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
> you. The last two questions (or comments) are used to confirm my
> understanding is correct,
>
> - is it normal use case or best practices for a job to consume/read the
> counters from previous completed job in an automatic way? I ask this
> because I am not sure whether the most use case of counter is human read
> and manual analysis, other then using another job to automatic consume the
> counters?
>
>
> Lin,
> Every job has a set of counters to maintain job statistics.
> This is specifically for human analysis and to help understand what
> happened with your job.
> It allows you to see how much data is read in by the job, how many records
> processed to be measured against how long the job took to complete.  It
> also showed you how much data is written back out.
>
> In addition to this,  a set of use cases for counters in Hadoop center on
> quality control. Its normal to chain jobs together to form a job flow.
> A typical use case for Hadoop is to pull data from various sources,
> combine them and do some process on them, resulting in a data set that gets
> sent to another system for visualization.
>
> In this use case, there are usually data cleansing and validation jobs. As
> they run, its possible to track a number of defective records. At the end
> of that specific job, from the ToolRunner, or whichever job class you used
> to launch your job, you can then get these aggregated counters for the job
> and determine if the process passed or failed.  Based on this, you can exit
> your program with either a success or failed flag.  Job Flow control tools
> like Oozie can capture this and then decide to continue or to stop and
> alert an operator of an error.
>
> - I want to confirm my understanding is correct, when each task completes,
> JT will aggregate/update the global counter values from the specific
> counter values updated by the complete task, but never expose global
> counters values until job completes? If it is correct, I am wondering why
> JT doing aggregation each time when a task completes, other than doing a
> one time aggregation when the job completes? Is there any design choice
> reasons? thanks.
>
>
> That's a good question. I haven't looked at the code, so I can't say
> definitively when the JT performs its aggregation. However, as the job runs
> and in process, we can look at the job tracker web page(s) and see the
> counter summary. This would imply that there has to be some aggregation
> occurring mid-flight. (It would be trivial to sum the list of counters
> periodically to update the job statistics.)  Note too that if the JT web
> pages can show a counter, its possible to then write a monitoring tool that
> can monitor the job while running and then kill the job mid flight if a
> certain threshold of a counter is met.
>
> That is to say you could in theory write a monitoring process and watch
> the counters. If lets say an error counter hits a predetermined threshold,
> you could then issue a 'hadoop job -kill <job-id>' command.
>
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel 
> <[email protected]>wrote:
>
>>
>> On Oct 19, 2012, at 10:27 PM, Lin Ma <[email protected]> wrote:
>>
>> Thanks for the detailed reply Mike, I learned a lot from the discussion.
>>
>> - I just want to confirm with you that, supposing in the same job, when a
>> specific task completed (and counter is aggregated in JT after the task
>> completed from our discussion?), the other running task in the same job
>> cannot get the updated counter value from the previous completed task? I am
>> asking this because I am thinking whether I can use counter to share a
>> global value between tasks.
>>
>>
>> Yes that is correct.
>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
>> way for a task to query the job tracker. This might have changed in YARN
>>
>> - If so, what is the traditional use case of counter, only use counter
>> values after the whole job completes?
>>
>> Yes the counters are used to provide data at the end of the job...
>>
>> BTW: appreciate if you could share me a few use cases from your
>> experience about how counters are used.
>>
>> Well you have your typical job data like the number of records processed,
>> total number of bytes read,  bytes written...
>>
>> But suppose you wanted to do some quality control on your input.
>> So you need to keep a track on the count of bad records.  If this job is
>> part of a process, you may want to include business logic in your job to
>> halt the job flow if X% of the records contain bad data.
>>
>> Or your process takes input records and in processing them, they sort the
>> records based on some characteristic and you want to count those sorted
>> records as you processed them.
>>
>> For a more concrete example, the Illinois Tollway has these 'fast pass'
>> lanes where cars equipped with RFID tags can have the tolls automatically
>> deducted from their accounts rather than pay the toll manually each time.
>>
>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
>> cheaters where they drive through the sensor and the sensor doesn't capture
>> the RFID tag. (Note its possible that you have a false positive where the
>> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
>> map/reduce job would require the use of counters.
>>
>> Does that help?
>>
>> -Mike
>>
>> regards,
>> Lin
>>
>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <[email protected]
>> > wrote:
>>
>>> Yeah, sorry...
>>>
>>> I meant that if you were dynamically creating a counter foo in the
>>> Mapper task, then each mapper would be creating their own counter foo.
>>> As the job runs, these counters will eventually be sent up to the JT.
>>> The job tracker would keep a separate counter for each task.
>>>
>>> At the end, the final count is aggregated from the list of counters for
>>> foo.
>>>
>>>
>>> I don't know how you can get a task to ask information from the Job
>>> Tracker on how things are going in other tasks.  That is what I meant that
>>> you couldn't get information about the other counters or even the status of
>>> the other tasks running in the same job.
>>>
>>> I didn't see anything in the APIs that allowed for that type of flow...
>>> Of course having said that... someone pops up with a way to do just that.
>>> ;-)
>>>
>>>
>>> Does that clarify things?
>>>
>>> -Mike
>>>
>>>
>>> On Oct 19, 2012, at 11:56 AM, Lin Ma <[email protected]> wrote:
>>>
>>> Hi Mike,
>>>
>>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>>>
>>> From your this statement "It would make sense that the JT maintains a
>>> unique counter for each task until the tasks complete." -- it seems each
>>> task cannot see counters from each other, since JT maintains a unique
>>> counter for each tasks;
>>>
>>> From your this comment "I meant that if a Task created and updated a
>>> counter, a different Task has access to that counter. " -- it seems
>>> different tasks could share/access the same counter.
>>>
>>> Appreciate if you could help to clarify a bit.
>>>
>>> regards,
>>> Lin
>>>
>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>>> [email protected]> wrote:
>>>
>>>>
>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <[email protected]> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Thanks for the detailed reply. Two quick questions/comments,
>>>>
>>>> 1. For "task", you mean a specific mapper instance, or a specific
>>>> reducer instance?
>>>>
>>>>
>>>> Either.
>>>>
>>>> 2. "However, I do not believe that a separate Task could connect with
>>>> the JT and see if the counter exists or if it could get a value or even an
>>>> accurate value since the updates are asynchronous." -- do you mean if a
>>>> mapper is updating custom counter ABC, and another mapper is updating the
>>>> same customer counter ABC, their counter values are updated independently
>>>> by different mappers, and will not published (aggregated) externally until
>>>> job completed successfully?
>>>>
>>>> I meant that if a Task created and updated a counter, a different Task
>>>> has access to that counter.
>>>>
>>>> To give you an example, if I want to count the number of quality errors
>>>> and then fail after X number of errors, I can't use Global counters to do
>>>> this.
>>>>
>>>> regards,
>>>> Lin
>>>>
>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>>> [email protected]> wrote:
>>>>
>>>>> As I understand it... each Task has its own counters and are
>>>>> independently updated. As they report back to the JT, they update the
>>>>> counter(s)' status.
>>>>> The JT then will aggregate them.
>>>>>
>>>>> In terms of performance, Counters take up some memory in the JT so
>>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>>> As to limits... I guess that will depend on the amount of memory on
>>>>> the JT machine, the size of the cluster (Number of TT) and the number of
>>>>> counters.
>>>>>
>>>>> In terms of global accessibility... Maybe.
>>>>>
>>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>>> globally accessible.
>>>>> If a task creates and implements a dynamic counter... I know that it
>>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>>> separate Task could connect with the JT and see if the counter exists or 
>>>>> if
>>>>> it could get a value or even an accurate value since the updates are
>>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>>> aggregated until the job ends. It would make sense that the JT maintains a
>>>>> unique counter for each task until the tasks complete. (If a task fails, 
>>>>> it
>>>>> would have to delete the counters so that when the task is restarted the
>>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>>> so I am probably wrong.
>>>>>
>>>>> HTH
>>>>> Mike
>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <[email protected]> wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have some quick questions regarding to Hadoop counter,
>>>>>
>>>>>
>>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>>    read and write) for all Mappers and Reducers in a job?
>>>>>    - What is the performance and best practices of using Hadoop
>>>>>    counters? I am not sure if using Hadoop counters too heavy, there will 
>>>>> be
>>>>>    performance downgrade to the whole job?
>>>>>
>>>>> regards,
>>>>> Lin
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hadoop counter

Reply via email to