Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by you. The last two questions (or comments) are used to confirm my understanding is correct,
- is it normal use case or best practices for a job to consume/read the counters from previous completed job in an automatic way? I ask this because I am not sure whether the most use case of counter is human read and manual analysis, other then using another job to automatic consume the counters? - I want to confirm my understanding is correct, when each task completes, JT will aggregate/update the global counter values from the specific counter values updated by the complete task, but never expose global counters values until job completes? If it is correct, I am wondering why JT doing aggregation each time when a task completes, other than doing a one time aggregation when the job completes? Is there any design choice reasons? thanks. regards, Lin On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <[email protected]>wrote: > > On Oct 19, 2012, at 10:27 PM, Lin Ma <[email protected]> wrote: > > Thanks for the detailed reply Mike, I learned a lot from the discussion. > > - I just want to confirm with you that, supposing in the same job, when a > specific task completed (and counter is aggregated in JT after the task > completed from our discussion?), the other running task in the same job > cannot get the updated counter value from the previous completed task? I am > asking this because I am thinking whether I can use counter to share a > global value between tasks. > > > Yes that is correct. > While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy > way for a task to query the job tracker. This might have changed in YARN > > - If so, what is the traditional use case of counter, only use counter > values after the whole job completes? > > Yes the counters are used to provide data at the end of the job... > > BTW: appreciate if you could share me a few use cases from your experience > about how counters are used. > > Well you have your typical job data like the number of records processed, > total number of bytes read, bytes written... > > But suppose you wanted to do some quality control on your input. > So you need to keep a track on the count of bad records. If this job is > part of a process, you may want to include business logic in your job to > halt the job flow if X% of the records contain bad data. > > Or your process takes input records and in processing them, they sort the > records based on some characteristic and you want to count those sorted > records as you processed them. > > For a more concrete example, the Illinois Tollway has these 'fast pass' > lanes where cars equipped with RFID tags can have the tolls automatically > deducted from their accounts rather than pay the toll manually each time. > > Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are > cheaters where they drive through the sensor and the sensor doesn't capture > the RFID tag. (Note its possible that you have a false positive where the > car has an RFID chip but doesn't trip the sensor.) Pushing the data in a > map/reduce job would require the use of counters. > > Does that help? > > -Mike > > regards, > Lin > > On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel > <[email protected]>wrote: > >> Yeah, sorry... >> >> I meant that if you were dynamically creating a counter foo in the Mapper >> task, then each mapper would be creating their own counter foo. >> As the job runs, these counters will eventually be sent up to the JT. The >> job tracker would keep a separate counter for each task. >> >> At the end, the final count is aggregated from the list of counters for >> foo. >> >> >> I don't know how you can get a task to ask information from the Job >> Tracker on how things are going in other tasks. That is what I meant that >> you couldn't get information about the other counters or even the status of >> the other tasks running in the same job. >> >> I didn't see anything in the APIs that allowed for that type of flow... >> Of course having said that... someone pops up with a way to do just that. >> ;-) >> >> >> Does that clarify things? >> >> -Mike >> >> >> On Oct 19, 2012, at 11:56 AM, Lin Ma <[email protected]> wrote: >> >> Hi Mike, >> >> Sorry I am a bit lost... As you are thinking faster than me. :-P >> >> From your this statement "It would make sense that the JT maintains a >> unique counter for each task until the tasks complete." -- it seems each >> task cannot see counters from each other, since JT maintains a unique >> counter for each tasks; >> >> From your this comment "I meant that if a Task created and updated a >> counter, a different Task has access to that counter. " -- it seems >> different tasks could share/access the same counter. >> >> Appreciate if you could help to clarify a bit. >> >> regards, >> Lin >> >> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel < >> [email protected]> wrote: >> >>> >>> On Oct 19, 2012, at 11:27 AM, Lin Ma <[email protected]> wrote: >>> >>> Hi Mike, >>> >>> Thanks for the detailed reply. Two quick questions/comments, >>> >>> 1. For "task", you mean a specific mapper instance, or a specific >>> reducer instance? >>> >>> >>> Either. >>> >>> 2. "However, I do not believe that a separate Task could connect with >>> the JT and see if the counter exists or if it could get a value or even an >>> accurate value since the updates are asynchronous." -- do you mean if a >>> mapper is updating custom counter ABC, and another mapper is updating the >>> same customer counter ABC, their counter values are updated independently >>> by different mappers, and will not published (aggregated) externally until >>> job completed successfully? >>> >>> I meant that if a Task created and updated a counter, a different Task >>> has access to that counter. >>> >>> To give you an example, if I want to count the number of quality errors >>> and then fail after X number of errors, I can't use Global counters to do >>> this. >>> >>> regards, >>> Lin >>> >>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel < >>> [email protected]> wrote: >>> >>>> As I understand it... each Task has its own counters and are >>>> independently updated. As they report back to the JT, they update the >>>> counter(s)' status. >>>> The JT then will aggregate them. >>>> >>>> In terms of performance, Counters take up some memory in the JT so >>>> while its OK to use them, if you abuse them, you can run in to issues. >>>> As to limits... I guess that will depend on the amount of memory on the >>>> JT machine, the size of the cluster (Number of TT) and the number of >>>> counters. >>>> >>>> In terms of global accessibility... Maybe. >>>> >>>> The reason I say maybe is that I'm not sure by what you mean by >>>> globally accessible. >>>> If a task creates and implements a dynamic counter... I know that it >>>> will eventually be reflected in the JT. However, I do not believe that a >>>> separate Task could connect with the JT and see if the counter exists or if >>>> it could get a value or even an accurate value since the updates are >>>> asynchronous. Not to mention that I don't believe that the counters are >>>> aggregated until the job ends. It would make sense that the JT maintains a >>>> unique counter for each task until the tasks complete. (If a task fails, it >>>> would have to delete the counters so that when the task is restarted the >>>> correct count is maintained. ) Note, I haven't looked at the source code >>>> so I am probably wrong. >>>> >>>> HTH >>>> Mike >>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <[email protected]> wrote: >>>> >>>> Hi guys, >>>> >>>> I have some quick questions regarding to Hadoop counter, >>>> >>>> >>>> - Hadoop counter (customer defined) is global accessible (for both >>>> read and write) for all Mappers and Reducers in a job? >>>> - What is the performance and best practices of using Hadoop >>>> counters? I am not sure if using Hadoop counters too heavy, there will >>>> be >>>> performance downgrade to the whole job? >>>> >>>> regards, >>>> Lin >>>> >>>> >>>> >>> >>> >> >> > >
