Hello, I am working with a large dataset of logs (approximately 1.5TB every month). Each record in the log contains a list of fields and a common query by the users on a daily basis is to filter records on a particular field that matches a value. Right now the log data is not organized in any particular fashion and it makes the querying very slow. I am working on re-structuring the log data by splitting the data into multiple smaller buckets such that the common field based query is sped up.
I have completed re-structuring the data and now I want to compare the performance improvement obtained by my effort. Can someone suggest me a way to compare the performance of the map-reduce jobs? I can think of the following: 1. CPU time spent 2. Wall clock time taken 3. #bytes shuffled Also, is there any way to get these metric values through command line rather than the web-interface? I appreciate any help. Thanks! -- Prabu D
