[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904453#action_12904453 ] Richard Ding commented on PIG-1483: --- Patch committed to trunk. > [piggybank] Add HadoopJobHistoryLoader to the piggybank > --- > > Key: PIG-1483 > URL: https://issues.apache.org/jira/browse/PIG-1483 > Project: Pig > Issue Type: New Feature >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1483.patch, PIG-1483_1.patch > > > PIG-1333 added many script-related entries to the MR job xml file and thus > it's now possible to use Pig for querying Hadoop job history/xml files to get > script-level usage statistics. What we need is a Pig loader that can parse > these files and generate corresponding data objects. > The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. > Here is an example that shows the intended usage: > *Find all the jobs grouped by script and user:* > {code} > a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as > (j:map[], m:map[], r:map[]); > b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) > j#'USER' as user, (Chararray) j#'JOBID' as job; > c = filter b by not (id is null); > d = group c by (id, user); > e = foreach d generate flatten(group), c.job; > dump e; > {code} > A couple more examples: > *Find scripts that use only the default parallelism:* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) r#'NUMBER_REDUCES' as reduces; > c = group b by (id, user, script_name) parallel 10; > d = foreach c generate group.user, group.script_name, MAX(b.reduces) as > max_reduces; > e = filter d by max_reduces == 1; > dump e; > {code} > *Find the running time of each script (in seconds):* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as > end; > c = group b by (id, user, script_name) > d = foreach c generate group.user, group.script_name, (MAX(b.end) - > MIN(b.start)/1000; > dump d; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903491#action_12903491 ] Olga Natkovich commented on PIG-1483: - +1, please, commit > [piggybank] Add HadoopJobHistoryLoader to the piggybank > --- > > Key: PIG-1483 > URL: https://issues.apache.org/jira/browse/PIG-1483 > Project: Pig > Issue Type: New Feature >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1483.patch, PIG-1483_1.patch > > > PIG-1333 added many script-related entries to the MR job xml file and thus > it's now possible to use Pig for querying Hadoop job history/xml files to get > script-level usage statistics. What we need is a Pig loader that can parse > these files and generate corresponding data objects. > The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. > Here is an example that shows the intended usage: > *Find all the jobs grouped by script and user:* > {code} > a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as > (j:map[], m:map[], r:map[]); > b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) > j#'USER' as user, (Chararray) j#'JOBID' as job; > c = filter b by not (id is null); > d = group c by (id, user); > e = foreach d generate flatten(group), c.job; > dump e; > {code} > A couple more examples: > *Find scripts that use only the default parallelism:* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) r#'NUMBER_REDUCES' as reduces; > c = group b by (id, user, script_name) parallel 10; > d = foreach c generate group.user, group.script_name, MAX(b.reduces) as > max_reduces; > e = filter d by max_reduces == 1; > dump e; > {code} > *Find the running time of each script (in seconds):* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as > end; > c = group b by (id, user, script_name) > d = foreach c generate group.user, group.script_name, (MAX(b.end) - > MIN(b.start)/1000; > dump d; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886537#action_12886537 ] Richard Ding commented on PIG-1483: --- Add these additional entries to the first map: {code} PIG_JOB_FEATURE, PIG_JOB_ALIAS, PIG_JOB_PARENTS {code} > [piggybank] Add HadoopJobHistoryLoader to the piggybank > --- > > Key: PIG-1483 > URL: https://issues.apache.org/jira/browse/PIG-1483 > Project: Pig > Issue Type: New Feature >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1483.patch > > > PIG-1333 added many script-related entries to the MR job xml file and thus > it's now possible to use Pig for querying Hadoop job history/xml files to get > script-level usage statistics. What we need is a Pig loader that can parse > these files and generate corresponding data objects. > The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. > Here is an example that shows the intended usage: > *Find all the jobs grouped by script and user:* > {code} > a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as > (j:map[], m:map[], r:map[]); > b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) > j#'USER' as user, (Chararray) j#'JOBID' as job; > c = filter b by not (id is null); > d = group c by (id, user); > e = foreach d generate flatten(group), c.job; > dump e; > {code} > A couple more examples: > *Find scripts that use only the default parallelism:* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) r#'NUMBER_REDUCES' as reduces; > c = group b by (id, user, script_name) parallel 10; > d = foreach c generate group.user, group.script_name, MAX(b.reduces) as > max_reduces; > e = filter d by max_reduces == 1; > dump e; > {code} > *Find the running time of each script (in seconds):* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as > end; > c = group b by (id, user, script_name) > d = foreach c generate group.user, group.script_name, (MAX(b.end) - > MIN(b.start)/1000; > dump d; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886502#action_12886502 ] Richard Ding commented on PIG-1483: --- Usage: {code} register piggybank.jar A = load '' org.apache.pig.piggybank.storage.HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); {code} where j is a map with following entries: {code} JOBID, JOBNAME, CLUSTER, QUEUE_NAME, STATUS, PIG_VERSION, HADOOP_VERSION, USER, USER_GROUP, HOST_DIR, JOBCONF, PIG_SCRIPT_ID, PIG_SCRIPT, TOTAL_LAUNCHED_MAPS, TOTAL_MAPS, FINISHED_MAPS, FAILED_MAPS, RACK_LOCAL_MAPS, DATA_LOCAL_MAPS, TOTAL_LAUNCHED_REDUCES, TOTAL_REDUCES, FINISHED_REDUCES, FAILED_REDUCES, SUBMIT_TIME, LAUNCH_TIME, FINISH_TIME, MAP_INPUT_RECORDS, MAP_OUTPUT_RECORDS, MAP_OUTPUT_BYTES, COMBINE_INPUT_RECORDS, COMBINE_OUTPUT_RECORDS, SPILLED_RECORDS, REDUCE_SHUFFLE_BYTES, REDUCE_INPUT_GROUPS, REDUCE_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS, HDFS_BYTES_READ, HDFS_BYTES_WRITTEN, FILE_BYTES_READ, FILE_BYTES_WRITTEN, {code} m is a map with following entries: {code} MAX_MAP_INPUT_ROWS, MIN_MAP_INPUT_ROWS, MAX_MAP_TIME, MIN_MAP_TIME, AVG_MAP_TIME, NUMBER_MAPS {code} r is a map with following entries: {code} AVG_REDUCE_TIME, MAX_REDUCE_TIME, NUMBER_REDUCES, MIN_REDUCE_TIME, MIN_REDUCE_INPUT_ROWS, MAX_REDUCE_INPUT_ROWS {code} > [piggybank] Add HadoopJobHistoryLoader to the piggybank > --- > > Key: PIG-1483 > URL: https://issues.apache.org/jira/browse/PIG-1483 > Project: Pig > Issue Type: New Feature >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1483.patch > > > PIG-1333 added many script-related entries to the MR job xml file and thus > it's now possible to use Pig for querying Hadoop job history/xml files to get > script-level usage statistics. What we need is a Pig loader that can parse > these files and generate corresponding data objects. > The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. > Here is an example that shows the intended usage: > *Find all the jobs grouped by script and user:* > {code} > a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as > (j:map[], m:map[], r:map[]); > b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) > j#'USER' as user, (Chararray) j#'JOBID' as job; > c = filter b by not (id is null); > d = group c by (id, user); > e = foreach d generate flatten(group), c.job; > dump e; > {code} > A couple more examples: > *Find scripts that use only the default parallelism:* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) r#'NUMBER_REDUCES' as reduces; > c = group b by (id, user, script_name) parallel 10; > d = foreach c generate group.user, group.script_name, MAX(b.reduces) as > max_reduces; > e = filter d by max_reduces == 1; > dump e; > {code} > *Find the running time of each script (in seconds):* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as > end; > c = group b by (id, user, script_name) > d = foreach c generate group.user, group.script_name, (MAX(b.end) - > MIN(b.start)/1000; > dump d; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.