Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by Arun C Murthy: http://wiki.apache.org/pig/PigStreamingFunctionalSpec ------------------------------------------------------------------------------ ===== 4.1.1 Logging ===== - Users will have control over handling of `stderr` of their streaming application. By default, in case of errors, the full error information would be brought to the client and stored in the client side log. + Users will have control over handling of `stderr` of their streaming application by requesting the `stderr` is stored in DFS both for successful and failed jobs. This is done by adding `stderr spec` to the streaming command declaration: - In addition, a user can request the `stderr` is stored in DFS both for successful and failed jobs. This is done by adding `stderr spec` to the streaming command declaration: - {{{ - define CMD `stream.pl` stderr('stream.stderr') + define CMD `stream.pl` stderr('<dir>' limit 100) }}} - In this case, the streaming `stderr` will be stored in _logs directory in the jobs output directory. Note that the same Pig job can have multiple streaming applications associated with it. It would be up to the user to make sure that different names are used for this to avoid conflicts. + In this case, the streaming `stderr` will be stored in _logs/<dir> directory in the job's output directory. Note that the same Pig job can have multiple streaming applications associated with it. It would be up to the user to make sure that different names are used for this to avoid conflicts by passing the right directory to the stderr spec. - Pig would store up to '''500''' logs per streaming job in this location. The limit is imposed to make sure that we don't create a large number of small files in DFS and waste space and name node resources. The user can specify a smaller number via `limit` keyword in the `stderr` specL + Pig would store logs of upto '''100''' tasks per streaming job in this location (so it's 100*4 = 400 logs assuming 4 retries per task). The limit is imposed to make sure that we don't create a large number of small files in HDFS and waste space and name node resources. The user can specify a smaller number via `limit` keyword in the `stderr` spec. {{{ - define CMD `stream.pl` stderr('stream.stderr' limit 100) + define CMD `stream.pl` stderr('CMD_logs' limit 100) }}} - The logs would only contain stderr information from the streaming application. The content will include a header and a footer. The header will include task name, start time, input size, input file and input range if available. The footer will contain result code, end time, and primary output size. + The logs would only contain stderr information from the streaming application. The content will include a header and a footer. The header will include task name, start time, input size, input file and input range if available. The footer will contain result code, end time, and outputs' sizes. ===== 4.1.2 Error Handling =====