[jira] Updated: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-781: --- Attachment: partial_failure.patch This fix associates stores with MR jobs. At the end of the execution it will print out which stores have passed and which ones have failed. Example: {noformat} 50% complete 100% complete 1 map reduce job(s) failed! Failed to produce result in: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/baz Successfully stored result in: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/bar Successfully stored result in: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/foo Some jobs have failed! {noformat} Error reporting for failed MR jobs -- Key: PIG-781 URL: https://issues.apache.org/jira/browse/PIG-781 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner Attachments: partial_failure.patch If we have multiple MR jobs to run and some of them fail the behavior of the system is to not stop on the first failure but to keep going. That way jobs that do not depend on the failed job might still succeed. The question is to how best report this scenario to a user. How do we tell which jobs failed and which didn't? One way could be to tie jobs to stores and report which store locations won't have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705951#action_12705951 ] Gunther Hagleitner edited comment on PIG-781 at 5/5/09 1:19 AM: This fix associates stores with MR jobs. At the end of the execution it will print out which stores have passed and which ones have failed. Example: {noformat} 50% complete 100% complete 1 map reduce job(s) failed! Failed to produce result in: /user/hagleitn/baz Successfully stored result in: /user/hagleitn/bar Successfully stored result in: /user/hagleitn/foo Some jobs have failed! {noformat} was (Author: hagleitn): This fix associates stores with MR jobs. At the end of the execution it will print out which stores have passed and which ones have failed. Example: {noformat} 50% complete 100% complete 1 map reduce job(s) failed! Failed to produce result in: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/baz Successfully stored result in: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/bar Successfully stored result in: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/foo Some jobs have failed! {noformat} Error reporting for failed MR jobs -- Key: PIG-781 URL: https://issues.apache.org/jira/browse/PIG-781 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner Attachments: partial_failure.patch If we have multiple MR jobs to run and some of them fail the behavior of the system is to not stop on the first failure but to keep going. That way jobs that do not depend on the failed job might still succeed. The question is to how best report this scenario to a user. How do we tell which jobs failed and which didn't? One way could be to tie jobs to stores and report which store locations won't have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-741) Add LIMIT as a statement that works in nested FOREACH
[ https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-741: --- Resolution: Fixed Status: Resolved (was: Patch Available) Patch checked in. Add LIMIT as a statement that works in nested FOREACH - Key: PIG-741 URL: https://issues.apache.org/jira/browse/PIG-741 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Alan Gates Fix For: 0.3.0 Attachments: PIG-741.patch I'd like to compute the top 10 results in each group. The natural way to express this in Pig would be: {code} A = load '...' using PigStorage() as ( date: int, count: int, url: chararray ); B = group A by ( date ); C = foreach B { D = order A by count desc; E = limit D 10; generate FLATTEN(E); }; dump C; {code} Yeah, I could write a UDF / PiggyBank function to take the top n results. But since LIMIT already exists as a statement, it seems like it should also work in the nested foreach context. Example workaround code. {code} C = foreach B { D = order A by count desc; E = util.TOP(D, 10); generate FLATTEN(E); }; dump C; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706083#action_12706083 ] Olga Natkovich commented on PIG-781: Hi Gunther, The output looks good - this is exactly what we want. This would solve issues for adhoc queries; however, we also need to make sure that users can detect this programatically. This has two part to it. (1) The return code they see when a program partially successful. We need to add a new return code to http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification for this. (2) A per output done file either on DFS or on the local file system to indicate success. I think, for now, we should at least do (1). (2) requires more though to make sure we don't leave done files behind forever. Error reporting for failed MR jobs -- Key: PIG-781 URL: https://issues.apache.org/jira/browse/PIG-781 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner Attachments: partial_failure.patch If we have multiple MR jobs to run and some of them fail the behavior of the system is to not stop on the first failure but to keep going. That way jobs that do not depend on the failed job might still succeed. The question is to how best report this scenario to a user. How do we tell which jobs failed and which didn't? One way could be to tie jobs to stores and report which store locations won't have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706144#action_12706144 ] Rakesh Setty commented on PIG-794: -- While trying to address the comment about eliminating the AvroValueReader, I noticed that the way pos (current position in the stream) is being handled is wrong. The position in the stream can only be handled by the ValueReader (Avro codebase) due to the non-standard (not making use of DataOutput's methods to store data) way of storing data by Avro. For example, an integer can be stored in anywhere between 1 - 5 bytes while a long can be stored in anywhere between 1 - 10 bytes. I think we have to ask the Avro team to support this (current position in the stream) for us to proceed with this. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706197#action_12706197 ] Doug Cutting commented on PIG-794: -- I think we have to ask the Avro team to support this (current position in the stream) for us to proceed with this. ValueReader performs no buffering, so its position is always the same as the InputStream that it wraps. See DataFileReader#SeekableBufferedInput for an example of an input stream that tracks its position. Note that AVRO-25 proposes to add buffering to ValueWriter, so that the position of the underlying stream might be different than that of the ValueWriter, but I do not forsee a need to add this to ValueReader, the concern here. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706198#action_12706198 ] Olga Natkovich commented on PIG-781: we have been also asked to provide an option to fail the entire job as soon as the first job fails. more details to follow Error reporting for failed MR jobs -- Key: PIG-781 URL: https://issues.apache.org/jira/browse/PIG-781 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner Attachments: partial_failure.patch If we have multiple MR jobs to run and some of them fail the behavior of the system is to not stop on the first failure but to keep going. That way jobs that do not depend on the failed job might still succeed. The question is to how best report this scenario to a user. How do we tell which jobs failed and which didn't? One way could be to tie jobs to stores and report which store locations won't have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706220#action_12706220 ] Rakesh Setty commented on PIG-794: -- This works. Will update the patch. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Setty updated PIG-794: - Attachment: AvroStorage.patch Modified patch Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Setty updated PIG-794: - Comment: was deleted (was: Updated patch) Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706226#action_12706226 ] Hadoop QA commented on PIG-794: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12407285/AvroStorage.patch against trunk revision 771844. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/30/console This message is automatically generated. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706244#action_12706244 ] Olga Natkovich commented on PIG-794: Doug, if there is no buffering then the position in the inout stream can be used for now. However, if you are planning to do buffering in the future, it might be good to have an API that just gives the position so that later we don't need to change the code. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706278#action_12706278 ] Olga Natkovich commented on PIG-794: Hi Rakesh, Thanks for the update. A few comments: (1) Thanks for adding comments. They need to be of javadoc style so that we get free documentation from it. You can see examples in other files (2) Looks like there is at least one System.println statement that got in I assume by mistake. (3) Looks like you have some traces as log.error instead of log.debug (4) You need to attach AVRO library separately. Patches don't work well with binary data Also I am curious if removing wrapper class made a performance difference? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706284#action_12706284 ] Olga Natkovich commented on PIG-794: One more thing: since we are adding avro library, lets add some unit tests as well. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.