[jira] Updated: (PIG-781) Error reporting for failed MR jobs

2009-05-05 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-781:
---

Attachment: partial_failure.patch

This fix associates stores with MR jobs. At the end of the execution it will 
print out which stores have passed and which ones have failed.

Example:

{noformat}
50% complete
100% complete
1 map reduce job(s) failed!
Failed to produce result in: 
hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/baz
Successfully stored result in: 
hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/bar
Successfully stored result in: 
hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/foo
Some jobs have failed!
{noformat}


 Error reporting for failed MR jobs
 --

 Key: PIG-781
 URL: https://issues.apache.org/jira/browse/PIG-781
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
 Attachments: partial_failure.patch


 If we have multiple MR jobs to run and some of them fail the behavior of the 
 system is to not stop on the first failure but to keep going. That way jobs 
 that do not depend on the failed job might still succeed.
 The question is to how best report this scenario to a user. How do we tell 
 which jobs failed and which didn't?
 One way could be to tie jobs to stores and report which store locations won't 
 have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-781) Error reporting for failed MR jobs

2009-05-05 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705951#action_12705951
 ] 

Gunther Hagleitner edited comment on PIG-781 at 5/5/09 1:19 AM:


This fix associates stores with MR jobs. At the end of the execution it will 
print out which stores have passed and which ones have failed.

Example:

{noformat}
50% complete
100% complete
1 map reduce job(s) failed!
Failed to produce result in: /user/hagleitn/baz
Successfully stored result in: /user/hagleitn/bar
Successfully stored result in: /user/hagleitn/foo
Some jobs have failed!
{noformat}


  was (Author: hagleitn):
This fix associates stores with MR jobs. At the end of the execution it 
will print out which stores have passed and which ones have failed.

Example:

{noformat}
50% complete
100% complete
1 map reduce job(s) failed!
Failed to produce result in: 
hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/baz
Successfully stored result in: 
hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/bar
Successfully stored result in: 
hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/foo
Some jobs have failed!
{noformat}

  
 Error reporting for failed MR jobs
 --

 Key: PIG-781
 URL: https://issues.apache.org/jira/browse/PIG-781
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
 Attachments: partial_failure.patch


 If we have multiple MR jobs to run and some of them fail the behavior of the 
 system is to not stop on the first failure but to keep going. That way jobs 
 that do not depend on the failed job might still succeed.
 The question is to how best report this scenario to a user. How do we tell 
 which jobs failed and which didn't?
 One way could be to tie jobs to stores and report which store locations won't 
 have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-741) Add LIMIT as a statement that works in nested FOREACH

2009-05-05 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-741:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked in.

 Add LIMIT as a statement that works in nested FOREACH
 -

 Key: PIG-741
 URL: https://issues.apache.org/jira/browse/PIG-741
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Alan Gates
 Fix For: 0.3.0

 Attachments: PIG-741.patch


 I'd like to compute the top 10 results in each group.
 The natural way to express this in Pig would be:
 {code}
 A = load '...' using PigStorage() as (
 date: int,
 count: int,
 url: chararray
 );
 B = group A by ( date );
 C = foreach B {
 D = order A by count desc;
 E = limit D 10;
 generate
 FLATTEN(E);
 };
 dump C;
 {code}
 Yeah, I could write a UDF / PiggyBank function to take the top n results. But 
 since LIMIT already exists as a statement, it seems like it should also work 
 in the nested foreach context.
 Example workaround code.
 {code}
 C = foreach B {
 D = order A by count desc;
 E = util.TOP(D, 10);
 generate
 FLATTEN(E);
 };
 dump C;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-781) Error reporting for failed MR jobs

2009-05-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706083#action_12706083
 ] 

Olga Natkovich commented on PIG-781:


Hi Gunther,

The output looks good - this is exactly what we want.

This would solve issues for adhoc queries; however, we also need to make sure 
that users can detect this programatically. This has two part to it.

(1) The return code they see when a program partially successful. We need to 
add a new return code to 
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification for this.
(2) A per output done file either on DFS or on the local file system to 
indicate success.

I think, for now, we should at least do (1). (2) requires more though to make 
sure we don't leave done files behind forever.

 Error reporting for failed MR jobs
 --

 Key: PIG-781
 URL: https://issues.apache.org/jira/browse/PIG-781
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
 Attachments: partial_failure.patch


 If we have multiple MR jobs to run and some of them fail the behavior of the 
 system is to not stop on the first failure but to keep going. That way jobs 
 that do not depend on the failed job might still succeed.
 The question is to how best report this scenario to a user. How do we tell 
 which jobs failed and which didn't?
 One way could be to tie jobs to stores and report which store locations won't 
 have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Rakesh Setty (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706144#action_12706144
 ] 

Rakesh Setty commented on PIG-794:
--

While trying to address the comment about eliminating the AvroValueReader, I 
noticed that the way pos (current position in the stream) is being handled is 
wrong. The position in the stream can only be handled by the ValueReader (Avro 
codebase) due to the non-standard (not making use of
DataOutput's methods to store data) way of storing data by Avro. For example, 
an integer can be stored in anywhere between 1 -
5 bytes while a long can be stored in anywhere between 1 - 10 bytes.
I think we have to ask the Avro team to support this (current position in the 
stream) for us to proceed with this. 

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706197#action_12706197
 ] 

Doug Cutting commented on PIG-794:
--

 I think we have to ask the Avro team to support this (current position in the 
 stream) for us to proceed with this. 

ValueReader performs no buffering, so its position is always the same as the 
InputStream that it wraps.  See DataFileReader#SeekableBufferedInput for an 
example of an input stream that tracks its position.

Note that AVRO-25 proposes to add buffering to ValueWriter, so that the 
position of the underlying stream might be different than that of the 
ValueWriter, but I do not forsee a need to add this to ValueReader, the concern 
here.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-781) Error reporting for failed MR jobs

2009-05-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706198#action_12706198
 ] 

Olga Natkovich commented on PIG-781:


we have been also asked to provide an option to fail the entire job as soon as 
the first job fails. more details to follow

 Error reporting for failed MR jobs
 --

 Key: PIG-781
 URL: https://issues.apache.org/jira/browse/PIG-781
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
 Attachments: partial_failure.patch


 If we have multiple MR jobs to run and some of them fail the behavior of the 
 system is to not stop on the first failure but to keep going. That way jobs 
 that do not depend on the failed job might still succeed.
 The question is to how best report this scenario to a user. How do we tell 
 which jobs failed and which didn't?
 One way could be to tie jobs to stores and report which store locations won't 
 have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Rakesh Setty (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706220#action_12706220
 ] 

Rakesh Setty commented on PIG-794:
--

This works. Will update the patch.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Rakesh Setty (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Setty updated PIG-794:
-

Attachment: AvroStorage.patch

Modified patch

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Rakesh Setty (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Setty updated PIG-794:
-

Comment: was deleted

(was: Updated patch)

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706226#action_12706226
 ] 

Hadoop QA commented on PIG-794:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12407285/AvroStorage.patch
  against trunk revision 771844.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/30/console

This message is automatically generated.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706244#action_12706244
 ] 

Olga Natkovich commented on PIG-794:


Doug, if there is no buffering then the position in the inout stream can be 
used for now. However, if you are planning to do buffering in the future, it 
might be good to have an API that just gives the position so that later we 
don't need to change the code.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706278#action_12706278
 ] 

Olga Natkovich commented on PIG-794:


Hi Rakesh,

Thanks for the update. A few comments:

(1) Thanks for adding comments. They need to be of javadoc style so that we get 
free documentation from it. You can see examples in other files
(2) Looks like there is at least one System.println statement that got in I 
assume by mistake.
(3) Looks like you have some traces as log.error instead of log.debug
(4) You need to attach AVRO library separately. Patches don't work well with 
binary data

Also I am curious if removing wrapper class made a performance difference?

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706284#action_12706284
 ] 

Olga Natkovich commented on PIG-794:


One more thing: since we are adding avro library, lets add some unit tests as 
well.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.