[jira] Commented: (PIG-1432) [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path

2010-06-02 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874629#action_12874629
 ] 

Yan Zhou commented on PIG-1432:
---

The patch is based on the 0.7 branch. No test is necessary as athis is a 
trivial fix.

 [zebra] There are some debuging info output to STDOUT in PIG's TableStorer 
 call path
 

 Key: PIG-1432
 URL: https://issues.apache.org/jira/browse/PIG-1432
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Trivial
 Fix For: 0.7.0

 Attachments: PIG-1432.patch


 Users redirecting STDOUT to disk file got disk full errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: does EvalFunc generate the entire bag always ?

2010-06-02 Thread Alan Gates

I don't think it pushes limit yet in this case.

Alan.

On Jun 1, 2010, at 1:44 PM, hc busy wrote:


well, see that's the thing, the 'sort A by $0' is already nlg(n)

ahh, I see, my own example suffers from this problem.

I guess I'm wondering how 'limit' works in conjunction with UDF's... A
practical application escapes me right now, But if I do

C = foreach B{
  C1 = MyUdf(B.bag_on_b);
  C2 = limit C1 5;
}

does it know to push limit in this case?


On Thu, May 27, 2010 at 2:32 PM, Alan Gates ga...@yahoo-inc.com  
wrote:


The default case is that a UDFs that take bags (such as COUNT,  
etc.) are
handed the entire bag at once.  In the case where all UDFs in a  
foreach
implement the algebraic interface and the expression itself is  
algebraic
than the combiner will be used, thus significantly limiting the  
size of the
bag handed to the UDF.  The accumulator does hand records to the  
UDF a few

thousand at a time.  Currently it has no way to turn off the flow of
records.

What you want might be accomplished by the LIMIT operator, which  
can be

used inside a nested foreach.  Something like:

C = foreach B {
  C1 = sort A by $0;
  C2 = limit 5 C1;
  generate myUDF(C2);
}

Alan.


On May 26, 2010, at 11:59 AM, hc busy wrote:

Hey, guys, how are Bags passed to EvalFunc stored?


I was looking at the Accumulator interface and it says that the  
reason why
this needed for COUNT and SUM is because EvalFunc always gives you  
the

entire bag when the EvalFunc is run on a bag.

I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and  
the code

inside that does


for(Tuple entry:inputDataBag){
 stuff
}


was an actual iterator that iterated on the bag sequentially without
necessarily having the entire bag in memory all at once. ??  
Because it's

an
iterator, so there's no way to do anything other than to stream  
through

it.

I'm looking at this because Accumulator has no way of telling Pig  
I've

seen
enough It streams through the entire bag no matter what happens.  
(like,
hypothetically speaking, if I was writing 5th item of a sorted  
bag udf),
after I see 5th of a 5 million entry bag, I want to stop executing  
if

possible.

Is there a easy way to make this happen?








[jira] Commented: (PIG-1432) [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path

2010-06-02 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874726#action_12874726
 ] 

Yan Zhou commented on PIG-1432:
---

Internal Hudson results:

 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no tests are needed for 
this patch.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


 [zebra] There are some debuging info output to STDOUT in PIG's TableStorer 
 call path
 

 Key: PIG-1432
 URL: https://issues.apache.org/jira/browse/PIG-1432
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Trivial
 Fix For: 0.7.0

 Attachments: PIG-1432.patch


 Users redirecting STDOUT to disk file got disk full errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-282) Custom Partitioner

2010-06-02 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-282:
---

Attachment: CustomPartitionerFinale.patch

Added code review comments and some minor changes with test cases.

 Custom Partitioner
 --

 Key: PIG-282
 URL: https://issues.apache.org/jira/browse/PIG-282
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Amir Youssefi
Assignee: Aniket Mokashi
Priority: Minor
 Fix For: 0.8.0

 Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, 
 CustomPartitionerTest.patch


 By adding custom partitioner we can give control over which output partition 
 a key (/value) goes to. We can add keywords to language e.g. 
 PARTITION BY UDF(...)
 or a similar syntax. UDF returns a number between 0 and n-1 where n is number 
 of output partitions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-06-02 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1249:


Status: Open  (was: Patch Available)

The latest patch doesn't apply because of a merge conflict.  I'll attach a 
patch that addresses this.

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-06-02 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1249:


Attachment: PIG-1249-4.patch

Patch with merge conflict resolution.

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
 PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-06-02 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1249:


Status: Patch Available  (was: Open)

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
 PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-282) Custom Partitioner

2010-06-02 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-282:
---

Status: Patch Available  (was: Open)

 Custom Partitioner
 --

 Key: PIG-282
 URL: https://issues.apache.org/jira/browse/PIG-282
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Amir Youssefi
Assignee: Aniket Mokashi
Priority: Minor
 Fix For: 0.8.0

 Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, 
 CustomPartitionerTest.patch


 By adding custom partitioner we can give control over which output partition 
 a key (/value) goes to. We can add keywords to language e.g. 
 PARTITION BY UDF(...)
 or a similar syntax. UDF returns a number between 0 and n-1 where n is number 
 of output partitions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-282) Custom Partitioner

2010-06-02 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-282:
---

Status: Open  (was: Patch Available)

 Custom Partitioner
 --

 Key: PIG-282
 URL: https://issues.apache.org/jira/browse/PIG-282
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Amir Youssefi
Assignee: Aniket Mokashi
Priority: Minor
 Fix For: 0.8.0

 Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, 
 CustomPartitionerTest.patch


 By adding custom partitioner we can give control over which output partition 
 a key (/value) goes to. We can add keywords to language e.g. 
 PARTITION BY UDF(...)
 or a similar syntax. UDF returns a number between 0 and n-1 where n is number 
 of output partitions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



algebraic optimization not invoked for filter following group?

2010-06-02 Thread Dmitriy Ryaboy
It looks like right now, the combiner optimization does not kick in for a
script like this:

data = load 'foo' using PigStorage() as (a, b, c);
grouped = group data by a;
filtered = filter grouped by COUNT(data)  1000;

Looking at the code in CombinerOptimizer, seems like the Filter bit is just
pseudo-coded in comments. Are there complications there other than what is
already noted, or is it just the matter of coding up the pseudo-code?

On that note -- assuming the optimization was implemented for Filter
following group, would it automagically start working for Splits, as well?

-D


[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874865#action_12874865
 ] 

Dmitriy V. Ryaboy commented on PIG-1428:


I notice that the issue has been discussed before in PIG-889, and Santosh 
argued (convincingly) that adding this method to PigLogger might not make 
sense. Santosh, would you like to suggest a different place to put this 
functionality? I am not married to using this method, it's just the path of 
least resistance.

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1432) [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path

2010-06-02 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874871#action_12874871
 ] 

Gaurav Jain commented on PIG-1432:
--


+1

 [zebra] There are some debuging info output to STDOUT in PIG's TableStorer 
 call path
 

 Key: PIG-1432
 URL: https://issues.apache.org/jira/browse/PIG-1432
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Trivial
 Fix For: 0.7.0

 Attachments: PIG-1432.patch


 Users redirecting STDOUT to disk file got disk full errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1432) [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path

2010-06-02 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1432:
--

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.8.0
   Resolution: Fixed

Committed to both 0.7 branch and trunk where TableStorer does not output to 
STDOUT in itself but the other two occurrences in key generator called by 
TableStorer are still present.

 [zebra] There are some debuging info output to STDOUT in PIG's TableStorer 
 call path
 

 Key: PIG-1432
 URL: https://issues.apache.org/jira/browse/PIG-1432
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Trivial
 Fix For: 0.8.0, 0.7.0

 Attachments: PIG-1432.patch


 Users redirecting STDOUT to disk file got disk full errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-06-02 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874903#action_12874903
 ] 

Jeff Zhang commented on PIG-1249:
-

Alan,Thanks for your help.

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
 PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true

2010-06-02 Thread Pradeep Kamath (JIRA)
pig should create success file if 
mapreduce.fileoutputcommitter.marksuccessfuljobs is true
--

 Key: PIG-1433
 URL: https://issues.apache.org/jira/browse/PIG-1433
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.8.0


pig should create success file if 
mapreduce.fileoutputcommitter.marksuccessfuljobs is true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true

2010-06-02 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1433:


Status: Patch Available  (was: Open)

 pig should create success file if 
 mapreduce.fileoutputcommitter.marksuccessfuljobs is true
 --

 Key: PIG-1433
 URL: https://issues.apache.org/jira/browse/PIG-1433
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.8.0

 Attachments: PIG-1433.patch


 pig should create success file if 
 mapreduce.fileoutputcommitter.marksuccessfuljobs is true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true

2010-06-02 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1433:


Attachment: PIG-1433.patch

Attached patch addresses the issue in MapReduceLauncher by creating an _SUCCESS 
file for stores which are part of successful jobs if the property is set in the 
job.

 pig should create success file if 
 mapreduce.fileoutputcommitter.marksuccessfuljobs is true
 --

 Key: PIG-1433
 URL: https://issues.apache.org/jira/browse/PIG-1433
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.8.0

 Attachments: PIG-1433.patch


 pig should create success file if 
 mapreduce.fileoutputcommitter.marksuccessfuljobs is true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.