[jira] Commented: (PIG-1432) [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path
[ https://issues.apache.org/jira/browse/PIG-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874629#action_12874629 ] Yan Zhou commented on PIG-1432: --- The patch is based on the 0.7 branch. No test is necessary as athis is a trivial fix. [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path Key: PIG-1432 URL: https://issues.apache.org/jira/browse/PIG-1432 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Trivial Fix For: 0.7.0 Attachments: PIG-1432.patch Users redirecting STDOUT to disk file got disk full errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: does EvalFunc generate the entire bag always ?
I don't think it pushes limit yet in this case. Alan. On Jun 1, 2010, at 1:44 PM, hc busy wrote: well, see that's the thing, the 'sort A by $0' is already nlg(n) ahh, I see, my own example suffers from this problem. I guess I'm wondering how 'limit' works in conjunction with UDF's... A practical application escapes me right now, But if I do C = foreach B{ C1 = MyUdf(B.bag_on_b); C2 = limit C1 5; } does it know to push limit in this case? On Thu, May 27, 2010 at 2:32 PM, Alan Gates ga...@yahoo-inc.com wrote: The default case is that a UDFs that take bags (such as COUNT, etc.) are handed the entire bag at once. In the case where all UDFs in a foreach implement the algebraic interface and the expression itself is algebraic than the combiner will be used, thus significantly limiting the size of the bag handed to the UDF. The accumulator does hand records to the UDF a few thousand at a time. Currently it has no way to turn off the flow of records. What you want might be accomplished by the LIMIT operator, which can be used inside a nested foreach. Something like: C = foreach B { C1 = sort A by $0; C2 = limit 5 C1; generate myUDF(C2); } Alan. On May 26, 2010, at 11:59 AM, hc busy wrote: Hey, guys, how are Bags passed to EvalFunc stored? I was looking at the Accumulator interface and it says that the reason why this needed for COUNT and SUM is because EvalFunc always gives you the entire bag when the EvalFunc is run on a bag. I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code inside that does for(Tuple entry:inputDataBag){ stuff } was an actual iterator that iterated on the bag sequentially without necessarily having the entire bag in memory all at once. ?? Because it's an iterator, so there's no way to do anything other than to stream through it. I'm looking at this because Accumulator has no way of telling Pig I've seen enough It streams through the entire bag no matter what happens. (like, hypothetically speaking, if I was writing 5th item of a sorted bag udf), after I see 5th of a 5 million entry bag, I want to stop executing if possible. Is there a easy way to make this happen?
[jira] Commented: (PIG-1432) [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path
[ https://issues.apache.org/jira/browse/PIG-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874726#action_12874726 ] Yan Zhou commented on PIG-1432: --- Internal Hudson results: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path Key: PIG-1432 URL: https://issues.apache.org/jira/browse/PIG-1432 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Trivial Fix For: 0.7.0 Attachments: PIG-1432.patch Users redirecting STDOUT to disk file got disk full errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-282) Custom Partitioner
[ https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-282: --- Attachment: CustomPartitionerFinale.patch Added code review comments and some minor changes with test cases. Custom Partitioner -- Key: PIG-282 URL: https://issues.apache.org/jira/browse/PIG-282 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Amir Youssefi Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, CustomPartitionerTest.patch By adding custom partitioner we can give control over which output partition a key (/value) goes to. We can add keywords to language e.g. PARTITION BY UDF(...) or a similar syntax. UDF returns a number between 0 and n-1 where n is number of output partitions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1249: Status: Open (was: Patch Available) The latest patch doesn't apply because of a merge conflict. I'll attach a patch that addresses this. Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1249: Attachment: PIG-1249-4.patch Patch with merge conflict resolution. Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1249: Status: Patch Available (was: Open) Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-282) Custom Partitioner
[ https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-282: --- Status: Patch Available (was: Open) Custom Partitioner -- Key: PIG-282 URL: https://issues.apache.org/jira/browse/PIG-282 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Amir Youssefi Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, CustomPartitionerTest.patch By adding custom partitioner we can give control over which output partition a key (/value) goes to. We can add keywords to language e.g. PARTITION BY UDF(...) or a similar syntax. UDF returns a number between 0 and n-1 where n is number of output partitions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-282) Custom Partitioner
[ https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-282: --- Status: Open (was: Patch Available) Custom Partitioner -- Key: PIG-282 URL: https://issues.apache.org/jira/browse/PIG-282 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Amir Youssefi Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, CustomPartitionerTest.patch By adding custom partitioner we can give control over which output partition a key (/value) goes to. We can add keywords to language e.g. PARTITION BY UDF(...) or a similar syntax. UDF returns a number between 0 and n-1 where n is number of output partitions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
algebraic optimization not invoked for filter following group?
It looks like right now, the combiner optimization does not kick in for a script like this: data = load 'foo' using PigStorage() as (a, b, c); grouped = group data by a; filtered = filter grouped by COUNT(data) 1000; Looking at the code in CombinerOptimizer, seems like the Filter bit is just pseudo-coded in comments. Are there complications there other than what is already noted, or is it just the matter of coding up the pseudo-code? On that note -- assuming the optimization was implemented for Filter following group, would it automagically start working for Splits, as well? -D
[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874865#action_12874865 ] Dmitriy V. Ryaboy commented on PIG-1428: I notice that the issue has been discussed before in PIG-889, and Santosh argued (convincingly) that adding this method to PigLogger might not make sense. Santosh, would you like to suggest a different place to put this functionality? I am not married to using this method, it's just the path of least resistance. Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1432) [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path
[ https://issues.apache.org/jira/browse/PIG-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874871#action_12874871 ] Gaurav Jain commented on PIG-1432: -- +1 [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path Key: PIG-1432 URL: https://issues.apache.org/jira/browse/PIG-1432 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Trivial Fix For: 0.7.0 Attachments: PIG-1432.patch Users redirecting STDOUT to disk file got disk full errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1432) [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path
[ https://issues.apache.org/jira/browse/PIG-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1432: -- Status: Resolved (was: Patch Available) Fix Version/s: 0.8.0 Resolution: Fixed Committed to both 0.7 branch and trunk where TableStorer does not output to STDOUT in itself but the other two occurrences in key generator called by TableStorer are still present. [zebra] There are some debuging info output to STDOUT in PIG's TableStorer call path Key: PIG-1432 URL: https://issues.apache.org/jira/browse/PIG-1432 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Trivial Fix For: 0.8.0, 0.7.0 Attachments: PIG-1432.patch Users redirecting STDOUT to disk file got disk full errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874903#action_12874903 ] Jeff Zhang commented on PIG-1249: - Alan,Thanks for your help. Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true
pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- Key: PIG-1433 URL: https://issues.apache.org/jira/browse/PIG-1433 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.8.0 pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true
[ https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1433: Status: Patch Available (was: Open) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- Key: PIG-1433 URL: https://issues.apache.org/jira/browse/PIG-1433 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.8.0 Attachments: PIG-1433.patch pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true
[ https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1433: Attachment: PIG-1433.patch Attached patch addresses the issue in MapReduceLauncher by creating an _SUCCESS file for stores which are part of successful jobs if the property is set in the job. pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- Key: PIG-1433 URL: https://issues.apache.org/jira/browse/PIG-1433 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.8.0 Attachments: PIG-1433.patch pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.