[jira] Updated: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1531: -- Status: Resolved (was: Patch Available) Resolution: Fixed Committed to both trunk and 0.8. Thanks, Niraj! Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG-1531_5.patch, PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915408#action_12915408 ] Ashutosh Chauhan commented on PIG-1641: --- Tested manually for local mode. Messages were same as proposed above. +1 for the commit. One minor suggestion is to put a line at the start saying something like: Detected Local mode. Stats reported below may be incomplete. This will reinforce the message to users that stats reporting is not transparent across different modes (local Vs map-reduce). Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1641.patch User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1641) Incorrect counters in local mode
Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42 ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sortSAMPLER job_local_000300000000rank_sortORDER_BY Processed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913048#action_12913048 ] Ashutosh Chauhan commented on PIG-1531: --- Oh Hudson, oh well... Ran the full suite of 400 minutes of unit tests; all passed. Patch is ready for review. Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reopened PIG-1531: --- Peril of not writing unit test : Resurrection of bug. Argh.. Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1531: -- Status: Patch Available (was: Reopened) Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1531: -- Attachment: pig-1531_4.patch Added a test-case which fails on trunk. Pig still gobbles up error messages. Fix is to rethrow the message in the hierarchy. Attached patch containis the test case and the fix. Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'
[ https://issues.apache.org/jira/browse/PIG-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905995#action_12905995 ] Ashutosh Chauhan commented on PIG-1590: --- Also inner merge join on more then 2 then tables also translates into POMergeCogroup + FE + Flatten. Here also it can be translated to use POMergeJoin and enjoy the benefits which comes along with it. Though I suspect it will introduce much more complexity in POMergeJoin then the case for left outer merge join. So, may not be worth doing. Use POMergeJoin for Left Outer Join when join using 'merge' --- Key: PIG-1590 URL: https://issues.apache.org/jira/browse/PIG-1590 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Priority: Minor C = join A by $0 left, B by $0 using 'merge'; will result in map-side sort merge join. Internally, it will translate to use POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few restrictions on its loaders (A and B in this case) which is cumbersome. Currently, only Zebra is known to satisfy all those requirements. It will be better to use POMergeJoin in this case, since it has far fewer requirements on its loader. Importantly, it works with PigStorage. Plus, POMergeJoin will be faster then POMergeCogroup + FE-Flatten. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1598) Pig gobbles up error messages - Part 2
Pig gobbles up error messages - Part 2 -- Key: PIG-1598 URL: https://issues.apache.org/jira/browse/PIG-1598 Project: Pig Issue Type: Improvement Reporter: Ashutosh Chauhan Another case of PIG-1531 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'
Use POMergeJoin for Left Outer Join when join using 'merge' --- Key: PIG-1590 URL: https://issues.apache.org/jira/browse/PIG-1590 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Priority: Minor C = join A by $0 left, B by $0 using 'merge'; will result in map-side sort merge join. Internally, it will translate to use POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few restrictions on its loaders (A and B in this case) which is cumbersome. Currently, only Zebra is known to satisfy all those requirements. It will be better to use POMergeJoin in this case, since it has far fewer requirements on its loader. Importantly, it works with PigStorage. Plus, POMergeJoin will be faster then POMergeCogroup + FE-Flatten. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'
[ https://issues.apache.org/jira/browse/PIG-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905207#action_12905207 ] Ashutosh Chauhan commented on PIG-1590: --- It will entail changes in POMergeJoin and LogToPhyTranslationVisitor. Use POMergeJoin for Left Outer Join when join using 'merge' --- Key: PIG-1590 URL: https://issues.apache.org/jira/browse/PIG-1590 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Priority: Minor C = join A by $0 left, B by $0 using 'merge'; will result in map-side sort merge join. Internally, it will translate to use POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few restrictions on its loaders (A and B in this case) which is cumbersome. Currently, only Zebra is known to satisfy all those requirements. It will be better to use POMergeJoin in this case, since it has far fewer requirements on its loader. Importantly, it works with PigStorage. Plus, POMergeJoin will be faster then POMergeCogroup + FE-Flatten. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904497#action_12904497 ] Ashutosh Chauhan commented on PIG-1531: --- Niraj ran all the unit tests. All passed. No complaints from test-patch either. Committed to the trunk. Thanks, Niraj ! Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1531: -- Attachment: pig-1531_3.patch I took a look of the latest patch. There are two minor problems. Firstly, pigExec was always null and never assigned a value, so it resulted in NPE in certain code path. Second, the boolean logic in PigInputFormat needs instead of ||. I thought of correcting it and committing. But then realized hudson hasnt come back with results yet. So, I am uploading a new patch with those corrections and submitting to Hudson again. In this patch, I also refactored a code a bit, so its easier to read. Have a look and if its look fine to you. Can you run test-patch and unit tests and paste results here, so I can commit it. Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1531: -- Status: Patch Available (was: Open) Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902576#action_12902576 ] Ashutosh Chauhan commented on PIG-1531: --- * In addition to error Msg, you also need to set error code on the exception you are throwing. * Since you are catching exceptions thrown by user code (StoreFunc Interface) it is not safe to assume that e.getMessage() will be non-null or non-empty string. This will result in NPE. You need to check for it and provide a generic error Msg in those cases. * Generic error msg should also contain output location String. Since if user didnt provide it, that wont get printed. So, you can reword the message as Output location validation failed for: location. More Information to follow: * Since, PigException extends from IOException. The IOException you are catching can also be a PigException, you need to test if it is and then set the message and error code. * In case of non-existent input location I am still seeing the generic message ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: file:///Users/chauhana/workspace/pig-1531/a Though the full stack trace is printed at the end which contains the underlying error String. Its more confusing because now there are three different error messages amid a java stack trace. * This warrants a testcase for regression purposes. (Infact error reporting behavior already changed since the time I opened this bug.) Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: PIG_1531.patch Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Release Note: With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions. Following preconditions must be met to use this feature: 1) No other operations can be done between load and join statements. 2) Data must be sorted on join keys for all tables in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement IndexableLoadFunc. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = COGROUP A by id, B by id using 'merge'; was: With this patch, it is now possible to perform map-side cogroup if data is sorted and one of the loader implements {{CollectableLoader}} interface. Primary algorithm is based on sort-merge join. Additional implementation details: 1) No other operations can be done between load and join statements. 2) Data must be sorted in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement CollectableLoader interface as well as OrderedLoadFunc. 5) All other loaders must implement IndexableLoadFunc. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similiar conditions apply to map-side cogroups (PIG-1309) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); C = COGROUP A by id, B by id using 'merge'; Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0, 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1353) Map-side outer joins
[ https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1353: -- Release Note: With this patch, it is now possible to perform [left|right|full] outer joins on two tables as well as inner joins on more then two tables in Pig in map-side if data is sorted and loaders implement required interfaces. Primary algorithm is based on sort-merge join. Following preconditions should be met in order to use this feature: 1) No other operations can be done between load and join statements. 2) Data must be sorted on join keys in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement {IndexableLoadFunc}. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side cogroups (PIG-1309) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = join A by id left, B by id using 'merge'; . was: With this patch, it is now possible to perform [left|right|full] outer joins on two tables as well as inner joins on more then two tables in Pig in map-side if data is sorted and one of the loader implements {{CollectableLoader}} interface. Primary algorithm is based on sort-merge join. Additional implementation details: 1) No other operations can be done between load and join statements. 2) Data must be sorted in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement CollectableLoader interface as well as OrderedLoadFunc. 5) All other loaders must implement IndexableLoadFunc. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similiar conditions apply to map-side cogroups (PIG-1309) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); C = join A by id left, B by id using 'merge'; . Map-side outer joins Key: PIG-1353 URL: https://issues.apache.org/jira/browse/PIG-1353 Project: Pig Issue Type: Improvement Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: pig-1353.patch, pig-1353.patch Pig already has couple of map-side join implementations: Merge Join and Fragmented-Replicate Join. But both of them are pretty restrictive. Merge Join can only join two tables and that too can only do inner join. FR Join can join multiple relations, but it can also only do inner and left outer joins. Further it restricts the sizes of side relations. It will be nice if we can do map side joins on multiple tables as well do inner, left outer, right outer and full outer joins. Lot of groundwork for this has already been done in PIG-1309. Remaining will be tracked in this jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Release Note: With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions. Following preconditions must be met to use this feature: 1) No other operations can be done between load and cogroup statements. 2) Data must be sorted on join keys for all tables in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement IndexableLoadFunc. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = COGROUP A by id, B by id using 'merge'; was: With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions. Following preconditions must be met to use this feature: 1) No other operations can be done between load and join statements. 2) Data must be sorted on join keys for all tables in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement IndexableLoadFunc. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = COGROUP A by id, B by id using 'merge'; Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0, 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple
[ https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900566#action_12900566 ] Ashutosh Chauhan commented on PIG-1420: --- I could not figure out how to re-open this issue. Issues marked as resolved cannot be reopened. Once the patch is committed, commiter should mark issue as resolved, since resolved issues can be reopened before release is rolled out. When the release is rolled out, resolved issues should be marked as closed, since there is no point in reopening an issue which has already been released. If more work needs to be done on that issue new jira should be created for it for future releases. Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple - Key: PIG-1420 URL: https://issues.apache.org/jira/browse/PIG-1420 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 0.8.0 Attachments: addconcat2.patch, PIG-1420.2.patch Original Estimate: 24h Remaining Estimate: 24h org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and org.apache.pig.builtin.StringConcat (which acts on Strings internally), both act on the first two fields of a tuple. This results in ugly nested CONCAT calls like: CONCAT(CONCAT(A, ' '), B) The more desirable form is: CONCAT(A, ' ', B) This change will be backwards compatible, provided that no one was relying on the fact that CONCAT ignores fields after the first two in a tuple. This seems a reasonable assumption to make, or at least a small break in compatibility for a sizable improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path
[ https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900568#action_12900568 ] Ashutosh Chauhan commented on PIG-1486: --- I did svn co https://svn.apache.org/repos/asf/hadoop/pig/trunk/ pig-1486 ant eclipse-files and then imported pig-1486 as existing project in eclipse. I presume thats all I need to do. Patch needs more updates after PIG-1520 . Essentially needs to remove owl from eclipse's build path. Further, eclipse also reported * Unbound classpath variable: 'ANT_HOME/lib/ant.jar' in project 'pig-1486' * Project 'pig-1486' is missing required library: 'lib/hadoop20.jar' update ant eclipse-files target to include new jar and remove contrib dirs from build path -- Key: PIG-1486 URL: https://issues.apache.org/jira/browse/PIG-1486 Project: Pig Issue Type: Bug Components: tools Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Priority: Minor Fix For: 0.8.0 Attachments: PIG-1486.1.patch, PIG-1486.2.patch, PIG-1486.patch .eclipse.templates/.classpath needs to be updated to address following - 1. There is a new jar that is used by the code - guava-r03.jar 2. The jar ANT_HOME/lib/ant.jar gives an 'unbounded jar' error in eclipse. 3. Removing the contrib projects from class path as discussed in PIG-1390, until all libs necessary for the contribs are included in classpath. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-533) DBloader UDF (initial prototype)
[ https://issues.apache.org/jira/browse/PIG-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan resolved PIG-533. -- Fix Version/s: 0.8.0 Resolution: Fixed PIG-1229 makes this redundant. DBloader UDF (initial prototype) Key: PIG-533 URL: https://issues.apache.org/jira/browse/PIG-533 Project: Pig Issue Type: New Feature Reporter: Ian Holsman Priority: Minor Fix For: 0.8.0 Attachments: DbStorage.java This is an initial prototype of a UDF that can insert data into a database directly from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898648#action_12898648 ] Ashutosh Chauhan commented on PIG-1518: --- This feature of combining multiple splits should honor OrderedLoadFunc interface. If loadfunc is implementing that interface, then splits generated by it should not be combined. However, its not clear why FileInputLoadFunc implements this interface. AFAIK, split[] returned by getsplits() on FileInputFormat makes no guarantees that underlying splits will be returned in ordered fashion. Though, it is a default behavior right now and thus making it implement OrderedLoadFunc doesnt result in any problem in current implementation. But it seems there is no real benefit of FileInputLoadFunc needing to implement it (there is one exception to which I will come later on). So, I will argue that FileInputLoadFunc stop implementing OrderedLoadFunc. This will result in immediate benefit of making this change useful to all the fundamental storage mechanisms of Pig like PigStorage, BinStorage, InterStorage etc. Dropping of an interface by an implementing class can be seen as backward incompatible change, but I really doubt if any one cares if PigStorage is reading splits in an ordered fashion. Only real victim of this change will be MergeJoin which will stop working with PigStorage by default. But we have not seen MergeJoin being used with PigStorage at many places. Second, its anyway is based on assumption of FileInputFormat which may choose to change behavior in future. Third, solution of this problem will be straight forward that having other Loader which extends PigStorage and implements OrderedLoadFunc which can be used to load data for merge join. In essence I am arguing to drop OrderedLoadFunc interface from FileInputLoadFunc so that this feature is useful for large number of usecases. Yan, you also need to watch out for ReadToEndLoader which is also making assumptions which may break in presence of this feature. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1404) PigUnit - Pig script testing simplified.
[ https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895318#action_12895318 ] Ashutosh Chauhan commented on PIG-1404: --- bq. 3. (This one is for other pig developers) Is Piggybank the right place for this or should we put it under test? I think this will be really useful for Pig users in setting up automated tests of their Pig Latin scripts. Should we support it outright rather than put it in piggybank and risk having it go unmaintained? I think it deserves to be put in under test. Having written few end-to-end test cases of pig in junit, I can see its really useful for Pig itself. Usefulness for pig users is pretty obvious. PigUnit - Pig script testing simplified. - Key: PIG-1404 URL: https://issues.apache.org/jira/browse/PIG-1404 Project: Pig Issue Type: New Feature Reporter: Romain Rigaux Assignee: Romain Rigaux Fix For: 0.8.0 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, PIG-1404-4.patch, PIG-1404.patch The goal is to provide a simple xUnit framework that enables our Pig scripts to be easily: - unit tested - regression tested - quickly prototyped No cluster set up is required. For example: TestCase {code} @Test public void testTop3Queries() { String[] args = { n=3, }; test = new PigTest(top_queries.pig, args); String[] input = { yahoo\t10, twitter\t7, facebook\t10, yahoo\t15, facebook\t5, }; String[] output = { (yahoo,25L), (facebook,15L), (twitter,7L), }; test.assertOutput(data, input, queries_limit, output); } {code} top_queries.pig {code} data = LOAD '$input' AS (query:CHARARRAY, count:INT); ... queries_sum = FOREACH queries_group GENERATE group AS query, SUM(queries.count) AS count; ... queries_limit = LIMIT queries_ordered $n; STORE queries_limit INTO '$output'; {code} They are 3 modes: * LOCAL (if pigunit.exectype.local properties is present) * MAPREDUCE (use the cluster specified in the classpath, same as HADOOP_CONF_DIR) ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in the class path will be: ~/pigtest/conf) ** pointing to an existing cluster (if pigunit.exectype.cluster properties is present) For now, it would be nice to see how this idea could be integrated in Piggybank and if PigParser/PigServer could improve their interfaces in order to make PigUnit simple. Other components based on PigUnit could be built later: - standalone MiniCluster - notion of workspaces for each test - standalone utility that reads test configuration and generates a test report... It is a first prototype, open to suggestions and can definitely take advantage of feedbacks. How to test, in pig_trunk: {code} Apply patch $pig_trunk ant compile-test $pig_trunk ant $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99 {code} (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the future between 'unit' and 'integration') Many examples are in: {code} contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java {code} When used as a standalone, do not forget commons-lang-2.4.jar and the HADOOP_CONF_DIR to your cluster in your CLASSPATH. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1531) Pig gobbles up error messages
[ https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894935#action_12894935 ] Ashutosh Chauhan commented on PIG-1531: --- Another instance where it happens is when input location doesnt exists, error message shown is {code} org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for tmp_emtpy_1280539088 {code} Whereas underlying exception did have more useful String which gets lost in log file {code} org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://machine.server.edu/tmp/pig/tmp_tables/tmp_empty_1280539088 {code} Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce
[ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894945#action_12894945 ] Ashutosh Chauhan commented on PIG-1516: --- +1. Changes look good. finalize in bag implementations causes pig to run out of memory in reduce -- Key: PIG-1516 URL: https://issues.apache.org/jira/browse/PIG-1516 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1516.2.patch, PIG-1516.patch *Problem:* pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it. If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created. *Solution:* The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function. A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayListFile. *Possible workaround for earlier releases:* Since the fix is going into 0.8, here is a workaround - Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query . To disable combiner, set the property: -Dpig.exec.nocombiner=true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894963#action_12894963 ] Ashutosh Chauhan commented on PIG-1229: --- I am still getting the same exception {code} java.io.IOException: JDBC Error at org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.init(PigOutputFormat.java:124) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:85) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.init(MapTask.java:488) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:610) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.sql.SQLException: Table not found in statement [insert into ttt (id, name, ratio) values (?,?,?)] at org.hsqldb.jdbc.Util.throwError(Unknown Source) at org.hsqldb.jdbc.jdbcPreparedStatement.init(Unknown Source) at org.hsqldb.jdbc.jdbcConnection.prepareStatement(Unknown Source) at org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:288) ... 6 more {code} Reading through few internet forums it seems that there are subtle differences in stand-alone mode Vs server mode of hsqldb . May be starting hsqldb instance in server mode would alleviate the problem. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1531) Pig gobbles up error messages
Pig gobbles up error messages - Key: PIG-1531 URL: https://issues.apache.org/jira/browse/PIG-1531 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 Consider the following. I have my own Storer implementing StoreFunc and I am throwing FrontEndException (and other Exceptions derived from PigException) in its various methods. I expect those error messages to be shown in error scenarios. Instead Pig gobbles up my error messages and shows its own generic error message like: {code} 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: Unexpected error. Could not validate the output specification for: default.partitoned Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log {code} Instead I expect it to display my error messages which it stores away in that log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892378#action_12892378 ] Ashutosh Chauhan commented on PIG-1229: --- Since fix to PIG-1424 doesnt look straight forward and I dont think anyone is working on it, I will suggest to unblock this useful piggy bank functionality from Pig's issues. We can take the original approach suggested in the first patch of passing jdbc url string as constructor argument instead of store location. Ankur, do you have cycles to generate the patch which we will commit now so it makes into 0.8. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890845#action_12890845 ] Ashutosh Chauhan commented on PIG-928: -- Addendum: * Also what will happen if user returned a nil python object (null equivalent of Java) from UDF. It looks to me that will result in NPE. Can you add a test for that and similar test case from pigToPython() UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, RegisterPythonUDFLatest.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path
[ https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887885#action_12887885 ] Ashutosh Chauhan commented on PIG-1486: --- Took a look at the patch. Changes look good. But, because of PIG-1452 some additional changes are required. Need to remove lib/hadoop20.jar from eclipse build-path and hadoop-core.jar, hadoop-test.jar, apache-commons-* and few other jars needed to be added in, which now are pulled in from maven repos and put in build/ivy/lib/Pig update ant eclipse-files target to include new jar and remove contrib dirs from build path -- Key: PIG-1486 URL: https://issues.apache.org/jira/browse/PIG-1486 Project: Pig Issue Type: Bug Components: tools Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Priority: Minor Fix For: 0.8.0 Attachments: PIG-1486.patch .eclipse.templates/.classpath needs to be updated to address following - 1. There is a new jar that is used by the code - guava-r03.jar 2. The jar ANT_HOME/lib/ant.jar gives an 'unbounded jar' error in eclipse. 3. Removing the contrib projects from class path as discussed in PIG-1390, until all libs necessary for the contribs are included in classpath. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888100#action_12888100 ] Ashutosh Chauhan commented on PIG-928: -- * Do you want to allow: {{register myJavaUDFs.jar using 'java' as 'javaNameSpace'}} ? Use-case could be that if we are allowing namespaces for non-java, why not allow for Java udfs as well. But then {{define}} is exactly for this purpose. So, it may make sense to throw exception for such a case. * In ScriptEngine.getJarPath() shouldn't you throw a FileNotFoundException instead of returning null. * Don't gobble up Checked Exceptions and then rethrow RuntimeExceptions. Throw checked exceptions, if you need to. * ScriptEngine.getInstance() should be a singleton, no? * In JythonScriptEngine.getFunction() I think you should check if interpreter.get(functionName) != null and then return it and call Interpreter.init(path) only if its null. * In JythonUtils, for doing type conversion you should make use of both input and output schemas (whenever they are available) and avoid doing reflection for every element. You can get hold of input schema through outputSchema() of EvalFunc and then do UDFCOntext magic to use it. If schema == null || schema == bytearray, you need to resort to reflections. Similarily if outputSchema is available via decorators, use it to do type conversions. * In jythonUtils.pythonToPig() in case of Tuple, you first create Object[] then do Arrays.asList(), you can directly create ListObject and avoid unnecessary casting. In the same method, you are only checking for long, dont you need to check for int, String etc. and then do casting appropriately. Also, in default case I think we cant let object pass as it is using Object.class, it could be object of any type and may cause cryptic errors in Pipeline, if let through. We should throw an exception if we dont know what type of object it is. Similar argument for default case of pigToPython() * I didn't get why the changes are required in POUserFunc. Can you explain and also add it as comments in the code. Testing: * This is a big enough feature to warrant its own test file. So, consider adding a new test file (may be TestNonJavaUDF). Additionally, we see frequent timeouts on TestEvalPipeline, we dont want it to run any longer. * Instead of adding query through pigServer.registerCode() api, add it through pigServer.registerQuery(register myscript.py using jython). This will make sure we are testing changes in QueryParser.jjt as well. * Add more tests. Specifically, for complex types passed to the udfs (like bag) and returning a bag. You can get bags after doing a group-by. You can also take a look at original Julien's patch which contained a python script. Those I guess were at right level of complexity to be added as test-cases in our junit tests. Nit-picks: * Unnecessary import in JythonFunction.java * In PigContext.java, you are using Vector and LinkedList, instead of usual ArrayList. Any particular reason for it, just curious? * More documentation (in QuerParser.jjt, ScriptEngine, JythonScriptEngine (specifically for outputSchema, outputSchemaFunction, schemafunction)) * Also keep an eye of recent mavenization efforts of Pig, depending on when it gets checked-in you may (or may not) need to make changes to ivy UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1487) Replace bz with .bz in all the LoadFunc
[ https://issues.apache.org/jira/browse/PIG-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888182#action_12888182 ] Ashutosh Chauhan commented on PIG-1487: --- +1 Replace bz with .bz in all the LoadFunc Key: PIG-1487 URL: https://issues.apache.org/jira/browse/PIG-1487 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1487.patch This issue relates with PIG-1463. Thank Ashutosh find another place in PigStorage should be corrected. I check all the LoadFunc and found that TextLoader also has same problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887283#action_12887283 ] Ashutosh Chauhan commented on PIG-1249: --- Map-reduce framework has a jira related to this issue. https://issues.apache.org/jira/browse/MAPREDUCE-1521 It has two implications for Pig: 1) We need to reconsider whether we still want Pig to set number of reducers on user's behalf. We can choose not to intelligently choose # of reducers and let framework fail the job which doesn't correctly specify # of reducers. Then, Pig is out of this guessing game and users are forced by framework to correctly specify # of reducers. 2) Now that MR framework will fail the job based on configured limits, operators where Pig does compute and set number of reducers (like skewed join etc.) should now be aware of those limits so that # of reducers computed by them fall within those limits. Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Status: Resolved (was: Patch Available) Resolution: Fixed Patch checked-in to 0.7 branch as well. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886904#action_12886904 ] Ashutosh Chauhan commented on PIG-1389: --- +1 Discussed about 3) with Richard offline. Though theoretically it will be better to find out the features on the fully compiled and optimized MR plan, it will be hard and may not be worth the complexity doing it. So, in this first pass it is fine to mark those features while MR plan's compilation is in progress. As a result in few corner cases, features marked for MR Oper may not be correct. We will fix up those cases as and when they come up. Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch, PIG-1389_2.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1491) Failure planning nested FOREACH with DISTINCT, POLoad cannot be cast to POLocalRearrange
[ https://issues.apache.org/jira/browse/PIG-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886906#action_12886906 ] Ashutosh Chauhan commented on PIG-1491: --- Scott, It will be useful if you can also paste the Pig script which produced this exception. Failure planning nested FOREACH with DISTINCT, POLoad cannot be cast to POLocalRearrange Key: PIG-1491 URL: https://issues.apache.org/jira/browse/PIG-1491 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Scott Carey I have a failure that occurs during planning while using DISTINCT in a nested FOREACH. Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SecondaryKeyOptimizer.visitMROp(SecondaryKeyOptimizer.java:352) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:218) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:40) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884551#action_12884551 ] Ashutosh Chauhan commented on PIG-1449: --- Reran the contrib tests. All passed. Patch committed. Thanks, Christian and Justin for working on this ! RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1449: -- Status: Resolved (was: Patch Available) Fix Version/s: 0.8.0 Resolution: Fixed RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Fix For: 0.8.0 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884552#action_12884552 ] Ashutosh Chauhan commented on PIG-1449: --- @Christian, It would definitely be useful to get the execution time for running the tests down. It takes a while currently to run all Pig tests. RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Fix For: 0.8.0 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: PIG_1309_7.patch Backport of merge cogroup for 0.7 branch. Since, hudson can test only for trunk. Manually ran all the tests, all passed. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1424) Error logs of streaming should not be placed in output location
[ https://issues.apache.org/jira/browse/PIG-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884365#action_12884365 ] Ashutosh Chauhan commented on PIG-1424: --- This turns out to be much more involved then I initially thought. Assumption that output/input location is a file based path exists at more then one place in Pig. In particular, Streaming kind of make this explicit assumption and has it in the semantics. We need to be careful about streaming semantics before we fix this. More at: http://wiki.apache.org/pig/PigStreamingFunctionalSpec Error logs of streaming should not be placed in output location --- Key: PIG-1424 URL: https://issues.apache.org/jira/browse/PIG-1424 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 This becomes a problem when output location is anything other then a filesystem. Output will be written to DB but where the logs generated by streaming should go? Clearly, they cant be written into DB. This blocks PIG-1229 which introduces writing to DB from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1449: -- Status: Open (was: Patch Available) RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1449: -- Status: Patch Available (was: Open) Running through Hudson. RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884116#action_12884116 ] Ashutosh Chauhan commented on PIG-1389: --- 1. {code} +/** + * Returns the counter name for the given input file name + * + * @param fname the input file name + * @return the counter name + */ +public static String getMultiInputsCounterName(String fname) { +return MULTI_INPUTS_RECORD_COUNTER + +new Path(fname).getName(); +} {code} Its dangerous to assume that input is a file name. It may not be. It can be a jdbc location string. In particular, new Path(fname) parses fname and throws exception if String is not the way it expects it to be. So, at various places in the patch, dont assume the path will refer to a file location and particularly avoid using Path() and deal in Strings. 2. In PigRecordReader, initialization of Counters should be done in initialize() instead of getCurrentValue() that will avoid branching for every call of getCurrentValue. 3. Marking of features in MRCompiler while compilation is still in progress may lead to incorrect results. We do bunch of optimizations *after* MR plan is constructed. During which plan may get readjusted and whatever features were there in that particular MROper may get pushed around into different MR Oper. Better way to do this marking is post-construction of the MRPlan. Have a visitor which walks on the final MR Plan and marks the feature in those operator. 4. As an extension of 1. I think having a test for non-file based input/output location would really be useful. PIG-1229 would have made that super-easy. Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1466) Improve log messages for memory usage
Improve log messages for memory usage - Key: PIG-1466 URL: https://issues.apache.org/jira/browse/PIG-1466 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Minor For anything more then a moderately sized dataset Pig usually spits following messages: {code} 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Usage threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed = 954466304(932096K) max = 954466304(932096K) 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Collection threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed = 954466304(932096K) max = 954466304(932096K) {code} This seems to confuse users a lot. Once these messages are printed, users tend to believe that Pig is having hard time with memory, is spilling to disk etc. but in fact Pig might be cruising along at ease. We should be little more careful what to print in logs. Currently these are printed when a notification is sent by JVM and some other conditions are met which may not necessarily indicate low memory condition. Furthermore, with {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these messages have lost their usefulness. At the every least, we should lower the log level at which these are printed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1463) Replace bz with .bz in setStoreLocation in PigStorage
[ https://issues.apache.org/jira/browse/PIG-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882017#action_12882017 ] Ashutosh Chauhan commented on PIG-1463: --- +1 Replace bz with .bz in setStoreLocation in PigStorage -- Key: PIG-1463 URL: https://issues.apache.org/jira/browse/PIG-1463 Project: Pig Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: PIG_1463.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1462) No informative error message on parse problem
[ https://issues.apache.org/jira/browse/PIG-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881550#action_12881550 ] Ashutosh Chauhan commented on PIG-1462: --- This has come up before. As noted on PIG-798 correct way to achieve this is {code} grunt in = load 'data' using PigStorage() as (m:map[]); grunt tags = foreach in generate (tuple(chararray)) m#'k1' as tagtuple; grunt dump tags; {code} We probably need to add a note about casting in cookbook. Also, need to generate better error message. No informative error message on parse problem - Key: PIG-1462 URL: https://issues.apache.org/jira/browse/PIG-1462 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur Consider the following script in = load 'data' using PigStorage() as (m:map[]); tags = foreach in generate m#'k1' as (tagtuple: tuple(chararray)); dump tags; This throws the following error message that does not really say that this is a bad declaration org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Encountered at line 2, column 38. Was expecting one of: at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880881#action_12880881 ] Ashutosh Chauhan commented on PIG-1427: --- It seems you missed out ivy.xml bits in the latest patch. +1 otherwise, please commit if tests pass. Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, PIG-1427.diff, PIG-1427.diff As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878663#action_12878663 ] Ashutosh Chauhan commented on PIG-1449: --- Justin, Good catch. Can you assimilate your test case in junit in one of piggybank/test/storage/TestRegExLoader or TestMyRegExLoader. That way we'll have a regression test for the issue. RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Attachments: RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1448) Detach tuple from inner plans of physical operator
Detach tuple from inner plans of physical operator --- Key: PIG-1448 URL: https://issues.apache.org/jira/browse/PIG-1448 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 This is a follow-up on PIG-1446 which only addresses this general problem for a specific instance of For Each. In general, all the physical operators which can have inner plans are vulnerable to this. Few of them include POLocalRearrange, POFilter, POCollectedGroup etc. Need to fix all of these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator
[ https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878294#action_12878294 ] Ashutosh Chauhan commented on PIG-1448: --- Problem here is not as bad as it may sound. All the physical operator already detaches the input tuple after they are done with it. In the getNext() phy op first calls processInput() which first attaches the input tuple and then detaches it at the end. So, physical operators contained within inner plans will also do that. Problem is when there is a Bin Cond, Pig short circuits one of the branches of the inner plan, in which case getNext() of the operator is never called and thus tuple is never detached. Note in these cases, tuple was already attached by the operator which had this inner plan to all the roots of the plan. So, in this particular use case tuple got attached but was never detached and thus had the stray reference which cannot be GC'ed. This still will not be a problem if there is only a single pipeline in mapper or reducer since the next time new key/value pair is read and is run through pipeline, the reference will be overwritten and thus tuple which was not detached in previous run can now be GC'ed. Only if you have Multi Query optimized script the same pipeline may not be run when the next key/value pair is read in map() or reduce() and then stray reference will not be overwritten. If all of these conditions are met and if tuple itself is large or contains large bags, we may end up with OOME. Detach tuple from inner plans of physical operator --- Key: PIG-1448 URL: https://issues.apache.org/jira/browse/PIG-1448 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 This is a follow-up on PIG-1446 which only addresses this general problem for a specific instance of For Each. In general, all the physical operators which can have inner plans are vulnerable to this. Few of them include POLocalRearrange, POFilter, POCollectedGroup etc. Need to fix all of these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
[ https://issues.apache.org/jira/browse/PIG-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878336#action_12878336 ] Ashutosh Chauhan commented on PIG-1442: --- This looks like a variant of PIG-1446 and PIG-1448 PigCombiner attaches the tuple to the roots of combine plan, but never detaches them. PODemux also attach the tuple to the inner plan, but never detaches it. Note that PigCombiner may also contain multiple pipelines depending on number of operations done inside For Each resulting in similar problems as described in PIG-1448. java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766) --- Key: PIG-1442 URL: https://issues.apache.org/jira/browse/PIG-1442 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.7.0 Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev (18/may) Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0 Reporter: Dirk Schmid As mentioned by Ashutosh this is a reopen of https://issues.apache.org/jira/browse/PIG-766 because there is still a problem which causes that PIG scales only by memory. For convenience here comes the last entry of the PIG-766-Jira-Ticket: {quote}1. Are you getting the exact same stack trace as mentioned in the jira?{quote} Yes the same and some similar traces: {noformat} java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179) at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880) at org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501) java.lang.OutOfMemoryError: Java heap space at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58) at org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263) at org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at
[jira] Updated: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.
[ https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1446: -- Status: Resolved (was: Patch Available) Fix Version/s: 0.8.0 0.7.0 Resolution: Fixed As usual, hudson is not responding. I manually ran all the unit tests, all of them passed. Committed to both trunk and 0.7 OOME in a query having a bincond in the inner plan of a Foreach. Key: PIG-1446 URL: https://issues.apache.org/jira/browse/PIG-1446 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0, 0.7.0 Attachments: pig-1446.patch This is seen when For Each is following a group-by and there is a bin cond as an inner plan of For Each. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.
[ https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reassigned PIG-1446: - Assignee: Ashutosh Chauhan OOME in a query having a bincond in the inner plan of a Foreach. Key: PIG-1446 URL: https://issues.apache.org/jira/browse/PIG-1446 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: pig-1446.patch This is seen when For Each is following a group-by and there is a bin cond as an inner plan of For Each. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.
[ https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1446: -- Attachment: pig-1446.patch Sequence of event is as follows: 1) MultiQuery optimizer combined 30 group-bys in one reducer. So, there are 30 pipelines in a reducer. 2) Each of these group-by has a ForEach after them. 3) ForEach has a bincond in its own plan. 4) Group-by resulted in large bags (10s of million of records). 5) Tuple containing group and bag is attached to the roots of inner plan of FE. 6) FE pulled the tuples through its leaves. 7) Due to short-circuiting in bin-cond, one branch of the plan is never pulled resulting in stray reference of bag which actually was not needed. 8) Due to MQ optimized 30 group-bys, we had many such bags now hanging in there, eating up all the memory. Fix: Detach tuples from the roots once you are done in FE. OOME in a query having a bincond in the inner plan of a Foreach. Key: PIG-1446 URL: https://issues.apache.org/jira/browse/PIG-1446 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Attachments: pig-1446.patch This is seen when For Each is following a group-by and there is a bin cond as an inner plan of For Each. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.
[ https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1446: -- Status: Patch Available (was: Open) OOME in a query having a bincond in the inner plan of a Foreach. Key: PIG-1446 URL: https://issues.apache.org/jira/browse/PIG-1446 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: pig-1446.patch This is seen when For Each is following a group-by and there is a bin cond as an inner plan of For Each. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877616#action_12877616 ] Ashutosh Chauhan commented on PIG-1428: --- I propose a slightly different approach here. Instead of adding getPigStatusReporter() to PigLogger() interface and the corresponding implementation in PigHadoopLogger, we can add a static singleton method in PigStatusReporter and also add a setContext( TaskInputOutputContext context) We can then set the context in map() and reduce() functions and users will have full access of the reporter object through the static method. This will allow us to keep error logging different then status reporting. Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877591#action_12877591 ] Ashutosh Chauhan commented on PIG-1428: --- So, I read through PIG-889. It seems that there never was a documented way to use counters, reporters etc from UDFs, Load/Store Funcs. Actually, there is a hacky way to do it, which exists in DefaultAbstractBag.java {code} protected void incSpillCount(Enum counter) { // Increment the spill count // warn is a misnomer. The function updates the counter. If the update // fails, it dumps a warning PigHadoopLogger.getInstance().warn(this, Spill counter incremented, counter); } {code} But in PIG-889 Santhosh has argued against for this (mis)use of PigLogger. I think we need to provide a formal way to Pig users to access counters, reporters from our interfaces (UDFs, L/S) as PigHadoopLogger is designed for error-handling (warning aggregation in particular) and not for this purpose. And we shall mark this class as Internal only, before some one starts using it. With the same argument, above method where Pig is internally making use of its own Counters is flawed and needs to be corrected. Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1438) [Performance] MultiQueryOptimizer should also merge DISTINCT jobs
[ https://issues.apache.org/jira/browse/PIG-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877150#action_12877150 ] Ashutosh Chauhan commented on PIG-1438: --- +1 please commit. [Performance] MultiQueryOptimizer should also merge DISTINCT jobs - Key: PIG-1438 URL: https://issues.apache.org/jira/browse/PIG-1438 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1438.patch, PIG-1438_1.patch Current implementation doesn't merge jobs derived from DISTINCT statements. The reason is that DISTINCT jobs are implemented using a special combiner (DistinctCombiner). But we should be able to merge jobs that have the same type of combiner (e.g. merge multiple DISTINCT jobs into one). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876763#action_12876763 ] Ashutosh Chauhan commented on PIG-1427: --- @Dmitriy, Occupied with some work. Will get back to it sometime later this week. Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, PIG-1427.diff As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program
[ https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-283: - Status: Resolved (was: Patch Available) Release Note: For documentation: After this patch, it becomes possible to set key value pairs as following in the script. {code} set mapred.map.tasks.speculative.execution false set pig.logfile mylogfile.log set my.arbitrary.key my.arbitary.value {code} These key value pairs would be put in job-conf by Pig. This is a script wide setting meaning if value is defined multiple times for a key in the script, the last one will take effect and it will be this value which will be set for all the jobs generated by script. Resolution: Fixed Re-ran all the test reported by Hudson as failures. All of them passed. Patch committed. Allow to set arbitrary jobconf key-value pairs inside pig program - Key: PIG-283 URL: https://issues.apache.org/jira/browse/PIG-283 Project: Pig Issue Type: New Feature Components: grunt Affects Versions: 0.7.0 Reporter: Christian Kunz Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: pig-282.patch It would be useful to be able to set arbitrary JobConf key-value pairs inside a pig program (e.g. in front of a COGROUP statement). I wonder whether the simplest way to add this feature is by expanding the 'set' command functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true
[ https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12875319#action_12875319 ] Ashutosh Chauhan commented on PIG-1433: --- +1 for the commit. couple of notes for future: * Since this is related to Hadoop property. We should consider this removing from Pig codebase when MAPREDUCE-1447 and MAPREDUCE-947 are fixed. * We have lot of constant strings in our codebase. For the sake of clean code, we shall put all of those public static final string in one top level interface called Constants. This should be part of seperate clean-up code jira. pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- Key: PIG-1433 URL: https://issues.apache.org/jira/browse/PIG-1433 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.8.0 Attachments: PIG-1433.patch pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true
[ https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12875326#action_12875326 ] Ashutosh Chauhan commented on PIG-1433: --- My point was to have all constant strings in one place instead of each class having some of them It could be either interface or class. If interface is considered anti-pattern, doing it in class is fine too. pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- Key: PIG-1433 URL: https://issues.apache.org/jira/browse/PIG-1433 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.8.0 Attachments: PIG-1433.patch pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
[Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct - Key: PIG-1437 URL: https://issues.apache.org/jira/browse/PIG-1437 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Minor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1437: -- Release Note: (was: Its possible to rewrite queries like this {code} A = load 'data' as (name,age); B = group A by (name,age); C = foreach B generate group.name, group.age; dump C; {code} or {code} (name,age); B = group A by (name A = load 'data' as,age); C = foreach B generate flatten(group); dump C; {code} to {code} A = load 'data' as (name,age); B = distinct A; dump B; {code} This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. ) Description: Its possible to rewrite queries like this {code} A = load 'data' as (name,age); B = group A by (name,age); C = foreach B generate group.name, group.age; dump C; {code} or {code} (name,age); B = group A by (name A = load 'data' as,age); C = foreach B generate flatten(group); dump C; {code} to {code} A = load 'data' as (name,age); B = distinct A; dump B; {code} This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct - Key: PIG-1437 URL: https://issues.apache.org/jira/browse/PIG-1437 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Minor Its possible to rewrite queries like this {code} A = load 'data' as (name,age); B = group A by (name,age); C = foreach B generate group.name, group.age; dump C; {code} or {code} (name,age); B = group A by (name A = load 'data' as,age); C = foreach B generate flatten(group); dump C; {code} to {code} A = load 'data' as (name,age); B = distinct A; dump B; {code} This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program
[ https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12873095#action_12873095 ] Ashutosh Chauhan commented on PIG-283: -- Seems hudson didn't fully recover from its long hospital trip. All failures are unrelated and because of port conflicts. Patch is ready for review. Allow to set arbitrary jobconf key-value pairs inside pig program - Key: PIG-283 URL: https://issues.apache.org/jira/browse/PIG-283 Project: Pig Issue Type: New Feature Components: grunt Affects Versions: 0.7.0 Reporter: Christian Kunz Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: pig-282.patch It would be useful to be able to set arbitrary JobConf key-value pairs inside a pig program (e.g. in front of a COGROUP statement). I wonder whether the simplest way to add this feature is by expanding the 'set' command functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872303#action_12872303 ] Ashutosh Chauhan commented on PIG-1427: --- 1. You didnt pay heed to my request for incrementing counter when udf times out or throws an exception :) I think that will be pretty useful for user to know how many faulty records there are in the dataset which can't be processed by the UDF. 2. In the getDefaultValue() there seems to be a inconsistency among different if statements. I guess you need to make a distinction between Integer[] and Integer return type and then return appropriate return value. 3. Doing svn co; patch -p0 monitoredUDF.patch; ant jar results in build failure. It seems ivy is not pulling guava lib. 4. Since its user facing new interface, having stability/visibility tag would really be useful. 5. Since it spawns a new thread for every exec() call, I assume it will have some overhead. If you have done some comparison or have numbers for that, it will be great if you can share that. Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: monitoredUdf.patch, monitoredUdf.patch As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program
[ https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872315#action_12872315 ] Ashutosh Chauhan commented on PIG-283: -- Proposal here is as suggested in the description. Expand set command so that set can take arbitrary key-value pairs and pass it on to the job-conf. Allow to set arbitrary jobconf key-value pairs inside pig program - Key: PIG-283 URL: https://issues.apache.org/jira/browse/PIG-283 Project: Pig Issue Type: New Feature Components: grunt Reporter: Christian Kunz It would be useful to be able to set arbitrary JobConf key-value pairs inside a pig program (e.g. in front of a COGROUP statement). I wonder whether the simplest way to add this feature is by expanding the 'set' command functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program
[ https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-283: - Attachment: pig-282.patch Patch as suggested in previous comment. This will let user to add / override key value pairs in job conf through grunt or through script. Like {code} grunt set mapred.map.tasks.speculative.execution false grunt set pig.logfile mylogfile.log grunt set my.arbitrary.key my.arbitary.value {code} Allow to set arbitrary jobconf key-value pairs inside pig program - Key: PIG-283 URL: https://issues.apache.org/jira/browse/PIG-283 Project: Pig Issue Type: New Feature Components: grunt Reporter: Christian Kunz Attachments: pig-282.patch It would be useful to be able to set arbitrary JobConf key-value pairs inside a pig program (e.g. in front of a COGROUP statement). I wonder whether the simplest way to add this feature is by expanding the 'set' command functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program
[ https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reassigned PIG-283: Assignee: Ashutosh Chauhan Allow to set arbitrary jobconf key-value pairs inside pig program - Key: PIG-283 URL: https://issues.apache.org/jira/browse/PIG-283 Project: Pig Issue Type: New Feature Components: grunt Reporter: Christian Kunz Assignee: Ashutosh Chauhan Attachments: pig-282.patch It would be useful to be able to set arbitrary JobConf key-value pairs inside a pig program (e.g. in front of a COGROUP statement). I wonder whether the simplest way to add this feature is by expanding the 'set' command functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program
[ https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-283: - Status: Patch Available (was: Open) Affects Version/s: 0.7.0 Fix Version/s: 0.8.0 Allow to set arbitrary jobconf key-value pairs inside pig program - Key: PIG-283 URL: https://issues.apache.org/jira/browse/PIG-283 Project: Pig Issue Type: New Feature Components: grunt Affects Versions: 0.7.0 Reporter: Christian Kunz Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: pig-282.patch It would be useful to be able to set arbitrary JobConf key-value pairs inside a pig program (e.g. in front of a COGROUP statement). I wonder whether the simplest way to add this feature is by expanding the 'set' command functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872862#action_12872862 ] Ashutosh Chauhan commented on PIG-1427: --- * Filed PIG-1428 for it. * Neat workaround. * I guess checking in lib/ is fine. They are using APL. * Performance number looks good. Initially, lets not default for monitoring. Later as we gain more experience with this feature we should on monitoring by default so as not to waste cluster resources because of programming errors. Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: monitoredUdf.patch, monitoredUdf.patch As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1347) Clear up output directory for a failed job
[ https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871862#action_12871862 ] Ashutosh Chauhan commented on PIG-1347: --- Patch is pretty straightforward and harmless as it only removes code and does not add any thing new. Only concern I have is FileLocalizer.registerDeleteOnFail() is a public method so its possible that some one using Pig's java api is using this method to do the cleanup himself previously. So, this can be considered as backward incompatible change. But, Daniel explained to me that this method was meant for Pig's internal usage and clean up in any case was taken care by Pig before the recent store func changes, so user need not to worry about it. So, its extremely unlikely that someone is using it. So, +1 on committing. Clear up output directory for a failed job -- Key: PIG-1347 URL: https://issues.apache.org/jira/browse/PIG-1347 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Ashitosh Darbarwar Fix For: 0.8.0 Attachments: PIG-1347-1.patch FileLocalizer.deleteOnFail suppose to track the output files need to be deleted in case the job fails. However, in the current code base, deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is called by nobody. We need to bring it back. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1424) Error logs of streaming should not be placed in output location
[ https://issues.apache.org/jira/browse/PIG-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871902#action_12871902 ] Ashutosh Chauhan commented on PIG-1424: --- Till we figure out a proper solution for this, one possibility is to wrap the code in my previous comment into try-catch block. That will unblock PIG-1229 for commit. We can leave this ticket open if we feel there is a need for a better solution. Error logs of streaming should not be placed in output location --- Key: PIG-1424 URL: https://issues.apache.org/jira/browse/PIG-1424 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 This becomes a problem when output location is anything other then a filesystem. Output will be written to DB but where the logs generated by streaming should go? Clearly, they cant be written into DB. This blocks PIG-1229 which introduces writing to DB from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872031#action_12872031 ] Ashutosh Chauhan commented on PIG-1427: --- A useful feature. Couple of comments: 1. Currently in case of time outs and error you are always returning null. It will be useful if user can specify a default return value as a definition of his annotation which is returned in those cases. For example if my regex fails on an input String, I want to return an empty String back. Something like: {code} @MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 500, defaultReturnValue = ) {code} 2. It seems that PigHadoopLogger.getReporter() method accidentally got removed in 0.7 and trunk. This needs to be restored. It will be really cool to see how many of my input records are faulty on UI. Since, it is a small change, I think you can add that getter method in there and then update the appropriate counters. Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: monitoredUdf.patch As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871253#action_12871253 ] Ashutosh Chauhan commented on PIG-766: -- Dirk, 1. Are you getting the exact same stack trace as mentioned in the jira? 2. Which operations are you doing in your query - join, group-by, any other ? 3. What load/store func are you using to read and write data? PigStorage or your own ? 4. What is your data size and memory available to your tasks? 5. Do you have very large records in your dataset, like hundreds of MB for one record ? It would be great if you can paste here the script from which you get this exception. ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.7.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871448#action_12871448 ] Ashutosh Chauhan commented on PIG-928: -- Arnab, Thanks for putting together a patch for this. One question I have is about register Vs define. Currently you are auto-registering all the functions in the script file and then they are available for later use in script. But I am not sure how we will handle the case for inlined functions. For inline functions {{define}} seems to be a natural choice as noted in previous comments of the jira. And if so, then we need to modify define to support that use case. Wondering to remain consistent, we always use {{define}} to define non-native functions instead of auto registering them. I also didn't get why there will be need for separate interpreter instances in that case. UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Fix For: 0.8.0 Attachments: calltrace.png, package.zip, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1424) Error logs of streaming should not be placed in output location
[ https://issues.apache.org/jira/browse/PIG-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869688#action_12869688 ] Ashutosh Chauhan commented on PIG-1424: --- Since all the logs generated by Pig in backend end up in log directory of task tracker, logs generated by streaming binary should also go there and not into the output location. The place where this setting of location happens is in JobControlCompiler.java, line 460: {code} conf.set(pig.streaming.log.dir, new Path(outputPath, LOG_DIR).toString()); {code} Error logs of streaming should not be placed in output location --- Key: PIG-1424 URL: https://issues.apache.org/jira/browse/PIG-1424 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 This becomes a problem when output location is anything other then a filesystem. Output will be written to DB but where the logs generated by streaming should go? Clearly, they cant be written into DB. This blocks PIG-1229 which introduces writing to DB from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869692#action_12869692 ] Ashutosh Chauhan commented on PIG-1229: --- Cool. I created PIG-1424 to track the Pig issue. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1381) Need a way for Pig to take an alternative property file
[ https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867220#action_12867220 ] Ashutosh Chauhan commented on PIG-1381: --- +1 on the changes. For completeness, we can also check in an empty pig.properties in the conf dir and then add comments in both pig.properties and pig-default.properties that if user wants to pass some properties doing it through pig-default.properties will have no effect and instead they should add extra properties they want to add/override in pig.properties. Need a way for Pig to take an alternative property file --- Key: PIG-1381 URL: https://issues.apache.org/jira/browse/PIG-1381 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: V.V.Chaitanya Krishna Fix For: 0.7.0, 0.8.0 Attachments: PIG-1381-1.patch, PIG-1381-2.patch, PIG-1381-3.patch, PIG-1381-4.patch Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a default pig.properties and if user have a different pig.properties, there will be a conflict since we can only read one. There are couple of ways to solve it: 1. Give a command line option for user to pass an additional property file 2. Change the name for default pig.properties to pig-default.properties, and user can give a pig.properties to override 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems to be more natural for hadoop community. If so, we shall provide backward compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1229: -- Attachment: pig-1229.patch Ankur, Sorry for getting back late on this. I fiddled with your latest patch and was able to make some progress on it. I am able to get rid of those Path problems (looks like Pig itself is not dealing with it correctly at one place). I think with the patch that I attached should work but I am not able to get test case to pass because of hsqldb problem which I am not able to resolve. I keep getting this error from it: {noformat} Caused by: java.sql.SQLException: The database is already in use by another process: org.hsqldb.persist.niolockf...@4abea04e[file =/private/tmp/batchtest.lck, exists=true, locked=false, valid=false, fl =null]: java.lang.Exception: checkHeartbeat(): lock file [/private/tmp/batchtest.lck] is presumably locked by another process. at org.hsqldb.jdbc.Util.sqlException(Unknown Source) at org.hsqldb.jdbc.jdbcConnection.init(Unknown Source) at org.hsqldb.jdbcDriver.getConnection(Unknown Source) at org.hsqldb.jdbcDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:185) at org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:274) {noformat} Anyways here are the changes I made: 1. {code} Index:src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java === -conf.set(pig.streaming.log.dir, -new Path(outputPath, LOG_DIR).toString()); +//conf.set(pig.streaming.log.dir, +//new Path(outputPath, LOG_DIR).toString()); conf.set(pig.streaming.task.output.dir, outputPath); } {code} This looks like a problem in Pig. Here Pig is incorrectly assuming that it can put logs generated during stream command in output location which is incorrect if output location is something like DB. Since this needs changes in main Pig code, I will suggest to open new jira for it and track it there. 2. Then in DBStorage.java {code} @Override public void setStoreLocation(String location, Job job) throws IOException { job.getConfiguration().set(pig.db.conn.string, location); } @Override public RecordWriterNullWritable, NullWritable getRecordWriter( TaskAttemptContext context) throws IOException, InterruptedException { jdbcURL = context.getConfiguration().get(pig.db.conn.string); return null; } {code} Need to save db connection string in job in setStoreLocation() and then retrieve it in backend in getRecordWriter(). 3. In DBStorage.java {code} @Override public void cleanupOnFailure(String location, Job job) throws IOException { log.error(Job has failed.); } {code} You need to necessarily override this function of StoreFunc() as default implementation assumes FileSystem as the output location. Currently, I left it as no-op but it can be improved to do rollbacks, release db connections etc. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1390) Provide a target to generate eclipse-related classpath and files
[ https://issues.apache.org/jira/browse/PIG-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12862951#action_12862951 ] Ashutosh Chauhan commented on PIG-1390: --- I gave it a go and did as mentioned in previous comment {noformat} These are the steps that could be followed and imported to eclipse in a faster way : 1. checkout the trunk code. 2. run ant eclipse-files. 3. open eclipse and import the existing project. {noformat} Though, pig itself compiled fine and is ready to go, the contrib projects (owl,zebra,piggybank/hiverc) didnt compile, I think because either it didn't download dependices of those projects or didn't include them in the build path. So, there appears unfriendly red cross next to project. If I remove them from build path, things are good. Did I do something wrong or is this expected ? Provide a target to generate eclipse-related classpath and files Key: PIG-1390 URL: https://issues.apache.org/jira/browse/PIG-1390 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.7.0, 0.8.0 Reporter: V.V.Chaitanya Krishna Assignee: V.V.Chaitanya Krishna Fix For: 0.8.0 Attachments: PIG-1390-2.patch, PIG-1390-3.patch, PIG-eclipse_support.patch Currently, after checking out from svn repository, there is no provision to auto-generate eclipse-related classpath and files , which could help in import into eclipse directly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory
[ https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1395: -- Status: Resolved (was: Patch Available) Resolution: Fixed Patch checked-in with updated comment. Mapside cogroup runs out of memory -- Key: PIG-1395 URL: https://issues.apache.org/jira/browse/PIG-1395 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: cogrp_mem.patch In a particular scenario when there aren't lot of tuples with a same key in a relation (i.e. there aren't many repeating keys) map tasks doing cogroup fails with GC overhead exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861122#action_12861122 ] Ashutosh Chauhan commented on PIG-798: -- 1. {noformat} b = foreach a generate (chararray) $0 as name; {noformat} 2. {noformat} B = foreach A generate $0 as name:chararray; {noformat} @Viraj, Discussed with Alan and Daniel. Language semantics for achieving this functionality with whatever loader is 1. The fact that 2 works for BinStorage is unfortunate and is bug. It is something which is currently there for backward compatibility and will eventually be removed. Schema errors when using PigStorage and none when using BinStorage in FOREACH?? --- Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 Reporter: Viraj Bhat Attachments: binstoragecreateop, schemaerr.pig, visits.txt In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1395) Mapside cogroup runs out of memory
Mapside cogroup runs out of memory -- Key: PIG-1395 URL: https://issues.apache.org/jira/browse/PIG-1395 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 In a particular scenario when there aren't lot of tuples with a same key in a relation (i.e. there aren't many repeating keys) map tasks doing cogroup fails with GC overhead exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory
[ https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1395: -- Status: Patch Available (was: Open) Mapside cogroup runs out of memory -- Key: PIG-1395 URL: https://issues.apache.org/jira/browse/PIG-1395 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: cogrp_mem.patch In a particular scenario when there aren't lot of tuples with a same key in a relation (i.e. there aren't many repeating keys) map tasks doing cogroup fails with GC overhead exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861177#action_12861177 ] Ashutosh Chauhan commented on PIG-1229: --- Ankur, The stack trace above is out of sync with trunk. Can you upload the patch with this alternative approach that you are trying. I think it might be possible to get this working. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1381) Need a way for Pig to take an alternative property file
[ https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861186#action_12861186 ] Ashutosh Chauhan commented on PIG-1381: --- Do we need to have two different property files ? One possibility is to not package pig.properties in the pig.jar and then include it in the classpath while invoking Pig. (We can modify pig shell script to include it in the path by default). Then, user can add/delete/modify the pig.properties as he wish as well override default properties. Disadvantage of two property files, is sometimes its confusing which property is getting picked up (one in default or one in user specified). If there is only one property file, there is only one way to specify the properties to Pig which I think is better way of doing it. Need a way for Pig to take an alternative property file --- Key: PIG-1381 URL: https://issues.apache.org/jira/browse/PIG-1381 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Fix For: 0.8.0 Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a default pig.properties and if user have a different pig.properties, there will be a conflict since we can only read one. There are couple of ways to solve it: 1. Give a command line option for user to pass an additional property file 2. Change the name for default pig.properties to pig-default.properties, and user can give a pig.properties to override 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems to be more natural for hadoop community. If so, we shall provide backward compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860598#action_12860598 ] Ashutosh Chauhan commented on PIG-798: -- You can specify schema in FOREACH GENERATE with PigStorage loader as follows: {code} grunt a = load 'data' using PigStorage(); grunt b = foreach a generate (chararray) $0 as name; grunt describe b; b: {name: chararray} grunt dump b; {code} I get the expected result. Schema errors when using PigStorage and none when using BinStorage in FOREACH?? --- Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 Reporter: Viraj Bhat Attachments: binstoragecreateop, schemaerr.pig, visits.txt In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860606#action_12860606 ] Ashutosh Chauhan commented on PIG-1339: --- This works fine on grunt. {code} grunt a = load '1-3.txt' using PigStorage() as (あいうえお); grunt dump a; {code} gives expected result. Problem is if it is fed as script to Pig {code} bin/pig myscript.pig {code} gives the exception as you shown above. This looks like a bug in PigScriptParser.jj where it should read the stream from script file as UTF-8. International characters in column names not supported -- Key: PIG-1339 URL: https://issues.apache.org/jira/browse/PIG-1339 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0, 0.7.0, 0.8.0 Reporter: Viraj Bhat There is a particular use-case in which someone specifies a column name to be in International characters. {code} inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); describe inputdata; dump inputdata; {code} == Pig Stack Trace --- ERROR 1000: Error during parsing. Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : at org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) == Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error
[ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860614#action_12860614 ] Ashutosh Chauhan commented on PIG-1211: --- Oh, I got confused. From your earlier comment, it occurred to me you are saying that we should add a -checkscript command line option. From your previous comment are you suggesting that we should add syntax checker which will always run (i.e., without needing any cmd line directive) before the query starts to execute and thereby catching as many user error as possible. I think this is a reasonable ask and will be useful to users. This might be the first step towards making a distinction between pig compile time and run-time explicit to user. If we go full length here, we might as well do what Milind suggested earlier (and in recent mail thread). We can add a compilation phase which first runs a syntax checker, then generates object code (essentially job jar) from pig script. This compiled object can then be handed over to run-time (hadoop cluster). Wow, pig-latin is evolving towards a true language :) Pig script runs half way after which it reports syntax error Key: PIG-1211 URL: https://issues.apache.org/jira/browse/PIG-1211 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script which is structured in the following way {code} register cp.jar dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5); filtered_dataset = filter dataset by (col1 == 1); proj_filtered_dataset = foreach filtered_dataset generate col2, col3; rmf $output1; store proj_filtered_dataset into '$output1' using PigStorage(); second_stream = foreach filtered_dataset generate col2, col4, col5; group_second_stream = group second_stream by col4; output2 = foreach group_second_stream { a = second_stream.col2 b = distinct second_stream.col5; c = order b by $0; generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc; } rmf $output2; --syntax error here store output2 to '$output2' using PigStorage(); {code} I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. The usage of HDFS option, rmf causes the first store to execute. The only option the I have is to run an explain before running his script grunt explain -script myscript.pig -out explain.out or moving the rmf statements to the top of the script Here are some questions: a) Can we have an option to do something like checkscript instead of explain to get the same syntax error? In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1390) Provide a target to generate eclipse-related classpath and files
[ https://issues.apache.org/jira/browse/PIG-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reassigned PIG-1390: - Assignee: V.V.Chaitanya Krishna Provide a target to generate eclipse-related classpath and files Key: PIG-1390 URL: https://issues.apache.org/jira/browse/PIG-1390 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.7.0, 0.8.0 Reporter: V.V.Chaitanya Krishna Assignee: V.V.Chaitanya Krishna Fix For: 0.8.0 Attachments: PIG-eclipse_support.patch Currently, after checking out from svn repository, there is no provision to auto-generate eclipse-related classpath and files , which could help in import into eclipse directly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error
[ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859462#action_12859462 ] Ashutosh Chauhan commented on PIG-1211: --- bq. Can we have an option to do something like checkscript instead of explain to get the same syntax error? In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error Though its possible to add something like checkscript. But, it will be a syntactic sugar, since it will do the same exact thing as explain does (but not printing the plan at the end). So, I am thinking, shall we tell users to run explain to catch syntax errors, instead of adding this new command line option? What do others think ? Pig script runs half way after which it reports syntax error Key: PIG-1211 URL: https://issues.apache.org/jira/browse/PIG-1211 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script which is structured in the following way {code} register cp.jar dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5); filtered_dataset = filter dataset by (col1 == 1); proj_filtered_dataset = foreach filtered_dataset generate col2, col3; rmf $output1; store proj_filtered_dataset into '$output1' using PigStorage(); second_stream = foreach filtered_dataset generate col2, col4, col5; group_second_stream = group second_stream by col4; output2 = foreach group_second_stream { a = second_stream.col2 b = distinct second_stream.col5; c = order b by $0; generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc; } rmf $output2; --syntax error here store output2 to '$output2' using PigStorage(); {code} I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. The usage of HDFS option, rmf causes the first store to execute. The only option the I have is to run an explain before running his script grunt explain -script myscript.pig -out explain.out or moving the rmf statements to the top of the script Here are some questions: a) Can we have an option to do something like checkscript instead of explain to get the same syntax error? In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
[ https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859471#action_12859471 ] Ashutosh Chauhan commented on PIG-1345: --- This will involve recording line numbers (and possibly more metadata) from parser to logical layer, then to physical layer and then to backend and then back in case of exceptions. This has been discussed before in some detail in PIG-908. Linking it against that. Link casting errors in POCast to actual lines numbers in Pig script --- Key: PIG-1345 URL: https://issues.apache.org/jira/browse/PIG-1345 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat For the purpose of easy debugging, I would be nice to find out where my warnings are coming from is in the pig script. The only known process is to comment out lines in the Pig script and see if these warnings go away. 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 I think this may need us to keep track of the line numbers of the Pig script (via out javacc parser) and maintain it in the logical and physical plan. It would help users in debugging simple errors/warning related to casting. Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? Do we need to change the parser to something other than javacc to make this task simpler? Standardize on Parser and Scanner Technology Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
[ https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1345: -- Parent: PIG-908 Issue Type: Sub-task (was: Improvement) Link casting errors in POCast to actual lines numbers in Pig script --- Key: PIG-1345 URL: https://issues.apache.org/jira/browse/PIG-1345 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat For the purpose of easy debugging, I would be nice to find out where my warnings are coming from is in the pig script. The only known process is to comment out lines in the Pig script and see if these warnings go away. 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 I think this may need us to keep track of the line numbers of the Pig script (via out javacc parser) and maintain it in the logical and physical plan. It would help users in debugging simple errors/warning related to casting. Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? Do we need to change the parser to something other than javacc to make this task simpler? Standardize on Parser and Scanner Technology Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859152#action_12859152 ] Ashutosh Chauhan commented on PIG-1339: --- This is not reproducible on trunk. I get the expected output. Viraj, can you please verify if it works for you in trunk ? International characters in column names not supported -- Key: PIG-1339 URL: https://issues.apache.org/jira/browse/PIG-1339 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat There is a particular use-case in which someone specifies a column name to be in International characters. {code} inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); describe inputdata; dump inputdata; {code} == Pig Stack Trace --- ERROR 1000: Error during parsing. Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : at org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) == Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1341) BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED
[ https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859157#action_12859157 ] Ashutosh Chauhan commented on PIG-1341: --- I think BinStorage is an internal way of moving data around in Pig and it should be treated that way. I think we should discourage its usage to user. Otherwise, we need to add capabilities as the one requested here. Important impact of making such a change is that we can't then swap out BinStorage with other storage mechanisms. If Avro (or protobuf or whatever) proved to be a better replacement for BinStorage, then we cant just swap them in place of BinStorage, unless we add to them all the capabilities that BinStorage has. Therefore, I suggest to keep capabilities of BinStorage to minimal. BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED -- Key: PIG-1341 URL: https://issues.apache.org/jira/browse/PIG-1341 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Attachments: PIG-1341.patch Script reads in BinStorage data and tries to convert a column which is in DataByteArray to Chararray. {code} raw = load 'sampledata' using BinStorage() as (col1,col2, col3); --filter out null columns A = filter raw by col1#'bcookie' is not null; B = foreach A generate col1#'bcookie' as reqcolumn; describe B; --B: {regcolumn: bytearray} X = limit B 5; dump X; B = foreach A generate (chararray)col1#'bcookie' as convertedcol; describe B; --B: {convertedcol: chararray} X = limit B 5; dump X; {code} The first dump produces: (36co9b55onr8s) (36co9b55onr8s) (36hilul5oo1q1) (36hilul5oo1q1) (36l4cj15ooa8a) The second dump produces: () () () () () It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s). Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859159#action_12859159 ] Ashutosh Chauhan commented on PIG-798: -- Viraj, I am confused with this description. It seems to me that you are first storing some data using BinStorage and then loading it using PigStorage. If that is so, obviously it will not work. PigStorage and BinStorage aren't interoperable in this way. Specifically, data stored using BinStorage, can only be loaded using BinStorage. Schema errors when using PigStorage and none when using BinStorage in FOREACH?? --- Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Attachments: binstoragecreateop, schemaerr.pig, visits.txt In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1378) har url not usable in Pig scripts
[ https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12858709#action_12858709 ] Ashutosh Chauhan commented on PIG-1378: --- {noformat} grunt a = load 'har://namenode-location/user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} This is incorrect. You need to do the following: {noformat} grunt a = load 'har://hdfs-namenode.foo.com:8020/user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} Note that scheme is hdfs. Then a -(dash), followed by namenode url, followed by semi-colon, followed by port number(8020) and then location of your har archive. har url not usable in Pig scripts - Key: PIG-1378 URL: https://issues.apache.org/jira/browse/PIG-1378 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Fix For: 0.8.0 I am trying to use har (Hadoop Archives) in my Pig script. I can use them through the HDFS shell {noformat} $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' Found 1 items -rw--- 5 viraj users1537234 2010-04-14 09:49 user/viraj/project/subproject/files/size/data/part-1 {noformat} Using similar URL's in grunt yields {noformat} grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} {noformat} 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs 2010-04-14 22:08:48,814 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:357) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249) at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472) ... 13 more {noformat} According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the following as stated in the original description {noformat} grunt a = load 'har://namenode-location/user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} {noformat} Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: har://namenode-location/user/viraj/project/subproject/files/size/data'; ... 8 more Caused by: java.io.IOException: No FileSystem for scheme: namenode-location at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66) at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104) at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193) at .apache.hadoop.fs.Path.getFileSystem(Path.java:175) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208) at