[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910441#action_12910441 ] Ankur commented on PIG-1229: In the putNext() method, count is reset to 0 every time the number of tuples added to the batch exceed 'batchSize'. The batch is then executed and its parameters cleared. There is currently an ExecException in the putNext() method that is being ignored. Can you try adding some debugging System.outs and check the stdout/stderr of your reducers to see if that is the problem ? allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: jira-1229-final.test-fix.patch Here is my understanding of what happens 1. The main thread in the JVM executing the test initializes MiniDFSCluster, MiniMRCluster and HSQLDB server all in different threads. 2. The test setUp() method then executed to create table 'ttt' to which data will be written by DBStorage() in the test. 3. Pig statements are then executed that spawn M/R job as a separate process that tries to get a connection to the database and create a preparedStatement for table 'ttt'. This fails sometimes as DB thread does NOT get a chance to fully persist the table information and the exception is thrown from the map-tasks as noted by Ashutosh. The fix for this is to add a 5 sec sleep in setUp() method to give DB a chance to persist table information. This alleviates the problem and test passes for repeated multiple runs. Note that Ideal fix would have been to do a busy wait for table creation completion but i don't see a method in HSqlDB to do that. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: (was: jira-1229-final.test-fix.patch) allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-final.patch, jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: jira-1229-final.test-fix.patch Aaron, Autocommit() was not the issue. It was the usage of jdbc:hsqldb:file: url in the STORE function that was the problem. Replacing it with jdbc:hsqldb:hsql://localhost/dbname solved the issue. Attaching the updated patch with the test case modification. Really appreciate your help here. Thanks a lot :-) allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: jira-1229-final.test-fix.patch Attaching the patch with fixes to the test case. 1. Starting the HsqlDB server manually - dbServer.start(). 2. Supplying user name and password when initializing DBStorage. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: (was: jira-1229-final.test-fix.patch) allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-final.patch, jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: jira-1229-final.patch Hope this one finally goes in . allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-final.patch, jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Status: Patch Available (was: In Progress) Regenerated the patch as per Ashutosh's suggestion. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-final.patch, jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce
[ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892176#action_12892176 ] Ankur commented on PIG-1516: The solution to have the finalize method AT ALL for the purpose of deleting files when object is garbage collected is NOT a good one. Generally speaking using finalizers to release non-memory resources like file handles should be avoided as it has an insidious bug. From the article on Object finalization and Cleanup - http://www.javaworld.com/jw-06-1998/jw-06-techniques.html Don't rely on finalizers to release non-memory resources An example of an object that breaks this rule is one that opens a file in its constructor and closes the file in its finalize() method. Although this design seems neat, tidy, and symmetrical, it potentially creates an insidious bug. A Java program generally will have only a finite number of file handles at its disposal. When all those handles are in use, the program won't be able to open any more files. finalize in bag implementations causes pig to run out of memory in reduce -- Key: PIG-1516 URL: https://issues.apache.org/jira/browse/PIG-1516 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 *Problem:* pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it. If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created. *Solution:* The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function. A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayListFile. *Possible workaround for earlier releases:* Since the fix is going into 0.8, here is a workaround - Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query . To disable combiner, set the property: -Dpig.exec.nocombiner=true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1482) Pig gets confused when more than one loader is involved
[ https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885840#action_12885840 ] Ankur commented on PIG-1482: forgot to add Include this change as well for the above script to work G = FOREACH F GENERATE group.v1, group.a; Pig gets confused when more than one loader is involved --- Key: PIG-1482 URL: https://issues.apache.org/jira/browse/PIG-1482 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur In case of two relations being loaded using different loader, joined, grouped and projected, pig gets confused in trying to find appropriate loader for the requested cast. Consider the following script :- A = LOAD 'data1' USING PigStorage() AS (s, m, l); B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3; C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int; D = LOAD 'data2' USING TextLoader() AS (a); E = JOIN C BY v1, D BY a USING 'replicated'; F = GROUP E BY (v1, a); G = FOREACH F GENERATE (chararray)group.v1, group.a; dump G; This throws the error, stack trace of which is in the next comment -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1482) Pig gets confused when more than one loader is involved
[ https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885839#action_12885839 ] Ankur commented on PIG-1482: Casting early alleviates the problem. So this makes the above script work C = FOREACH B GENERATE (chararray) v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int; Pig gets confused when more than one loader is involved --- Key: PIG-1482 URL: https://issues.apache.org/jira/browse/PIG-1482 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur In case of two relations being loaded using different loader, joined, grouped and projected, pig gets confused in trying to find appropriate loader for the requested cast. Consider the following script :- A = LOAD 'data1' USING PigStorage() AS (s, m, l); B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3; C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int; D = LOAD 'data2' USING TextLoader() AS (a); E = JOIN C BY v1, D BY a USING 'replicated'; F = GROUP E BY (v1, a); G = FOREACH F GENERATE (chararray)group.v1, group.a; dump G; This throws the error, stack trace of which is in the next comment -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1482) Pig gets confused when more than one loader is involved
[ https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885838#action_12885838 ] Ankur commented on PIG-1482: ERROR 1065: Found more than one load function to use: [PigStorage, TextLoader] org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias K at org.apache.pig.PigServer.openIterator(PigServer.java:521) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias K at org.apache.pig.PigServer.store(PigServer.java:577) at org.apache.pig.PigServer.openIterator(PigServer.java:504) ... 6 more Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:89) at org.apache.pig.PigServer.validate(PigServer.java:930) at org.apache.pig.PigServer.compileLp(PigServer.java:884) at org.apache.pig.PigServer.store(PigServer.java:568) ... 7 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1053: Cannot resolve load function to use for casting from bytearray to chararray. at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1775) at org.apache.pig.impl.logicalLayer.LOCast.visit(LOCast.java:67) at org.apache.pig.impl.logicalLayer.LOCast.visit(LOCast.java:32) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2819) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2723) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) ... 13 more Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1065: Found more than one load function to use: [PigStorage, TextLoader] at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3161) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3176) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3103) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3176) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3103) Pig gets confused when more than one loader is involved --- Key: PIG-1482 URL: https://issues.apache.org/jira/browse/PIG-1482 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur In case of two relations being loaded using different loader, joined, grouped and projected, pig gets confused in trying to find appropriate loader for the requested cast. Consider the following script :- A = LOAD 'data1' USING PigStorage() AS (s, m, l); B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3; C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int; D = LOAD 'data2' USING TextLoader() AS (a); E = JOIN C BY v1, D BY a USING 'replicated'; F = GROUP E BY (v1, a); G = FOREACH F GENERATE (chararray)group.v1, group.a; dump
[jira] Created: (PIG-1482) Pig gets confused when more than one loader is involved
Pig gets confused when more than one loader is involved --- Key: PIG-1482 URL: https://issues.apache.org/jira/browse/PIG-1482 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur In case of two relations being loaded using different loader, joined, grouped and projected, pig gets confused in trying to find appropriate loader for the requested cast. Consider the following script :- A = LOAD 'data1' USING PigStorage() AS (s, m, l); B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3; C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 :0) as v3:int; D = LOAD 'data2' USING TextLoader() AS (a); E = JOIN C BY v1, D BY a USING 'replicated'; F = GROUP E BY (v1, a); G = FOREACH F GENERATE (chararray)group.v1, group.a; dump G; This throws the error, stack trace of which is in the next comment -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1462) No informative error message on parse problem
No informative error message on parse problem - Key: PIG-1462 URL: https://issues.apache.org/jira/browse/PIG-1462 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur Consider the following script in = load 'data' using PigStorage() as (m:map[]); tags = foreach in generate m#'k1' as (tagtuple: tuple(chararray)); dump tags; This throws the following error message that does not really say that this is a bad declaration org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Encountered at line 2, column 38. Was expecting one of: at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1462) No informative error message on parse problem
[ https://issues.apache.org/jira/browse/PIG-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881551#action_12881551 ] Ankur commented on PIG-1462: Right, the JIRA is for adding a better error message that doesn't leave a user guessing No informative error message on parse problem - Key: PIG-1462 URL: https://issues.apache.org/jira/browse/PIG-1462 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur Consider the following script in = load 'data' using PigStorage() as (m:map[]); tags = foreach in generate m#'k1' as (tagtuple: tuple(chararray)); dump tags; This throws the following error message that does not really say that this is a bad declaration org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Encountered at line 2, column 38. Was expecting one of: at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869552#action_12869552 ] Ankur commented on PIG-1229: Hi Ashutosh, Thanks for helping out here. The error that you see - ...The database is already in use by another process is due to locking issues in hsqldb 1.8.0.7. Upgrading to 1.8.0.10 alleviates the problem and the test passes successfully. Few changes that I did 1. Added a placeholder record-writer as PigOutputFormat calls close() on it throwing null pointer exception if we return null from our output format. 2. Looks like you missed the ivy.xml and build.xml changes to pull the correct hsqldb jar. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: pig-1229.2.patch allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1393) Bug in Nested FOREACH
Bug in Nested FOREACH - Key: PIG-1393 URL: https://issues.apache.org/jira/browse/PIG-1393 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ankur Fix For: 0.8.0 Following script makes the parser throw an error A = load 'data' as ( a: int, b: map[]) ; B = foreach A generate ((chararray) b#'url') as url; C = foreach B { urlQueryFields = url#'queryFields'; result = (urlQueryFields is not null) ? urlQueryFields : 1; generate result; }; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: jira-1229-v3.patch Here you go ... allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch, jira-1229-v3.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1392) Parser fails to recognize valid field
Parser fails to recognize valid field - Key: PIG-1392 URL: https://issues.apache.org/jira/browse/PIG-1392 Project: Pig Issue Type: Bug Reporter: Ankur Using this script below, parser fails to recognize a valid field in the relation and throws error A = LOAD '/tmp' as (a:int, b:chararray, c:int); B = GROUP A BY (a, b); C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ; The error thrown is 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: chararray),A: {a: int,b: chararray,c: int}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1392) Parser fails to recognize valid field
[ https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1392: --- Fix Version/s: 0.7.0 Parser fails to recognize valid field - Key: PIG-1392 URL: https://issues.apache.org/jira/browse/PIG-1392 Project: Pig Issue Type: Bug Reporter: Ankur Fix For: 0.7.0 Using this script below, parser fails to recognize a valid field in the relation and throws error A = LOAD '/tmp' as (a:int, b:chararray, c:int); B = GROUP A BY (a, b); C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ; The error thrown is 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: chararray),A: {a: int,b: chararray,c: int}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1379) Jars registered from command line should override the ones present in the script
Jars registered from command line should override the ones present in the script - Key: PIG-1379 URL: https://issues.apache.org/jira/browse/PIG-1379 Project: Pig Issue Type: Improvement Reporter: Ankur Fix For: 0.7.0 Jars that are registered from the command line when executing the pig script should override the ones that are specified via 'register' in the pig script itself. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856761#action_12856761 ] Ankur commented on PIG-1229: Any updates ? allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855835#action_12855835 ] Ankur commented on PIG-1229: * Sigh * The problem is with hadoop's Path implementation that has problems understanding JDBC URLs correctly. So turning relToAbsPathForStoreFunction() does NOT help. The URI SyntaxException is now propagated to the point of setting output path for the job. Here is the new trace from the text execution failure with suggested workaround org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:332) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835) at org.apache.pig.PigServer.execute(PigServer.java:828) at org.apache.pig.PigServer.access$100(PigServer.java:105) at org.apache.pig.PigServer$Graph.execute(PigServer.java:1080) at org.apache.pig.PigServer.executeBatch(PigServer.java:288) at org.apache.pig.piggybank.test.storage.TestDBStorage.testWriteToDB(Unknown Source) Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:624) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:246) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308) Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100 at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.init(Path.java:126) at org.apache.hadoop.fs.Path.init(Path.java:45) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:459) Caused by: java.net.URISyntaxException: Relative path in absolute URI: jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100 at java.net.URI.checkPath(URI.java:1787) at java.net.URI.init(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:137) allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853843#action_12853843 ] Ankur commented on PIG-1229: So accepting the JDBC URL in setStoreLocation() exposes a flaw in Hadoop's Path class and it causes test case to fail with following exception java.net.URISyntaxException: Relative path in absolute URI: jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100 at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.init(Path.java:126) at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:238) at org.apache.pig.StoreFunc.relToAbsPathForStoreLocation(StoreFunc.java:60) at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3587) ... ... Caused by: java.net.URISyntaxException: Relative path in absolute URI: jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100 at java.net.URI.checkPath(URI.java:1787) at java.net.URI.init(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:137) Looking at the code of Path.java it seems like it extracts scheme based on the first occurrence of ':', this causes authority and path to be extracted incorrectly resulting in the above exception thrown java.net.URI. However if I try to initialize URI directly with the URL string, no exception is thrown. As for DB reachability check, I think it is ok to check the availability at the runtime an fail if its available. We do this prepareToWrite(). For performance enhancement, I think we can track that via separate issue. This patch has taken quite a while now and I wouldn't want to delay it further by depending on a hadoop fix. So If a reviewer does not find any blocking issues then my suggestion is to go ahead with the commit. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852243#action_12852243 ] Ankur commented on PIG-1229: Ashutosh, Thanks for the review comments. Accepting the store location via setStoreLocation() definitely makes sense. However I am not sure about checking database reachability in checkOutputSepcs() since that may be called on the client side as well and the DB machine may not be reachable from the client machine. Isn't OutputFormat's setupTask() a better place to do a DB availability checks ? This sounds like a reasonable ask before a commit. I will incorporate this and submit a new patch Doing DataType.find() I assume this is what you have in mind :- 1. Getting DB Schema information for the table we are writing to. 2. Use checkSchema() API to validate this with Pig supplied schema and cache it. 3. Use the cached information in the putNext() method. This is more of a performance enhancement and looks like more work. So I would prefer if we track this as a JIRA for DBStorage. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: jira-1229-v2.patch Here is the updated patch that compiles against pig 0.7 branch and implements new load/store APIs. Note:- that I haven't used hadoop's DBOutputFormat as the code is not yet moved to o.p.h.mapreduce.lib and hence there are compatibility issues. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch, jira-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: (was: hsqldb.jar) allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch, jira-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: (was: jira-1229.patch) allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Status: In Progress (was: Patch Available) allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Status: Patch Available (was: In Progress) allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.8.0 Attachments: jira-1229-v2.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1327) Incorrect column pruning after multiple JOIN operations
Incorrect column pruning after multiple JOIN operations --- Key: PIG-1327 URL: https://issues.apache.org/jira/browse/PIG-1327 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur In a script with multiple JOIN and GROUP operations, the column pruner incorrectly removes some of the fields that it shouldn't. Here is a script that demonstrates the issue = LOAD 'data1' USING PigStorage() AS (a:chararray, b:chararray, c:long); B = LOAD 'data2' USING PigStorage() AS (x:chararray, y:chararray, z:long); C = LOAD 'data3' using PigStorage() AS (d:chararray, e:chararray, f:chararray); join1 = JOIN B by x, A by a; filtered1 = FILTER join1 BY y == b; InterimData = FOREACH filtered1 GENERATE a, b, c, y, z; join2 = JOIN InterimData BY b LEFT OUTER, C BY d PARALLEL 2; proj = FOREACH join2 GENERATE a,b,y,z,e,f; TopNPrj = FOREACH proj GENERATE a, (( e is not null and e != '') ? e : 'None') , z; TopNDataGrp = GROUP TopNPrj BY (a, e) PARALLEL 2; TopNDataSum = FOREACH TopNDataGrp GENERATE flatten(group) as (a, e), SUM(TopNPrj.z) as views; TopNDataRegrp = GROUP TopNDataSum BY (a) PARALLEL 2; TopNDataCount = FOREACH TopNDataRegrp { OrderedData = ORDER TopNDataSum BY views desc; LimitedData = LIMIT OrderedData 50; GENERATE LimitedData; } TopNData = FOREACH TopNDataCount GENERATE flatten($0) as (a, e, views); store TopNData into 'tmpTopN'; TopNData_stored = load 'tmpTopN' as (a:chararray, b:chararray, c:long); joinTopNData = JOIN TopNData_stored BY (a,b) RIGHT OUTER, proj BY (a,b) PARALLEL 2; describe joinTopNData; STORE joinTopNData INTO 'output'; The column 'f' from relation 'C' participating in the 2nd JOIN is missing from the final join ouput -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1327) Incorrect column pruning after multiple JOIN operations
[ https://issues.apache.org/jira/browse/PIG-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849995#action_12849995 ] Ankur commented on PIG-1327: Yes, I verified that Incorrect column pruning after multiple JOIN operations --- Key: PIG-1327 URL: https://issues.apache.org/jira/browse/PIG-1327 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur In a script with multiple JOIN and GROUP operations, the column pruner incorrectly removes some of the fields that it shouldn't. Here is a script that demonstrates the issue = LOAD 'data1' USING PigStorage() AS (a:chararray, b:chararray, c:long); B = LOAD 'data2' USING PigStorage() AS (x:chararray, y:chararray, z:long); C = LOAD 'data3' using PigStorage() AS (d:chararray, e:chararray, f:chararray); join1 = JOIN B by x, A by a; filtered1 = FILTER join1 BY y == b; InterimData = FOREACH filtered1 GENERATE a, b, c, y, z; join2 = JOIN InterimData BY b LEFT OUTER, C BY d PARALLEL 2; proj = FOREACH join2 GENERATE a,b,y,z,e,f; TopNPrj = FOREACH proj GENERATE a, (( e is not null and e != '') ? e : 'None') , z; TopNDataGrp = GROUP TopNPrj BY (a, e) PARALLEL 2; TopNDataSum = FOREACH TopNDataGrp GENERATE flatten(group) as (a, e), SUM(TopNPrj.z) as views; TopNDataRegrp = GROUP TopNDataSum BY (a) PARALLEL 2; TopNDataCount = FOREACH TopNDataRegrp { OrderedData = ORDER TopNDataSum BY views desc; LimitedData = LIMIT OrderedData 50; GENERATE LimitedData; } TopNData = FOREACH TopNDataCount GENERATE flatten($0) as (a, e, views); store TopNData into 'tmpTopN'; TopNData_stored = load 'tmpTopN' as (a:chararray, b:chararray, c:long); joinTopNData = JOIN TopNData_stored BY (a,b) RIGHT OUTER, proj BY (a,b) PARALLEL 2; describe joinTopNData; STORE joinTopNData INTO 'output'; The column 'f' from relation 'C' participating in the 2nd JOIN is missing from the final join ouput -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847909#action_12847909 ] Ankur commented on PIG-1229: @Ashtosh Chauhan I read the HSQLDB license and it looked ok to me but I am not a lawyer :-) . Besides that apache cocoon uses it. I think we should be ok pulling it through ivy. I'll make the ivy and load-store related changes and submit a new patch on Monday. Sorry for the delay. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.7.0 Attachments: hsqldb.jar, jira-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1273) Skewed join throws error
Skewed join throws error - Key: PIG-1273 URL: https://issues.apache.org/jira/browse/PIG-1273 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur When the sampled relation is too small or empty then skewed join fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1273) Skewed join throws error
[ https://issues.apache.org/jira/browse/PIG-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840482#action_12840482 ] Ankur commented on PIG-1273: Here is a simple script to reproduce it a = load 'test.dat' using PigStorage() as (nums:chararray); b = load 'join.dat' using PigStorage('\u0001') as (number:chararray,text:chararray); c = filter a by nums == '7'; d = join c by nums LEFT OUTER, b by number USING skewed; dump d; test.dat 1 2 3 4 5 = join.dat = 1^Aone 2^Atwo 3^Athree where ^A means Control-A charatcer used as a separator. Skewed join throws error - Key: PIG-1273 URL: https://issues.apache.org/jira/browse/PIG-1273 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur When the sampled relation is too small or empty then skewed join fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1273) Skewed join throws error
[ https://issues.apache.org/jira/browse/PIG-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840483#action_12840483 ] Ankur commented on PIG-1273: Complete stack trace of the error thrown my 3rd M/R job in the pipeline java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.(MapTask.java:448) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 6 more Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Empty samples file at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.configure(SkewedPartitioner.java:128) ... 11 more Caused by: java.lang.RuntimeException: Empty samples file at org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil.loadPartitionFile(MapRedUtil.java:128) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.configure(SkewedPartitioner.java:125) ... 11 more Skewed join throws error - Key: PIG-1273 URL: https://issues.apache.org/jira/browse/PIG-1273 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur When the sampled relation is too small or empty then skewed join fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1274) Column pruning throws Null pointer exception
Column pruning throws Null pointer exception Key: PIG-1274 URL: https://issues.apache.org/jira/browse/PIG-1274 Project: Pig Issue Type: Bug Reporter: Ankur In case data has missing values for certain columns in a relation participating in a join, column pruning throws null pointer exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835136#action_12835136 ] Ankur commented on PIG-1233: In the current code path we cannot have a situation where intermediateCount in NOT null but intermediateSum is null. So just checking the former if sufficient. NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.7.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1238) Dump does not respect the schema
Dump does not respect the schema Key: PIG-1238 URL: https://issues.apache.org/jira/browse/PIG-1238 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur For complex data type and certain sequence of operations dump produces results with non-existent field in the relation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1238) Dump does not respect the schema
[ https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834151#action_12834151 ] Ankur commented on PIG-1238: Here is a script to reproduce the issue:- A = LOAD 'two.txt' USING PigStorage(); B = FOREACH A GENERATE ['a'#'12'] as b:map[], ['b'#['c'#'12']] as mapFields; C = FOREACH B GENERATE(CHARARRAY) mapFields#'b'#'c' AS f1, RANDOM() AS f2; D = ORDER C BY f2 PARALLEL 10; E = LIMIT D 20; F = FOREACH E GENERATE f1; describe F; dump F; With the above script here is a snippet of the logs that might be useful ... ... 2010-02-16 10:42:44,814 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete 2010-02-16 10:42:55,966 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2010-02-16 10:42:55,981 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Successfully stored result in: hdfs://mithrilblue-nn1.blue.ygrid.yahoo.com/tmp/temp-1870551954/tmp-470213889 2010-02-16 10:42:55,991 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 1 time(s). 2010-02-16 10:42:55,991 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Records written : 1 2010-02-16 10:42:55,991 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Bytes written : 14 2010-02-16 10:42:55,991 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! (12,) Note:- If we remove PARALLEL 10 from Order by correct results are produced and NO warning is thrown. Dump does not respect the schema Key: PIG-1238 URL: https://issues.apache.org/jira/browse/PIG-1238 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur For complex data type and certain sequence of operations dump produces results with non-existent field in the relation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1238) Dump does not respect the schema
[ https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834642#action_12834642 ] Ankur commented on PIG-1238: Daniel the correct syntax is - ['b'#['c'#'12']] as mapFields. Dump does not respect the schema Key: PIG-1238 URL: https://issues.apache.org/jira/browse/PIG-1238 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur For complex data type and certain sequence of operations dump produces results with non-existent field in the relation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1238) Dump does not respect the schema
[ https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834643#action_12834643 ] Ankur commented on PIG-1238: Seems like inner [] are making parts of it appear underlined. Correct syntax is ['b'# ['c'#'12'] ] as mapFields Dump does not respect the schema Key: PIG-1238 URL: https://issues.apache.org/jira/browse/PIG-1238 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur For complex data type and certain sequence of operations dump produces results with non-existent field in the relation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1238) Dump does not respect the schema
[ https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834644#action_12834644 ] Ankur commented on PIG-1238: Sigh Enclose 'c'#'12' in a square bracket and then enclose 'b'# ... in another square bracket Dump does not respect the schema Key: PIG-1238 URL: https://issues.apache.org/jira/browse/PIG-1238 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur For complex data type and certain sequence of operations dump produces results with non-existent field in the relation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834645#action_12834645 ] Ankur commented on PIG-1233: Olga, All queries that use AVG(), have null values for certain keys and have accumulator turned on for them are affected by this. Please see the test case for a sample query. The current workaround is to filter the nulls before averaging. NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.7.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1233: --- Status: In Progress (was: Patch Available) NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.7.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1233: --- Status: Patch Available (was: In Progress) Retrying as suggested by Olga NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.7.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1233: --- Status: In Progress (was: Patch Available) NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.6.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1233: --- Attachment: jira-1233.patch Added test case NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.6.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1233: --- Status: Patch Available (was: In Progress) Retrying hudson after adding the suggested test case NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.6.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: jira-1229.patch Updated code with added test case using HSQLDB (binary part of the patch). allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Attachments: jira-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Fix Version/s: 0.6.0 Status: Patch Available (was: Open) allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.6.0 Attachments: jira-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1229: --- Attachment: hsqldb.jar Attaching hsqldb.jar separately as including it in the patch does not work allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Fix For: 0.6.0 Attachments: hsqldb.jar, jira-1229.patch UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834075#action_12834075 ] Ankur commented on PIG-1233: The test report URLs don't work. Is this the correct one ? http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/205/testReport/ Looks alright to me. NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.6.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1233) NullPointerException in AVG
NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Fix For: 0.6.0 The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1233: --- Attachment: jira-1233.patch Attached is a very simple patch that adds the required null checks. This is a very simple code change so I don't think any new test cases are needed. NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Fix For: 0.6.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1233: --- Status: Patch Available (was: Open) NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.6.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur reassigned PIG-1229: -- Assignee: Ankur allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Attachments: DbStorage.java UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db
[ https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831337#action_12831337 ] Ankur commented on PIG-1229: Aaron, Thanks for the suggestions. I'll have an updated patch coming soon. allow pig to write output into a JDBC db Key: PIG-1229 URL: https://issues.apache.org/jira/browse/PIG-1229 Project: Pig Issue Type: New Feature Components: impl Reporter: Ian Holsman Assignee: Ankur Priority: Minor Attachments: DbStorage.java UDF to store data into a DB -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH
[ https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800610#action_12800610 ] Ankur commented on PIG-1191: I'll check and update the ticket POCast throws exception for certain sequences of LOAD, FILTER, FORACH - Key: PIG-1191 URL: https://issues.apache.org/jira/browse/PIG-1191 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Priority: Blocker Attachments: PIG-1191-1.patch When using a custom load/store function, one that returns complex data (map of maps, list of maps), for certain sequences of LOAD, FILTER, FOREACH pig script throws an exception of the form - org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to actual-type at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) ... Looking through the code of POCast, apparently the operator was unable to find the right load function for doing the conversion and consequently bailed out with the exception failing the entire pig script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH
[ https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800636#action_12800636 ] Ankur commented on PIG-1191: Case 1, 2: Succeeds Case 3 : Fails Case 4,5: Empty results. Both of them are using consecutive projection of complex fields. I'll add 1 more test case POCast throws exception for certain sequences of LOAD, FILTER, FORACH - Key: PIG-1191 URL: https://issues.apache.org/jira/browse/PIG-1191 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Priority: Blocker Attachments: PIG-1191-1.patch When using a custom load/store function, one that returns complex data (map of maps, list of maps), for certain sequences of LOAD, FILTER, FOREACH pig script throws an exception of the form - org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to actual-type at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) ... Looking through the code of POCast, apparently the operator was unable to find the right load function for doing the conversion and consequently bailed out with the exception failing the entire pig script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH
[ https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800655#action_12800655 ] Ankur commented on PIG-1191: CASE 6: In CASE 1 replace LIMIT with a GROUP BY followed by FOREACH Succeeds with the given patch. POCast throws exception for certain sequences of LOAD, FILTER, FORACH - Key: PIG-1191 URL: https://issues.apache.org/jira/browse/PIG-1191 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Priority: Blocker Attachments: PIG-1191-1.patch When using a custom load/store function, one that returns complex data (map of maps, list of maps), for certain sequences of LOAD, FILTER, FOREACH pig script throws an exception of the form - org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to actual-type at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) ... Looking through the code of POCast, apparently the operator was unable to find the right load function for doing the conversion and consequently bailed out with the exception failing the entire pig script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH
[ https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur reassigned PIG-1191: -- Assignee: Pradeep Kamath POCast throws exception for certain sequences of LOAD, FILTER, FORACH - Key: PIG-1191 URL: https://issues.apache.org/jira/browse/PIG-1191 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Pradeep Kamath Priority: Blocker Attachments: PIG-1191-1.patch When using a custom load/store function, one that returns complex data (map of maps, list of maps), for certain sequences of LOAD, FILTER, FOREACH pig script throws an exception of the form - org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to actual-type at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) ... Looking through the code of POCast, apparently the operator was unable to find the right load function for doing the conversion and consequently bailed out with the exception failing the entire pig script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH
[ https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800660#action_12800660 ] Ankur commented on PIG-1191: Small correct in comment dated - 15/Jan/10 09:39 AM Case 5: Still FAILS POCast throws exception for certain sequences of LOAD, FILTER, FORACH - Key: PIG-1191 URL: https://issues.apache.org/jira/browse/PIG-1191 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Pradeep Kamath Priority: Blocker Attachments: PIG-1191-1.patch When using a custom load/store function, one that returns complex data (map of maps, list of maps), for certain sequences of LOAD, FILTER, FOREACH pig script throws an exception of the form - org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to actual-type at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) ... Looking through the code of POCast, apparently the operator was unable to find the right load function for doing the conversion and consequently bailed out with the exception failing the entire pig script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH
POCast throws exception for certain sequences of LOAD, FILTER, FORACH - Key: PIG-1191 URL: https://issues.apache.org/jira/browse/PIG-1191 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Priority: Blocker When using a custom load/store function, one that returns complex data (map of maps, list of maps), for certain sequences of LOAD, FILTER, FOREACH pig script throws an exception of the form - org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to actual-type at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) ... Looking through the code of POCast, apparently the operator was unable to find the right load function for doing the conversion and consequently bailed out with the exception failing the entire pig script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH
[ https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800609#action_12800609 ] Ankur commented on PIG-1191: Listed below are the identified cases. CASE 1: LOAD - FILTER - FOREACH - LIMIT - STORE === SCRIPT --- sds = LOAD '/my/data/location' USING my.org.MyMapLoader() AS (simpleFields:map[], mapFields:map[], listMapFields:map[]); queries = FILTER sds BY mapFields#'page_params'#'query' is NOT NULL; queries_rand = FOREACH queries GENERATE (CHARARRAY) (mapFields#'page_params'#'query') AS query_string; queries_limit = LIMIT queries_rand 100; STORE queries_limit INTO 'out'; RESULT FAILS in reduce stage with the following exception org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to string. at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:423) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:391) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:371) CASE 2: LOAD - FOREACH - FILTER - LIMIT - STORE === Note that FILTER and FOREACH order is reversed SCRIPT --- sds = LOAD '/my/data/location' USING my.org.MyMapLoader() AS (simpleFields:map[], mapFields:map[], listMapFields:map[]); queries_rand = FOREACH sds GENERATE (CHARARRAY) (mapFields#'page_params'#'query') AS query_string; queries = FILTER queries_rand BY query_string IS NOT null; queries_limit = LIMIT queries 100; STORE queries_limit INTO 'out'; RESULT --- SUCCESS - Results are correctly stored. So if a projection is done before FILTER it recieves the LoadFunc in the POCast operator and everything is cool. CASE 3: LOAD - FOREACH - FOREACH - FILTER - LIMIT - STORE == SCRIPT --- ds = LOAD '/my/data/location' USING my.org.MyMapLoader() AS (simpleFields:map[], mapFields:map[], listMapFields:map[]); params = FOREACH sds GENERATE (map[]) (mapFields#'page_params') AS params; queries = FOREACH params GENERATE (CHARARRAY) (params#'query') AS query_string; queries_filtered = FILTER queries BY query_string IS NOT null; queries_limit = LIMIT queries_filtered 100; STORE queries_limit INTO 'out'; RESULT --- FAILS in Map stage. Looks like the 2nd FOREACH did not get the loadFunc and bailed out with following stack trace org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to string. at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at CASE 4: LOAD - FOREACH - FOREACH - LIMIT - STORE SCRIPT --- sds = LOAD '/my/data/location' USING my.org.MyMapLoader() AS (simpleFields:map[], mapFields:map[], listMapFields:map[]); params = FOREACH sds GENERATE (map[]) (mapFields#'page_params') AS params; queries = FOREACH params
[jira] Commented: (PIG-761) ERROR 2086 on simple JOIN
[ https://issues.apache.org/jira/browse/PIG-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794005#action_12794005 ] Ankur commented on PIG-761: --- Here is a very simple script to reproduce the issue:- - Start - data1 = LOAD 'data1' as (a:int, b:int, c:chararray); proj1 = LIMIT data1 5; data2 = LOAD 'data2' as (x:int, y:chararray, z:chararray); proj2 = FOREACH data2 GENERATE x, y; cogrouped = COGROUP proj1 BY a, proj2 BY x INNER PARALLEL 2; joined = FOREACH cogrouped GENERATE FLATTEN(proj1), FLATTEN(proj2); store joined into 'results'; - End The problem seems to be with the LIMIT operator for one of the relations participating in the join. Seems like this causes the mismatch between expected and found local re-arrange operators ERROR 2086 on simple JOIN - Key: PIG-761 URL: https://issues.apache.org/jira/browse/PIG-761 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Environment: mapreduce mode Reporter: Vadim Zaliva ERROR 2086: Unexpected problem during optimization. Could not find all LocalRearrange operators.org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias 109 doing pretty straightforward join in one of my pig scripts. I am able to 'dump' both relationship involved in this join. when I try to join them I am getting this error. Here is a full log: ERROR 2086: Unexpected problem during optimization. Could not find all LocalRearrange operators. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias 109 at org.apache.pig.PigServer.registerQuery(PigServer.java:296) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:319) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:274) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:700) at org.apache.pig.PigServer.execute(PigServer.java:691) at org.apache.pig.PigServer.registerQuery(PigServer.java:292) ... 5 more Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2086: Unexpected problem during optimization. Could not find all LocalRearrange operators. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator.handlePackage(POPackageAnnotator.java:116) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator.visitMROp(POPackageAnnotator.java:88) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:194) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:43) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:65) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MapReduceLauncher.compile(MapReduceLauncher.java:198) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:80) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:261) ... 8 more ERROR 1002: Unable to store alias 398 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias 398 at org.apache.pig.PigServer.registerQuery(PigServer.java:296) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:319) Caused by: java.lang.NullPointerException at
[jira] Created: (PIG-1168) Dump produces wrong results
Dump produces wrong results --- Key: PIG-1168 URL: https://issues.apache.org/jira/browse/PIG-1168 Project: Pig Issue Type: Bug Reporter: Ankur For a map-only job, dump just re-executes every pig-latin statement from the begininng assuming that they would produce same result. the assumption is not valid if there are UDFs that are invoked. Consider the following script:- raw = LOAD '$input' USING PigStorage() AS (text_string:chararray); DUMP raw; ccm = FOREACH raw GENERATE MyUDF(text_string); DUMP ccm; bug = FOREACH ccm GENERATE ccmObj; DUMP bug; The UDF MyUDF generates a tuple with one of the fields being a randomly generated UUID. So even though one would expect relations 'ccm' and 'bug' to contain identical data, they are different because of re-execution from the begininng. This breaks the application logic. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1152) bincond operator throws parser error
bincond operator throws parser error Key: PIG-1152 URL: https://issues.apache.org/jira/browse/PIG-1152 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Bincond operator throws parser error when true condition contains a constant bag with 1 tuple containing a single field of int type with -ve value. Here is the script to reproduce the issue A = load 'A' as (s: chararray, x: int, y: int); B = group A by s; C = foreach B generate group, flatten(((COUNT(A) 1L) ? {(-1)} : A.x)); dump C; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits
[ https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1114: --- Attachment: Pig_1114_Client.log MultiQuery optimization throws error when merging 2 level splits Key: PIG-1114 URL: https://issues.apache.org/jira/browse/PIG-1114 Project: Pig Issue Type: Bug Reporter: Ankur Assignee: Richard Ding Priority: Critical Fix For: 0.6.0 Attachments: Pig_1114_Client.log Multi-query optimization throws an error when merging 2 level splits. Following is the script to reproduce the error data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray); ids = FOREACH data GENERATE id; allId = GROUP ids all; allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total; idGroup = GROUP ids by id; idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count; countTotal = cross idGroupCount, allIdCount; idCountTotal = foreach countTotal generate id, count, total, (double)count / (double)total as proportion; orderedCounts = order idCountTotal by count desc; STORE orderedCounts INTO 'mq_problem/ids'; names = FOREACH data GENERATE name; allNames = GROUP names all; allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as total; nameGroup = GROUP names by name; nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as count; namesCrossed = cross nameGroupCount, allNamesCount; nameCountTotal = foreach namesCrossed generate name, count, total, (double)count / (double)total as proportion; nameCountsOrdered = order nameCountTotal by count desc; STORE nameCountsOrdered INTO 'mq_problem/names'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits
[ https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784070#action_12784070 ] Ankur commented on PIG-1114: Richard, I ran the above script again with -M option to confirm that Multiquery was not disabled, instead it worked on 2 separated parts of the script. I am attaching the pig client logs from the run for your reference. MultiQuery optimization throws error when merging 2 level splits Key: PIG-1114 URL: https://issues.apache.org/jira/browse/PIG-1114 Project: Pig Issue Type: Bug Reporter: Ankur Assignee: Richard Ding Priority: Critical Fix For: 0.6.0 Attachments: Pig_1114_Client.log Multi-query optimization throws an error when merging 2 level splits. Following is the script to reproduce the error data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray); ids = FOREACH data GENERATE id; allId = GROUP ids all; allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total; idGroup = GROUP ids by id; idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count; countTotal = cross idGroupCount, allIdCount; idCountTotal = foreach countTotal generate id, count, total, (double)count / (double)total as proportion; orderedCounts = order idCountTotal by count desc; STORE orderedCounts INTO 'mq_problem/ids'; names = FOREACH data GENERATE name; allNames = GROUP names all; allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as total; nameGroup = GROUP names by name; nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as count; namesCrossed = cross nameGroupCount, allNamesCount; nameCountTotal = foreach namesCrossed generate name, count, total, (double)count / (double)total as proportion; nameCountsOrdered = order nameCountTotal by count desc; STORE nameCountsOrdered INTO 'mq_problem/names'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits
[ https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783553#action_12783553 ] Ankur commented on PIG-1114: The error thrown is java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableTuple, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) MultiQuery optimization throws error when merging 2 level splits Key: PIG-1114 URL: https://issues.apache.org/jira/browse/PIG-1114 Project: Pig Issue Type: Bug Reporter: Ankur Priority: Critical Multi-query optimization throws an error when merging 2 level splits. Following is the script to reproduce the error data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray); ids = FOREACH data GENERATE id; allId = GROUP ids all; allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total; idGroup = GROUP ids by id; idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count; countTotal = cross idGroupCount, allIdCount; idCountTotal = foreach countTotal generate id, count, total, (double)count / (double)total as proportion; orderedCounts = order idCountTotal by count desc; STORE orderedCounts INTO 'mq_problem/ids'; names = FOREACH data GENERATE name; allNames = GROUP names all; allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as total; nameGroup = GROUP names by name; nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as count; namesCrossed = cross nameGroupCount, allNamesCount; nameCountTotal = foreach namesCrossed generate name, count, total, (double)count / (double)total as proportion; nameCountsOrdered = order nameCountTotal by count desc; STORE nameCountsOrdered INTO 'mq_problem/names'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits
[ https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783554#action_12783554 ] Ankur commented on PIG-1114: The same script works with -M (multi-query disabled) option, BUT surprisingly the run indicates that now multi-query optimization being applied separately to the first STORE and the second STORE. This is just a workaround but it also indicates that in cases like this, disabling multi-query actually DOES NOT disable it completely instead just makes it run on parts of the script. MultiQuery optimization throws error when merging 2 level splits Key: PIG-1114 URL: https://issues.apache.org/jira/browse/PIG-1114 Project: Pig Issue Type: Bug Reporter: Ankur Priority: Critical Multi-query optimization throws an error when merging 2 level splits. Following is the script to reproduce the error data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray); ids = FOREACH data GENERATE id; allId = GROUP ids all; allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total; idGroup = GROUP ids by id; idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count; countTotal = cross idGroupCount, allIdCount; idCountTotal = foreach countTotal generate id, count, total, (double)count / (double)total as proportion; orderedCounts = order idCountTotal by count desc; STORE orderedCounts INTO 'mq_problem/ids'; names = FOREACH data GENERATE name; allNames = GROUP names all; allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as total; nameGroup = GROUP names by name; nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as count; namesCrossed = cross nameGroupCount, allNamesCount; nameCountTotal = foreach namesCrossed generate name, count, total, (double)count / (double)total as proportion; nameCountsOrdered = order nameCountTotal by count desc; STORE nameCountsOrdered INTO 'mq_problem/names'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits
[ https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1114: --- Fix Version/s: 0.6.0 MultiQuery optimization throws error when merging 2 level splits Key: PIG-1114 URL: https://issues.apache.org/jira/browse/PIG-1114 Project: Pig Issue Type: Bug Reporter: Ankur Priority: Critical Fix For: 0.6.0 Multi-query optimization throws an error when merging 2 level splits. Following is the script to reproduce the error data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray); ids = FOREACH data GENERATE id; allId = GROUP ids all; allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total; idGroup = GROUP ids by id; idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count; countTotal = cross idGroupCount, allIdCount; idCountTotal = foreach countTotal generate id, count, total, (double)count / (double)total as proportion; orderedCounts = order idCountTotal by count desc; STORE orderedCounts INTO 'mq_problem/ids'; names = FOREACH data GENERATE name; allNames = GROUP names all; allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as total; nameGroup = GROUP names by name; nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as count; namesCrossed = cross nameGroupCount, allNamesCount; nameCountTotal = foreach namesCrossed generate name, count, total, (double)count / (double)total as proportion; nameCountsOrdered = order nameCountTotal by count desc; STORE nameCountsOrdered INTO 'mq_problem/names'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1112) FLATTEN eliminates the alias
FLATTEN eliminates the alias Key: PIG-1112 URL: https://issues.apache.org/jira/browse/PIG-1112 Project: Pig Issue Type: Bug Reporter: Ankur Fix For: 0.6.0 If schema for a field of type 'bag' is partially defined then FLATTEN() incorrectly eliminates the field and throws an error. Consider the following example:- A = LOAD 'sample' using PigStorage() as (first:chararray, second:chararray, ladder:bag{}); B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second; C = GROUP B by (first,third); This throws the error ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: third in {first: chararray,second: chararray} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1113) Diamond query optimization throws error in JOIN
Diamond query optimization throws error in JOIN --- Key: PIG-1113 URL: https://issues.apache.org/jira/browse/PIG-1113 Project: Pig Issue Type: Bug Reporter: Ankur The following script results in 1 M/R job as a result of diamond query optimization but the script fails. set1 = LOAD 'set1' USING PigStorage as (a:chararray, b:chararray, c:chararray); set2 = LOAD 'set2' USING PigStorage as (a: chararray, b:chararray, c:bag{}); set2_1 = FOREACH set2 GENERATE a as f1, b as f2, (chararray) 0 as f3; set2_2 = FOREACH set2 GENERATE a as f1, FLATTEN((IsEmpty(c) ? null : c)) as f2, (chararray) 1 as f3; all_set2 = UNION set2_1, set2_2; joined_sets = JOIN set1 BY (a,b), all_set2 BY (f2,f3); dump joined_sets; And here is the error org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot convert a bag to a String at org.apache.pig.data.DataType.toString(DataType.java:739) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:625) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:247) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1113) Diamond query optimization throws error in JOIN
[ https://issues.apache.org/jira/browse/PIG-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782877#action_12782877 ] Ankur commented on PIG-1113: The script fails even if correct schema is specified for the c:bag{}. So the following change does not alleviate the problem set2 = LOAD 'set2' USING PigStorage as (a: chararray, b:chararray, c:bag{T:tuple(l:chararray)}); Diamond query optimization throws error in JOIN --- Key: PIG-1113 URL: https://issues.apache.org/jira/browse/PIG-1113 Project: Pig Issue Type: Bug Reporter: Ankur The following script results in 1 M/R job as a result of diamond query optimization but the script fails. set1 = LOAD 'set1' USING PigStorage as (a:chararray, b:chararray, c:chararray); set2 = LOAD 'set2' USING PigStorage as (a: chararray, b:chararray, c:bag{}); set2_1 = FOREACH set2 GENERATE a as f1, b as f2, (chararray) 0 as f3; set2_2 = FOREACH set2 GENERATE a as f1, FLATTEN((IsEmpty(c) ? null : c)) as f2, (chararray) 1 as f3; all_set2 = UNION set2_1, set2_2; joined_sets = JOIN set1 BY (a,b), all_set2 BY (f2,f3); dump joined_sets; And here is the error org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot convert a bag to a String at org.apache.pig.data.DataType.toString(DataType.java:739) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:625) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:247) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1108) Incorrect map output key type in MultiQuery optimization
[ https://issues.apache.org/jira/browse/PIG-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782787#action_12782787 ] Ankur commented on PIG-1108: In my test run on 0.6.0 branch, disabling MQ did not work. Pig client logs showed that MQ was still kicking in and the mappers failed with the same error message as in description. It will be good if we can add few points about SecondaryKey here - http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification Incorrect map output key type in MultiQuery optimization Key: PIG-1108 URL: https://issues.apache.org/jira/browse/PIG-1108 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Richard Ding When trying to merge 2 split plans, one of which never progresses along an M/R boundary, PIG sets the map-output key type incorrectly resulting in the following error:- java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableTuple at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Here is a small script to be used a reproducible test case rmf plan1 rmf plan2 A = LOAD 'data' USING PigStorage() as (a: int, b: chararray); SPLIT A into plan1 IF (a5), plan2 IF (a5); B = GROUP plan1 BY b; C = FOREACH B { tmp = ORDER plan1 BY a desc; GENERATE FLATTEN(group) as b, tmp; }; D = FILTER C BY b is not null; STORE D into 'plan1'; STORE plan2 into 'plan2'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1075) Error in Cogroup when key fields types don't match
Error in Cogroup when key fields types don't match -- Key: PIG-1075 URL: https://issues.apache.org/jira/browse/PIG-1075 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Ankur When Cogrouping 2 relations on multiple key fields, pig throws an error if the corresponding types don't match. Consider the following script:- A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int); B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int); C = CoGROUP A BY (a,b,c), B BY (a,b,c); D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B); describe D; dump D; The complete stack trace of the error thrown is Pig Stack Trace --- ERROR 1051: Cannot cast to Unknown org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to describe schema for alias D at org.apache.pig.PigServer.dumpSchema(PigServer.java:436) at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83) at org.apache.pig.PigServer.compileLp(PigServer.java:821) at org.apache.pig.PigServer.dumpSchema(PigServer.java:428) ... 6 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1060: Cannot resolve COGroup output schema at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) ... 11 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1051: Cannot cast to Unknown at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451) ... 16 more The error message does not help the user in identifying the issue clearly especially if the pig script is large and complex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1075) Error in Cogroup when key fields types don't match
[ https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774222#action_12774222 ] Ankur commented on PIG-1075: Pig should throw an error message that better identifies the cause of the problem. Error in Cogroup when key fields types don't match -- Key: PIG-1075 URL: https://issues.apache.org/jira/browse/PIG-1075 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Ankur When Cogrouping 2 relations on multiple key fields, pig throws an error if the corresponding types don't match. Consider the following script:- A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int); B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int); C = CoGROUP A BY (a,b,c), B BY (a,b,c); D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B); describe D; dump D; The complete stack trace of the error thrown is Pig Stack Trace --- ERROR 1051: Cannot cast to Unknown org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to describe schema for alias D at org.apache.pig.PigServer.dumpSchema(PigServer.java:436) at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83) at org.apache.pig.PigServer.compileLp(PigServer.java:821) at org.apache.pig.PigServer.dumpSchema(PigServer.java:428) ... 6 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1060: Cannot resolve COGroup output schema at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) ... 11 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1051: Cannot cast to Unknown at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451) ... 16 more The error message does not help the user in identifying the issue clearly especially if the pig script is large and complex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12773389#action_12773389 ] Ankur commented on PIG-958: --- Can you explain this a little bit more - .. In the earlier patch (958.v3.patch), After moving the results from the tasks current working directory, I was manually deleting the directory. This is to ensure that empty part files don't get moved to the final output directory. But doing so causes hadoop to complain that it can no longer write to task's output dir and the task fails. I saw compile errors while trying to run unit test: ... Did you compile the pig.jar and ran core test before. This creates the necessary classes and jar file son the local machine required by contrib tests. On my local machine gan...@grainflydivide-dr:pig_trunk$ ant ... buildJar: [echo] svnString 830456 [jar] Building jar: /home/gankur/eclipse/workspace/pig_trunk/build/pig-0.6.0-dev-core.jar [jar] Building jar: /home/gankur/eclipse/workspace/pig_trunk/build/pig-0.6.0-dev.jar [copy] Copying 1 file to /home/gankur/eclipse/workspace/pig_trunk gan...@grainflydivide-dr:pig_trunk$ ant test ... test-core: [delete] Deleting directory /home/gankur/eclipse/workspace/pig_trunk/build/test/logs [mkdir] Created dir: /home/gankur/eclipse/workspace/pig_trunk/build/test/logs [junit] Running org.apache.pig.test.TestAdd [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.024 sec [junit] Running org.apache.pig.test.TestAlgebraicEval ... gan...@grainflydivide-dr:pig_trunk$ cd contrib/piggybank/java/ gan...@grainflydivide-dr:java$ ant test ... test: [echo] *** Running UDF tests *** [delete] Deleting directory /home/gankur/eclipse/workspace/pig_trunk/contrib/piggybank/java/build/test/logs [mkdir] Created dir: /home/gankur/eclipse/workspace/pig_trunk/contrib/piggybank/java/build/test/logs [junit] Running org.apache.pig.piggybank.test.evaluation.TestEvalString [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 0.15 sec [junit] Running org.apache.pig.piggybank.test.evaluation.TestMathUDF [junit] Tests run: 35, Failures: 0, Errors: 0, Time elapsed: 0.123 sec [junit] Running org.apache.pig.piggybank.test.evaluation.TestStat [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.114 sec [junit] Running org.apache.pig.piggybank.test.evaluation.datetime.TestDiffDate [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.105 sec [junit] Running org.apache.pig.piggybank.test.evaluation.decode.TestDecode [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.089 sec [junit] Running org.apache.pig.piggybank.test.evaluation.string.TestHashFNV [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.094 sec [junit] Running org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 17.163 sec [junit] Running org.apache.pig.piggybank.test.evaluation.string.TestRegex [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.092 sec [junit] Running org.apache.pig.piggybank.test.evaluation.util.TestSearchQuery [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.093 sec [junit] Running org.apache.pig.piggybank.test.evaluation.util.TestTop [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.099 sec [junit] Running org.apache.pig.piggybank.test.evaluation.util.apachelogparser.TestDateExtractor [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.087 sec [junit] Running org.apache.pig.piggybank.test.evaluation.util.apachelogparser.TestHostExtractor [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.083 sec [junit] Running org.apache.pig.piggybank.test.evaluation.util.apachelogparser.TestSearchEngineExtractor [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.091 sec [junit] Running org.apache.pig.piggybank.test.evaluation.util.apachelogparser.TestSearchTermExtractor [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.1 sec [junit] Running org.apache.pig.piggybank.test.storage.TestCombinedLogLoader [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.535 sec [junit] Running org.apache.pig.piggybank.test.storage.TestCommonLogLoader [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.54 sec [junit] Running org.apache.pig.piggybank.test.storage.TestHelper [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.014 sec [junit] Running org.apache.pig.piggybank.test.storage.TestMultiStorage [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 16.964 sec [junit] Running org.apache.pig.piggybank.test.storage.TestMyRegExLoader [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.452 sec [junit] Running
[jira] Commented: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772925#action_12772925 ] Ankur commented on PIG-958: --- Can we have an update on this please ? Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Attachments: 958.v3.patch, 958.v4.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1060) MultiQuery optimization throws error for multi-level splits
MultiQuery optimization throws error for multi-level splits --- Key: PIG-1060 URL: https://issues.apache.org/jira/browse/PIG-1060 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Ankur Consider the following scenario :- 1. Multi-level splits in the map plan. 2. Each split branch further progressing across a local-global rearrange. 3. Output of each of these finally merged via a UNION. MultiQuery optimizer throws the following error in such a case: ERROR 2146: Internal Error. Inconsistency in key index found during optimization. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1060) MultiQuery optimization throws error for multi-level splits
[ https://issues.apache.org/jira/browse/PIG-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771390#action_12771390 ] Ankur commented on PIG-1060: Here's a sample script to illustrate the issue. Note that sample data isn't very important here since the optimization and execution fail. === test.pig data = LOAD 'dummy' as (name:chararray, freq:int); filter1 = FILTER data BY freq 5; group1 = GROUP filter1 BY name; proj1 = FOREACH group1 GENERATE FLATTEN(group), 'string1', SUM(filter1.freq); filter2 = FILTER data by freq 5; group2 = GROUP filter2 BY name; proj2 = FOREACH group2 GENERATE FLATTEN(group), 'string2', SUM(filter2.freq); filter3 = FILTER filter2 by freq 10; group3 = GROUP filter3 By name; proj3 = FOREACH group3 GENERATE FLATTEN(group), 'string3', SUM(filter3.freq); filter4 = FILTER filter3 by freq 7; group4 = GROUP filter4 By name; proj4 = FOREACH group4 GENERATE FLATTEN(group), 'string4', SUM(filter4.freq); M1 = LIMIT proj1 10; M2 = LIMIT proj2 10; M3 = LIMIT proj3 10; M4 = LIMIT proj4 10; U = UNION M1, M2, M3, M4; STORE U INTO 'res' USING PigStorage(); The dot output can dumped via command - explain -dot -script test.pig; to visualize the scenario. A surprising observation is that despite turning MultiQuery off using -M, it seems that the MultiQuery optimizer is still runs and fails the script. MultiQuery optimization throws error for multi-level splits --- Key: PIG-1060 URL: https://issues.apache.org/jira/browse/PIG-1060 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Ankur Consider the following scenario :- 1. Multi-level splits in the map plan. 2. Each split branch further progressing across a local-global rearrange. 3. Output of each of these finally merged via a UNION. MultiQuery optimizer throws the following error in such a case: ERROR 2146: Internal Error. Inconsistency in key index found during optimization. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-958: -- Attachment: 958.v4.patch 1. When run in cluster mode, static variable PigMapReduce.sJobConf is null when checked in the UDF constructor but NOT null when UDF is actually invoked. This causes incorrect initialization of FileSystem object 'fs' to local filesystem, causing the test to fail. Moved to 'fs' initialization to intijobSpecificParams() method. 2. Deleting the temporary directory manually in finish(), causes the job to fail. Removed the manual deletion. As a side effect, user specified PARENT output directory in the UDF will have empty part-* files. These should be deleted manually by the user. Verfied that UDF works correctly and that unit test pass Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Attachments: 958.v3.patch, 958.v4.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12770042#action_12770042 ] Ankur commented on PIG-958: --- Just back from vacation. Have updated the code with required changes. It should be good to go now. Pradeep can you or any other committer review it ? Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Attachments: 958.v3.patch, 958.v4.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-958: -- Attachment: 958.v3.patch Pradeep, Thanks for your review comments. I have incorporated the suggestions provided in the code review. The code is vastly simplified, cleaner and more readable :-). Unit test now pass in local mode but fail in cluster mode after taking an update of Pig code base. The error I see is :- hdfs://localhost.localdomain:40352/user/gankur/output/_temporary/_attempt_20091009030519686_0001_m_00_0/output, expected: file:/// Looks like a config issue with org.apache.pig.test.MiniCluster in the latest pig code. I didn't get time to debug this as I am going on a vacation. Regardless, I have attached the new patch for your review. Please suggest what needs to be done to pass the unit test in cluster mode. -Ankur Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Attachments: 958.v3.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-976) Multi-query optimization throws ClassCastException
Multi-query optimization throws ClassCastException -- Key: PIG-976 URL: https://issues.apache.org/jira/browse/PIG-976 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.4.0 Reporter: Ankur Multi-query optimization fails to merge 2 branches when 1 is a result of Group By ALL and another is a result of Group By field1 where field 1 is of type long. Here is the script that fails with multi-query on. data = LOAD 'test' USING PigStorage('\t') AS (a:long, b:double, c:double); A = GROUP data ALL; B = FOREACH A GENERATE SUM(data.b) AS sum1, SUM(data.c) AS sum2; C = FOREACH B GENERATE (sum1/sum2) AS rate; STORE C INTO 'result1'; D = GROUP data BY a; E = FOREACH D GENERATE group AS a, SUM(data.b), SUM(data.c); STORE E into 'result2'; Here is the exception from the logs java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:399) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:180) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:145) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:197) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:264) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:254) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:196) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:174) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:63) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:906) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:786) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-958: -- Status: Open (was: Patch Available) Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Attachments: 958.v1.patch, 958.v2.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-958: -- Status: Patch Available (was: Open) Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Attachments: 958.v2.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-958: -- Status: Patch Available (was: Open) Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Attachments: 958.v1.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755928#action_12755928 ] Ankur commented on PIG-958: --- Hudson seems to be failing during compilation as my test case defined in package org.apache.pig.piggybank.test.storage is reusing certain classes from org.apache.pig.test, namely 'Util' and MiniCluster. Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Attachments: 958.v1.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-894) order-by fails when input is empty
[ https://issues.apache.org/jira/browse/PIG-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754883#action_12754883 ] Ankur commented on PIG-894: --- Is empty inputs referring to relation - l ('students.txt') or f (filter l by 1 == 2). I am seeing a similar issue where the sampler produces an empty file when the number of records in the relation being sorted in too low ( 4 ). order-by fails when input is empty -- Key: PIG-894 URL: https://issues.apache.org/jira/browse/PIG-894 Project: Pig Issue Type: Bug Reporter: Thejas M Nair grunt l = load 'students.txt' ; grunt f = filter l by 1 == 2; grunt o = order f by $0 ; grunt dump o; This results in 3 MR jobs . The 2nd (sampling) MR creates empty sample file, and 3rd MR (order-by) fails with following error in Map job - java.lang.RuntimeException: java.lang.RuntimeException: Empty samples file at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:104) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:348) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Caused by: java.lang.RuntimeException: Empty samples file at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:89) ... 5 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-958) Splitting output data on key field
Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group
[ https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742748#action_12742748 ] Ankur commented on PIG-919: --- I have seem this issue in other places when the value coming out of a map[] is used in a group/cogroup/join. Pig throws a the same error. And Viraj is right, explicit casting to chararray alleviates the issue. But this is confusing for users. Pig should be converting NullableText to NullableBytesWritable automatically. Here is another sample script that throws an error. Exlicit casting to chararray resolves the issue data = LOAD 'mydata' USING CustomLoader() AS (f1:double, f2: map[]) dataProjected = FOREACH data GENERATE f2#'Url' as url, f1 as rank data2 = LOAD 'urlList' AS (url:bytearray); grouped = COGROUP data BY url, data2 url Parallel 10; STORE grouped INTO 'results' Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group -- Key: PIG-919 URL: https://issues.apache.org/jira/browse/PIG-919 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar I have a Pig script, which takes in a student file and generates a bag of maps. I later want to group on the value of the key name0 which corresponds to the first name of the student. {code} register mymapudf.jar; data = LOAD '/user/viraj/studenttab10k' AS (somename:chararray,age:long,marks:float); genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as bp:map[], age, marks; getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks; filternonnullfirstnames = filter getfirstnames by firstname is not null; groupgenmap = group filternonnullfirstnames by firstname; dump groupgenmap; {code} When I execute this code, I get an error in the Map Phase: === java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-871) Improve distribution of keys in reduce phase
Improve distribution of keys in reduce phase Key: PIG-871 URL: https://issues.apache.org/jira/browse/PIG-871 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Ankur The default hashing scheme used to distribute keys in reduce phase sometimes results in an uneven distribution of keys resulting in 5 - 10 % of reducers being overloaded with data. This bottleneck makes the PIG jobs really slow and gives users a bad impression. While there is no bullet proof solution to the problem in general, the hashing can certainly be improved for better distribution. The proposal here is to evaluate and incorporate other hashing schemes that give high avalanche and more even distribution. We can start by evaluating MurmurHash which is Apache 2.0 licensed and freely available here - http://www.getopt.org/murmur/MurmurHash.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-754) Bugs with load and store and filenames passed with -param containing periods
[ https://issues.apache.org/jira/browse/PIG-754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724065#action_12724065 ] Ankur commented on PIG-754: --- Verified in the latest code that fixing PIG-564 does resolve this issue. This should be marked as duplicate of PIG-564 and closed Bugs with load and store and filenames passed with -param containing periods Key: PIG-754 URL: https://issues.apache.org/jira/browse/PIG-754 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz This one drove me batty. I have two files file and file.right. file: {code} WRONG This is file, not file.right. {code} file.right: {code} RIGHT This is file.right.. {code} infile.pig: {code} A = load '$infile' using PigStorage(); dump A; {code} When I pass in file.right as the infile parameter value, the wrong file is read: {code} -bash-3.00$ pig -exectype local -param infile=file.right infile.pig USING: /grid/0/gs/pig/current 2009-04-05 23:18:36,291 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-05 23:18:36,292 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (WRONG ) (This is file, not file.right.) {code} However, if I pass in infile as ./file.right, the script magically works. {code} -bash-3.00$ pig -exectype local -param infile=./file.right infile.pig USING: /grid/0/gs/pig/current 2009-04-05 23:20:46,735 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-05 23:20:46,736 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (RIGHT) (This is file.right.) {code} I do not have this problem if I use the file name with a period in the script itself: infile2.pig {code} A = load 'file.right' using PigStorage(); dump A; {code} {code} -bash-3.00$ pig -exectype local infile2.pig USING: /grid/0/gs/pig/current 2009-04-05 23:22:47,022 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-05 23:22:47,023 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (RIGHT) (This is file.right.) {code} I also experience similar problems when I try to pass in param outfile in a store statement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-821) simulate NTILE(n) , rank() functionality in pig
[ https://issues.apache.org/jira/browse/PIG-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717635#action_12717635 ] Ankur commented on PIG-821: --- Ok, So I tried writing an NTILE UDF that accepts 1. Number of tiles 2. A bag of sorted tuples The problem with that is it is essentially a serial process instead of parallel as one would expect. So I am not sure if an NTILE operation can be done efficiently via a UDF. An efficient NTILE operation over sorted dataset should 1. Partition the sorted data into the number of tiles requested 2. Preserve the ordering in each tile. 3. Have each tile contain exactly the number of elements as per ntile logic. There is a total ordering partitioner in hadoop - http://issues.apache.org/jira/browse/HADOOP-3019 that effects total ordering of output data. However it cannot strictly enforce the number of elements contained in each part output which is a necessary condition to comply with NTILE logic. Any thoughts? simulate NTILE(n) , rank() functionality in pig --- Key: PIG-821 URL: https://issues.apache.org/jira/browse/PIG-821 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.2.0 Environment: mithril gold -gateway 4000 Reporter: Rekha Fix For: 0.2.0 Hi, I came across a job which has some processing which I cant seem to get easily over-the-counter from pig. These are NTILE() /rank() operations available in oracle. While I am trying to write a UDF, that is not working out too well for me yet.. :( I have a ntile(n) over (partititon by x, y, z order by a desc, b desc) operation to be done in pig scripts. Is there a default function in pig scripting which can do this? For example, lets consider a simple example at http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions091.htm So here, how would we ideally substitute NTILE() with? any pig counterpart function/udf? SELECT last_name, salary, NTILE(4) OVER (ORDER BY salary DESC) AS quartile FROM employees WHERE department_id = 100; LAST_NAME SALARY QUARTILE - -- -- Greenberg 12000 1 Faviet 9000 1 Chen8200 2 Urman 7800 2 Sciarra 7700 3 Popp6900 4 In real case, i have ntile over multiple columns, so ideal way to find histograms/boundary/spitting out the bucket number is needed. Similarly a pig function is required for rank() over(partition by a,b,c order by d desc) as e Please let me know soon. Thanks Regards, /Rekha -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-732) Utility UDFs
[ https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-732: -- Attachment: udf.v5.patch Minor issue in test case, causing test failure. Fixed in latest upload - udf.v5.patch. Also changed TopN to Top. Should be good to go now. Utility UDFs - Key: PIG-732 URL: https://issues.apache.org/jira/browse/PIG-732 Project: Pig Issue Type: New Feature Reporter: Ankur Priority: Minor Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch, udf.v5.patch Two utility UDFs and their respective test cases. 1. TopN - Accepts number of tuples (N) to retain in output, field number (type long) to use for comparison, and an sorted/unsorted bag of tuples. It outputs a bag containing top N tuples. 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines (Yahoo, Google, AOL, Live) and extracts and normalizes the search query present in it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-732) Utility UDFs
[ https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704014#action_12704014 ] Ankur commented on PIG-732: --- If there aren't any other issues then can we go ahead and commit these ? Utility UDFs - Key: PIG-732 URL: https://issues.apache.org/jira/browse/PIG-732 Project: Pig Issue Type: New Feature Reporter: Ankur Priority: Minor Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch Two utility UDFs and their respective test cases. 1. TopN - Accepts number of tuples (N) to retain in output, field number (type long) to use for comparison, and an sorted/unsorted bag of tuples. It outputs a bag containing top N tuples. 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines (Yahoo, Google, AOL, Live) and extracts and normalizes the search query present in it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.