[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712397#action_12712397 ] Hong Tang commented on PIG-794: --- - It appears that the code added a three-byte sync-mark \1\2\3 before every tuple. - There is no escaping of sync-mark collisions in user code. - The introduction of the sync mark also defeats the purpose of using Avro in the first place (sharing a common serialization format). Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712409#action_12712409 ] Hudson commented on PIG-697: Integrated in Pig-trunk #451 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/451/]) : Proposed improvements to pig's optimizer Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be
[jira] Commented: (PIG-814) Make Binstorage more robust when data contains record markers
[ https://issues.apache.org/jira/browse/PIG-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712410#action_12712410 ] Hudson commented on PIG-814: Integrated in Pig-trunk #451 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/451/]) :Make Binstorage more robust when data contains record markers (pradeepkth) Make Binstorage more robust when data contains record markers - Key: PIG-814 URL: https://issues.apache.org/jira/browse/PIG-814 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-814.patch When the inputstream for BinStorage is at a position where the data has the record marker sequence, the code incorrectly assumes that it is at the beginning of a record (tuple) and calls DataReaderWriter.readDatum() trying to read the tuple. The problem is more likely when RandomSampleLoader (used in order by implementation) skips the input stream for sampling and calls Binstorage.getNext(). The code should be more robust in such cases -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milind Bhandarkar updated PIG-656: -- Status: Open (was: Patch Available) modifying patch to include test case. Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.1 Reporter: Viraj Bhat Assignee: Milind Bhandarkar Fix For: 0.3.0 Attachments: mywordcount.txt, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milind Bhandarkar updated PIG-656: -- Attachment: (was: reserved.patch) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.1 Reporter: Viraj Bhat Assignee: Milind Bhandarkar Fix For: 0.3.0 Attachments: mywordcount.txt, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milind Bhandarkar updated PIG-656: -- Attachment: reserved.patch Uploading a modified patch that now includes a test case. The findbugs warning is not new to this patch. Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.1 Reporter: Viraj Bhat Assignee: Milind Bhandarkar Fix For: 0.3.0 Attachments: mywordcount.txt, reserved.patch, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milind Bhandarkar updated PIG-656: -- Status: Patch Available (was: Open) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.1 Reporter: Viraj Bhat Assignee: Milind Bhandarkar Fix For: 0.3.0 Attachments: mywordcount.txt, reserved.patch, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712470#action_12712470 ] Hadoop QA commented on PIG-656: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408885/reserved.patch against trunk revision 08. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/56/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/56/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/56/console This message is automatically generated. Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.1 Reporter: Viraj Bhat Assignee: Milind Bhandarkar Fix For: 0.3.0 Attachments: mywordcount.txt, reserved.patch, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-817) Pig Docs for 0.3.0 Release
Pig Docs for 0.3.0 Release -- Key: PIG-817 URL: https://issues.apache.org/jira/browse/PIG-817 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.3.0 Reporter: Corinne Chandel Update Pig docs for 0.3.0 release Getting Started Pig Latin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-817) Pig Docs for 0.3.0 Release
[ https://issues.apache.org/jira/browse/PIG-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corinne Chandel updated PIG-817: Attachment: Doc-XML-Files.zip Doc-Build.zip PIG-817.patch (1) PIG_817.patch - patch file (2) Doc-Build.zip - local doc build (for review) (3) Doc-XML-Files - copies of the updated XML files (in case you need them) Pig Docs for 0.3.0 Release -- Key: PIG-817 URL: https://issues.apache.org/jira/browse/PIG-817 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.3.0 Reporter: Corinne Chandel Attachments: Doc-Build.zip, Doc-XML-Files.zip, PIG-817.patch Update Pig docs for 0.3.0 release Getting Started Pig Latin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-813) Semantics of * and count
[ https://issues.apache.org/jira/browse/PIG-813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712500#action_12712500 ] Corinne Chandel commented on PIG-813: - Pig Latin Manual updated (PIG-817): 1) The semantics of * as explained by Olga. 2) An example of GROUP ALL included. George, thanks for pointing out this documentation issue. Semantics of * and count Key: PIG-813 URL: https://issues.apache.org/jira/browse/PIG-813 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.2.0 Reporter: George Mavromatis Fix For: 0.2.0 Continuation of PIG-812. See PIG-812 for more details. In order for this to be resolved in the right manner the following must added in the http://hadoop.apache.org/pig/docs/r0.2.0/piglatin.html 1) The semantics of * as explained by Olga. 2) An example of GROUP ALL Otherwise people will waste their time doing the same (documentation-caused) mistakes again. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-Patch-minerva.apache.org #57
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/57/ -- started Building remotely on minerva.apache.org (Ubuntu) Updating http://svn.apache.org/repos/asf/hadoop/pig/trunk Fetching 'http://svn.apache.org/repos/asf/hadoop/core/nightly/test-patch' at -1 into 'http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/ws/trunk/test/bin' At revision 778081 At revision 778081 no change for http://svn.apache.org/repos/asf/hadoop/pig/trunk since the previous build no change for http://svn.apache.org/repos/asf/hadoop/core/nightly/test-patch since the previous build [Pig-Patch-minerva.apache.org] $ /bin/bash /tmp/hudson4426818384798735035.sh /home/hudson/tools/java/latest1.6/bin/java Buildfile: build.xml check-for-findbugs: findbugs.check: java5.check: forrest.check: hudson-test-patch: [exec] [exec] [exec] == [exec] == [exec] Testing patch for PIG-817. [exec] == [exec] == [exec] [exec] [exec] Reverted 'test/org/apache/pig/test/TestLogicalPlanBuilder.java' [exec] Reverted 'src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt' [exec] [exec] Fetching external item into 'test/bin' [exec] Atest/bin/test-patch.sh [exec] Updated external to revision 778081. [exec] [exec] Updated to revision 778081. [exec] PIG-817 patch is being downloaded at Sat May 23 21:53:42 PDT 2009 from [exec] http://issues.apache.org/jira/secure/attachment/12408897/Doc-XML-Files.zip [exec] [exec] [exec] == [exec] == [exec] Pre-building trunk to determine trunk number [exec] of release audit, javac, and Findbugs warnings. [exec] == [exec] == [exec] [exec] [exec] /home/hudson/tools/ant/latest/bin/ant -Djava5.home=/home/hudson/tools/java/latest1.5 -Dforrest.home=/home/nigel/tools/forrest/latest -DPigPatchProcess= releaseaudit http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/ws/patchprocess/trunkReleaseAuditWarnings.txt 21 [exec] /home/hudson/tools/ant/latest/bin/ant -Djavac.args=-Xlint -Xmaxwarns 1000 -Declipse.home=/home/nigel/tools/eclipse/latest -Djava5.home=/home/hudson/tools/java/latest1.5 -Dforrest.home=/home/nigel/tools/forrest/latest -DPigPatchProcess= clean tar http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/ws/patchprocess/trunkJavacWarnings.txt 21 [exec] /home/hudson/tools/ant/latest/bin/ant -Dfindbugs.home=/home/nigel/tools/findbugs/latest -Djava5.home=/home/hudson/tools/java/latest1.5 -Dforrest.home=/home/nigel/tools/forrest/latest -DPigPatchProcess= findbugs /dev/null 21 [exec] [exec] [exec] == [exec] == [exec] Checking there are no @author tags in the patch. [exec] == [exec] == [exec] [exec] [exec] There appear to be 0 @author tags in the patch. [exec] [exec] [exec] == [exec] == [exec] Checking there are new or changed tests in the patch. [exec] == [exec] == [exec] [exec] [exec] There appear to be 0 test files referenced in the patch. [exec] The patch appears to be a documentation patch that doesn't require tests. [exec] [exec] [exec] == [exec] == [exec] Applying patch. [exec] == [exec] == [exec] [exec] [exec] patch unexpectedly ends in middle of line [exec] /usr/bin/patch: Only garbage was found in the patch input. [exec] PATCH APPLICATION FAILED