[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710628#action_12710628 ] Daniel Dai commented on PIG-774: You get the point, Viraj. Actually we can have two different configurations: 1. LANG=UTF8, all data files, script files, parameter files are UTF8 2. LANG=GB2321, data files are UTF8; script files, parameter files are GB2321 However, for RH-EL default settings, LANG=POSIX, which does not work well for Chinese characters. So for simplicity, we can have everything UTF8 (case 1). This is the default setting for Ubuntu. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Assignee: Daniel Dai Priority: Critical Fix For: 0.3.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = {code} shell $ hadoop
[jira] Updated: (PIG-765) to implement jdiff
[ https://issues.apache.org/jira/browse/PIG-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated PIG-765: --- Hadoop Flags: [Reviewed] Status: Patch Available (was: In Progress) to implement jdiff -- Key: PIG-765 URL: https://issues.apache.org/jira/browse/PIG-765 Project: Pig Issue Type: Improvement Components: build Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: pig-765.patch, pig-765.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-765) to implement jdiff
[ https://issues.apache.org/jira/browse/PIG-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated PIG-765: --- Status: In Progress (was: Patch Available) to implement jdiff -- Key: PIG-765 URL: https://issues.apache.org/jira/browse/PIG-765 Project: Pig Issue Type: Improvement Components: build Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: pig-765.patch, pig-765.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-765) to implement jdiff
[ https://issues.apache.org/jira/browse/PIG-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710724#action_12710724 ] Hadoop QA commented on PIG-765: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12407876/pig-765.patch against trunk revision 776106. -1 @author. The patch appears to contain 5 @author tags which the Pig community has agreed to not allow in code contributions. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/48/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/48/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/48/console This message is automatically generated. to implement jdiff -- Key: PIG-765 URL: https://issues.apache.org/jira/browse/PIG-765 Project: Pig Issue Type: Improvement Components: build Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: pig-765.patch, pig-765.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-Patch-minerva.apache.org #48
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/48/changes Changes: [sms] PIG-697: Proposed improvements to pig's optimizer -- [...truncated 91447 lines...] [exec] [junit] 09/05/19 06:29:42 INFO dfs.DataNode: Received block blk_-6409077214663205636_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/05/19 06:29:42 INFO dfs.DataNode: Received block blk_-6409077214663205636_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/05/19 06:29:42 INFO dfs.DataNode: PacketResponder 1 for block blk_-6409077214663205636_1011 terminating [exec] [junit] 09/05/19 06:29:42 INFO dfs.DataNode: PacketResponder 2 for block blk_-6409077214663205636_1011 terminating [exec] [junit] 09/05/19 06:29:42 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:44808 is added to blk_-6409077214663205636_1011 size 6 [exec] [junit] 09/05/19 06:29:42 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50164 is added to blk_-6409077214663205636_1011 size 6 [exec] [junit] 09/05/19 06:29:42 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://localhost:44060 [exec] [junit] 09/05/19 06:29:42 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: localhost:48467 [exec] [junit] 09/05/19 06:29:42 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 [exec] [junit] 09/05/19 06:29:42 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 [exec] [junit] 09/05/19 06:29:43 INFO dfs.StateChange: BLOCK* ask 127.0.0.1:56490 to delete blk_-8848395961846024087_1006 blk_2864616809689055598_1004 [exec] [junit] 09/05/19 06:29:43 INFO dfs.StateChange: BLOCK* ask 127.0.0.1:46139 to delete blk_-8848395961846024087_1006 blk_986403042756521851_1005 [exec] [junit] 09/05/19 06:29:43 INFO mapReduceLayer.JobControlCompiler: Setting up single store job [exec] [junit] 09/05/19 06:29:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. [exec] [junit] 09/05/19 06:29:43 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_200905190628_0002/job.jar. blk_-1187686921115561227_1012 [exec] [junit] 09/05/19 06:29:43 INFO dfs.DataNode: Receiving block blk_-1187686921115561227_1012 src: /127.0.0.1:34222 dest: /127.0.0.1:46139 [exec] [junit] 09/05/19 06:29:43 INFO dfs.DataNode: Receiving block blk_-1187686921115561227_1012 src: /127.0.0.1:50480 dest: /127.0.0.1:56490 [exec] [junit] 09/05/19 06:29:43 INFO dfs.DataNode: Receiving block blk_-1187686921115561227_1012 src: /127.0.0.1:33721 dest: /127.0.0.1:50164 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: Received block blk_-1187686921115561227_1012 of size 1393210 from /127.0.0.1 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: PacketResponder 0 for block blk_-1187686921115561227_1012 terminating [exec] [junit] 09/05/19 06:29:44 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50164 is added to blk_-1187686921115561227_1012 size 1393210 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: Deleting block blk_-8848395961846024087_1006 file dfs/data/data5/current/blk_-8848395961846024087 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: Received block blk_-1187686921115561227_1012 of size 1393210 from /127.0.0.1 [exec] [junit] 09/05/19 06:29:44 WARN dfs.DataNode: Unexpected error trying to delete block blk_2864616809689055598_1004. BlockInfo not found in volumeMap. [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: PacketResponder 1 for block blk_-1187686921115561227_1012 terminating [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: Received block blk_-1187686921115561227_1012 of size 1393210 from /127.0.0.1 [exec] [junit] 09/05/19 06:29:44 WARN dfs.DataNode: java.io.IOException: Error in deleting blocks. [exec] [junit] at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:1146) [exec] [junit] at org.apache.hadoop.dfs.DataNode.processCommand(DataNode.java:793) [exec] [junit] at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:663) [exec] [junit] at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888) [exec] [junit] at java.lang.Thread.run(Thread.java:619) [exec] [junit] [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: PacketResponder 2 for block blk_-1187686921115561227_1012 terminating [exec] [junit] 09/05/19 06:29:44 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:46139 is added to
[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)
[ https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710818#action_12710818 ] Yiping Han commented on PIG-807: David, the syntax: B = foreach A generate SUM(m), is confusing for both developers and the parser. I like the idea to remove the explicit GROUP ALL, but would rather to use a different key word for that. I.e., B = FOR A GENERATE SUM(m); Adding a new keyword for this purpose would also works as the hint for parser to treat this as a direct hadoop iterator access. PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator) Key: PIG-807 URL: https://issues.apache.org/jira/browse/PIG-807 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath Fix For: 0.3.0 Currently all bags resulting from a group or cogroup are materialized as bags containing all of the contents. The issue with this is that if a particular key has many corresponding values, all these values get stuffed in a bag which may run out of memory and hence spill causing slow down in performance and sometime memory exceptions. In many cases, the udfs which use these bags coming out a group and cogroup only need to iterate over the bag in a unidirectional read-once manner. This can be implemented by having the bag implement its iterator by simply iterating over the underlying hadoop iterator provided in the reduce. This kind of a bag is also needed in http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for this issue too. The other part of this issue is to have some way for the udfs to communicate to Pig that any input bags that they need are read once bags . This can be achieved by having an Interface - say UsesReadOnceBags which is serves as a tag to indicate the intent to Pig. Pig can then rewire its execution plan to use ReadOnceBags is feasible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-656) Use of eval word in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710862#action_12710862 ] Viraj Bhat commented on PIG-656: Another pig parse issue when a udf was defined within a package which had matches keywords in its path. So something like : define DISTANCE_SCORE mypackage.pig.udf.matches.LevensteinMatchUDF(); gives a parse error ERROR 1000: Error during parsing. Encountered matches matches at line 11, column 42. Was expecting: IDENTIFIER ... It is possible to have keywords from pig within package names or even udf - shouldn't pig not be robust to simple grammar disambiguation of this sort ? Use of eval word in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 Attachments: mywordcount.txt, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-656) Use of eval word in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat reopened PIG-656: Documentation should be updated on the eval keyword and what it actually does otherwise the user can be lost trying to find out the error. Use of eval word in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 Attachments: mywordcount.txt, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)
[ https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710901#action_12710901 ] Mridul Muralidharan commented on PIG-807: - I think I am missing something here. If I did not get it wrong, two (different ?) usecases seem to be mentioned here ? 1) Avoid materializing bag's for a record when it can be streamed from the underlying data. bag's currently created through (co)group output seems to fall inside this. As in : B = GROUP A by id; C = FOREACH B generate SUM($1.field); This does not reqiure the $1.field bag to be created explicitly - but through an iterator interface, just stream the values from underlying reducer output. 2) The group ALL based construct seem to be to directly stream an entire relation through udf's. As a shorthand for A_tmp = GROUP A all; B = FOREACH A_tmp GENERATE algUdf($1); If I am right in splitting this, then : First usecase has tremendous potential for improving performance - particularly to remove the annoying OOM's or spills which happen : but not sure how it interact with pig's current pipeline design... (if any). Since there are alternatives (though more cryptic) to do it, I dont have any particular opinion about 2. Regards, Mridul PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator) Key: PIG-807 URL: https://issues.apache.org/jira/browse/PIG-807 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath Fix For: 0.3.0 Currently all bags resulting from a group or cogroup are materialized as bags containing all of the contents. The issue with this is that if a particular key has many corresponding values, all these values get stuffed in a bag which may run out of memory and hence spill causing slow down in performance and sometime memory exceptions. In many cases, the udfs which use these bags coming out a group and cogroup only need to iterate over the bag in a unidirectional read-once manner. This can be implemented by having the bag implement its iterator by simply iterating over the underlying hadoop iterator provided in the reduce. This kind of a bag is also needed in http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for this issue too. The other part of this issue is to have some way for the udfs to communicate to Pig that any input bags that they need are read once bags . This can be achieved by having an Interface - say UsesReadOnceBags which is serves as a tag to indicate the intent to Pig. Pig can then rewire its execution plan to use ReadOnceBags is feasible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: A proposal for changing pig's memory management
The claims in the paper I was interested in were not issues like non- blocking I/O etc. The claim that is of interest to pig is that a memory allocation and garbage collection scheme that is beyond the control of the programmer is a bad fit for a large data processing system. This is a fundamental design choice in Java, and fits it well for the vast majority of its uses. But for systems like Pig there seems to be no choice but to work around Java's memory management. I'll clarify this point in the document. I took a closer look at NIO. My concern is that it does not give the level of control I want. NIO allows you to force a buffer to disk and request a buffer to load, but you cannot force a page out of memory. It doesn't even guarantee that after you load a page it will really be loaded. One of the biggest issues in pig right now is that we run out memory or get the garbage collector in a situation where it can't make sufficient progress. Perhaps switching to large buffers instead of having many individual objects will address this. But I'm concerned that if we cannot explicitly force data out of memory onto disk then we'll be back in the same boat of trusting the Java memory manager. Alan. On May 14, 2009, at 7:43 PM, Ted Dunning wrote: That Telegraph dataflow paper is pretty long in the tooth. Certainly several of their claims have little force any more (lack of non- blocking I/O, poor thread performance, no unmap, very expensive synchronization for uncontested locks). It is worth that they did all of their tests on the 1.3 JVM and things have come an enormous way since then. Certainly, it is worth having opaque contains based on byte arrays, but isn't that pretty much what the NIO byte buffers are there to provide? Wouldn't a virtual tuple type that was nothing more than a byte buffer, type and an offset do almost all of what is proposed here? On Thu, May 14, 2009 at 5:33 PM, Alan Gates ga...@yahoo-inc.com wrote: http://wiki.apache.org/pig/PigMemory Alan.
Re: A proposal for changing pig's memory management
If you have a small number of long-lived large objects and a large number of small ephemeral objects then the java collector should be in pig-heaven (as it were). The long-lived objects will take no time to collect and the ephemeral objects won't be around to collect by the time the full GC happens. On Tue, May 19, 2009 at 3:44 PM, Alan Gates ga...@yahoo-inc.com wrote: Perhaps switching to large buffers instead of having many individual objects will address this. But I'm concerned that if we cannot explicitly force data out of memory onto disk then we'll be back in the same boat of trusting the Java memory manager. -- Ted Dunning, CTO DeepDyve
[jira] Updated: (PIG-802) PERFORMANCE: not creating bags for ORDER BY
[ https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Setty updated PIG-802: - Attachment: (was: OrderByOptimization.patch) PERFORMANCE: not creating bags for ORDER BY --- Key: PIG-802 URL: https://issues.apache.org/jira/browse/PIG-802 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: OrderByOptimization.patch Order by should be changed to not use POPackage to put all of the tuples in a bag on the reduce side, as the bag is just immediately flattened. It can instead work like join does for the last input in the join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-656: --- Summary: Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception (was: Use of eval word in the package hierarchy of a UDF causes parse exception) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 Attachments: mywordcount.txt, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Attachment: OptimizerPhase3_parrt1.patch Part 1 of the Phase3 patch. It implements the requiredFields feature in all the relational operators. New unit tests have been added. Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: In Progress (was: Patch Available) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple inputs. * @param second operator, will be pushed in front of first * @param inputNum,
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: Patch Available (was: In Progress) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple inputs. * @param second operator, will be pushed in front of first * @param inputNum,
[jira] Created: (PIG-812) COUNT(*) does not work
COUNT(*) does not work --- Key: PIG-812 URL: https://issues.apache.org/jira/browse/PIG-812 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.2.0 Pig script to count the number of rows in a studenttab10k file which contains 10k records. {code} studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float); X2 = GROUP studenttab ALL; describe X2; Y2 = FOREACH X2 GENERATE COUNT(*); explain Y2; DUMP Y2; {code} returns the following error ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias Y2 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log If you look at the log file: Caused by: java.lang.ClassCastException at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76) at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-812) COUNT(*) does not work
[ https://issues.apache.org/jira/browse/PIG-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-812: --- Attachment: studenttab10k Input file COUNT(*) does not work --- Key: PIG-812 URL: https://issues.apache.org/jira/browse/PIG-812 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.2.0 Attachments: studenttab10k Pig script to count the number of rows in a studenttab10k file which contains 10k records. {code} studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float); X2 = GROUP studenttab ALL; describe X2; Y2 = FOREACH X2 GENERATE COUNT(*); explain Y2; DUMP Y2; {code} returns the following error ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias Y2 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log If you look at the log file: Caused by: java.lang.ClassCastException at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76) at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: A proposal for changing pig's memory management
I am still not very convinced about the value about this implementation - particularly considering the advances made since 1.3 in memory allocators and garbage collection. The side effect of this proposal is many, and sometimes non-obvious. Like implicitly moving young generation data into older generation, causing much more memory pressure for gc, fragmentation of memory blocks causing quite a bit of memory pressure, replicating quite a bit of functionality with garbage collection, possibility of bugs with ref counting, etc. If assumption that current working set of bag/tuple does not need to be spilled, and anything else can be, then this will pretty much deteriorate to current impl in worst case. A much more simpler method to gain benefits would be to handle primitives as ... primitives and not through the java wrapper classes for them. It should be possible to write schema aware tuples which make use of the primitives specified to take a fraction of memory required (4 bytes + null_check boolean for int + offset mapping instead of 24/32 bytes it currently is, etc). Regards, Mridul Alan Gates wrote: http://wiki.apache.org/pig/PigMemory Alan.