RE: [jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code
Hi David, This is exactly the problem that the multi-query optimization project is addressing. Please see the following link for details: http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification Thanks, -Richard -Original Message- From: David Ciemiewicz (JIRA) [mailto:j...@apache.org] Sent: Tuesday, April 28, 2009 7:43 AM To: pig-dev@hadoop.apache.org Subject: [jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code [ https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.pl ugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703659#ac tion_12703659 ] David Ciemiewicz commented on PIG-777: -- This seems like it could be useful but I don't understand the full issue as a user. I often want to compute intermediate summaries, store them, and then continue computation. {code}A = load ... ... store D into ... E = group D by ... ... store H into ...{code} The problem I encountered in earlier versions of Pig was that to PREVENT two executions of steps A thru D, I had to introduce a load step before E: {code}A = load ... ... store D into ... D = load ... E = group D by ... ... store H into ...{code} It's great that you will be introducing code that possibly eliminates D = load in the execution. However, is anything being done so that I don't need to introduce D = load in the first place? Code refactoring: Create optimization out of store/load post processing code Key: PIG-777 URL: https://issues.apache.org/jira/browse/PIG-777 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner The postProcessing method in the pig server checks whether a logical graph contains stores to and loads from the same location. If so, it will either connect the store and load, or optimize by throwing out the load and connecting the store predecessor with the successor of the load. Ideally the introduction of the store and load connection should happen in the query compiler, while the optimization should then happen in an separate optimizer step as part of the optimizer framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code
[ https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703764#action_12703764 ] David Ciemiewicz commented on PIG-777: -- Another thing ... If you eliminate the D = load statement, could you provide some information to the user that this optimization is taking place? It would help me immensely with code maintenance if I could eliminate the D = load steps which often require recoding the AS clause schema. Code refactoring: Create optimization out of store/load post processing code Key: PIG-777 URL: https://issues.apache.org/jira/browse/PIG-777 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner The postProcessing method in the pig server checks whether a logical graph contains stores to and loads from the same location. If so, it will either connect the store and load, or optimize by throwing out the load and connecting the store predecessor with the successor of the load. Ideally the introduction of the store and load connection should happen in the query compiler, while the optimization should then happen in an separate optimizer step as part of the optimizer framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703792#action_12703792 ] Alan Gates commented on PIG-627: Checked in multiquery-phase3_0423.patch to multiquery branch. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: doc-fix.patch, error_handling_0415.patch, error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery-phase3_0423.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703937#action_12703937 ] Viraj Bhat commented on PIG-774: Daniel, Thanks again for your patch, I worked with Pradeep and changed the parser code to invoke that behavior you suggested and then filed Jira PIG-774. Here is one problem that I faced.. Suppose I have a script like this, known as chinese_data.pig {code} rmf chineseoutput; %default querystring 'myquery'; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); --dump I; J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; --store J into 'chineseoutput'; dump J; {code} I have a parameter file known as nextgen_paramfile which contains the $querystring variable.. {code} querystring= 歌手香港情牽女人心演唱會 {code} I run the above script and parameter file as: {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} I get the following error: 2009-04-29 01:05:14,979 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 2009-04-29 01:05:16,328 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 2009-04-29 01:05:16,907 [main] INFO org.apache.pig.PigServer - Create a new graph. 2009-04-29 01:05:17,794 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 7, column 33. Encountered: \u6b4c (27468), after : I realized that it was something to do with the commented line in the pig script. {code} --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} Why is that so, I am attaching the pig_*log on this Jira. Additionally I found that the parameter substitution is happening correctly when I run the script as: {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig {code} The substituted file, chinese_data.pig.substituted is correct. Viraj Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, utf8_parser-1.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO
[jira] Commented: (PIG-619) Dumping empty results produces Unable to get results for /tmp/temp-1964806069/tmp256878619 org.apache.pig.builtin.BinStorage message
[ https://issues.apache.org/jira/browse/PIG-619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703940#action_12703940 ] Alan Gates commented on PIG-619: Does fixing this still make sense? IIRC the main reason for doing the store/load thing in the middle was to deal with the fact that Pig couldn't do multiple stores in one script without re-running the entire script. But since that is in the process of being changed (see PIG-627), this should no longer be necessary. Dumping empty results produces Unable to get results for /tmp/temp-1964806069/tmp256878619 org.apache.pig.builtin.BinStorage message --- Key: PIG-619 URL: https://issues.apache.org/jira/browse/PIG-619 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 18, Multi-node hadoop installation Reporter: Viraj Bhat Assignee: Alan Gates Attachments: mydata.txt, tmpfileload.pig Following pig script stores empty filter results into 'emptyfilteredlogs' HDFS dir. It later reloads this data from an empty HDFS dir for additional grouping and counting. It has been observed that this script, succeeds on a single node hadoop installation with the following message as the alias COUNT_EMPTYFILTERED_LOGS contains empty data. == 2009-01-13 21:47:08,988 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! == But on a multi-node Hadoop installation, the script fails with the following error: == 2009-01-13 13:48:34,602 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! java.io.IOException: Unable to open iterator for alias: COUNT_EMPTYFILTERED_LOGS [Unable to get results for /tmp/temp-1964806069/tmp256878619:org.apache.pig.builtin.BinStorage] at org.apache.pig.backend.hadoop.executionengine.HJob.getResults(HJob.java:74) at org.apache.pig.PigServer.openIterator(PigServer.java:408) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: org.apache.pig.backend.executionengine.ExecException: Unable to get results for /tmp/temp-1964806069/tmp256878619:org.apache.pig.builtin.BinStorage ... 7 more Caused by: java.io.IOException: /tmp/temp-1964806069/tmp256878619 does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:188) at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:291) at org.apache.pig.backend.hadoop.executionengine.HJob.getResults(HJob.java:69) ... 6 more == {code} RAW_LOGS = load 'mydata.txt' as (url:chararray, numvisits:int); RAW_LOGS = limit RAW_LOGS 2; FILTERED_LOGS = filter RAW_LOGS by numvisits 0; store FILTERED_LOGS into 'emptyfilteredlogs' using PigStorage(); EMPTY_FILTERED_LOGS = load 'emptyfilteredlogs' as (url:chararray, numvisits:int); GROUP_EMPTYFILTERED_LOGS = group EMPTY_FILTERED_LOGS by numvisits; COUNT_EMPTYFILTERED_LOGS = foreach GROUP_EMPTYFILTERED_LOGS generate group, COUNT(EMPTY_FILTERED_LOGS); explain COUNT_EMPTYFILTERED_LOGS; dump COUNT_EMPTYFILTERED_LOGS; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-789) coupling load and store in script no longer works
coupling load and store in script no longer works - Key: PIG-789 URL: https://issues.apache.org/jira/browse/PIG-789 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Alan Gates Many user's pig script do something like this: a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); c = filter a by age 500; e = group c by (name, age); f = foreach e generate group, COUNT($1); store f into 'bla'; f1 = load 'bla'; g = order f1 by $1; dump g; With the inclusion of the multi-query phase2 patch this appears to no longer work. You get an error: 2009-04-28 18:24:50,776 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/gates/bla does not exist. We shouldn't be checking for bla's existence here because it will be created eventually by the script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-619) Dumping empty results produces Unable to get results for /tmp/temp-1964806069/tmp256878619 org.apache.pig.builtin.BinStorage message
[ https://issues.apache.org/jira/browse/PIG-619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703941#action_12703941 ] Viraj Bhat commented on PIG-619: So when does the Multi-Store query optimization get committed/merged into the main branch, (where this is default way the multi-store happens). Viraj Dumping empty results produces Unable to get results for /tmp/temp-1964806069/tmp256878619 org.apache.pig.builtin.BinStorage message --- Key: PIG-619 URL: https://issues.apache.org/jira/browse/PIG-619 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 18, Multi-node hadoop installation Reporter: Viraj Bhat Assignee: Alan Gates Attachments: mydata.txt, tmpfileload.pig Following pig script stores empty filter results into 'emptyfilteredlogs' HDFS dir. It later reloads this data from an empty HDFS dir for additional grouping and counting. It has been observed that this script, succeeds on a single node hadoop installation with the following message as the alias COUNT_EMPTYFILTERED_LOGS contains empty data. == 2009-01-13 21:47:08,988 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! == But on a multi-node Hadoop installation, the script fails with the following error: == 2009-01-13 13:48:34,602 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! java.io.IOException: Unable to open iterator for alias: COUNT_EMPTYFILTERED_LOGS [Unable to get results for /tmp/temp-1964806069/tmp256878619:org.apache.pig.builtin.BinStorage] at org.apache.pig.backend.hadoop.executionengine.HJob.getResults(HJob.java:74) at org.apache.pig.PigServer.openIterator(PigServer.java:408) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: org.apache.pig.backend.executionengine.ExecException: Unable to get results for /tmp/temp-1964806069/tmp256878619:org.apache.pig.builtin.BinStorage ... 7 more Caused by: java.io.IOException: /tmp/temp-1964806069/tmp256878619 does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:188) at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:291) at org.apache.pig.backend.hadoop.executionengine.HJob.getResults(HJob.java:69) ... 6 more == {code} RAW_LOGS = load 'mydata.txt' as (url:chararray, numvisits:int); RAW_LOGS = limit RAW_LOGS 2; FILTERED_LOGS = filter RAW_LOGS by numvisits 0; store FILTERED_LOGS into 'emptyfilteredlogs' using PigStorage(); EMPTY_FILTERED_LOGS = load 'emptyfilteredlogs' as (url:chararray, numvisits:int); GROUP_EMPTYFILTERED_LOGS = group EMPTY_FILTERED_LOGS by numvisits; COUNT_EMPTYFILTERED_LOGS = foreach GROUP_EMPTYFILTERED_LOGS generate group, COUNT(EMPTY_FILTERED_LOGS); explain COUNT_EMPTYFILTERED_LOGS; dump COUNT_EMPTYFILTERED_LOGS; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-790) Error message should indicate in which line number in the Pig script the error occured (debugging BinCond)
Error message should indicate in which line number in the Pig script the error occured (debugging BinCond) -- Key: PIG-790 URL: https://issues.apache.org/jira/browse/PIG-790 Project: Pig Issue Type: Bug Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Minor I have a simple Pig script which loads integer data and does an Bincond, where it compares, col1 eq ''. There is an error message that is generated in this case, but it does not specify the line number in the script. {code} MYDATA = load '/user/viraj/myerrordata.txt' using PigStorage() as (col1:int, col2:int); MYDATA_PROJECT = FOREACH MYDATA GENERATE ((col1 eq '') ? 1 : 0) as newcol1, ((col1 neq '') ? col1 - col2 : 16) as time_diff; dump MYDATA_PROJECT; {code} == 2009-04-29 02:33:07,182 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 2009-04-29 02:33:08,584 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 2009-04-29 02:33:08,836 [main] INFO org.apache.pig.PigServer - Create a new graph. 2009-04-29 02:33:10,040 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1039: Incompatible types in EqualTo Operator left hand side:int right hand side:chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1240972386081.log == It would be good if the error message has a line number and a copy of the line in the script which is causing the problem. Attaching data, script and log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703977#action_12703977 ] Viraj Bhat commented on PIG-774: I modified the file PigScriptParser.jj, and it works. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, pig_1240967860835.log, utf8_parser-1.patch, utf8_parser-2.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = {code} shell $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-0 {code} = Additionally tried the dry-run option to figure out if the parameter substitution was occurring