[jira] Created: (PIG-780) Code refactoring: PlanPrinters
Code refactoring: PlanPrinters -- Key: PIG-780 URL: https://issues.apache.org/jira/browse/PIG-780 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner Priority: Minor There seems to be quite a bit of duplicated code/functionality with all the PlanPrinters in the system. It would make things easier, if that was consolidated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-775) PORelationToExprProject should create a NonSpillableDataBag to create empty bags
[ https://issues.apache.org/jira/browse/PIG-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-775. Resolution: Fixed Patch committed. PORelationToExprProject should create a NonSpillableDataBag to create empty bags Key: PIG-775 URL: https://issues.apache.org/jira/browse/PIG-775 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Priority: Minor Fix For: 0.3.0 Attachments: PIG-775.patch PORelationToExprProject currently uses BagFactory.newDefaultBag() to create an empty bag in cases where it has to send an empty bag on EOP - each such empty bag created will be registered with the SpillableMemoryManager as a spillable bag. Since it is an empty bag, it really should not be registered as a spillable bag. For this, NonSpillableDataBag can be used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-712) Need utilities to create schemas for bags and tuples
[ https://issues.apache.org/jira/browse/PIG-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702446#action_12702446 ] Santhosh Srinivasan commented on PIG-712: - I made changes to the javadoc to ensure that there are no javadoc warnings. PIG-782 was created to track this. Need utilities to create schemas for bags and tuples Key: PIG-712 URL: https://issues.apache.org/jira/browse/PIG-712 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Priority: Minor Fix For: 0.3.0 Attachments: Pig_712_Patch.txt Pig should provide utilities to create bag and tuple schemas. Currently, users return schemas in outputSchema method and end up with very verbose boiler plate code. It will be very nice if Pig encapsulates the boiler plate code in utility methods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702445#action_12702445 ] Alan Gates commented on PIG-774: Two lines of change are needed to fix this: 1. In QueryParser.jjt, introduce a new option for handling unicode 2. In the LogicalPlanBuilder, use the getBytes method with the UTF-8 charset These changes also need to be propagated to the remaining JavaCC parsers. Then testing will need to be done. Estimate 3-5 days of work. Reference: http://www.xrce.xerox.com/competencies/content-analysis/tools/publis/javacc_unicode.pdf Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37
[jira] Resolved: (PIG-782) javadoc throws warnings - this would break hudson patch test process.
[ https://issues.apache.org/jira/browse/PIG-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan resolved PIG-782. - Resolution: Fixed I changed the javadoc to remove the warnings. javadoc throws warnings - this would break hudson patch test process. - Key: PIG-782 URL: https://issues.apache.org/jira/browse/PIG-782 Project: Pig Issue Type: Bug Environment: javadoc throws warnings - this would break the hudson patch test process. Reporter: Giridharan Kesavan [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:233: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:205: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:185: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:220: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:158: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:134: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:105: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:92: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:120: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:48: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:77: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:92: warning - @param argument names is not a parameter name. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-712) Need utilities to create schemas for bags and tuples
[ https://issues.apache.org/jira/browse/PIG-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702446#action_12702446 ] Santhosh Srinivasan edited comment on PIG-712 at 4/24/09 10:07 AM: --- I made changes to the javadoc to ensure that there are no javadoc warnings. PIG-782 was created to track this. Note that I fixed it to get rid of the javadoc warnings. The contents of the javadoc are not consistent and this has to be addressed. was (Author: sms): I made changes to the javadoc to ensure that there are no javadoc warnings. PIG-782 was created to track this. Need utilities to create schemas for bags and tuples Key: PIG-712 URL: https://issues.apache.org/jira/browse/PIG-712 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Priority: Minor Fix For: 0.3.0 Attachments: Pig_712_Patch.txt Pig should provide utilities to create bag and tuple schemas. Currently, users return schemas in outputSchema method and end up with very verbose boiler plate code. It will be very nice if Pig encapsulates the boiler plate code in utility methods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-783) PigStorage does not handle unicode characters above \u007f as a separator in the data.
PigStorage does not handle unicode characters above \u007f as a separator in the data. -- Key: PIG-783 URL: https://issues.apache.org/jira/browse/PIG-783 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Alan Gates Priority: Minor PigStorage reads one byte at a time to find the separator character in the data. So any multi-byte UTF-8 character will not work as a separator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-784) PigStorage() - need ability to turn off Attempt to access field warnings
[ https://issues.apache.org/jira/browse/PIG-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702524#action_12702524 ] Santhosh Srinivasan commented on PIG-784: - By default, these warnings are aggregated and should not appear in the logs. If required, the warning aggregation can be turned off. PigStorage() - need ability to turn off Attempt to access field warnings --- Key: PIG-784 URL: https://issues.apache.org/jira/browse/PIG-784 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz I want an option to PigStorage() for LOAD which will allow me to turn off the Attempt to access field warnings. Something like: {code} define PigStorage PigStorage(warn_load_nonexistent_field=off); A = load 'mydata.txt' using PigStorage() as (col1: chararray, col2_optional: int, col3_optional: float); {code} or {code} A = load 'mydata.txt' using PigStorage(warn_load_nonexistent_field=0) as (col1: chararray, col2_optional: int, col3_optional: float); {code} If I have a very large data set with optional columns that are not populated (and have no tab separator), I'd like to just read the file as is and not generate the warnings. The warnings are problematic because the fill up the logging output and every System.out.println will generate slow down the overall processing. Especially if the data file being processed is missing one or more columns on every single row. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-784) PigStorage() - need ability to turn off Attempt to access field warnings
[ https://issues.apache.org/jira/browse/PIG-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702528#action_12702528 ] David Ciemiewicz commented on PIG-784: -- @Santhosh Hmmm. I'm running Pig in local mode with the latest published build and I get lots of warnings and they are not aggregated: -bash-3.00$ pig -exectype local -latest cat.pig USING: /grid/0/gs/pig/current 2009-04-24 20:02:55,666 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,667 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,672 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-24 20:02:55,672 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,1,42.0F) (,,) (,,) (,,) (,,)
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702533#action_12702533 ] David Ciemiewicz commented on PIG-774: -- A somewhat related bug is JIRA PIG-755 - the difficulty of debugging issues related to passed parameters. If Pig produced an output file of the code with parameter substitutions made, we could have more rapidly isolated the problem. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-0
[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function
[ https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702535#action_12702535 ] David Ciemiewicz commented on PIG-759: -- If hbase has named columns in it's schema, why wouldn't it be appropriate to say something like: table = load '$tablename/$subsection' using HBaseStorage() as (a, b); Since HBaseStorage() is specified: 1) Isn't hbase:// implicit? 2) Shouldn't I be able to just specify the names in the AS clause? HBaseStorage scheme for Load/Slice function --- Key: PIG-759 URL: https://issues.apache.org/jira/browse/PIG-759 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner We would like to change the HBaseStorage function to use a scheme when loading a table in pig. The scheme we are thinking of is: hbase. So in order to load an hbase table in a pig script the statement should read: {noformat} table = load 'hbase://tablename' using HBaseStorage(); {noformat} If the scheme is omitted pig would assume the tablename to be an hdfs path and the storage function would use the last component of the path as a table name and output a warning. For details on why see jira issue: PIG-758 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702538#action_12702538 ] Santhosh Srinivasan commented on PIG-774: - In order to obtain the substituted pig script use the -dryrun switch. By default, the substituted pig script is not stored on disk. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-0 = Additionally tried the dry-run option to figure out if the parameter substitution was
[jira] Commented: (PIG-598) Parameter substitution ($PARAMETER) should not be performed in comments
[ https://issues.apache.org/jira/browse/PIG-598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702563#action_12702563 ] George Mavromatis commented on PIG-598: --- This has been flagged as minor, I think that's a rather serious and elementary one. It takes out the ability to comment/uncomment parameters and creates very bizarre error messages to people who are not aware of it. Can we increase priority? Parameter substitution ($PARAMETER) should not be performed in comments --- Key: PIG-598 URL: https://issues.apache.org/jira/browse/PIG-598 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: David Ciemiewicz Priority: Minor Compiling the following code example will generate an error that $NOT_A_PARAMETER is an Undefined Parameter. This is problematic as sometimes you want to comment out parts of your code, including parameters so that you don't have to define them. This I think it would be really good if parameter substitution was not performed in comments. {code} -- $NOT_A_PARAMETER {code} {code} -bash-3.00$ pig -exectype local -latest comment.pig USING: /grid/0/gs/pig/current java.lang.RuntimeException: Undefined parameter : NOT_A_PARAMETER at org.apache.pig.tools.parameters.PreprocessorContext.substitute(PreprocessorContext.java:221) at org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.parsePigFile(ParameterSubstitutionPreprocessor.java:106) at org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.genSubstitutedFile(ParameterSubstitutionPreprocessor.java:86) at org.apache.pig.Main.runParamPreprocessor(Main.java:394) at org.apache.pig.Main.main(Main.java:296) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function
[ https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702602#action_12702602 ] Alex Newman commented on PIG-759: - I actually like the protocol specification as it allows us flexibility to hit hbase with another protocol like thrift but I may be overthinking it. Sent from mobile -- Sent from my mobile device HBaseStorage scheme for Load/Slice function --- Key: PIG-759 URL: https://issues.apache.org/jira/browse/PIG-759 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner We would like to change the HBaseStorage function to use a scheme when loading a table in pig. The scheme we are thinking of is: hbase. So in order to load an hbase table in a pig script the statement should read: {noformat} table = load 'hbase://tablename' using HBaseStorage(); {noformat} If the scheme is omitted pig would assume the tablename to be an hdfs path and the storage function would use the last component of the path as a table name and output a warning. For details on why see jira issue: PIG-758 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702620#action_12702620 ] Viraj Bhat commented on PIG-774: One workaround for this issue is using the FilterFunc, which reads its filter list from a file written on HDFS. Care has to be taken in the FilterFunc UDF, to invoke the BufferedReader which read UTF8 data. {code} public class FILTERFROMFILE extends FilterFunc{ private String FilterListFileName = ; private void init() throws IOException { Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob); InputStream is = FileLocalizer.openDFSFile(FilterListFileName, props); BufferedReader reader = new BufferedReader(new InputStreamReader(is,UTF8)); } public Boolean exec(Tuple input) throws IOException { init(); //do the matching here } } {code} Pig code to instantiate the filter function UDF {code} register pigudf/myfilterfunc.jar; define MATCHQUERY FILTERFROMFILE('/user/viraj/chinesedata'); rmf chineseoutput; I = load '/user/viraj/testchinese' using PigStorage('\u0001') as (teststring:chararray); J = filter I by MATCHQUERY(teststring); store J into 'chineseoutput'; {code} Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702621#action_12702621 ] Viraj Bhat commented on PIG-774: Ciemo, as stated in the original problem description there is a '-r' switch for achieving the same. {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig {code} Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-0
[jira] Issue Comment Edited: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702620#action_12702620 ] Viraj Bhat edited comment on PIG-774 at 4/24/09 4:35 PM: - One workaround for this issue is using the FilterFunc, which reads its filter list from a file written on HDFS. Care has to be taken in the FilterFunc UDF, to invoke the BufferedReader to read UTF8 data. {code} public class FILTERFROMFILE extends FilterFunc{ private String FilterListFileName = ; private void init() throws IOException { Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob); InputStream is = FileLocalizer.openDFSFile(FilterListFileName, props); BufferedReader reader = new BufferedReader(new InputStreamReader(is,UTF8)); } public Boolean exec(Tuple input) throws IOException { init(); //do the matching here } } {code} Pig code to instantiate the filter function UDF {code} register pigudf/myfilterfunc.jar; define MATCHQUERY FILTERFROMFILE('/user/viraj/chinesedata'); rmf chineseoutput; I = load '/user/viraj/testchinese' using PigStorage('\u0001') as (teststring:chararray); J = filter I by MATCHQUERY(teststring); store J into 'chineseoutput'; {code} was (Author: viraj): One workaround for this issue is using the FilterFunc, which reads its filter list from a file written on HDFS. Care has to be taken in the FilterFunc UDF, to invoke the BufferedReader which read UTF8 data. {code} public class FILTERFROMFILE extends FilterFunc{ private String FilterListFileName = ; private void init() throws IOException { Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob); InputStream is = FileLocalizer.openDFSFile(FilterListFileName, props); BufferedReader reader = new BufferedReader(new InputStreamReader(is,UTF8)); } public Boolean exec(Tuple input) throws IOException { init(); //do the matching here } } {code} Pig code to instantiate the filter function UDF {code} register pigudf/myfilterfunc.jar; define MATCHQUERY FILTERFROMFILE('/user/viraj/chinesedata'); rmf chineseoutput; I = load '/user/viraj/testchinese' using PigStorage('\u0001') as (teststring:chararray); J = filter I by MATCHQUERY(teststring); store J into 'chineseoutput'; {code} Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO
[jira] Commented: (PIG-755) Difficult to debug parameter substitution problems based on the error messages when running in local mode
[ https://issues.apache.org/jira/browse/PIG-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702645#action_12702645 ] Viraj Bhat commented on PIG-755: Ciemo presently there is an option in Pig known as dryrun or '-r', where it produces a pigscript.substituted in ASCII text in which it replaces all the parameters with actual values.. Viraj Difficult to debug parameter substitution problems based on the error messages when running in local mode - Key: PIG-755 URL: https://issues.apache.org/jira/browse/PIG-755 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Attachments: inputfile.txt, localparamsub.pig I have a script in which I do a parameter substitution for the input file. I have a use case where I find it difficult to debug based on the error messages in local mode. {code} A = load '$infile' using PigStorage() as ( date: chararray, count : long, gmean : double ); dump A; {code} 1) I run it in local mode with the input file in the current working directory {code} prompt $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -exectype local -param infile='inputfile.txt' localparamsub.pig {code} 2009-04-07 00:03:51,967 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore - Received error from storer function: org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function. 2009-04-07 00:03:51,970 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Failed jobs!! 2009-04-07 00:03:51,971 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 1 out of 1 failed! 2009-04-07 00:03:51,974 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062631414.log ERROR 1066: Unable to open iterator for alias A org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A at org.apache.pig.PigServer.openIterator(PigServer.java:439) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:352) Caused by: java.io.IOException: Job terminated with anomalous status FAILED at org.apache.pig.PigServer.openIterator(PigServer.java:433) ... 5 more 2) I run it in map reduce mode {code} prompt $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -param infile='inputfile.txt' localparamsub.pig {code} 2009-04-07 00:07:31,660 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 2009-04-07 00:07:32,074 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 2009-04-07 00:07:34,543 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-07 00:07:39,540 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-07 00:07:39,540 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Map reduce job failed 2009-04-07 00:07:39,563 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: inputfile does not exist. Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062851400.log ERROR 2100: inputfile does not exist. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A at org.apache.pig.PigServer.openIterator(PigServer.java:439) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193) at
[jira] Updated: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-774: --- Description: I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = {code} shell $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-0 {code} = Additionally tried the dry-run option to figure out if the parameter substitution was occurring properly. = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig {code} = {code} shell$ file chinese_data.pig.substituted chinese_data.pig.substituted: ASCII text shell$ cat chinese_data.pig.substituted {code} {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == ' ?? ??'; store J into 'chineseoutput'; {code} = This issue has to do with the parser not handling UTF-8 data. was: I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese