[jira] Updated: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-774: --- Attachment: utf8_parser-1.patch As Alan said, adding option to QueryParser.jjt and ParamLoader.jj will do the trick. Probably we do not need to hardcode UTF8 into getBytes. If the OS encoding is UTF8 (LANG=UTF-8), getBytes generates byte array using OS encoding, which is UTF8. If the OS is native encoding (LANG=GB2312), getBytes generate byte array of native encoding, then SimpleCharStream will interpret the input stream as native encoding also, so everything goes fine. One thing I want to point out. On UTF8 OS, everything is perfect. However, on legacy system with native encoding, PigStorage treats all input/output file UTF8, which is reasonable because all data files come from or go to hadoop backend for which UTF8 is highly desired. However, these input/output files cannot be read by vi on OS with native encoding. Most applications (eg: vi, cat) interpret input file using OS encoding. In addition, if we do a Pig dump on such OS, we will see UTF8 output stream which is messy. Script files and parameter files are local and most users will use vi to edit. We shall interpret script files and parameter files as OS encoding. utf8_parser-1.patch is a preliminary patch. Viraj, can you give a try? Also we need to fix jline. It does not deal with multibyte characters well now. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, utf8_parser-1.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22
[jira] Commented: (PIG-755) Difficult to debug parameter substitution problems based on the error messages when running in local mode
[ https://issues.apache.org/jira/browse/PIG-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702703#action_12702703 ] David Ciemiewicz commented on PIG-755: -- Thanks. Didn't know about the dry-run option. Hopefully it will someday produce UTF-8 text given some of the parameters will be in Chinese or Japanese characters. :^) Difficult to debug parameter substitution problems based on the error messages when running in local mode - Key: PIG-755 URL: https://issues.apache.org/jira/browse/PIG-755 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Attachments: inputfile.txt, localparamsub.pig I have a script in which I do a parameter substitution for the input file. I have a use case where I find it difficult to debug based on the error messages in local mode. {code} A = load '$infile' using PigStorage() as ( date: chararray, count : long, gmean : double ); dump A; {code} 1) I run it in local mode with the input file in the current working directory {code} prompt $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -exectype local -param infile='inputfile.txt' localparamsub.pig {code} 2009-04-07 00:03:51,967 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore - Received error from storer function: org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function. 2009-04-07 00:03:51,970 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Failed jobs!! 2009-04-07 00:03:51,971 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 1 out of 1 failed! 2009-04-07 00:03:51,974 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062631414.log ERROR 1066: Unable to open iterator for alias A org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A at org.apache.pig.PigServer.openIterator(PigServer.java:439) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:352) Caused by: java.io.IOException: Job terminated with anomalous status FAILED at org.apache.pig.PigServer.openIterator(PigServer.java:433) ... 5 more 2) I run it in map reduce mode {code} prompt $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -param infile='inputfile.txt' localparamsub.pig {code} 2009-04-07 00:07:31,660 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 2009-04-07 00:07:32,074 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 2009-04-07 00:07:34,543 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-07 00:07:39,540 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-07 00:07:39,540 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Map reduce job failed 2009-04-07 00:07:39,563 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: inputfile does not exist. Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062851400.log ERROR 2100: inputfile does not exist. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A at org.apache.pig.PigServer.openIterator(PigServer.java:439) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193) at
[jira] Commented: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702704#action_12702704 ] David Ciemiewicz commented on PIG-506: -- Alan, This seems much cleaner way to set up native Hadoop map-reduce jobs than the command line interfaces people use today. Might be worth it just for that alone. I think you'd need to gather some examples from non-Pig users and prototype them as Pig/NATIVE scripts to demonstrate what the advantages would be. For me, as a primary Pig user, there is some appeal because I could benefit from borrowing other's code. Does pig need a NATIVE keyword? --- Key: PIG-506 URL: https://issues.apache.org/jira/browse/PIG-506 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this: {code} A = load 'myfile'; X = load 'myotherfile'; B = group A by $0; C = foreach B generate group, myudf(B); D = native (jar=mymr.jar, infile=frompig outfile=topig); E = join D by $0, X by $0; ... {code} This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk. Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-602) Pass global configurations to UDF
[ https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702707#action_12702707 ] David Ciemiewicz commented on PIG-602: -- This sounds a lot like shell script environment variables. As such maybe it should follow the same rich level of operations and semantics that you get with environment variables. How is PigConf different from set properties in Pig? Why can't both use the same mechanism? Should they use the same mechanism? Can / should this same mechanism let my UDFs know when Pig is in local mode versus hdfs mode? [JIRA PIG-756] (or should something different be used? When in grunt, how can I inspect what the current PigConf values are? (Useful for logging and debugging) By what mechanism can I set or override these values from within my Pig script? Can I set the values to be one thing at one point in the Pig script and change it later to a new value in the Pig script? Pass global configurations to UDF - Key: PIG-602 URL: https://issues.apache.org/jira/browse/PIG-602 Project: Pig Issue Type: New Feature Components: impl Reporter: Yiping Han Assignee: Alan Gates We are seeking an easy way to pass a large number of global configurations to UDFs. Since our application contains many pig jobs, and has a large number of configurations. Passing configurations through command line is not an ideal way (i.e. modifying single parameter needs to change multiple command lines). And to put everything into the hadoop conf is not an ideal way either. We would like to see if Pig can provide such a facility that allows us to pass a configuration file in some format(XML?) and then make it available through out all the UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.