[jira] Updated: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-774:
---

Attachment: utf8_parser-1.patch

As Alan said, adding option to QueryParser.jjt and ParamLoader.jj will do the 
trick. Probably we do not need to hardcode UTF8 into getBytes. If the OS 
encoding is UTF8 (LANG=UTF-8), getBytes generates byte array using OS encoding, 
which is UTF8. If the OS is native encoding (LANG=GB2312), getBytes generate 
byte array of native encoding, then SimpleCharStream will interpret the input 
stream as native encoding also, so everything goes fine.

One thing I want to point out. On UTF8 OS, everything is perfect. However, on 
legacy system with native encoding, PigStorage treats all input/output file 
UTF8, which is reasonable because all data files come from or go to hadoop 
backend for which UTF8 is highly desired. However, these input/output files 
cannot be read by vi on OS with native encoding. Most applications (eg: vi, 
cat) interpret input file using OS encoding. In addition, if we do a Pig dump 
on such OS, we will see UTF8 output stream which is messy. Script files and 
parameter files are local and most users will use vi to edit. We shall 
interpret script files and parameter files as OS encoding. 

utf8_parser-1.patch is a preliminary patch. Viraj, can you give a try?

Also we need to fix jline. It does not deal with multibyte characters well now.

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
 utf8_parser-1.patch


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 {code}
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 {code}
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 {code}
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 {code}
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 

[jira] Updated: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-774:
---

Description: 
I created a very small test case in which I did the following.

1) Created a UTF-8 file which contained a query string in Chinese and wrote it 
to HDFS. I used this dfs file as an input for the tests.
2) Created a parameter file which also contained the same query string as in 
Step 1.
3) Created a Pig script which takes in the parametrized query string and hard 
coded Chinese character.

Pig script: chinese_data.pig

{code}
rmf chineseoutput;
I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');

J = filter I by $0 == '$querystring';
--J = filter I by $0 == ' 歌手香港情牽女人心演唱會';

store J into 'chineseoutput';
dump J;
{code}
=

Parameter file: nextgen_paramfile
=
queryid=20090311
querystring='   歌手香港情牽女人心演唱會'
=

Input file: /user/viraj/chinese.txt
=
shell$ hadoop fs -cat /user/viraj/chinese.txt
歌手香港情牽女人心演唱會
=

I ran the above set of inputs in the following ways:

Run 1:
=
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
{code}
=
2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
2009-04-22 01:31:40,700 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
0% complete
2009-04-22 01:31:50,720 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
100% complete
2009-04-22 01:31:50,720 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
Success!
=

Run 2: removed the parameter substitution in the Pig script instead used the 
following statement.
=
{code}
J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
{code}
=
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main chinese_data_withoutparam.pig
=
2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
2009-04-22 01:35:27,399 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
0% complete
2009-04-22 01:35:32,415 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
100% complete
2009-04-22 01:35:32,415 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
Success!
=

In both cases:
=
{code}
shell $ hadoop fs -ls /user/viraj/chineseoutput
Found 2 items
drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
/user/viraj/chineseoutput/_logs
-rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
/user/viraj/chineseoutput/part-0
{code}
=

Additionally tried the dry-run option to figure out if the parameter 
substitution was occurring properly.
=
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig
{code}
=
{code}
shell$ file chinese_data.pig.substituted 
chinese_data.pig.substituted: ASCII text
shell$ cat chinese_data.pig.substituted 
{code}

{code}
rmf chineseoutput;
I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');

J = filter I by $0 == ' ??  ??';

store J into 'chineseoutput';
{code}
=
This issue has to do with the parser not handling UTF-8 data. 

  was:
I created a very small test case in which I did the following.

1) Created a UTF-8 file which contained a query string in Chinese