[jira] Created: (PIG-780) Code refactoring: PlanPrinters

2009-04-24 Thread Gunther Hagleitner (JIRA)
Code refactoring: PlanPrinters
--

 Key: PIG-780
 URL: https://issues.apache.org/jira/browse/PIG-780
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
Priority: Minor


There seems to be quite a bit of duplicated code/functionality with all the 
PlanPrinters in the system. It would make things easier, if that was 
consolidated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-775) PORelationToExprProject should create a NonSpillableDataBag to create empty bags

2009-04-24 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-775.


Resolution: Fixed

Patch committed.

 PORelationToExprProject should create a NonSpillableDataBag to create empty 
 bags
 

 Key: PIG-775
 URL: https://issues.apache.org/jira/browse/PIG-775
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
Priority: Minor
 Fix For: 0.3.0

 Attachments: PIG-775.patch


 PORelationToExprProject currently uses BagFactory.newDefaultBag() to create 
 an empty bag in cases where it has to send an empty bag on EOP - each such 
 empty bag created will be registered with the SpillableMemoryManager as a 
 spillable bag. Since it is an empty bag, it really should not be registered 
 as a spillable bag. For this, NonSpillableDataBag can be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-712) Need utilities to create schemas for bags and tuples

2009-04-24 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702446#action_12702446
 ] 

Santhosh Srinivasan commented on PIG-712:
-

I made changes to the javadoc to ensure that there are no javadoc warnings. 
PIG-782 was created to track this.

 Need utilities to create schemas for bags and tuples
 

 Key: PIG-712
 URL: https://issues.apache.org/jira/browse/PIG-712
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.3.0

 Attachments: Pig_712_Patch.txt


 Pig should provide utilities to create bag and tuple schemas. Currently, 
 users return schemas in outputSchema method and end up with very verbose 
 boiler plate code. It will be very nice if Pig encapsulates the boiler plate 
 code in utility methods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702445#action_12702445
 ] 

Alan Gates commented on PIG-774:


Two lines of change are needed to fix this:
1. In QueryParser.jjt, introduce a new option for handling unicode
2. In the LogicalPlanBuilder, use the getBytes method with the UTF-8
charset

These changes also need to be propagated to the remaining
JavaCC parsers.  Then testing will need to be done.  Estimate 3-5 days of work.

Reference:
http://www.xrce.xerox.com/competencies/content-analysis/tools/publis/javacc_unicode.pdf


 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 

[jira] Resolved: (PIG-782) javadoc throws warnings - this would break hudson patch test process.

2009-04-24 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan resolved PIG-782.
-

Resolution: Fixed

I changed the javadoc to remove the warnings.

 javadoc throws warnings - this would break hudson patch test process.
 -

 Key: PIG-782
 URL: https://issues.apache.org/jira/browse/PIG-782
 Project: Pig
  Issue Type: Bug
 Environment: javadoc throws warnings - this would break the hudson 
 patch test process.
Reporter: Giridharan Kesavan

   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:233:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:205:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:185:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:220:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:158:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:134:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:105:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:92:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:120:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:48:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:77:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:92:
  warning - @param argument names is not a parameter name.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-712) Need utilities to create schemas for bags and tuples

2009-04-24 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702446#action_12702446
 ] 

Santhosh Srinivasan edited comment on PIG-712 at 4/24/09 10:07 AM:
---

I made changes to the javadoc to ensure that there are no javadoc warnings. 
PIG-782 was created to track this. Note that I fixed it to get rid of the 
javadoc warnings. The contents of the javadoc are not consistent and this has 
to be addressed.

  was (Author: sms):
I made changes to the javadoc to ensure that there are no javadoc warnings. 
PIG-782 was created to track this.
  
 Need utilities to create schemas for bags and tuples
 

 Key: PIG-712
 URL: https://issues.apache.org/jira/browse/PIG-712
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.3.0

 Attachments: Pig_712_Patch.txt


 Pig should provide utilities to create bag and tuple schemas. Currently, 
 users return schemas in outputSchema method and end up with very verbose 
 boiler plate code. It will be very nice if Pig encapsulates the boiler plate 
 code in utility methods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-783) PigStorage does not handle unicode characters above \u007f as a separator in the data.

2009-04-24 Thread Alan Gates (JIRA)
PigStorage does not handle unicode characters above \u007f as a separator in 
the data.
--

 Key: PIG-783
 URL: https://issues.apache.org/jira/browse/PIG-783
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Alan Gates
Priority: Minor


PigStorage reads one byte at a time to find the separator character in the 
data.  So any multi-byte UTF-8 character will not work as a separator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-784) PigStorage() - need ability to turn off Attempt to access field warnings

2009-04-24 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702524#action_12702524
 ] 

Santhosh Srinivasan commented on PIG-784:
-

By default, these warnings are aggregated and should not appear in the logs. If 
required, the warning aggregation can be turned off.

 PigStorage() - need ability to turn off Attempt to access field  warnings
 ---

 Key: PIG-784
 URL: https://issues.apache.org/jira/browse/PIG-784
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 I want an option to PigStorage() for LOAD which will allow me to turn off the 
 Attempt to access field warnings.
 Something like:
 {code}
 define PigStorage PigStorage(warn_load_nonexistent_field=off);
 A = load 'mydata.txt' using PigStorage()
 as (col1: chararray, col2_optional: int, col3_optional: float);
 {code}
 or
 {code}
 A = load 'mydata.txt' using PigStorage(warn_load_nonexistent_field=0)
 as (col1: chararray, col2_optional: int, col3_optional: float);
 {code}
 If I have a very large data set with optional columns that are not populated 
 (and have no tab separator), I'd like to just read the file as is and not 
 generate the warnings.
 The warnings are problematic because the fill up the logging output and every 
 System.out.println will generate slow down the overall processing.  
 Especially if the data file being processed is missing one or more columns on 
 every single row.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-784) PigStorage() - need ability to turn off Attempt to access field warnings

2009-04-24 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702528#action_12702528
 ] 

David Ciemiewicz commented on PIG-784:
--

@Santhosh

Hmmm.  I'm running Pig in local mode with the latest published build and I get 
lots of warnings and they are not aggregated:

-bash-3.00$ pig -exectype local -latest cat.pig
USING: /grid/0/gs/pig/current
2009-04-24 20:02:55,666 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,667 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,672 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-04-24 20:02:55,672 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(a,1,42.0F)
(,,)
(,,)
(,,)
(,,)

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702533#action_12702533
 ] 

David Ciemiewicz commented on PIG-774:
--

A somewhat related bug is JIRA PIG-755 - the difficulty of debugging issues 
related to passed parameters.

If Pig produced an output file of the code with parameter substitutions made, 
we could have more rapidly isolated the problem.

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/_logs
 -rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/part-0
 

[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-04-24 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702535#action_12702535
 ] 

David Ciemiewicz commented on PIG-759:
--

If hbase has named columns in it's schema, why wouldn't it be appropriate to 
say something like:

table = load '$tablename/$subsection' using HBaseStorage() as (a, b);

Since HBaseStorage() is specified:

1) Isn't hbase:// implicit?
2) Shouldn't I be able to just specify the names in the AS clause?



 HBaseStorage scheme for Load/Slice function
 ---

 Key: PIG-759
 URL: https://issues.apache.org/jira/browse/PIG-759
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner

 We would like to change the HBaseStorage function to use a scheme when 
 loading a table in pig. The scheme we are thinking of is: hbase. So in 
 order to load an hbase table in a pig script the statement should read:
 {noformat}
 table = load 'hbase://tablename' using HBaseStorage();
 {noformat}
 If the scheme is omitted pig would assume the tablename to be an hdfs path 
 and the storage function would use the last component of the path as a table 
 name and output a warning.
 For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702538#action_12702538
 ] 

Santhosh Srinivasan commented on PIG-774:
-

In order to obtain the substituted pig script use the -dryrun switch. By 
default, the substituted pig script is not stored on disk.

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/_logs
 -rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/part-0
 =
 Additionally tried the dry-run option to figure out if the parameter 
 substitution was 

[jira] Commented: (PIG-598) Parameter substitution ($PARAMETER) should not be performed in comments

2009-04-24 Thread George Mavromatis (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702563#action_12702563
 ] 

George Mavromatis commented on PIG-598:
---

This has been flagged as minor, I think that's a rather serious and elementary 
one. It takes out the ability to comment/uncomment parameters and creates very 
bizarre error messages to people who are not aware of it. Can we increase 
priority?

 Parameter substitution ($PARAMETER) should not be performed in comments
 ---

 Key: PIG-598
 URL: https://issues.apache.org/jira/browse/PIG-598
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: David Ciemiewicz
Priority: Minor

 Compiling the following code example will generate an error that 
 $NOT_A_PARAMETER is an Undefined Parameter.
 This is problematic as sometimes you want to comment out parts of your code, 
 including parameters so that you don't have to define them.
 This I think it would be really good if parameter substitution was not 
 performed in comments.
 {code}
 -- $NOT_A_PARAMETER
 {code}
 {code}
 -bash-3.00$ pig -exectype local -latest comment.pig
 USING: /grid/0/gs/pig/current
 java.lang.RuntimeException: Undefined parameter : NOT_A_PARAMETER
 at 
 org.apache.pig.tools.parameters.PreprocessorContext.substitute(PreprocessorContext.java:221)
 at 
 org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.parsePigFile(ParameterSubstitutionPreprocessor.java:106)
 at 
 org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.genSubstitutedFile(ParameterSubstitutionPreprocessor.java:86)
 at org.apache.pig.Main.runParamPreprocessor(Main.java:394)
 at org.apache.pig.Main.main(Main.java:296)
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-04-24 Thread Alex Newman (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702602#action_12702602
 ] 

Alex Newman commented on PIG-759:
-

I actually like the protocol specification as it allows us flexibility
to hit hbase with another protocol like thrift but I may  be
overthinking it.
Sent from mobile


-- 
Sent from my mobile device


 HBaseStorage scheme for Load/Slice function
 ---

 Key: PIG-759
 URL: https://issues.apache.org/jira/browse/PIG-759
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner

 We would like to change the HBaseStorage function to use a scheme when 
 loading a table in pig. The scheme we are thinking of is: hbase. So in 
 order to load an hbase table in a pig script the statement should read:
 {noformat}
 table = load 'hbase://tablename' using HBaseStorage();
 {noformat}
 If the scheme is omitted pig would assume the tablename to be an hdfs path 
 and the storage function would use the last component of the path as a table 
 name and output a warning.
 For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702620#action_12702620
 ] 

Viraj Bhat commented on PIG-774:


One workaround for this issue is using the FilterFunc, which reads its filter 
list from a file written on HDFS.

Care has to be taken in the FilterFunc UDF, to invoke the BufferedReader which 
read UTF8 data.
{code}

public class FILTERFROMFILE extends FilterFunc{
   private String FilterListFileName = ;
   
   private void init() throws IOException {

 Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob);
 InputStream is = FileLocalizer.openDFSFile(FilterListFileName, props);
 BufferedReader reader = new BufferedReader(new 
InputStreamReader(is,UTF8));
   
   }

   public Boolean exec(Tuple input) throws IOException {
init();
//do the matching here

   }
}
{code}

Pig code to instantiate the filter function UDF
{code}
register pigudf/myfilterfunc.jar;

define MATCHQUERY FILTERFROMFILE('/user/viraj/chinesedata');

rmf chineseoutput;

I = load '/user/viraj/testchinese' using PigStorage('\u0001') as 
(teststring:chararray);

J = filter I by MATCHQUERY(teststring);

store J into 'chineseoutput';
{code}


 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702621#action_12702621
 ] 

Viraj Bhat commented on PIG-774:


Ciemo, as stated in the original problem description there is a  '-r' switch 
for achieving the same.
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig
{code}

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/_logs
 -rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/part-0
 

[jira] Issue Comment Edited: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702620#action_12702620
 ] 

Viraj Bhat edited comment on PIG-774 at 4/24/09 4:35 PM:
-

One workaround for this issue is using the FilterFunc, which reads its filter 
list from a file written on HDFS.

Care has to be taken in the FilterFunc UDF, to invoke the BufferedReader to 
read UTF8 data.
{code}

public class FILTERFROMFILE extends FilterFunc{
   private String FilterListFileName = ;
   
   private void init() throws IOException {

 Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob);
 InputStream is = FileLocalizer.openDFSFile(FilterListFileName, props);
 BufferedReader reader = new BufferedReader(new 
InputStreamReader(is,UTF8));
   
   }

   public Boolean exec(Tuple input) throws IOException {
init();
//do the matching here

   }
}
{code}

Pig code to instantiate the filter function UDF
{code}
register pigudf/myfilterfunc.jar;

define MATCHQUERY FILTERFROMFILE('/user/viraj/chinesedata');

rmf chineseoutput;

I = load '/user/viraj/testchinese' using PigStorage('\u0001') as 
(teststring:chararray);

J = filter I by MATCHQUERY(teststring);

store J into 'chineseoutput';
{code}


  was (Author: viraj):
One workaround for this issue is using the FilterFunc, which reads its 
filter list from a file written on HDFS.

Care has to be taken in the FilterFunc UDF, to invoke the BufferedReader which 
read UTF8 data.
{code}

public class FILTERFROMFILE extends FilterFunc{
   private String FilterListFileName = ;
   
   private void init() throws IOException {

 Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob);
 InputStream is = FileLocalizer.openDFSFile(FilterListFileName, props);
 BufferedReader reader = new BufferedReader(new 
InputStreamReader(is,UTF8));
   
   }

   public Boolean exec(Tuple input) throws IOException {
init();
//do the matching here

   }
}
{code}

Pig code to instantiate the filter function UDF
{code}
register pigudf/myfilterfunc.jar;

define MATCHQUERY FILTERFROMFILE('/user/viraj/chinesedata');

rmf chineseoutput;

I = load '/user/viraj/testchinese' using PigStorage('\u0001') as 
(teststring:chararray);

J = filter I by MATCHQUERY(teststring);

store J into 'chineseoutput';
{code}

  
 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 

[jira] Commented: (PIG-755) Difficult to debug parameter substitution problems based on the error messages when running in local mode

2009-04-24 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702645#action_12702645
 ] 

Viraj Bhat commented on PIG-755:


Ciemo presently there is an option in Pig known as dryrun or  '-r', where it 
produces a pigscript.substituted in ASCII text in which it replaces all the 
parameters with actual values..

Viraj

 Difficult to debug parameter substitution problems based on the error 
 messages when running in local mode
 -

 Key: PIG-755
 URL: https://issues.apache.org/jira/browse/PIG-755
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0

 Attachments: inputfile.txt, localparamsub.pig


 I have a script in which I do a parameter substitution for the input file. I 
 have a use case where I find it difficult to debug based on the error 
 messages in local mode.
 {code}
 A = load '$infile' using PigStorage() as
  (
date: chararray,
count   : long,
gmean   : double
 );
 dump A;
 {code}
 1) I run it in local mode with the input file in the current working directory
 {code}
 prompt  $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main 
 -exectype local -param infile='inputfile.txt' localparamsub.pig
 {code}
 2009-04-07 00:03:51,967 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore
  - Received error from storer function: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
 setup the load function.
 2009-04-07 00:03:51,970 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Failed jobs!!
 2009-04-07 00:03:51,971 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 1 out of 1 
 failed!
 2009-04-07 00:03:51,974 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1066: Unable to open iterator for alias A
 
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062631414.log
 
 ERROR 1066: Unable to open iterator for alias A
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias A
 at org.apache.pig.PigServer.openIterator(PigServer.java:439)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:352)
 Caused by: java.io.IOException: Job terminated with anomalous status FAILED
 at org.apache.pig.PigServer.openIterator(PigServer.java:433)
 ... 5 more
 
 2) I run it in map reduce mode
 {code}
 prompt  $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -param 
 infile='inputfile.txt' localparamsub.pig
 {code}
 2009-04-07 00:07:31,660 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localhost:9000
 2009-04-07 00:07:32,074 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localhost:9001
 2009-04-07 00:07:34,543 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the arguments. Applications should 
 implement Tool for the same.
 2009-04-07 00:07:39,540 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2009-04-07 00:07:39,540 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Map reduce job failed
 2009-04-07 00:07:39,563 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2100: inputfile does not exist.
 
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062851400.log
 
 ERROR 2100: inputfile does not exist.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias A
 at org.apache.pig.PigServer.openIterator(PigServer.java:439)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193)
 at 
 

[jira] Updated: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-774:
---

Description: 
I created a very small test case in which I did the following.

1) Created a UTF-8 file which contained a query string in Chinese and wrote it 
to HDFS. I used this dfs file as an input for the tests.
2) Created a parameter file which also contained the same query string as in 
Step 1.
3) Created a Pig script which takes in the parametrized query string and hard 
coded Chinese character.

Pig script: chinese_data.pig

{code}
rmf chineseoutput;
I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');

J = filter I by $0 == '$querystring';
--J = filter I by $0 == ' 歌手香港情牽女人心演唱會';

store J into 'chineseoutput';
dump J;
{code}
=

Parameter file: nextgen_paramfile
=
queryid=20090311
querystring='   歌手香港情牽女人心演唱會'
=

Input file: /user/viraj/chinese.txt
=
shell$ hadoop fs -cat /user/viraj/chinese.txt
歌手香港情牽女人心演唱會
=

I ran the above set of inputs in the following ways:

Run 1:
=
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
{code}
=
2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
2009-04-22 01:31:40,700 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
0% complete
2009-04-22 01:31:50,720 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
100% complete
2009-04-22 01:31:50,720 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
Success!
=

Run 2: removed the parameter substitution in the Pig script instead used the 
following statement.
=
{code}
J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
{code}
=
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main chinese_data_withoutparam.pig
=
2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
2009-04-22 01:35:27,399 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
0% complete
2009-04-22 01:35:32,415 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
100% complete
2009-04-22 01:35:32,415 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
Success!
=

In both cases:
=
{code}
shell $ hadoop fs -ls /user/viraj/chineseoutput
Found 2 items
drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
/user/viraj/chineseoutput/_logs
-rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
/user/viraj/chineseoutput/part-0
{code}
=

Additionally tried the dry-run option to figure out if the parameter 
substitution was occurring properly.
=
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig
{code}
=
{code}
shell$ file chinese_data.pig.substituted 
chinese_data.pig.substituted: ASCII text
shell$ cat chinese_data.pig.substituted 
{code}

{code}
rmf chineseoutput;
I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');

J = filter I by $0 == ' ??  ??';

store J into 'chineseoutput';
{code}
=
This issue has to do with the parser not handling UTF-8 data. 

  was:
I created a very small test case in which I did the following.

1) Created a UTF-8 file which contained a query string in Chinese