[jira] Created: (PIG-1660) Consider passing result of COUNT/COUNT_STAR to LIMIT
Consider passing result of COUNT/COUNT_STAR to LIMIT - Key: PIG-1660 URL: https://issues.apache.org/jira/browse/PIG-1660 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Viraj Bhat Fix For: 0.9.0 In realistic scenarios we need to split a dataset into segments by using LIMIT, and like to achieve that goal within the same pig script. Here is a case: {code} A = load '$DATA' using PigStorage(',') as (id, pvs); B = group A by ALL; C = foreach B generate COUNT_STAR(A) as row_cnt; -- get the low 50% segment D = order A by pvs; E = limit D (C.row_cnt * 0.2); store E in '$Eoutput'; -- get the high 20% segment F = order A by pvs DESC; G = limit F (C.row_cnt * 0.2); store G in '$Goutput'; {code} Since LIMIT only accepts constants, we have to split the operation to two steps in order to pass in the constants for the LIMIT statements. Please consider bringing this feature in so the processing can be more efficient. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1630) Support param_files to be loaded into HDFS
Support param_files to be loaded into HDFS -- Key: PIG-1630 URL: https://issues.apache.org/jira/browse/PIG-1630 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Viraj Bhat I want to place the parameters of a Pig script in a param_file. But instead of this file being in the local file system where I run my java command, I want this to be on HDFS. {code} $ java -cp pig.jar org.apache.pig.Main -param_file hdfs://namenode/paramfile myscript.pig {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1631) Support to 2 level nested foreach
Support to 2 level nested foreach - Key: PIG-1631 URL: https://issues.apache.org/jira/browse/PIG-1631 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Viraj Bhat What I would like to do is generate certain metrics for every listing impression in the context of a page like clicks on the page etc. So, I first group by to get clicks and impression together. Now, I would want to iterate through the mini-table (one per serve-id) and compute metrics. Since nested foreach within foreach is not supported I ended up writing a UDF that took both the bags and computed the metric. It would have been elegant to keep the logic of iterating over the records outside in the PIG script. Here is some pseudocode of how I would have liked to write it: {code} -- Let us say in our page context there was click on rank 2 for which there were 3 ads A1 = LOAD '...' AS (page_id, rank); -- clicks. A2 = Load '...' AS (page_id, rank); -- impressions B = COGROUP A1 by (page_id), A2 by (page_id); -- Let us say B contains the following schema -- (group, {(A1...)} {(A2...)}) -- Each record would be in B would be: -- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))} C = FOREACH B GENERATE { D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current pig as well. Basically, I would like a mini-table which represents an entire serve. FOREACH D GENERATE page_id_1, A2:rank, SOMEUDF(A1:rank, A2::rank); -- This UDF returns a value (like v1, v2, v3 depending on A1::rank and A2::rank) }; # output # page_id, 1, v1 # page_id, 2, v2 # page_id, 3, v3 DUMP C; {code} P.S: I understand that I could have alternatively, flattened the fields of B and then done a GROUP on page_id and then iterated through the records calling 'SOMEUDF' appropriately but that would be 2 map-reduce operations AFAIK. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1633) Using an alias withing Nested Foreach causes indeterminate behaviour
Using an alias withing Nested Foreach causes indeterminate behaviour Key: PIG-1633 URL: https://issues.apache.org/jira/browse/PIG-1633 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0 Reporter: Viraj Bhat I have created a RANDOMINT function which generates random numbers between (0 and specified value), For example RANDOMINT(4) gives random numbers between 0 and 3 (inclusive) {code} $hadoop fs -cat rand.dat f g h i j k l m {code} The pig script is as follows: {code} register math.jar; A = load 'rand.dat' using PigStorage() as (data); B = foreach A { r = math.RANDOMINT(4); generate data, r as random, ((r == 3)?1:0) as quarter; }; dump B; {code} The results are as follows: {code} {color:red} (f,0,0) (g,3,0) (h,0,0) (i,2,0) (j,3,0) (k,2,0) (l,0,1) (m,1,0) {color} {code} If you observe, (j,3,0) is created because r is used both in the foreach and generate clauses and generate different values. Modifying the above script to below solves the issue. The M/R jobs from both scripts are the same. It is just a matter of convenience. {code} A = load 'rand.dat' using PigStorage() as (data); B = foreach A generate data, math.RANDOMINT(4) as r; C = foreach B generate data, r, ((r == 3)?1:0) as quarter; dump C; {code} Is this issue related to PIG:747? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1634) Multiple names for the group field
Multiple names for the group field Key: PIG-1634 URL: https://issues.apache.org/jira/browse/PIG-1634 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0 Reporter: Viraj Bhat I am hoping that in Pig if I type {quote} c = cogroup a by foo, b by bar, the fields c.group, c.foo and c.bar should all map to c.$0 {quote} This would improve the readability of the Pig script. Here's a real usecase: {code} --- pages = LOAD 'pages.dat' AS (url, pagerank); visits = LOAD 'user_log.dat' AS (user_id, url); page_visits = COGROUP pages BY url, visits BY url; frequent_visits = FILTER page_visits BY COUNT(visits) = 2; answer = FOREACH frequent_visits GENERATE url, FLATTEN(pages.pagerank); --- {code} (The important part is the final GENERATE statement, which references the field url, which was the grouping field in the earlier COGROUP.) To get it to work I have to write it in a less intuitive way. Maybe with the new parser changes in Pig 0.9 it would be easier to specify that. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag
Return code from Pig is 0 even if the job fails when using -M flag -- Key: PIG-1615 URL: https://issues.apache.org/jira/browse/PIG-1615 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.6.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script of this form, which I used inside a workflow system such as Oozie. {code} A = load '$INPUT' using PigStorage(); store A into '$OUTPUT'; {code} I run this as with Multi-query optimization turned off : {quote} $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig {quote} The directory /user/viraj/junk1 is not present I get the following results: {quote} Input(s): Failed to read data from /user/viraj/junk1 Output(s): Failed to produce result in /user/viraj/junk2 {quote} This is expected, but the return code is still 0 {code} $ echo $? 0 {code} If I run this script with Multi-query optimization turned on, it gives, a return code of 2, which is correct. {code} $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig ... $ echo $? 2 {code} I believe a wrong return code from Pig, is causing Oozie to believe that Pig script succeeded. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag
[ https://issues.apache.org/jira/browse/PIG-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910414#action_12910414 ] Viraj Bhat commented on PIG-1615: - I tested this on Pig 0.8, but with a downloaded version, which was little old. I re-downloaded the latest source, seems to be fixed. Viraj Return code from Pig is 0 even if the job fails when using -M flag -- Key: PIG-1615 URL: https://issues.apache.org/jira/browse/PIG-1615 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script of this form, which I used inside a workflow system such as Oozie. {code} A = load '$INPUT' using PigStorage(); store A into '$OUTPUT'; {code} I run this as with Multi-query optimization turned off : {quote} $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig {quote} The directory /user/viraj/junk1 is not present I get the following results: {quote} Input(s): Failed to read data from /user/viraj/junk1 Output(s): Failed to produce result in /user/viraj/junk2 {quote} This is expected, but the return code is still 0 {code} $ echo $? 0 {code} If I run this script with Multi-query optimization turned on, it gives, a return code of 2, which is correct. {code} $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig ... $ echo $? 2 {code} I believe a wrong return code from Pig, is causing Oozie to believe that Pig script succeeded. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-282) Custom Partitioner
[ https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-282: --- Release Note: This feature allows to specify Hadoop Partitioner for the following operations: GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed' join). Partitioner controls the partitioning of the keys of the intermediate map-outputs. See http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html for more details. To use this feature you can add PARTITION BY clause to the appropriate operator: A = load 'input_data'; B = group A by $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; . Here is the code for SimpleCustomPartitioner public class SimpleCustomPartitioner extends PartitionerPigNullableWritable, Writable { //@Override public int getPartition(PigNullableWritable key, Writable value, int numPartitions) { if(key.getValueAsPigType() instanceof Integer) { int ret = (((Integer)key.getValueAsPigType()).intValue() % numPartitions); return ret; } else { return (key.hashCode()) % numPartitions; } } } was: This feature allows to specify Hadoop Partitioner for the following operations: GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed' join). Partitioner controls the partitioning of the keys of the intermediate map-outputs. See http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Partitioner.html for more details. To use this feature you can add PARTITION BY clause to the appropriate operator: A = load 'input_data'; B = group A by $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; . Here is the code for SimpleCustomPartitioner public class SimpleCustomPartitioner extends PartitionerPigNullableWritable, Writable { //@Override public int getPartition(PigNullableWritable key, Writable value, int numPartitions) { if(key.getValueAsPigType() instanceof Integer) { int ret = (((Integer)key.getValueAsPigType()).intValue() % numPartitions); return ret; } else { return (key.hashCode()) % numPartitions; } } } Custom Partitioner -- Key: PIG-282 URL: https://issues.apache.org/jira/browse/PIG-282 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Amir Youssefi Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, CustomPartitionerTest.patch By adding custom partitioner we can give control over which output partition a key (/value) goes to. We can add keywords to language e.g. PARTITION BY UDF(...) or a similar syntax. UDF returns a number between 0 and n-1 where n is number of output partitions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)
Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem) Key: PIG-1586 URL: https://issues.apache.org/jira/browse/PIG-1586 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Viraj Bhat I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)
[ https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1586: Description: I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj was: I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem) Key: PIG-1586 URL: https://issues.apache.org/jira/browse/PIG-1586 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Viraj Bhat I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a
[jira] Created: (PIG-1576) Difference in Semantics between Load statement in Pig and HDFS client on Command line
Difference in Semantics between Load statement in Pig and HDFS client on Command line - Key: PIG-1576 URL: https://issues.apache.org/jira/browse/PIG-1576 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0, 0.6.0 Reporter: Viraj Bhat Here is my directory structure on HDFS which I want to access using Pig. This is a sample, but in real use case I have more than 100 of these directories. {code} $ hadoop fs -ls /user/viraj/recursive/ Found 3 items drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 /user/viraj/recursive/20080615 drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 /user/viraj/recursive/20080616 drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 /user/viraj/recursive/20080617 {code} Using the command line I am access them using variety of options: {code} $ hadoop fs -ls /user/viraj/recursive/{200806}{15..17}/ -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt $ hadoop fs -ls /user/viraj/recursive/{20080615..20080617}/ -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt {code} I have written a Pig script, all the below combination of load statements do not work? {code} --A = load '/user/viraj/recursive/{200806}{15..17}/' using PigStorage('\u0001') as (k:int, v:chararray); A = load '/user/viraj/recursive/{20080615..20080617}/' using PigStorage('\u0001') as (k:int, v:chararray); AL = limit A 10; dump AL; {code} I get the following error in Pig 0.8 {noformat} 2010-08-27 16:34:27,704 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2010-08-27 16:34:27,711 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2 0.8.0-SNAPSHOT viraj 2010-08-27 16:34:24 2010-08-27 16:34:27 LIMIT Failed! Failed Jobs: JobId Alias Feature Message Outputs N/A A,ALMessage: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: /user/viraj/recursive/{20080615..20080617}/ at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://localhost:9000/user/viraj/recursive/{20080615..20080617} matches 0 files at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268) ... 7 more hdfs://localhost:9000/tmp/temp241388470/tmp987803889, {noformat} The following works: {code} A = load '/user/viraj/recursive/{200806}{15,16,17}/' using PigStorage('\u0001') as (k:int, v:chararray); AL = limit A 10; dump AL; {code} Why is there an inconsistency between HDFS client and Pig? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1561) XMLLoader in Piggybank does not support bz2 or gzip compressed XML files
XMLLoader in Piggybank does not support bz2 or gzip compressed XML files Key: PIG-1561 URL: https://issues.apache.org/jira/browse/PIG-1561 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat I have a simple Pig script which uses the XMLLoader after the Piggybank is built. {code} register piggybank.jar; A = load '/user/viraj/capacity-scheduler.xml.gz' using org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray); B = limit A 1; dump B; --store B into '/user/viraj/handlegz' using PigStorage(); {code} returns empty tuple {code} () {code} If you supply the uncompressed XML file, you get {code} (property namemapred.capacity-scheduler.queue.my.capacity/name value10/value descriptionPercentage of the number of slots in the cluster that are guaranteed to be available for jobs in this queue. /description /property) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1547) Piggybank MultiStorage does not scale when processing around 7k records per bucket
Piggybank MultiStorage does not scale when processing around 7k records per bucket -- Key: PIG-1547 URL: https://issues.apache.org/jira/browse/PIG-1547 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Viraj Bhat I am trying to use the MultiStorage piggybank UDF {code} register pig-svn/trunk/contrib/piggybank/java/piggybank.jar; A = load '/user/viraj/largebucketinput.txt' using PigStorage('\u0001') as (a,b,c); STORE A INTO '/user/viraj/multistore' USING org.apache.pig.piggybank.storage.MultiStorage('/user/viraj/multistore', '1', 'none', '\u0001'); {code} The file largebucketinput.txt is around 85MB in size and for each b we have 512 values starting from 0-511 and each value of b or a bucket contains 7k records a) On a multi-node hadoop installation: The above Pig script which spawn a single Map only job does not succeed and is killed by the TT, for running above the memory limit. == Message == TaskTree [pid=24584,tipID=attempt_201008110143_101976_m_00_0] is running beyond memory-limits. Current usage : 1661034496bytes. Limit : 1610612736bytes. == Message == We tried increasing the Map slots but it does not succeed. b) On a single node hadoop installation: The pig script fails with the following message in the mappers: 2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7687609983190239805_126509 2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_2734778934507357565_126509 2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1293917224803067377_126509 2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-2272713260404734116_126509 2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2781) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232) 2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-2272713260404734116_126509 bad datanode[0] nodes == null 2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file /user/viraj/multistore/_temporary/_attempt_201005141440_0178_m_01_0/444/444-1 - Aborting... 2010-08-17 16:37:48,619 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2837) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2762) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232) 2010-08-17 16:37:48,622 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task Need to investigate more. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage
[ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895858#action_12895858 ] Viraj Bhat commented on PIG-1537: - Hi Olga, I have given the specific script with UDF's for Daniel to test. Thanks Daniel for your help. The script which does not use Column Pruner optimization or disables it using -t gives correct results. Viraj Column pruner causes wrong results when using both Custom Store UDF and PigStorage -- Key: PIG-1537 URL: https://issues.apache.org/jira/browse/PIG-1537 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.8.0 I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); ss_sc_filtered_1 = FILTER ss_sc_1 BY a#'id' matches '65.*' OR a#'id' matches '466.*' OR a#'id' matches '043.*' OR a#'id' matches '044.*' OR a#'id' matches '0650.*' OR a#'id' matches '001.*'; ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; ss_sc_all_proj = FOREACH ss_sc_all GENERATE a#'query' as query, a#'testid' as testid, a#'timestamp' as timestamp, a, b, c; ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); ss_sc_all_map_count = group ss_sc_all_map all; count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1); STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); {code} I run this script using: a) java -cp pig0.7.jar script.pig b) java -cp pig0.7.jar -t PruneColumns script.pig What I observe is that the alias count produces the same number of records but ss_sc_all_map have different sizes when run with above 2 options. Is due to the fact that there are 2 store func's used? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage
[ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1537: Description: I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); ss_sc_filtered_1 = FILTER ss_sc_1 BY a#'id' matches '65.*' OR a#'id' matches '466.*' OR a#'id' matches '043.*' OR a#'id' matches '044.*' OR a#'id' matches '0650.*' OR a#'id' matches '001.*'; ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; ss_sc_all_proj = FOREACH ss_sc_all GENERATE a#'query' as query, a#'testid' as testid, a#'timestamp' as timestamp, a, b, c; ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); ss_sc_all_map_count = group ss_sc_all_map all; count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1); STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); {code} I run this script using: a) java -cp pig0.7.jar script.pig b) java -cp pig0.7.jar -t PruneColumns script.pig What I observe is that the alias count produces the same number of records but ss_sc_all_map have different sizes when run with above 2 options. Is due to the fact that there are 2 store func's used? Viraj was: I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); ss_sc_filtered_1 = FILTER ss_sc_1 BY a#'id' matches '65.*' OR a#'id' matches '466.*' OR a#'id' matches '043.*' OR a#'id' matches '044.*' OR a#'id' matches '0650.*' OR a#'id' matches '001.*'; ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; ss_sc_all_proj = FOREACH ss_sc_all GENERATE a#'query' as query, a#'testid' as testid, a#'timestamp' as timestamp, a, b, c; ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); ss_sc_all_map_count = group ss_sc_all_map all; count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1); STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); I run this script using: a) java -cp pig0.7.jar script.pig b) java -cp pig0.7.jar -t PruneColumns script.pig What I observe is that the alias count produces the same number of records but ss_sc_all_map have different sizes when run with above 2 options. Is due to the fact that there are 2 store func's used? Viraj Column pruner causes wrong results when using both Custom Store UDF and PigStorage -- Key: PIG-1537 URL: https://issues.apache.org/jira/browse/PIG-1537 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Viraj Bhat I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD
[jira] Created: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage
Column pruner causes wrong results when using both Custom Store UDF and PigStorage -- Key: PIG-1537 URL: https://issues.apache.org/jira/browse/PIG-1537 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Viraj Bhat I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); ss_sc_filtered_1 = FILTER ss_sc_1 BY a#'id' matches '65.*' OR a#'id' matches '466.*' OR a#'id' matches '043.*' OR a#'id' matches '044.*' OR a#'id' matches '0650.*' OR a#'id' matches '001.*'; ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; ss_sc_all_proj = FOREACH ss_sc_all GENERATE a#'query' as query, a#'testid' as testid, a#'timestamp' as timestamp, a, b, c; ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); ss_sc_all_map_count = group ss_sc_all_map all; count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1); STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); I run this script using: a) java -cp pig0.7.jar script.pig b) java -cp pig0.7.jar -t PruneColumns script.pig What I observe is that the alias count produces the same number of records but ss_sc_all_map have different sizes when run with above 2 options. Is due to the fact that there are 2 store func's used? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
[ https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864963#action_12864963 ] Viraj Bhat commented on PIG-1345: - Richard thanks for suggesting a workaround. The error message is definitely more verbose than the original one. At least in one way the user can know as to where the cast is an issue in the, maybe in some addition taking place in the script. This Jira was originally created as task to correlate exactly on which line int is implicitly cast to float, which I believe is hard to do in the current parser as we do not keep track of line number. Viraj Link casting errors in POCast to actual lines numbers in Pig script --- Key: PIG-1345 URL: https://issues.apache.org/jira/browse/PIG-1345 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat For the purpose of easy debugging, I would be nice to find out where my warnings are coming from is in the pig script. The only known process is to comment out lines in the Pig script and see if these warnings go away. 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 I think this may need us to keep track of the line numbers of the Pig script (via out javacc parser) and maintain it in the logical and physical plan. It would help users in debugging simple errors/warning related to casting. Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? Do we need to change the parser to something other than javacc to make this task simpler? Standardize on Parser and Scanner Technology Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861097#action_12861097 ] Viraj Bhat commented on PIG-798: Hi Ashutosh, Yes that is possible, I know that we can do that in PigStorage() but why can we not do this in PigStorage? What do I need to cast as (chararray) ? {code} A = load 'somedata' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} But this is possible in BinStorage(), why is this not consistent? Is it that BinStorage() has schemas embedded while PigStorage() does not? Should this not be fixed to make it consistent across storage formats? Viraj Schema errors when using PigStorage and none when using BinStorage in FOREACH?? --- Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 Reporter: Viraj Bhat Attachments: binstoragecreateop, schemaerr.pig, visits.txt In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error
[ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861106#action_12861106 ] Viraj Bhat commented on PIG-1211: - Ashutosh, yes as more and more people adopt Pig, they expect some type of guarantees, since Pig is designed to help people with no experience in writing M/R programs. If I am a novice user I have a small typo, do I wait for 3-4 hours to discover that there is a syntax error? I have not only wasted the CPU cycles but also the users productivity. The problem here is that dump and hadoop shell commands are treated differently in Pig scripts and Multi-query optimizations are ignored. I have listed what Milind and Dmitry is suggesting. Maybe this is the way future Pig Language will compile to give you a hadoop jar file in sequence or as a DAG. Pigcc -L myScript.pig - parses pig script, generates logical plan, and stores it in myScript.pig.l Pigcc -P myScript.pig.l - produces physical plan from the logical plan, and stores it in myScript.pig.p Pigcc -M myScript.pig.p - produces map-reduce plan, myScript.pig.m Pig myScript.pig.m - interprets the MR plan. This can be split into multiple sequential MR jobs plans too, myScript.pig.m.{1,2,3..}, so that a way to execute the pig script is to run Hadoop jar pigRT.jar myScript.pig.m.1 Hadoop jar pigRT.jar myScript.pig.m.2 Hadoop jar pigRT.jar myScript.pig.m.3 Hadoop jar pigRT.jar myScript.pig.m.4 Thanks Viraj Pig script runs half way after which it reports syntax error Key: PIG-1211 URL: https://issues.apache.org/jira/browse/PIG-1211 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script which is structured in the following way {code} register cp.jar dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5); filtered_dataset = filter dataset by (col1 == 1); proj_filtered_dataset = foreach filtered_dataset generate col2, col3; rmf $output1; store proj_filtered_dataset into '$output1' using PigStorage(); second_stream = foreach filtered_dataset generate col2, col4, col5; group_second_stream = group second_stream by col4; output2 = foreach group_second_stream { a = second_stream.col2 b = distinct second_stream.col5; c = order b by $0; generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc; } rmf $output2; --syntax error here store output2 to '$output2' using PigStorage(); {code} I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. The usage of HDFS option, rmf causes the first store to execute. The only option the I have is to run an explain before running his script grunt explain -script myscript.pig -out explain.out or moving the rmf statements to the top of the script Here are some questions: a) Can we have an option to do something like checkscript instead of explain to get the same syntax error? In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861134#action_12861134 ] Viraj Bhat commented on PIG-798: Ashutosh thanks for clarifying, we will wait till that bug is fixed in BinStorage Viraj Schema errors when using PigStorage and none when using BinStorage in FOREACH?? --- Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 Reporter: Viraj Bhat Attachments: binstoragecreateop, schemaerr.pig, visits.txt In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
[ https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860397#action_12860397 ] Viraj Bhat commented on PIG-1345: - Which release will PIG:908 be fixed? Does it guarantee that if we fix PIG:908, then this issue will be solved? Link casting errors in POCast to actual lines numbers in Pig script --- Key: PIG-1345 URL: https://issues.apache.org/jira/browse/PIG-1345 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat For the purpose of easy debugging, I would be nice to find out where my warnings are coming from is in the pig script. The only known process is to comment out lines in the Pig script and see if these warnings go away. 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 I think this may need us to keep track of the line numbers of the Pig script (via out javacc parser) and maintain it in the logical and physical plan. It would help users in debugging simple errors/warning related to casting. Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? Do we need to change the parser to something other than javacc to make this task simpler? Standardize on Parser and Scanner Technology Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error
[ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860419#action_12860419 ] Viraj Bhat commented on PIG-1211: - Ashutosh, I feel that the user may not be interested in running his script first using explain finding his syntax error and then again running it again to get his results. They expect Pig to tell them all the errors upfront before submitting a M/R job. Explain was not designed for checking syntax error in scripts. I believe that if you have a dump statement, explain -script will cause the script to run. Is it not possible for Pig to find out that there is an error with store syntax? Viraj Pig script runs half way after which it reports syntax error Key: PIG-1211 URL: https://issues.apache.org/jira/browse/PIG-1211 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script which is structured in the following way {code} register cp.jar dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5); filtered_dataset = filter dataset by (col1 == 1); proj_filtered_dataset = foreach filtered_dataset generate col2, col3; rmf $output1; store proj_filtered_dataset into '$output1' using PigStorage(); second_stream = foreach filtered_dataset generate col2, col4, col5; group_second_stream = group second_stream by col4; output2 = foreach group_second_stream { a = second_stream.col2 b = distinct second_stream.col5; c = order b by $0; generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc; } rmf $output2; --syntax error here store output2 to '$output2' using PigStorage(); {code} I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. The usage of HDFS option, rmf causes the first store to execute. The only option the I have is to run an explain before running his script grunt explain -script myscript.pig -out explain.out or moving the rmf statements to the top of the script Here are some questions: a) Can we have an option to do something like checkscript instead of explain to get the same syntax error? In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860445#action_12860445 ] Viraj Bhat commented on PIG-1339: - Hi Ashutosh this does not work in trunk. I am using the latest build: {code} $java -cp ~/pig-svn/trunk/pig.jar org.apache.pig.Main -version Apache Pig version 0.8.0-dev (r937554) compiled Apr 23 2010, 16:57:32 {code} 2010-04-23 17:31:41,448 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 71. Encountered: \u3042 (12354), after : This is a valid bug. Viraj International characters in column names not supported -- Key: PIG-1339 URL: https://issues.apache.org/jira/browse/PIG-1339 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0, 0.7.0, 0.8.0 Reporter: Viraj Bhat There is a particular use-case in which someone specifies a column name to be in International characters. {code} inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); describe inputdata; dump inputdata; {code} == Pig Stack Trace --- ERROR 1000: Error during parsing. Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : at org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) == Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860452#action_12860452 ] Viraj Bhat commented on PIG-798: Hi Ashutosh, The problem here is not about using the data interchangeably between BinStorage() and PigStorage(), it is about the consistency issues in schema. Sorry if the description was unclear. I can see that it is possible to write statements such as this using BinStorage() {code} A = load 'somedata' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} and not write it using PigStorage(). Should we not support the following statement, as a user I am interested in projecting the first column and casting it to a chararray. I am not interested in knowing what the schemas are of other columns!! Fails when I do the following: {code} A = load 'somedata' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Can you tell me why the schema specification in FOREACH GENERATE works with BinStorage and not in PigStorage? Viraj Schema errors when using PigStorage and none when using BinStorage in FOREACH?? --- Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Attachments: binstoragecreateop, schemaerr.pig, visits.txt In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-798: --- Affects Version/s: 0.6.0 0.5.0 0.4.0 0.3.0 0.7.0 0.8.0 Schema errors when using PigStorage and none when using BinStorage in FOREACH?? --- Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 Reporter: Viraj Bhat Attachments: binstoragecreateop, schemaerr.pig, visits.txt In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1378) har url not usable in Pig scripts
[ https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859384#action_12859384 ] Viraj Bhat commented on PIG-1378: - har:// currently works in Pig 0.7 when the hdfs location is specified. har url not usable in Pig scripts - Key: PIG-1378 URL: https://issues.apache.org/jira/browse/PIG-1378 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Fix For: 0.8.0 I am trying to use har (Hadoop Archives) in my Pig script. I can use them through the HDFS shell {noformat} $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' Found 1 items -rw--- 5 viraj users1537234 2010-04-14 09:49 user/viraj/project/subproject/files/size/data/part-1 {noformat} Using similar URL's in grunt yields {noformat} grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} {noformat} 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs 2010-04-14 22:08:48,814 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:357) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249) at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472) ... 13 more {noformat} According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the following as stated in the original description {noformat} grunt a = load 'har://namenode-location/user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} {noformat} Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: har://namenode-location/user/viraj/project/subproject/files/size/data'; ... 8 more Caused by: java.io.IOException: No FileSystem for scheme: namenode-location at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66) at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104) at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193) at .apache.hadoop.fs.Path.getFileSystem(Path.java:175) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245) {noformat} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1378) har url not usable in Pig scripts
har url not usable in Pig scripts - Key: PIG-1378 URL: https://issues.apache.org/jira/browse/PIG-1378 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Fix For: 0.7.0 I am trying to use har (Hadoop Archives) in my Pig script. I can use them through the HDFS shell {noformat} $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' Found 1 items -rw--- 5 viraj users1537234 2010-04-14 09:49 user/viraj/project/subproject/files/size/data/part-1 {noformat} Using similar URL's in grunt yields {noformat} grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} {noformat} 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs 2010-04-14 22:08:48,814 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:357) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249) at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472) ... 13 more {noformat} According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the following as stated in the original description {noformat} grunt a = load 'har://namenode-location/user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} {noformat} Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: har://namenode-location/user/viraj/project/subproject/files/size/data'; ... 8 more Caused by: java.io.IOException: No FileSystem for scheme: mithrilgold at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66) at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104) at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193) at .apache.hadoop.fs.Path.getFileSystem(Path.java:175) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245) {noformat} Viraj -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1378) har url not usable in Pig scripts
[ https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1378: Description: I am trying to use har (Hadoop Archives) in my Pig script. I can use them through the HDFS shell {noformat} $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' Found 1 items -rw--- 5 viraj users1537234 2010-04-14 09:49 user/viraj/project/subproject/files/size/data/part-1 {noformat} Using similar URL's in grunt yields {noformat} grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} {noformat} 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs 2010-04-14 22:08:48,814 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:357) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249) at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472) ... 13 more {noformat} According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the following as stated in the original description {noformat} grunt a = load 'har://namenode-location/user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} {noformat} Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: har://namenode-location/user/viraj/project/subproject/files/size/data'; ... 8 more Caused by: java.io.IOException: No FileSystem for scheme: namenode-location at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66) at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104) at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193) at .apache.hadoop.fs.Path.getFileSystem(Path.java:175) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245) {noformat} Viraj was: I am trying to use har (Hadoop Archives) in my Pig script. I can use them through the HDFS shell {noformat} $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' Found 1 items -rw--- 5 viraj users1537234 2010-04-14 09:49 user/viraj/project/subproject/files/size/data/part-1 {noformat} Using similar URL's in grunt yields {noformat} grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; grunt dump a; {noformat} {noformat} 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0:
[jira] Commented: (PIG-518) LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag
[ https://issues.apache.org/jira/browse/PIG-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857157#action_12857157 ] Viraj Bhat commented on PIG-518: The above script generates the following error in Pig 0.7 2010-04-14 17:10:49,807 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1048: Two inputs of BinCond must have compatible schemas. left hand side: b: bag({colb2: bytearray,colb3: bytearray}) right hand side: bag({(chararray,chararray)}) A type cast to the right type solves the problem. {code} a = load 'sports_views.txt' as (col1:chararray, col2:chararray, col3:chararray); b = load 'queries.txt' as (colb1:chararray,colb2:chararray,colb3:chararray); mycogroup = cogroup a by col1 inner, b by colb1; mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) 0L ? b.(colb2,colb3) : {('','')})); dump mynewalias; {code} (alice,lakers,3,ipod,3) (alice,warriors,7,ipod,3) (peter,sun,7,sun,4) (peter,nets,7,sun,4) Closing bug as Pig yields the correct error message which the user can use to recode his script LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag --- Key: PIG-518 URL: https://issues.apache.org/jira/browse/PIG-518 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Attachments: queries.txt, sports_views.txt The following piece of Pig script, which provides default values for bags {('','')} when the COUNT returns 0 fails with the following error. (Note: Files used in this script are enclosed on this Jira.) a = load 'sports_views.txt' as (col1, col2, col3); b = load 'queries.txt' as (colb1,colb2,colb3); mycogroup = cogroup a by col1 inner, b by colb1; mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) 0L ? b.(colb2,colb3) : {('','')})); dump mynewalias; java.io.IOException: Unable to open iterator for alias: mynewalias [Unable to store for alias: mynewalias [Can't overwrite cause]] at java.lang.Throwable.initCause(Throwable.java:320) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1494) at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:85) at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:28) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2345) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2252) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:121) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:40) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java: 79) at org.apache.pig.PigServer.compileLp(PigServer.java:684) at org.apache.pig.PigServer.compileLp(PigServer.java:655) at org.apache.pig.PigServer.store(PigServer.java:433) at org.apache.pig.PigServer.store(PigServer.java:421) at org.apache.pig.PigServer.openIterator(PigServer.java:384) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: java.io.IOException: Unable to store for alias: mynewalias [Can't overwrite cause] ... 26 more Caused by: java.lang.IllegalStateException: Can't overwrite cause ... 26 more -- This message is automatically generated by
[jira] Resolved: (PIG-518) LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag
[ https://issues.apache.org/jira/browse/PIG-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat resolved PIG-518. Fix Version/s: 0.7.0 Resolution: Fixed LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag --- Key: PIG-518 URL: https://issues.apache.org/jira/browse/PIG-518 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.7.0 Attachments: queries.txt, sports_views.txt The following piece of Pig script, which provides default values for bags {('','')} when the COUNT returns 0 fails with the following error. (Note: Files used in this script are enclosed on this Jira.) a = load 'sports_views.txt' as (col1, col2, col3); b = load 'queries.txt' as (colb1,colb2,colb3); mycogroup = cogroup a by col1 inner, b by colb1; mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) 0L ? b.(colb2,colb3) : {('','')})); dump mynewalias; java.io.IOException: Unable to open iterator for alias: mynewalias [Unable to store for alias: mynewalias [Can't overwrite cause]] at java.lang.Throwable.initCause(Throwable.java:320) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1494) at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:85) at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:28) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2345) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2252) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:121) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:40) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java: 79) at org.apache.pig.PigServer.compileLp(PigServer.java:684) at org.apache.pig.PigServer.compileLp(PigServer.java:655) at org.apache.pig.PigServer.store(PigServer.java:433) at org.apache.pig.PigServer.store(PigServer.java:421) at org.apache.pig.PigServer.openIterator(PigServer.java:384) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: java.io.IOException: Unable to store for alias: mynewalias [Can't overwrite cause] ... 26 more Caused by: java.lang.IllegalStateException: Can't overwrite cause ... 26 more -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (PIG-829) DECLARE statement stop processing after special characters such as dot . , + % etc..
[ https://issues.apache.org/jira/browse/PIG-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat resolved PIG-829. Fix Version/s: 0.7.0 Resolution: Fixed Pig 0.7 yields the correct result. {code} x = LOAD 'something' as (a:chararray, b:chararray); y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); STORE y INTO 'foo.bar'; {code} DECLARE statement stop processing after special characters such as dot . , + % etc.. -- Key: PIG-829 URL: https://issues.apache.org/jira/browse/PIG-829 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.7.0 The below Pig script does not work well, when special characters are used in the DECLARE statement. {code} %DECLARE OUT foo.bar x = LOAD 'something' as (a:chararray, b:chararray); y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); STORE y INTO '$OUT'; {code} When the above script is run in the dry run mode; the substituted file does not contain the special character. {code} java -cp pig.jar:/homes/viraj/hadoop-0.18.0-dev/conf -Dhod.server='' org.apache.pig.Main -r declaresp.pig {code} Resulting file: declaresp.pig.substituted {code} x = LOAD 'something' as (a:chararray, b:chararray); y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); STORE y INTO 'foo'; {code} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1377) Pig/Zebra fails without proper error message when the mapred.jobtracker.maxtasks.per.job exceeds threshold
Pig/Zebra fails without proper error message when the mapred.jobtracker.maxtasks.per.job exceeds threshold -- Key: PIG-1377 URL: https://issues.apache.org/jira/browse/PIG-1377 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0, 0.7.0 Reporter: Viraj Bhat I have a Zebra script which generates huge amount of mappers around 400K. The mapred.jobtracker.maxtasks.per.job is currently set at 200k. The job fails at the initialization phase. It is very hard to find out the cause. We need a way to report the right error message to users. Unfortunately for Pig to get this error in the backend, Map Reduce Jira: https://issues.apache.org/jira/browse/MAPREDUCE-1049 needs to be fixed. {code} -- Sorted format %set default_parallel 100; raw = load '/user/viraj/generated/raw/zebra-sorted/20100203' USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (id, timestamp, code, ip, host, reference, type, flag, params : map[] ); describe raw; user_events = filter raw by id == 'viraj'; describe user_events; dump user_events; sorted_events = order user_events by id, timestamp; dump sorted_events; store sorted_events into 'finalresult'; {code} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1374) Order by fails with java.lang.String cannot be cast to org.apache.pig.data.DataBag
Order by fails with java.lang.String cannot be cast to org.apache.pig.data.DataBag -- Key: PIG-1374 URL: https://issues.apache.org/jira/browse/PIG-1374 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0, 0.7.0 Reporter: Viraj Bhat Script loads data from BinStorage(), then flattens columns and then sorts on the second column with order descending. The order by fails with the ClassCastException {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; d = order c by $1 desc; dump d; {code} The sampling job fails with the following error: === java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:407) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:188) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:329) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:232) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:227) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:159) === The schema for b, c and d are as follows: b: {bag_of_tuples: {tuple: (uuid: chararray,velocity: double)}} c: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} d: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} If we modify this script to order on the first column it seems to work {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; d = order c by $0 desc; dump d; {code} (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) There is a workaround to do a projection before ORDER {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; newc = foreach c generate $0 as uuid, $1 as velocity; newd = order newc by velocity desc; dump newd; {code} (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) The schema for the Loader is as follows: {code} public Schema outputSchema(Schema input) { try{ ListSchema.FieldSchema list = new ArrayListSchema.FieldSchema(); list.add(new Schema.FieldSchema(uuid, DataType.CHARARRAY)); list.add(new Schema.FieldSchema(velocity, DataType.DOUBLE)); Schema tupleSchema = new Schema(list); Schema.FieldSchema tupleFs = new Schema.FieldSchema(tuple, tupleSchema, DataType.TUPLE); Schema bagSchema = new Schema(tupleFs); bagSchema.setTwoLevelAccessRequired(true); Schema.FieldSchema bagFs = new Schema.FieldSchema(bag_of_tuples,bagSchema, DataType.BAG); return new Schema(bagFs); }catch (Exception e){ return null; } } {code} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
[ https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854762#action_12854762 ] Viraj Bhat commented on PIG-756: In Pig 0.7 we have moved local mode of Pig to local mode of Hadoop. https://issues.apache.org/jira/browse/PIG-1053 Closing issue UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path Key: PIG-756 URL: https://issues.apache.org/jira/browse/PIG-756 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz I have a utility function util.INSETFROMFILE() that I pass a file name during initialization. {code} define inQuerySet util.INSETFROMFILE(analysis/queries); A = load 'logs' using PigStorage() as ( date int, query chararray ); B = filter A by inQuerySet(query); {code} This provides a computationally inexpensive way to effect map-side joins for small sets plus functions of this style provide the ability to encapsulate more complex matching rules. For rapid development and debugging purposes, I want this code to run without modification on both my local file system when I do pig -exectype local and on HDFS. Pig needs to provide an API for UDFs which allow them to either: 1) know when they are in local or HDFS mode and let them open and read from files as appropriate 2) just provide a file name and read statements and have pig transparently manage local or HDFS opens and reads for the UDF UDFs need to read configuration information off the filesystem and it simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
[ https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat resolved PIG-756. Resolution: Fixed Fix Version/s: 0.7.0 https://issues.apache.org/jira/browse/PIG-1053 fixes this issue. UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path Key: PIG-756 URL: https://issues.apache.org/jira/browse/PIG-756 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz Fix For: 0.7.0 I have a utility function util.INSETFROMFILE() that I pass a file name during initialization. {code} define inQuerySet util.INSETFROMFILE(analysis/queries); A = load 'logs' using PigStorage() as ( date int, query chararray ); B = filter A by inQuerySet(query); {code} This provides a computationally inexpensive way to effect map-side joins for small sets plus functions of this style provide the ability to encapsulate more complex matching rules. For rapid development and debugging purposes, I want this code to run without modification on both my local file system when I do pig -exectype local and on HDFS. Pig needs to provide an API for UDFs which allow them to either: 1) know when they are in local or HDFS mode and let them open and read from files as appropriate 2) just provide a file name and read statements and have pig transparently manage local or HDFS opens and reads for the UDF UDFs need to read configuration information off the filesystem and it simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
Link casting errors in POCast to actual lines numbers in Pig script --- Key: PIG-1345 URL: https://issues.apache.org/jira/browse/PIG-1345 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat For the purpose of easy debugging, I would be nice to find out where my warnings are coming from is in the pig script. The only known process is to comment out lines in the Pig script and see if these warnings go away. 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 I think this may need us to keep track of the line numbers of the Pig script (via out javacc parser) and maintain it in the logical and physical plan. It would help users in debugging simple errors/warning related to casting. Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? Do we need to change the parser to something other than javacc to make this task simpler? Standardize on Parser and Scanner Technology Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1339) International characters in column names not supported
International characters in column names not supported -- Key: PIG-1339 URL: https://issues.apache.org/jira/browse/PIG-1339 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat There is a particular use-case in which someone specifies a column name to be in International characters. {code} inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); describe inputdata; dump inputdata; {code} == Pig Stack Trace --- ERROR 1000: Error during parsing. Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : at org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) == Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20
Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20 - Key: PIG-1341 URL: https://issues.apache.org/jira/browse/PIG-1341 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Script reads in BinStorage data and tries to convert a column which is in DataByteArray to Chararray. {code} raw = load 'sampledata' using BinStorage() as (col1,col2, col3); --filter out null columns A = filter raw by col1#'bcookie' is not null; B = foreach A generate col1#'bcookie' as reqcolumn; describe B; --B: {regcolumn: bytearray} X = limit B 5; dump X; B = foreach A generate (chararray)col1#'bcookie' as convertedcol; describe B; --B: {convertedcol: chararray} X = limit B 5; dump X; {code} The first dump produces: (36co9b55onr8s) (36co9b55onr8s) (36hilul5oo1q1) (36hilul5oo1q1) (36l4cj15ooa8a) The second dump produces: () () () () () It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s). Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED
[ https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1341: Component/s: impl Summary: Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED (was: Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED -- Key: PIG-1341 URL: https://issues.apache.org/jira/browse/PIG-1341 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Script reads in BinStorage data and tries to convert a column which is in DataByteArray to Chararray. {code} raw = load 'sampledata' using BinStorage() as (col1,col2, col3); --filter out null columns A = filter raw by col1#'bcookie' is not null; B = foreach A generate col1#'bcookie' as reqcolumn; describe B; --B: {regcolumn: bytearray} X = limit B 5; dump X; B = foreach A generate (chararray)col1#'bcookie' as convertedcol; describe B; --B: {convertedcol: chararray} X = limit B 5; dump X; {code} The first dump produces: (36co9b55onr8s) (36co9b55onr8s) (36hilul5oo1q1) (36hilul5oo1q1) (36l4cj15ooa8a) The second dump produces: () () () () () It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s). Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]
Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2] Key: PIG-1308 URL: https://issues.apache.org/jira/browse/PIG-1308 Project: Pig Issue Type: Bug Reporter: Viraj Bhat Fix For: 0.7.0 Simple script fails to read files from BinStorage() and fails to submit jobs to JobTracker. This occurs with trunk and not with Pig 0.6 branch. {code} data = load 'binstorage' using BinStorage() as (s, m, l); A = foreach ULT generate s#'key' as value; X = limit A 20; dump X; {code} When this script is submitted to the Jobtracker, we found the following error: 2010-03-18 22:31:22,296 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:01,574 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:43,276 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:33:21,743 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:02,004 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:43,442 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:35:25,907 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:07,402 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:48,596 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:37:28,014 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:04,823 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:38,981 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:39:12,220 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 Stack Trace revelead at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144) at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:115) at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404) at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167) at org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263) at org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216) at org.apache.pig.PigServer.compileLp(PigServer.java:883) at org.apache.pig.PigServer.store(PigServer.java:564) The binstorage data was generated from 2 datasets using limit and union: {code} Large1 = load 'input1' using PigStorage(); Large2 = load 'input2' using PigStorage(); V = limit Large1 1; C = limit Large2 1; U = union V, C; store U into 'mobilesample' using BinStorage(); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]
[ https://issues.apache.org/jira/browse/PIG-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1308: Description: Simple script fails to read files from BinStorage() and fails to submit jobs to JobTracker. This occurs with trunk and not with Pig 0.6 branch. {code} data = load 'binstoragesample' using BinStorage() as (s, m, l); A = foreach ULT generate s#'key' as value; X = limit A 20; dump X; {code} When this script is submitted to the Jobtracker, we found the following error: 2010-03-18 22:31:22,296 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:01,574 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:43,276 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:33:21,743 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:02,004 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:43,442 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:35:25,907 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:07,402 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:48,596 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:37:28,014 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:04,823 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:38,981 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:39:12,220 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 Stack Trace revelead at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144) at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:115) at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404) at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167) at org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263) at org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216) at org.apache.pig.PigServer.compileLp(PigServer.java:883) at org.apache.pig.PigServer.store(PigServer.java:564) The binstorage data was generated from 2 datasets using limit and union: {code} Large1 = load 'input1' using PigStorage(); Large2 = load 'input2' using PigStorage(); V = limit Large1 1; C = limit Large2 1; U = union V, C; store U into 'binstoragesample' using BinStorage(); {code} was: Simple script fails to read files from BinStorage() and fails to submit jobs to JobTracker. This occurs with trunk and not with Pig 0.6 branch. {code} data = load 'binstorage' using BinStorage() as (s, m, l); A = foreach ULT generate s#'key' as value; X = limit A 20; dump X; {code} When this script is submitted to the Jobtracker, we found the following error: 2010-03-18 22:31:22,296 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:01,574 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:43,276 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:33:21,743 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:02,004 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:43,442 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:35:25,907 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat -
[jira] Created: (PIG-1278) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableFloatWritable, recieved org.apache.pig.impl.io.NullableText
Type mismatch in key from map: expected org.apache.pig.impl.io.NullableFloatWritable, recieved org.apache.pig.impl.io.NullableText --- Key: PIG-1278 URL: https://issues.apache.org/jira/browse/PIG-1278 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 I have a script which uses Map data, and runs a UDF, which creates random numbers and then orders the data by these random numbers. {code} REGISTER myloader.jar; --jar produced from the source code listed below REGISTER math.jar; DEFINE generator math.Random(); inputdata = LOAD '/user/viraj/mymapdata' USING MyMapLoader()AS (s:map[], m:map[], l:map[]); queries = FILTER inputdata BY m#'key'#'query' IS NOT null; queries_rand = FOREACH queries GENERATE generator('') AS rand_num, (CHARARRAY) m#'key'#'query' AS query_string; queries_sorted = ORDER queries_rand BY rand_num PARALLEL 10; queries_limit = LIMIT queries_sorted 1000; rand_queries = FOREACH queries_limit GENERATE query_string; STORE rand_queries INTO 'finalresult'; {code} UDF source for Random.java {code} package math; import java.io.IOException; /* * Implements a random float [0,1) generator. */ public class Random extends EvalFuncFloat { private final Random m_rand = new Random(); public Float exec(Tuple input) throws IOException { return new Float(m_rand.nextFloat()); } public Schema outputSchema(Schema input) { final String name = getSchemaName(getClass().getName(), input); return new Schema(new Schema.FieldSchema(name, DataType.FLOAT)); } } {code} Running this script returns the following error in the Mapper = java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableFloatWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:109) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:255) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1281) Detect org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple type of errors at Compile Type during creation of logical plan
Detect org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple type of errors at Compile Type during creation of logical plan --- Key: PIG-1281 URL: https://issues.apache.org/jira/browse/PIG-1281 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.8.0 This is more of an enhancement request, where we can detect simple errors during compile time during creation of Logical plan rather than at the backend. I created a script which contains an error which gets detected in the backend as a cast error when in fact we can detect it in the front end(group is a single element so group.$0 projection operation will not work). {code} inputdata = LOAD '/user/viraj/mymapdata' AS (co1, col2, col3, col4); projdata = FILTER inputdata BY (col1 is not null); groupprojdata = GROUP projdata BY col1; cleandata = FOREACH groupprojdata { bagproj = projdata.col1; dist_bags = DISTINCT bagproj; GENERATE group.$0 as newcol1, COUNT(dist_bags) as newcol2; }; cleandata1 = GROUP cleandata by newcol2; cleandata2 = FOREACH cleandata1 { GENERATE group.$0 as finalcol1, COUNT(cleandata.newcol1) as finalcol2; }; ordereddata = ORDER cleandata2 by finalcol2; store into 'finalresult' using PigStorage(); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840339#action_12840339 ] Viraj Bhat commented on PIG-1252: - A modified version of the script works, does this have to do with nested foreach? {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); dump finalData; {code} Diamond splitter does not generate correct results when using Multi-query optimization -- Key: PIG-1252 URL: https://issues.apache.org/jira/browse/PIG-1252 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.7.0 I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData { orderedData = ORDER trueDataTmp BY col1,col2; GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); } dump finalData; {code} You can see that falseDataTmp is untouched. When I run this script with no-Multiquery (-M) option I get the right result. This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using FILTER instead of SPIT. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1272) Column pruner causes wrong results
Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file kv contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the ColumnPruner optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1272) Column pruner causes wrong results
[ https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840389#action_12840389 ] Viraj Bhat commented on PIG-1272: - Now with Pig 0.7 or trunk we have the following error: 2010-03-02 23:35:09,349 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoSuchFieldError: sJobConf at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POJoinPackage.getNext(POJoinPackage.java:110) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:380) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:363) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:240) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:409) at org.apache.hadoop.mapred.Child.main(Child.java:159) Viraj Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file kv contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the ColumnPruner optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1263) Script producing varying number of records when COGROUPing value of map data type with and without types
Script producing varying number of records when COGROUPing value of map data type with and without types Key: PIG-1263 URL: https://issues.apache.org/jira/browse/PIG-1263 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have a Pig script which I am experimenting upon. [[Albeit this is not optimized and can be done in variety of ways]] I get different record counts by placing load store pairs in the script. Case 1: Returns 424329 records Case 2: Returns 5859 records Case 3: Returns 5859 records Case 4: Returns 5578 records I am wondering what the correct result is? Here are the scripts. Case 1: {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, group.id12 as id12; --load previous days data K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); L = COGROUP K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER, J by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER; M = filter L by IsEmpty(K); store M into 'cogroupNoTypes' using PigStorage(); {code} Case 2: Storing and loading intermediate results in J {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, group.id12 as id12; --store intermediate data to HDFS and re-read store J into 'output/20100203/J' using PigStorage('\u0001'); --load previous days data K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); --read J into K1 K1 = LOAD 'output/20100203/J' using PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); L = COGROUP K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER, K1 by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER; M = filter L by IsEmpty(K); store M into 'cogroupNoTypesIntStore' using PigStorage(); {code} Case 3: Types information specified but no intermediate store of J {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, (long)m#'id1' as id1, (long)m#'id2' as id2, (long)m#'id3' as id3, (long)m#'id4' as id4, (long)m#'id5' as id5, (long)m#'id6' as id6, (long)m#'id7' as id7, (chararray)m#'id8' as id8, (chararray)m#'id9' as id9, (chararray)m#'id10' as id10, (chararray)m#'id11' as id11, (chararray)m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7,
[jira] Created: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization
Diamond splitter does not generate correct results when using Multi-query optimization -- Key: PIG-1252 URL: https://issues.apache.org/jira/browse/PIG-1252 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData { orderedData = ORDER trueDataTmp BY col1,col2; GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); } dump finalData; {code} You can see that falseDataTmp is untouched. When I run this script with no-Multiquery (-M) option I get the right result. This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using FILTER instead of SPIT. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1252: Description: I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData { orderedData = ORDER trueDataTmp BY col1,col2; GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); } dump finalData; {code} You can see that falseDataTmp is untouched. When I run this script with no-Multiquery (-M) option I get the right result. This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using FILTER instead of SPIT. Viraj was: I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData { orderedData = ORDER trueDataTmp BY col1,col2; GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); } dump finalData; {code} You can see that falseDataTmp is untouched. When I run this script with no-Multiquery (-M) option I get the right result. This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using FILTER instead of SPIT. Viraj Diamond splitter does not generate correct results when using Multi-query optimization -- Key: PIG-1252 URL: https://issues.apache.org/jira/browse/PIG-1252 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData { orderedData = ORDER trueDataTmp BY col1,col2; GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); } dump finalData; {code} You can see that falseDataTmp is untouched. When I run this script with no-Multiquery (-M) option I get the right result. This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using FILTER instead of SPIT. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1247) Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error
Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error - Key: PIG-1247 URL: https://issues.apache.org/jira/browse/PIG-1247 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 I have a large script in which there are intermediate stores statements, one of them writes to a directory I do not have permission to write to. The stack trace I get from Pig is this: 2010-02-20 02:16:32,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error Details at logfile: /home/viraj/pig_1266632145355.log Pig Stack Trace --- ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error java.lang.ClassCastException: org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3583) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1407) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:762) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986) at org.apache.pig.PigServer.registerQuery(PigServer.java:386) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:386) The only way to find the error was to look at the javacc generated QueryParser.java code and do a System.out.println() Here is a script to reproduce the problem: {code} A = load '/user/viraj/three.txt' using PigStorage(); B = foreach A generate ['a'#'12'] as b:map[] ; store B into '/user/secure/pigtest' using PigStorage(); {code} three.txt has 3 lines which contain nothing but the number 1. {code} $ hadoop fs -ls /user/secure/ ls: could not get get listing for 'hdfs://mynamenode/user/secure' : org.apache.hadoop.security.AccessControlException: Permission denied: user=viraj, access=READ_EXECUTE, inode=secure:secure:users:rwx-- {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1243) Passing Complex map types to and from streaming causes a problem
Passing Complex map types to and from streaming causes a problem Key: PIG-1243 URL: https://issues.apache.org/jira/browse/PIG-1243 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 I have a program which generates different types of Maps fields and stores it into PigStorage. {code} A = load '/user/viraj/three.txt' using PigStorage(); B = foreach A generate ['a'#'12'] as b:map[], ['b'#['c'#'12']] as c, ['c'#{(['d'#'15']),(['e'#'16'])}] as d; store B into '/user/viraj/pigtest' using PigStorage(); {code} Now I test the previous output in the below script to make sure I have the right results. I also pass this data to a Perl script and I observe that the complex Map types I have generated, are lost when I get the result back. {code} DEFINE CMD `simple.pl` SHIP('simple.pl'); A = load '/user/viraj/pigtest' using PigStorage() as (simpleFields, mapFields, mapListFields); B = foreach A generate $0, $1, $2; dump B; C = foreach A generate (chararray)simpleFields#'a' as value, $0,$1,$2; D = stream C through CMD as (a0:map[], a1:map[], a2:map[]); dump D; {code} dumping B results in: ([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}]) ([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}]) ([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}]) dumping D results in: ([a#12],,) ([a#12],,) ([a#12],,) The Perl script used here is: {code} #!/usr/local/bin/perl use warnings; use strict; while() { my($bc,$s,$m,$l)=split/\t/; print($s\t$m\t$l); } {code} Is there an issue with handling of complex Map fields within streaming? How can I fix this to obtain the right result? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1194) ERROR 2055: Received Error while processing the map plan
[ https://issues.apache.org/jira/browse/PIG-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat reopened PIG-1194: - Hi Richard, I ran the script attached on the ticket and found out that the map tasks fails with the following error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:281) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am using the latest pig.jar without hadoop. Viraj ERROR 2055: Received Error while processing the map plan Key: PIG-1194 URL: https://issues.apache.org/jira/browse/PIG-1194 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0, 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.7.0 Attachments: inputdata.txt, PIG-1194.patch, PIG-1194.patch I have a simple Pig script which takes 3 columns out of which one is null. {code} input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3); a = GROUP input BY (((double) col3)/((double) col2) .001 OR col1 11 ? col1 : -1); b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, SUM(input.col3) as col3; store b into 'finalresult'; {code} When I run this script I get the following error: ERROR 2055: Received Error while processing the map plan. org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) A more useful error message for the purpose of debugging would be helpful. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831248#action_12831248 ] Viraj Bhat commented on PIG-1131: - Olga I marked it as critical since we mention that Pig can eat any type of data, and the example script shows that we need data with fixed schema's and to perform a simple join. Viraj Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831251#action_12831251 ] Viraj Bhat commented on PIG-1131: - Ashutosh I was able to recreate a similar problem using the trunk. java -cp pig-withouthadoop.jar org.apache.pig.Main -version Apache Pig version 0.7.0-dev (r907874) compiled Feb 08 2010, 17:35:04 Viraj Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1220) Document unknown keywords as missing or to do in future
Document unknown keywords as missing or to do in future --- Key: PIG-1220 URL: https://issues.apache.org/jira/browse/PIG-1220 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 To get help at the grunt shell I do the following: grunttouchz 010-02-04 00:59:28,714 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered IDENTIFIER touchz at line 1, column 1. Was expecting one of: EOF cat ... fs ... cd ... cp ... copyFromLocal ... copyToLocal ... dump ... describe ... aliases ... explain ... help ... kill ... ls ... mv ... mkdir ... pwd ... quit ... register ... rm ... rmf ... set ... illustrate ... run ... exec ... scriptDone ... ... EOL ... ; ... I looked at the code and found that we do nothing at: scriptDone: Is there some future value of that command. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1174) Creation of output path should be done by storage function
[ https://issues.apache.org/jira/browse/PIG-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1174: Fix Version/s: 0.7.0 Creation of output path should be done by storage function -- Key: PIG-1174 URL: https://issues.apache.org/jira/browse/PIG-1174 Project: Pig Issue Type: Bug Reporter: Bill Graham Fix For: 0.7.0 When executing a STORE command, Pig creates the output location before the storage function gets called. This causes problems with storage functions that have logic to determine the output location. See this thread: http://www.mail-archive.com/pig-user%40hadoop.apache.org/msg01538.html For example, when making a request like this: STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', 'none', '\t'); Pig creates a file '/my/home/output' and then an exception is thrown when MultiStorage tries to make a directory under '/my/home/output'. The workaround is to instead specify a dummy location as the first path like so: STORE A INTO '/my/home/output/temp' USING MultiStorage('/my/home/output','0', 'none', '\t'); Two changes should be made: 1. The path specified in the INTO clause should be available to the storage function so it doesn't need to be duplicated. 2. The creation of the output paths should be delegated to the storage function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig
[ https://issues.apache.org/jira/browse/PIG-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-940: --- Affects Version/s: (was: 0.3.0) 0.5.0 Fix Version/s: 0.7.0 Cross site HDFS access using the default.fs.name not possible in Pig Key: PIG-940 URL: https://issues.apache.org/jira/browse/PIG-940 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0 Environment: Hadoop 20 Reporter: Viraj Bhat Fix For: 0.7.0 I have a script which does the following.. access data from a remote HDFS location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I do not want to copy this huge amount of data between HDFS locations]]. However I want my Pigscript to write data to the HDFS running on localmachine.company.com. Currently Pig does not support that behavior and complains that: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist {code} A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); C = JOIN A by a, B by c; store C into 'output' using PigStorage(); {code} === 2009-09-01 00:37:24,032 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localmachine.company.com:8020 2009-09-01 00:37:24,277 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localmachine.company.com:50300 2009-09-01 00:37:24,567 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage-POForEach to POJoinPackage 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2009-09-01 00:37:26,197 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2009-09-01 00:37:26,249 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-09-01 00:37:26,747 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map reduce job(s) failed! 2009-09-01 00:37:26,756 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed to produce result in: hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480 2009-09-01 00:37:26,756 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log === The error file in Pig contains: === ERROR 2998: Unhandled internal error. org.apache.pig.backend.executionengine.ExecException: ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. at org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126) at org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) at org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at
[jira] Updated: (PIG-531) Way for explain to show 1 plan at a time
[ https://issues.apache.org/jira/browse/PIG-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-531: --- Fix Version/s: 0.5.0 Hi Olga, I think we have a way to handle it in multi-query optimization. Is it reasonable to close this as fixed. I see the following in the Multi-query document about explain: http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification explain [-out path] [-brief] [-dot] [-param key=value]* [-param_file filename]* [-script scriptname] [handle] Viraj Way for explain to show 1 plan at a time Key: PIG-531 URL: https://issues.apache.org/jira/browse/PIG-531 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Fix For: 0.5.0 Several users complained that EXPLAIN output is too verbose and is hard to make sense of. One way to improve the situation is to realize is that EXPLAIN actually contains several plans: logical, physical, backend specific. So we can update EXPLAIN to allow to show a particular plan. For instance EXPLAIN LOGICAL A; would show only logical plan. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1194) ERROR 2055: Received Error while processing the map plan
ERROR 2055: Received Error while processing the map plan Key: PIG-1194 URL: https://issues.apache.org/jira/browse/PIG-1194 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0, 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.6.0 Attachments: inputdata.txt I have a simple Pig script which takes 3 columns out of which one is null. {code} input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3); a = GROUP input BY (((double) col3)/((double) col2) .001 OR col1 11 ? col1 : -1); b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, SUM(input.col3) as col3; store b into 'finalresult'; {code} When I run this script I get the following error: ERROR 2055: Received Error while processing the map plan. org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) A more useful error message for the purpose of debugging would be helpful. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1194) ERROR 2055: Received Error while processing the map plan
[ https://issues.apache.org/jira/browse/PIG-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1194: Attachment: inputdata.txt Testdata to run with this script ERROR 2055: Received Error while processing the map plan Key: PIG-1194 URL: https://issues.apache.org/jira/browse/PIG-1194 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0, 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.6.0 Attachments: inputdata.txt I have a simple Pig script which takes 3 columns out of which one is null. {code} input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3); a = GROUP input BY (((double) col3)/((double) col2) .001 OR col1 11 ? col1 : -1); b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, SUM(input.col3) as col3; store b into 'finalresult'; {code} When I run this script I get the following error: ERROR 2055: Received Error while processing the map plan. org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) A more useful error message for the purpose of debugging would be helpful. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified
[ https://issues.apache.org/jira/browse/PIG-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800315#action_12800315 ] Viraj Bhat commented on PIG-1187: - Hi Jeff, This is specific to the data we are using and it looks like parser failed when it is trying to interpret some characters. As such we have tested this with Chinese characters and it works. Viraj UTF-8 (international code) breaks with loader when load with schema is specified Key: PIG-1187 URL: https://issues.apache.org/jira/browse/PIG-1187 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have a set of Pig statements which dump an international dataset. {code} INPUT_OBJECT = load 'internationalcode'; describe INPUT_OBJECT; dump INPUT_OBJECT; {code} Sample output (756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)}) It works and dumps results but when I use a schema for loading it fails. {code} INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag {T: tuple(label:chararray)}); describe INPUT_OBJECT; {code} The error message is as follows:2010-01-14 02:23:27,320 FATAL org.apache.hadoop.mapred.Child: Error running child : org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop caused by repeated empty string matches at line 1, column 21. at org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620) at org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569) at org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651) at org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152) at org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100) at org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382) at org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42) at org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68) at org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified
UTF-8 (international code) breaks with loader when load with schema is specified Key: PIG-1187 URL: https://issues.apache.org/jira/browse/PIG-1187 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have a set of Pig statements which dump an international dataset. {code} INPUT_OBJECT = load 'internationalcode'; describe INPUT_OBJECT; dump INPUT_OBJECT; {code} Sample output (756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)}) It works and dumps results but when I use a schema for loading it fails. {code} INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag {T: tuple(label:chararray)}); describe INPUT_OBJECT; {code} The error message is as follows:2010-01-14 02:23:27,320 FATAL org.apache.hadoop.mapred.Child: Error running child : org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop caused by repeated empty string matches at line 1, column 21. at org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620) at org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569) at org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651) at org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152) at org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100) at org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382) at org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42) at org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68) at org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM --- Key: PIG-1157 URL: https://issues.apache.org/jira/browse/PIG-1157 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 Hi all, I have a script which does 2 replicated joins in succession. Please note that the inputs do not exist on the HDFS. {code} A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c); A1 = FOREACH A GENERATE a; B = GROUP A1 BY a; C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y); D = JOIN C BY x, B BY group USING replicated; E = JOIN A BY a, D by x USING replicated; dump E; {code} 2009-12-16 19:12:00,253 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 4 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-only splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 2 splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. unable to create new native thread Details at logfile: pig_1260990666148.log Looking at the log file: Pig Stack Trace --- ERROR 2998: Unhandled internal error. unable to create new native thread java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773) at org.apache.pig.PigServer.store(PigServer.java:522) at org.apache.pig.PigServer.openIterator(PigServer.java:458) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) If we want to look at the explain output, we find that there is no Map Reduce plan that is generated. Why is the M/R plan not generated? Attaching the script and explain output. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
[ https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1157: Attachment: oomreplicatedjoin.pig replicatedjoinexplain.log Explain output and Pig script. Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM --- Key: PIG-1157 URL: https://issues.apache.org/jira/browse/PIG-1157 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 Attachments: oomreplicatedjoin.pig, replicatedjoinexplain.log Hi all, I have a script which does 2 replicated joins in succession. Please note that the inputs do not exist on the HDFS. {code} A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c); A1 = FOREACH A GENERATE a; B = GROUP A1 BY a; C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y); D = JOIN C BY x, B BY group USING replicated; E = JOIN A BY a, D by x USING replicated; dump E; {code} 2009-12-16 19:12:00,253 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 4 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-only splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 2 splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. unable to create new native thread Details at logfile: pig_1260990666148.log Looking at the log file: Pig Stack Trace --- ERROR 2998: Unhandled internal error. unable to create new native thread java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773) at org.apache.pig.PigServer.store(PigServer.java:522) at org.apache.pig.PigServer.openIterator(PigServer.java:458) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) If we want to look at the explain output, we find that there is no Map Reduce plan that is generated. Why is the M/R plan not generated? Attaching the script and explain output. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Fix For: 0.7.0 Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1144: Attachment: brokenparallel.out genericscript_broken_parallel.pig Script and explain output set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Fix For: 0.7.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788436#action_12788436 ] Viraj Bhat commented on PIG-1144: - This happens on the real cluster, where the sorting job did not complete because of a single reducer. set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Fix For: 0.7.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788439#action_12788439 ] Viraj Bhat commented on PIG-1144: - Hi Daniel, One more thing to note is that the Last Sort M/R job has a parallelism of 1. Should it not be -1? Viraj set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Fix For: 0.7.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788481#action_12788481 ] Viraj Bhat commented on PIG-1144: - Hi Daniel, Thanks again for your input. This is more of a performance issue, where users do not detect, till they see that 1 reducer job has failed in the sort phase. They safely assume that the default_parallel keyword will do the trick. Viraj set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1131) Pig simple join does not work when it contains empty lines
Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.7.0 I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1131: Attachment: simplejoinscript.pig junk2.txt junk1.txt Dummy datasets and pig script Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1124) Unable to set Custom Job Name using the -Dmapred.job.name parameter
Unable to set Custom Job Name using the -Dmapred.job.name parameter --- Key: PIG-1124 URL: https://issues.apache.org/jira/browse/PIG-1124 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Priority: Minor Fix For: 0.6.0 As a Hadoop user I want to control the Job name for my analysis via the command line using the following construct:: java -cp pig.jar:$HADOOP_HOME/conf -Dmapred.job.name=hadoop_junkie org.apache.pig.Main broken.pig -Dmapred.job.name should normally set my Hadoop Job name, but somehow during the formation of the job.xml in Pig this information is lost and the job name turns out to be: PigLatin:broken.pig The current workaround seems to be wiring it in the script itself, using the following ( or using parameter substitution). set job.name 'my job' Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1101) Pig parser does not recognize its own data type in LIMIT statement
Pig parser does not recognize its own data type in LIMIT statement -- Key: PIG-1101 URL: https://issues.apache.org/jira/browse/PIG-1101 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Priority: Minor Fix For: 0.6.0 I have a Pig script in which I specify the number of records to limit as a long type. {code} A = LOAD '/user/viraj/echo.txt' AS (txt:chararray); B = LIMIT A 10L; DUMP B; {code} I get a parser error: 2009-11-21 02:25:51,100 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered LONGINTEGER 10L at line 3, column 13. Was expecting: INTEGER ... at org.apache.pig.impl.logicalLayer.parser.QueryParser.generateParseException(QueryParser.java:8963) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_consume_token(QueryParser.java:8839) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LimitClause(QueryParser.java:1656) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1280) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:682) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017) In fact 10L seems to work in the foreach generate construct. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1081) PigCookBook use of PARALLEL keyword
PigCookBook use of PARALLEL keyword --- Key: PIG-1081 URL: https://issues.apache.org/jira/browse/PIG-1081 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.5.0 Reporter: Viraj Bhat Fix For: 0.5.0 Hi all, I am looking at some tips for optimizing Pig programs (Pig Cookbook) using the PARALLEL keyword. http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Use+PARALLEL+Keyword We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches 1 reducer for all cases. In this documentation we state that: num machines * num reduce slots per machine * 0.9, this documentation was valid for HoD (Hadoop on Demand) where you are creating your own Hadoop clusters, but if you are using: Either the Capacity Scheduler http://hadoop.apache.org/common/docs/current/capacity_scheduler.html or the Fair Share Scheduler http://hadoop.apache.org/common/docs/current/fair_scheduler.html , these numbers could mean that you are using around 90% of your reducer slots in your machine. We should change this to something like: The number of reducers you may need for a particular construct in Pig which forms a Map Reduce boundary depends entirely on your data and the number of intermediate keys you are generating in your mappers. In best cases we have seen that a reducer processing about 500 MB of data behaves efficiently. Additionally it is hard to define the optimum number of reducers, since it completely depends on the paritioner and distribution of map (combiner) output keys. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1084) Pig CookBook documentation Take Advantage of Join Optimization additions:Merge and Skewed Join
Pig CookBook documentation Take Advantage of Join Optimization additions:Merge and Skewed Join Key: PIG-1084 URL: https://issues.apache.org/jira/browse/PIG-1084 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 Hi all, We have a host of Join optimizations that have been implemented recently in Pig to improve performance. These include: http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html#JOIN 1) Merge Join 2) Skewed Join It would be nice to mention the Merge Join and Skewed join in the following section on the PigCookBook http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Take+Advantage+of+Join+Optimization Can we update this release 0.6?? Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1060) MultiQuery optimization throws error for multi-level splits
[ https://issues.apache.org/jira/browse/PIG-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12773744#action_12773744 ] Viraj Bhat commented on PIG-1060: - Hi Ankur and Richard, I have a script which demonstrates a similar problem, but can be solved by using the -M option. This script can reproduce the problem even without the UNION operator , but it has properties 1 and 2 of the original problem description. Try commenting out the F alias. It works fine. {code} ORGINALDATA = load '/user/viraj/somedata.txt' using PigStorage() as (col1, col2, col3, col4, col5, col6, col7, col8); --Check data A = foreach ORGINALDATA generate col1, col2, col3, col4, col5, col6; B = group A all; C = foreach B generate COUNT(A); store C into '/user/viraj/result1'; D = filter A by (col1 == col2) or (col1 == col3); E = group D all; F = foreach E generate COUNT(D); --try commenting F store F into '/user/viraj/result2'; G = filter D by (col4 == col5) ; H = group G all; I = foreach H generate COUNT(G); store I into '/user/viraj/result3'; J = filter G by (((col6 == 'm') or (col6 == 'M')) and (col6 == 1)) or (((col6 == 'f') or (col6 == 'F')) and (col6 == 0)) or ((col6 == '') and (col6 == -1)); K = group J all; L = foreach K generate COUNT(J); store L into '/user/viraj/result4'; {code} MultiQuery optimization throws error for multi-level splits --- Key: PIG-1060 URL: https://issues.apache.org/jira/browse/PIG-1060 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Ankur Assignee: Richard Ding Consider the following scenario :- 1. Multi-level splits in the map plan. 2. Each split branch further progressing across a local-global rearrange. 3. Output of each of these finally merged via a UNION. MultiQuery optimizer throws the following error in such a case: ERROR 2146: Internal Error. Inconsistency in key index found during optimization. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1031) PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double
PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double --- Key: PIG-1031 URL: https://issues.apache.org/jira/browse/PIG-1031 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0 Reporter: Viraj Bhat Fix For: 0.5.0, 0.6.0 I have a data stored in a text file as: {(4153E765)} {(AF533765)} I try reading it using PigStorage as: {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:bytearray)}); dump A; {code} I get the following results: {code} ({(Infinity)}) ({(AF533765)}) {code} The problem seems to be with the method: parseFromBytes(byte[] b) in class Utf8StorageConverter. This method uses the TextDataParser (class generated via jjt) to interpret the type of data from content, even though the schema tells it is a bytearray. TextDataParser.jjt sample code {code} TOKEN : { ... DOUBLENUMBER: ([-,+])? FLOATINGPOINT ( [e,E] ([ -,+])? FLOATINGPOINT )? FLOATNUMBER: DOUBLENUMBER ([f,F])? ... } {code} I tried the following options, but it will not work as we need to call bytesToBag(byte[] b) in the Utf8StorageConverter class. {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term)}); A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:chararray)}); {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1031) PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double
[ https://issues.apache.org/jira/browse/PIG-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1031: Description: I have a data stored in a text file as: {(4153E765)} {(AF533765)} I try reading it using PigStorage as: {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:bytearray)}); dump A; {code} I get the following results: ({(Infinity)}) ({(AF533765)}) The problem seems to be with the method: parseFromBytes(byte[] b) in class Utf8StorageConverter. This method uses the TextDataParser (class generated via jjt) to interpret the type of data from content, even though the schema tells it is a bytearray. TextDataParser.jjt sample code {code} TOKEN : { ... DOUBLENUMBER: ([-,+])? FLOATINGPOINT ( [e,E] ([ -,+])? FLOATINGPOINT )? FLOATNUMBER: DOUBLENUMBER ([f,F])? ... } {code} I tried the following options, but it will not work as we need to call bytesToBag(byte[] b) in the Utf8StorageConverter class. {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term)}); A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:chararray)}); {code} Viraj was: I have a data stored in a text file as: {(4153E765)} {(AF533765)} I try reading it using PigStorage as: {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:bytearray)}); dump A; {code} I get the following results: {code} ({(Infinity)}) ({(AF533765)}) {code} The problem seems to be with the method: parseFromBytes(byte[] b) in class Utf8StorageConverter. This method uses the TextDataParser (class generated via jjt) to interpret the type of data from content, even though the schema tells it is a bytearray. TextDataParser.jjt sample code {code} TOKEN : { ... DOUBLENUMBER: ([-,+])? FLOATINGPOINT ( [e,E] ([ -,+])? FLOATINGPOINT )? FLOATNUMBER: DOUBLENUMBER ([f,F])? ... } {code} I tried the following options, but it will not work as we need to call bytesToBag(byte[] b) in the Utf8StorageConverter class. {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term)}); A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:chararray)}); {code} Viraj PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double --- Key: PIG-1031 URL: https://issues.apache.org/jira/browse/PIG-1031 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0 Reporter: Viraj Bhat Fix For: 0.5.0, 0.6.0 I have a data stored in a text file as: {(4153E765)} {(AF533765)} I try reading it using PigStorage as: {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:bytearray)}); dump A; {code} I get the following results: ({(Infinity)}) ({(AF533765)}) The problem seems to be with the method: parseFromBytes(byte[] b) in class Utf8StorageConverter. This method uses the TextDataParser (class generated via jjt) to interpret the type of data from content, even though the schema tells it is a bytearray. TextDataParser.jjt sample code {code} TOKEN : { ... DOUBLENUMBER: ([-,+])? FLOATINGPOINT ( [e,E] ([ -,+])? FLOATINGPOINT )? FLOATNUMBER: DOUBLENUMBER ([f,F])? ... } {code} I tried the following options, but it will not work as we need to call bytesToBag(byte[] b) in the Utf8StorageConverter class. {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term)}); A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:chararray)}); {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-978) ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) and ERROR 2999: (Unexpected internal error. null) when using Multi-Query optimization
ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) and ERROR 2999: (Unexpected internal error. null) when using Multi-Query optimization --- Key: PIG-978 URL: https://issues.apache.org/jira/browse/PIG-978 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have Pig script of this form.. which I execute using Multi-query optimization. {code} A = load '/user/viraj/firstinput' using PigStorage(); B = group C = ..agrregation function store C into '/user/viraj/firstinputtempresult/days1'; .. Atab = load '/user/viraj/secondinput' using PigStorage(); Btab = group Ctab = ..agrregation function store Ctab into '/user/viraj/secondinputtempresult/days1'; .. E = load '/user/viraj/firstinputtempresult/' using PigStorage(); F = group G = aggregation function store G into '/user/viraj/finalresult1'; Etab = load '/user/viraj/secondinputtempresult/' using PigStorage(); Ftab = group Gtab = aggregation function store Gtab into '/user/viraj/finalresult2'; {code} 2009-07-20 22:05:44,507 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2100: hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist. Details at logfile: /homes/viraj/pigscripts/pig_1248127173601.log) is due to the mismatch of store/load commands. The script first stores files into the 'days1' directory (store C into '/user/viraj/firstinputtempresult/days1' using PigStorage();), but it later loads from the top level directory (E = load '/user/viraj/firstinputtempresult/' using PigStorage()) instead of the original directory (/user/viraj/firstinputtempresult/days1). The current multi-query optimizer can't solve the dependency between these two commands--they have different load file paths. So the jobs will run concurrently and result in the errors. The solution is to add 'exec' or 'run' command after the first two stores . This will force the first two store commands to run before the rest commands. It would be nice to see this fixed as a part of an enhancement to the Multi-query. We either disable the Multi-query or throw a warning/error message, so that the user can correct his load/store statements. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-974) Issues with mv command when used after store when using -param_file/-param options
Issues with mv command when used after store when using -param_file/-param options -- Key: PIG-974 URL: https://issues.apache.org/jira/browse/PIG-974 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Environment: Hadoop 18 and 20 Reporter: Viraj Bhat Fix For: 0.6.0 Attachments: studenttab10k I have a Pig script which moves the final output to another HDFS directory to signal completion, so that another Pig script can start working on these results. {code} studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, age:int,gpa:float); X = GROUP studenttab by age; Y = FOREACH X GENERATE group, COUNT(studenttab); store Y into '$finalop' using PigStorage(); mv '$finalop' '$finalmove'; {code} where finalop and finalmove are parameters used storing intermediate and final results. I run this script as this: {code} $shell java -cp pig20.jar:/path/tohadoop/site.xml -Dmapred.job.queue.name=default org.apache.pig.Main -M -param finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove testmove.pig {code} or using the param_file option {code} $shelljava -cp pig20.jar:/path/tohadoop/site.xml -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file moveparamfile testmove.pig {code} The underlying Map Reduce jobs run well but the move command seems to be failing: 2009-09-23 23:26:21,781 [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pigscripts/pig_1253748381778.log 2009-09-23 23:26:21,963 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020 2009-09-23 23:26:22,227 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:50300 2009-09-23 23:26:27,187 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner 2009-09-23 23:26:27,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2009-09-23 23:26:27,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2009-09-23 23:26:28,828 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2009-09-23 23:26:29,423 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-23 23:26:29,478 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-09-23 23:27:29,828 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2009-09-23 23:27:59,764 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2009-09-23 23:28:57,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-09-23 23:28:57,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Successfully stored result in: /user/viraj/finaloutput 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Records written : 60 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Bytes written : 420 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2009-09-23 23:28:57,367 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' does not exist. Details at logfile: /homes/viraj/pigscripts/pig_1253748381778.log {code} $shell hadoop fs -ls /user/viraj/finaloutput Found 1 items -rw--- 3 viraj users420 2009-09-23 23:42 /user/viraj/finaloutput/part-0 {code} Opening the log file: Pig Stack Trace --- ERROR 2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' does not exist. java.io.IOException: File or directory '/user/viraj/finaloutput' does not exist. at
[jira] Updated: (PIG-974) Issues with mv command when used after store when using -param_file/-param options
[ https://issues.apache.org/jira/browse/PIG-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-974: --- Attachment: studenttab10k Testdata Issues with mv command when used after store when using -param_file/-param options -- Key: PIG-974 URL: https://issues.apache.org/jira/browse/PIG-974 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Environment: Hadoop 18 and 20 Reporter: Viraj Bhat Fix For: 0.6.0 Attachments: studenttab10k I have a Pig script which moves the final output to another HDFS directory to signal completion, so that another Pig script can start working on these results. {code} studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, age:int,gpa:float); X = GROUP studenttab by age; Y = FOREACH X GENERATE group, COUNT(studenttab); store Y into '$finalop' using PigStorage(); mv '$finalop' '$finalmove'; {code} where finalop and finalmove are parameters used storing intermediate and final results. I run this script as this: {code} $shell java -cp pig20.jar:/path/tohadoop/site.xml -Dmapred.job.queue.name=default org.apache.pig.Main -M -param finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove testmove.pig {code} or using the param_file option {code} $shelljava -cp pig20.jar:/path/tohadoop/site.xml -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file moveparamfile testmove.pig {code} The underlying Map Reduce jobs run well but the move command seems to be failing: 2009-09-23 23:26:21,781 [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pigscripts/pig_1253748381778.log 2009-09-23 23:26:21,963 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020 2009-09-23 23:26:22,227 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:50300 2009-09-23 23:26:27,187 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner 2009-09-23 23:26:27,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2009-09-23 23:26:27,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2009-09-23 23:26:28,828 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2009-09-23 23:26:29,423 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-23 23:26:29,478 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-09-23 23:27:29,828 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2009-09-23 23:27:59,764 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2009-09-23 23:28:57,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-09-23 23:28:57,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Successfully stored result in: /user/viraj/finaloutput 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Records written : 60 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Bytes written : 420 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2009-09-23 23:28:57,367 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' does not exist. Details at logfile: /homes/viraj/pigscripts/pig_1253748381778.log {code} $shell hadoop fs -ls /user/viraj/finaloutput Found 1 items -rw--- 3 viraj users420 2009-09-23 23:42 /user/viraj/finaloutput/part-0 {code} Opening the log file:
[jira] Commented: (PIG-974) Issues with mv command when used after store when using -param_file/-param options
[ https://issues.apache.org/jira/browse/PIG-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758962#action_12758962 ] Viraj Bhat commented on PIG-974: It turns out that the problem was due to single quotes. {code} mv '$finalop' '$finalmove'; {code} This piece of modified script should work.. {code} mv $finalop $finalmove; {code} The hard part here is when to use single quotes for parameters and when we should not..This is not documented in the manual. The error message is also confusing.. === java.io.IOException: File or directory '/user/viraj/finaloutput' does not exist. === I thought that the single quotes against the filename printed in the error message refers to the correct file name. {code} $shellhadoop fs -ls '/user/viraj/finaloutput' Found 1 items -rw--- 3 viraj users420 2009-09-24 01:16 /user/viraj/finaloutput/part-0 {code} Thanks Viraj Issues with mv command when used after store when using -param_file/-param options -- Key: PIG-974 URL: https://issues.apache.org/jira/browse/PIG-974 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Environment: Hadoop 18 and 20 Reporter: Viraj Bhat Fix For: 0.6.0 Attachments: studenttab10k I have a Pig script which moves the final output to another HDFS directory to signal completion, so that another Pig script can start working on these results. {code} studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, age:int,gpa:float); X = GROUP studenttab by age; Y = FOREACH X GENERATE group, COUNT(studenttab); store Y into '$finalop' using PigStorage(); mv '$finalop' '$finalmove'; {code} where finalop and finalmove are parameters used storing intermediate and final results. I run this script as this: {code} $shell java -cp pig20.jar:/path/tohadoop/site.xml -Dmapred.job.queue.name=default org.apache.pig.Main -M -param finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove testmove.pig {code} or using the param_file option {code} $shelljava -cp pig20.jar:/path/tohadoop/site.xml -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file moveparamfile testmove.pig {code} The underlying Map Reduce jobs run well but the move command seems to be failing: 2009-09-23 23:26:21,781 [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pigscripts/pig_1253748381778.log 2009-09-23 23:26:21,963 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020 2009-09-23 23:26:22,227 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:50300 2009-09-23 23:26:27,187 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner 2009-09-23 23:26:27,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2009-09-23 23:26:27,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2009-09-23 23:26:28,828 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2009-09-23 23:26:29,423 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-23 23:26:29,478 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-09-23 23:27:29,828 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2009-09-23 23:27:59,764 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2009-09-23 23:28:57,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-09-23 23:28:57,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Successfully stored result in: /user/viraj/finaloutput 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Records written : 60 2009-09-23 23:28:57,267 [main] INFO
[jira] Created: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig
Cross site HDFS access using the default.fs.name not possible in Pig Key: PIG-940 URL: https://issues.apache.org/jira/browse/PIG-940 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Environment: Hadoop 20 Reporter: Viraj Bhat Fix For: 0.3.0 I have a script which does the following.. access data from a remote HDFS location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I do not want to copy this huge amount of data between HDFS locations]]. However I want my Pigscript to write data to the HDFS running on localmachine.company.com. Currently Pig does not support that behavior and complains that: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist {code} A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); C = JOIN A by a, B by c; store C into 'output' using PigStorage(); {code} === 2009-09-01 00:37:24,032 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localmachine.company.com:8020 2009-09-01 00:37:24,277 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localmachine.company.com:50300 2009-09-01 00:37:24,567 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage-POForEach to POJoinPackage 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2009-09-01 00:37:26,197 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2009-09-01 00:37:26,249 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-09-01 00:37:26,747 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map reduce job(s) failed! 2009-09-01 00:37:26,756 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed to produce result in: hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480 2009-09-01 00:37:26,756 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log === The error file in Pig contains: === ERROR 2998: Unhandled internal error. org.apache.pig.backend.executionengine.ExecException: ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. at org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126) at org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) at org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not
[jira] Commented: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig
[ https://issues.apache.org/jira/browse/PIG-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749722#action_12749722 ] Viraj Bhat commented on PIG-940: One important point to add: {code} localmachine.company.com prompt hadoop fs -ls hdfs://remotemachine1.company.com/user/viraj//*.txt -rw-r--r-- 3 viraj users 13 2009-08-13 23:42 /user/viraj/A1.txt -rw-r--r-- 3 viraj users 8 2009-08-29 00:51 /user/viraj/B1.txt {code} Cross site HDFS access using the default.fs.name not possible in Pig Key: PIG-940 URL: https://issues.apache.org/jira/browse/PIG-940 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Environment: Hadoop 20 Reporter: Viraj Bhat Fix For: 0.3.0 I have a script which does the following.. access data from a remote HDFS location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I do not want to copy this huge amount of data between HDFS locations]]. However I want my Pigscript to write data to the HDFS running on localmachine.company.com. Currently Pig does not support that behavior and complains that: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist {code} A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); C = JOIN A by a, B by c; store C into 'output' using PigStorage(); {code} === 2009-09-01 00:37:24,032 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localmachine.company.com:8020 2009-09-01 00:37:24,277 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localmachine.company.com:50300 2009-09-01 00:37:24,567 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage-POForEach to POJoinPackage 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2009-09-01 00:37:26,197 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2009-09-01 00:37:26,249 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-09-01 00:37:26,747 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map reduce job(s) failed! 2009-09-01 00:37:26,756 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed to produce result in: hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480 2009-09-01 00:37:26,756 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log === The error file in Pig contains: === ERROR 2998: Unhandled internal error. org.apache.pig.backend.executionengine.ExecException: ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. at org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126) at org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) at org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at
[jira] Created: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group
Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group -- Key: PIG-919 URL: https://issues.apache.org/jira/browse/PIG-919 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 I have a Pig script, which takes in a student file and generates a bag of maps. I later want to group on the value of the key name0 which corresponds to the first name of the student. {code} register mymapudf.jar; data = LOAD '/user/viraj/studenttab10k' AS (somename:chararray,age:long,marks:float); genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as bp:map[], age, marks; getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks; filternonnullfirstnames = filter getfirstnames by firstname is not null; groupgenmap = group filternonnullfirstnames by firstname; dump groupgenmap; {code} When I execute this code, I get an error in the Map Phase: === java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group
[ https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742668#action_12742668 ] Viraj Bhat commented on PIG-919: This problem can be solved simply by casting the firstname to chararray!! Why?? {code} groupgenmap = group filternonnullfirstnames by (chararray)firstname; dump groupgenmap; {code} Is there a problem with the UDF?? Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group -- Key: PIG-919 URL: https://issues.apache.org/jira/browse/PIG-919 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar I have a Pig script, which takes in a student file and generates a bag of maps. I later want to group on the value of the key name0 which corresponds to the first name of the student. {code} register mymapudf.jar; data = LOAD '/user/viraj/studenttab10k' AS (somename:chararray,age:long,marks:float); genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as bp:map[], age, marks; getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks; filternonnullfirstnames = filter getfirstnames by firstname is not null; groupgenmap = group filternonnullfirstnames by firstname; dump groupgenmap; {code} When I execute this code, I get an error in the Map Phase: === java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-913) Error in Pig script when grouping on chararray column
[ https://issues.apache.org/jira/browse/PIG-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740360#action_12740360 ] Viraj Bhat commented on PIG-913: The following works though.. {code} data = LOAD '/user/viraj/studenttab10k' AS (s:bytearray); --data = LOAD '/user/viraj/studenttab10k' AS (s); dataSmall = limit data 100; bb = GROUP dataSmall by $0; dump bb; {code} or Error in Pig script when grouping on chararray column - Key: PIG-913 URL: https://issues.apache.org/jira/browse/PIG-913 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.4.0 I have a very simple script which fails at parsetime due to the schema I specified in the loader. {code} data = LOAD '/user/viraj/studenttab10k' AS (s:chararray); dataSmall = limit data 100; bb = GROUP dataSmall by $0; dump bb; {code} = 2009-08-06 18:47:56,297 [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig-svn/trunk/pig_1249609676296.log 09/08/06 18:47:56 INFO pig.Main: Logging error messages to: /homes/viraj/pig-svn/trunk/pig_1249609676296.log 2009-08-06 18:47:56,459 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://localhost:9000 2009-08-06 18:47:56,694 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: localhost:9001 2009-08-06 18:47:57,008 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias bb 09/08/06 18:47:57 ERROR grunt.Grunt: ERROR 1002: Unable to store alias bb Details at logfile: /homes/viraj/pig-svn/trunk/pig_1249609676296.log = = Pig Stack Trace --- ERROR 1002: Unable to store alias bb org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias bb at org.apache.pig.PigServer.openIterator(PigServer.java:481) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:531) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias bb at org.apache.pig.PigServer.store(PigServer.java:536) at org.apache.pig.PigServer.openIterator(PigServer.java:464) ... 6 more Caused by: java.lang.NullPointerException at org.apache.pig.impl.logicalLayer.LOCogroup.unsetSchema(LOCogroup.java:359) at org.apache.pig.impl.logicalLayer.optimizer.SchemaRemover.visit(SchemaRemover.java:64) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:335) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:46) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildSchemas(LogicalTransformer.java:67) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:187) at org.apache.pig.PigServer.compileLp(PigServer.java:854) at org.apache.pig.PigServer.compileLp(PigServer.java:791) at org.apache.pig.PigServer.store(PigServer.java:509) ... 7 more = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-828) Problem accessing a tuple within a bag
Problem accessing a tuple within a bag -- Key: PIG-828 URL: https://issues.apache.org/jira/browse/PIG-828 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Below pig script creates a tuple which contains 3 columns, 2 of which are chararray's and the third column is a bag of constant chararray. The script later projects the tuple within a bag. {code} a = load 'studenttab5' as (name, age, gpa); b = foreach a generate ('viraj', {('sms')}, 'pig') as document:(id,singlebag:{singleTuple:(single)}, article); describe b; c = foreach b generate document.singlebag; dump c; {code} When we run this script we get a run-time error in the Map phase. java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:402) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:400) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-828) Problem accessing a tuple within a bag
[ https://issues.apache.org/jira/browse/PIG-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-828: --- Attachment: tupleacc.pig studenttab5 Input script and data. Problem accessing a tuple within a bag -- Key: PIG-828 URL: https://issues.apache.org/jira/browse/PIG-828 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Attachments: studenttab5, tupleacc.pig Below pig script creates a tuple which contains 3 columns, 2 of which are chararray's and the third column is a bag of constant chararray. The script later projects the tuple within a bag. {code} a = load 'studenttab5' as (name, age, gpa); b = foreach a generate ('viraj', {('sms')}, 'pig') as document:(id,singlebag:{singleTuple:(single)}, article); describe b; c = foreach b generate document.singlebag; dump c; {code} When we run this script we get a run-time error in the Map phase. java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:402) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:400) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-816) PigStorage() does not accept Unicode characters in its contructor
PigStorage() does not accept Unicode characters in its contructor -- Key: PIG-816 URL: https://issues.apache.org/jira/browse/PIG-816 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.3.0 Simple Pig script which uses Unicode characters in the PigStorage() constructor fails with the following error: {code} studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, age:int,gpa:float); X2 = GROUP studenttab by age; Y2 = FOREACH X2 GENERATE group, COUNT(studenttab); store Y2 into '/user/viraj/y2' using PigStorage('\u0001'); {code} ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate exception from backend error: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference #1 is an invalid XML character. Attaching log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-816) PigStorage() does not accept Unicode characters in its contructor
[ https://issues.apache.org/jira/browse/PIG-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-816: --- Attachment: pig_1243043613713.log Log file for detailed error message PigStorage() does not accept Unicode characters in its contructor -- Key: PIG-816 URL: https://issues.apache.org/jira/browse/PIG-816 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.3.0 Attachments: pig_1243043613713.log Simple Pig script which uses Unicode characters in the PigStorage() constructor fails with the following error: {code} studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, age:int,gpa:float); X2 = GROUP studenttab by age; Y2 = FOREACH X2 GENERATE group, COUNT(studenttab); store Y2 into '/user/viraj/y2' using PigStorage('\u0001'); {code} ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate exception from backend error: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference #1 is an invalid XML character. Attaching log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-656) Use of eval word in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710862#action_12710862 ] Viraj Bhat commented on PIG-656: Another pig parse issue when a udf was defined within a package which had matches keywords in its path. So something like : define DISTANCE_SCORE mypackage.pig.udf.matches.LevensteinMatchUDF(); gives a parse error ERROR 1000: Error during parsing. Encountered matches matches at line 11, column 42. Was expecting: IDENTIFIER ... It is possible to have keywords from pig within package names or even udf - shouldn't pig not be robust to simple grammar disambiguation of this sort ? Use of eval word in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 Attachments: mywordcount.txt, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-656) Use of eval word in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat reopened PIG-656: Documentation should be updated on the eval keyword and what it actually does otherwise the user can be lost trying to find out the error. Use of eval word in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 Attachments: mywordcount.txt, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception
[ https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-656: --- Summary: Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception (was: Use of eval word in the package hierarchy of a UDF causes parse exception) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception - Key: PIG-656 URL: https://issues.apache.org/jira/browse/PIG-656 Project: Pig Issue Type: Bug Components: documentation, grunt Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 Attachments: mywordcount.txt, TOKENIZE.jar Consider a Pig script which does something similar to a word count. It uses the built-in TOKENIZE function, but packages it inside a class hierarchy such as mypackage.eval {code} register TOKENIZE.jar my_src = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t') AS (mlist: chararray); modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist)); describe modules; grouped = GROUP modules BY $0; describe grouped; counts = FOREACH grouped GENERATE COUNT(modules), group; ordered = ORDER counts BY $0; dump ordered; {code} The parser complains: === 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray} === I looked at the following source code at (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems that : EVAL is a keyword in Pig. Here are some clarifications: 1) Is there documentation on what the EVAL keyword actually is? 2) Is EVAL keyword actually implemented? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-812) COUNT(*) does not work
COUNT(*) does not work --- Key: PIG-812 URL: https://issues.apache.org/jira/browse/PIG-812 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.2.0 Pig script to count the number of rows in a studenttab10k file which contains 10k records. {code} studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float); X2 = GROUP studenttab ALL; describe X2; Y2 = FOREACH X2 GENERATE COUNT(*); explain Y2; DUMP Y2; {code} returns the following error ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias Y2 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log If you look at the log file: Caused by: java.lang.ClassCastException at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76) at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-812) COUNT(*) does not work
[ https://issues.apache.org/jira/browse/PIG-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-812: --- Attachment: studenttab10k Input file COUNT(*) does not work --- Key: PIG-812 URL: https://issues.apache.org/jira/browse/PIG-812 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.2.0 Attachments: studenttab10k Pig script to count the number of rows in a studenttab10k file which contains 10k records. {code} studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float); X2 = GROUP studenttab ALL; describe X2; Y2 = FOREACH X2 GENERATE COUNT(*); explain Y2; DUMP Y2; {code} returns the following error ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias Y2 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log If you look at the log file: Caused by: java.lang.ClassCastException at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76) at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710619#action_12710619 ] Viraj Bhat commented on PIG-774: Hi Daniel, For this patch to work, is it important to set: LESSCHARSET to utf-8 LANG to en_US.utf8 I am observing that the dry run using pig -r does not yield the right parameter substitution, if we do not have these variables set. They are not set by default on the RH-EL 5.0 You have mentioned this in your earlier comments!! Thanks Viraj Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Assignee: Daniel Dai Priority: Critical Fix For: 0.3.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = {code} shell $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj
[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-798: --- Description: In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. was: In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. Schema errors when using PigStorage and none when using BinStorage?? Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code}