[jira] Created: (PIG-1660) Consider passing result of COUNT/COUNT_STAR to LIMIT

2010-09-30 Thread Viraj Bhat (JIRA)
Consider passing result of COUNT/COUNT_STAR to LIMIT 
-

 Key: PIG-1660
 URL: https://issues.apache.org/jira/browse/PIG-1660
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Viraj Bhat
 Fix For: 0.9.0


In realistic scenarios we need to split a dataset into segments by using LIMIT, 
and like to achieve that goal within the same pig script. Here is a case:

{code}
A = load '$DATA' using PigStorage(',') as (id, pvs);
B = group A by ALL;
C = foreach B generate COUNT_STAR(A) as row_cnt;
-- get the low 50% segment
D = order A by pvs;
E = limit D (C.row_cnt * 0.2);
store E in '$Eoutput';
-- get the high 20% segment
F = order A by pvs DESC;
G = limit F (C.row_cnt * 0.2);
store G in '$Goutput';
{code}

Since LIMIT only accepts constants, we have to split the operation to two steps 
in order to pass in the constants for the LIMIT statements. Please consider 
bringing this feature in so the processing can be more efficient.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1630) Support param_files to be loaded into HDFS

2010-09-20 Thread Viraj Bhat (JIRA)
Support param_files to be loaded into HDFS
--

 Key: PIG-1630
 URL: https://issues.apache.org/jira/browse/PIG-1630
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I want to place the parameters of a Pig script in a param_file. 

But instead of this file being in the local file system where I run my java 
command, I want this to be on HDFS.

{code}
$ java -cp pig.jar org.apache.pig.Main -param_file hdfs://namenode/paramfile 
myscript.pig
{code}

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1631) Support to 2 level nested foreach

2010-09-20 Thread Viraj Bhat (JIRA)
Support to 2 level nested foreach
-

 Key: PIG-1631
 URL: https://issues.apache.org/jira/browse/PIG-1631
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Viraj Bhat


What I would like to do is generate certain metrics for every listing 
impression in the context of a page like clicks on the page etc. So, I first 
group by to get clicks and impression together. Now, I would want to iterate 
through the mini-table (one per serve-id) and compute metrics. Since nested 
foreach within foreach is not supported I ended up writing a UDF that took both 
the bags and computed the metric. It would have been elegant to keep the logic 
of iterating over the records outside in the PIG script. 

Here is some pseudocode of how I would have liked to write it:

{code}
-- Let us say in our page context there was click on rank 2 for which there 
were 3 ads 
A1 = LOAD '...' AS (page_id, rank); -- clicks. 
A2 = Load '...' AS (page_id, rank); -- impressions

B = COGROUP A1 by (page_id), A2 by (page_id); 

-- Let us say B contains the following schema 
-- (group, {(A1...)} {(A2...)})  
-- Each record would be in B would be:
-- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))}

C = FOREACH B GENERATE {
D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current pig 
as well. Basically, I would like a mini-table which represents an entire serve. 
FOREACH D GENERATE
page_id_1,
A2:rank,
SOMEUDF(A1:rank, A2::rank);  -- This UDF returns a 
value (like v1, v2, v3 depending on A1::rank and A2::rank)
};
# output
# page_id, 1, v1
# page_id,  2, v2
# page_id, 3, v3

DUMP C;
{code}

P.S: I understand that I could have alternatively, flattened the fields of B 
and then done a GROUP on page_id and then iterated through the records calling 
'SOMEUDF' appropriately but that would be 2 map-reduce operations AFAIK. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1633) Using an alias withing Nested Foreach causes indeterminate behaviour

2010-09-20 Thread Viraj Bhat (JIRA)
Using an alias withing Nested Foreach causes indeterminate behaviour


 Key: PIG-1633
 URL: https://issues.apache.org/jira/browse/PIG-1633
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0
Reporter: Viraj Bhat


I have created a RANDOMINT function which generates random numbers between (0 
and specified value), For example RANDOMINT(4) gives random numbers between 0 
and 3 (inclusive)

{code}
$hadoop fs -cat rand.dat
f
g
h
i
j
k
l
m
{code}

The pig script is as follows:
{code}
register math.jar;
A = load 'rand.dat' using PigStorage() as (data);

B = foreach A {
r = math.RANDOMINT(4);
generate
data,
r as random,
((r == 3)?1:0) as quarter;
};

dump B;
{code}

The results are as follows:
{code}
{color:red} 
(f,0,0)
(g,3,0)
(h,0,0)
(i,2,0)
(j,3,0)
(k,2,0)
(l,0,1)
(m,1,0)
{color} 
{code}

If you observe, (j,3,0) is created because r is used both in the foreach and 
generate clauses and generate different values.

Modifying the above script to below solves the issue. The M/R jobs from both 
scripts are the same. It is just a matter of convenience. 
{code}
A = load 'rand.dat' using PigStorage() as (data);

B = foreach A generate
data,
math.RANDOMINT(4) as r;

C = foreach B generate
data,
r,
((r == 3)?1:0) as quarter;

dump C;
{code}

Is this issue related to PIG:747?
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1634) Multiple names for the group field

2010-09-20 Thread Viraj Bhat (JIRA)
Multiple names for the group field


 Key: PIG-1634
 URL: https://issues.apache.org/jira/browse/PIG-1634
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0
Reporter: Viraj Bhat


I am hoping that in Pig if I type 

{quote} c = cogroup a by foo, b by bar, the fields c.group, c.foo  and c.bar 
should all map to c.$0 {quote} 

This would improve the readability  of the Pig script.

Here's a real usecase:
{code}
---
pages = LOAD 'pages.dat'  AS (url, pagerank);

visits = LOAD 'user_log.dat'  AS (user_id, url);

page_visits = COGROUP pages BY url, visits BY url;

frequent_visits = FILTER page_visits BY COUNT(visits) = 2;

answer = FOREACH frequent_visits  GENERATE url, FLATTEN(pages.pagerank);
---
{code}

(The important part is the final GENERATE statement, which references   the 
field url, which was the grouping field in the earlier COGROUP.)  To get it  
to work I have to write it in a less intuitive way.

Maybe with the new parser changes in Pig 0.9 it would be easier to specify that.
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag

2010-09-16 Thread Viraj Bhat (JIRA)
Return code from Pig is 0 even if the job fails when using -M flag
--

 Key: PIG-1615
 URL: https://issues.apache.org/jira/browse/PIG-1615
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


I have a Pig script of this form, which I used inside a workflow system such as 
Oozie.
{code}
A = load  '$INPUT' using PigStorage();
store A into '$OUTPUT';
{code}

I run this as with Multi-query optimization turned off :
{quote}
$java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
{quote}

The directory /user/viraj/junk1 is not present

I get the following results:
{quote}
Input(s):
Failed to read data from /user/viraj/junk1
Output(s):
Failed to produce result in /user/viraj/junk2
{quote}

This is expected, but the return code is still 0
{code}
$ echo $?
0
{code}

If I run this script with Multi-query optimization turned on, it gives, a 
return code of 2, which is correct.

{code}
$ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
...
$ echo $?
2
{code}

I believe a wrong return code from Pig, is causing Oozie to believe that Pig 
script succeeded.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag

2010-09-16 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910414#action_12910414
 ] 

Viraj Bhat commented on PIG-1615:
-

I tested this on Pig 0.8, but with a downloaded version, which was little old. 

I re-downloaded the latest source, seems to be fixed.

Viraj

 Return code from Pig is 0 even if the job fails when using -M flag
 --

 Key: PIG-1615
 URL: https://issues.apache.org/jira/browse/PIG-1615
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


 I have a Pig script of this form, which I used inside a workflow system such 
 as Oozie.
 {code}
 A = load  '$INPUT' using PigStorage();
 store A into '$OUTPUT';
 {code}
 I run this as with Multi-query optimization turned off :
 {quote}
 $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
 INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
 {quote}
 The directory /user/viraj/junk1 is not present
 I get the following results:
 {quote}
 Input(s):
 Failed to read data from /user/viraj/junk1
 Output(s):
 Failed to produce result in /user/viraj/junk2
 {quote}
 This is expected, but the return code is still 0
 {code}
 $ echo $?
 0
 {code}
 If I run this script with Multi-query optimization turned on, it gives, a 
 return code of 2, which is correct.
 {code}
 $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
 INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
 ...
 $ echo $?
 2
 {code}
 I believe a wrong return code from Pig, is causing Oozie to believe that Pig 
 script succeeded.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-282) Custom Partitioner

2010-09-15 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-282:
---

Release Note: 
This feature allows to specify Hadoop Partitioner for the following operations: 
GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed'  join). Partitioner 
controls the partitioning of the keys of the intermediate map-outputs. See 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html
 for more details.

To use this feature you can add PARTITION BY clause to the appropriate operator:
A = load 'input_data';
B = group A by $0 PARTITION BY 
org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
.
Here is the code for SimpleCustomPartitioner

public class SimpleCustomPartitioner extends PartitionerPigNullableWritable, 
Writable {
 //@Override
public int getPartition(PigNullableWritable key, Writable value, int 
numPartitions) {
if(key.getValueAsPigType() instanceof Integer) {
int ret = (((Integer)key.getValueAsPigType()).intValue() % 
numPartitions);
return ret;
   }
   else {
return (key.hashCode()) % numPartitions;
}
}
}

  was:
This feature allows to specify Hadoop Partitioner for the following operations: 
GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed'  join). Partitioner 
controls the partitioning of the keys of the intermediate map-outputs. See 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Partitioner.html
 for more details.

To use this feature you can add PARTITION BY clause to the appropriate operator:
A = load 'input_data';
B = group A by $0 PARTITION BY 
org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
.
Here is the code for SimpleCustomPartitioner

public class SimpleCustomPartitioner extends PartitionerPigNullableWritable, 
Writable {
 //@Override
public int getPartition(PigNullableWritable key, Writable value, int 
numPartitions) {
if(key.getValueAsPigType() instanceof Integer) {
int ret = (((Integer)key.getValueAsPigType()).intValue() % 
numPartitions);
return ret;
   }
   else {
return (key.hashCode()) % numPartitions;
}
}
}


 Custom Partitioner
 --

 Key: PIG-282
 URL: https://issues.apache.org/jira/browse/PIG-282
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Amir Youssefi
Assignee: Aniket Mokashi
Priority: Minor
 Fix For: 0.8.0

 Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, 
 CustomPartitionerTest.patch


 By adding custom partitioner we can give control over which output partition 
 a key (/value) goes to. We can add keywords to language e.g. 
 PARTITION BY UDF(...)
 or a similar syntax. UDF returns a number between 0 and n-1 where n is number 
 of output partitions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Viraj Bhat (JIRA)
Parameter subsitution using -param option runs into problems when substituing 
entire pig statements in a shell script (maybe this is a bash problem)


 Key: PIG-1586
 URL: https://issues.apache.org/jira/browse/PIG-1586
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Viraj Bhat


I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \
 -param OUTPUT=\'/user/viraj/output\' USING PigStorage()
{code}

register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1586:


Description: 
I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \
 -param OUTPUT=\'/user/viraj/output\' USING PigStorage()
{code}

{code}
register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj

  was:
I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \
 -param OUTPUT=\'/user/viraj/output\' USING PigStorage()
{code}

register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj




 Parameter subsitution using -param option runs into problems when substituing 
 entire pig statements in a shell script (maybe this is a bash problem)
 

 Key: PIG-1586
 URL: https://issues.apache.org/jira/browse/PIG-1586
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Viraj Bhat

 I have a Pig script as a template:
 {code}
 register Countwords.jar;
 A = $INPUT;
 B = FOREACH A GENERATE
 examples.udf.SubString($0,0,1),
 $1 as num;
 C = GROUP B BY $0;
 D = FOREACH C GENERATE group, SUM(B.num);
 STORE D INTO $OUTPUT;
 {code}
 I attempt to do Parameter substitutions using the following:
 Using Shell script:
 {code}
 #!/bin/bash
 java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r 
 -file sub.pig \
  -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' 
 USING PigStorage() AS (word:chararray,num:int)) by (word),(load 
 '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
 (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \
  -param OUTPUT=\'/user/viraj/output\' USING PigStorage()
 {code}
 {code}
 register Countwords.jar;
 A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
 (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
 PigStorage() AS (word:chararray,num:int)) by (word)) generate 
 flatten(examples.udf.CountWords(runsub.sh,,)));
 B = FOREACH A GENERATE
 examples.udf.SubString($0,0,1),
 $1 as num;
 C = GROUP B BY $0;
 D = FOREACH C GENERATE group, SUM(B.num);
 STORE D INTO /user/viraj/output;
 {code}
 The shell substitutes the $0 before passing it to java. 
 a) Is there a workaround for this?  
 b) Is this is Pig param problem?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a 

[jira] Created: (PIG-1576) Difference in Semantics between Load statement in Pig and HDFS client on Command line

2010-08-27 Thread Viraj Bhat (JIRA)
Difference in Semantics between Load statement in Pig and HDFS client on 
Command line
-

 Key: PIG-1576
 URL: https://issues.apache.org/jira/browse/PIG-1576
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0, 0.6.0
Reporter: Viraj Bhat


Here is my directory structure on HDFS which I want to access using Pig. 
This is a sample, but in real use case I have more than 100 of these 
directories.
{code}
$ hadoop fs -ls /user/viraj/recursive/
Found 3 items
drwxr-xr-x   - viraj supergroup  0 2010-08-26 11:25 
/user/viraj/recursive/20080615
drwxr-xr-x   - viraj supergroup  0 2010-08-26 11:25 
/user/viraj/recursive/20080616
drwxr-xr-x   - viraj supergroup  0 2010-08-26 11:25 
/user/viraj/recursive/20080617
{code}
Using the command line I am access them using variety of options:
{code}
$ hadoop fs -ls /user/viraj/recursive/{200806}{15..17}/
-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080615/kv2.txt
-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080616/kv2.txt
-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080617/kv2.txt

$ hadoop fs -ls /user/viraj/recursive/{20080615..20080617}/

-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080615/kv2.txt

-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080616/kv2.txt

-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080617/kv2.txt
{code}

I have written a Pig script, all the below combination of load statements do 
not work?
{code}
--A = load '/user/viraj/recursive/{200806}{15..17}/' using PigStorage('\u0001') 
as (k:int, v:chararray);
A = load '/user/viraj/recursive/{20080615..20080617}/' using 
PigStorage('\u0001') as (k:int, v:chararray);
AL = limit A 10;
dump AL;
{code}

I get the following error in Pig 0.8
{noformat}
2010-08-27 16:34:27,704 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil 
- 1 map reduce job(s) failed!
2010-08-27 16:34:27,711 [main] INFO  org.apache.pig.tools.pigstats.PigStats - 
Script Statistics: 
HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
0.20.2  0.8.0-SNAPSHOT  viraj   2010-08-27 16:34:24 2010-08-27 16:34:27 
LIMIT
Failed!
Failed Jobs:
JobId   Alias   Feature Message Outputs
N/A A,ALMessage: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for: /user/viraj/recursive/{20080615..20080617}/
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
Pattern hdfs://localhost:9000/user/viraj/recursive/{20080615..20080617} matches 
0 files
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268)
... 7 more
hdfs://localhost:9000/tmp/temp241388470/tmp987803889,
{noformat}

The following works:
{code}
A = load '/user/viraj/recursive/{200806}{15,16,17}/' using PigStorage('\u0001') 
as (k:int, v:chararray);
AL = limit A 10;
dump AL;
{code}

Why is there an inconsistency between HDFS client and Pig?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1561) XMLLoader in Piggybank does not support bz2 or gzip compressed XML files

2010-08-23 Thread Viraj Bhat (JIRA)
XMLLoader in Piggybank does not support bz2 or gzip compressed XML files


 Key: PIG-1561
 URL: https://issues.apache.org/jira/browse/PIG-1561
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I have a simple Pig script which uses the XMLLoader after the Piggybank is 
built.

{code}
register piggybank.jar;
A = load '/user/viraj/capacity-scheduler.xml.gz' using 
org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray);
B = limit A 1;
dump B;
--store B into '/user/viraj/handlegz' using PigStorage();
{code}


returns empty tuple
{code}
()
{code}

If you supply the uncompressed XML file, you get
{code}
(property
namemapred.capacity-scheduler.queue.my.capacity/name
value10/value
descriptionPercentage of the number of slots in the cluster that are
  guaranteed to be available for jobs in this queue.
/description
  /property)
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1547) Piggybank MultiStorage does not scale when processing around 7k records per bucket

2010-08-17 Thread Viraj Bhat (JIRA)
Piggybank MultiStorage does not scale when processing around 7k records per 
bucket
--

 Key: PIG-1547
 URL: https://issues.apache.org/jira/browse/PIG-1547
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I am trying to use the MultiStorage piggybank UDF
{code}
register pig-svn/trunk/contrib/piggybank/java/piggybank.jar;
A = load '/user/viraj/largebucketinput.txt' using PigStorage('\u0001') as 
(a,b,c);
STORE A INTO '/user/viraj/multistore' USING 
org.apache.pig.piggybank.storage.MultiStorage('/user/viraj/multistore', '1', 
'none', '\u0001');
{code}
The file largebucketinput.txt is around 85MB in size and for each b we have 
512 values starting from 0-511 and each value of b or a bucket contains 7k 
records

a) On a multi-node hadoop installation:
The above Pig script which spawn a single Map only job does not succeed and is 
killed by the TT, for running above the memory limit.

== Message == 
TaskTree [pid=24584,tipID=attempt_201008110143_101976_m_00_0] is running 
beyond memory-limits. Current usage : 1661034496bytes. Limit : 1610612736bytes.
== Message == 
We tried increasing the Map slots but it does not succeed.

b) On a single node hadoop installation:
The pig script fails with the following message in the mappers:

2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_7687609983190239805_126509
2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_2734778934507357565_126509
2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_-1293917224803067377_126509
2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_-2272713260404734116_126509
2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
Exception: java.io.IOException: Unable to create new block.
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2781)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232)

2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_-2272713260404734116_126509 bad datanode[0] nodes == null
2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Could not get 
block locations. Source file 
/user/viraj/multistore/_temporary/_attempt_201005141440_0178_m_01_0/444/444-1
 - Aborting...
2010-08-17 16:37:48,619 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at org.apache.hadoop.io.Text.readString(Text.java:400)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2837)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2762)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232)
2010-08-17 16:37:48,622 INFO org.apache.hadoop.mapred.TaskRunner: Runnning 
cleanup for the task


Need to investigate more.

Viraj






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-05 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895858#action_12895858
 ] 

Viraj Bhat commented on PIG-1537:
-

Hi Olga, I have given the specific script with UDF's for Daniel to test.  
Thanks Daniel for your help.
The script which does not use Column Pruner optimization or disables it using 
-t gives correct results.
Viraj

 Column pruner causes wrong results when using both Custom Store UDF and 
 PigStorage
 --

 Key: PIG-1537
 URL: https://issues.apache.org/jira/browse/PIG-1537
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.8.0


 I have script which is of this pattern and it uses 2 StoreFunc's:
 {code}
 register loader.jar
 register piggy-bank/java/build/storage.jar;
 %DEFAULT OUTPUTDIR /user/viraj/prunecol/
 ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
 ss_sc_filtered_0 = FILTER ss_sc_0 BY
 a#'id' matches '1.*' OR
 a#'id' matches '2.*' OR
 a#'id' matches '3.*' OR
 a#'id' matches '4.*';
 ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
 ss_sc_filtered_1 = FILTER ss_sc_1 BY
 a#'id' matches '65.*' OR
 a#'id' matches '466.*' OR
 a#'id' matches '043.*' OR
 a#'id' matches '044.*' OR
 a#'id' matches '0650.*' OR
 a#'id' matches '001.*';
 ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
 ss_sc_all_proj = FOREACH ss_sc_all GENERATE
 a#'query' as query,
 a#'testid' as testid,
 a#'timestamp' as timestamp,
 a,
 b,
 c;
 ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
 ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
 STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
 ss_sc_all_map_count = group ss_sc_all_map all;
 count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
 record_count,COUNT($1);
 STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
 {code}
 I run this script using:
 a) java -cp pig0.7.jar script.pig
 b) java -cp pig0.7.jar -t PruneColumns script.pig
 What I observe is that the alias count produces the same number of records 
 but ss_sc_all_map have different sizes when run with above 2 options.
 Is due to the fact that there are 2 store func's used?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-04 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1537:


Description: 
I have script which is of this pattern and it uses 2 StoreFunc's:

{code}
register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
{code}

I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias count produces the same number of records 
but ss_sc_all_map have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj

  was:
I have script which is of this pattern and it uses 2 StoreFunc's:
{code}

register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias count produces the same number of records 
but ss_sc_all_map have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj


 Column pruner causes wrong results when using both Custom Store UDF and 
 PigStorage
 --

 Key: PIG-1537
 URL: https://issues.apache.org/jira/browse/PIG-1537
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat

 I have script which is of this pattern and it uses 2 StoreFunc's:
 {code}
 register loader.jar
 register piggy-bank/java/build/storage.jar;
 %DEFAULT OUTPUTDIR /user/viraj/prunecol/
 ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
 ss_sc_filtered_0 = FILTER ss_sc_0 BY
 a#'id' matches '1.*' OR
 a#'id' matches '2.*' OR
 a#'id' matches '3.*' OR
 a#'id' matches '4.*';
 ss_sc_1 = LOAD 

[jira] Created: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-04 Thread Viraj Bhat (JIRA)
Column pruner causes wrong results when using both Custom Store UDF and 
PigStorage
--

 Key: PIG-1537
 URL: https://issues.apache.org/jira/browse/PIG-1537
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I have script which is of this pattern and it uses 2 StoreFunc's:
{code}

register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias count produces the same number of records 
but ss_sc_all_map have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-05-06 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864963#action_12864963
 ] 

Viraj Bhat commented on PIG-1345:
-

Richard thanks for suggesting a workaround. The error message is definitely 
more verbose than the original one. 

At least in one way the user can know as to where the cast is an issue in the, 
maybe in some addition taking place in the script. 

This Jira was originally created as task to correlate exactly on which line 
int is implicitly cast to float, which I believe is hard to do in the current 
parser as we do not keep track of line number.

Viraj

 Link casting errors in POCast to actual lines numbers in Pig script
 ---

 Key: PIG-1345
 URL: https://issues.apache.org/jira/browse/PIG-1345
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat

 For the purpose of easy debugging, I would be nice to find out where  my 
 warnings are coming from is in the pig script. 
 The only known process is to comment out lines in the Pig script and see if 
 these warnings go away.
 2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26
 I think this may need us to keep track of the line numbers of the Pig script 
 (via out javacc parser) and maintain it in the logical and physical plan.
 It would help users in debugging simple errors/warning related to casting.
 Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?
 Do we need to change the parser to something other than javacc to make this 
 task simpler?
 Standardize on Parser and Scanner Technology
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-26 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861097#action_12861097
 ] 

Viraj Bhat commented on PIG-798:


Hi Ashutosh,
 Yes that is possible, I know that we can do that in PigStorage() but why can 
we not do this in PigStorage? What do I need to cast as (chararray) ?
{code}
A = load 'somedata' using PigStorage();
B = foreach A generate $0 as name:chararray;
dump B;
{code}

But this is possible in BinStorage(), why is this not consistent?

Is it that BinStorage() has schemas embedded while PigStorage() does not? 

Should this not be fixed to make it consistent across storage formats?

Viraj

 Schema errors when using PigStorage and none when using BinStorage in 
 FOREACH??
 ---

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
Reporter: Viraj Bhat
 Attachments: binstoragecreateop, schemaerr.pig, visits.txt


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}
 ===
 So why should the semantics of BinStorage() be different from PigStorage() 
 where is ok not to specify a schema??? Should it not be consistent across 
 both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

2010-04-26 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861106#action_12861106
 ] 

Viraj Bhat commented on PIG-1211:
-

Ashutosh, yes as more and more people adopt Pig, they expect some type of 
guarantees, since Pig is designed to help people with no experience in writing 
M/R programs.

If I am a novice user I have a small typo, do I wait for 3-4 hours to discover 
that there is a syntax error? I have not only wasted the CPU cycles but also 
the users productivity.

The problem here is that dump and hadoop shell commands are treated differently 
in Pig scripts and Multi-query optimizations are ignored.

I have listed what Milind and Dmitry is suggesting. Maybe this is the way 
future Pig Language will compile to give you a hadoop jar file in sequence or 
as a DAG.

Pigcc -L myScript.pig - parses pig script, generates logical plan, and stores 
it in myScript.pig.l

Pigcc -P myScript.pig.l - produces physical plan from the logical plan, and 
stores it in myScript.pig.p

Pigcc -M myScript.pig.p - produces map-reduce plan, myScript.pig.m

Pig myScript.pig.m - interprets the MR plan. This can be split into multiple 
sequential MR jobs plans too,  myScript.pig.m.{1,2,3..}, so that a way to 
execute the pig script is to run

Hadoop jar pigRT.jar myScript.pig.m.1
Hadoop jar pigRT.jar myScript.pig.m.2
Hadoop jar pigRT.jar myScript.pig.m.3
Hadoop jar pigRT.jar myScript.pig.m.4

Thanks Viraj


 Pig script runs half way after which it reports syntax error
 

 Key: PIG-1211
 URL: https://issues.apache.org/jira/browse/PIG-1211
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


 I have a Pig script which is structured in the following way
 {code}
 register cp.jar
 dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
 col3, col4, col5);
 filtered_dataset = filter dataset by (col1 == 1);
 proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
 rmf $output1;
 store proj_filtered_dataset into '$output1' using PigStorage();
 second_stream = foreach filtered_dataset  generate col2, col4, col5;
 group_second_stream = group second_stream by col4;
 output2 = foreach group_second_stream {
  a =  second_stream.col2
  b =   distinct second_stream.col5;
  c = order b by $0;
  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
 }
 rmf  $output2;
 --syntax error here
 store output2 to '$output2' using PigStorage();
 {code}
 I run this script using the Multi-query option, it runs successfully till the 
 first store but later fails with a syntax error. 
 The usage of HDFS option, rmf causes the first store to execute. 
 The only option the I have is to run an explain before running his script 
 grunt explain -script myscript.pig -out explain.out
 or moving the rmf statements to the top of the script
 Here are some questions:
 a) Can we have an option to do something like checkscript instead of 
 explain to get the same syntax error?  In this way I can ensure that I do not 
 run for 3-4 hours before encountering a syntax error
 b) Can pig not figure out a way to re-order the rmf statements since all the 
 store directories are variables
 Thanks
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-26 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861134#action_12861134
 ] 

Viraj Bhat commented on PIG-798:


Ashutosh thanks for clarifying, we will wait till that bug is fixed in 
BinStorage

Viraj

 Schema errors when using PigStorage and none when using BinStorage in 
 FOREACH??
 ---

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
Reporter: Viraj Bhat
 Attachments: binstoragecreateop, schemaerr.pig, visits.txt


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}
 ===
 So why should the semantics of BinStorage() be different from PigStorage() 
 where is ok not to specify a schema??? Should it not be consistent across 
 both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-04-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860397#action_12860397
 ] 

Viraj Bhat commented on PIG-1345:
-

Which release will PIG:908 be fixed? 

Does it guarantee that if we fix PIG:908, then this issue will be solved?  

 Link casting errors in POCast to actual lines numbers in Pig script
 ---

 Key: PIG-1345
 URL: https://issues.apache.org/jira/browse/PIG-1345
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat

 For the purpose of easy debugging, I would be nice to find out where  my 
 warnings are coming from is in the pig script. 
 The only known process is to comment out lines in the Pig script and see if 
 these warnings go away.
 2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26
 I think this may need us to keep track of the line numbers of the Pig script 
 (via out javacc parser) and maintain it in the logical and physical plan.
 It would help users in debugging simple errors/warning related to casting.
 Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?
 Do we need to change the parser to something other than javacc to make this 
 task simpler?
 Standardize on Parser and Scanner Technology
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

2010-04-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860419#action_12860419
 ] 

Viraj Bhat commented on PIG-1211:
-

Ashutosh, I feel that the user may not be interested in running his script 
first using explain finding his syntax error and then again running it again to 
get his results.  
They expect Pig to tell them all the errors upfront before submitting a M/R job.

Explain was not designed for checking syntax error in scripts. 

I believe that if you have a dump statement, explain -script will cause the 
script to run.

Is it not possible for Pig to find out that there is an error with store 
syntax? 

Viraj

 Pig script runs half way after which it reports syntax error
 

 Key: PIG-1211
 URL: https://issues.apache.org/jira/browse/PIG-1211
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


 I have a Pig script which is structured in the following way
 {code}
 register cp.jar
 dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
 col3, col4, col5);
 filtered_dataset = filter dataset by (col1 == 1);
 proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
 rmf $output1;
 store proj_filtered_dataset into '$output1' using PigStorage();
 second_stream = foreach filtered_dataset  generate col2, col4, col5;
 group_second_stream = group second_stream by col4;
 output2 = foreach group_second_stream {
  a =  second_stream.col2
  b =   distinct second_stream.col5;
  c = order b by $0;
  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
 }
 rmf  $output2;
 --syntax error here
 store output2 to '$output2' using PigStorage();
 {code}
 I run this script using the Multi-query option, it runs successfully till the 
 first store but later fails with a syntax error. 
 The usage of HDFS option, rmf causes the first store to execute. 
 The only option the I have is to run an explain before running his script 
 grunt explain -script myscript.pig -out explain.out
 or moving the rmf statements to the top of the script
 Here are some questions:
 a) Can we have an option to do something like checkscript instead of 
 explain to get the same syntax error?  In this way I can ensure that I do not 
 run for 3-4 hours before encountering a syntax error
 b) Can pig not figure out a way to re-order the rmf statements since all the 
 store directories are variables
 Thanks
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1339) International characters in column names not supported

2010-04-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860445#action_12860445
 ] 

Viraj Bhat commented on PIG-1339:
-

Hi Ashutosh this does not work in trunk. I am using the latest build:

{code}
$java -cp  ~/pig-svn/trunk/pig.jar org.apache.pig.Main -version

Apache Pig version 0.8.0-dev (r937554) 
compiled Apr 23 2010, 16:57:32

{code}

2010-04-23 17:31:41,448 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Lexical error at line 1, column 71.  Encountered: 
\u3042 (12354), after : 


This is a valid bug.

Viraj

 International characters in column names not supported
 --

 Key: PIG-1339
 URL: https://issues.apache.org/jira/browse/PIG-1339
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0, 0.7.0, 0.8.0
Reporter: Viraj Bhat

 There is a particular use-case in which someone specifies a column name to be 
 in International characters.
 {code}
 inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
 describe inputdata;
 dump inputdata;
 {code}
 ==
 Pig Stack Trace
 ---
 ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
 Encountered: \u3042 (12354), after : 
 org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 
 1, column 64.  Encountered: \u3042 (12354), after : 
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:391)
 ==
 Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860452#action_12860452
 ] 

Viraj Bhat commented on PIG-798:


Hi Ashutosh,
 The problem here is not about using the data interchangeably between 
BinStorage() and PigStorage(), it is about the consistency issues in schema. 
Sorry if the description was unclear.

I can see that it is possible to write statements such as this using 
BinStorage() 

{code}
A = load 'somedata' using BinStorage();
B = foreach A generate $0 as name:chararray;
dump B;
{code}

and not write it using PigStorage().

Should we not support the following statement, as a user I am interested in 
projecting the first column and casting it to a chararray. I am not interested 
in knowing what the schemas are of other columns!!

Fails when I do the following:
{code}
A = load 'somedata' using PigStorage();
B = foreach A generate $0 as name:chararray;
dump B;
{code}

Can you tell me why the schema specification in FOREACH GENERATE works with 
BinStorage and not in PigStorage? 

Viraj

 Schema errors when using PigStorage and none when using BinStorage in 
 FOREACH??
 ---

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Attachments: binstoragecreateop, schemaerr.pig, visits.txt


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}
 ===
 So why should the semantics of BinStorage() be different from PigStorage() 
 where is ok not to specify a schema??? Should it not be consistent across 
 both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-23 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-798:
---

Affects Version/s: 0.6.0
   0.5.0
   0.4.0
   0.3.0
   0.7.0
   0.8.0

 Schema errors when using PigStorage and none when using BinStorage in 
 FOREACH??
 ---

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
Reporter: Viraj Bhat
 Attachments: binstoragecreateop, schemaerr.pig, visits.txt


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}
 ===
 So why should the semantics of BinStorage() be different from PigStorage() 
 where is ok not to specify a schema??? Should it not be consistent across 
 both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1378) har url not usable in Pig scripts

2010-04-21 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859384#action_12859384
 ] 

Viraj Bhat commented on PIG-1378:
-

har:// currently works in Pig 0.7 when the hdfs location is specified.

 har url not usable in Pig scripts
 -

 Key: PIG-1378
 URL: https://issues.apache.org/jira/browse/PIG-1378
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


 I am trying to use har (Hadoop Archives) in my Pig script.
 I can use them through the HDFS shell
 {noformat}
 $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
 Found 1 items
 -rw---   5 viraj users1537234 2010-04-14 09:49 
 user/viraj/project/subproject/files/size/data/part-1
 {noformat}
 Using similar URL's in grunt yields
 {noformat}
 grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; 
 grunt dump a;
 {noformat}
 {noformat}
 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible 
 file URI scheme: har : hdfs
 2010-04-14 22:08:48,814 [main] WARN  org.apache.pig.tools.grunt.Grunt - There 
 is no log file to write to.
 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
 Incompatible file URI scheme: har : hdfs
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
 at org.apache.pig.Main.main(Main.java:357)
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
 Incompatible file URI scheme: har : hdfs
 at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249)
 at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472)
 ... 13 more
 {noformat}
 According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the 
 following as stated in the original description
 {noformat}
 grunt a = load 
 'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
 grunt dump a;
 {noformat}
 {noformat}
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: 
 Unable to create input splits for: 
 har://namenode-location/user/viraj/project/subproject/files/size/data'; 
 ... 8 more
 Caused by: java.io.IOException: No FileSystem for scheme: namenode-location
 at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
 at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66)
 at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
 at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
 at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104)
 at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
 at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193)
 at .apache.hadoop.fs.Path.getFileSystem(Path.java:175)
 at 
 .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208)
 at 
 .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
 at 
 .apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
 at 
 .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245)
 {noformat}
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1378) har url not usable in Pig scripts

2010-04-14 Thread Viraj Bhat (JIRA)
har url not usable in Pig scripts
-

 Key: PIG-1378
 URL: https://issues.apache.org/jira/browse/PIG-1378
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I am trying to use har (Hadoop Archives) in my Pig script.

I can use them through the HDFS shell
{noformat}
$hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
Found 1 items
-rw---   5 viraj users1537234 2010-04-14 09:49 
user/viraj/project/subproject/files/size/data/part-1
{noformat}

Using similar URL's in grunt yields
{noformat}
grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; 
grunt dump a;
{noformat}


{noformat}
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. 
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file 
URI scheme: har : hdfs
2010-04-14 22:08:48,814 [main] WARN  org.apache.pig.tools.grunt.Grunt - There 
is no log file to write to.
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
Incompatible file URI scheme: har : hdfs
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
Incompatible file URI scheme: har : hdfs
at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249)
at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472)
... 13 more
{noformat}

According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the 
following as stated in the original description

{noformat}
grunt a = load 
'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
grunt dump a;
{noformat}

{noformat}
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: 
Unable to create input splits for: 
har://namenode-location/user/viraj/project/subproject/files/size/data'; 
... 8 more
Caused by: java.io.IOException: No FileSystem for scheme: mithrilgold
at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66)
at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104)
at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193)
at .apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at 
.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at 
.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245)
{noformat}

Viraj

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (PIG-1378) har url not usable in Pig scripts

2010-04-14 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1378:


Description: 
I am trying to use har (Hadoop Archives) in my Pig script.

I can use them through the HDFS shell
{noformat}
$hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
Found 1 items
-rw---   5 viraj users1537234 2010-04-14 09:49 
user/viraj/project/subproject/files/size/data/part-1
{noformat}

Using similar URL's in grunt yields
{noformat}
grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; 
grunt dump a;
{noformat}


{noformat}
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. 
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file 
URI scheme: har : hdfs
2010-04-14 22:08:48,814 [main] WARN  org.apache.pig.tools.grunt.Grunt - There 
is no log file to write to.
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
Incompatible file URI scheme: har : hdfs
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
Incompatible file URI scheme: har : hdfs
at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249)
at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472)
... 13 more
{noformat}

According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the 
following as stated in the original description

{noformat}
grunt a = load 
'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
grunt dump a;
{noformat}

{noformat}
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: 
Unable to create input splits for: 
har://namenode-location/user/viraj/project/subproject/files/size/data'; 
... 8 more
Caused by: java.io.IOException: No FileSystem for scheme: namenode-location
at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66)
at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104)
at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193)
at .apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at 
.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at 
.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245)
{noformat}

Viraj

  was:
I am trying to use har (Hadoop Archives) in my Pig script.

I can use them through the HDFS shell
{noformat}
$hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
Found 1 items
-rw---   5 viraj users1537234 2010-04-14 09:49 
user/viraj/project/subproject/files/size/data/part-1
{noformat}

Using similar URL's in grunt yields
{noformat}
grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; 
grunt dump a;
{noformat}


{noformat}
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. 
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 

[jira] Commented: (PIG-518) LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag

2010-04-14 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857157#action_12857157
 ] 

Viraj Bhat commented on PIG-518:


The above script generates the following error in Pig 0.7

2010-04-14 17:10:49,807 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1048: Two inputs of BinCond must have compatible schemas. left hand side: b: 
bag({colb2: bytearray,colb3: bytearray}) right hand side: 
bag({(chararray,chararray)})


A type cast to the right type solves the problem.

{code}
a = load 'sports_views.txt' as (col1:chararray, col2:chararray, 
col3:chararray); 
b = load 'queries.txt' as (colb1:chararray,colb2:chararray,colb3:chararray); 
mycogroup = cogroup a by col1 inner, b by colb1; 
mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b)  0L ? 
b.(colb2,colb3) : {('','')}));
dump mynewalias; 
{code}

(alice,lakers,3,ipod,3)
(alice,warriors,7,ipod,3)
(peter,sun,7,sun,4)
(peter,nets,7,sun,4)

Closing bug as Pig yields the correct error message which the user can use to 
recode his script



 LOBinCond  exception in LogicalPlanValidationExecutor when providing default 
 values for bag
 ---

 Key: PIG-518
 URL: https://issues.apache.org/jira/browse/PIG-518
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Attachments: queries.txt, sports_views.txt


 The following piece of Pig script, which provides default values for bags 
 {('','')}  when the COUNT returns 0 fails with the following error. (Note: 
 Files used in this script are enclosed on this Jira.)
 
 a = load 'sports_views.txt' as (col1, col2, col3);
 b = load 'queries.txt' as (colb1,colb2,colb3);
 mycogroup = cogroup a by col1 inner, b by colb1;
 mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b)  0L ? 
 b.(colb2,colb3) : {('','')}));
 dump mynewalias;
 
 java.io.IOException: Unable to open iterator for alias: mynewalias [Unable to 
 store for alias: mynewalias [Can't overwrite cause]]
  at java.lang.Throwable.initCause(Throwable.java:320)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1494)
  at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:85)
  at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:28)
  at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2345)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2252)
  at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:121)
  at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:40)
  at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
  at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
  at 
 org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:
 79)
  at org.apache.pig.PigServer.compileLp(PigServer.java:684)
  at org.apache.pig.PigServer.compileLp(PigServer.java:655)
  at org.apache.pig.PigServer.store(PigServer.java:433)
  at org.apache.pig.PigServer.store(PigServer.java:421)
  at org.apache.pig.PigServer.openIterator(PigServer.java:384)
  at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269)
  at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178)
  at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
  at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
  at org.apache.pig.Main.main(Main.java:306)
 Caused by: java.io.IOException: Unable to store for alias: mynewalias [Can't 
 overwrite cause]
  ... 26 more
 Caused by: java.lang.IllegalStateException: Can't overwrite cause
  ... 26 more
 

-- 
This message is automatically generated by 

[jira] Resolved: (PIG-518) LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag

2010-04-14 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat resolved PIG-518.


Fix Version/s: 0.7.0
   Resolution: Fixed

 LOBinCond  exception in LogicalPlanValidationExecutor when providing default 
 values for bag
 ---

 Key: PIG-518
 URL: https://issues.apache.org/jira/browse/PIG-518
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.7.0

 Attachments: queries.txt, sports_views.txt


 The following piece of Pig script, which provides default values for bags 
 {('','')}  when the COUNT returns 0 fails with the following error. (Note: 
 Files used in this script are enclosed on this Jira.)
 
 a = load 'sports_views.txt' as (col1, col2, col3);
 b = load 'queries.txt' as (colb1,colb2,colb3);
 mycogroup = cogroup a by col1 inner, b by colb1;
 mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b)  0L ? 
 b.(colb2,colb3) : {('','')}));
 dump mynewalias;
 
 java.io.IOException: Unable to open iterator for alias: mynewalias [Unable to 
 store for alias: mynewalias [Can't overwrite cause]]
  at java.lang.Throwable.initCause(Throwable.java:320)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1494)
  at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:85)
  at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:28)
  at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2345)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2252)
  at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:121)
  at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:40)
  at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
  at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
  at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
  at 
 org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:
 79)
  at org.apache.pig.PigServer.compileLp(PigServer.java:684)
  at org.apache.pig.PigServer.compileLp(PigServer.java:655)
  at org.apache.pig.PigServer.store(PigServer.java:433)
  at org.apache.pig.PigServer.store(PigServer.java:421)
  at org.apache.pig.PigServer.openIterator(PigServer.java:384)
  at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269)
  at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178)
  at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
  at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
  at org.apache.pig.Main.main(Main.java:306)
 Caused by: java.io.IOException: Unable to store for alias: mynewalias [Can't 
 overwrite cause]
  ... 26 more
 Caused by: java.lang.IllegalStateException: Can't overwrite cause
  ... 26 more
 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (PIG-829) DECLARE statement stop processing after special characters such as dot . , + % etc..

2010-04-14 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat resolved PIG-829.


Fix Version/s: 0.7.0
   Resolution: Fixed

Pig 0.7 yields the correct result.
{code}
x = LOAD 'something' as (a:chararray, b:chararray);
y = FILTER x BY ( a MATCHES '^.*yahoo.*$' );
STORE y INTO 'foo.bar';
{code}

 DECLARE statement stop processing after special characters such as dot . , 
 + % etc..
 --

 Key: PIG-829
 URL: https://issues.apache.org/jira/browse/PIG-829
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


 The below Pig script does not work well, when special characters are used in 
 the DECLARE statement.
 {code}
 %DECLARE OUT foo.bar
 x = LOAD 'something' as (a:chararray, b:chararray);
 y = FILTER x BY ( a MATCHES '^.*yahoo.*$' );
 STORE y INTO '$OUT';
 {code}
 When the above script is run in the dry run mode; the substituted file does 
 not contain the special character.
 {code}
 java -cp pig.jar:/homes/viraj/hadoop-0.18.0-dev/conf -Dhod.server='' 
 org.apache.pig.Main -r declaresp.pig
 {code}
 Resulting file: declaresp.pig.substituted
 {code}
 x = LOAD 'something' as (a:chararray, b:chararray);
 y = FILTER x BY ( a MATCHES '^.*yahoo.*$' );
 STORE y INTO 'foo';
 {code}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1377) Pig/Zebra fails without proper error message when the mapred.jobtracker.maxtasks.per.job exceeds threshold

2010-04-13 Thread Viraj Bhat (JIRA)
Pig/Zebra fails without proper error message when the 
mapred.jobtracker.maxtasks.per.job exceeds threshold
--

 Key: PIG-1377
 URL: https://issues.apache.org/jira/browse/PIG-1377
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0, 0.7.0
Reporter: Viraj Bhat


I have a Zebra script which generates huge amount of mappers around 400K. The 
mapred.jobtracker.maxtasks.per.job is currently set at 200k. The job fails at 
the initialization phase. It is very hard to find out the cause.

We need a way to report the right error message to users. Unfortunately for Pig 
to get this error in the backend, Map Reduce Jira: 
https://issues.apache.org/jira/browse/MAPREDUCE-1049 needs to be fixed.
{code}

-- Sorted format
%set default_parallel 100;
raw = load '/user/viraj/generated/raw/zebra-sorted/20100203'
USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted')
as (id,
timestamp,
code,
ip,
host,
reference,
type,
flag,
params : map[]
);
describe raw;
user_events = filter raw by id == 'viraj';
describe user_events;
dump user_events;
sorted_events = order user_events by id, timestamp;
dump sorted_events;
store sorted_events into 'finalresult';
{code}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1374) Order by fails with java.lang.String cannot be cast to org.apache.pig.data.DataBag

2010-04-12 Thread Viraj Bhat (JIRA)
Order by fails with java.lang.String cannot be cast to 
org.apache.pig.data.DataBag
--

 Key: PIG-1374
 URL: https://issues.apache.org/jira/browse/PIG-1374
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0, 0.7.0
Reporter: Viraj Bhat


Script loads data from BinStorage(), then flattens columns and then sorts on 
the second column with order descending. The order by fails with the 
ClassCastException

{code}
register loader.jar;
a = load 'c2' using BinStorage();
b = foreach a generate org.apache.pig.CCMLoader(*);
describe b;
c = foreach b generate flatten($0);
describe c;
d = order c by $1 desc;
dump d;
{code}

The sampling job fails with the following error:
===
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.pig.data.DataBag
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:407)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:188)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:329)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:232)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:227)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:52)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:159)
===

The schema for b, c and d are as follows:

b: {bag_of_tuples: {tuple: (uuid: chararray,velocity: double)}}

c: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double}

d: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double}

If we modify this script to order on the first column it seems to work

{code}
register loader.jar;
a = load 'c2' using BinStorage();
b = foreach a generate org.apache.pig.CCMLoader(*);
describe b;
c = foreach b generate flatten($0);
describe c;
d = order c by $0 desc;
dump d;
{code}

(gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493)
(ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138)


There is a workaround to do a projection before ORDER

{code}
register loader.jar;
a = load 'c2' using BinStorage();
b = foreach a generate org.apache.pig.CCMLoader(*);
describe b;
c = foreach b generate flatten($0);
describe c;
newc = foreach c generate $0 as uuid, $1 as velocity;
newd = order newc by velocity desc;
dump newd;
{code}

(gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493)
(ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138)


The schema for the Loader is as follows:

{code}
  public Schema outputSchema(Schema input) {
 try{  
ListSchema.FieldSchema list = new 
ArrayListSchema.FieldSchema();
list.add(new Schema.FieldSchema(uuid, 
DataType.CHARARRAY));
list.add(new Schema.FieldSchema(velocity, 
DataType.DOUBLE));
Schema tupleSchema = new Schema(list);
Schema.FieldSchema tupleFs = new 
Schema.FieldSchema(tuple, tupleSchema, DataType.TUPLE);
Schema bagSchema = new Schema(tupleFs);
bagSchema.setTwoLevelAccessRequired(true);
Schema.FieldSchema bagFs = new 
Schema.FieldSchema(bag_of_tuples,bagSchema, DataType.BAG);
return new Schema(bagFs);
}catch (Exception e){
return null;
}
}
{code}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2010-04-07 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854762#action_12854762
 ] 

Viraj Bhat commented on PIG-756:


In Pig 0.7 we have moved local mode of Pig to local mode of Hadoop.
https://issues.apache.org/jira/browse/PIG-1053

Closing issue

 UDFs should have API for transparently opening and reading files from HDFS or 
 from local file system with only relative path
 

 Key: PIG-756
 URL: https://issues.apache.org/jira/browse/PIG-756
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 I have a utility function util.INSETFROMFILE() that I pass a file name during 
 initialization.
 {code}
 define inQuerySet util.INSETFROMFILE(analysis/queries);
 A = load 'logs' using PigStorage() as ( date int, query chararray );
 B = filter A by inQuerySet(query);
 {code}
 This provides a computationally inexpensive way to effect map-side joins for 
 small sets plus functions of this style provide the ability to encapsulate 
 more complex matching rules.
 For rapid development and debugging purposes, I want this code to run without 
 modification on both my local file system when I do pig -exectype local and 
 on HDFS.
 Pig needs to provide an API for UDFs which allow them to either:
 1) know  when they are in local or HDFS mode and let them open and read 
 from files as appropriate
 2) just provide a file name and read statements and have pig transparently 
 manage local or HDFS opens and reads for the UDF
 UDFs need to read configuration information off the filesystem and it 
 simplifies the process if one can just flip the switch of -exectype local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2010-04-07 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat resolved PIG-756.


   Resolution: Fixed
Fix Version/s: 0.7.0

https://issues.apache.org/jira/browse/PIG-1053 fixes this issue.

 UDFs should have API for transparently opening and reading files from HDFS or 
 from local file system with only relative path
 

 Key: PIG-756
 URL: https://issues.apache.org/jira/browse/PIG-756
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz
 Fix For: 0.7.0


 I have a utility function util.INSETFROMFILE() that I pass a file name during 
 initialization.
 {code}
 define inQuerySet util.INSETFROMFILE(analysis/queries);
 A = load 'logs' using PigStorage() as ( date int, query chararray );
 B = filter A by inQuerySet(query);
 {code}
 This provides a computationally inexpensive way to effect map-side joins for 
 small sets plus functions of this style provide the ability to encapsulate 
 more complex matching rules.
 For rapid development and debugging purposes, I want this code to run without 
 modification on both my local file system when I do pig -exectype local and 
 on HDFS.
 Pig needs to provide an API for UDFs which allow them to either:
 1) know  when they are in local or HDFS mode and let them open and read 
 from files as appropriate
 2) just provide a file name and read statements and have pig transparently 
 manage local or HDFS opens and reads for the UDF
 UDFs need to read configuration information off the filesystem and it 
 simplifies the process if one can just flip the switch of -exectype local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-03-31 Thread Viraj Bhat (JIRA)
Link casting errors in POCast to actual lines numbers in Pig script
---

 Key: PIG-1345
 URL: https://issues.apache.org/jira/browse/PIG-1345
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


For the purpose of easy debugging, I would be nice to find out where  my 
warnings are coming from is in the pig script. 

The only known process is to comment out lines in the Pig script and see if 
these warnings go away.

2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26

I think this may need us to keep track of the line numbers of the Pig script 
(via out javacc parser) and maintain it in the logical and physical plan.

It would help users in debugging simple errors/warning related to casting.

Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?

Do we need to change the parser to something other than javacc to make this 
task simpler?

Standardize on Parser and Scanner Technology

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1339) International characters in column names not supported

2010-03-30 Thread Viraj Bhat (JIRA)
International characters in column names not supported
--

 Key: PIG-1339
 URL: https://issues.apache.org/jira/browse/PIG-1339
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


There is a particular use-case in which someone specifies a column name to be 
in International characters.

{code}
inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
describe inputdata;
dump inputdata;
{code}
==
Pig Stack Trace
---
ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
Encountered: \u3042 (12354), after : 

org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, 
column 64.  Encountered: \u3042 (12354), after : 

at 
org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:391)
==

Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20

2010-03-30 Thread Viraj Bhat (JIRA)
Cannot convert DataByeArray to Chararray and results in 
FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20
-

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat


Script reads in BinStorage data and tries to convert a column which is in 
DataByteArray to Chararray. 

{code}
raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
--filter out null columns
A = filter raw by col1#'bcookie' is not null;

B = foreach A generate col1#'bcookie'  as reqcolumn;
describe B;
--B: {regcolumn: bytearray}
X = limit B 5;
dump X;

B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
describe B;
--B: {convertedcol: chararray}
X = limit B 5;
dump X;

{code}

The first dump produces:

(36co9b55onr8s)
(36co9b55onr8s)
(36hilul5oo1q1)
(36hilul5oo1q1)
(36l4cj15ooa8a)

The second dump produces:
()
()
()
()
()

It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
time(s).
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-03-30 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1341:


Component/s: impl
Summary: Cannot convert DataByeArray to Chararray and results in 
FIELD_DISCARDED_TYPE_CONVERSION_FAILED  (was: Cannot convert DataByeArray to 
Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20)

 Cannot convert DataByeArray to Chararray and results in 
 FIELD_DISCARDED_TYPE_CONVERSION_FAILED
 --

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat

 Script reads in BinStorage data and tries to convert a column which is in 
 DataByteArray to Chararray. 
 {code}
 raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
 --filter out null columns
 A = filter raw by col1#'bcookie' is not null;
 B = foreach A generate col1#'bcookie'  as reqcolumn;
 describe B;
 --B: {regcolumn: bytearray}
 X = limit B 5;
 dump X;
 B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
 describe B;
 --B: {convertedcol: chararray}
 X = limit B 5;
 dump X;
 {code}
 The first dump produces:
 (36co9b55onr8s)
 (36co9b55onr8s)
 (36hilul5oo1q1)
 (36hilul5oo1q1)
 (36l4cj15ooa8a)
 The second dump produces:
 ()
 ()
 ()
 ()
 ()
 It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
 time(s).
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-03-30 Thread Viraj Bhat (JIRA)
pig_log file missing even though Main tells it is creating one and an M/R job 
fails 


 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


There is a particular case where I was running with the latest trunk of Pig.

{code}
$java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig

[main] INFO  org.apache.pig.Main - Logging error messages to: 
/homes/viraj/pig_1263420012601.log

$ls -l pig_1263420012601.log
ls: pig_1263420012601.log: No such file or directory
{code}

The job failed and the log file did not contain anything, the only way to debug 
was to look into the Jobtracker logs.

Here are some reasons which would have caused this behavior:
1) The underlying filer/NFS had some issues. In that case do we not error on 
stdout?
2) There are some errors from the backend which are not being captured

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]

2010-03-18 Thread Viraj Bhat (JIRA)
Inifinite loop in JobClient when reading from BinStorage Message: 
[org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2]


 Key: PIG-1308
 URL: https://issues.apache.org/jira/browse/PIG-1308
 Project: Pig
  Issue Type: Bug
Reporter: Viraj Bhat
 Fix For: 0.7.0


Simple script fails to read files from BinStorage() and fails to submit jobs to 
JobTracker. This occurs with trunk and not with Pig 0.6 branch.

{code}
data = load 'binstorage' using BinStorage() as (s, m, l);
A = foreach ULT generate   s#'key' as value;
X = limit A 20;
dump X;
{code}

When this script is submitted to the Jobtracker, we found the following error:
2010-03-18 22:31:22,296 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:01,574 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:43,276 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:33:21,743 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:02,004 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:43,442 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:35:25,907 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:36:07,402 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:36:48,596 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:37:28,014 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:38:04,823 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:38:38,981 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:39:12,220 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2

Stack Trace revelead 

at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144)
at 
org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:115)
at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404)
at 
org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167)
at 
org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263)
at 
org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112)
at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76)
at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216)
at org.apache.pig.PigServer.compileLp(PigServer.java:883)
at org.apache.pig.PigServer.store(PigServer.java:564)

The binstorage data was generated from 2 datasets using limit and union:
{code}
Large1 = load 'input1'  using PigStorage();
Large2 = load 'input2' using PigStorage();
V = limit Large1 1;
C = limit Large2 1;
U = union V, C;
store U into 'mobilesample' using BinStorage();
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]

2010-03-18 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1308:


Description: 
Simple script fails to read files from BinStorage() and fails to submit jobs to 
JobTracker. This occurs with trunk and not with Pig 0.6 branch.

{code}
data = load 'binstoragesample' using BinStorage() as (s, m, l);
A = foreach ULT generate   s#'key' as value;
X = limit A 20;
dump X;
{code}

When this script is submitted to the Jobtracker, we found the following error:
2010-03-18 22:31:22,296 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:01,574 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:43,276 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:33:21,743 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:02,004 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:43,442 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:35:25,907 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:36:07,402 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:36:48,596 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:37:28,014 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:38:04,823 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:38:38,981 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:39:12,220 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2

Stack Trace revelead 

at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144)
at 
org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:115)
at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404)
at 
org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167)
at 
org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263)
at 
org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112)
at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76)
at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216)
at org.apache.pig.PigServer.compileLp(PigServer.java:883)
at org.apache.pig.PigServer.store(PigServer.java:564)

The binstorage data was generated from 2 datasets using limit and union:
{code}
Large1 = load 'input1'  using PigStorage();
Large2 = load 'input2' using PigStorage();
V = limit Large1 1;
C = limit Large2 1;
U = union V, C;
store U into 'binstoragesample' using BinStorage();
{code}

  was:
Simple script fails to read files from BinStorage() and fails to submit jobs to 
JobTracker. This occurs with trunk and not with Pig 0.6 branch.

{code}
data = load 'binstorage' using BinStorage() as (s, m, l);
A = foreach ULT generate   s#'key' as value;
X = limit A 20;
dump X;
{code}

When this script is submitted to the Jobtracker, we found the following error:
2010-03-18 22:31:22,296 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:01,574 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:43,276 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:33:21,743 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:02,004 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:43,442 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:35:25,907 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - 

[jira] Created: (PIG-1278) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableFloatWritable, recieved org.apache.pig.impl.io.NullableText

2010-03-05 Thread Viraj Bhat (JIRA)
Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableFloatWritable, recieved 
org.apache.pig.impl.io.NullableText 
---

 Key: PIG-1278
 URL: https://issues.apache.org/jira/browse/PIG-1278
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I have a script which uses Map data, and runs a UDF, which creates random 
numbers and then orders the data by these random numbers.

{code}
REGISTER myloader.jar;
--jar produced from the source code listed below
REGISTER math.jar;

DEFINE generator math.Random();

inputdata = LOAD '/user/viraj/mymapdata'   USING MyMapLoader()AS (s:map[], 
m:map[], l:map[]);

queries = FILTER inputdata   BY m#'key'#'query' IS NOT null;

queries_rand = FOREACH queries  GENERATE generator('') AS rand_num, (CHARARRAY) 
m#'key'#'query' AS query_string;

queries_sorted = ORDER queries_rand  BY rand_num  PARALLEL 10;

queries_limit = LIMIT queries_sorted 1000;

rand_queries = FOREACH queries_limit  GENERATE query_string;

STORE rand_queries INTO 'finalresult';

{code}

UDF source for Random.java
{code}
package math;

import java.io.IOException;

/*
* Implements a random float [0,1) generator.
*/

public class Random extends EvalFuncFloat
{
private final Random m_rand = new Random();

   public Float exec(Tuple input) throws IOException
{
   return new Float(m_rand.nextFloat());
}

public Schema outputSchema(Schema input)
{
   final String name = getSchemaName(getClass().getName(), input);
   return new Schema(new Schema.FieldSchema(name, DataType.FLOAT));
}
}
{code}

Running this script returns the following error in the Mapper
=
java.io.IOException: Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableFloatWritable, recieved 
org.apache.pig.impl.io.NullableText at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845) at 
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) 
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:109)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:255)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
org.apache.hadoop.mapred.Child.main(Child.java:159) 
=

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1281) Detect org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple type of errors at Compile Type during creation of logical plan

2010-03-05 Thread Viraj Bhat (JIRA)
Detect org.apache.pig.data.DataByteArray cannot be cast to 
org.apache.pig.data.Tuple type of errors at Compile Type during creation of 
logical plan
---

 Key: PIG-1281
 URL: https://issues.apache.org/jira/browse/PIG-1281
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


This is more of an enhancement request, where we can detect simple errors 
during compile time during creation of Logical plan rather than at the backend.

I created a script which contains an error which gets detected in the backend 
as a cast error when in fact we can detect it in the front end(group is a 
single element so group.$0 projection operation will not work).

{code}
inputdata = LOAD '/user/viraj/mymapdata' AS (co1, col2, col3, col4);

projdata = FILTER inputdata BY (col1 is not null);

groupprojdata = GROUP projdata BY col1;

cleandata = FOREACH groupprojdata {
 bagproj = projdata.col1;
 dist_bags = DISTINCT bagproj;
 GENERATE group.$0 as newcol1, COUNT(dist_bags) as newcol2;
  };

cleandata1 = GROUP cleandata by newcol2;

cleandata2 = FOREACH cleandata1 { GENERATE group.$0 as finalcol1, 
COUNT(cleandata.newcol1) as finalcol2; };

ordereddata = ORDER cleandata2 by finalcol2;

store into 'finalresult' using PigStorage();
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

2010-03-02 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840339#action_12840339
 ] 

Viraj Bhat commented on PIG-1252:
-

A modified version of the script works, does this have to do with nested 
foreach? 

{code}
loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
 
dump finalData;
{code}

 Diamond splitter does not generate correct results when using Multi-query 
 optimization
 --

 Key: PIG-1252
 URL: https://issues.apache.org/jira/browse/PIG-1252
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.7.0


 I have script which uses split but somehow does not use one of the split 
 branch. The skeleton of the script is as follows
 {code}
 loadData = load '/user/viraj/zebradata' using 
 org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
 col7');
 prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
 (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
 ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 
 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
 SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
 falseDataTmp IF (validRec == '1' AND splitcond == '');
 grpData = GROUP trueDataTmp BY splitcond;
 finalData = FOREACH grpData {
orderedData = ORDER trueDataTmp BY col1,col2;
GENERATE FLATTEN ( MYUDF (orderedData, 60, 
 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
   }
 dump finalData;
 {code}
 You can see that falseDataTmp is untouched.
 When I run this script with no-Multiquery (-M) option I get the right result. 
  This could be the result of complex BinCond's in the POLoad. We can get rid 
 of this error by using  FILTER instead of SPIT.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1272) Column pruner causes wrong results

2010-03-02 Thread Viraj Bhat (JIRA)
Column pruner causes wrong results
--

 Key: PIG-1272
 URL: https://issues.apache.org/jira/browse/PIG-1272
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


For a simple script the column pruner optimization removes certain columns from 
the original relation, which results in wrong results.

Input file kv contains the following columns (tab separated)
{code}
a   1
a   2
a   3
b   4
c   5
c   6
b   7
d   8
{code}

Now running this script in Pig 0.6 produces

{code}
kv = load 'kv' as (k,v);
keys= foreach kv generate k;
keys = distinct keys; 
keys = limit keys 2;
rejoin = join keys by k, kv by k;
dump rejoin;
{code}

(a,a)
(a,a)
(a,a)
(b,b)
(b,b)


Running this in Pig 0.5 version without column pruner results in:
(a,a,1)
(a,a,2)
(a,a,3)
(b,b,4)
(b,b,7)

When we disable the ColumnPruner optimization it gives right results.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1272) Column pruner causes wrong results

2010-03-02 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840389#action_12840389
 ] 

Viraj Bhat commented on PIG-1272:
-

Now with Pig 0.7 or trunk we have the following error:

2010-03-02 23:35:09,349 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.NoSuchFieldError: sJobConf
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POJoinPackage.getNext(POJoinPackage.java:110)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:380)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:363)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:240)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:409)
at org.apache.hadoop.mapred.Child.main(Child.java:159)

Viraj

 Column pruner causes wrong results
 --

 Key: PIG-1272
 URL: https://issues.apache.org/jira/browse/PIG-1272
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0


 For a simple script the column pruner optimization removes certain columns 
 from the original relation, which results in wrong results.
 Input file kv contains the following columns (tab separated)
 {code}
 a   1
 a   2
 a   3
 b   4
 c   5
 c   6
 b   7
 d   8
 {code}
 Now running this script in Pig 0.6 produces
 {code}
 kv = load 'kv' as (k,v);
 keys= foreach kv generate k;
 keys = distinct keys; 
 keys = limit keys 2;
 rejoin = join keys by k, kv by k;
 dump rejoin;
 {code}
 (a,a)
 (a,a)
 (a,a)
 (b,b)
 (b,b)
 Running this in Pig 0.5 version without column pruner results in:
 (a,a,1)
 (a,a,2)
 (a,a,3)
 (b,b,4)
 (b,b,7)
 When we disable the ColumnPruner optimization it gives right results.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1263) Script producing varying number of records when COGROUPing value of map data type with and without types

2010-02-25 Thread Viraj Bhat (JIRA)
Script producing varying number of records when COGROUPing value of map data 
type with and without types


 Key: PIG-1263
 URL: https://issues.apache.org/jira/browse/PIG-1263
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have a Pig script which I am experimenting upon. [[Albeit this is not 
optimized and can be done in variety of ways]] I get different record counts by 
placing load store pairs in the script.

Case 1: Returns 424329 records
Case 2: Returns 5859 records
Case 3: Returns 5859 records
Case 4: Returns 5578 records
I am wondering what the correct result is?

Here are the scripts.
Case 1: 
{code}
register udf.jar

A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);

B = FOREACH A GENERATE
s#'key1' as key1,
s#'key2' as key2;

C = FOREACH B generate key2;

D = filter C by (key2 IS NOT null);

E = distinct D;

store E into 'unique_key_list' using PigStorage('\u0001');

F = Foreach E generate key2, MapGenerate(key2) as m;

G = FILTER F by (m IS NOT null);

H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, 
m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, 
m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12;

I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 
group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, 
group.id12 as id12;

--load previous days data
K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, 
id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

L = COGROUP  K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER,
 J by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER;

M = filter L by IsEmpty(K);

store M into 'cogroupNoTypes' using PigStorage();
{code}

Case 2:  Storing and loading intermediate results in J 
{code}
register udf.jar

A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);

B = FOREACH A GENERATE
s#'key1' as key1,
s#'key2' as key2;

C = FOREACH B generate key2;

D = filter C by (key2 IS NOT null);

E = distinct D;

store E into 'unique_key_list' using PigStorage('\u0001');

F = Foreach E generate key2, MapGenerate(key2) as m;

G = FILTER F by (m IS NOT null);

H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, 
m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, 
m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12;

I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 
group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, 
group.id12 as id12;

--store intermediate data to HDFS and re-read
store J into 'output/20100203/J' using PigStorage('\u0001');

--load previous days data
K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, 
id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

--read J into K1
K1 = LOAD 'output/20100203/J' using PigStorage('\u0001') as (id1, id2, id3, 
id4, id5, id6, id7, id8, id9, id10, id11, id12);

L = COGROUP  K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER,
 K1 by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER;

M = filter L by IsEmpty(K);

store M into 'cogroupNoTypesIntStore' using PigStorage();
{code}


Case 3: Types information specified but no intermediate store of J

{code}
register udf.jar

A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);

B = FOREACH A GENERATE
s#'key1' as key1,
s#'key2' as key2;

C = FOREACH B generate key2;

D = filter C by (key2 IS NOT null);

E = distinct D;

store E into 'unique_key_list' using PigStorage('\u0001');

F = Foreach E generate key2, MapGenerate(key2) as m;

G = FILTER F by (m IS NOT null);

H = foreach G generate key2, (long)m#'id1' as id1, (long)m#'id2' as id2, 
(long)m#'id3' as id3, (long)m#'id4' as id4, (long)m#'id5' as id5, (long)m#'id6' 
as id6, (long)m#'id7' as id7, (chararray)m#'id8' as id8, (chararray)m#'id9' as 
id9, (chararray)m#'id10' as id10, (chararray)m#'id11' as id11, 
(chararray)m#'id12' as id12;


I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 

[jira] Created: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

2010-02-22 Thread Viraj Bhat (JIRA)
Diamond splitter does not generate correct results when using Multi-query 
optimization
--

 Key: PIG-1252
 URL: https://issues.apache.org/jira/browse/PIG-1252
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I have script which uses split but somehow does not use one of the split 
branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
   orderedData = ORDER trueDataTmp BY col1,col2;
   GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
  }

dump finalData;

{code}


You can see that falseDataTmp is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  
This could be the result of complex BinCond's in the POLoad. We can get rid of 
this error by using  FILTER instead of SPIT.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

2010-02-22 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1252:


Description: 
I have script which uses split but somehow does not use one of the split 
branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
   orderedData = ORDER trueDataTmp BY col1,col2;
   GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
  }

dump finalData;

{code}


You can see that falseDataTmp is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  
This could be the result of complex BinCond's in the POLoad. We can get rid of 
this error by using  FILTER instead of SPIT.

Viraj

  was:
I have script which uses split but somehow does not use one of the split 
branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
   orderedData = ORDER trueDataTmp BY col1,col2;
   GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
  }

dump finalData;

{code}


You can see that falseDataTmp is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  
This could be the result of complex BinCond's in the POLoad. We can get rid of 
this error by using  FILTER instead of SPIT.

Viraj


 Diamond splitter does not generate correct results when using Multi-query 
 optimization
 --

 Key: PIG-1252
 URL: https://issues.apache.org/jira/browse/PIG-1252
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


 I have script which uses split but somehow does not use one of the split 
 branch. The skeleton of the script is as follows
 {code}
 loadData = load '/user/viraj/zebradata' using 
 org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
 col7');
 prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
 (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
 ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 
 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
 SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
 falseDataTmp IF (validRec == '1' AND splitcond == '');
 grpData = GROUP trueDataTmp BY splitcond;
 finalData = FOREACH grpData {
orderedData = ORDER trueDataTmp BY col1,col2;
GENERATE FLATTEN ( MYUDF (orderedData, 60, 
 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
   }
 dump finalData;
 {code}
 You can see that falseDataTmp is untouched.
 When I run this script with no-Multiquery (-M) option I get the right result. 
  This could be the result of complex BinCond's in the POLoad. We can get rid 
 of this error by using  FILTER instead of SPIT.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1247) Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error

2010-02-19 Thread Viraj Bhat (JIRA)
Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error
-

 Key: PIG-1247
 URL: https://issues.apache.org/jira/browse/PIG-1247
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I have a large script in which there are intermediate stores statements, one of 
them writes to a directory I do not have permission to write to. 

The stack trace I get from Pig is this:

2010-02-20 02:16:32,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2999: Unexpected internal error. 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error

Details at logfile: /home/viraj/pig_1266632145355.log

Pig Stack Trace
---

ERROR 2999: Unexpected internal error. 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error
java.lang.ClassCastException: 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3583)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1407)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:762)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:386)


The only way to find the error was to look at the javacc generated 
QueryParser.java code and do a System.out.println()


Here is a script to reproduce the problem:

{code}
A = load '/user/viraj/three.txt' using PigStorage();
B = foreach A generate ['a'#'12'] as b:map[] ;
store B into '/user/secure/pigtest' using PigStorage();
{code}

three.txt has 3 lines which contain nothing but the number 1.

{code}
$ hadoop fs -ls /user/secure/

ls: could not get get listing for 'hdfs://mynamenode/user/secure' : 
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=viraj, access=READ_EXECUTE, inode=secure:secure:users:rwx--

{code}


Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1243) Passing Complex map types to and from streaming causes a problem

2010-02-18 Thread Viraj Bhat (JIRA)
Passing Complex map types to and from streaming causes a problem


 Key: PIG-1243
 URL: https://issues.apache.org/jira/browse/PIG-1243
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I have a program which generates different types of Maps fields and stores it 
into PigStorage.
{code}
A = load '/user/viraj/three.txt' using PigStorage();

B = foreach A generate ['a'#'12'] as b:map[], ['b'#['c'#'12']] as c, 
['c'#{(['d'#'15']),(['e'#'16'])}] as d;

store B into '/user/viraj/pigtest' using PigStorage();
{code}

Now I test the previous output in the below script to make sure I have the 
right results. I also pass this data to a Perl script and I observe that the 
complex Map types I have generated, are lost when I get the result back.

{code}
DEFINE CMD `simple.pl` SHIP('simple.pl');

A = load '/user/viraj/pigtest' using PigStorage() as (simpleFields, mapFields, 
mapListFields);

B = foreach A generate $0, $1, $2;

dump B;

C = foreach A generate  (chararray)simpleFields#'a' as value, $0,$1,$2;

D = stream C through CMD as (a0:map[], a1:map[], a2:map[]);

dump D;
{code}


dumping B results in:

([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}])
([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}])
([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}])

dumping D results in:

([a#12],,)
([a#12],,)
([a#12],,)

The Perl script used here is:
{code}
#!/usr/local/bin/perl

use warnings;

use strict;

while() {

my($bc,$s,$m,$l)=split/\t/;

print($s\t$m\t$l);

}
{code}

Is there an issue with handling of complex Map fields within streaming? How can 
I fix this to obtain the right result?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1194) ERROR 2055: Received Error while processing the map plan

2010-02-10 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat reopened PIG-1194:
-


Hi Richard,
 I ran the script attached on the ticket and found out that the map tasks fails 
with the following error:

org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received 
Error while processing the map plan. at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:281)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
org.apache.hadoop.mapred.Child.main(Child.java:159) 

I am using the latest pig.jar without hadoop.
Viraj

 ERROR 2055: Received Error while processing the map plan
 

 Key: PIG-1194
 URL: https://issues.apache.org/jira/browse/PIG-1194
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0, 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: inputdata.txt, PIG-1194.patch, PIG-1194.patch


 I have a simple Pig script which takes 3 columns out of which one is null. 
 {code}
 input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3);
 a = GROUP input BY (((double) col3)/((double) col2)  .001 OR col1  11 ? 
 col1 : -1);
 b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, 
 SUM(input.col3) as  col3;
 store b into 'finalresult';
 {code}
 When I run this script I get the following error:
 ERROR 2055: Received Error while processing the map plan.
 org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received 
 Error while processing the map plan.
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 
 A more useful error message for the purpose of debugging would be helpful.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-08 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831248#action_12831248
 ] 

Viraj Bhat commented on PIG-1131:
-

Olga I marked it as critical since we mention that Pig can eat any type of 
data, and the example script shows that we need data with fixed schema's and to 
perform a simple join.

Viraj

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-08 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831251#action_12831251
 ] 

Viraj Bhat commented on PIG-1131:
-

Ashutosh I was able to recreate a similar problem using the trunk. 

java -cp pig-withouthadoop.jar org.apache.pig.Main -version


Apache Pig version 0.7.0-dev (r907874) 

compiled Feb 08 2010, 17:35:04

Viraj

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1220) Document unknown keywords as missing or to do in future

2010-02-03 Thread Viraj Bhat (JIRA)
Document unknown keywords as missing or to do in future
---

 Key: PIG-1220
 URL: https://issues.apache.org/jira/browse/PIG-1220
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


To get help at the grunt shell I do the following:

grunttouchz

010-02-04 00:59:28,714 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Encountered  IDENTIFIER touchz  at line 1, 
column 1.
Was expecting one of:
EOF 
cat ...
fs ...
cd ...
cp ...
copyFromLocal ...
copyToLocal ...
dump ...
describe ...
aliases ...
explain ...
help ...
kill ...
ls ...
mv ...
mkdir ...
pwd ...
quit ...
register ...
rm ...
rmf ...
set ...
illustrate ...
run ...
exec ...
scriptDone ...
 ...
EOL ...
; ...

I looked at the code and found that we do nothing at:

scriptDone: Is there some future value of that command.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1174) Creation of output path should be done by storage function

2010-01-27 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1174:


Fix Version/s: 0.7.0

 Creation of output path should be done by storage function
 --

 Key: PIG-1174
 URL: https://issues.apache.org/jira/browse/PIG-1174
 Project: Pig
  Issue Type: Bug
Reporter: Bill Graham
 Fix For: 0.7.0


 When executing a STORE command, Pig creates the output location before the 
 storage function gets called. This causes problems with storage functions 
 that have logic to determine the output location. See this thread:
 http://www.mail-archive.com/pig-user%40hadoop.apache.org/msg01538.html
 For example, when making a request like this:
 STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', 
 'none', '\t');
 Pig creates a file '/my/home/output' and then an exception is thrown when 
 MultiStorage tries to make a directory under '/my/home/output'. The 
 workaround is to instead specify a dummy location as the first path like so:
 STORE A INTO '/my/home/output/temp' USING MultiStorage('/my/home/output','0', 
 'none', '\t');
 Two changes should be made:
 1. The path specified in the INTO clause should be available to the storage 
 function so it doesn't need to be duplicated.
 2. The creation of the output paths should be delegated to the storage 
 function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig

2010-01-27 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-940:
---

Affects Version/s: (was: 0.3.0)
   0.5.0
Fix Version/s: 0.7.0

 Cross site HDFS access using the default.fs.name not possible in Pig
 

 Key: PIG-940
 URL: https://issues.apache.org/jira/browse/PIG-940
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0
 Environment: Hadoop 20
Reporter: Viraj Bhat
 Fix For: 0.7.0


 I have a script which does the following.. access data from a remote HDFS 
 location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I 
 do not want to copy this huge amount of data between HDFS locations]].
 However I want my Pigscript  to write data to the HDFS running on 
 localmachine.company.com.
 Currently Pig does not support that behavior and complains that: 
 hdfs://localmachine.company.com/user/viraj/A1.txt does not exist
 {code}
 A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); 
 B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); 
 C = JOIN A by a, B by c; 
 store C into 'output' using PigStorage();  
 {code}
 ===
 2009-09-01 00:37:24,032 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localmachine.company.com:8020
 2009-09-01 00:37:24,277 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localmachine.company.com:50300
 2009-09-01 00:37:24,567 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer
  - Rewrite: POPackage-POForEach to POJoinPackage
 2009-09-01 00:37:24,573 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 1
 2009-09-01 00:37:24,573 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 1
 2009-09-01 00:37:26,197 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
  - Setting up single store job
 2009-09-01 00:37:26,249 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the arguments. Applications should 
 implement Tool for the same.
 2009-09-01 00:37:26,746 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2009-09-01 00:37:26,746 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 100% complete
 2009-09-01 00:37:26,747 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 1 map reduce job(s) failed!
 2009-09-01 00:37:26,756 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed to produce result in: 
 hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480
 2009-09-01 00:37:26,756 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
 Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log
 ===
 The error file in Pig contains:
 ===
 ERROR 2998: Unhandled internal error. 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2100: 
 hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
 at 
 org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126)
 at 
 org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
 at 
 org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228)
 at 
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
 at 
 

[jira] Updated: (PIG-531) Way for explain to show 1 plan at a time

2010-01-27 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-531:
---

Fix Version/s: 0.5.0

Hi Olga,
 I think we have a way to handle it in multi-query optimization. Is it 
reasonable to close this as fixed.

I see the following in the Multi-query document about explain:

http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

explain [-out path] [-brief] [-dot] [-param key=value]* [-param_file 
filename]* [-script scriptname] [handle]

Viraj

 Way for explain to show 1 plan at a time
 

 Key: PIG-531
 URL: https://issues.apache.org/jira/browse/PIG-531
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Fix For: 0.5.0


 Several users complained that EXPLAIN output is too verbose and is hard to 
 make sense of.
 One way to improve the situation is to realize is that EXPLAIN actually 
 contains several plans: logical, physical, backend specific. So we can update 
 EXPLAIN to allow to show a particular plan. For instance
 EXPLAIN LOGICAL A;
 would show only logical plan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1194) ERROR 2055: Received Error while processing the map plan

2010-01-15 Thread Viraj Bhat (JIRA)
ERROR 2055: Received Error while processing the map plan


 Key: PIG-1194
 URL: https://issues.apache.org/jira/browse/PIG-1194
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0, 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0
 Attachments: inputdata.txt

I have a simple Pig script which takes 3 columns out of which one is null. 
{code}

input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3);
a = GROUP input BY (((double) col3)/((double) col2)  .001 OR col1  11 ? col1 
: -1);
b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, SUM(input.col3) 
as  col3;
store b into 'finalresult';

{code}


When I run this script I get the following error:

ERROR 2055: Received Error while processing the map plan.

org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received 
Error while processing the map plan.

at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277)

at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)

at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)



A more useful error message for the purpose of debugging would be helpful.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1194) ERROR 2055: Received Error while processing the map plan

2010-01-15 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1194:


Attachment: inputdata.txt

Testdata to run with this script

 ERROR 2055: Received Error while processing the map plan
 

 Key: PIG-1194
 URL: https://issues.apache.org/jira/browse/PIG-1194
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0, 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: inputdata.txt


 I have a simple Pig script which takes 3 columns out of which one is null. 
 {code}
 input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3);
 a = GROUP input BY (((double) col3)/((double) col2)  .001 OR col1  11 ? 
 col1 : -1);
 b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, 
 SUM(input.col3) as  col3;
 store b into 'finalresult';
 {code}
 When I run this script I get the following error:
 ERROR 2055: Received Error while processing the map plan.
 org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received 
 Error while processing the map plan.
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 
 A more useful error message for the purpose of debugging would be helpful.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified

2010-01-14 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800315#action_12800315
 ] 

Viraj Bhat commented on PIG-1187:
-

Hi Jeff,
 This is specific to the data we are using and it looks like parser failed when 
it is trying to interpret some characters. As such we have tested this with 
Chinese characters and it works.
Viraj

 UTF-8 (international code) breaks with loader when load with schema is 
 specified
 

 Key: PIG-1187
 URL: https://issues.apache.org/jira/browse/PIG-1187
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


 I have a set of Pig statements which dump an international dataset.
 {code}
 INPUT_OBJECT = load 'internationalcode';
 describe INPUT_OBJECT;
 dump INPUT_OBJECT;
 {code}
 Sample output
 (756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)})
 It works and dumps results but when I use a schema for loading it fails.
 {code}
 INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag 
 {T: tuple(label:chararray)});
 describe INPUT_OBJECT;
 {code}
 The error message is as follows:2010-01-14 02:23:27,320 FATAL 
 org.apache.hadoop.mapred.Child: Error running child : 
 org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop 
 caused by repeated empty string matches at line 1, column 21.
   at 
 org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620)
   at 
 org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569)
   at 
 org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651)
   at 
 org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152)
   at 
 org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100)
   at 
 org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382)
   at 
 org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42)
   at 
 org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68)
   at 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified

2010-01-13 Thread Viraj Bhat (JIRA)
UTF-8 (international code) breaks with loader when load with schema is specified


 Key: PIG-1187
 URL: https://issues.apache.org/jira/browse/PIG-1187
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have a set of Pig statements which dump an international dataset.
{code}
INPUT_OBJECT = load 'internationalcode';
describe INPUT_OBJECT;
dump INPUT_OBJECT;
{code}

Sample output

(756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)})

It works and dumps results but when I use a schema for loading it fails.

{code}
INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag 
{T: tuple(label:chararray)});
describe INPUT_OBJECT;
{code}

The error message is as follows:2010-01-14 02:23:27,320 FATAL 
org.apache.hadoop.mapred.Child: Error running child : 
org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop 
caused by repeated empty string matches at line 1, column 21.
at 
org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620)
at 
org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569)
at 
org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651)
at 
org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152)
at 
org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100)
at 
org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382)
at 
org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42)
at 
org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68)
at 
org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-16 Thread Viraj Bhat (JIRA)
Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


Hi all,
 I have a script which does 2 replicated joins in succession. Please note that 
the inputs do not exist on the HDFS.

{code}
A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
A1 = FOREACH A GENERATE a;
B = GROUP A1 BY a;
C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
D = JOIN C BY x, B BY group USING replicated;
E = JOIN A BY a, D by x USING replicated;
dump E;
{code}

2009-12-16 19:12:00,253 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 4
2009-12-16 19:12:00,254 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 1 map-only splittees.
2009-12-16 19:12:00,254 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 1 map-reduce splittees.
2009-12-16 19:12:00,254 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 2 out of total 2 splittees.
2009-12-16 19:12:00,254 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 2
2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. unable to create new native thread
Details at logfile: pig_1260990666148.log

Looking at the log file:

Pig Stack Trace
---
ERROR 2998: Unhandled internal error. unable to create new native thread

java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
at org.apache.pig.PigServer.store(PigServer.java:522)
at org.apache.pig.PigServer.openIterator(PigServer.java:458)
at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:397)


If we want to look at the explain output, we find that there is no Map Reduce 
plan that is generated. 

 Why is the M/R plan not generated?


Attaching the script and explain output.
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-16 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1157:


Attachment: oomreplicatedjoin.pig
replicatedjoinexplain.log

Explain output and Pig script.

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)
set default_parallelism construct does not set the number of reducers correctly
---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
 Fix For: 0.7.0


Hi all,
 I have a Pig script where I set the parallelism using the following set 
construct: set default_parallel 100 . I modified the MRPrinter.java to 
printout the parallelism
{code}
...
public void visitMROp(MapReduceOper mr)
mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
Parallelism  + mr.getRequestedParallelism());
...
{code}

When I run an explain on the script, I see that the last job which does the 
actual sort, runs as a single reducer job. This can be corrected, by adding the 
PARALLEL keyword in front of the ORDER BY.

Attaching the script and the explain output

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1144:


Attachment: brokenparallel.out
genericscript_broken_parallel.pig

Script and explain output

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788436#action_12788436
 ] 

Viraj Bhat commented on PIG-1144:
-

This happens on the real cluster, where the sorting job did not complete 
because of a single reducer. 

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788439#action_12788439
 ] 

Viraj Bhat commented on PIG-1144:
-

Hi Daniel,
One more thing to note is that the Last Sort M/R job has a parallelism of 1. 
Should it not be -1?
Viraj

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788481#action_12788481
 ] 

Viraj Bhat commented on PIG-1144:
-

Hi Daniel,
 Thanks again for your input. This is more of a performance issue, where users 
do not detect, till they see that 1 reducer job has failed in the sort phase. 
They safely assume that the default_parallel keyword will do the trick.
Viraj

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1131) Pig simple join does not work when it contains empty lines

2009-12-07 Thread Viraj Bhat (JIRA)
Pig simple join does not work when it contains empty lines
--

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.7.0


I have a simple script, which does a JOIN.

{code}
input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
describe input1;

input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
describe input2;

joineddata = JOIN input1 by $0, input2 by $0;

describe joineddata;

store joineddata into 'result';
{code}

The input data contains empty lines.  

The join fails in the Map phase with the following error in the 
PRLocalRearrange.java

java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)

I am surprised that the test cases did not detect this error. Could we add this 
data which contains empty lines to the testcases?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines

2009-12-07 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1131:


Attachment: simplejoinscript.pig
junk2.txt
junk1.txt

Dummy datasets and pig script

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1124) Unable to set Custom Job Name using the -Dmapred.job.name parameter

2009-12-03 Thread Viraj Bhat (JIRA)
Unable to set Custom Job Name using the -Dmapred.job.name parameter
---

 Key: PIG-1124
 URL: https://issues.apache.org/jira/browse/PIG-1124
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Priority: Minor
 Fix For: 0.6.0


As a Hadoop user I want to control the Job name for my analysis via the command 
line using the following construct::

java -cp pig.jar:$HADOOP_HOME/conf -Dmapred.job.name=hadoop_junkie 
org.apache.pig.Main broken.pig

-Dmapred.job.name should normally set my Hadoop Job name, but somehow during 
the formation of the job.xml in Pig this information is lost and the job name 
turns out to be:

PigLatin:broken.pig

The current workaround seems to be wiring it in the script itself, using the 
following ( or using parameter substitution).

set job.name 'my job'

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1101) Pig parser does not recognize its own data type in LIMIT statement

2009-11-20 Thread Viraj Bhat (JIRA)
Pig parser does not recognize its own data type in LIMIT statement
--

 Key: PIG-1101
 URL: https://issues.apache.org/jira/browse/PIG-1101
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Priority: Minor
 Fix For: 0.6.0


I have a Pig script in which I specify the number of records to limit as a long 
type. 

{code}
A = LOAD '/user/viraj/echo.txt' AS (txt:chararray);

B = LIMIT A 10L;

DUMP B;
{code}

I get a parser error:

2009-11-21 02:25:51,100 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Encountered  LONGINTEGER 10L  at line 3, 
column 13.
Was expecting:
INTEGER ...
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.generateParseException(QueryParser.java:8963)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_consume_token(QueryParser.java:8839)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LimitClause(QueryParser.java:1656)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1280)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:682)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)

In fact 10L seems to work in the foreach generate construct.

Viraj



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1081) PigCookBook use of PARALLEL keyword

2009-11-10 Thread Viraj Bhat (JIRA)
PigCookBook use of PARALLEL keyword
---

 Key: PIG-1081
 URL: https://issues.apache.org/jira/browse/PIG-1081
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.5.0
Reporter: Viraj Bhat
 Fix For: 0.5.0


Hi all,
 I am looking at some tips for optimizing Pig programs (Pig Cookbook) using the 
PARALLEL keyword.

http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Use+PARALLEL+Keyword 
We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches 1 
reducer for all cases. 

In this documentation we state that: num machines * num reduce slots per 
machine * 0.9, this documentation was valid for HoD (Hadoop on Demand) where 
you are creating your own Hadoop clusters, but if you are using:

Either the Capacity Scheduler 
http://hadoop.apache.org/common/docs/current/capacity_scheduler.html or the 
Fair Share Scheduler 
http://hadoop.apache.org/common/docs/current/fair_scheduler.html , these 
numbers could mean that you are using around 90% of your reducer slots in your 
machine.

We should change this to something like: 
The number of reducers you may need for a particular construct in Pig which 
forms a Map Reduce boundary depends entirely on your data and the number of 
intermediate keys you are generating in your mappers. In best cases we have 
seen that a reducer processing about 500 MB of data behaves efficiently. 
Additionally it is hard to define the optimum number of reducers, since it 
completely depends on the paritioner and distribution of map (combiner) output 
keys.

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1084) Pig CookBook documentation Take Advantage of Join Optimization additions:Merge and Skewed Join

2009-11-10 Thread Viraj Bhat (JIRA)
Pig CookBook documentation Take Advantage of Join Optimization 
additions:Merge and Skewed Join


 Key: PIG-1084
 URL: https://issues.apache.org/jira/browse/PIG-1084
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


Hi all,
 We have a host of Join optimizations that have been implemented recently in 
Pig to improve performance. These include:

http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html#JOIN

1) Merge Join
2) Skewed Join

It would be nice to mention the Merge Join and Skewed join in the following 
section on the PigCookBook

http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Take+Advantage+of+Join+Optimization

Can we update this release 0.6??

Thanks
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1060) MultiQuery optimization throws error for multi-level splits

2009-11-04 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12773744#action_12773744
 ] 

Viraj Bhat commented on PIG-1060:
-

Hi Ankur and Richard,
 I have a script which demonstrates a similar problem, but can be solved by 
using the -M option. This script can reproduce the problem even without the 
UNION operator , but it has  properties 1 and 2 of the original problem 
description.

Try commenting out the F alias. It works fine.

{code}

ORGINALDATA = load '/user/viraj/somedata.txt' using PigStorage() as (col1, 
col2, col3, col4, col5, col6, col7, col8);



--Check data

A = foreach ORGINALDATA generate col1, col2, col3, col4, col5, col6;

B = group A all;

C = foreach B generate COUNT(A);

store C into '/user/viraj/result1';



D = filter A by (col1 == col2) or (col1 == col3);

E = group D all;

F = foreach E generate COUNT(D);

--try commenting F
store F into '/user/viraj/result2';



G = filter D by (col4 == col5) ;

H = group G all;

I = foreach H generate COUNT(G);

store I into '/user/viraj/result3';



J = filter G by (((col6 == 'm') or (col6 == 'M')) and (col6 == 1)) or (((col6 
== 'f') or (col6 == 'F')) and (col6 == 0)) or ((col6 == '') and (col6 == -1));

K = group J all;

L = foreach K generate COUNT(J);

store L into '/user/viraj/result4';

{code}



 MultiQuery optimization throws error for multi-level splits
 ---

 Key: PIG-1060
 URL: https://issues.apache.org/jira/browse/PIG-1060
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Ankur
Assignee: Richard Ding

 Consider the following scenario :-
 1. Multi-level splits in the map plan.
 2. Each split branch further progressing across a local-global rearrange.
 3. Output of each of these finally merged via a UNION.
 MultiQuery optimizer throws the following error in such a case:
 ERROR 2146: Internal Error. Inconsistency in key index found during 
 optimization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator

2009-10-29 Thread Viraj Bhat (JIRA)
Behvaiour of COGROUP with and without schema when using * operator


 Key: PIG-1064
 URL: https://issues.apache.org/jira/browse/PIG-1064
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have 2 tab separated files, 1.txt and 2.txt

$ cat 1.txt 

1   2

2   3


$ cat 2.txt 

1   2

2   3

I use COGROUP feature of Pig in the following way:

$java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main

{code}
grunt A = load '1.txt';
grunt B = load '2.txt' as (b0, b1);
grunt C = cogroup A by *, B by *;  
{code}

2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1012: Each COGroup input has to have the same number of inner plans
Details at logfile: pig_1256845224752.log
==

If I reverse, the order of the schema's
{code}
grunt A = load '1.txt' as (a0, a1);
grunt B = load '2.txt';
grunt C = cogroup A by *, B by *;  
{code}
2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1013: Grouping attributes can either be star (*) or a list of expressions, but 
not both.
Details at logfile: pig_1256845224752.log

==
Now running without schema??
{code}
grunt A = load '1.txt';
grunt B = load '2.txt';
grunt C = cogroup A by *, B by *;
grunt dump C; 
{code}

2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
stored result in: file:/tmp/temp-319926700/tmp-1990275961
2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written 
: 2
2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 
154
2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!

((1,2),{(1,2)},{(1,2)})
((2,3),{(2,3)},{(2,3)})
==

Is this a bug or a feature?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1031) PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double

2009-10-20 Thread Viraj Bhat (JIRA)
PigStorage interpreting chararray/bytearray for a tuple element inside a bag as 
float or double
---

 Key: PIG-1031
 URL: https://issues.apache.org/jira/browse/PIG-1031
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0
Reporter: Viraj Bhat
 Fix For: 0.5.0, 0.6.0


I have a data stored in a text file as:

{(4153E765)}
{(AF533765)}

I try reading it using PigStorage as:
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:bytearray)});
dump A;
{code}

I get the following results:

{code}
({(Infinity)})
({(AF533765)})
{code}

The problem seems to be with the method: parseFromBytes(byte[] b) in class 
Utf8StorageConverter. This method uses the TextDataParser (class generated via 
jjt) to interpret the type of data from content, even though the schema tells 
it is a bytearray. 

TextDataParser.jjt  sample code
{code}
TOKEN :
{
...
  DOUBLENUMBER: ([-,+])? FLOATINGPOINT ( [e,E] ([ -,+])? 
FLOATINGPOINT )?
  FLOATNUMBER: DOUBLENUMBER ([f,F])? 
...
}
{code}

I tried the following options, but it will not work as we need to call 
bytesToBag(byte[] b) in the Utf8StorageConverter class.
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term)});
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:chararray)});
{code}


Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1031) PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double

2009-10-20 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1031:


Description: 
I have a data stored in a text file as:

{(4153E765)}
{(AF533765)}


I try reading it using PigStorage as:

{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:bytearray)});
dump A;
{code}

I get the following results:


({(Infinity)})
({(AF533765)})


The problem seems to be with the method: parseFromBytes(byte[] b) in class 
Utf8StorageConverter. This method uses the TextDataParser (class generated via 
jjt) to interpret the type of data from content, even though the schema tells 
it is a bytearray. 

TextDataParser.jjt  sample code
{code}
TOKEN :
{
...
  DOUBLENUMBER: ([-,+])? FLOATINGPOINT ( [e,E] ([ -,+])? 
FLOATINGPOINT )?
  FLOATNUMBER: DOUBLENUMBER ([f,F])? 
...
}
{code}

I tried the following options, but it will not work as we need to call 
bytesToBag(byte[] b) in the Utf8StorageConverter class.
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term)});
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:chararray)});
{code}


Viraj

  was:
I have a data stored in a text file as:

{(4153E765)}
{(AF533765)}

I try reading it using PigStorage as:
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:bytearray)});
dump A;
{code}

I get the following results:

{code}
({(Infinity)})
({(AF533765)})
{code}

The problem seems to be with the method: parseFromBytes(byte[] b) in class 
Utf8StorageConverter. This method uses the TextDataParser (class generated via 
jjt) to interpret the type of data from content, even though the schema tells 
it is a bytearray. 

TextDataParser.jjt  sample code
{code}
TOKEN :
{
...
  DOUBLENUMBER: ([-,+])? FLOATINGPOINT ( [e,E] ([ -,+])? 
FLOATINGPOINT )?
  FLOATNUMBER: DOUBLENUMBER ([f,F])? 
...
}
{code}

I tried the following options, but it will not work as we need to call 
bytesToBag(byte[] b) in the Utf8StorageConverter class.
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term)});
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:chararray)});
{code}


Viraj


 PigStorage interpreting chararray/bytearray for a tuple element inside a bag 
 as float or double
 ---

 Key: PIG-1031
 URL: https://issues.apache.org/jira/browse/PIG-1031
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0
Reporter: Viraj Bhat
 Fix For: 0.5.0, 0.6.0


 I have a data stored in a text file as:
 {(4153E765)}
 {(AF533765)}
 I try reading it using PigStorage as:
 {code}
 A = load 'pigstoragebroken.dat' using PigStorage() as 
 (intersectionBag:bag{T:tuple(term:bytearray)});
 dump A;
 {code}
 I get the following results:
 ({(Infinity)})
 ({(AF533765)})
 The problem seems to be with the method: parseFromBytes(byte[] b) in class 
 Utf8StorageConverter. This method uses the TextDataParser (class generated 
 via jjt) to interpret the type of data from content, even though the schema 
 tells it is a bytearray. 
 TextDataParser.jjt  sample code
 {code}
 TOKEN :
 {
 ...
   DOUBLENUMBER: ([-,+])? FLOATINGPOINT ( [e,E] ([ -,+])? 
 FLOATINGPOINT )?
   FLOATNUMBER: DOUBLENUMBER ([f,F])? 
 ...
 }
 {code}
 I tried the following options, but it will not work as we need to call 
 bytesToBag(byte[] b) in the Utf8StorageConverter class.
 {code}
 A = load 'pigstoragebroken.dat' using PigStorage() as 
 (intersectionBag:bag{T:tuple(term)});
 A = load 'pigstoragebroken.dat' using PigStorage() as 
 (intersectionBag:bag{T:tuple(term:chararray)});
 {code}
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-978) ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) and ERROR 2999: (Unexpected internal error. null) when using Multi-Query optimization

2009-09-25 Thread Viraj Bhat (JIRA)
ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) 
and ERROR 2999: (Unexpected internal error. null) when using Multi-Query 
optimization
---

 Key: PIG-978
 URL: https://issues.apache.org/jira/browse/PIG-978
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have  Pig script of this form.. which I execute using Multi-query 
optimization.

{code}
A = load '/user/viraj/firstinput' using PigStorage();
B = group 
C = ..agrregation function
store C into '/user/viraj/firstinputtempresult/days1';
..
Atab = load '/user/viraj/secondinput' using PigStorage();
Btab = group 
Ctab = ..agrregation function
store Ctab into '/user/viraj/secondinputtempresult/days1';
..
E = load '/user/viraj/firstinputtempresult/' using PigStorage();
F = group 
G = aggregation function
store G into '/user/viraj/finalresult1';

Etab = load '/user/viraj/secondinputtempresult/' using PigStorage();
Ftab = group 
Gtab = aggregation function
store Gtab into '/user/viraj/finalresult2';
{code}


2009-07-20 22:05:44,507 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
ERROR 2100: hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist. 
Details at logfile: /homes/viraj/pigscripts/pig_1248127173601.log)  

is due to the mismatch of store/load commands. The script first stores files 
into the 'days1' directory (store C into 
'/user/viraj/firstinputtempresult/days1' using PigStorage();), but it later 
loads from the top level directory (E = load 
'/user/viraj/firstinputtempresult/' using PigStorage()) instead of the original 
directory (/user/viraj/firstinputtempresult/days1).

The current multi-query optimizer can't solve the dependency between these two 
commands--they have different load file paths. So the jobs will run 
concurrently and result in the errors.

The solution is to add 'exec' or 'run' command after the first two stores . 
This will force the first two store commands to run before the rest commands.

It would be nice to see this fixed as a part of an enhancement to the 
Multi-query. We either disable the Multi-query or throw a warning/error 
message, so that the user can correct his load/store statements.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-974) Issues with mv command when used after store when using -param_file/-param options

2009-09-23 Thread Viraj Bhat (JIRA)
Issues with mv command when used after store when using -param_file/-param 
options
--

 Key: PIG-974
 URL: https://issues.apache.org/jira/browse/PIG-974
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
 Environment: Hadoop 18 and 20
Reporter: Viraj Bhat
 Fix For: 0.6.0
 Attachments: studenttab10k

I have a Pig script which moves the final output to another HDFS directory to 
signal completion, so that another Pig script can start working on these 
results.
{code}
studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, 
age:int,gpa:float);
X = GROUP studenttab by age;
Y = FOREACH X GENERATE group, COUNT(studenttab);
store Y into '$finalop' using PigStorage();
mv '$finalop' '$finalmove';
{code}

where finalop and finalmove are parameters used storing intermediate and 
final results.

I run this script as this:
{code}
$shell java -cp pig20.jar:/path/tohadoop/site.xml 
-Dmapred.job.queue.name=default org.apache.pig.Main -M -param 
finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove 
testmove.pig 
{code}
or using the param_file option
{code}
$shelljava -cp pig20.jar:/path/tohadoop/site.xml 
-Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file 
moveparamfile  testmove.pig
{code}

The underlying Map Reduce jobs run well but the move command seems to be 
failing:

2009-09-23 23:26:21,781 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /homes/viraj/pigscripts/pig_1253748381778.log
2009-09-23 23:26:21,963 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: hdfs://localhost:8020
2009-09-23 23:26:22,227 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
map-reduce job tracker at: localhost:50300
2009-09-23 23:26:27,187 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer 
- Choosing to move algebraic foreach to combiner
2009-09-23 23:26:27,203 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 1
2009-09-23 23:26:27,203 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 1
2009-09-23 23:26:28,828 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
2009-09-23 23:26:29,423 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
Use GenericOptionsParser for parsing the arguments. Applications should 
implement Tool for the same.
2009-09-23 23:26:29,478 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
2009-09-23 23:27:29,828 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 50% complete
2009-09-23 23:27:59,764 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 50% complete
2009-09-23 23:28:57,249 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete
2009-09-23 23:28:57,249 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Successfully stored result in: /user/viraj/finaloutput
2009-09-23 23:28:57,267 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Records written : 60
2009-09-23 23:28:57,267 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Bytes written : 420
2009-09-23 23:28:57,267 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Success!
2009-09-23 23:28:57,367 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' 
does not exist.
Details at logfile: /homes/viraj/pigscripts/pig_1253748381778.log

{code}
$shell hadoop fs -ls /user/viraj/finaloutput 
Found 1 items
-rw---   3 viraj users420 2009-09-23 23:42 
/user/viraj/finaloutput/part-0
{code}

Opening the log file:

Pig Stack Trace
---
ERROR 2998: Unhandled internal error. File or directory 
'/user/viraj/finaloutput' does not exist.

java.io.IOException: File or directory '/user/viraj/finaloutput' does not exist.
at 

[jira] Updated: (PIG-974) Issues with mv command when used after store when using -param_file/-param options

2009-09-23 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-974:
---

Attachment: studenttab10k

Testdata

 Issues with mv command when used after store when using -param_file/-param 
 options
 --

 Key: PIG-974
 URL: https://issues.apache.org/jira/browse/PIG-974
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
 Environment: Hadoop 18 and 20
Reporter: Viraj Bhat
 Fix For: 0.6.0

 Attachments: studenttab10k


 I have a Pig script which moves the final output to another HDFS directory to 
 signal completion, so that another Pig script can start working on these 
 results.
 {code}
 studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, 
 age:int,gpa:float);
 X = GROUP studenttab by age;
 Y = FOREACH X GENERATE group, COUNT(studenttab);
 store Y into '$finalop' using PigStorage();
 mv '$finalop' '$finalmove';
 {code}
 where finalop and finalmove are parameters used storing intermediate and 
 final results.
 I run this script as this:
 {code}
 $shell java -cp pig20.jar:/path/tohadoop/site.xml 
 -Dmapred.job.queue.name=default org.apache.pig.Main -M -param 
 finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove 
 testmove.pig 
 {code}
 or using the param_file option
 {code}
 $shelljava -cp pig20.jar:/path/tohadoop/site.xml 
 -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file 
 moveparamfile  testmove.pig
 {code}
 
 The underlying Map Reduce jobs run well but the move command seems to be 
 failing:
 
 2009-09-23 23:26:21,781 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /homes/viraj/pigscripts/pig_1253748381778.log
 2009-09-23 23:26:21,963 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localhost:8020
 2009-09-23 23:26:22,227 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localhost:50300
 2009-09-23 23:26:27,187 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
  - Choosing to move algebraic foreach to combiner
 2009-09-23 23:26:27,203 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 1
 2009-09-23 23:26:27,203 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 1
 2009-09-23 23:26:28,828 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
  - Setting up single store job
 2009-09-23 23:26:29,423 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the arguments. Applications should 
 implement Tool for the same.
 2009-09-23 23:26:29,478 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2009-09-23 23:27:29,828 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 50% complete
 2009-09-23 23:27:59,764 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 50% complete
 2009-09-23 23:28:57,249 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 100% complete
 2009-09-23 23:28:57,249 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Successfully stored result in: /user/viraj/finaloutput
 2009-09-23 23:28:57,267 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Records written : 60
 2009-09-23 23:28:57,267 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Bytes written : 420
 2009-09-23 23:28:57,267 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Success!
 2009-09-23 23:28:57,367 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' 
 does not exist.
 Details at logfile: /homes/viraj/pigscripts/pig_1253748381778.log
 
 {code}
 $shell hadoop fs -ls /user/viraj/finaloutput 
 Found 1 items
 -rw---   3 viraj users420 2009-09-23 23:42 
 /user/viraj/finaloutput/part-0
 {code}
 
 Opening the log file:
 

[jira] Commented: (PIG-974) Issues with mv command when used after store when using -param_file/-param options

2009-09-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758962#action_12758962
 ] 

Viraj Bhat commented on PIG-974:


It turns out that the problem was due to single quotes.
{code}
mv '$finalop' '$finalmove';
{code}

This piece of modified script should work..
{code}
mv $finalop $finalmove;
{code}

The hard part here is when to use single quotes for parameters and when we 
should not..This is not documented in the manual.

The error message is also confusing..
===
java.io.IOException: File or directory '/user/viraj/finaloutput' does not exist.
===

I thought that the single quotes against the filename printed in the error 
message refers to the correct file name.

{code}
$shellhadoop fs -ls '/user/viraj/finaloutput' 
Found 1 items
-rw---   3 viraj users420 2009-09-24 01:16 
/user/viraj/finaloutput/part-0
{code}

Thanks Viraj

 Issues with mv command when used after store when using -param_file/-param 
 options
 --

 Key: PIG-974
 URL: https://issues.apache.org/jira/browse/PIG-974
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
 Environment: Hadoop 18 and 20
Reporter: Viraj Bhat
 Fix For: 0.6.0

 Attachments: studenttab10k


 I have a Pig script which moves the final output to another HDFS directory to 
 signal completion, so that another Pig script can start working on these 
 results.
 {code}
 studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, 
 age:int,gpa:float);
 X = GROUP studenttab by age;
 Y = FOREACH X GENERATE group, COUNT(studenttab);
 store Y into '$finalop' using PigStorage();
 mv '$finalop' '$finalmove';
 {code}
 where finalop and finalmove are parameters used storing intermediate and 
 final results.
 I run this script as this:
 {code}
 $shell java -cp pig20.jar:/path/tohadoop/site.xml 
 -Dmapred.job.queue.name=default org.apache.pig.Main -M -param 
 finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove 
 testmove.pig 
 {code}
 or using the param_file option
 {code}
 $shelljava -cp pig20.jar:/path/tohadoop/site.xml 
 -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file 
 moveparamfile  testmove.pig
 {code}
 
 The underlying Map Reduce jobs run well but the move command seems to be 
 failing:
 
 2009-09-23 23:26:21,781 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /homes/viraj/pigscripts/pig_1253748381778.log
 2009-09-23 23:26:21,963 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localhost:8020
 2009-09-23 23:26:22,227 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localhost:50300
 2009-09-23 23:26:27,187 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
  - Choosing to move algebraic foreach to combiner
 2009-09-23 23:26:27,203 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 1
 2009-09-23 23:26:27,203 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 1
 2009-09-23 23:26:28,828 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
  - Setting up single store job
 2009-09-23 23:26:29,423 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the arguments. Applications should 
 implement Tool for the same.
 2009-09-23 23:26:29,478 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2009-09-23 23:27:29,828 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 50% complete
 2009-09-23 23:27:59,764 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 50% complete
 2009-09-23 23:28:57,249 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 100% complete
 2009-09-23 23:28:57,249 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Successfully stored result in: /user/viraj/finaloutput
 2009-09-23 23:28:57,267 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Records written : 60
 2009-09-23 23:28:57,267 [main] INFO  
 

[jira] Created: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig

2009-08-31 Thread Viraj Bhat (JIRA)
Cross site HDFS access using the default.fs.name not possible in Pig


 Key: PIG-940
 URL: https://issues.apache.org/jira/browse/PIG-940
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
 Environment: Hadoop 20
Reporter: Viraj Bhat
 Fix For: 0.3.0


I have a script which does the following.. access data from a remote HDFS 
location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I 
do not want to copy this huge amount of data between HDFS locations]].

However I want my Pigscript  to write data to the HDFS running on 
localmachine.company.com.

Currently Pig does not support that behavior and complains that: 
hdfs://localmachine.company.com/user/viraj/A1.txt does not exist

{code}
A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); 
B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); 
C = JOIN A by a, B by c; 
store C into 'output' using PigStorage();  
{code}
===
2009-09-01 00:37:24,032 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: hdfs://localmachine.company.com:8020
2009-09-01 00:37:24,277 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
map-reduce job tracker at: localmachine.company.com:50300
2009-09-01 00:37:24,567 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer
 - Rewrite: POPackage-POForEach to POJoinPackage
2009-09-01 00:37:24,573 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 1
2009-09-01 00:37:24,573 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 1
2009-09-01 00:37:26,197 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
2009-09-01 00:37:26,249 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
Use GenericOptionsParser for parsing the arguments. Applications should 
implement Tool for the same.
2009-09-01 00:37:26,746 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
2009-09-01 00:37:26,746 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete
2009-09-01 00:37:26,747 [main] ERROR 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 1 map reduce job(s) failed!
2009-09-01 00:37:26,756 [main] ERROR 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed to produce result in: 
hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480
2009-09-01 00:37:26,756 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!
2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log
===

The error file in Pig contains:
===
ERROR 2998: Unhandled internal error. 
org.apache.pig.backend.executionengine.ExecException: ERROR 2100: 
hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
at 
org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126)
at 
org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
at 
org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)

java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not 

[jira] Commented: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig

2009-08-31 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749722#action_12749722
 ] 

Viraj Bhat commented on PIG-940:


One important point to add:
{code}
localmachine.company.com prompt hadoop fs -ls 
hdfs://remotemachine1.company.com/user/viraj//*.txt
-rw-r--r--   3 viraj users 13 2009-08-13 23:42 /user/viraj/A1.txt
-rw-r--r--   3 viraj users  8 2009-08-29 00:51 /user/viraj/B1.txt
{code}

 Cross site HDFS access using the default.fs.name not possible in Pig
 

 Key: PIG-940
 URL: https://issues.apache.org/jira/browse/PIG-940
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
 Environment: Hadoop 20
Reporter: Viraj Bhat
 Fix For: 0.3.0


 I have a script which does the following.. access data from a remote HDFS 
 location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I 
 do not want to copy this huge amount of data between HDFS locations]].
 However I want my Pigscript  to write data to the HDFS running on 
 localmachine.company.com.
 Currently Pig does not support that behavior and complains that: 
 hdfs://localmachine.company.com/user/viraj/A1.txt does not exist
 {code}
 A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); 
 B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); 
 C = JOIN A by a, B by c; 
 store C into 'output' using PigStorage();  
 {code}
 ===
 2009-09-01 00:37:24,032 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localmachine.company.com:8020
 2009-09-01 00:37:24,277 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localmachine.company.com:50300
 2009-09-01 00:37:24,567 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer
  - Rewrite: POPackage-POForEach to POJoinPackage
 2009-09-01 00:37:24,573 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 1
 2009-09-01 00:37:24,573 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 1
 2009-09-01 00:37:26,197 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
  - Setting up single store job
 2009-09-01 00:37:26,249 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the arguments. Applications should 
 implement Tool for the same.
 2009-09-01 00:37:26,746 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2009-09-01 00:37:26,746 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 100% complete
 2009-09-01 00:37:26,747 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 1 map reduce job(s) failed!
 2009-09-01 00:37:26,756 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed to produce result in: 
 hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480
 2009-09-01 00:37:26,756 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
 Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log
 ===
 The error file in Pig contains:
 ===
 ERROR 2998: Unhandled internal error. 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2100: 
 hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
 at 
 org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126)
 at 
 org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
 at 
 org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228)
 at 
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 at 
 

[jira] Created: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group

2009-08-12 Thread Viraj Bhat (JIRA)
Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableBytesWritable, recieved 
org.apache.pig.impl.io.NullableText when doing simple group
--

 Key: PIG-919
 URL: https://issues.apache.org/jira/browse/PIG-919
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0


I have a Pig script, which takes in a student file and generates a bag of maps. 
 I later want to group on the value of the key name0 which corresponds to the 
first name of the student.
{code}
register mymapudf.jar;



data = LOAD '/user/viraj/studenttab10k' AS 
(somename:chararray,age:long,marks:float);



genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as 
bp:map[], age, marks;



getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks;



filternonnullfirstnames = filter getfirstnames by firstname is not null;




groupgenmap = group filternonnullfirstnames by firstname;



dump groupgenmap;
{code}

When I execute this code, I get an error in the Map Phase:
===
java.io.IOException: Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableBytesWritable, recieved 
org.apache.pig.impl.io.NullableText
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group

2009-08-12 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742668#action_12742668
 ] 

Viraj Bhat commented on PIG-919:


This problem can be solved simply by casting the firstname to chararray!! Why??
{code}
groupgenmap = group filternonnullfirstnames by (chararray)firstname;

dump groupgenmap;
{code}

Is there a problem with the UDF??

 Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText when doing simple group
 --

 Key: PIG-919
 URL: https://issues.apache.org/jira/browse/PIG-919
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0

 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar


 I have a Pig script, which takes in a student file and generates a bag of 
 maps.  I later want to group on the value of the key name0 which 
 corresponds to the first name of the student.
 {code}
 register mymapudf.jar;
 data = LOAD '/user/viraj/studenttab10k' AS 
 (somename:chararray,age:long,marks:float);
 genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as 
 bp:map[], age, marks;
 getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks;
 filternonnullfirstnames = filter getfirstnames by firstname is not null;
 groupgenmap = group filternonnullfirstnames by firstname;
 dump groupgenmap;
 {code}
 When I execute this code, I get an error in the Map Phase:
 ===
 java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-913) Error in Pig script when grouping on chararray column

2009-08-06 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740360#action_12740360
 ] 

Viraj Bhat commented on PIG-913:


The following works though..
{code}
data = LOAD '/user/viraj/studenttab10k' AS (s:bytearray);
--data =  LOAD '/user/viraj/studenttab10k' AS (s);

dataSmall = limit data 100;

bb = GROUP dataSmall by $0;

dump bb;
{code}

or 

 Error in Pig script when grouping on chararray column
 -

 Key: PIG-913
 URL: https://issues.apache.org/jira/browse/PIG-913
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.4.0


 I have a very simple script which fails at parsetime due to the schema I 
 specified in the loader.
 {code}
 data = LOAD '/user/viraj/studenttab10k' AS (s:chararray);
 dataSmall = limit data 100;
 bb = GROUP dataSmall by $0;
 dump bb;
 {code}
 =
 2009-08-06 18:47:56,297 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /homes/viraj/pig-svn/trunk/pig_1249609676296.log
 09/08/06 18:47:56 INFO pig.Main: Logging error messages to: 
 /homes/viraj/pig-svn/trunk/pig_1249609676296.log
 2009-08-06 18:47:56,459 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localhost:9000
 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to hadoop 
 file system at: hdfs://localhost:9000
 2009-08-06 18:47:56,694 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localhost:9001
 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to 
 map-reduce job tracker at: localhost:9001
 2009-08-06 18:47:57,008 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1002: Unable to store alias bb
 09/08/06 18:47:57 ERROR grunt.Grunt: ERROR 1002: Unable to store alias bb
 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1249609676296.log
 =
 =
 Pig Stack Trace
 ---
 ERROR 1002: Unable to store alias bb
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias bb
 at org.apache.pig.PigServer.openIterator(PigServer.java:481)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:531)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: 
 Unable to store alias bb
 at org.apache.pig.PigServer.store(PigServer.java:536)
 at org.apache.pig.PigServer.openIterator(PigServer.java:464)
 ... 6 more
 Caused by: java.lang.NullPointerException
 at 
 org.apache.pig.impl.logicalLayer.LOCogroup.unsetSchema(LOCogroup.java:359)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.SchemaRemover.visit(SchemaRemover.java:64)
 at 
 org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:335)
 at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:46)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildSchemas(LogicalTransformer.java:67)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:187)
 at org.apache.pig.PigServer.compileLp(PigServer.java:854)
 at org.apache.pig.PigServer.compileLp(PigServer.java:791)
 at org.apache.pig.PigServer.store(PigServer.java:509)
 ... 7 more
 =

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-828) Problem accessing a tuple within a bag

2009-06-01 Thread Viraj Bhat (JIRA)
Problem accessing a tuple within a bag
--

 Key: PIG-828
 URL: https://issues.apache.org/jira/browse/PIG-828
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0


Below pig script creates a tuple which contains 3 columns, 2 of which are 
chararray's and the third column is a bag of constant chararray. The script 
later projects the tuple within a bag.

{code}
a = load 'studenttab5' as (name, age, gpa);

b = foreach a generate ('viraj', {('sms')}, 'pig') as 
document:(id,singlebag:{singleTuple:(single)}, article);

describe b;

c = foreach b generate document.singlebag;

dump c;
{code}

When we run this script we get a run-time error in the Map phase.

java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast 
to org.apache.pig.data.DataBag
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:402)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:400)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-828) Problem accessing a tuple within a bag

2009-06-01 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-828:
---

Attachment: tupleacc.pig
studenttab5

Input script and data.

 Problem accessing a tuple within a bag
 --

 Key: PIG-828
 URL: https://issues.apache.org/jira/browse/PIG-828
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0

 Attachments: studenttab5, tupleacc.pig


 Below pig script creates a tuple which contains 3 columns, 2 of which are 
 chararray's and the third column is a bag of constant chararray. The script 
 later projects the tuple within a bag.
 {code}
 a = load 'studenttab5' as (name, age, gpa);
 b = foreach a generate ('viraj', {('sms')}, 'pig') as 
 document:(id,singlebag:{singleTuple:(single)}, article);
 describe b;
 c = foreach b generate document.singlebag;
 dump c;
 {code}
 When we run this script we get a run-time error in the Map phase.
 
 java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast 
 to org.apache.pig.data.DataBag
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:402)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:400)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-816) PigStorage() does not accept Unicode characters in its contructor

2009-05-22 Thread Viraj Bhat (JIRA)
PigStorage() does not accept Unicode characters in its contructor 
--

 Key: PIG-816
 URL: https://issues.apache.org/jira/browse/PIG-816
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.3.0


Simple Pig script which uses Unicode characters in the PigStorage() constructor 
fails with the following error:

{code}
studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, 
age:int,gpa:float);
X2 = GROUP studenttab by age;
Y2 = FOREACH X2 GENERATE group, COUNT(studenttab);
store Y2 into '/user/viraj/y2' using PigStorage('\u0001');
{code}


ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate 
exception from backend error: org.apache.hadoop.ipc.RemoteException: 
java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException: 
Character reference #1 is an invalid XML character.

Attaching log file.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-816) PigStorage() does not accept Unicode characters in its contructor

2009-05-22 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-816:
---

Attachment: pig_1243043613713.log

Log file for detailed error message

 PigStorage() does not accept Unicode characters in its contructor 
 --

 Key: PIG-816
 URL: https://issues.apache.org/jira/browse/PIG-816
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.3.0

 Attachments: pig_1243043613713.log


 Simple Pig script which uses Unicode characters in the PigStorage() 
 constructor fails with the following error:
 {code}
 studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, 
 age:int,gpa:float);
 X2 = GROUP studenttab by age;
 Y2 = FOREACH X2 GENERATE group, COUNT(studenttab);
 store Y2 into '/user/viraj/y2' using PigStorage('\u0001');
 {code}
 
 ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate 
 exception from backend error: org.apache.hadoop.ipc.RemoteException: 
 java.io.IOException: java.lang.RuntimeException: 
 org.xml.sax.SAXParseException: Character reference #1 is an invalid XML 
 character.
 
 Attaching log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-656) Use of eval word in the package hierarchy of a UDF causes parse exception

2009-05-19 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710862#action_12710862
 ] 

Viraj Bhat commented on PIG-656:


Another pig parse issue when a udf was defined within a package which had 
matches keywords in its path.

So something like :
define DISTANCE_SCORE mypackage.pig.udf.matches.LevensteinMatchUDF();

gives a parse error


ERROR 1000: Error during parsing. Encountered  matches matches  at 
line 11, column 42.
Was expecting:
 IDENTIFIER ...


It is possible to have keywords from pig within package names or even  udf - 
shouldn't pig not be robust to simple grammar disambiguation of  this sort ?

 Use of eval word in the package hierarchy of a UDF causes parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0

 Attachments: mywordcount.txt, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-656) Use of eval word in the package hierarchy of a UDF causes parse exception

2009-05-19 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat reopened PIG-656:



Documentation should be updated on the eval keyword and what it actually does 
otherwise the user can be lost trying to find out the error.

 Use of eval word in the package hierarchy of a UDF causes parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0

 Attachments: mywordcount.txt, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception

2009-05-19 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-656:
---

Summary: Use of eval or any other keyword in the package hierarchy of a UDF 
causes parse exception  (was: Use of eval word in the package hierarchy of a 
UDF causes parse exception)

 Use of eval or any other keyword in the package hierarchy of a UDF causes 
 parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0

 Attachments: mywordcount.txt, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-812) COUNT(*) does not work

2009-05-19 Thread Viraj Bhat (JIRA)
COUNT(*) does not work 
---

 Key: PIG-812
 URL: https://issues.apache.org/jira/browse/PIG-812
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.2.0


Pig script to count the number of rows in a studenttab10k file which contains 
10k records.
{code}
studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float);

X2 = GROUP studenttab ALL;

describe X2;

Y2 = FOREACH X2 GENERATE COUNT(*);

explain Y2;

DUMP Y2;

{code}

returns the following error

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator 
for alias Y2
Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log


If you look at the log file:

Caused by: java.lang.ClassCastException
at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76)
at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-812) COUNT(*) does not work

2009-05-19 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-812:
---

Attachment: studenttab10k

Input file

 COUNT(*) does not work 
 ---

 Key: PIG-812
 URL: https://issues.apache.org/jira/browse/PIG-812
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.2.0

 Attachments: studenttab10k


 Pig script to count the number of rows in a studenttab10k file which contains 
 10k records.
 {code}
 studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float);
 X2 = GROUP studenttab ALL;
 describe X2;
 Y2 = FOREACH X2 GENERATE COUNT(*);
 explain Y2;
 DUMP Y2;
 {code}
 returns the following error
 
 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator 
 for alias Y2
 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log
 
 If you look at the log file:
 
 Caused by: java.lang.ClassCastException
 at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76)
 at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-05-18 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710619#action_12710619
 ] 

Viraj Bhat commented on PIG-774:


Hi Daniel,
 For this patch to work, is it important  to set:

LESSCHARSET to utf-8

LANG to en_US.utf8

I am observing that the dry run using pig -r does not yield the right parameter 
substitution, if we do not have these variables set. 

They are not set by default on the RH-EL 5.0

You have mentioned this in your earlier comments!!

Thanks Viraj

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
Priority: Critical
 Fix For: 0.3.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
 pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 {code}
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 {code}
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 {code}
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 {code}
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 {code}
 shell $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj 

[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage??

2009-05-01 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-798:
---

Description: 
In the following script I have a tab separated text file, which I load using 
PigStorage() and store using BinStorage()
{code}
A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
url:chararray, time:chararray);

B = group A by name;

store B into '/user/viraj/binstoragecreateop' using BinStorage();

dump B;
{code}

I later load file 'binstoragecreateop' in the following way.
{code}

A = load '/user/viraj/binstoragecreateop' using BinStorage();

B = foreach A generate $0 as name:chararray;

dump B;
{code}
Result
===
(Amy)
(Fred)
===
The above code work properly and returns the right results. If I use 
PigStorage() to achieve the same, I get the following error.
{code}
A = load '/user/viraj/visits.txt' using PigStorage();

B = foreach A generate $0 as name:chararray;

dump B;

{code}
===
{code}
2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field 
Schema: name: chararray
Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
{code}
===
So why should the semantics of BinStorage() be different from PigStorage() 
where is ok not to specify a schema??? Should it not be consistent across both.

  was:
In the following script I have a tab separated text file, which I load using 
PigStorage() and store using BinStorage()
{code}
A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
url:chararray, time:chararray);

B = group A by name;

store B into '/user/viraj/binstoragecreateop' using BinStorage();

dump B;
{code}

I later load file 'binstoragecreateop' in the following way.
{code}

A = load '/user/viraj/binstoragecreateop' using BinStorage();

B = foreach A generate $0 as name:chararray;

dump B;
{code}
Result
===
(Amy)
(Fred)
===
The above code work properly and returns the right results. If I use 
PigStorage() to achieve the same, I get the following error.
{code}
A = load '/user/viraj/visits.txt' using PigStorage();

B = foreach A generate $0 as name:chararray;

dump B;

{code}
===
2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field 
Schema: name: chararray
Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
===
So why should the semantics of BinStorage() be different from PigStorage() 
where is ok not to specify a schema??? Should it not be consistent across both.


 Schema errors when using PigStorage and none when using BinStorage??
 

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}
 

  1   2   >