from:"Ankur \(JIRA\)"

[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-09-16 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910441#action_12910441
 ] 

Ankur commented on PIG-1229:


In the putNext() method, count is reset to 0 every time the number of tuples 
added to the batch exceed 'batchSize'. The batch is then executed and its 
parameters cleared. There is currently 
an ExecException in the putNext() method that is being ignored. Can you try 
adding some debugging System.outs and check the stdout/stderr of your reducers 
to see if that is the problem ?

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, 
 jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-08-04 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ankur updated PIG-1229:
---

Attachment: jira-1229-final.test-fix.patch

Here is my understanding of what happens

1. The main thread in the JVM executing the test initializes MiniDFSCluster,
MiniMRCluster and HSQLDB server all in different threads.
2. The test setUp() method then executed to create table 'ttt' to which data
will be written by DBStorage() in the test.
3. Pig statements are then executed that spawn M/R job as a separate process
that tries to get a connection to the database and create a preparedStatement
for table 'ttt'. This fails sometimes as DB thread does NOT get a chance to
fully persist the table information and the exception is thrown from the
map-tasks as noted by Ashutosh.

The fix for this is to add a 5 sec sleep in setUp() method to give DB a chance
to persist table information. This alleviates the problem and test passes for
repeated multiple runs.

Note that Ideal fix would have been to do a busy wait for table creation
completion but i don't see a method in HSqlDB to do that.

allow pig to write output into a JDBC db

Key: PIG-1229
URL: https://issues.apache.org/jira/browse/PIG-1229
Project: Pig
Issue Type: New Feature
Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
Fix For: 0.8.0

Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch,
jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch

UDF to store data into a DB

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-08-04 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: (was: jira-1229-final.test-fix.patch)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-v2.patch, 
 jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-08-04 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229-final.test-fix.patch

Aaron,
 Autocommit() was not the issue.  It was the usage of 
jdbc:hsqldb:file: url in the STORE function that was the problem. Replacing 
it with jdbc:hsqldb:hsql://localhost/dbname solved the issue. Attaching the 
updated patch with the test case modification.

Really appreciate your help here. Thanks a lot :-)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, 
 jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-08-03 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229-final.test-fix.patch

Attaching the patch with fixes to the test case.
1. Starting the HsqlDB server manually - dbServer.start().
2. Supplying user name and password when initializing DBStorage.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, 
 jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-08-03 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: (was: jira-1229-final.test-fix.patch)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-v2.patch, 
 jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-07-27 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229-final.patch

Hope this one finally goes in .

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-v2.patch, 
 jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-07-27 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Status: Patch Available  (was: In Progress)

Regenerated the patch as per Ashutosh's suggestion.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-v2.patch, 
 jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

2010-07-25 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892176#action_12892176
 ] 

Ankur commented on PIG-1516:


The solution to have the finalize method AT ALL for the purpose of deleting 
files when object is garbage collected is NOT a good one. Generally speaking 
using finalizers to release non-memory resources like file handles should be 
avoided as it has an insidious bug. From the article on Object finalization 
and Cleanup - http://www.javaworld.com/jw-06-1998/jw-06-techniques.html

Don't rely on finalizers to release non-memory resources

An example of an object that breaks this rule is one that opens a file in its 
constructor and closes the file in its finalize() method. Although this design 
seems neat, tidy, and symmetrical, it potentially creates an insidious bug. A 
Java program generally will have only a finite number of file handles at its 
disposal. When all those handles are in use, the program won't be able to open 
any more files.  

 finalize in bag implementations causes pig to run out of memory in reduce 
 --

 Key: PIG-1516
 URL: https://issues.apache.org/jira/browse/PIG-1516
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 *Problem:*
 pig bag implementations that are subclasses of DefaultAbstractBag, have 
 finalize methods implemented. As a result, the garbage collector moves them 
 to a finalization queue, and the memory used is freed only after the 
 finalization happens on it.
 If the bags are not finalized fast enough, a lot of memory is consumed by the 
 finalization queue, and pig runs out of memory. This can happen if large 
 number of small bags are being created.
 *Solution:*
 The finalize function exists for the purpose of deleting the spill files that 
 are created when the bag is too large. But if the bags are small enough, no 
 spill files are created, and there is no use of the finalize function.
  A new class that holds a list of files will be introduced (FileList). This 
 class will have a finalize method that deletes the files. The bags will no 
 longer have finalize methods, and the bags will use FileList instead of 
 ArrayListFile.
 *Possible workaround for earlier releases:*
 Since the fix is going into 0.8, here is a workaround -
 Disabling the combiner will reduce the number of bags getting created, as 
 there will not be the stage of combining intermediate merge results. But I 
 would recommend disabling it only if you have this problem as it is likely to 
 slow down the query .
 To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1482) Pig gets confused when more than one loader is involved

2010-07-07 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885840#action_12885840
 ] 

Ankur commented on PIG-1482:


forgot to add

Include this change as well for the above script to work

G = FOREACH F GENERATE group.v1, group.a;

 Pig gets confused when more than one loader is involved
 ---

 Key: PIG-1482
 URL: https://issues.apache.org/jira/browse/PIG-1482
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ankur

 In case of two relations being loaded using different loader, joined, grouped 
 and projected, pig gets confused in trying to find appropriate loader for the 
 requested cast. Consider the following script :-
 A = LOAD 'data1' USING PigStorage() AS (s, m, l);
 B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3;
 C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 
 :0) as v3:int;
 D = LOAD 'data2' USING TextLoader() AS (a);
 E = JOIN C BY v1, D BY a USING 'replicated';
 F = GROUP E BY (v1, a);
 G = FOREACH F GENERATE (chararray)group.v1, group.a;
 
 dump G;
 This throws the error, stack trace of which is in the next comment

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1482) Pig gets confused when more than one loader is involved

2010-07-07 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885839#action_12885839
 ] 

Ankur commented on PIG-1482:


Casting early alleviates the problem. So this makes the above script work

C = FOREACH B GENERATE (chararray) v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 
== 'v3' ? 1 :0) as v3:int;

 Pig gets confused when more than one loader is involved
 ---

 Key: PIG-1482
 URL: https://issues.apache.org/jira/browse/PIG-1482
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ankur

 In case of two relations being loaded using different loader, joined, grouped 
 and projected, pig gets confused in trying to find appropriate loader for the 
 requested cast. Consider the following script :-
 A = LOAD 'data1' USING PigStorage() AS (s, m, l);
 B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3;
 C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 
 :0) as v3:int;
 D = LOAD 'data2' USING TextLoader() AS (a);
 E = JOIN C BY v1, D BY a USING 'replicated';
 F = GROUP E BY (v1, a);
 G = FOREACH F GENERATE (chararray)group.v1, group.a;
 
 dump G;
 This throws the error, stack trace of which is in the next comment

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1482) Pig gets confused when more than one loader is involved

2010-07-07 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885838#action_12885838
 ] 

Ankur commented on PIG-1482:


ERROR 1065: Found more than one load function to use: [PigStorage, TextLoader]

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open 
iterator for alias K
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:391)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: 
Unable to store alias K
at org.apache.pig.PigServer.store(PigServer.java:577)
at org.apache.pig.PigServer.openIterator(PigServer.java:504)
... 6 more
Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An 
unexpected exception caused the validation to stop
at 
org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
at 
org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:89)
at org.apache.pig.PigServer.validate(PigServer.java:930)
at org.apache.pig.PigServer.compileLp(PigServer.java:884)
at org.apache.pig.PigServer.store(PigServer.java:568)
... 7 more
Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
ERROR 1053: Cannot resolve load function to use for casting from bytearray to 
chararray.
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1775)
at org.apache.pig.impl.logicalLayer.LOCast.visit(LOCast.java:67)
at org.apache.pig.impl.logicalLayer.LOCast.visit(LOCast.java:32)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2819)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2723)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
... 13 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1065: 
Found more than one load function to use: [PigStorage, TextLoader]
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3161)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3176)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3103)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3176)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.getLoadFuncSpec(TypeCheckingVisitor.java:3103)


 Pig gets confused when more than one loader is involved
 ---

 Key: PIG-1482
 URL: https://issues.apache.org/jira/browse/PIG-1482
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ankur

 In case of two relations being loaded using different loader, joined, grouped 
 and projected, pig gets confused in trying to find appropriate loader for the 
 requested cast. Consider the following script :-
 A = LOAD 'data1' USING PigStorage() AS (s, m, l);
 B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3;
 C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 
 :0) as v3:int;
 D = LOAD 'data2' USING TextLoader() AS (a);
 E = JOIN C BY v1, D BY a USING 'replicated';
 F = GROUP E BY (v1, a);
 G = FOREACH F GENERATE (chararray)group.v1, group.a;
 
 dump

[jira] Created: (PIG-1482) Pig gets confused when more than one loader is involved

2010-07-07 Thread Ankur (JIRA)

Pig gets confused when more than one loader is involved
---

 Key: PIG-1482
 URL: https://issues.apache.org/jira/browse/PIG-1482
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ankur


In case of two relations being loaded using different loader, joined, grouped 
and projected, pig gets confused in trying to find appropriate loader for the 
requested cast. Consider the following script :-

A = LOAD 'data1' USING PigStorage() AS (s, m, l);
B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3;
C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 
:0) as v3:int;

D = LOAD 'data2' USING TextLoader() AS (a);
E = JOIN C BY v1, D BY a USING 'replicated';

F = GROUP E BY (v1, a);
G = FOREACH F GENERATE (chararray)group.v1, group.a;

dump G;

This throws the error, stack trace of which is in the next comment


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1462) No informative error message on parse problem

2010-06-22 Thread Ankur (JIRA)

No informative error message on parse problem
-

 Key: PIG-1462
 URL: https://issues.apache.org/jira/browse/PIG-1462
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ankur


Consider the following script

in = load 'data' using PigStorage() as (m:map[]);
tags = foreach in generate m#'k1' as (tagtuple: tuple(chararray));
dump tags;

This throws the following error message that does not really say that this is a 
bad declaration

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
parsing. Encountered  at line 2, column 38.
Was expecting one of:

at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:391)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1462) No informative error message on parse problem

2010-06-22 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881551#action_12881551
 ] 

Ankur commented on PIG-1462:


Right, the JIRA is for adding a better error message that doesn't leave a user 
guessing

 No informative error message on parse problem
 -

 Key: PIG-1462
 URL: https://issues.apache.org/jira/browse/PIG-1462
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ankur

 Consider the following script
 in = load 'data' using PigStorage() as (m:map[]);
 tags = foreach in generate m#'k1' as (tagtuple: tuple(chararray));
 dump tags;
 This throws the following error message that does not really say that this is 
 a bad declaration
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Encountered  at line 2, column 38.
 Was expecting one of:
 
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
   at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
   at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
   at org.apache.pig.Main.main(Main.java:391)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-05-20 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869552#action_12869552
 ] 

Ankur commented on PIG-1229:


Hi Ashutosh,
   Thanks for helping out here. The error that you see - 
...The database is already in use by another process is due to locking issues 
in hsqldb 1.8.0.7. Upgrading to 1.8.0.10 
alleviates the problem and the test passes successfully. Few changes that I did

1. Added a placeholder record-writer as PigOutputFormat calls close() on it 
throwing null pointer exception if we return null from our output format.
2. Looks like you missed the ivy.xml and build.xml changes to pull the correct 
hsqldb jar.
 

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, 
 pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-05-20 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: pig-1229.2.patch

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, 
 pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1393) Bug in Nested FOREACH

2010-04-26 Thread Ankur (JIRA)

Bug in Nested FOREACH
-

 Key: PIG-1393
 URL: https://issues.apache.org/jira/browse/PIG-1393
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ankur
 Fix For: 0.8.0


Following script makes the parser throw an error

A = load 'data' as ( a: int, b: map[]) ;
B = foreach A generate ((chararray) b#'url') as url;
C = foreach B { 
  urlQueryFields = url#'queryFields';
  result = (urlQueryFields is not null) ? urlQueryFields : 1;
  generate result;
};


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-04-26 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229-v3.patch

Here you go ...

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229-v3.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1392) Parser fails to recognize valid field

2010-04-23 Thread Ankur (JIRA)

Parser fails to recognize valid field
-

 Key: PIG-1392
 URL: https://issues.apache.org/jira/browse/PIG-1392
 Project: Pig
  Issue Type: Bug
Reporter: Ankur


Using this script below, parser fails to recognize a valid field in the 
relation and throws error

A = LOAD '/tmp' as (a:int, b:chararray, c:int);
B = GROUP A BY (a, b);
C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ;

The error thrown is

2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Invalid alias: c in {group: (a: int,b: 
chararray),A: {a: int,b: chararray,c: int}}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1392) Parser fails to recognize valid field

2010-04-23 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1392:
---

Fix Version/s: 0.7.0

 Parser fails to recognize valid field
 -

 Key: PIG-1392
 URL: https://issues.apache.org/jira/browse/PIG-1392
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
 Fix For: 0.7.0


 Using this script below, parser fails to recognize a valid field in the 
 relation and throws error
 A = LOAD '/tmp' as (a:int, b:chararray, c:int);
 B = GROUP A BY (a, b);
 C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ;
 The error thrown is
 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: 
 chararray),A: {a: int,b: chararray,c: int}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1379) Jars registered from command line should override the ones present in the script

2010-04-15 Thread Ankur (JIRA)

Jars registered from command line should override the ones present in the 
script 
-

 Key: PIG-1379
 URL: https://issues.apache.org/jira/browse/PIG-1379
 Project: Pig
  Issue Type: Improvement
Reporter: Ankur
 Fix For: 0.7.0


Jars that are registered from the command line when executing the pig script 
should override the ones that are specified via 'register' in the pig script 
itself.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-04-13 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856761#action_12856761
 ] 

Ankur commented on PIG-1229:


Any updates ? 

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-04-11 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855835#action_12855835
 ] 

Ankur commented on PIG-1229:


* Sigh *
The problem is with hadoop's Path implementation that has problems 
understanding JDBC URLs correctly. So turning relToAbsPathForStoreFunction() 
does NOT help. 
The URI SyntaxException is now propagated to the point of setting output path 
for the job. Here is the new trace from the text execution failure with 
suggested workaround

org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected 
error during execution.
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:332)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835)
at org.apache.pig.PigServer.execute(PigServer.java:828)
at org.apache.pig.PigServer.access$100(PigServer.java:105)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1080)
at org.apache.pig.PigServer.executeBatch(PigServer.java:288)
at 
org.apache.pig.piggybank.test.storage.TestDBStorage.testWriteToDB(Unknown 
Source)
Caused by: 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
 ERROR 2017: Internal error creating job configuration.
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:624)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:246)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: 
jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100
at org.apache.hadoop.fs.Path.initialize(Path.java:140)
at org.apache.hadoop.fs.Path.init(Path.java:126)
at org.apache.hadoop.fs.Path.init(Path.java:45)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:459)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100
at java.net.URI.checkPath(URI.java:1787)
at java.net.URI.init(URI.java:735)
at org.apache.hadoop.fs.Path.initialize(Path.java:137)


  

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-04-06 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853843#action_12853843
 ] 

Ankur commented on PIG-1229:


So accepting the JDBC URL in setStoreLocation() exposes a flaw in Hadoop's Path 
class and it causes test case to fail with following exception

java.net.URISyntaxException: Relative path in absolute URI: 
jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: 
jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100
at org.apache.hadoop.fs.Path.initialize(Path.java:140)
at org.apache.hadoop.fs.Path.init(Path.java:126)
at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:238)
at 
org.apache.pig.StoreFunc.relToAbsPathForStoreLocation(StoreFunc.java:60)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3587)
...
...
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100
at java.net.URI.checkPath(URI.java:1787)
at java.net.URI.init(URI.java:735)
at org.apache.hadoop.fs.Path.initialize(Path.java:137)

Looking at the code of Path.java it seems like it extracts scheme based on the 
first occurrence of ':', this causes authority and path to be extracted 
incorrectly resulting in the above exception thrown java.net.URI. 
However if I try to initialize URI directly with the URL string, no exception 
is thrown.

As for DB reachability check, I think it is ok to check the availability at the 
runtime an fail if its available. We do this prepareToWrite(). 
For performance enhancement, I think we can track that via separate issue.

This patch has taken quite a while now and I wouldn't want to delay it further 
by depending on a hadoop fix.

So If a reviewer does not find any blocking issues then my suggestion is to go 
ahead with the commit. 

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-03-31 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852243#action_12852243
]

Ankur commented on PIG-1229:

Ashutosh,
Thanks for the review comments. Accepting the store location via
setStoreLocation() definitely makes sense. However I am not sure about checking
database reachability in checkOutputSepcs()
since that may be called on the client side as well and the DB machine may not
be reachable from the client machine. Isn't OutputFormat's setupTask() a
better place to do a DB availability checks ?
This sounds like a reasonable ask before a commit. I will incorporate this and
submit a new patch

Doing DataType.find()
I assume this is what you have in mind :-
1. Getting DB Schema information for the table we are writing to.
2. Use checkSchema() API to validate this with Pig supplied schema and
cache it.
3. Use the cached information in the putNext() method.

This is more of a performance enhancement and looks like more work. So I would
prefer if we track this as a JIRA for DBStorage.

allow pig to write output into a JDBC db

Key: PIG-1229
URL: https://issues.apache.org/jira/browse/PIG-1229
Project: Pig
Issue Type: New Feature
Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
Fix For: 0.8.0

Attachments: jira-1229-v2.patch

UDF to store data into a DB

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229-v2.patch

Here is the updated patch that compiles against pig 0.7 branch and implements 
new load/store APIs. 

Note:- that I haven't used hadoop's DBOutputFormat as the code is not yet moved 
to o.p.h.mapreduce.lib and hence there are compatibility issues.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: (was: hsqldb.jar)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: (was: jira-1229.patch)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Status: In Progress  (was: Patch Available)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Status: Patch Available  (was: In Progress)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1327) Incorrect column pruning after multiple JOIN operations

2010-03-25 Thread Ankur (JIRA)

Incorrect column pruning after multiple JOIN operations
---

 Key: PIG-1327
 URL: https://issues.apache.org/jira/browse/PIG-1327
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur


In a script with multiple JOIN and GROUP operations, the column pruner 
incorrectly removes some of the fields that it shouldn't. Here is a script that 
demonstrates the issue

 = LOAD 'data1' USING PigStorage() AS (a:chararray, b:chararray, c:long);
B = LOAD 'data2' USING PigStorage() AS (x:chararray, y:chararray, z:long);
C = LOAD 'data3' using PigStorage() AS (d:chararray, e:chararray, f:chararray);

join1 = JOIN B by x, A by a;
filtered1 = FILTER join1  BY y == b;
InterimData = FOREACH filtered1 GENERATE a, b, c, y, z;
join2 = JOIN InterimData BY b LEFT OUTER, C BY d  PARALLEL 2;
proj = FOREACH join2 GENERATE a,b,y,z,e,f;
TopNPrj = FOREACH proj GENERATE a, (( e is not null and e != '') ? e : 'None') 
, z;
TopNDataGrp = GROUP TopNPrj BY (a, e) PARALLEL 2;
TopNDataSum = FOREACH TopNDataGrp GENERATE flatten(group) as (a, e), 
SUM(TopNPrj.z) as views;
TopNDataRegrp = GROUP TopNDataSum BY (a) PARALLEL 2;
TopNDataCount = FOREACH TopNDataRegrp { OrderedData = ORDER TopNDataSum BY 
views desc; LimitedData = LIMIT OrderedData 50; GENERATE LimitedData; }
TopNData = FOREACH TopNDataCount GENERATE flatten($0) as (a, e, views);
store TopNData into 'tmpTopN';
TopNData_stored = load 'tmpTopN' as (a:chararray, b:chararray, c:long);
joinTopNData = JOIN TopNData_stored BY (a,b) RIGHT OUTER, proj BY (a,b) 
PARALLEL 2;
describe joinTopNData;
STORE  joinTopNData  INTO 'output';

The column 'f' from relation 'C' participating in the 2nd JOIN is missing from 
the final join ouput

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1327) Incorrect column pruning after multiple JOIN operations

2010-03-25 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849995#action_12849995
 ] 

Ankur commented on PIG-1327:


Yes, I verified that

 Incorrect column pruning after multiple JOIN operations
 ---

 Key: PIG-1327
 URL: https://issues.apache.org/jira/browse/PIG-1327
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur

 In a script with multiple JOIN and GROUP operations, the column pruner 
 incorrectly removes some of the fields that it shouldn't. Here is a script 
 that demonstrates the issue
  = LOAD 'data1' USING PigStorage() AS (a:chararray, b:chararray, c:long);
 B = LOAD 'data2' USING PigStorage() AS (x:chararray, y:chararray, z:long);
 C = LOAD 'data3' using PigStorage() AS (d:chararray, e:chararray, 
 f:chararray);
 join1 = JOIN B by x, A by a;
 filtered1 = FILTER join1  BY y == b;
 InterimData = FOREACH filtered1 GENERATE a, b, c, y, z;
 join2 = JOIN InterimData BY b LEFT OUTER, C BY d  PARALLEL 2;
 proj = FOREACH join2 GENERATE a,b,y,z,e,f;
 TopNPrj = FOREACH proj GENERATE a, (( e is not null and e != '') ? e : 
 'None') , z;
 TopNDataGrp = GROUP TopNPrj BY (a, e) PARALLEL 2;
 TopNDataSum = FOREACH TopNDataGrp GENERATE flatten(group) as (a, e), 
 SUM(TopNPrj.z) as views;
 TopNDataRegrp = GROUP TopNDataSum BY (a) PARALLEL 2;
 TopNDataCount = FOREACH TopNDataRegrp { OrderedData = ORDER TopNDataSum BY 
 views desc; LimitedData = LIMIT OrderedData 50; GENERATE LimitedData; }
 TopNData = FOREACH TopNDataCount GENERATE flatten($0) as (a, e, views);
 store TopNData into 'tmpTopN';
 TopNData_stored = load 'tmpTopN' as (a:chararray, b:chararray, c:long);
 joinTopNData = JOIN TopNData_stored BY (a,b) RIGHT OUTER, proj BY (a,b) 
 PARALLEL 2;
 describe joinTopNData;
 STORE  joinTopNData  INTO 'output';
 The column 'f' from relation 'C' participating in the 2nd JOIN is missing 
 from the final join ouput

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-03-21 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847909#action_12847909
 ] 

Ankur commented on PIG-1229:


@Ashtosh Chauhan 
I read the HSQLDB license and it looked ok to me but I am not a lawyer :-) . 
Besides that apache cocoon uses it. I think we should be ok pulling it through 
ivy.

I'll make the ivy and load-store related changes and submit a new patch on 
Monday.

Sorry for the delay.
 

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.7.0

 Attachments: hsqldb.jar, jira-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1273) Skewed join throws error

2010-03-02 Thread Ankur (JIRA)

Skewed join throws error 
-

 Key: PIG-1273
 URL: https://issues.apache.org/jira/browse/PIG-1273
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur


When the sampled relation is too small or empty then skewed join fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1273) Skewed join throws error

2010-03-02 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840482#action_12840482
 ] 

Ankur commented on PIG-1273:


Here is a simple script to reproduce it

a = load 'test.dat' using PigStorage() as (nums:chararray);
b = load 'join.dat' using PigStorage('\u0001') as 
(number:chararray,text:chararray);
c = filter a by nums == '7';
d = join c by nums LEFT OUTER, b by number USING skewed;
dump d;

 test.dat 
1
2
3
4
5

= join.dat =
1^Aone
2^Atwo
3^Athree

where ^A means Control-A charatcer used as a separator.

 Skewed join throws error 
 -

 Key: PIG-1273
 URL: https://issues.apache.org/jira/browse/PIG-1273
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur

 When the sampled relation is too small or empty then skewed join fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1273) Skewed join throws error

2010-03-02 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840483#action_12840483
 ] 

Ankur commented on PIG-1273:


Complete stack trace of the error thrown my 3rd M/R job in the pipeline

java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at 
org.apache.hadoop.mapred.MapTask$OldOutputCollector.(MapTask.java:448)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 6 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Empty 
samples file
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.configure(SkewedPartitioner.java:128)
... 11 more
Caused by: java.lang.RuntimeException: Empty samples file
at 
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil.loadPartitionFile(MapRedUtil.java:128)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.configure(SkewedPartitioner.java:125)
... 11 more


 Skewed join throws error 
 -

 Key: PIG-1273
 URL: https://issues.apache.org/jira/browse/PIG-1273
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur

 When the sampled relation is too small or empty then skewed join fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1274) Column pruning throws Null pointer exception

2010-03-02 Thread Ankur (JIRA)

Column pruning throws Null pointer exception


 Key: PIG-1274
 URL: https://issues.apache.org/jira/browse/PIG-1274
 Project: Pig
  Issue Type: Bug
Reporter: Ankur


In case data has missing values for certain columns in a relation participating 
in a join, column pruning throws null pointer exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1233) NullPointerException in AVG

2010-02-17 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835136#action_12835136
 ] 

Ankur commented on PIG-1233:


In the current code path we cannot have a situation where intermediateCount in 
NOT null but intermediateSum is null. So just checking the former if sufficient.

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.7.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1238) Dump does not respect the schema

2010-02-16 Thread Ankur (JIRA)

Dump does not respect the schema


 Key: PIG-1238
 URL: https://issues.apache.org/jira/browse/PIG-1238
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur


For complex data type and certain sequence of operations dump produces results 
with non-existent field in the relation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1238) Dump does not respect the schema

2010-02-16 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834151#action_12834151
 ] 

Ankur commented on PIG-1238:


Here is a script to reproduce the issue:-

A = LOAD 'two.txt' USING PigStorage();
B = FOREACH A GENERATE ['a'#'12'] as b:map[], ['b'#['c'#'12']] as mapFields;
C = FOREACH B GENERATE(CHARARRAY) mapFields#'b'#'c' AS f1, RANDOM() AS f2;
D = ORDER C BY f2 PARALLEL 10;
E = LIMIT D 20;
F = FOREACH E GENERATE f1;
describe F;
dump F;

With the above script here is a snippet of the logs that might be useful
...
...
2010-02-16 10:42:44,814 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 90% complete
2010-02-16 10:42:55,966 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete
2010-02-16 10:42:55,981 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Successfully stored result in: 
hdfs://mithrilblue-nn1.blue.ygrid.yahoo.com/tmp/temp-1870551954/tmp-470213889
2010-02-16 10:42:55,991 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Encountered Warning ACCESSING_NON_EXISTENT_FIELD 1 time(s).
2010-02-16 10:42:55,991 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Records written : 1
2010-02-16 10:42:55,991 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Bytes written : 14
2010-02-16 10:42:55,991 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Success!
(12,)

Note:- If we remove PARALLEL 10 from Order by correct results are produced 
and NO warning is thrown.

 Dump does not respect the schema
 

 Key: PIG-1238
 URL: https://issues.apache.org/jira/browse/PIG-1238
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur

 For complex data type and certain sequence of operations dump produces 
 results with non-existent field in the relation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1238) Dump does not respect the schema

2010-02-16 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834642#action_12834642
 ] 

Ankur commented on PIG-1238:


Daniel the correct syntax is - ['b'#['c'#'12']] as mapFields.

 Dump does not respect the schema
 

 Key: PIG-1238
 URL: https://issues.apache.org/jira/browse/PIG-1238
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur

 For complex data type and certain sequence of operations dump produces 
 results with non-existent field in the relation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1238) Dump does not respect the schema

2010-02-16 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834643#action_12834643
 ] 

Ankur commented on PIG-1238:


Seems like inner [] are making parts of it appear underlined. Correct syntax is
['b'# ['c'#'12'] ] as mapFields

 Dump does not respect the schema
 

 Key: PIG-1238
 URL: https://issues.apache.org/jira/browse/PIG-1238
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur

 For complex data type and certain sequence of operations dump produces 
 results with non-existent field in the relation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1238) Dump does not respect the schema

2010-02-16 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834644#action_12834644
 ] 

Ankur commented on PIG-1238:


Sigh
Enclose 'c'#'12' in a square bracket and then enclose 'b'# ... in another 
square bracket

 Dump does not respect the schema
 

 Key: PIG-1238
 URL: https://issues.apache.org/jira/browse/PIG-1238
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur

 For complex data type and certain sequence of operations dump produces 
 results with non-existent field in the relation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1233) NullPointerException in AVG

2010-02-16 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834645#action_12834645
 ] 

Ankur commented on PIG-1233:


Olga,
   All queries that use AVG(),  have null values for certain keys and have 
accumulator turned on for them are affected by this. Please see the test case 
for a sample query. The current workaround is to filter the nulls before 
averaging.

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.7.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-16 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1233:
---

Status: In Progress  (was: Patch Available)

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.7.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-16 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1233:
---

Status: Patch Available  (was: In Progress)

Retrying as suggested by Olga

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.7.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-15 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1233:
---

Status: In Progress  (was: Patch Available)

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.6.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-15 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1233:
---

Attachment: jira-1233.patch

Added test case

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.6.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-15 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1233:
---

Status: Patch Available  (was: In Progress)

Retrying hudson after adding the suggested test case

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.6.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-02-15 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229.patch

Updated code with added test case using HSQLDB (binary part of the patch).

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Attachments: jira-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-02-15 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Fix Version/s: 0.6.0
   Status: Patch Available  (was: Open)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.6.0

 Attachments: jira-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-02-15 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: hsqldb.jar

Attaching hsqldb.jar separately as including it in the patch does not work

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.6.0

 Attachments: hsqldb.jar, jira-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1233) NullPointerException in AVG

2010-02-15 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834075#action_12834075
 ] 

Ankur commented on PIG-1233:


The test report URLs don't work. Is this the correct one ?
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/205/testReport/

Looks alright to me.

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.6.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1233) NullPointerException in AVG

2010-02-09 Thread Ankur (JIRA)

NullPointerException in AVG 


 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
 Fix For: 0.6.0


The overridden method - getValue() in AVG throws null pointer exception in case 
accumulate() is not called leaving variable 'intermediateCount'  initialized to 
null. This causes java to throw exception when it tries to 'unbox' the value 
for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-09 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1233:
---

Attachment: jira-1233.patch

Attached is a very simple patch that adds the required null checks. This is a 
very simple code change so I don't think any new test cases are needed. 

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
 Fix For: 0.6.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-09 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1233:
---

Status: Patch Available  (was: Open)

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.6.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1229) allow pig to write output into a JDBC db

2010-02-08 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur reassigned PIG-1229:
--

Assignee: Ankur

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Attachments: DbStorage.java


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-02-08 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831337#action_12831337
 ] 

Ankur commented on PIG-1229:


Aaron, Thanks for the suggestions.
I'll have an updated patch coming soon.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Attachments: DbStorage.java


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH

2010-01-15 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800610#action_12800610
 ] 

Ankur commented on PIG-1191:


I'll check and update the ticket

 POCast throws exception for certain sequences of LOAD, FILTER, FORACH
 -

 Key: PIG-1191
 URL: https://issues.apache.org/jira/browse/PIG-1191
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Priority: Blocker
 Attachments: PIG-1191-1.patch


 When using a custom load/store function, one that returns complex data (map 
 of maps, list of maps), for certain sequences  of LOAD, FILTER, FOREACH pig 
 script throws an exception of the form -
  
 org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a 
 bytearray from the UDF. Cannot determine how to convert the bytearray to 
 actual-type
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
 ...
 Looking through the code of POCast, apparently the operator was unable to 
 find the right load function for doing the conversion and consequently bailed 
 out with the exception failing the entire pig script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH

2010-01-15 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800636#action_12800636
 ] 

Ankur commented on PIG-1191:


Case 1, 2: Succeeds
Case 3 : Fails
Case 4,5: Empty results. Both of them are using consecutive projection of 
complex fields.

I'll add 1 more test case

 POCast throws exception for certain sequences of LOAD, FILTER, FORACH
 -

 Key: PIG-1191
 URL: https://issues.apache.org/jira/browse/PIG-1191
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Priority: Blocker
 Attachments: PIG-1191-1.patch


 When using a custom load/store function, one that returns complex data (map 
 of maps, list of maps), for certain sequences  of LOAD, FILTER, FOREACH pig 
 script throws an exception of the form -
  
 org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a 
 bytearray from the UDF. Cannot determine how to convert the bytearray to 
 actual-type
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
 ...
 Looking through the code of POCast, apparently the operator was unable to 
 find the right load function for doing the conversion and consequently bailed 
 out with the exception failing the entire pig script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH

2010-01-15 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800655#action_12800655
 ] 

Ankur commented on PIG-1191:


CASE 6:  In CASE 1 replace LIMIT with a GROUP BY followed by FOREACH 


Succeeds with the given patch.


 POCast throws exception for certain sequences of LOAD, FILTER, FORACH
 -

 Key: PIG-1191
 URL: https://issues.apache.org/jira/browse/PIG-1191
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Priority: Blocker
 Attachments: PIG-1191-1.patch


 When using a custom load/store function, one that returns complex data (map 
 of maps, list of maps), for certain sequences  of LOAD, FILTER, FOREACH pig 
 script throws an exception of the form -
  
 org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a 
 bytearray from the UDF. Cannot determine how to convert the bytearray to 
 actual-type
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
 ...
 Looking through the code of POCast, apparently the operator was unable to 
 find the right load function for doing the conversion and consequently bailed 
 out with the exception failing the entire pig script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH

2010-01-15 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur reassigned PIG-1191:
--

Assignee: Pradeep Kamath

 POCast throws exception for certain sequences of LOAD, FILTER, FORACH
 -

 Key: PIG-1191
 URL: https://issues.apache.org/jira/browse/PIG-1191
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Pradeep Kamath
Priority: Blocker
 Attachments: PIG-1191-1.patch


 When using a custom load/store function, one that returns complex data (map 
 of maps, list of maps), for certain sequences  of LOAD, FILTER, FOREACH pig 
 script throws an exception of the form -
  
 org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a 
 bytearray from the UDF. Cannot determine how to convert the bytearray to 
 actual-type
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
 ...
 Looking through the code of POCast, apparently the operator was unable to 
 find the right load function for doing the conversion and consequently bailed 
 out with the exception failing the entire pig script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH

2010-01-15 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800660#action_12800660
 ] 

Ankur commented on PIG-1191:


Small correct in comment dated - 15/Jan/10 09:39 AM

Case 5: Still FAILS

 POCast throws exception for certain sequences of LOAD, FILTER, FORACH
 -

 Key: PIG-1191
 URL: https://issues.apache.org/jira/browse/PIG-1191
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Pradeep Kamath
Priority: Blocker
 Attachments: PIG-1191-1.patch


 When using a custom load/store function, one that returns complex data (map 
 of maps, list of maps), for certain sequences  of LOAD, FILTER, FOREACH pig 
 script throws an exception of the form -
  
 org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a 
 bytearray from the UDF. Cannot determine how to convert the bytearray to 
 actual-type
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
 ...
 Looking through the code of POCast, apparently the operator was unable to 
 find the right load function for doing the conversion and consequently bailed 
 out with the exception failing the entire pig script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH

2010-01-14 Thread Ankur (JIRA)

POCast throws exception for certain sequences of LOAD, FILTER, FORACH
-

 Key: PIG-1191
 URL: https://issues.apache.org/jira/browse/PIG-1191
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Priority: Blocker


When using a custom load/store function, one that returns complex data (map of 
maps, list of maps), for certain sequences  of LOAD, FILTER, FOREACH pig script 
throws an exception of the form -
 
org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a 
bytearray from the UDF. Cannot determine how to convert the bytearray to 
actual-type
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
...
Looking through the code of POCast, apparently the operator was unable to find 
the right load function for doing the conversion and consequently bailed out 
with the exception failing the entire pig script.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH

2010-01-14 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800609#action_12800609
 ] 

Ankur commented on PIG-1191:


Listed below are the identified cases. 

CASE 1: LOAD - FILTER - FOREACH - LIMIT - STORE
===

SCRIPT
---
sds = LOAD '/my/data/location'
  USING my.org.MyMapLoader()
  AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
queries = FILTER sds BY mapFields#'page_params'#'query' is NOT NULL;
queries_rand = FOREACH queries
   GENERATE (CHARARRAY) (mapFields#'page_params'#'query') AS 
query_string;
queries_limit = LIMIT queries_rand 100;
STORE queries_limit INTO 'out'; 

RESULT 

FAILS in reduce stage with the following exception

org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a 
bytearray from the UDF. Cannot determine
how to convert the bytearray to string.
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:423)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:391)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:371)


CASE 2: LOAD - FOREACH - FILTER - LIMIT - STORE
===
Note that FILTER and FOREACH order is reversed

SCRIPT
---
sds = LOAD '/my/data/location'
  USING my.org.MyMapLoader()
  AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
queries_rand = FOREACH sds
   GENERATE (CHARARRAY) (mapFields#'page_params'#'query') AS 
query_string;
queries = FILTER queries_rand BY query_string IS NOT null;
queries_limit = LIMIT queries 100; 
STORE queries_limit INTO 'out';

RESULT
---
SUCCESS - Results are correctly stored. So if a projection is done before 
FILTER it recieves the LoadFunc in the POCast
operator and everything is cool.


CASE 3: LOAD - FOREACH - FOREACH - FILTER - LIMIT - STORE
==

SCRIPT
---
ds = LOAD '/my/data/location'
  USING my.org.MyMapLoader()
  AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
params = FOREACH sds GENERATE 
  (map[]) (mapFields#'page_params') AS params;
queries = FOREACH params
  GENERATE (CHARARRAY) (params#'query') AS query_string;
queries_filtered = FILTER queries
   BY query_string IS NOT null;
queries_limit = LIMIT queries_filtered 100;
STORE queries_limit INTO 'out';

RESULT
---
FAILS in Map stage. Looks like the 2nd FOREACH did not get the loadFunc and 
bailed out with following stack trace

org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a 
bytearray from the UDF. Cannot determine
how to convert the bytearray to string. at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
 at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
 at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
 at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
 at

CASE 4: LOAD - FOREACH - FOREACH - LIMIT - STORE


SCRIPT
---
sds = LOAD '/my/data/location'
  USING my.org.MyMapLoader()
  AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
params = FOREACH sds GENERATE
  (map[]) (mapFields#'page_params') AS params;
queries = FOREACH params

[jira] Commented: (PIG-761) ERROR 2086 on simple JOIN

2009-12-23 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794005#action_12794005
 ] 

Ankur commented on PIG-761:
---

Here is a very simple script to reproduce the issue:-

- Start -
data1 = LOAD 'data1' as (a:int, b:int, c:chararray);
proj1 = LIMIT data1 5;

data2 = LOAD 'data2' as (x:int, y:chararray, z:chararray);
proj2 = FOREACH data2 GENERATE x, y;

cogrouped = COGROUP proj1 BY a, proj2 BY x INNER PARALLEL 2;
joined = FOREACH cogrouped GENERATE FLATTEN(proj1), FLATTEN(proj2);

store joined into 'results';
- End 

The problem seems to be with the LIMIT operator for one of the relations 
participating in the join.  Seems like this causes the mismatch between 
expected and found local re-arrange operators

 ERROR 2086 on simple JOIN
 -

 Key: PIG-761
 URL: https://issues.apache.org/jira/browse/PIG-761
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
 Environment: mapreduce mode
Reporter: Vadim Zaliva

 ERROR 2086: Unexpected problem during optimization. Could not find all 
 LocalRearrange operators.org.apache.pig.impl.logicalLayer.FrontendException: 
 ERROR 1002: Unable to store alias 109
 doing pretty straightforward join in one of my pig scripts. I am able to 
 'dump' both relationship involved in this join. when I try to join them I am 
 getting this error.
 Here is a full log:
 ERROR 2086: Unexpected problem during optimization. Could not find all
 LocalRearrange operators.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable
 to store alias 109
at org.apache.pig.PigServer.registerQuery(PigServer.java:296)
at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529)
at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280)
at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:319)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
 2043: Unexpected error during execution.
at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:274)
at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:700)
at org.apache.pig.PigServer.execute(PigServer.java:691)
at org.apache.pig.PigServer.registerQuery(PigServer.java:292)
... 5 more
 Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException:
 ERROR 2086: Unexpected problem during optimization. Could not find all
 LocalRearrange operators.
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator.handlePackage(POPackageAnnotator.java:116)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator.visitMROp(POPackageAnnotator.java:88)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:194)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:43)
at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:65)
at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
at 
 org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
 MapReduceLauncher.compile(MapReduceLauncher.java:198)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:80)
at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:261)
... 8 more
 ERROR 1002: Unable to store alias 398
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable
 to store alias 398
at org.apache.pig.PigServer.registerQuery(PigServer.java:296)
at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529)
at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280)
at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:319)
 Caused by: java.lang.NullPointerException
at

[jira] Created: (PIG-1168) Dump produces wrong results

2009-12-21 Thread Ankur (JIRA)

Dump produces wrong results
---

 Key: PIG-1168
 URL: https://issues.apache.org/jira/browse/PIG-1168
 Project: Pig
  Issue Type: Bug
Reporter: Ankur


For a map-only job, dump just re-executes every pig-latin statement from the 
begininng assuming that they would produce same result. the assumption is not 
valid if there are UDFs that are invoked. Consider the following script:-

raw = LOAD '$input' USING PigStorage() AS (text_string:chararray);
DUMP raw;

ccm = FOREACH raw GENERATE MyUDF(text_string);
DUMP ccm;

bug = FOREACH ccm GENERATE ccmObj;

DUMP bug;

The UDF MyUDF generates a tuple with one of the fields being a randomly 
generated UUID. So even though one would expect relations 'ccm' and 'bug' to 
contain identical data, they are different because of re-execution from the 
begininng. This breaks the application logic.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1152) bincond operator throws parser error

2009-12-14 Thread Ankur (JIRA)

bincond operator throws parser error


 Key: PIG-1152
 URL: https://issues.apache.org/jira/browse/PIG-1152
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur


Bincond operator throws parser error when true condition contains a constant 
bag with 1 tuple containing a single field of int type with -ve value. 

Here is the script to reproduce the issue

A = load 'A' as (s: chararray, x: int, y: int);
B = group A by s;
C = foreach B generate group, flatten(((COUNT(A)  1L) ? {(-1)} : A.x));
dump C;


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits

2009-11-30 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1114:
---

Attachment: Pig_1114_Client.log

 MultiQuery optimization throws error when merging 2 level splits
 

 Key: PIG-1114
 URL: https://issues.apache.org/jira/browse/PIG-1114
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Assignee: Richard Ding
Priority: Critical
 Fix For: 0.6.0

 Attachments: Pig_1114_Client.log


 Multi-query optimization throws an error when merging 2 level splits. 
 Following is the script to reproduce the error
 data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray);
 ids = FOREACH data GENERATE id;
 allId = GROUP ids all;
 allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total;
 idGroup = GROUP ids by id;
 idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count;
 countTotal = cross idGroupCount, allIdCount;
 idCountTotal = foreach countTotal generate
 id,
 count,
 total,
 (double)count / (double)total as proportion;
 orderedCounts = order idCountTotal by count desc;
 STORE orderedCounts INTO 'mq_problem/ids';
 names = FOREACH data GENERATE name;
 allNames = GROUP names all;
 allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as 
 total;
 nameGroup = GROUP names by name;
 nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as 
 count;
 namesCrossed = cross nameGroupCount, allNamesCount;
 nameCountTotal = foreach namesCrossed generate
 name,
 count,
 total,
 (double)count / (double)total as proportion;
 nameCountsOrdered = order nameCountTotal by count desc;
 STORE nameCountsOrdered INTO 'mq_problem/names';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits

2009-11-30 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784070#action_12784070
 ] 

Ankur commented on PIG-1114:


Richard,
 I ran the above script again with -M  option to confirm that 
Multiquery was not disabled, instead it worked on 2 separated parts of the 
script. I am attaching the pig client logs from the run for your reference.

 MultiQuery optimization throws error when merging 2 level splits
 

 Key: PIG-1114
 URL: https://issues.apache.org/jira/browse/PIG-1114
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Assignee: Richard Ding
Priority: Critical
 Fix For: 0.6.0

 Attachments: Pig_1114_Client.log


 Multi-query optimization throws an error when merging 2 level splits. 
 Following is the script to reproduce the error
 data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray);
 ids = FOREACH data GENERATE id;
 allId = GROUP ids all;
 allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total;
 idGroup = GROUP ids by id;
 idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count;
 countTotal = cross idGroupCount, allIdCount;
 idCountTotal = foreach countTotal generate
 id,
 count,
 total,
 (double)count / (double)total as proportion;
 orderedCounts = order idCountTotal by count desc;
 STORE orderedCounts INTO 'mq_problem/ids';
 names = FOREACH data GENERATE name;
 allNames = GROUP names all;
 allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as 
 total;
 nameGroup = GROUP names by name;
 nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as 
 count;
 namesCrossed = cross nameGroupCount, allNamesCount;
 nameCountTotal = foreach namesCrossed generate
 name,
 count,
 total,
 (double)count / (double)total as proportion;
 nameCountsOrdered = order nameCountTotal by count desc;
 STORE nameCountsOrdered INTO 'mq_problem/names';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits

2009-11-29 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783553#action_12783553
 ] 

Ankur commented on PIG-1114:


The error thrown is 

java.io.IOException: Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableTuple, recieved 
org.apache.pig.impl.io.NullableText
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807)
at 
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)



 MultiQuery optimization throws error when merging 2 level splits
 

 Key: PIG-1114
 URL: https://issues.apache.org/jira/browse/PIG-1114
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Priority: Critical

 Multi-query optimization throws an error when merging 2 level splits. 
 Following is the script to reproduce the error
 data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray);
 ids = FOREACH data GENERATE id;
 allId = GROUP ids all;
 allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total;
 idGroup = GROUP ids by id;
 idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count;
 countTotal = cross idGroupCount, allIdCount;
 idCountTotal = foreach countTotal generate
 id,
 count,
 total,
 (double)count / (double)total as proportion;
 orderedCounts = order idCountTotal by count desc;
 STORE orderedCounts INTO 'mq_problem/ids';
 names = FOREACH data GENERATE name;
 allNames = GROUP names all;
 allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as 
 total;
 nameGroup = GROUP names by name;
 nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as 
 count;
 namesCrossed = cross nameGroupCount, allNamesCount;
 nameCountTotal = foreach namesCrossed generate
 name,
 count,
 total,
 (double)count / (double)total as proportion;
 nameCountsOrdered = order nameCountTotal by count desc;
 STORE nameCountsOrdered INTO 'mq_problem/names';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits

2009-11-29 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783554#action_12783554
 ] 

Ankur commented on PIG-1114:


The same script works with -M (multi-query disabled) option, BUT surprisingly 
the run indicates that now multi-query optimization being applied separately to 
the first STORE and the second STORE. This is just a workaround but it also 
indicates that in cases like this, disabling multi-query actually DOES NOT 
disable it completely instead just makes it run on parts of the script.

 MultiQuery optimization throws error when merging 2 level splits
 

 Key: PIG-1114
 URL: https://issues.apache.org/jira/browse/PIG-1114
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Priority: Critical

 Multi-query optimization throws an error when merging 2 level splits. 
 Following is the script to reproduce the error
 data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray);
 ids = FOREACH data GENERATE id;
 allId = GROUP ids all;
 allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total;
 idGroup = GROUP ids by id;
 idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count;
 countTotal = cross idGroupCount, allIdCount;
 idCountTotal = foreach countTotal generate
 id,
 count,
 total,
 (double)count / (double)total as proportion;
 orderedCounts = order idCountTotal by count desc;
 STORE orderedCounts INTO 'mq_problem/ids';
 names = FOREACH data GENERATE name;
 allNames = GROUP names all;
 allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as 
 total;
 nameGroup = GROUP names by name;
 nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as 
 count;
 namesCrossed = cross nameGroupCount, allNamesCount;
 nameCountTotal = foreach namesCrossed generate
 name,
 count,
 total,
 (double)count / (double)total as proportion;
 nameCountsOrdered = order nameCountTotal by count desc;
 STORE nameCountsOrdered INTO 'mq_problem/names';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1114) MultiQuery optimization throws error when merging 2 level splits

2009-11-29 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1114:
---

Fix Version/s: 0.6.0

 MultiQuery optimization throws error when merging 2 level splits
 

 Key: PIG-1114
 URL: https://issues.apache.org/jira/browse/PIG-1114
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Priority: Critical
 Fix For: 0.6.0


 Multi-query optimization throws an error when merging 2 level splits. 
 Following is the script to reproduce the error
 data = LOAD 'data' USING PigStorage() AS (id:int, name:chararray);
 ids = FOREACH data GENERATE id;
 allId = GROUP ids all;
 allIdCount = FOREACH allId GENERATE group as allId, COUNT(ids) as total;
 idGroup = GROUP ids by id;
 idGroupCount = FOREACH idGroup GENERATE group as id, COUNT(ids) as count;
 countTotal = cross idGroupCount, allIdCount;
 idCountTotal = foreach countTotal generate
 id,
 count,
 total,
 (double)count / (double)total as proportion;
 orderedCounts = order idCountTotal by count desc;
 STORE orderedCounts INTO 'mq_problem/ids';
 names = FOREACH data GENERATE name;
 allNames = GROUP names all;
 allNamesCount = FOREACH allNames GENERATE group as namesAll, COUNT(names) as 
 total;
 nameGroup = GROUP names by name;
 nameGroupCount = FOREACH nameGroup GENERATE group as name, COUNT(names) as 
 count;
 namesCrossed = cross nameGroupCount, allNamesCount;
 nameCountTotal = foreach namesCrossed generate
 name,
 count,
 total,
 (double)count / (double)total as proportion;
 nameCountsOrdered = order nameCountTotal by count desc;
 STORE nameCountsOrdered INTO 'mq_problem/names';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1112) FLATTEN eliminates the alias

2009-11-26 Thread Ankur (JIRA)

FLATTEN eliminates the alias


 Key: PIG-1112
 URL: https://issues.apache.org/jira/browse/PIG-1112
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
 Fix For: 0.6.0


If schema for a field of type 'bag' is partially defined then FLATTEN() 
incorrectly eliminates the field and throws an error. 
Consider the following example:-

A = LOAD 'sample' using PigStorage() as (first:chararray, second:chararray, 
ladder:bag{});  
B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second;   

C = GROUP B by (first,third);

This throws the error
 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. 
Invalid alias: third in {first: chararray,second: chararray}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1113) Diamond query optimization throws error in JOIN

2009-11-26 Thread Ankur (JIRA)

Diamond query optimization throws error in JOIN
---

 Key: PIG-1113
 URL: https://issues.apache.org/jira/browse/PIG-1113
 Project: Pig
  Issue Type: Bug
Reporter: Ankur


The following script results in 1 M/R job as a result of diamond query 
optimization but the script fails.

set1 = LOAD 'set1' USING PigStorage as (a:chararray, b:chararray, c:chararray);
set2 = LOAD 'set2' USING PigStorage as (a: chararray, b:chararray, c:bag{});

set2_1 = FOREACH set2 GENERATE a as f1, b as f2, (chararray) 0 as f3;
set2_2 = FOREACH set2 GENERATE a as f1, FLATTEN((IsEmpty(c) ? null : c)) as f2, 
(chararray) 1 as f3;

all_set2 = UNION set2_1, set2_2;

joined_sets = JOIN set1 BY (a,b), all_set2 BY (f2,f3);
dump joined_sets;

And here is the error

org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot 
convert a bag to a String
at org.apache.pig.data.DataType.toString(DataType.java:739)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:625)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:247)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1113) Diamond query optimization throws error in JOIN

2009-11-26 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782877#action_12782877
 ] 

Ankur commented on PIG-1113:


The script fails even if correct schema is specified for the c:bag{}. So the 
following change does not alleviate the problem

set2 = LOAD 'set2' USING PigStorage as (a: chararray, b:chararray, 
c:bag{T:tuple(l:chararray)});

 Diamond query optimization throws error in JOIN
 ---

 Key: PIG-1113
 URL: https://issues.apache.org/jira/browse/PIG-1113
 Project: Pig
  Issue Type: Bug
Reporter: Ankur

 The following script results in 1 M/R job as a result of diamond query 
 optimization but the script fails.
 set1 = LOAD 'set1' USING PigStorage as (a:chararray, b:chararray, 
 c:chararray);
 set2 = LOAD 'set2' USING PigStorage as (a: chararray, b:chararray, c:bag{});
 set2_1 = FOREACH set2 GENERATE a as f1, b as f2, (chararray) 0 as f3;
 set2_2 = FOREACH set2 GENERATE a as f1, FLATTEN((IsEmpty(c) ? null : c)) as 
 f2, (chararray) 1 as f3;
 all_set2 = UNION set2_1, set2_2;
 joined_sets = JOIN set1 BY (a,b), all_set2 BY (f2,f3);
 dump joined_sets;
 And here is the error
 org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot 
 convert a bag to a String
   at org.apache.pig.data.DataType.toString(DataType.java:739)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:625)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:247)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1108) Incorrect map output key type in MultiQuery optimization

2009-11-25 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782787#action_12782787
 ] 

Ankur commented on PIG-1108:


In my test run on 0.6.0 branch, disabling MQ did not work. Pig client logs 
showed that MQ was still kicking in and the mappers failed with the same error 
message as in description. It will be good if we can add few points about 
SecondaryKey here - 
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

 Incorrect map output key type in MultiQuery optimization
 

 Key: PIG-1108
 URL: https://issues.apache.org/jira/browse/PIG-1108
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Richard Ding

 When trying to merge 2 split plans, one of which never progresses along an 
 M/R boundary, PIG sets the map-output key type incorrectly resulting in the 
 following error:-
 java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableText, recieved 
 org.apache.pig.impl.io.NullableTuple
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807)
   at 
 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 Here is a small script to be used a reproducible test case
 rmf plan1
 rmf plan2
 A = LOAD 'data' USING PigStorage() as (a: int, b: chararray);
 SPLIT A into plan1 IF (a5), plan2 IF (a5);
 B = GROUP plan1 BY b;
 C = FOREACH B {
   tmp = ORDER plan1 BY a desc;
   GENERATE FLATTEN(group) as b, tmp;
   };
 D = FILTER C BY b is not null;
 STORE D into 'plan1';
 STORE plan2 into 'plan2';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1075) Error in Cogroup when key fields types don't match

2009-11-05 Thread Ankur (JIRA)

Error in Cogroup when key fields types don't match
--

 Key: PIG-1075
 URL: https://issues.apache.org/jira/browse/PIG-1075
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Ankur


When Cogrouping 2 relations on multiple key fields, pig throws an error if the 
corresponding types don't match. 
Consider the following script:-
A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int);
B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int);
C = CoGROUP A BY (a,b,c), B BY (a,b,c);
D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
describe D;
dump D;

The complete stack trace of the error thrown is

Pig Stack Trace
---
ERROR 1051: Cannot cast to Unknown

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to 
describe schema for alias D
at org.apache.pig.PigServer.dumpSchema(PigServer.java:436)
at 
org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:397)
Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An 
unexpected exception caused the validation to stop
at 
org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
at 
org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83)
at org.apache.pig.PigServer.compileLp(PigServer.java:821)
at org.apache.pig.PigServer.dumpSchema(PigServer.java:428)
... 6 more
Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
ERROR 1060: Cannot resolve COGroup output schema
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463)
at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372)
at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
... 11 more
Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
ERROR 1051: Cannot cast to Unknown
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552)
at 
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451)
... 16 more

The error message does not help the user in identifying the issue clearly 
especially if the pig script is large and complex.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1075) Error in Cogroup when key fields types don't match

2009-11-05 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774222#action_12774222
 ] 

Ankur commented on PIG-1075:


Pig should throw an error message that better identifies the cause of the 
problem.

 Error in Cogroup when key fields types don't match
 --

 Key: PIG-1075
 URL: https://issues.apache.org/jira/browse/PIG-1075
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Ankur

 When Cogrouping 2 relations on multiple key fields, pig throws an error if 
 the corresponding types don't match. 
 Consider the following script:-
 A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int);
 B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int);
 C = CoGROUP A BY (a,b,c), B BY (a,b,c);
 D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
 describe D;
 dump D;
 The complete stack trace of the error thrown is
 Pig Stack Trace
 ---
 ERROR 1051: Cannot cast to Unknown
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to 
 describe schema for alias D
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:436)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An 
 unexpected exception caused the validation to stop
 at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
 at 
 org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83)
 at org.apache.pig.PigServer.compileLp(PigServer.java:821)
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:428)
 ... 6 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1060: Cannot resolve COGroup output schema
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463)
 at 
 org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372)
 at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
 ... 11 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1051: Cannot cast to Unknown
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451)
 ... 16 more
 The error message does not help the user in identifying the issue clearly 
 especially if the pig script is large and complex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-958) Splitting output data on key field

2009-11-03 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12773389#action_12773389
 ] 

Ankur commented on PIG-958:
---

 Can you explain this a little bit more - ..
In the earlier patch (958.v3.patch), After moving the results from the tasks 
current working directory, I was manually deleting the directory. This is to 
ensure that empty part files don't get moved to the final output directory. But 
doing so causes hadoop to complain that it can no longer write to task's output 
dir and the task fails.

 I saw compile errors while trying to run unit test: ...
Did you compile the pig.jar  and ran core test before. This creates the 
necessary classes and jar file son the local machine required by contrib tests.

On my local machine
gan...@grainflydivide-dr:pig_trunk$ ant 
...
buildJar:
 [echo] svnString 830456
  [jar] Building jar: 
/home/gankur/eclipse/workspace/pig_trunk/build/pig-0.6.0-dev-core.jar
  [jar] Building jar: 
/home/gankur/eclipse/workspace/pig_trunk/build/pig-0.6.0-dev.jar
 [copy] Copying 1 file to /home/gankur/eclipse/workspace/pig_trunk

gan...@grainflydivide-dr:pig_trunk$ ant test
...
test-core:
   [delete] Deleting directory 
/home/gankur/eclipse/workspace/pig_trunk/build/test/logs
[mkdir] Created dir: 
/home/gankur/eclipse/workspace/pig_trunk/build/test/logs
[junit] Running org.apache.pig.test.TestAdd
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.024 sec
[junit] Running org.apache.pig.test.TestAlgebraicEval
...
gan...@grainflydivide-dr:pig_trunk$ cd contrib/piggybank/java/
gan...@grainflydivide-dr:java$ ant test
...
test:
 [echo]  *** Running UDF tests ***
   [delete] Deleting directory 
/home/gankur/eclipse/workspace/pig_trunk/contrib/piggybank/java/build/test/logs
[mkdir] Created dir: 
/home/gankur/eclipse/workspace/pig_trunk/contrib/piggybank/java/build/test/logs
[junit] Running org.apache.pig.piggybank.test.evaluation.TestEvalString
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 0.15 sec
[junit] Running org.apache.pig.piggybank.test.evaluation.TestMathUDF
[junit] Tests run: 35, Failures: 0, Errors: 0, Time elapsed: 0.123 sec
[junit] Running org.apache.pig.piggybank.test.evaluation.TestStat
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.114 sec
[junit] Running 
org.apache.pig.piggybank.test.evaluation.datetime.TestDiffDate
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.105 sec
[junit] Running org.apache.pig.piggybank.test.evaluation.decode.TestDecode
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.089 sec
[junit] Running org.apache.pig.piggybank.test.evaluation.string.TestHashFNV
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.094 sec
[junit] Running 
org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 17.163 sec
[junit] Running org.apache.pig.piggybank.test.evaluation.string.TestRegex
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.092 sec
[junit] Running 
org.apache.pig.piggybank.test.evaluation.util.TestSearchQuery
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.093 sec
[junit] Running org.apache.pig.piggybank.test.evaluation.util.TestTop
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.099 sec
[junit] Running 
org.apache.pig.piggybank.test.evaluation.util.apachelogparser.TestDateExtractor
[junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.087 sec
[junit] Running 
org.apache.pig.piggybank.test.evaluation.util.apachelogparser.TestHostExtractor
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.083 sec
[junit] Running 
org.apache.pig.piggybank.test.evaluation.util.apachelogparser.TestSearchEngineExtractor
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.091 sec
[junit] Running 
org.apache.pig.piggybank.test.evaluation.util.apachelogparser.TestSearchTermExtractor
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.1 sec
[junit] Running org.apache.pig.piggybank.test.storage.TestCombinedLogLoader
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.535 sec
[junit] Running org.apache.pig.piggybank.test.storage.TestCommonLogLoader
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.54 sec
[junit] Running org.apache.pig.piggybank.test.storage.TestHelper
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.014 sec
[junit] Running org.apache.pig.piggybank.test.storage.TestMultiStorage
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 16.964 sec
[junit] Running org.apache.pig.piggybank.test.storage.TestMyRegExLoader
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.452 sec
[junit] Running

[jira] Commented: (PIG-958) Splitting output data on key field

2009-11-02 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772925#action_12772925
 ] 

Ankur commented on PIG-958:
---

Can we have an update on this please ?

 Splitting output data on key field
 --

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
 Attachments: 958.v3.patch, 958.v4.patch


 Pig users often face the need to split the output records into a bunch of 
 files and directories depending on the type of record. Pig's SPLIT operator 
 is useful when record types are few and known in advance. In cases where type 
 is not directly known but is derived dynamically from values of a key field 
 in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1060) MultiQuery optimization throws error for multi-level splits

2009-10-29 Thread Ankur (JIRA)

MultiQuery optimization throws error for multi-level splits
---

 Key: PIG-1060
 URL: https://issues.apache.org/jira/browse/PIG-1060
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Ankur


Consider the following scenario :-
1. Multi-level splits in the map plan.
2. Each split branch further progressing across a local-global rearrange.
3. Output of each of these finally merged via a UNION.

MultiQuery optimizer throws the following error in such a case:
ERROR 2146: Internal Error. Inconsistency in key index found during 
optimization.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1060) MultiQuery optimization throws error for multi-level splits

2009-10-29 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771390#action_12771390
 ] 

Ankur commented on PIG-1060:


Here's a sample script to illustrate the issue. Note that sample data isn't 
very important here since the optimization and execution fail. 
=== test.pig 

data = LOAD 'dummy' as (name:chararray, freq:int);

filter1 = FILTER data BY freq  5;
group1 = GROUP filter1 BY name;
proj1 = FOREACH group1 GENERATE FLATTEN(group), 'string1', SUM(filter1.freq);

filter2 = FILTER data by freq  5;
group2 = GROUP filter2 BY name;
proj2 = FOREACH group2 GENERATE FLATTEN(group), 'string2', SUM(filter2.freq);

filter3 = FILTER filter2 by freq  10;
group3 = GROUP filter3 By name;
proj3 = FOREACH group3 GENERATE FLATTEN(group), 'string3', SUM(filter3.freq);

filter4 = FILTER filter3 by freq  7;
group4 = GROUP filter4 By name;
proj4 = FOREACH group4 GENERATE FLATTEN(group), 'string4', SUM(filter4.freq);

M1 = LIMIT proj1 10;
M2 = LIMIT proj2 10;
M3 = LIMIT proj3 10;
M4 = LIMIT proj4 10;

U = UNION M1, M2, M3, M4;

STORE U INTO 'res' USING PigStorage();

The dot output can dumped via command - explain -dot -script test.pig; to 
visualize the scenario.
A surprising observation is that despite turning MultiQuery off using -M, it 
seems that the MultiQuery optimizer is still runs and fails the script.




 MultiQuery optimization throws error for multi-level splits
 ---

 Key: PIG-1060
 URL: https://issues.apache.org/jira/browse/PIG-1060
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Ankur

 Consider the following scenario :-
 1. Multi-level splits in the map plan.
 2. Each split branch further progressing across a local-global rearrange.
 3. Output of each of these finally merged via a UNION.
 MultiQuery optimizer throws the following error in such a case:
 ERROR 2146: Internal Error. Inconsistency in key index found during 
 optimization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-958) Splitting output data on key field

2009-10-26 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-958:
--

Attachment: 958.v4.patch

1. When run in cluster mode, static variable PigMapReduce.sJobConf is null when 
checked in the UDF constructor but NOT null when UDF is actually invoked. This  
causes incorrect initialization of FileSystem object 'fs' to local filesystem, 
causing the test to fail. Moved to 'fs' initialization to 
intijobSpecificParams() method.

2. Deleting the temporary directory manually in finish(), causes the job to 
fail. Removed the manual deletion. As a side effect, user specified PARENT 
output directory in the UDF will have empty part-* files. These should be 
deleted manually by the user.

Verfied that UDF works correctly and that unit test pass

 Splitting output data on key field
 --

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
 Attachments: 958.v3.patch, 958.v4.patch


 Pig users often face the need to split the output records into a bunch of 
 files and directories depending on the type of record. Pig's SPLIT operator 
 is useful when record types are few and known in advance. In cases where type 
 is not directly known but is derived dynamically from values of a key field 
 in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-958) Splitting output data on key field

2009-10-26 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12770042#action_12770042
 ] 

Ankur commented on PIG-958:
---

Just back from vacation. Have updated the code with required changes. It should 
be good to go now. Pradeep can you or any other committer review it ?



 Splitting output data on key field
 --

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
 Attachments: 958.v3.patch, 958.v4.patch


 Pig users often face the need to split the output records into a bunch of 
 files and directories depending on the type of record. Pig's SPLIT operator 
 is useful when record types are few and known in advance. In cases where type 
 is not directly known but is derived dynamically from values of a key field 
 in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-958) Splitting output data on key field

2009-10-09 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ankur updated PIG-958:
--

Attachment: 958.v3.patch

Pradeep,
Thanks for your review comments. I have incorporated the
suggestions provided in the code review. The code is vastly simplified, cleaner
and more readable :-).

Unit test now pass in local mode but fail in cluster mode after taking an
update of Pig code base. The error I see is :-
hdfs://localhost.localdomain:40352/user/gankur/output/_temporary/_attempt_20091009030519686_0001_m_00_0/output,
expected: file:///

Looks like a config issue with org.apache.pig.test.MiniCluster in the latest
pig code. I didn't get time to debug this as I am going on a vacation.
Regardless, I have attached the new patch for your review. Please suggest what
needs to be done to pass the unit test in cluster mode.

-Ankur

Splitting output data on key field
--

Key: PIG-958
URL: https://issues.apache.org/jira/browse/PIG-958
Project: Pig
Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
Attachments: 958.v3.patch

Pig users often face the need to split the output records into a bunch of
files and directories depending on the type of record. Pig's SPLIT operator
is useful when record types are few and known in advance. In cases where type
is not directly known but is derived dynamically from values of a key field
in the output tuple, a custom store function is a better solution.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-976) Multi-query optimization throws ClassCastException

2009-09-25 Thread Ankur (JIRA)

Multi-query optimization throws ClassCastException
--

 Key: PIG-976
 URL: https://issues.apache.org/jira/browse/PIG-976
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Ankur


Multi-query optimization fails to merge 2 branches when 1 is a result of Group 
By ALL and another is a result of Group By field1 where field 1 is of type 
long. Here is the script that fails with multi-query on.

data = LOAD 'test' USING PigStorage('\t') AS (a:long, b:double, c:double); 
A = GROUP data ALL;
B = FOREACH A GENERATE SUM(data.b) AS sum1, SUM(data.c) AS sum2;
C = FOREACH B GENERATE (sum1/sum2) AS rate; 
STORE C INTO 'result1';

D = GROUP data BY a; 
E = FOREACH D GENERATE group AS a, SUM(data.b), SUM(data.c);
STORE E into 'result2';
 
Here is the exception from the logs

java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast 
to org.apache.pig.data.DataBag
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:399)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:180)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:145)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:197)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:264)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:254)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:196)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:174)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:63)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:906)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:786)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-958) Splitting output data on key field

2009-09-22 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-958:
--

Status: Open  (was: Patch Available)

 Splitting output data on key field
 --

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
 Attachments: 958.v1.patch, 958.v2.patch


 Pig users often face the need to split the output records into a bunch of 
 files and directories depending on the type of record. Pig's SPLIT operator 
 is useful when record types are few and known in advance. In cases where type 
 is not directly known but is derived dynamically from values of a key field 
 in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-958) Splitting output data on key field

2009-09-22 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-958:
--

Status: Patch Available  (was: Open)

 Splitting output data on key field
 --

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
 Attachments: 958.v2.patch


 Pig users often face the need to split the output records into a bunch of 
 files and directories depending on the type of record. Pig's SPLIT operator 
 is useful when record types are few and known in advance. In cases where type 
 is not directly known but is derived dynamically from values of a key field 
 in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-958) Splitting output data on key field

2009-09-16 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-958:
--

Status: Patch Available  (was: Open)

 Splitting output data on key field
 --

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
 Attachments: 958.v1.patch


 Pig users often face the need to split the output records into a bunch of 
 files and directories depending on the type of record. Pig's SPLIT operator 
 is useful when record types are few and known in advance. In cases where type 
 is not directly known but is derived dynamically from values of a key field 
 in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-958) Splitting output data on key field

2009-09-16 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755928#action_12755928
 ] 

Ankur commented on PIG-958:
---

Hudson seems to be failing during compilation as my test case defined in 
package org.apache.pig.piggybank.test.storage  is reusing certain classes from 
org.apache.pig.test, namely 'Util' and MiniCluster.




 Splitting output data on key field
 --

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
 Attachments: 958.v1.patch


 Pig users often face the need to split the output records into a bunch of 
 files and directories depending on the type of record. Pig's SPLIT operator 
 is useful when record types are few and known in advance. In cases where type 
 is not directly known but is derived dynamically from values of a key field 
 in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-894) order-by fails when input is empty

2009-09-14 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754883#action_12754883
 ] 

Ankur commented on PIG-894:
---

Is empty inputs referring to relation - l ('students.txt')  or f (filter l by 1 
== 2). I am seeing a similar issue where the sampler produces an empty file 
when the number of records in the relation being sorted in too low (  4 ). 

 order-by fails when input is empty
 --

 Key: PIG-894
 URL: https://issues.apache.org/jira/browse/PIG-894
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair

 grunt l = load 'students.txt' ;
 grunt f = filter l by 1 == 2;
 grunt o = order f by $0 ;
 grunt dump o;
 This results in 3 MR jobs . The 2nd (sampling) MR creates empty sample file, 
 and 3rd MR (order-by) fails with following error in Map job -
 java.lang.RuntimeException: java.lang.RuntimeException: Empty samples file
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:104)
   at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
   at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:348)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Caused by: java.lang.RuntimeException: Empty samples file
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:89)
   ... 5 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-958) Splitting output data on key field

2009-09-14 Thread Ankur (JIRA)

Splitting output data on key field
--

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur


Pig users often face the need to split the output records into a bunch of files 
and directories depending on the type of record. Pig's SPLIT operator is useful 
when record types are few and known in advance. In cases where type is not 
directly known but is derived dynamically from values of a key field in the 
output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group

2009-08-13 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742748#action_12742748
 ] 

Ankur commented on PIG-919:
---

I have seem this issue in other places when the value coming out of a map[] is 
used in a group/cogroup/join. Pig throws a the same error. And Viraj is right, 
explicit casting to chararray alleviates the issue. But this is confusing for 
users. Pig should be converting NullableText to NullableBytesWritable 
automatically. Here is another sample script that throws an error. Exlicit 
casting to chararray resolves the issue

data = LOAD 'mydata' USING CustomLoader()  AS (f1:double, f2: map[])

dataProjected =  FOREACH data GENERATE f2#'Url' as url, f1 as rank

data2 = LOAD 'urlList' AS (url:bytearray);

grouped = COGROUP data BY url, data2 url Parallel 10;

STORE grouped INTO 'results'


 Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText when doing simple group
 --

 Key: PIG-919
 URL: https://issues.apache.org/jira/browse/PIG-919
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0

 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar


 I have a Pig script, which takes in a student file and generates a bag of 
 maps.  I later want to group on the value of the key name0 which 
 corresponds to the first name of the student.
 {code}
 register mymapudf.jar;
 data = LOAD '/user/viraj/studenttab10k' AS 
 (somename:chararray,age:long,marks:float);
 genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as 
 bp:map[], age, marks;
 getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks;
 filternonnullfirstnames = filter getfirstnames by firstname is not null;
 groupgenmap = group filternonnullfirstnames by firstname;
 dump groupgenmap;
 {code}
 When I execute this code, I get an error in the Map Phase:
 ===
 java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-871) Improve distribution of keys in reduce phase

2009-07-03 Thread Ankur (JIRA)

Improve distribution of keys in reduce phase


 Key: PIG-871
 URL: https://issues.apache.org/jira/browse/PIG-871
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Ankur


The default hashing scheme used to distribute keys in reduce phase sometimes 
results in an uneven distribution of keys resulting in 5 - 10 % of reducers 
being overloaded with data. This bottleneck makes the PIG jobs really slow and 
gives users a bad impression.

While there is no bullet proof solution to the problem in general, the hashing 
can certainly be improved for better distribution. The proposal here is to 
evaluate and incorporate other hashing schemes that give high avalanche and 
more even distribution. We can start by evaluating MurmurHash which is Apache 
2.0 licensed and freely available here - 
http://www.getopt.org/murmur/MurmurHash.java


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-754) Bugs with load and store and filenames passed with -param containing periods

2009-06-25 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724065#action_12724065
 ] 

Ankur commented on PIG-754:
---

Verified in the latest code that fixing PIG-564 does resolve this issue. This 
should be marked as duplicate of PIG-564 and closed

 Bugs with load and store and filenames passed with -param containing periods
 

 Key: PIG-754
 URL: https://issues.apache.org/jira/browse/PIG-754
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 This one drove me batty.
 I have two files file and file.right.
 file:
 {code}
 WRONG 
 This is file, not file.right.
 {code}
 file.right:
 {code}
 RIGHT
 This is file.right..
 {code}
 infile.pig:
 {code}
 A = load '$infile' using PigStorage();
 dump A;
 {code}
 When I pass in file.right as the infile parameter value, the wrong file is 
 read:
 {code}
 -bash-3.00$ pig -exectype local -param infile=file.right infile.pig
 USING: /grid/0/gs/pig/current
 2009-04-05 23:18:36,291 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-04-05 23:18:36,292 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 (WRONG )
 (This is file, not file.right.)
 {code}
 However, if I pass in infile as ./file.right, the script magically works.
 {code}
 -bash-3.00$ pig -exectype local -param infile=./file.right infile.pig
 USING: /grid/0/gs/pig/current
 2009-04-05 23:20:46,735 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-04-05 23:20:46,736 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 (RIGHT)
 (This is file.right.)
 {code}
 I do not have this problem if I use the file name with a period in the script 
 itself:
 infile2.pig
 {code}
 A = load 'file.right' using PigStorage();
 dump A;
 {code}
 {code}
 -bash-3.00$ pig -exectype local infile2.pig
 USING: /grid/0/gs/pig/current
 2009-04-05 23:22:47,022 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-04-05 23:22:47,023 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 (RIGHT)
 (This is file.right.)
 {code}
 I also experience similar problems when I try to pass in param outfile in a 
 store statement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-821) simulate NTILE(n) , rank() functionality in pig

2009-06-09 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717635#action_12717635
 ] 

Ankur commented on PIG-821:
---

Ok, So I tried writing an NTILE UDF that accepts 
1. Number of tiles 
2. A bag of sorted tuples

The problem with that is it is essentially a serial process instead of parallel 
as one would expect. So I am not sure if an NTILE operation can be done 
efficiently via a UDF. An efficient NTILE operation over sorted dataset should 
1. Partition the sorted data into the number of tiles requested
2. Preserve the ordering in each tile.
3. Have each tile contain exactly the number of elements as per ntile logic.

There is a total ordering partitioner in hadoop - 
http://issues.apache.org/jira/browse/HADOOP-3019 that effects total ordering of 
output data. However it cannot strictly enforce the number of elements 
contained in each part output which is a necessary condition to comply with 
NTILE logic.

Any thoughts?


 simulate NTILE(n) , rank() functionality in pig
 ---

 Key: PIG-821
 URL: https://issues.apache.org/jira/browse/PIG-821
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.2.0
 Environment: mithril gold -gateway 4000
Reporter: Rekha
 Fix For: 0.2.0


 Hi,
 I came across a job which has some processing which I cant seem to get easily 
 over-the-counter from pig.
 These are NTILE() /rank() operations available in oracle.
 While I am trying to write a UDF, that is not working out too well for me 
 yet.. :(
 I have a ntile(n) over (partititon by x, y, z order by a desc, b desc) 
 operation to be done in pig scripts.
 Is there a default function in pig scripting which can do this?
 For example, lets consider a simple example at 
 http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions091.htm
 So here, how would we ideally substitute NTILE() with? any pig counterpart 
 function/udf?
 SELECT last_name, salary, NTILE(4) OVER (ORDER BY salary DESC) 
AS quartile FROM employees
WHERE department_id = 100;
  
 LAST_NAME SALARY   QUARTILE
 - -- --
 Greenberg  12000  1
 Faviet  9000  1
 Chen8200  2
 Urman   7800  2
 Sciarra 7700  3
 Popp6900  4
  
 In real case, i have ntile over multiple columns, so ideal way to find 
 histograms/boundary/spitting out the bucket number is needed.
 Similarly a pig function is required for rank() over(partition by a,b,c order 
 by d desc) as e
 Please let me know soon.
 Thanks  Regards,
 /Rekha

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-732) Utility UDFs

2009-04-30 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-732:
--

Attachment: udf.v5.patch

Minor issue in test case, causing test failure. Fixed in latest upload - 
udf.v5.patch. Also changed TopN to Top. Should be good to go now.

 Utility UDFs 
 -

 Key: PIG-732
 URL: https://issues.apache.org/jira/browse/PIG-732
 Project: Pig
  Issue Type: New Feature
Reporter: Ankur
Priority: Minor
 Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch, 
 udf.v5.patch


 Two utility UDFs and their respective test cases.
 1. TopN - Accepts number of tuples (N) to retain in output, field number 
 (type long) to use for comparison, and an sorted/unsorted bag of tuples. It 
 outputs a bag containing top N tuples.
 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines 
 (Yahoo, Google, AOL, Live) and extracts and normalizes the search query 
 present in it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-732) Utility UDFs

2009-04-29 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704014#action_12704014
 ] 

Ankur commented on PIG-732:
---

If there aren't any other issues then can we go ahead and commit these ?

 Utility UDFs 
 -

 Key: PIG-732
 URL: https://issues.apache.org/jira/browse/PIG-732
 Project: Pig
  Issue Type: New Feature
Reporter: Ankur
Priority: Minor
 Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch


 Two utility UDFs and their respective test cases.
 1. TopN - Accepts number of tuples (N) to retain in output, field number 
 (type long) to use for comparison, and an sorted/unsorted bag of tuples. It 
 outputs a bag containing top N tuples.
 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines 
 (Yahoo, Google, AOL, Live) and extracts and normalizes the search query 
 present in it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1 2 >

1 - 100 of 105 matches

Mail list logo