[jira] Commented: (PIG-1661) Add alternative search-provider to Pig site

2010-10-02 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917246#action_12917246
 ] 

Santhosh Srinivasan commented on PIG-1661:
--

Sure, worth a try.

 Add alternative search-provider to Pig site
 ---

 Key: PIG-1661
 URL: https://issues.apache.org/jira/browse/PIG-1661
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Priority: Minor
 Attachments: PIG-1661.patch


 Use search-hadoop.com service to make available search in Pig sources, MLs, 
 wiki, etc.
 This was initially proposed on user mailing list. The search service was 
 already added in site's skin (common for all Hadoop related projects) via 
 AVRO-626 so this issue is about enabling it for Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1344) PigStorage should be able to read back complex data containing delimiters created by PigStorage

2010-03-30 Thread Santhosh Srinivasan (JIRA)
PigStorage should be able to read back complex data containing delimiters 
created by PigStorage
---

 Key: PIG-1344
 URL: https://issues.apache.org/jira/browse/PIG-1344
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Santhosh Srinivasan
Assignee: Daniel Dai
 Fix For: 0.8.0


With Pig 0.7, the TextDataParser has been removed and the logic to parse 
complex data types has moved to Utf8StorageConverter. However, this does not 
handle the case where the complex data types could contain delimiters ('{', 
'}', ',', '(', ')', '[', ']', '#'). Fixing this issue will make PigStorage self 
contained and more usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service

2010-03-26 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850342#action_12850342
 ] 

Santhosh Srinivasan commented on PIG-1331:
--

Jay, 

In PIG-823 there was a discussion around how Owl is different from Hive's 
metastore. Is that still true today? If not, can you elaborate on the key 
differences between the two systems?

Thanks,
Santhosh

 Owl Hadoop Table Management Service
 ---

 Key: PIG-1331
 URL: https://issues.apache.org/jira/browse/PIG-1331
 Project: Pig
  Issue Type: New Feature
Reporter: Jay Tang

 This JIRA is a proposal to create a Hadoop table management service: Owl. 
 Today, MapReduce and Pig applications interacts directly with HDFS 
 directories and files and must deal with low level data management issues 
 such as storage format, serialization/compression schemes, data layout, and 
 efficient data accesses, etc, often with different solutions. Owl aims to 
 provide a standard way to addresses this issue and abstracts away the 
 complexities of reading/writing huge amount of data from/to HDFS.
 Owl has a data access API that is modeled after the traditional Hadoop 
 !InputFormt and a management API to manipulate Owl objects.  This JIRA is 
 related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata 
 store.  Owl integrates with different storage module like Zebra with a 
 pluggable architecture.
  Initially, the proposal is to submit Owl as a Pig contrib project.  Over 
 time, it makes sense to move it to a Hadoop subproject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1117) Pig reading hive columnar rc tables

2010-01-11 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798917#action_12798917
 ] 

Santhosh Srinivasan commented on PIG-1117:
--

+1 on making it part of main piggybank. We should not be creating a separate 
directory just to handle hive.

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
 Fix For: 0.7.0

 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's

2009-11-10 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775968#action_12775968
 ] 

Santhosh Srinivasan commented on PIG-1065:
--

The schema will then correspond to the prefix as it is implemented today. For 
example if the AS statement is define for the flatten($1) and if $1 flattens to 
10 columns and if the AS clause has 3 columns then the prefix is used and the 
remaining are left undefined.

 In-determinate behaviour of Union when there are 2 non-matching schema's
 

 Key: PIG-1065
 URL: https://issues.apache.org/jira/browse/PIG-1065
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


 I have a script which first does a union of these schemas and then does a 
 ORDER BY of this result.
 {code}
 f1 = LOAD '1.txt' as (key:chararray, v:chararray);
 f2 = LOAD '2.txt' as (key:chararray);
 u0 = UNION f1, f2;
 describe u0;
 dump u0;
 u1 = ORDER u0 BY $0;
 dump u1;
 {code}
 When I run in Map Reduce mode I get the following result:
 $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
 
 Schema for u0 unknown.
 
 (1,2)
 (2,3)
 (1)
 (2)
 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias u1
 at org.apache.pig.PigServer.openIterator(PigServer.java:475)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 Caused by: java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 
 When I run the same script in local mode I get a different result, as we know 
 that local mode does not use any Hadoop Classes.
 $java -cp pig.jar org.apache.pig.Main -x local broken.pig
 
 Schema for u0 unknown
 
 (1,2)
 (1)
 (2,3)
 (2)
 
 (1,2)
 (1)
 (2,3)
 (2)
 
 Here are some questions
 1) Why do we allow union if the schemas do not match
 2) Should we not print an error message/warning so that the user knows that 
 this is not allowed or he can get unexpected results?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's

2009-11-10 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776098#action_12776098
 ] 

Santhosh Srinivasan commented on PIG-1065:
--

bq. Aliasing inside foreach is hugely useful for readability. Are you 
suggesting removing the ability to assign aliases inside a forearch, or just to 
change/assign schemas?

For consistency, all relational operators should support the AS clause. 
Gradually, the aliasing on a per column basis in foreach should be removed from 
the documentation, deprecated and eventually removed. This is a long term 
recommendation.

 In-determinate behaviour of Union when there are 2 non-matching schema's
 

 Key: PIG-1065
 URL: https://issues.apache.org/jira/browse/PIG-1065
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


 I have a script which first does a union of these schemas and then does a 
 ORDER BY of this result.
 {code}
 f1 = LOAD '1.txt' as (key:chararray, v:chararray);
 f2 = LOAD '2.txt' as (key:chararray);
 u0 = UNION f1, f2;
 describe u0;
 dump u0;
 u1 = ORDER u0 BY $0;
 dump u1;
 {code}
 When I run in Map Reduce mode I get the following result:
 $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
 
 Schema for u0 unknown.
 
 (1,2)
 (2,3)
 (1)
 (2)
 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias u1
 at org.apache.pig.PigServer.openIterator(PigServer.java:475)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 Caused by: java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 
 When I run the same script in local mode I get a different result, as we know 
 that local mode does not use any Hadoop Classes.
 $java -cp pig.jar org.apache.pig.Main -x local broken.pig
 
 Schema for u0 unknown
 
 (1,2)
 (1)
 (2,3)
 (2)
 
 (1,2)
 (1)
 (2,3)
 (2)
 
 Here are some questions
 1) Why do we allow union if the schemas do not match
 2) Should we not print an error message/warning so that the user knows that 
 this is not allowed or he can get unexpected results?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1073) LogicalPlanCloner can't clone plan containing LOJoin

2009-11-05 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774147#action_12774147
 ] 

Santhosh Srinivasan commented on PIG-1073:
--

If my memory serves me correctly, the logical plan cloning was implemented (by 
me) for cloning inner plans for foreach. As such, the top level plan cloning 
was never tested and some items are marked as TODO (see visit methods for 
LOLoad, LOStore and LOStream).

If you want to use it as you mention in your test cases, then you need to add 
code for cloning the LOLoad, LOStore, LOStream and LOJoin operators.


 LogicalPlanCloner can't clone plan containing LOJoin
 

 Key: PIG-1073
 URL: https://issues.apache.org/jira/browse/PIG-1073
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan

 Add following testcase in LogicalPlanBuilder.java
 public void testLogicalPlanCloner() throws CloneNotSupportedException{
 LogicalPlan lp = buildPlan(C = join ( load 'A') by $0, (load 'B') by 
 $0;);
 LogicalPlanCloner cloner = new LogicalPlanCloner(lp);
 cloner.getClonedPlan();
 }
 and this fails with the following stacktrace:
 java.lang.NullPointerException
 at 
 org.apache.pig.impl.logicalLayer.LOVisitor.visit(LOVisitor.java:171)
 at 
 org.apache.pig.impl.logicalLayer.PlanSetter.visit(PlanSetter.java:63)
 at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:213)
 at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:45)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanCloneHelper.getClonedPlan(LogicalPlanCloneHelper.java:73)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanCloner.getClonedPlan(LogicalPlanCloner.java:46)
 at 
 org.apache.pig.test.TestLogicalPlanBuilder.testLogicalPlanCloneHelper(TestLogicalPlanBuilder.java:2110)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's

2009-11-05 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774153#action_12774153
 ] 

Santhosh Srinivasan commented on PIG-1065:
--

Answer to Question 1: Pig 1.0 had that syntax and it was retained for backward 
compatibility. Paolo suggested that for uniformity, the 'AS' clause for the 
load statements should be extended to all relational operators. Gradually, the 
column aliasing in the foreach should be removed from the documentation and 
eventually removed from the language.

 In-determinate behaviour of Union when there are 2 non-matching schema's
 

 Key: PIG-1065
 URL: https://issues.apache.org/jira/browse/PIG-1065
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


 I have a script which first does a union of these schemas and then does a 
 ORDER BY of this result.
 {code}
 f1 = LOAD '1.txt' as (key:chararray, v:chararray);
 f2 = LOAD '2.txt' as (key:chararray);
 u0 = UNION f1, f2;
 describe u0;
 dump u0;
 u1 = ORDER u0 BY $0;
 dump u1;
 {code}
 When I run in Map Reduce mode I get the following result:
 $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
 
 Schema for u0 unknown.
 
 (1,2)
 (2,3)
 (1)
 (2)
 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias u1
 at org.apache.pig.PigServer.openIterator(PigServer.java:475)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 Caused by: java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 
 When I run the same script in local mode I get a different result, as we know 
 that local mode does not use any Hadoop Classes.
 $java -cp pig.jar org.apache.pig.Main -x local broken.pig
 
 Schema for u0 unknown
 
 (1,2)
 (1)
 (2,3)
 (2)
 
 (1,2)
 (1)
 (2,3)
 (2)
 
 Here are some questions
 1) Why do we allow union if the schemas do not match
 2) Should we not print an error message/warning so that the user knows that 
 this is not allowed or he can get unexpected results?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-28 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771287#action_12771287
 ] 

Santhosh Srinivasan commented on PIG-1016:
--

I am summarizing my understanding of the patch that has been submitted by hc 
busy.

Root cause: PIG-880 changed the value type of maps in PigStorage from native 
Java types to DataByteArray. As a result of this change, parsing of complex 
types as map values was disabled.

Proposed fix: Revert the changes made as part of PIG-880 to interpret map 
values as Java types. In addition, change the comparison method to check for 
the object type and call the appropriate compareTo method. The latter is 
required to workaround the fact that the front-end assigns the value type to be 
DataByteArray whereas the backend sees the actual type (Integer, Long, Tuple, 
DataBag, etc.)

Based on this understanding I have the following review comment(s).

Index: 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBytesRawComparator.java
===

Can you explain the checks in the if and the else? Specifically, 
NullableBytesWritable is a subclass of PigNullableWritable. As a result, in the 
if part, the check for both o1 and o2 not being PigNullableWritable is 
confusing as nbw1 and nbw2 are cast to NullableBytesWritable if o1 and o2 are 
not PigNullableWritable.  

{code}
+// find bug is complaining about nulls. This check sequence will 
prevent nulls from being dereferenced.
+if(o1!=null  o2!=null){
+
+// In case the objects are comparable
+if((o1 instanceof NullableBytesWritable  o2 instanceof 
NullableBytesWritable)||
+   !(o1 instanceof PigNullableWritable  o2 instanceof 
PigNullableWritable)
+){
+
+  NullableBytesWritable nbw1 = (NullableBytesWritable)o1;
+  NullableBytesWritable nbw2 = (NullableBytesWritable)o2;
+  
+  // If either are null, handle differently.
+  if (!nbw1.isNull()  !nbw2.isNull()) {
+  rc = 
((DataByteArray)nbw1.getValueAsPigType()).compareTo((DataByteArray)nbw2.getValueAsPigType());
+  } else {
+  // For sorting purposes two nulls are equal.
+  if (nbw1.isNull()  nbw2.isNull()) rc = 0;
+  else if (nbw1.isNull()) rc = -1;
+  else rc = 1;
+  }
+}else{
+  // enter here only if both o1 and o2 are 
non-NullableByteWritable PigNullableWritable's
+  PigNullableWritable nbw1 = (PigNullableWritable)o1;
+  PigNullableWritable nbw2 = (PigNullableWritable)o2;
+  // If either are null, handle differently.
+  if (!nbw1.isNull()  !nbw2.isNull()) {
+  rc = nbw1.compareTo(nbw2);
+  } else {
+  // For sorting purposes two nulls are equal.
+  if (nbw1.isNull()  nbw2.isNull()) rc = 0;
+  else if (nbw1.isNull()) rc = -1;
+  else rc = 1;
+  }
+}
+}else{
+  if(o1==null  o2==null){rc=0;}
+  else if(o1==null) {rc=-1;}
+  else{ rc=1; }
{code}

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.5.0

 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1056) table can not be loaded after store

2009-10-27 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12770743#action_12770743
 ] 

Santhosh Srinivasan commented on PIG-1056:
--

Do you have the right load statement? I don't see the using clause that 
specifies the zebra loader.

 table can not be loaded after store
 ---

 Key: PIG-1056
 URL: https://issues.apache.org/jira/browse/PIG-1056
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang

 Pig Stack Trace
 ---
 ERROR 1018: Problem determining schema during load
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Problem determining schema during load
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1023)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:967)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:716)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem 
 determining schema during load
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)
 ... 8 more
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: 
 Problem determining schema during load
 at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732)
 ... 10 more
 Caused by: java.io.IOException: No table specified for input
 at 
 org.apache.hadoop.zebra.pig.TableLoader.checkConf(TableLoader.java:238)
 at 
 org.apache.hadoop.zebra.pig.TableLoader.determineSchema(TableLoader.java:258)
 at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148)
 ... 11 more
 
 ~ 
 
 script:
 register /grid/0/dev/hadoopqa/hadoop/lib/zebra.jar;
 A = load 'filter.txt' as (name:chararray, age:int);
 B = filter A by age  20;
 --dump B;
 store B into 'filter1' using 
 org.apache.hadoop.zebra.pig.TableStorer('[name];[age]');
 rec1 = load 'B' using org.apache.hadoop.zebra.pig.TableLoader();
 dump rec1;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1012) FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in serializable class

2009-10-21 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768368#action_12768368
 ] 

Santhosh Srinivasan commented on PIG-1012:
--

I just looked at the first patch. It was setting generate to true in 
TestMRCompiler.java It should be set to false in order to run the test case 
correctly.

+++ test/org/apache/pig/test/TestMRCompiler.java

-private boolean generate = false;
+private boolean generate = true;

 FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in 
 serializable class
 ---

 Key: PIG-1012
 URL: https://issues.apache.org/jira/browse/PIG-1012
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
 Attachments: PIG-1012-2.patch, PIG-1012.patch


 SeClass org.apache.pig.backend.executionengine.PigSlice defines 
 non-transient non-serializable instance field is
 SeClass org.apache.pig.backend.executionengine.PigSlice defines 
 non-transient non-serializable instance field loader
 Sejava.util.zip.GZIPInputStream stored into non-transient field 
 PigSlice.is
 Seorg.apache.pig.backend.datastorage.SeekableInputStream stored into 
 non-transient field PigSlice.is
 Seorg.apache.tools.bzip2r.CBZip2InputStream stored into non-transient 
 field PigSlice.is
 Seorg.apache.pig.builtin.PigStorage stored into non-transient field 
 PigSlice.loader
 Seorg.apache.pig.backend.hadoop.DoubleWritable$Comparator implements 
 Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigBagWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigCharArrayWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDBAWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDoubleWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigFloatWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigIntWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigLongWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigTupleWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigWritableComparator
  implements Comparator but not Serializable
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper 
 defines non-transient non-serializable instance field nig
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GreaterThanExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GTOrEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LessThanExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LTOrEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.NotEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject
  defines non-transient non-serializable instance field bagIterator
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserComparisonFunc
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc
  defines non-transient non-serializable instance field log
 

[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-14 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765779#action_12765779
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

Another option is to change the implementation of COUNT to reflect the proposed 
semantics. If the underlying UDF is changed then the user should be notified 
via an information message. If the user checks the explain output then (s)he 
will notice COUNT_STAR and will be confused.

 Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
 records are counted without considering nullness of the fields in the records
 

 Key: PIG-1014
 URL: https://issues.apache.org/jira/browse/PIG-1014
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-13 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765194#action_12765194
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

Essentially, Pradeep is pointing out an issue in the implementation of COUNT. 
If that is the case then COUNT has to be fixed or the semantics of COUNT has to 
be documented to explain the current implementation. I would vote for fixing 
COUNT to have the correct semantics.

 Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
 records are counted without considering nullness of the fields in the records
 

 Key: PIG-1014
 URL: https://issues.apache.org/jira/browse/PIG-1014
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-13 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765357#action_12765357
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

After a discussion with Pradeep who also graciously ran SQL queries to verify 
semantics, we have the following proposal:

The semantics of COUNT could be defined as:

1. COUNT( A ) is equivalent to COUNT( A.* ) and the result of COUNT( A ) will 
count null tuples in the relation
2. COUNT( A.$0) will not count null tuples in the relation

3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation 
containing tuples with two columns and will exhibit the behavior of statement 1

OR 

3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation 
containing tuples with two columns and will exhibit the behavior of statement 2

Point 3 needs more discussion.

Comments/thoughts/suggestions/anything else welcome.


 Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
 records are counted without considering nullness of the fields in the records
 

 Key: PIG-1014
 URL: https://issues.apache.org/jira/browse/PIG-1014
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-12 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764771#action_12764771
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

Is Pig trying to guess the user's intent? What if the user wanted to do count 
without nulls ?

 Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
 records are counted without considering nullness of the fields in the records
 

 Key: PIG-1014
 URL: https://issues.apache.org/jira/browse/PIG-1014
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-12 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764792#action_12764792
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

If the user wants to count without nulls then the user should use COUNT_STAR. 
One of the philosophies of Pig has been to allow users to do exactly what they 
want. Here, we are violating that philosophy and secondly we are second 
guessing the user's intention.

 Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
 records are counted without considering nullness of the fields in the records
 

 Key: PIG-1014
 URL: https://issues.apache.org/jira/browse/PIG-1014
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-10-12 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764846#action_12764846
 ] 

Santhosh Srinivasan commented on PIG-984:
-

Very quick comment. The parser has a log.info which should be converted to a 
log.debug

Index: src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt
===


+[USING (\collected\ { 
+log.info(Using mapside);


 PERFORMANCE: Implement a map-side group operator to speed up processing of 
 ordered data 
 

 Key: PIG-984
 URL: https://issues.apache.org/jira/browse/PIG-984
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-984.patch, PIG-984_1.patch


 The general group by operation in Pig needs both mappers and reducers (the 
 aggregation is done in reducers). This incurs disk writes/reads  between 
 mappers and reducers.
 However, in the cases where the input data has the following properties
1. The records with the same key are grouped together (such as the data is 
 sorted by the keys).
2. The records with the same key are in the same mapper input.
 the group by operation can be performed in the mappers only and thus remove 
 the overhead of disk writes/reads.
 Alan proposed adding a hint to the group by clause like this one:
 {code}
 A = load 'input' using SomeLoader(...);
 B = group A by $0 using mapside;
 C = foreach B generate ...
 {code}
 The proposed addition of using mapside to group will be a mapside group 
 operator that collects all records for a given key into a buffer. When it 
 sees a key change it will emit the key and bag for records it had buffered. 
 It will assume that all keys for a given record are collected together and 
 thus there is not need to buffer across keys. 
 It is expected that SomeLoader will be implemented by data systems such as 
 Zebra to ensure the data emitted by the loader satisfies the above properties 
 (1) and (2).
 It will be the responsibility of the user (or the loader) to guarantee these 
 properties (1)  (2) before invoking the mapside hint for the group by 
 clause. The Pig runtime can't check for the errors in the input data.
 For the group by clauses with mapside hint, Pig Latin will only support group 
 by columns (including *), not group by expressions nor group all. 
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

2009-10-10 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764368#action_12764368
 ] 

Santhosh Srinivasan commented on PIG-1014:
--

When the semantics of COUNT was changed, I thought this was communicated with 
the users. What is the intention of this jira?

 Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
 records are counted without considering nullness of the fields in the records
 

 Key: PIG-1014
 URL: https://issues.apache.org/jira/browse/PIG-1014
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-995) Limit Optimizer throw exception ERROR 2156: Error while fixing projections

2009-10-09 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764119#action_12764119
 ] 

Santhosh Srinivasan commented on PIG-995:
-

Review comments:

The initialization code is fine. However, the try catch block is shared between 
the rebuildSchemas() and rebuildProjectionMaps() method invocation. This could 
lead to misleading error message. Specifically, if the rebuildSchemas() throws 
an exception then the error message will indicate that rebuilding projection 
maps failed.

 Limit Optimizer throw exception ERROR 2156: Error while fixing projections
 

 Key: PIG-995
 URL: https://issues.apache.org/jira/browse/PIG-995
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-995-1.patch, PIG-995-2.patch, PIG-995-3.patch


 The following script fail:
 A = load '1.txt' AS (a0, a1, a2);
 B = order A by a1;
 C = limit B 10;
 D = foreach C generate $0;
 dump D;
 Error log:
 Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2156: Error while 
 fixing projections. Projection map of node to be replaced is null.
 at 
 org.apache.pig.impl.logicalLayer.ProjectFixerUpper.visit(ProjectFixerUpper.java:138)
 at 
 org.apache.pig.impl.logicalLayer.LOProject.visit(LOProject.java:408)
 at org.apache.pig.impl.logicalLayer.LOProject.visit(LOProject.java:58)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:65)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.logicalLayer.LOForEach.rewire(LOForEach.java:761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-10-01 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761270#action_12761270
 ] 

Santhosh Srinivasan commented on PIG-984:
-

bq. But this is in line with what we've done for joins, philosophically, 
semantically, and syntacticly.

Not exactly; with joins we are exposing different kinds of joins. Here we are 
exposing the underlying aspects of the framework (mapside). If there is a 
parallel framework that does not do map-reduce then having mapside in the 
language is philosophically and semantically not correct.

 PERFORMANCE: Implement a map-side group operator to speed up processing of 
 ordered data 
 

 Key: PIG-984
 URL: https://issues.apache.org/jira/browse/PIG-984
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding

 The general group by operation in Pig needs both mappers and reducers (the 
 aggregation is done in reducers). This incurs disk writes/reads  between 
 mappers and reducers.
 However, in the cases where the input data has the following properties
1. The records with the same key are grouped together (such as the data is 
 sorted by the keys).
2. The records with the same key are in the same mapper input.
 the group by operation can be performed in the mappers only and thus remove 
 the overhead of disk writes/reads.
 Alan proposed adding a hint to the group by clause like this one:
 {code}
 A = load 'input' using SomeLoader(...);
 B = group A by $0 using mapside;
 C = foreach B generate ...
 {code}
 The proposed addition of using mapside to group will be a mapside group 
 operator that collects all records for a given key into a buffer. When it 
 sees a key change it will emit the key and bag for records it had buffered. 
 It will assume that all keys for a given record are collected together and 
 thus there is not need to buffer across keys. 
 It is expected that SomeLoader will be implemented by data systems such as 
 Zebra to ensure the data emitted by the loader satisfies the above properties 
 (1) and (2).
 It will be the responsibility of the user (or the loader) to guarantee these 
 properties (1)  (2) before invoking the mapside hint for the group by 
 clause. The Pig runtime can't check for the errors in the input data.
 For the group by clauses with mapside hint, Pig Latin will only support group 
 by columns (including *), not group by expressions nor group all. 
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-09-30 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761028#action_12761028
 ] 

Santhosh Srinivasan commented on PIG-984:
-

A couple of things:

1. I am concerned about extending the language for supporting features that can 
be handled internally. The scope of the language has not been defined but the 
language continues to evolve.

2. I agree with Thejas' comment about allowing expressions that do not alter 
the property. Pig will not be able to check that but it is no different from 
being able to check if the data is sorted or not.

 PERFORMANCE: Implement a map-side group operator to speed up processing of 
 ordered data 
 

 Key: PIG-984
 URL: https://issues.apache.org/jira/browse/PIG-984
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding

 The general group by operation in Pig needs both mappers and reducers (the 
 aggregation is done in reducers). This incurs disk writes/reads  between 
 mappers and reducers.
 However, in the cases where the input data has the following properties
1. The records with the same key are grouped together (such as the data is 
 sorted by the keys).
2. The records with the same key are in the same mapper input.
 the group by operation can be performed in the mappers only and thus remove 
 the overhead of disk writes/reads.
 Alan proposed adding a hint to the group by clause like this one:
 {code}
 A = load 'input' using SomeLoader(...);
 B = group A by $0 using mapside;
 C = foreach B generate ...
 {code}
 The proposed addition of using mapside to group will be a mapside group 
 operator that collects all records for a given key into a buffer. When it 
 sees a key change it will emit the key and bag for records it had buffered. 
 It will assume that all keys for a given record are collected together and 
 thus there is not need to buffer across keys. 
 It is expected that SomeLoader will be implemented by data systems such as 
 Zebra to ensure the data emitted by the loader satisfies the above properties 
 (1) and (2).
 It will be the responsibility of the user (or the loader) to guarantee these 
 properties (1)  (2) before invoking the mapside hint for the group by 
 clause. The Pig runtime can't check for the errors in the input data.
 For the group by clauses with mapside hint, Pig Latin will only support group 
 by columns (including *), not group by expressions nor group all. 
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

2009-09-30 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761073#action_12761073
 ] 

Santhosh Srinivasan commented on PIG-984:
-

bq. This is something that can be inferred looking at the schema and 
distribution key. I understand wanting a manual handle to turn on the behavior 
while developing, but the production version of this can be done automatically 
( if distributed by and sorted on a subset of group keys, apply map-side 
group rule in the optimizer).

+1 Thats what I meant when I said

bq. 1. I am concerned about extending the language for supporting features that 
can be handled internally. The scope of the language has not been defined but 
the language continues to evolve.

 PERFORMANCE: Implement a map-side group operator to speed up processing of 
 ordered data 
 

 Key: PIG-984
 URL: https://issues.apache.org/jira/browse/PIG-984
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding

 The general group by operation in Pig needs both mappers and reducers (the 
 aggregation is done in reducers). This incurs disk writes/reads  between 
 mappers and reducers.
 However, in the cases where the input data has the following properties
1. The records with the same key are grouped together (such as the data is 
 sorted by the keys).
2. The records with the same key are in the same mapper input.
 the group by operation can be performed in the mappers only and thus remove 
 the overhead of disk writes/reads.
 Alan proposed adding a hint to the group by clause like this one:
 {code}
 A = load 'input' using SomeLoader(...);
 B = group A by $0 using mapside;
 C = foreach B generate ...
 {code}
 The proposed addition of using mapside to group will be a mapside group 
 operator that collects all records for a given key into a buffer. When it 
 sees a key change it will emit the key and bag for records it had buffered. 
 It will assume that all keys for a given record are collected together and 
 thus there is not need to buffer across keys. 
 It is expected that SomeLoader will be implemented by data systems such as 
 Zebra to ensure the data emitted by the loader satisfies the above properties 
 (1) and (2).
 It will be the responsibility of the user (or the loader) to guarantee these 
 properties (1)  (2) before invoking the mapside hint for the group by 
 clause. The Pig runtime can't check for the errors in the input data.
 For the group by clauses with mapside hint, Pig Latin will only support group 
 by columns (including *), not group by expressions nor group all. 
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754349#action_12754349
 ] 

Santhosh Srinivasan commented on PIG-955:
-

Hi Ying,

How are Fragment Replicate Join and Skewed Join related as you mention in your 
bug description? Also, skewed join has been part of trunk for more than a month 
now. Your bug description states that Pig needs skewed join.

Thanks,
Santhosh

 Skewed join generates  incorrect results 
 -

 Key: PIG-955
 URL: https://issues.apache.org/jira/browse/PIG-955
 Project: Pig
  Issue Type: Improvement
Reporter: Ying He
 Attachments: PIG-955.patch


 Fragmented replicated join has a few limitations:
  - One of the tables needs to be loaded into memory
  - Join is limited to two tables
 Skewed join partitions the table and joins the records in the reduce phase. 
 It computes a histogram of the key space to account for skewing in the input 
 records. Further, it adjusts the number of reducers depending on the key 
 distribution.
 We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-922) Logical optimizer: push up project

2009-08-25 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747560#action_12747560
 ] 

Santhosh Srinivasan commented on PIG-922:
-

For relational operators that require multiple inputs, the list will correspond 
to each of its inputs. If you notice getRequiredFields, the list is populated 
on a per input basis. In the case of getRequiredInputs, I see that the use of 
the list is not consistent.for LOJoin, LOUnion, LOCogroup and LOCross.

 Logical optimizer: push up project
 --

 Key: PIG-922
 URL: https://issues.apache.org/jira/browse/PIG-922
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-922-p1_0.patch, PIG-922-p1_1.patch, 
 PIG-922-p1_2.patch


 This is a continuation work of 
 [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add 
 another rule to the logical optimizer: Push up project, ie, prune columns as 
 early as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-561) Need to generate empty tuples and bags as a part of Pig Syntax

2009-08-09 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan resolved PIG-561.
-

Resolution: Duplicate

Duplicate of PIG-773

 Need to generate empty tuples and bags as a part of Pig Syntax
 --

 Key: PIG-561
 URL: https://issues.apache.org/jira/browse/PIG-561
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.2.0
Reporter: Viraj Bhat

 There is a need to sometimes generate empty tuples and bags as a part of the 
 Pig syntax rather than using UDF's
 {code}
 a = load 'mydata.txt' using PigStorage();
 b =foreach a generate ( ) as emptytuple;
 c = foreach a generate { } as emptybag;
 dump c;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-912) Rename/Add 'string' as a type in place of chararray - and deprecate (and probably eventually remove) the use of 'chararray'

2009-08-06 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740357#action_12740357
 ] 

Santhosh Srinivasan commented on PIG-912:
-

+1

 Rename/Add 'string' as a type in place of chararray - and deprecate (and 
 probably eventually remove) the use of 'chararray'
 ---

 Key: PIG-912
 URL: https://issues.apache.org/jira/browse/PIG-912
 Project: Pig
  Issue Type: Bug
Reporter: Mridul Muralidharan

 The type 'chararray' in pig does not refer to an array of characters (char 
 []) but rather to java.lang.String
 This is inconsistent and confusing naming; and additionally, will be a 
 interoperability issue with other systems which support schema's (zebra among 
 others).
 It would be good to have a consistent naming across projects, while also 
 having appropriate names for the various types.
 Since use of 'chararray' is already widely deployed, it would be good to :
 a) Add a type 'string' (or equivalent) which is an alias for 'chararray'.
 Additionally, it is possible to envision these too (if deemed necessary - not 
 a main requiremnt) :
 b) Modify documentation and example scripts to use this new type.
 c) Emit warnings about chararray being deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-908) Need a way to correlate MR jobs with Pig statements

2009-08-04 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739147#action_12739147
 ] 

Santhosh Srinivasan commented on PIG-908:
-

+1

This approach has been discussed but not documented.

 Need a way to correlate MR jobs with Pig statements
 ---

 Key: PIG-908
 URL: https://issues.apache.org/jira/browse/PIG-908
 Project: Pig
  Issue Type: Wish
Reporter: Dmitriy V. Ryaboy

 Complex Pig Scripts often generate many Map-Reduce jobs, especially with the 
 recent introduction of multi-store capabilities.
 For example, the first script in the Pig tutorial produces 5 MR jobs.
 There is currently very little support for debugging resulting jobs; if one 
 of the MR jobs fails, it is hard to figure out which part of the script it 
 was responsible for. Explain plans help, but even with the explain plan, a 
 fair amount of effort (and sometimes, experimentation) is required to 
 correlate the failing MR job with the corresponding PigLatin statements.
 This ticket is created to discuss approaches to alleviating this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-07-31 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: In Progress  (was: Patch Available)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
 OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-07-31 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: In Progress)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: Optimizer_Phase5.patch, OptimizerPhase1.patch, 
 OptimizerPhase1_part2.patch, OptimizerPhase2.patch, 
 OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, 
 OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch, 
 OptimizerPhase4_part2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed 

[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Status: Open  (was: Patch Available)

 Order by is borken with complex fields
 --

 Key: PIG-880
 URL: https://issues.apache.org/jira/browse/PIG-880
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
 PIG-880.patch


 Pig script:
 a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
 f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
 s = order f by $0;   
 store s into 'sc.out' 
 Stack:
 Caused by: java.lang.ArrayStoreException
 at java.lang.System.arraycopy(Native Method)
 at java.util.Arrays.copyOf(Arrays.java:2763)
 at java.util.ArrayList.toArray(ArrayList.java:305)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
 ... 5 more
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
 at org.apache.pig.PigServer.execute(PigServer.java:762)
 at org.apache.pig.PigServer.access$100(PigServer.java:91)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-880 started by Santhosh Srinivasan.

 Order by is borken with complex fields
 --

 Key: PIG-880
 URL: https://issues.apache.org/jira/browse/PIG-880
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
 PIG-880.patch


 Pig script:
 a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
 f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
 s = order f by $0;   
 store s into 'sc.out' 
 Stack:
 Caused by: java.lang.ArrayStoreException
 at java.lang.System.arraycopy(Native Method)
 at java.util.Arrays.copyOf(Arrays.java:2763)
 at java.util.ArrayList.toArray(ArrayList.java:305)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
 ... 5 more
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
 at org.apache.pig.PigServer.execute(PigServer.java:762)
 at org.apache.pig.PigServer.access$100(PigServer.java:91)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Status: Patch Available  (was: In Progress)

 Order by is borken with complex fields
 --

 Key: PIG-880
 URL: https://issues.apache.org/jira/browse/PIG-880
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
 PIG-880.patch


 Pig script:
 a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
 f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
 s = order f by $0;   
 store s into 'sc.out' 
 Stack:
 Caused by: java.lang.ArrayStoreException
 at java.lang.System.arraycopy(Native Method)
 at java.util.Arrays.copyOf(Arrays.java:2763)
 at java.util.ArrayList.toArray(ArrayList.java:305)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
 ... 5 more
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
 at org.apache.pig.PigServer.execute(PigServer.java:762)
 at org.apache.pig.PigServer.access$100(PigServer.java:91)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-898) TextDataParser does not handle delimiters from one complex type in another

2009-07-30 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737319#action_12737319
 ] 

Santhosh Srinivasan commented on PIG-898:
-

In addition, empty bags, tuples and constants and nulls are not handled.

 TextDataParser does not handle delimiters from one complex type in another
 --

 Key: PIG-898
 URL: https://issues.apache.org/jira/browse/PIG-898
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.4.0


 Currently, TextDataParser does not handle delimiters of one complex type in 
 another. An example of such a case is key1(#value1} will not be parsed 
 correctly. The production for strings matches any sequence of character that 
 do not contain any delimiters for the complex types.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Attachment: (was: PIG-880.patch)

 Order by is borken with complex fields
 --

 Key: PIG-880
 URL: https://issues.apache.org/jira/browse/PIG-880
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
 PIG-880_1.patch


 Pig script:
 a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
 f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
 s = order f by $0;   
 store s into 'sc.out' 
 Stack:
 Caused by: java.lang.ArrayStoreException
 at java.lang.System.arraycopy(Native Method)
 at java.util.Arrays.copyOf(Arrays.java:2763)
 at java.util.ArrayList.toArray(ArrayList.java:305)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
 ... 5 more
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
 at org.apache.pig.PigServer.execute(PigServer.java:762)
 at org.apache.pig.PigServer.access$100(PigServer.java:91)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-30 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Status: Patch Available  (was: In Progress)

 Order by is borken with complex fields
 --

 Key: PIG-880
 URL: https://issues.apache.org/jira/browse/PIG-880
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
 PIG-880_1.patch


 Pig script:
 a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
 f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
 s = order f by $0;   
 store s into 'sc.out' 
 Stack:
 Caused by: java.lang.ArrayStoreException
 at java.lang.System.arraycopy(Native Method)
 at java.util.Arrays.copyOf(Arrays.java:2763)
 at java.util.ArrayList.toArray(ArrayList.java:305)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
 ... 5 more
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
 at org.apache.pig.PigServer.execute(PigServer.java:762)
 at org.apache.pig.PigServer.access$100(PigServer.java:91)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-29 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736990#action_12736990
 ] 

Santhosh Srinivasan commented on PIG-889:
-

PigHadoopLogger implements the PigLogger interface. As part of the 
implementation it uses the Hadoop reporter for aggregating the warning messages.

 Pig can not access reporter of PigHadoopLog in Load Func
 

 Key: PIG-889
 URL: https://issues.apache.org/jira/browse/PIG-889
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: Pig_889_Patch.txt


 I'd like to increment Counter in my own LoadFunc, but it will throw 
 NullPointerException. It seems that the reporter is not initialized.  
 I looked into this problem and find that it need to call 
 PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-897) Pig should support counters

2009-07-29 Thread Santhosh Srinivasan (JIRA)
Pig should support counters
---

 Key: PIG-897
 URL: https://issues.apache.org/jira/browse/PIG-897
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.4.0
Reporter: Santhosh Srinivasan
 Fix For: 0.4.0


Pig should support the use of counters. The use of the counters can possibly be 
via the script or via Java APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-880) Order by is borken with complex fields

2009-07-29 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan reassigned PIG-880:
---

Assignee: Santhosh Srinivasan

 Order by is borken with complex fields
 --

 Key: PIG-880
 URL: https://issues.apache.org/jira/browse/PIG-880
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch


 Pig script:
 a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
 f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
 s = order f by $0;   
 store s into 'sc.out' 
 Stack:
 Caused by: java.lang.ArrayStoreException
 at java.lang.System.arraycopy(Native Method)
 at java.util.Arrays.copyOf(Arrays.java:2763)
 at java.util.ArrayList.toArray(ArrayList.java:305)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
 ... 5 more
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
 at org.apache.pig.PigServer.execute(PigServer.java:762)
 at org.apache.pig.PigServer.access$100(PigServer.java:91)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-880) Order by is borken with complex fields

2009-07-29 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-880:


Status: Patch Available  (was: Open)

 Order by is borken with complex fields
 --

 Key: PIG-880
 URL: https://issues.apache.org/jira/browse/PIG-880
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, 
 PIG-880.patch


 Pig script:
 a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
 f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
 s = order f by $0;   
 store s into 'sc.out' 
 Stack:
 Caused by: java.lang.ArrayStoreException
 at java.lang.System.arraycopy(Native Method)
 at java.util.Arrays.copyOf(Arrays.java:2763)
 at java.util.ArrayList.toArray(ArrayList.java:305)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
 ... 5 more
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
 at org.apache.pig.PigServer.execute(PigServer.java:762)
 at org.apache.pig.PigServer.access$100(PigServer.java:91)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-660) Integration with Hadoop 0.20

2009-07-28 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736283#action_12736283
 ] 

Santhosh Srinivasan commented on PIG-660:
-

The build.xml in the patch(es) have the reference to hadoop20.jar. The missing 
part is the hadoop20.jar that Pig can use to build its sources. Pig cannot use 
the hadoop20.jar coming from the Hadoop release.

 Integration with Hadoop 0.20
 

 Key: PIG-660
 URL: https://issues.apache.org/jira/browse/PIG-660
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, 
 PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch


 With Hadoop 0.20, it will be possible to query the status of each map and 
 reduce in a map reduce job. This will allow better error reporting. Some of 
 the other items that could be on Hadoop's feature requests/bugs are 
 documented here for tracking.
 1. Hadoop should return objects instead of strings when exceptions are thrown
 2. The JobControl should handle all exceptions and report them appropriately. 
 For example, when the JobControl fails to launch jobs, it should handle 
 exceptions appropriately and should support APIs that query this state, i.e., 
 failure to launch jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-882) log level not propogated to loggers

2009-07-28 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736359#action_12736359
 ] 

Santhosh Srinivasan commented on PIG-882:
-

Minor comment:

Index: src/org/apache/pig/Main.java
===

Instead of printing the warning message to stdout, it should be printed to 
stderr.

{code}
+catch (IOException e)
+{
+System.out.println(Warn: Cannot open log4j properties file, use 
default);
+}
{code}


The rest of the patch looks fine.

 log level not propogated to loggers 
 

 Key: PIG-882
 URL: https://issues.apache.org/jira/browse/PIG-882
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
 Attachments: PIG-882-1.patch, PIG-882-2.patch


 Pig accepts log level as a parameter. But the log level it captures is not 
 set appropriately, so that loggers in different classes log at the specified 
 level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-24 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735129#action_12735129
 ] 

Santhosh Srinivasan commented on PIG-889:
-

The issue here is the lack of support for counters within Pig.

The intention of warn method in the PigLogger interface was to allow sources 
within Pig and UDFs  for warning aggregation. Your use of the reporter within 
the logger is not supported. An implementation detail prevents the correct use 
of this interface for load functions. The Hadoop reporter object is provided in 
the getRecordReader, map and reduce calls. For load functions, Pig provides an 
interface and for UDFs, an abstract class. As a result, the logger instance 
cannot be initialized in the loaders till we decide to add a method to support 
it. 

Will having the code from PigMapBase.map()  in 
PigInputFormat.java.getRecordReader work for you? 

{code}
PigHadoopLogger pigHadoopLogger = PigHadoopLogger.getInstance();
pigHadoopLogger.setAggregate(aggregateWarning);
pigHadoopLogger.setReporter(reporter);
PhysicalOperator.setPigLogger(pigHadoopLogger);
{code}

Note that this is a workaround for your situation. I would highly recommend 
that you move to the use of counters when they are supported.

 Pig can not access reporter of PigHadoopLog in Load Func
 

 Key: PIG-889
 URL: https://issues.apache.org/jira/browse/PIG-889
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: Pig_889_Patch.txt


 I'd like to increment Counter in my own LoadFunc, but it will throw 
 NullPointerException. It seems that the reporter is not initialized.  
 I looked into this problem and find that it need to call 
 PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-892) Make COUNT and AVG deal with nulls accordingly with SQL standar

2009-07-23 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734806#action_12734806
 ] 

Santhosh Srinivasan commented on PIG-892:
-

1. Index: src/org/apache/pig/builtin/FloatAvg.java
===

The size of 't' is not checked before t.get(0) in the method count


{code}
+if (t != null  t.get(0) != null)
+cnt++;
+}
{code}

2. Index: src/org/apache/pig/builtin/IntAvg.java
===

Same comment as FloatAvg.java

3. Index: src/org/apache/pig/builtin/DoubleAvg.java
===

Same comment as FloatAvg.java

4. Index: src/org/apache/pig/builtin/AVG.java
===

Same comment as FloatAvg.java

5. Index: src/org/apache/pig/builtin/LongAvg.java
===

Same comment as FloatAvg.java


6. Index: src/org/apache/pig/builtin/COUNT_STAR.java
===

I am not sure about the naming convention here. None of the built-in functions 
have a special character in the class name. COUNTSTAR would be better than 
COUNT_STAR.


 Make COUNT and AVG deal with nulls accordingly with SQL standar
 ---

 Key: PIG-892
 URL: https://issues.apache.org/jira/browse/PIG-892
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.4.0

 Attachments: PIG-892.patch, PIG-892_v2.patch


 both COUNT and AVG need to ignore nulls. Also add COUNT_STAR to match 
 COUNT(*) in SQL

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-07-23 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734810#action_12734810
 ] 

Santhosh Srinivasan commented on PIG-773:
-

+ 1 for the changes.

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch, 
 pig-773_v4.patch, pig-773_v5.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-07-23 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-773:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch has been committed. Thanks for the fix Ashutosh.

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch, 
 pig-773_v4.patch, pig-773_v5.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-892) Make COUNT and AVG deal with nulls accordingly with SQL standar

2009-07-22 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734417#action_12734417
 ] 

Santhosh Srinivasan commented on PIG-892:
-

I am reviewing the patch.

 Make COUNT and AVG deal with nulls accordingly with SQL standar
 ---

 Key: PIG-892
 URL: https://issues.apache.org/jira/browse/PIG-892
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.4.0

 Attachments: PIG-892.patch, PIG-892_v2.patch


 both COUNT and AVG need to ignore nulls. Also add COUNT_STAR to match 
 COUNT(*) in SQL

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-21 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-695:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch has been committed.

 Pig should not fail when error logs cannot be created
 -

 Key: PIG-695
 URL: https://issues.apache.org/jira/browse/PIG-695
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Attachments: PIG-695.patch


 Currently, PIG validates the log file location and fails/exits when the log 
 file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-21 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-695:


Fix Version/s: 0.4.0

 Pig should not fail when error logs cannot be created
 -

 Key: PIG-695
 URL: https://issues.apache.org/jira/browse/PIG-695
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-695.patch


 Currently, PIG validates the log file location and fails/exits when the log 
 file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-893) support cast of chararray to other simple types

2009-07-21 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733773#action_12733773
 ] 

Santhosh Srinivasan commented on PIG-893:
-

What are the semantics of chararray (string) to numeric types? 

Pig does not support conversion of any non-bytearray type to bytearray. The 
proposal in the jira description is minimalistic. Does it match with that of 
SQL? 

Without clear articulation about what these things mean, we cannot/should not 
support chararray to numeric type conversions. PiggyBank already supports UDFs 
that convert strings to int, double, etc.

It's a nice to have, as part of the language but its better positioned as a 
UDF. If clear semantics are laid out then making it part of the language will 
be a matter of consensus.

 support cast of chararray to other simple types
 ---

 Key: PIG-893
 URL: https://issues.apache.org/jira/browse/PIG-893
 Project: Pig
  Issue Type: New Feature
Reporter: Thejas M Nair

 Pig should support casting of chararray to 
 integer,long,float,double,bytearray. If the conversion fails for reasons such 
 as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func

2009-07-21 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733825#action_12733825
 ] 

Santhosh Srinivasan commented on PIG-889:
-

Comments:

The reporter inside the logger is setup correctly in PigInputFormat for Hadoop. 
However the usage of the logger to retrieve the reporter and then increment 
counters is flawed for the following reasons:

1. In the test case, the new loader uses PigHadoopLogger directly. When the 
loader is used in local mode, the notion of Hadoop disappears and the reference 
to PigHadoopLogger is not usable (i.e., will result in a NullPointerException).

{code}
+   @Override
+   public Tuple getNext() throws IOException {
+   PigHadoopLogger.getInstance().getReporter().incrCounter(
+   MyCounter.TupleCounter, 1);
+   return super.getNext();
+   }
{code}

2. The loggers were meant for warning aggregations. Here, there is a case being 
made to expand the capabilities to allow user defined counter aggregations. If 
thats the case, then new methods have to be added to the PigLogger interface.

 Pig can not access reporter of PigHadoopLog in Load Func
 

 Key: PIG-889
 URL: https://issues.apache.org/jira/browse/PIG-889
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: Pig_889_Patch.txt


 I'd like to increment Counter in my own LoadFunc, but it will throw 
 NullPointerException. It seems that the reporter is not initialized.  
 I looked into this problem and find that it need to call 
 PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-17 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-695:


Attachment: PIG-695.patch

Attached patch ensures that Pig does not error out when the error log file is 
not writable. 

 Pig should not fail when error logs cannot be created
 -

 Key: PIG-695
 URL: https://issues.apache.org/jira/browse/PIG-695
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Attachments: PIG-695.patch


 Currently, PIG validates the log file location and fails/exits when the log 
 file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work stopped: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-17 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-695 stopped by Santhosh Srinivasan.

 Pig should not fail when error logs cannot be created
 -

 Key: PIG-695
 URL: https://issues.apache.org/jira/browse/PIG-695
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Attachments: PIG-695.patch


 Currently, PIG validates the log file location and fails/exits when the log 
 file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created

2009-07-17 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-695:


Status: Patch Available  (was: Open)

 Pig should not fail when error logs cannot be created
 -

 Key: PIG-695
 URL: https://issues.apache.org/jira/browse/PIG-695
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Attachments: PIG-695.patch


 Currently, PIG validates the log file location and fails/exits when the log 
 file cannot be created. Instead, it should print a warning and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Status: In Progress  (was: Patch Available)

 All backend error messages must be logged to preserve the original error 
 messages
 -

 Key: PIG-728
 URL: https://issues.apache.org/jira/browse/PIG-728
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.4.0


 The current error handling framework logs backend error messages only when 
 Pig is not able to parse the error message. Instead, Pig should log the 
 backend error message irrespective of Pig's ability to parse backend error 
 messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
 is not consistent and should avoid the use of class_name + ( + 
 string_constructor_args + ).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Attachment: PIG-728_1.patch

Attaching a new patch that fixes the findbugs issue.

 All backend error messages must be logged to preserve the original error 
 messages
 -

 Key: PIG-728
 URL: https://issues.apache.org/jira/browse/PIG-728
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.4.0

 Attachments: PIG-728_1.patch


 The current error handling framework logs backend error messages only when 
 Pig is not able to parse the error message. Instead, Pig should log the 
 backend error message irrespective of Pig's ability to parse backend error 
 messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
 is not consistent and should avoid the use of class_name + ( + 
 string_constructor_args + ).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-15 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Attachment: PIG-728.patch

The attached patch logs all backend error messages before Pig tries to parse 
the messages. In addition, the log format has been cleaned up to be more user 
friendly. No new test cases have been added.

 All backend error messages must be logged to preserve the original error 
 messages
 -

 Key: PIG-728
 URL: https://issues.apache.org/jira/browse/PIG-728
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.4.0

 Attachments: PIG-728.patch


 The current error handling framework logs backend error messages only when 
 Pig is not able to parse the error message. Instead, Pig should log the 
 backend error message irrespective of Pig's ability to parse backend error 
 messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
 is not consistent and should avoid the use of class_name + ( + 
 string_constructor_args + ).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-07-15 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-728:


Fix Version/s: (was: 0.2.1)
   0.4.0
Affects Version/s: (was: 0.2.1)
   0.3.0
   Status: Patch Available  (was: Open)

 All backend error messages must be logged to preserve the original error 
 messages
 -

 Key: PIG-728
 URL: https://issues.apache.org/jira/browse/PIG-728
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.4.0

 Attachments: PIG-728.patch


 The current error handling framework logs backend error messages only when 
 Pig is not able to parse the error message. Instead, Pig should log the 
 backend error message irrespective of Pig's ability to parse backend error 
 messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
 is not consistent and should avoid the use of class_name + ( + 
 string_constructor_args + ).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-13 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730525#action_12730525
 ] 

Santhosh Srinivasan commented on PIG-877:
-

Its at Optimization time.

 Push up filter does not account for added columns in foreach
 

 Key: PIG-877
 URL: https://issues.apache.org/jira/browse/PIG-877
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.3.1

 Attachments: PIG-877.patch


 If a filter follows a foreach that produces an added column then push up 
 filter fails with a null pointer exception.
 {code}
 ...
 x = foreach w generate $0, COUNT($1);
 y = filter x by $1  10;
 {code}
 In the above example, the column in the filter's expression is an added 
 column. As a result, the optimizer rule is not able to map it back to the 
 input resulting in a null value. The subsequent for loop is failing due to 
 NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-13 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730595#action_12730595
 ] 

Santhosh Srinivasan commented on PIG-877:
-

Patch has been committed.

 Push up filter does not account for added columns in foreach
 

 Key: PIG-877
 URL: https://issues.apache.org/jira/browse/PIG-877
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.3.1

 Attachments: PIG-877.patch


 If a filter follows a foreach that produces an added column then push up 
 filter fails with a null pointer exception.
 {code}
 ...
 x = foreach w generate $0, COUNT($1);
 y = filter x by $1  10;
 {code}
 In the above example, the column in the filter's expression is an added 
 column. As a result, the optimizer rule is not able to map it back to the 
 input resulting in a null value. The subsequent for loop is failing due to 
 NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-09 Thread Santhosh Srinivasan (JIRA)
Push up filter does not account for added columns in foreach


 Key: PIG-877
 URL: https://issues.apache.org/jira/browse/PIG-877
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
 Fix For: 0.3.1


If a filter follows a foreach that produces an added column then push up filter 
fails with a null pointer exception.

{code}
...
x = foreach w generate $0, COUNT($1);
y = filter x by $1  10;
{code}

In the above example, the column in the filter's expression is an added column. 
As a result, the optimizer rule is not able to map it back to the input 
resulting in a null value. The subsequent for loop is failing due to NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-09 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan reassigned PIG-877:
---

Assignee: Santhosh Srinivasan

 Push up filter does not account for added columns in foreach
 

 Key: PIG-877
 URL: https://issues.apache.org/jira/browse/PIG-877
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.3.1


 If a filter follows a foreach that produces an added column then push up 
 filter fails with a null pointer exception.
 {code}
 ...
 x = foreach w generate $0, COUNT($1);
 y = filter x by $1  10;
 {code}
 In the above example, the column in the filter's expression is an added 
 column. As a result, the optimizer rule is not able to map it back to the 
 input resulting in a null value. The subsequent for loop is failing due to 
 NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-09 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-877:


Status: Patch Available  (was: Open)

 Push up filter does not account for added columns in foreach
 

 Key: PIG-877
 URL: https://issues.apache.org/jira/browse/PIG-877
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.3.1

 Attachments: PIG-877.patch


 If a filter follows a foreach that produces an added column then push up 
 filter fails with a null pointer exception.
 {code}
 ...
 x = foreach w generate $0, COUNT($1);
 y = filter x by $1  10;
 {code}
 In the above example, the column in the filter's expression is an added 
 column. As a result, the optimizer rule is not able to map it back to the 
 input resulting in a null value. The subsequent for loop is failing due to 
 NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-877) Push up filter does not account for added columns in foreach

2009-07-09 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-877:


Attachment: PIG-877.patch

Attached patch fixes the NPE.

 Push up filter does not account for added columns in foreach
 

 Key: PIG-877
 URL: https://issues.apache.org/jira/browse/PIG-877
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.3.1

 Attachments: PIG-877.patch


 If a filter follows a foreach that produces an added column then push up 
 filter fails with a null pointer exception.
 {code}
 ...
 x = foreach w generate $0, COUNT($1);
 y = filter x by $1  10;
 {code}
 In the above example, the column in the filter's expression is an added 
 column. As a result, the optimizer rule is not able to map it back to the 
 input resulting in a null value. The subsequent for loop is failing due to 
 NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-873) Optimizer should allow search for global patterns

2009-07-07 Thread Santhosh Srinivasan (JIRA)
Optimizer should allow search for global patterns
-

 Key: PIG-873
 URL: https://issues.apache.org/jira/browse/PIG-873
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
 Fix For: 0.4.0


Currently, the optimizer works on the following mechanism:

1. Specify the pattern to be searched
2. For each occurrence of the pattern, check and then apply a transformation

With this approach, the search for a pattern is localized. An example will 
illustrate the problem.

If the pattern to be searched for is foreach (with flatten) connected to any 
operator and if the graph has more than one foreach (with flatten) connected to 
an operator (cross, join, union, etc), then each instance of foreach connected 
to the operator is returned as a match. While this is fine for a localized view 
(per match), at a global view the pattern to be searched for is any number of 
foreach connected to an operator.

The implication of not having a globalized view is more rules. There will be 
one rule for one foreach connected to an opeator, one rule for two foreachs 
connected to an operators, etc.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-874) Problems in pushing down foreach with flatten

2009-07-07 Thread Santhosh Srinivasan (JIRA)
Problems in pushing down foreach with flatten
-

 Key: PIG-874
 URL: https://issues.apache.org/jira/browse/PIG-874
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Santhosh Srinivasan
 Fix For: 0.4.0


If the graph contains more than one foreach connected to an operator, pushing 
down foreach with flatten is not possible with the current optimizer pattern 
matching algorithm and current implementation of rewire. The following 
mechanism of pushing foreach with flatten does not work.

1. Search for foreach (with flatten) connected to an operator
2. If checks pass then unflatten the flattened column in the foreach
3. Create a new foreach that flattens the mapped column (the original column 
number could have changed) and insert the new foreach after the old foreach's 
successor.

An example to illustrate the problem:

{code}
A = load 'myfile' as (name, age, gpa:(letter_grade, point_score));
B = foreach A generate $0, $1, flatten($2);
C = load 'anotherfile' as (name, age, preference:(course_name, instructor));
D = foreach C generate $0, $1, flatten($2);
E = join B by $0, D by $0 using replicated;
F = limit E 10;
{code}

In the code snipped (see above), the optimizer will find two matches, B-E and 
D-E. For the first pattern match (B-E), $2 will be unflattened and a new 
foreach will be introduced after the join.

{code}
A = load 'myfile' as (name, age, gpa:(letter_grade, point_score));
B = foreach A generate $0, $1, $2;
C = load 'anotherfile' as (name, age, preference:(course_name, instructor));
D = foreach C generate $0, $1, flatten($2);
E = join B by $0, D by $0 using replicated;
E1 = foreach E generate $0, $1, flatten($2), $3, $4, $5, $6;
F = limit E1 10;
{code}

For the second match (D-E), the same transformation is applied. However, this 
transformation will not work for the following reason. The new foreach is now 
inserted between the E and E1. When E1 is rewired, rewire is unable to map $6 
in E1 as it never exists in E. In order to fix such situations, the pattern 
matching should return a global match instead of a local match.

Reference: PIG-873

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-07-02 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726566#action_12726566
 ] 

Santhosh Srinivasan commented on PIG-697:
-

1. Removing added fields from the flattened set.

The flattened set is the set of all flattened columns. It can contain mapped 
and added fields. In order to remove the added fields from this set, the 
removeAll method is used.

2. Comments on why the rule applies only to Order, Cross and Join

Will add these comments.

3. Removing code in LOForEach for flattening a bag with unknown schema

The code that I removed was redundant and also had a bug. The check for a field 
getting mapped was neglected in one case. After I added the check, the code for 
the if and the else was identical. I removed the redundant code and made it 
simpler.

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
 OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding 

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-07-02 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726601#action_12726601
 ] 

Santhosh Srinivasan commented on PIG-697:
-

-1 javac. The applied patch generated 250 javac compiler warnings (more than 
the trunk's current 248 warnings).

The additional 2 compiler warning messages are related to type inference. At 
this point these messages are harmless. 

-1 javac. The applied patch generated 250 javac compiler warnings (more than 
the trunk's current 248 warnings).

Dodgy warning:
The find bug warnings are harmless, there is an  explicit check for null to 
print null as opposed to the contents of the object.  

Correctness warning:
There are checks in place to ensure that the variable can never be null.


 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
 OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three 

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-07-02 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726684#action_12726684
 ] 

Santhosh Srinivasan commented on PIG-697:
-

Phase 4 part 2 patch has been committed

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
 OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second 

[jira] Updated: (PIG-792) PERFORMANCE: Support skewed join in pig

2009-07-02 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-792:


Status: Patch Available  (was: Open)

 PERFORMANCE: Support skewed join in pig
 ---

 Key: PIG-792
 URL: https://issues.apache.org/jira/browse/PIG-792
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
 Attachments: skewedjoin.patch


 Fragmented replicated join has a few limitations:
  - One of the tables needs to be loaded into memory
  - Join is limited to two tables
 Skewed join partitions the table and joins the records in the reduce phase. 
 It computes a histogram of the key space to account for skewing in the input 
 records. Further, it adjusts the number of reducers depending on the key 
 distribution.
 We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-29 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725261#action_12725261
 ] 

Santhosh Srinivasan commented on PIG-697:
-

Phase 4 part 1 patch has been committed.

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
 OptimizerPhase4_part1-1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * 

[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-29 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725279#action_12725279
 ] 

Santhosh Srinivasan commented on PIG-773:
-

Review comments:

1. In addition to checking the type of the constant, the value should also be 
checked. The checks on the data type is good. A check on the actual contents of 
the empty bag, empty tuple and empty map will complete the testing.

{code}
+LOConst loConst = (LOConst)logOp;
+assertTrue(loConst.getType() == DataType.TUPLE);
+assertTrue(loConst.getValue() instanceof Tuple);
{code}

2. When you have a bag like {(), (1)}, the schema of this bag is returned as a 
bag that contains a tuple that has no schema. This might be the right approach 
for now, i.e., if a bag contains a tuple with no schema then the schema of the 
bag will contain a tuple with no schema irrespective of the contents of the 
remaining tuple. This approach/idea falls into the bigger question of how to 
handle unknown schemas in Pig. Since Alan is looking at this question for all 
of Pig, it will be good if he can review this part.

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-866) Pig should support ability to query unique column name when there is no ambiguity

2009-06-29 Thread Santhosh Srinivasan (JIRA)
Pig should support ability to query unique column name when there is no 
ambiguity
-

 Key: PIG-866
 URL: https://issues.apache.org/jira/browse/PIG-866
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Santhosh Srinivasan
 Fix For: 0.4.0


Currently, the default alias of a column following a flatten contains the 
disambiguator  ::.  For columns that have a unique name, the :: 
disambiguator is not required. Although, Pig supports column access via the 
unique name and the disambiguated name, there is no support to retrieve the 
unique column name. This is a nice to have enhancement. An example below will 
illustrate the issue:

{code}
grunt a = load 'input' as (name, age, gpa); 
grunt b = group a ALL;
grunt c = foreach b generate flatten(a);

grunt describe c;
c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray} 

grunt d = foreach c generate name;  

grunt describe d;   
d: {a::name: bytearray} 
{code}

In the example shown above, although the column name is allowed in the relation 
'd', the name of the column appears as 'a::name' in the schema. The workaround 
for this issue is to use the AS clause in the foreach. However, this is 
cumbersome for users and its something that can be fixed within Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-866) Pig should support ability to query unique column name when there is no ambiguity

2009-06-29 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725303#action_12725303
 ] 

Santhosh Srinivasan commented on PIG-866:
-

This support has to be extended to the FieldSchema class when Java APIs are 
used to query the aliases.

 Pig should support ability to query unique column name when there is no 
 ambiguity
 -

 Key: PIG-866
 URL: https://issues.apache.org/jira/browse/PIG-866
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Santhosh Srinivasan
 Fix For: 0.4.0


 Currently, the default alias of a column following a flatten contains the 
 disambiguator  ::.  For columns that have a unique name, the :: 
 disambiguator is not required. Although, Pig supports column access via the 
 unique name and the disambiguated name, there is no support to retrieve the 
 unique column name. This is a nice to have enhancement. An example below will 
 illustrate the issue:
 {code}
 grunt a = load 'input' as (name, age, gpa); 
 grunt b = group a ALL;
 grunt c = foreach b generate flatten(a);
 grunt describe c;
 c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray} 
 grunt d = foreach c generate name;  
 grunt describe d;   
 d: {a::name: bytearray} 
 {code}
 In the example shown above, although the column name is allowed in the 
 relation 'd', the name of the column appears as 'a::name' in the schema. The 
 workaround for this issue is to use the AS clause in the foreach. However, 
 this is cumbersome for users and its something that can be fixed within Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-26 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: In Progress)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
 OptimizerPhase4_part1-1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple 

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-26 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724704#action_12724704
 ] 

Santhosh Srinivasan commented on PIG-697:
-

The find bug warnings are harmless, there are explicit checks for null to print 
null as opposed to the contents of the object.

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
 OptimizerPhase4_part1-1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-23 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: In Progress  (was: Patch Available)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
 OptimizerPhase4_part1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple 

[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-22 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722701#action_12722701
 ] 

Santhosh Srinivasan commented on PIG-773:
-

Comments:

1. Minor comment - the comment on the empty productions all have the same text 
For tuple and bag, it should be changed to tuple and bag respectively

{code}
+   |{ } // Match the empty content in map.
{code}

2. I am not sure about the test case testEmptyBagConstRecursive. Here the bag 
contains an empty tuple. As a result, the field schema for the bag should 
contain the schema of the empty tuple. The test case will probably fail.

{code}
+@Test
+public void testEmptyBagConstRecursive() throws FrontendException{
+   
+LogicalPlan lp = buildPlan(a = foreach (load 'b') generate {()};);
+LOForEach foreach = (LOForEach) lp.getLeaves().get(0);
+
+Schema.FieldSchema bagFs = new 
Schema.FieldSchema(null,null,DataType.BAG);
+Schema expectedSchema = new Schema(bagFs);
+   
+assertTrue(Schema.equals(foreach.getSchema(), expectedSchema, false, 
true));
+}

{code}

3. There are no tests that check if the empty constants are actually created, 
i.e., there are no checks for expected empty constants. The test below checks 
if the parser can parse the new syntax for empty constants. In addition, the 
values generated by the parser have to checked against expected values for 
these constants.

{code}
+@Test
+public void testRandomEmptyConst(){
+// Various random scripts to test recursive nature of parser with 
empty constants.
+   
+buildPlan(a = foreach (load 'b') generate {({})};);
+buildPlan(a = foreach (load 'b') generate ({()}););
+buildPlan(a = foreach (load 'b') generate {(),()};);
+buildPlan(a = foreach (load 'b') generate ({},{}););
+buildPlan(a = foreach (load 'b') generate ((),()););
+buildPlan(a = foreach (load 'b') generate ([],[]););
+buildPlan(a = foreach (load 'b') generate {({},{})};);
+buildPlan(a = foreach (load 'b') generate {([],[])};);
+buildPlan(a = foreach (load 'b') generate (({},{})););
+buildPlan(a = foreach (load 'b') generate (([],[])););
+}
{code}

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Priority: Minor
 Attachments: pig-773.patch, pig-773_v2.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-22 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-851:


Status: In Progress  (was: Patch Available)

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
 Attachments: patch_815.txt


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-22 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-851:


Patch Info: [Patch Available]

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
 Attachments: patch_815.txt


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-19 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721975#action_12721975
 ] 

Santhosh Srinivasan commented on PIG-697:
-

1. Some operators do not have any internal state that requires rewiring. 
Examples of such operators include LOStream, LOCross, etc.

2. I think that the additional walking should be removed. I added a TODO as I 
was not sure why it was added in the first place.

3. Yes, it will be added as part of the next patch.

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-19 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722075#action_12722075
 ] 

Santhosh Srinivasan commented on PIG-697:
-

OptimizerPhase3_part2_3.patch has been committed.

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, 

[jira] Commented: (PIG-753) Provide support for UDFs without parameters

2009-06-18 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721564#action_12721564
 ] 

Santhosh Srinivasan commented on PIG-753:
-

+1 for the code changes. The license header and the unit tests that failed have 
to be checked.

 Provide support for UDFs without parameters
 ---

 Key: PIG-753
 URL: https://issues.apache.org/jira/browse/PIG-753
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Jeff Zhang
 Attachments: Pig_753_Patch.txt


 Pig do not support UDF without parameters, it force me provide a parameter.
 like the following statement:
  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
 provide a parameter like following
  B = FOREACH A GENERATE bagGenerator($0);
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-18 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721592#action_12721592
 ] 

Santhosh Srinivasan commented on PIG-856:
-

Would that be through a configuration parameter? What would be the default 1 or 
2 ?

 PERFORMANCE: reduce number of replicas
 --

 Key: PIG-856
 URL: https://issues.apache.org/jira/browse/PIG-856
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Olga Natkovich

 Currently Pig uses the default number of replicas between MR jobs. Currently, 
 the number is 3. Given the temp nature of the data, we should never need more 
 than 2 and should explicitely set it to improve performance and to be nicer 
 to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-18 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721596#action_12721596
 ] 

Santhosh Srinivasan commented on PIG-856:
-

Essentially, are we adding more knobs to tune Pig? We should document these 
knobs and explain how they interact with each other.

 PERFORMANCE: reduce number of replicas
 --

 Key: PIG-856
 URL: https://issues.apache.org/jira/browse/PIG-856
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Olga Natkovich

 Currently Pig uses the default number of replicas between MR jobs. Currently, 
 the number is 3. Given the temp nature of the data, we should never need more 
 than 2 and should explicitely set it to improve performance and to be nicer 
 to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-17 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Attachment: (was: OptimizerPhase3_part2_2.patch)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, will be pushed in 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-17 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: In Progress)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: In Progress  (was: Patch Available)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, will be pushed in front of 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: In Progress)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second 

[jira] Commented: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-16 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720292#action_12720292
 ] 

Santhosh Srinivasan commented on PIG-851:
-

Review comments:

1. The new sources test/org/apache/pig/test/utils/MyUDFReturnMap.java and 
test/org/apache/pig/test/TestUDFReturnMap.java need to include the Apache 
license headers
2. The use of package 
sun.reflect.generics.reflectiveObjects.ParameterizedTypeImpl is resulting in 3 
compiler warnings and 1 javadoc warning. Can we use a different package?
3. The test case in TestUDFReturnMap runs the test in local mode (i.e., 
ExecType.LOCAL). Another test for map reduce mode, ExecType.MAPREDUCE, should 
be added.

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
 Fix For: 0.3.0

 Attachments: Pig_815_patch.txt


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: In Progress  (was: Patch Available)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, will be pushed in front of 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: In Progress)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-16 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Attachment: OptimizerPhase3_part2_2.patch

Attached patch fixes the findbug warning, and cleans up the sources by removing 
commented out code. The additional 35 compiler warning messages are related to 
type inference. At this point these messages are harmless.

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first 

[jira] Commented: (PIG-728) All backend error messages must be logged to preserve the original error messages

2009-06-15 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719617#action_12719617
 ] 

Santhosh Srinivasan commented on PIG-728:
-

In addition, when the framework is not able to parse the error message, the 
message should be annotated as such. Extraneous details like Unable to 
recreate exception, Cannot create exception from empty string, etc should 
not be communicated to the user. These messages reflect internal workings of 
the error handling framework and do not add value to the user.

 All backend error messages must be logged to preserve the original error 
 messages
 -

 Key: PIG-728
 URL: https://issues.apache.org/jira/browse/PIG-728
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.2.1


 The current error handling framework logs backend error messages only when 
 Pig is not able to parse the error message. Instead, Pig should log the 
 backend error message irrespective of Pig's ability to parse backend error 
 messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java 
 is not consistent and should avoid the use of class_name + ( + 
 string_constructor_args + ).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-605) Better explain and console output

2009-06-15 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719731#action_12719731
 ] 

Santhosh Srinivasan commented on PIG-605:
-

In addition, it will be very useful for users if the plans have the line 
numbers of the pig script that resulted in the final plan. For example, the 
plan should state Line number 10, 12, 14 to help users work backwards from 
the plan to the original script.

 Better explain and console output
 -

 Key: PIG-605
 URL: https://issues.apache.org/jira/browse/PIG-605
 Project: Pig
  Issue Type: Improvement
  Components: grunt
Reporter: Yiping Han

 It would be nice if when we explain the script, the corresponding mapred jobs 
 can be explicitly mark out in a neat way. While we execute the script, the 
 console output could print the name and url of the corresponding hadoop jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-15 Thread Santhosh Srinivasan (JIRA)
Map type used as return type in UDFs not recognized at all times


 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
 Fix For: 0.3.0


When an UDF returns a map and the outputSchema method is not overridden, Pig 
does not figure out the data type. As a result, the type is set to unknown 
resulting in run time failure. An example script and UDF follow

{code}
public class mapUDF extends EvalFuncMapObject, Object {

@Override
public MapObject, Object exec(Tuple input) throws IOException {
return new HashMapObject, Object();
}

//Note that the outputSchema method is commented out

/*
@Override
public Schema outputSchema(Schema input) {
try {
return new Schema(new Schema.FieldSchema(null, null, DataType.MAP));
} catch (FrontendException e) {
return null;
}
}
*/
{code}

{code}
grunt a = load 'student_tab.data';   
grunt b = foreach a generate EXPLODE(1);
grunt describe b;

b: {Unknown}

grunt dump b;

2009-06-15 17:59:01,776 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!

2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2080: Foreach currently does not handle type Unknown

{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-15 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719886#action_12719886
 ] 

Santhosh Srinivasan commented on PIG-851:
-

A workaround for this issue is to override the outputSchema method and return 
the appropriate schema.

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
 Fix For: 0.3.0


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-842) PigStorage should support multi-byte delimiters

2009-06-11 Thread Santhosh Srinivasan (JIRA)
PigStorage should support multi-byte delimiters
---

 Key: PIG-842
 URL: https://issues.apache.org/jira/browse/PIG-842
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
 Fix For: 0.3.0


Currently, PigStorage supports single byte delimiters. Users have requested 
mult-byte delimiters. There are performance implications with multi-byte 
delimiters. i.e., instead of looking for a single byte, PigStorage should look 
for a pattern ala BinStorage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-839) incorrect return codes on failure when using -f or -e flags

2009-06-08 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717451#action_12717451
 ] 

Santhosh Srinivasan commented on PIG-839:
-

There are no unit test cases as the testing was performed manually. Pasting a 
test run below.

{code}

$ cat /errcode.pig 
a = load '/user/sms/data/student_tab.data' ;
b = stream a through `false` ;
store b into '/user/sms/data/errcode.out'; 

#Before fix
$ java -cp pig.jar:/home/y/conf/pig/piglet/released org.apache.pig.Main -f 
errcode.pig 
2009-06-08 14:40:51,917 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!
2009-06-08 14:40:51,926 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
ERROR 2055: Received Error while processing the map plan: 'false ' failed with 
exit status: 1
Details at logfile: pig_1244497222536.log
afterside 14:40:53 ~/src_pig/pig/trunk_optimizer_phase3_part2 $ echo $?
0

#After fix
$ java -cp pig.jar:/home/y/conf/pig/piglet/released org.apache.pig.Main -f 
/homes/sms/src_pig/pig/trunk_optimizer_phase3_part2/errcode.pig 
2009-06-08 14:42:20,422 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!
2009-06-08 14:42:20,434 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
ERROR 2055: Received Error while processing the map plan: 'false ' failed with 
exit status: 1
Details at logfile: /homes/sms/src_commit/pig/trunk/pig_1244497306578.log
afterside 14:42:21 ~/src_commit/pig/trunk $ echo $?
2

{code}

 incorrect return codes on failure when using -f or -e flags
 ---

 Key: PIG-839
 URL: https://issues.apache.org/jira/browse/PIG-839
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Attachments: fix_return_code.patch


 To repro: pig -e a = load 'some file' ; b = stream a through \`false\` ; 
 store b into 'some file';
 Both the -e and -f flags do not return the right code upon exit. Running the 
 script w/o using -f works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-08 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717508#action_12717508
 ] 

Santhosh Srinivasan commented on PIG-773:
-

You can ignore that.

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Priority: Minor
 Attachments: pig-773.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-01 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715143#action_12715143
 ] 

Santhosh Srinivasan commented on PIG-697:
-

The graph operation pushAfter was added as a complementary operation to 
pushBefore. Currently, on the logical side, there are no concrete use cases for 
pushAfter. The only operator that truly supports multiple outputs is split. Our 
current model for split is to have an no-op split operator that has multiple 
successors, split outputs, each of which is the equivalent of a filter. The 
split output has inner plans which could have projection operators that hold 
references to the split's predecessor. 

When an operator is pushed after split, the operator will be placed between the 
split and split output. As a result, when rewire on split is called, the call 
is dispatched to the split output. The references in the split output after the 
rewire will now point to split's predecessor instead of pointing to the 
operator that was pushed after.

The intention of the pushAfter in the case of a split is to push it after the 
split output. However, the generic pushAfter operation does not distinguish 
between split and split output. A possible way out is to override this method 
in the logical plan and duplicate most of the code in the OperatorPlan and add 
new code to handle split.

As of now, the pushAfter will not be used in the logical layer.


 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules 

  1   2   3   >