[jira] Commented: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case

2009-10-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771846#action_12771846
 ] 

Hadoop QA commented on PIG-1063:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423638/PIG-1063.patch
  against trunk revision 831169.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 199 javac compiler warnings (more 
than the trunk's current 198 warnings).

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/131/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/131/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/131/console

This message is automatically generated.

 Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in 
 the multistore case
 --

 Key: PIG-1063
 URL: https://issues.apache.org/jira/browse/PIG-1063
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-1063.patch


 A StoreFunc implementation can inform pig of an OutputFormat it uses through 
 the getStoragePreparationClass() method. In a query with multiple stores 
 which gets optimized into a single mapred job, Pig does not call the 
 checkOutputSpecs() method on the outputformat. An example of such a script is:
 {noformat}
 a = load 'input.txt';
 b = filter a by $0  10;
 store b into 'output1' using StoreWithOutputFormat();
 c = group a by $0;
 d = foreach c generate group, COUNT(a.$0);
 store d into 'output2' using StoreWithOutputFormat();
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1035) support for skewed outer join

2009-10-30 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1035:
-

Attachment: 1035.patch

The attached patch contains modifications to support outer skewed join. It 
follows the same semantics as regular join. Some of the code used by regular 
join is moved to a common file - CompilerUtils and used by both.

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1035) support for skewed outer join

2009-10-30 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1035:
-

Status: Patch Available  (was: Open)

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771878#action_12771878
 ] 

Hadoop QA commented on PIG-1048:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423658/pig_1048.patch
  against trunk revision 831169.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/35/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/35/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/35/console

This message is automatically generated.

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Attachments: pig_1048.patch


 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1035) support for skewed outer join

2009-10-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771913#action_12771913
 ] 

Hadoop QA commented on PIG-1035:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423670/1035.patch
  against trunk revision 831169.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/132/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/132/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/132/console

This message is automatically generated.

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1066) ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null

2009-10-30 Thread Bogdan Dorohonceanu (JIRA)
ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected 
internal error. null


 Key: PIG-1066
 URL: https://issues.apache.org/jira/browse/PIG-1066
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.4.0
Reporter: Bogdan Dorohonceanu


-- load the QID_CT_QP20 data
x = LOAD '$FS_TD/$QID_IN_FILES' USING PigStorage('\t') AS 
(unstem_qid:chararray, jid_score_pairs:chararray);
DESCRIBE x;
--ILLUSTRATE x;

-- load the ID_RQ data
y0 = LOAD '$FS_USER/$ID_RQ_IN_FILE' USING PigStorage('\t') AS (sid:chararray, 
query:chararray);
-- force parallelization
-- y1 = ORDER y0 BY sid PARALLEL $NUM;
-- compute unstem_qid
DEFINE f `text_streamer_query j3_unicode.dat prop.dat normal.txt TAB TAB 
1:yes:UNSTEM_ID:%llx` INPUT(stdin USING PigStorage('\t')) OU\
TPUT(stdout USING PigStorage('\t')) SHIP('$USER/text_streamer_query', 
'$USER/j3_unicode.dat', '$USER/prop.dat', '$USER/normal.txt');
y = STREAM y0 THROUGH f AS (sid:chararray, query:chararray, 
unstem_qid:chararray);
DESCRIBE y;
--ILLUSTRATE y;
rmf /user/vega/zoom/y_debug
STORE y INTO '/user/vega/zoom/y_debug' USING PigStorage('\t');


2009-10-30 13:36:48,437 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: 
hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name
09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to hadoop 
file system at: 
hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name
2009-10-30 13:36:48,495 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
map-reduce job tracker at: dd-9c32d04:8889
09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to 
map-reduce job tracker at: dd-9c32d04:8889
2009-10-30 13:36:49,242 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2999: Unexpected internal error. null
09/10/30 13:36:49 ERROR grunt.Grunt: ERROR 2999: Unexpected internal error. null
Details at logfile: /disk1/vega/zoom/pig_1256909801304.log



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1066) ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null

2009-10-30 Thread Bogdan Dorohonceanu (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771948#action_12771948
 ] 

Bogdan Dorohonceanu commented on PIG-1066:
--

In the code above if I comment ILLUSTRATE, the code works fine. If I un-comment 
it, grunt gets an error.

 ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected 
 internal error. null
 

 Key: PIG-1066
 URL: https://issues.apache.org/jira/browse/PIG-1066
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.4.0
Reporter: Bogdan Dorohonceanu

 -- load the QID_CT_QP20 data
 x = LOAD '$FS_TD/$QID_IN_FILES' USING PigStorage('\t') AS 
 (unstem_qid:chararray, jid_score_pairs:chararray);
 DESCRIBE x;
 --ILLUSTRATE x;
 -- load the ID_RQ data
 y0 = LOAD '$FS_USER/$ID_RQ_IN_FILE' USING PigStorage('\t') AS (sid:chararray, 
 query:chararray);
 -- force parallelization
 -- y1 = ORDER y0 BY sid PARALLEL $NUM;
 -- compute unstem_qid
 DEFINE f `text_streamer_query j3_unicode.dat prop.dat normal.txt TAB TAB 
 1:yes:UNSTEM_ID:%llx` INPUT(stdin USING PigStorage('\t')) OU\
 TPUT(stdout USING PigStorage('\t')) SHIP('$USER/text_streamer_query', 
 '$USER/j3_unicode.dat', '$USER/prop.dat', '$USER/normal.txt');
 y = STREAM y0 THROUGH f AS (sid:chararray, query:chararray, 
 unstem_qid:chararray);
 DESCRIBE y;
 --ILLUSTRATE y;
 rmf /user/vega/zoom/y_debug
 STORE y INTO '/user/vega/zoom/y_debug' USING PigStorage('\t');
 2009-10-30 13:36:48,437 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: 
 hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name
 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to hadoop 
 file system at: 
 hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name
 2009-10-30 13:36:48,495 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: dd-9c32d04:8889
 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to 
 map-reduce job tracker at: dd-9c32d04:8889
 2009-10-30 13:36:49,242 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2999: Unexpected internal error. null
 09/10/30 13:36:49 ERROR grunt.Grunt: ERROR 2999: Unexpected internal error. 
 null
 Details at logfile: /disk1/vega/zoom/pig_1256909801304.log

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1066) ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null

2009-10-30 Thread Bogdan Dorohonceanu (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771951#action_12771951
 ] 

Bogdan Dorohonceanu commented on PIG-1066:
--

Actually, it looks like ILLUSTRATE crashes when used on a relation created with 
STREAM ... THROUGH 

 ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected 
 internal error. null
 

 Key: PIG-1066
 URL: https://issues.apache.org/jira/browse/PIG-1066
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.4.0
Reporter: Bogdan Dorohonceanu

 -- load the QID_CT_QP20 data
 x = LOAD '$FS_TD/$QID_IN_FILES' USING PigStorage('\t') AS 
 (unstem_qid:chararray, jid_score_pairs:chararray);
 DESCRIBE x;
 --ILLUSTRATE x;
 -- load the ID_RQ data
 y0 = LOAD '$FS_USER/$ID_RQ_IN_FILE' USING PigStorage('\t') AS (sid:chararray, 
 query:chararray);
 -- force parallelization
 -- y1 = ORDER y0 BY sid PARALLEL $NUM;
 -- compute unstem_qid
 DEFINE f `text_streamer_query j3_unicode.dat prop.dat normal.txt TAB TAB 
 1:yes:UNSTEM_ID:%llx` INPUT(stdin USING PigStorage('\t')) OU\
 TPUT(stdout USING PigStorage('\t')) SHIP('$USER/text_streamer_query', 
 '$USER/j3_unicode.dat', '$USER/prop.dat', '$USER/normal.txt');
 y = STREAM y0 THROUGH f AS (sid:chararray, query:chararray, 
 unstem_qid:chararray);
 DESCRIBE y;
 --ILLUSTRATE y;
 rmf /user/vega/zoom/y_debug
 STORE y INTO '/user/vega/zoom/y_debug' USING PigStorage('\t');
 2009-10-30 13:36:48,437 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: 
 hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name
 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to hadoop 
 file system at: 
 hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name
 2009-10-30 13:36:48,495 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: dd-9c32d04:8889
 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to 
 map-reduce job tracker at: dd-9c32d04:8889
 2009-10-30 13:36:49,242 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2999: Unexpected internal error. null
 09/10/30 13:36:49 ERROR grunt.Grunt: ERROR 2999: Unexpected internal error. 
 null
 Details at logfile: /disk1/vega/zoom/pig_1256909801304.log

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case

2009-10-30 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771985#action_12771985
 ] 

Pradeep Kamath commented on PIG-1063:
-

The 1 javac warning is due to the deprecation warning explained in my previous 
comment. The unit test failure seems unrelated and looks like a temporary env. 
issue - resubmitting to check if the tests pass now.

 Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in 
 the multistore case
 --

 Key: PIG-1063
 URL: https://issues.apache.org/jira/browse/PIG-1063
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-1063.patch


 A StoreFunc implementation can inform pig of an OutputFormat it uses through 
 the getStoragePreparationClass() method. In a query with multiple stores 
 which gets optimized into a single mapred job, Pig does not call the 
 checkOutputSpecs() method on the outputformat. An example of such a script is:
 {noformat}
 a = load 'input.txt';
 b = filter a by $0  10;
 store b into 'output1' using StoreWithOutputFormat();
 c = group a by $0;
 d = foreach c generate group, COUNT(a.$0);
 store d into 'output2' using StoreWithOutputFormat();
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1067) [Zebra] to support pig projection push down in Zebra

2009-10-30 Thread Chao Wang (JIRA)
[Zebra] to support pig projection push down in Zebra


 Key: PIG-1067
 URL: https://issues.apache.org/jira/browse/PIG-1067
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0


Pig tries to determine which fields in a query script file will be needed and 
passes that information to the load function, thereby optimizing the query by 
reducing the data to be loaded.

To support this optimization, Zebra needs to implement fieldsToRead method in 
TableLoader class to utilize this information. This jira is for this new 
feature.

For more information of this optimization on pig side, one can refer to jira:  
PIG-653

https://issues.apache.org/jira/browse/PIG-653

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1040) FINDBUGS: MS_SHOULD_BE_FINAL: Field isn't final but should be

2009-10-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1040:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed

 FINDBUGS: MS_SHOULD_BE_FINAL: Field isn't final but should be
 -

 Key: PIG-1040
 URL: https://issues.apache.org/jira/browse/PIG-1040
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1040.patch


 MS
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.USER_COMPARATOR_MARKER
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.weightedParts
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce.sJobConf
  isn't final and can't be protected from malicious code
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.bagFactory
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.reporter
  isn't final and can't be protected from malicious code
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.pigLogger
  should be package protected
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyBag
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyBool
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyDBA
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyDouble
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyFloat
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyInt
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyLong
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyMap
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyString
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyTuple
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.mTupleFactory
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.mTupleFactory
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.mBagFactory
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.mTupleFactory
  isn't final but should be
 MS
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.mTupleFactory
  isn't final but should be
 MSorg.apache.pig.builtin.PigDump.recordDelimiter isn't final but should be
 MSorg.apache.pig.impl.builtin.GFCross.DEFAULT_PARALLELISM isn't final but 
 should be
 MSorg.apache.pig.impl.logicalLayer.LogicalPlanBuilder.classloader isn't 
 final and can't be protected from malicious code
 MSorg.apache.pig.impl.logicalLayer.LogicalPlanCloneHelper.mOpToCloneMap 
 should be package protected
 MS
 org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.canonicalNamer 
 isn't final but should be
 MS
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.castLookup 
 isn't final but should be
 MSorg.apache.pig.impl.plan.OperatorPlan.log isn't final but should be

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-10-30 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772017#action_12772017
 ] 

Dmitriy V. Ryaboy commented on PIG-1062:


I have ResourceStats hooked up to LogicalOperators already, need to port the 
code to the new branch.  This will let us take statistics, if they are 
available, and pass them into the PoissonSampleLoader at initialization time, 
so it can get the number of tuples and avg tuple size directly from Stats.

That being said, statistics may not always be available...

Before I go into the more fanciful suggestion below -- perhaps a simple hack 
will do.  We have counters in Hadoop. Any reason we can't just read bytes read 
in map, records read in map, bytes written in map, records written in 
map counters directly?

If I am overlooking something obvious, here's the ignore counters suggestion:

If my understanding is correct, in PoissonSampleLoader we are interested in the 
average size of a tuple more than # of tuples -- # of tuples is just used as a 
way of crudely estimating avg size of tuple on disk, which is in turn used to 
crudely estimate the size of tuple in memory.  The estimate is likely to be 
very off, by the way, if we are not loading from BinStorage, but from arbitrary 
loadFuncs, as the underlying data, even if it is a file, might be compressed.

Perhaps we can get the average tuple size directly, instead? We could get that  
in the mappers of the sampling job by recording memory usage at the first 
getNext() call, forcing garbage collection, buffering up K tuples, and getting 
memory usage again. 

We now have the following variables available to each sampling mapper in the 
SkewedPartitioner:

* sample rate S (for the appropriate Poisson distribution)
* total # of mappers, M
* available heap size on the reducer, H
* estimated avg size of tuple, s

The number of tuples we want to sample is then simply T = max(10, S*H/(s*M))

In getNext(), we can now allocate a buffer for T elements, populate it with the 
first T tuples, and continue scanning the partition. For every ith next() call, 
we generate a random number r s.t. 0=ri, and if rT we insert the new tuple 
into our buffer at position r.  This gives us a nicely random sample of the 
tuples in the partition.

So this gets around the need for file size info on that side.

Now, PartitionSkewedKey uses the file size / avg_tuple_disk_size to estimate 
total number of tuples, and uses this estimate, plus the ratio of instances of 
a given key in the sample to the total sample size to predict the total number 
of records with a given key in the input.  But given the number of sampled 
tuples, and the sample rate, couldn't we calculate the total number of records 
in the original file by simply reversing the formula for determining the number 
of tuples to sample?  If we do this, no need to append any metadata.

Lastly, if we do want to move around metadata such as number of records in 
input, etc, and we don't want to use Hadoop counters, we should extend 
BinStorage with ResourceStats serialization, and use ResourceStatistics for 
this.  Even if the original data might not have stats, there is no reason we 
can't generate these basic counts at runtime for the data we write ourselves.

-D

 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair

 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach

2009-10-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1030:
--

Attachment: PIG-1030.patch

Add returns as suggested.

 explain and dump not working with two UDFs inside inner plan of foreach
 ---

 Key: PIG-1030
 URL: https://issues.apache.org/jira/browse/PIG-1030
 Project: Pig
  Issue Type: Bug
Reporter: Ying He
Assignee: Richard Ding
 Attachments: PIG-1030.patch, PIG-1030.patch


 this scprit does not work
 register /homes/yinghe/owl/string.jar;
 a = load '/user/yinghe/a.txt' as (id, color);
 b = group a all;
 c = foreach b {
 d = distinct a.color;
 generate group, string.BagCount2(d), string.ColumnLen2(d, 0);
 }
 the udfs are regular, not algebraic.
 then if I call  dump c; or explain c, I would get  this error message.
 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan 
 with single leaf. Found 2 leaves.
 The error only occurs for the first time, after getting this error, if I call 
 dump c or explain c again, it would succeed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach

2009-10-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1030:
--

Status: Patch Available  (was: Open)

 explain and dump not working with two UDFs inside inner plan of foreach
 ---

 Key: PIG-1030
 URL: https://issues.apache.org/jira/browse/PIG-1030
 Project: Pig
  Issue Type: Bug
Reporter: Ying He
Assignee: Richard Ding
 Attachments: PIG-1030.patch, PIG-1030.patch


 this scprit does not work
 register /homes/yinghe/owl/string.jar;
 a = load '/user/yinghe/a.txt' as (id, color);
 b = group a all;
 c = foreach b {
 d = distinct a.color;
 generate group, string.BagCount2(d), string.ColumnLen2(d, 0);
 }
 the udfs are regular, not algebraic.
 then if I call  dump c; or explain c, I would get  this error message.
 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan 
 with single leaf. Found 2 leaves.
 The error only occurs for the first time, after getting this error, if I call 
 dump c or explain c again, it would succeed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-920) optimizing diamond queries

2009-10-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772066#action_12772066
 ] 

Richard Ding commented on PIG-920:
--

Add additional comments.

 optimizing diamond queries
 --

 Key: PIG-920
 URL: https://issues.apache.org/jira/browse/PIG-920
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Attachments: PIG-920.patch, PIG-920.patch


 The following query
 A = load 'foo';
 B = filer A by $01;
 C = filter A by $1 = 'foo';
 D = COGROUP C by $0, B by $0;
 ..
 does not get efficiently executed. Currently, it runs a map only job that 
 basically reads and write the same data before doing the query processing.
 Query where the data is loaded twice actually executed more efficiently.
 This is not an uncommon query and we should fix this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-920) optimizing diamond queries

2009-10-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-920:
-

Attachment: PIG-920.patch

 optimizing diamond queries
 --

 Key: PIG-920
 URL: https://issues.apache.org/jira/browse/PIG-920
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Attachments: PIG-920.patch, PIG-920.patch


 The following query
 A = load 'foo';
 B = filer A by $01;
 C = filter A by $1 = 'foo';
 D = COGROUP C by $0, B by $0;
 ..
 does not get efficiently executed. Currently, it runs a map only job that 
 basically reads and write the same data before doing the query processing.
 Query where the data is loaded twice actually executed more efficiently.
 This is not an uncommon query and we should fix this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case

2009-10-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772068#action_12772068
 ] 

Hadoop QA commented on PIG-1063:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423638/PIG-1063.patch
  against trunk revision 831169.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 199 javac compiler warnings (more 
than the trunk's current 198 warnings).

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/133/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/133/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/133/console

This message is automatically generated.

 Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in 
 the multistore case
 --

 Key: PIG-1063
 URL: https://issues.apache.org/jira/browse/PIG-1063
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-1063.patch


 A StoreFunc implementation can inform pig of an OutputFormat it uses through 
 the getStoragePreparationClass() method. In a query with multiple stores 
 which gets optimized into a single mapred job, Pig does not call the 
 checkOutputSpecs() method on the outputformat. An example of such a script is:
 {noformat}
 a = load 'input.txt';
 b = filter a by $0  10;
 store b into 'output1' using StoreWithOutputFormat();
 c = group a by $0;
 d = foreach c generate group, COUNT(a.$0);
 store d into 'output2' using StoreWithOutputFormat();
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case

2009-10-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772072#action_12772072
 ] 

Olga Natkovich commented on PIG-1063:
-

+1; changes look good

 Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in 
 the multistore case
 --

 Key: PIG-1063
 URL: https://issues.apache.org/jira/browse/PIG-1063
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-1063.patch


 A StoreFunc implementation can inform pig of an OutputFormat it uses through 
 the getStoragePreparationClass() method. In a query with multiple stores 
 which gets optimized into a single mapred job, Pig does not call the 
 checkOutputSpecs() method on the outputformat. An example of such a script is:
 {noformat}
 a = load 'input.txt';
 b = filter a by $0  10;
 store b into 'output1' using StoreWithOutputFormat();
 c = group a by $0;
 d = foreach c generate group, COUNT(a.$0);
 store d into 'output2' using StoreWithOutputFormat();
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case

2009-10-30 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1063:


   Resolution: Fixed
Fix Version/s: 0.6.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

Patch committed to trunk

 Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in 
 the multistore case
 --

 Key: PIG-1063
 URL: https://issues.apache.org/jira/browse/PIG-1063
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.6.0

 Attachments: PIG-1063.patch


 A StoreFunc implementation can inform pig of an OutputFormat it uses through 
 the getStoragePreparationClass() method. In a query with multiple stores 
 which gets optimized into a single mapred job, Pig does not call the 
 checkOutputSpecs() method on the outputformat. An example of such a script is:
 {noformat}
 a = load 'input.txt';
 b = filter a by $0  10;
 store b into 'output1' using StoreWithOutputFormat();
 c = group a by $0;
 d = foreach c generate group, COUNT(a.$0);
 store d into 'output2' using StoreWithOutputFormat();
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1068) COGROUP fails with 'Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableTuple'

2009-10-30 Thread Vikram Oberoi (JIRA)
COGROUP fails with 'Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableText, recieved 
org.apache.pig.impl.io.NullableTuple'
---

 Key: PIG-1068
 URL: https://issues.apache.org/jira/browse/PIG-1068
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Vikram Oberoi


The COGROUP in the following script fails in its map:

{code}
logs = LOAD '$LOGS' USING PigStorage() AS (ts:int, id:chararray, 
command:chararray, comments:chararray); 
  


   
SPLIT logs INTO logins IF command == 'login', all_quits IF command == 'quit';   

   


   
-- Project login clients and count them by ID.  

   
login_info = FOREACH logins {   

   
GENERATE id as id,  

   
comments AS client; 

   
};  

   


   
logins_grouped = GROUP login_info BY (id, client);  

   


   
count_logins_by_client = FOREACH logins_grouped {   

   
generate group.id AS id, group.client AS client, COUNT($1) AS count;

   
}   

   


   
-- Get the first quit.  

   
all_quits_grouped = GROUP all_quits BY id;  

   


   
quits = FOREACH all_quits_grouped { 

   
ordered = ORDER all_quits BY ts ASC;

   
last_quit = 

[jira] Updated: (PIG-1068) COGROUP fails with 'Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableTuple'

2009-10-30 Thread Vikram Oberoi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikram Oberoi updated PIG-1068:
---

Attachment: cogroup-bug.pig
log

Attached the script and some sample data.

 COGROUP fails with 'Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableText, recieved 
 org.apache.pig.impl.io.NullableTuple'
 ---

 Key: PIG-1068
 URL: https://issues.apache.org/jira/browse/PIG-1068
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Vikram Oberoi
 Attachments: cogroup-bug.pig, log


 The COGROUP in the following script fails in its map:
 {code}
 logs = LOAD '$LOGS' USING PigStorage() AS (ts:int, id:chararray, 
 command:chararray, comments:chararray);   
 
   
   

 SPLIT logs INTO logins IF command == 'login', all_quits IF command == 'quit'; 
   

   
   

 -- Project login clients and count them by ID.
   

 login_info = FOREACH logins { 
   

 GENERATE id as id,
   

 comments AS client;   
   

 };
   

   
   

 logins_grouped = GROUP login_info BY (id, client);
   

   
   

 count_logins_by_client = FOREACH logins_grouped { 
   

 generate group.id AS id, group.client AS client, COUNT($1) AS count;  
   

 } 
   

   
   

 -- Get the first quit.
   

 all_quits_grouped = GROUP all_quits BY id;
   

   
   

 quits = FOREACH all_quits_grouped {  

[jira] Updated: (PIG-920) optimizing diamond queries

2009-10-30 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-920:
---

   Resolution: Fixed
Fix Version/s: 0.6.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

+1, patch committed, thanks for the contribution Richard!

 optimizing diamond queries
 --

 Key: PIG-920
 URL: https://issues.apache.org/jira/browse/PIG-920
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: PIG-920.patch, PIG-920.patch


 The following query
 A = load 'foo';
 B = filer A by $01;
 C = filter A by $1 = 'foo';
 D = COGROUP C by $0, B by $0;
 ..
 does not get efficiently executed. Currently, it runs a map only job that 
 basically reads and write the same data before doing the query processing.
 Query where the data is loaded twice actually executed more efficiently.
 This is not an uncommon query and we should fix this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-10-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1038:


  Component/s: impl
  Description: 
If nested foreach plan contains sort/distinct, it is possible to use hadoop 
secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
query. 

Eg1:
A = load 'mydata';
B = group A by $0;
C = foreach B {
D = order A by $1;
generate group, D;
}
store C into 'myresult';

We can specify a secondary sort on A.$1, and drop order A by $1.

Eg2:
A = load 'mydata';
B = group A by $0;
C = foreach B {
D = A.$1;
E = distinct D;
generate group, E;
}
store C into 'myresult';

We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
D to a special version of distinct, which does not do the sorting.

  was:Since the data coming to the reducer is sorted on group+distinct, we 
don't need to see all distinct values at once

Affects Version/s: 0.4.0
Fix Version/s: 0.6.0
  Summary: Optimize nested distinct/sort to use secondary key  
(was: stream nested distinct for in case of accumulate interface)

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-10-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772085#action_12772085
 ] 

Daniel Dai commented on PIG-1038:
-

Here is the design for this optimization:
1. Add SecondaryKeyOptimizer, which optimize map-reduce plan. It will
1.1 Discover if we use sort/distinct in nested foreach plan. 
1.2 For the first such sort/distinct, use the sort/distinct key as the 
secondary key
1.3 Once SecondaryKeyOptimizer discover secondary key, it will call 
POLocalRearrange.setSecondaryPlan, then drop sort or simplify distinct

2. Change POLocalRearrange
2.1 Add setSecondaryPlan to provide a way to set secondary plan for 
SecondaryKeyOptimizer
2.2 Change constructLROutput to make a compound key, which is a tuple: (key, 
secondaryKey)
2.3 We need to duplicate the logic to strip key from values for the secondary 
key as well

3. Change POPackageAnnotator to patch POPackage with the keyinfo from both key 
and secondaryKey

4. Change POPackage to stitch secondary key to the value

5. Change MapReduceOper to indicate that map-reduce operator needs secondary 
key, and JobControlCompiler will set OutputValueGroupingComparator to use the 
mainKeyComparator

6. Add mainKeyComparator which inherits PigNullableWritable and only compare 
the main key. We need that for the OutputValueGroupingComparator

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-10-30 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair reassigned PIG-1062:
--

Assignee: Thejas M Nair

 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair

 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1035) support for skewed outer join

2009-10-30 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772095#action_12772095
 ] 

Pradeep Kamath commented on PIG-1035:
-

The unit test does not seem to check the results of the outer join - would be 
good to add check of the actual results. 
In fact, there are already outer join tests in TestJoin.java - you can just 
update those to also test skewed join since those tests
already check output correctness.

In LogToPhyTranslationVisitor.java, in the following code, the return value of 
op.getSchema() should be checked for null in
which case the same Exception should be thrown:
{code}
 849 try {
   850 skj.addSchema(op.getSchema());
   851 } catch (FrontendException e) {
   852 int errCode = 2015;
   853 String msg = Couldn't set the schema for outer 
join ;
   854 throw new LogicalToPhysicalTranslatorException(msg, 
errCode, PigException.BUG, e);
   855 }
{code}
With the above code, schema is required for both inputs to the join. Strictly, 
for left and right outer joins, only the
schema of the side where nulls need to be projected is needed. Only in full 
outer join both inputs should have schemas - if possible
for left and right outer joins the restriction should be to require a schema 
only on the relevant input - for reference - left and right outer
joins  in regular join do this.

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1057) [Zebra] Zebra does not support concurrent deletions of column groups now.

2009-10-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1057:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked in.

 [Zebra] Zebra does not support concurrent deletions of column groups now.
 -

 Key: PIG-1057
 URL: https://issues.apache.org/jira/browse/PIG-1057
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0

 Attachments: patch_1057


 Zebra does not support concurrent deletions of column groups now.  As a 
 result, the TestDropColumnGroup testcase can sometimes fail due to this.
 In this testcase, multiple threads will be launched together, with each one 
 deleting one particular column group.  The following exception can be thrown 
 (with callstack):
 /*/
 ... 
 java.io.FileNotFoundException: File 
 /.../pig-trunk/build/contrib/zebra/test/data/DropCGTest/CG02 does not exist.
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:290)
   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:716)
   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:741)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:465)
   at 
 org.apache.hadoop.zebra.io.BasicTable$SchemaFile.setCGDeletedFlags(BasicTable.java:1610)
   at 
 org.apache.hadoop.zebra.io.BasicTable$SchemaFile.readSchemaFile(BasicTable.java:1593)
   at 
 org.apache.hadoop.zebra.io.BasicTable$SchemaFile.init(BasicTable.java:1416)
   at 
 org.apache.hadoop.zebra.io.BasicTable.dropColumnGroup(BasicTable.java:133)
   at 
 org.apache.hadoop.zebra.io.TestDropColumnGroup$DropThread.run(TestDropColumnGroup.java:772)
 ...
 /*/
 We plan to fix this in Zebra to support concurrent deletions of column 
 groups. The root cause is that a thread or process reads in some stale file 
 system information (e.g., it sees /CG0 first) and then can fail later on (it 
 tries to access /CG0, however /CG0 may be deleted by another thread or 
 process).  Therefore, we plan to adopt a retry logic to resolve this issue. 
 More detailed, we allow a dropping column group thread to retry n times when 
 doing its deleting job - n is the total number of column groups. 
 Note that here we do NOT try to resolve the more general concurrent column 
 group deletions + reads issue. If a process is reading some data that could 
 be deleted by another process, it can fail as we expect.
 Here we only try to resolve the concurrent column group deletions issue. If 
 you have multiple threads or processes to delete column groups, they should 
 succeed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra

2009-10-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-997:
-

Status: Patch Available  (was: Open)

 [zebra] Sorted Table Support by Zebra
 -

 Key: PIG-997
 URL: https://issues.apache.org/jira/browse/PIG-997
 Project: Pig
  Issue Type: New Feature
Reporter: Yan Zhou
 Fix For: 0.6.0

 Attachments: SortedTable.patch, SortedTable.patch


 This new feature is for Zebra to support sorted data in storage. As a storage 
 library, Zebra will not sort the data by itself. But it will support creation 
 and use of sorted data either through PIG  or through map/reduce tasks that 
 use Zebra as storage format.
 The sorted table keeps the data in a totally sorted manner across all 
 TFiles created by potentially all mappers or reducers.
 For sorted data creation through PIG's STORE operator ,  if the input data is 
 sorted through ORDER BY, the new Zebra table will be marked as sorted on 
 the sorted columns;
 For sorted data creation though Map/Reduce tasks,  three new static methods 
 of the BasicTableOutput class will be provided to allow or help the user to 
 achieve the goal. setSortInfo allows the user to specify the sorted columns 
 of the input tuple to be stored; getSortKeyGenerator and getSortKey help 
 the user to generate the key acceptable by Zebra as a sorted key based upon 
 the schema, sorted columns and the input tuple.
 For sorted data read through PIG's LOAD operator, pass string sorted as an 
 extra argument to the TableLoader constructor to ask for sorted table to be 
 loaded;
 For sorted data read through Map/Reduce tasks, a new static method of 
 TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
 table to be read. Additionally, an overloaded version of the new method can 
 be called to ask for a sorted table on specified sort columns and comparator.
 For this release, sorted table only supported sorting in ascending order, not 
 in descending order. In addition, the sort keys must be of simple types not 
 complex types such as RECORD, COLLECTION and MAP. 
 Multiple-key sorting is supported. But the ordering of the multiple sort keys 
 is significant with the first sort column being the primary sort key, the 
 second being the secondary sort key, etc.
 In this release, the sort keys are stored along with the sort columns where 
 the keys were originally created from, resulting in some data storage 
 redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-997) [zebra] Sorted Table Support by Zebra

2009-10-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou reassigned PIG-997:


Assignee: Yan Zhou

 [zebra] Sorted Table Support by Zebra
 -

 Key: PIG-997
 URL: https://issues.apache.org/jira/browse/PIG-997
 Project: Pig
  Issue Type: New Feature
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.6.0

 Attachments: SortedTable.patch, SortedTable.patch


 This new feature is for Zebra to support sorted data in storage. As a storage 
 library, Zebra will not sort the data by itself. But it will support creation 
 and use of sorted data either through PIG  or through map/reduce tasks that 
 use Zebra as storage format.
 The sorted table keeps the data in a totally sorted manner across all 
 TFiles created by potentially all mappers or reducers.
 For sorted data creation through PIG's STORE operator ,  if the input data is 
 sorted through ORDER BY, the new Zebra table will be marked as sorted on 
 the sorted columns;
 For sorted data creation though Map/Reduce tasks,  three new static methods 
 of the BasicTableOutput class will be provided to allow or help the user to 
 achieve the goal. setSortInfo allows the user to specify the sorted columns 
 of the input tuple to be stored; getSortKeyGenerator and getSortKey help 
 the user to generate the key acceptable by Zebra as a sorted key based upon 
 the schema, sorted columns and the input tuple.
 For sorted data read through PIG's LOAD operator, pass string sorted as an 
 extra argument to the TableLoader constructor to ask for sorted table to be 
 loaded;
 For sorted data read through Map/Reduce tasks, a new static method of 
 TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
 table to be read. Additionally, an overloaded version of the new method can 
 be called to ask for a sorted table on specified sort columns and comparator.
 For this release, sorted table only supported sorting in ascending order, not 
 in descending order. In addition, the sort keys must be of simple types not 
 complex types such as RECORD, COLLECTION and MAP. 
 Multiple-key sorting is supported. But the ordering of the multiple sort keys 
 is significant with the first sort column being the primary sort key, the 
 second being the secondary sort key, etc.
 In this release, the sort keys are stored along with the sort columns where 
 the keys were originally created from, resulting in some data storage 
 redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra

2009-10-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-997:
-

Attachment: SortedTable.patch

 [zebra] Sorted Table Support by Zebra
 -

 Key: PIG-997
 URL: https://issues.apache.org/jira/browse/PIG-997
 Project: Pig
  Issue Type: New Feature
Reporter: Yan Zhou
 Fix For: 0.6.0

 Attachments: SortedTable.patch, SortedTable.patch


 This new feature is for Zebra to support sorted data in storage. As a storage 
 library, Zebra will not sort the data by itself. But it will support creation 
 and use of sorted data either through PIG  or through map/reduce tasks that 
 use Zebra as storage format.
 The sorted table keeps the data in a totally sorted manner across all 
 TFiles created by potentially all mappers or reducers.
 For sorted data creation through PIG's STORE operator ,  if the input data is 
 sorted through ORDER BY, the new Zebra table will be marked as sorted on 
 the sorted columns;
 For sorted data creation though Map/Reduce tasks,  three new static methods 
 of the BasicTableOutput class will be provided to allow or help the user to 
 achieve the goal. setSortInfo allows the user to specify the sorted columns 
 of the input tuple to be stored; getSortKeyGenerator and getSortKey help 
 the user to generate the key acceptable by Zebra as a sorted key based upon 
 the schema, sorted columns and the input tuple.
 For sorted data read through PIG's LOAD operator, pass string sorted as an 
 extra argument to the TableLoader constructor to ask for sorted table to be 
 loaded;
 For sorted data read through Map/Reduce tasks, a new static method of 
 TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
 table to be read. Additionally, an overloaded version of the new method can 
 be called to ask for a sorted table on specified sort columns and comparator.
 For this release, sorted table only supported sorting in ascending order, not 
 in descending order. In addition, the sort keys must be of simple types not 
 complex types such as RECORD, COLLECTION and MAP. 
 Multiple-key sorting is supported. But the ordering of the multiple sort keys 
 is significant with the first sort column being the primary sort key, the 
 second being the secondary sort key, etc.
 In this release, the sort keys are stored along with the sort columns where 
 the keys were originally created from, resulting in some data storage 
 redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach

2009-10-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772123#action_12772123
 ] 

Hadoop QA commented on PIG-1030:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423708/PIG-1030.patch
  against trunk revision 831402.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/36/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/36/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/36/console

This message is automatically generated.

 explain and dump not working with two UDFs inside inner plan of foreach
 ---

 Key: PIG-1030
 URL: https://issues.apache.org/jira/browse/PIG-1030
 Project: Pig
  Issue Type: Bug
Reporter: Ying He
Assignee: Richard Ding
 Attachments: PIG-1030.patch, PIG-1030.patch


 this scprit does not work
 register /homes/yinghe/owl/string.jar;
 a = load '/user/yinghe/a.txt' as (id, color);
 b = group a all;
 c = foreach b {
 d = distinct a.color;
 generate group, string.BagCount2(d), string.ColumnLen2(d, 0);
 }
 the udfs are regular, not algebraic.
 then if I call  dump c; or explain c, I would get  this error message.
 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan 
 with single leaf. Found 2 leaves.
 The error only occurs for the first time, after getting this error, if I call 
 dump c or explain c again, it would succeed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1053) Consider moving to Hadoop for local mode

2009-10-30 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772144#action_12772144
 ] 

Alan Gates commented on PIG-1053:
-

For testing purposes we could simply change Main to tell PigContext the mode is 
MapReduce, even when the user selects local mode.  Assuming there are no 
configuration files in the classpath, this will result in using Hadoop in the 
local mode.

However, for a real fix, we need to make sure that when the user says -x 
local Hadoop's LocalJobRunner and the local file system are chosen even if 
there are configuration files in the classpath.  I believe this would be 
accomplished by changing PigContext to in local mode still connect to MR and 
HDFS, but to do so with an empty Properties object rather than using the one 
that is passed in.  This would affect connect, init, setJobTrackerLocation, and 
perhaps other calls.



 Consider moving to Hadoop for local mode
 

 Key: PIG-1053
 URL: https://issues.apache.org/jira/browse/PIG-1053
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates

 We need to consider moving Pig to use Hadoop's local mode instead of its own.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1053) Consider moving to Hadoop for local mode

2009-10-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1053:
---

Assignee: Ankit Modi

 Consider moving to Hadoop for local mode
 

 Key: PIG-1053
 URL: https://issues.apache.org/jira/browse/PIG-1053
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Ankit Modi

 We need to consider moving Pig to use Hadoop's local mode instead of its own.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1048:


   Resolution: Fixed
Fix Version/s: 0.6.0
   Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Sri for fixing this.

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: pig_1048.patch


 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1036) Fragment-replicate left outer join

2009-10-30 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772157#action_12772157
 ] 

Pradeep Kamath commented on PIG-1036:
-

In the unit tests in TestFRJoin, there is a check made for the output tuples 
using hasmaps and also using TestHelper.compareBags() - are both required?

In QueryParser.jjt we currently have:
{code}
1974 // in the case of outer joins, only two
| 1975 // inputs are allowed
| 1976 isOuter = (isLeftOuter || isRightOuter || isFullOuter);
| 1977 if(isOuter  gis.size()  2) {
| 1978   throw new ParseException((left|right|full) outer joins are 
only supported for two inputs);
| 1979 }
{code}

I think left outer join should only be supported for 2 way FR joins - looks 
like the code supports =2 inputs - 
{code}
 867 // This condition may not reach, as a join has more than 
one side
   868 if( innerFlags.length = 2 ) {
   869 isLeftOuter = !innerFlags[1];
   870 }
{code}

In the code below, TupleFactory.getInstance() can be replaced with 
mTupleFactory. Also there should be checks to see if the second input
has a schema and we should rely on that to determine how many nulls are needed. 
Left outer join should only be supported for the case
where second input has a schema (assuming we only support 2 way FR join) to be 
consistent in all our implementations of left join. 
209 Tuple nullTuple = 
TupleFactory.getInstance().newTuple(iter.next().get(0).size());   

Why is an array of Bags (nullBags) used? should this be nullTuples since what 
we really want is just 1 null tuple - also if we only
support 2 way left outer join, this would just be a nullTuple instead of an 
array.



 Fragment-replicate left outer join
 --

 Key: PIG-1036
 URL: https://issues.apache.org/jira/browse/PIG-1036
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Ankit Modi
 Attachments: LeftOuterFRJoin.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1006) FINDBUGS: EQ_COMPARETO_USE_OBJECT_EQUALS in bags and tuples

2009-10-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1006.
-

Resolution: Fixed

already resolved

 FINDBUGS: EQ_COMPARETO_USE_OBJECT_EQUALS in bags and tuples
 ---

 Key: PIG-1006
 URL: https://issues.apache.org/jira/browse/PIG-1006
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich

 Eqorg.apache.pig.data.DistinctDataBag$DistinctDataBagIterator$TContainer 
 defines compareTo(DistinctDataBag$DistinctDataBagIterator$TContainer) and 
 uses Object.equals()
 Eqorg.apache.pig.data.SingleTupleBag defines compareTo(Object) and uses 
 Object.equals()
 Eqorg.apache.pig.data.SortedDataBag$SortedDataBagIterator$PQContainer 
 defines compareTo(SortedDataBag$SortedDataBagIterator$PQContainer) and uses 
 Object.equals()
 Eqorg.apache.pig.data.TargetedTuple defines compareTo(Object) and uses 
 Object.equals()
 Eqorg.apache.pig.pen.util.ExampleTuple defines compareTo(Object) and uses 
 Object.equals()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1058) FINDBUGS: remaining Correctness Warnings

2009-10-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1058:


Status: Patch Available  (was: Open)

This patch resolves all the remaining findbugs issues

 FINDBUGS: remaining Correctness Warnings
 --

 Key: PIG-1058
 URL: https://issues.apache.org/jira/browse/PIG-1058
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1058.patch


 BCImpossible cast from java.lang.Object[] to java.lang.String[] in 
 org.apache.pig.PigServer.listPaths(String)
 ECCall to equals() comparing different types in 
 org.apache.pig.impl.plan.Operator.equals(Object)
 GCjava.lang.Byte is incompatible with expected argument type 
 java.lang.Integer in 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator$LoRearrangeDiscoverer.visitLocalRearrange(POLocalRearrange)
 ILThere is an apparent infinite recursive loop in 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POCogroup$groupComparator.equals(Object)
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.bsR(int)
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode()
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode()
 MFField ConstantExpression.res masks field in superclass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator
 Nm
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitSplit(POSplit)
  doesn't override method in superclass because parameter type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit
  doesn't match superclass parameter type 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit
 Nm
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.NoopStoreRemover$PhysicalRemover.visitSplit(POSplit)
  doesn't override method in superclass because parameter type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit
  doesn't match superclass parameter type 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit
 NPPossible null pointer dereference of ? in 
 org.apache.pig.impl.logicalLayer.optimizer.PushDownForeachFlatten.check(List)
 NPPossible null pointer dereference of lo in 
 org.apache.pig.impl.logicalLayer.optimizer.StreamOptimizer.transform(List)
 NPPossible null pointer dereference of 
 Schema$FieldSchema.Schema$FieldSchema.alias in 
 org.apache.pig.impl.logicalLayer.schema.Schema.equals(Schema, Schema, 
 boolean, boolean)
 NPPossible null pointer dereference of Schema$FieldSchema.alias in 
 org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.equals(Schema$FieldSchema,
  Schema$FieldSchema, boolean, boolean)
 NPPossible null pointer dereference of inp in 
 org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run()
 RCN   Nullcheck of pigContext at line 123 of value previously dereferenced in 
 org.apache.pig.impl.util.JarManager.createJar(OutputStream, List, PigContext)
 RV
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.fixUpDomain(String,
  Properties) ignores return value of java.net.InetAddress.getByName(String)
 RVBad attempt to compute absolute value of signed 32-bit hashcode in 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.getPartition(PigNullableWritable,
  Writable, int)
 RVBad attempt to compute absolute value of signed 32-bit hashcode in 
 org.apache.pig.impl.plan.DotPlanDumper.getID(Operator)
 UwF   Field only ever set to null: 
 org.apache.pig.impl.builtin.MergeJoinIndexer.dummyTuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1058) FINDBUGS: remaining Correctness Warnings

2009-10-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1058:


Attachment: PIG-1058.patch

 FINDBUGS: remaining Correctness Warnings
 --

 Key: PIG-1058
 URL: https://issues.apache.org/jira/browse/PIG-1058
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1058.patch


 BCImpossible cast from java.lang.Object[] to java.lang.String[] in 
 org.apache.pig.PigServer.listPaths(String)
 ECCall to equals() comparing different types in 
 org.apache.pig.impl.plan.Operator.equals(Object)
 GCjava.lang.Byte is incompatible with expected argument type 
 java.lang.Integer in 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator$LoRearrangeDiscoverer.visitLocalRearrange(POLocalRearrange)
 ILThere is an apparent infinite recursive loop in 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POCogroup$groupComparator.equals(Object)
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.bsR(int)
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode()
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode()
 MFField ConstantExpression.res masks field in superclass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator
 Nm
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitSplit(POSplit)
  doesn't override method in superclass because parameter type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit
  doesn't match superclass parameter type 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit
 Nm
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.NoopStoreRemover$PhysicalRemover.visitSplit(POSplit)
  doesn't override method in superclass because parameter type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit
  doesn't match superclass parameter type 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit
 NPPossible null pointer dereference of ? in 
 org.apache.pig.impl.logicalLayer.optimizer.PushDownForeachFlatten.check(List)
 NPPossible null pointer dereference of lo in 
 org.apache.pig.impl.logicalLayer.optimizer.StreamOptimizer.transform(List)
 NPPossible null pointer dereference of 
 Schema$FieldSchema.Schema$FieldSchema.alias in 
 org.apache.pig.impl.logicalLayer.schema.Schema.equals(Schema, Schema, 
 boolean, boolean)
 NPPossible null pointer dereference of Schema$FieldSchema.alias in 
 org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.equals(Schema$FieldSchema,
  Schema$FieldSchema, boolean, boolean)
 NPPossible null pointer dereference of inp in 
 org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run()
 RCN   Nullcheck of pigContext at line 123 of value previously dereferenced in 
 org.apache.pig.impl.util.JarManager.createJar(OutputStream, List, PigContext)
 RV
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.fixUpDomain(String,
  Properties) ignores return value of java.net.InetAddress.getByName(String)
 RVBad attempt to compute absolute value of signed 32-bit hashcode in 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.getPartition(PigNullableWritable,
  Writable, int)
 RVBad attempt to compute absolute value of signed 32-bit hashcode in 
 org.apache.pig.impl.plan.DotPlanDumper.getID(Operator)
 UwF   Field only ever set to null: 
 org.apache.pig.impl.builtin.MergeJoinIndexer.dummyTuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1058) FINDBUGS: remaining Correctness Warnings

2009-10-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772175#action_12772175
 ] 

Daniel Dai commented on PIG-1058:
-

+1

 FINDBUGS: remaining Correctness Warnings
 --

 Key: PIG-1058
 URL: https://issues.apache.org/jira/browse/PIG-1058
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1058.patch


 BCImpossible cast from java.lang.Object[] to java.lang.String[] in 
 org.apache.pig.PigServer.listPaths(String)
 ECCall to equals() comparing different types in 
 org.apache.pig.impl.plan.Operator.equals(Object)
 GCjava.lang.Byte is incompatible with expected argument type 
 java.lang.Integer in 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator$LoRearrangeDiscoverer.visitLocalRearrange(POLocalRearrange)
 ILThere is an apparent infinite recursive loop in 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POCogroup$groupComparator.equals(Object)
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.bsR(int)
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode()
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode()
 MFField ConstantExpression.res masks field in superclass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator
 Nm
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitSplit(POSplit)
  doesn't override method in superclass because parameter type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit
  doesn't match superclass parameter type 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit
 Nm
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.NoopStoreRemover$PhysicalRemover.visitSplit(POSplit)
  doesn't override method in superclass because parameter type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit
  doesn't match superclass parameter type 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit
 NPPossible null pointer dereference of ? in 
 org.apache.pig.impl.logicalLayer.optimizer.PushDownForeachFlatten.check(List)
 NPPossible null pointer dereference of lo in 
 org.apache.pig.impl.logicalLayer.optimizer.StreamOptimizer.transform(List)
 NPPossible null pointer dereference of 
 Schema$FieldSchema.Schema$FieldSchema.alias in 
 org.apache.pig.impl.logicalLayer.schema.Schema.equals(Schema, Schema, 
 boolean, boolean)
 NPPossible null pointer dereference of Schema$FieldSchema.alias in 
 org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.equals(Schema$FieldSchema,
  Schema$FieldSchema, boolean, boolean)
 NPPossible null pointer dereference of inp in 
 org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run()
 RCN   Nullcheck of pigContext at line 123 of value previously dereferenced in 
 org.apache.pig.impl.util.JarManager.createJar(OutputStream, List, PigContext)
 RV
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.fixUpDomain(String,
  Properties) ignores return value of java.net.InetAddress.getByName(String)
 RVBad attempt to compute absolute value of signed 32-bit hashcode in 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.getPartition(PigNullableWritable,
  Writable, int)
 RVBad attempt to compute absolute value of signed 32-bit hashcode in 
 org.apache.pig.impl.plan.DotPlanDumper.getID(Operator)
 UwF   Field only ever set to null: 
 org.apache.pig.impl.builtin.MergeJoinIndexer.dummyTuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-997) [zebra] Sorted Table Support by Zebra

2009-10-30 Thread Gaurav Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772177#action_12772177
 ] 

Gaurav Jain commented on PIG-997:
-

Reviewed

+1

 [zebra] Sorted Table Support by Zebra
 -

 Key: PIG-997
 URL: https://issues.apache.org/jira/browse/PIG-997
 Project: Pig
  Issue Type: New Feature
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.6.0

 Attachments: SortedTable.patch, SortedTable.patch


 This new feature is for Zebra to support sorted data in storage. As a storage 
 library, Zebra will not sort the data by itself. But it will support creation 
 and use of sorted data either through PIG  or through map/reduce tasks that 
 use Zebra as storage format.
 The sorted table keeps the data in a totally sorted manner across all 
 TFiles created by potentially all mappers or reducers.
 For sorted data creation through PIG's STORE operator ,  if the input data is 
 sorted through ORDER BY, the new Zebra table will be marked as sorted on 
 the sorted columns;
 For sorted data creation though Map/Reduce tasks,  three new static methods 
 of the BasicTableOutput class will be provided to allow or help the user to 
 achieve the goal. setSortInfo allows the user to specify the sorted columns 
 of the input tuple to be stored; getSortKeyGenerator and getSortKey help 
 the user to generate the key acceptable by Zebra as a sorted key based upon 
 the schema, sorted columns and the input tuple.
 For sorted data read through PIG's LOAD operator, pass string sorted as an 
 extra argument to the TableLoader constructor to ask for sorted table to be 
 loaded;
 For sorted data read through Map/Reduce tasks, a new static method of 
 TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
 table to be read. Additionally, an overloaded version of the new method can 
 be called to ask for a sorted table on specified sort columns and comparator.
 For this release, sorted table only supported sorting in ascending order, not 
 in descending order. In addition, the sort keys must be of simple types not 
 complex types such as RECORD, COLLECTION and MAP. 
 Multiple-key sorting is supported. But the ordering of the multiple sort keys 
 is significant with the first sort column being the primary sort key, the 
 second being the secondary sort key, etc.
 In this release, the sort keys are stored along with the sort columns where 
 the keys were originally created from, resulting in some data storage 
 redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1026) [zebra] map split returns null

2009-10-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1026:
--

Attachment: PIG_1026.patch

 [zebra] map split returns null
 --

 Key: PIG-1026
 URL: https://issues.apache.org/jira/browse/PIG-1026
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0

 Attachments: PIG_1026.patch


 Here is the test scenario:
  final static String STR_SCHEMA = m1:map(string),m2:map(map(int));
   //final static String STR_STORAGE = [m1#{a}];[m2#{x|y}]; [m1#{b}, 
 m2#{z}];[m1];
  final static String STR_STORAGE = [m1#{a}, m2#{x}];[m2#{x|y}]; [m1#{b}, 
 m2#{z}];[m1,m2];
 projection: String projection2 = new String(m1#{b}, m2#{x|z});
 User got null pointer exception on reading m1#{b}.
 Yan, please refer to the test class:
 TestNonDefaultWholeMapSplit.java 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1026) [zebra] map split returns null

2009-10-30 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772192#action_12772192
 ] 

Yan Zhou commented on PIG-1026:
---

Another problem is that during STORE, when  handleMapSplit method is called to 
set up the CG schema mapping with the Table schema, an incremented index was 
used as the column indices to the CG schemas. This works ok if the columns in 
the CG schemas are in the same order as in the table schema, but will be wrong 
if the orders are different, causing no values to be stored in some MAP columns 
in any CG and hence, during LOAD, getValue() returns null.

 [zebra] map split returns null
 --

 Key: PIG-1026
 URL: https://issues.apache.org/jira/browse/PIG-1026
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0

 Attachments: PIG_1026.patch


 Here is the test scenario:
  final static String STR_SCHEMA = m1:map(string),m2:map(map(int));
   //final static String STR_STORAGE = [m1#{a}];[m2#{x|y}]; [m1#{b}, 
 m2#{z}];[m1];
  final static String STR_STORAGE = [m1#{a}, m2#{x}];[m2#{x|y}]; [m1#{b}, 
 m2#{z}];[m1,m2];
 projection: String projection2 = new String(m1#{b}, m2#{x|z});
 User got null pointer exception on reading m1#{b}.
 Yan, please refer to the test class:
 TestNonDefaultWholeMapSplit.java 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-477) passing properties from command line to the backend

2009-10-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-477.


Resolution: Duplicate

Marking as duplicate of PIG-602

 passing properties from command line to the backend
 ---

 Key: PIG-477
 URL: https://issues.apache.org/jira/browse/PIG-477
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich

 We have users that would like to be able to pass paramters from command line 
 to their UDFs.
 A natural way to do that would be pass them as properties from the client to 
 the compute node and make them available through System.getProperties on the 
 backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1069) Order Preserving Sorted Table Union

2009-10-30 Thread Yan Zhou (JIRA)
Order Preserving Sorted Table Union
---

 Key: PIG-1069
 URL: https://issues.apache.org/jira/browse/PIG-1069
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou


The output schema will adopt a schema union semantics, namely, if an output 
column only appears in one component table, the result rows will have the 
values of the column if the rows are from that component table and null 
otherwise; on the other hand, if an output column appears in multiple component 
tables, the types of the column in all the component tables must be identical. 
Otherwise, an exception will be thrown. The result rows will have the values of 
the column if the rows are from the component tables that have the column 
themselves, or null if otherwise. 

The order preserving sort-unioned results could be further indexed by the 
component tables if the projection contains column(s) named source_table. If 
so specified, the component table index will be output at the position(s) as 
specified in the projection list. If the underlying table is not a union of 
sorted tables, use of the special column name in projection will cause an 
exception thrown. 

If an attempt is made to create a table of a column named source_table, an 
excpetion will be thrown as the name is reserved by zebra for the virtual name. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-10-30 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772197#action_12772197
 ] 

Thejas M Nair commented on PIG-1062:


Dmitriy,
I had overlooked the fact that input size of the file is being used also to 
calculate the number of samples. Thanks for pointing it out.  

I don't know if there are any problems in using counters directly, as long as 
information is required only after (first mapreduce) sampling phase, ie it 
could be used in PartitionSkewedKey().  

The logic in PoissonSampleLoader.computeSamples is  ( a detailed explanation 
will be added soon to the sampler wiki page). - The goal is to sample all keys 
from the first input that are will need to be partitioned across multiple 
reducers in the join phase. 
Let us assume X tuples fit into available memory in reducer. Lets say we want 
to sample 10 samples in each set of X tuples, with 95% confidence. Using 
poisson distribution formulas, we arrive at the number 17 as number of tuples 
to be sampled every X tuples. ( I don't know why poisson distrubution is the 
appropriate choice )

The total number of tuples to be sampled cannot be calculated without knowing 
total number of tuples. But what we know is that we should sample one tuple 
every (X/17) tuples. To calculate X, we need the average size of tuple in 
memory. Using the process memory usage is unlikely to give good approximation 
of that, because (as per my understanding) calling the garbage collector is not 
guaranteed to free memory used by all unused objects.  Tuple.getMemorySize() 
can be used to get an estimate of the memory used by the tuple. The average 
size could be estimated/corrected as we sample more tuples.
ie, PoissonSampleLoader.getNext() will return every H/s tuple in the input. 
(using H, s in previous comment)

In PartitionSkewedKey.exec(), Dmitriy's  idea of using number of samples, and 
sample rate (H/s) can be used to estimate total tuples. 

WeightedRangePartitioner.setConf is another function using fileSize().  That 
needs to change as well. I haven't looked at that yet.

 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair

 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1058) FINDBUGS: remaining Correctness Warnings

2009-10-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772210#action_12772210
 ] 

Hadoop QA commented on PIG-1058:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423734/PIG-1058.patch
  against trunk revision 831481.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/37/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/37/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/37/console

This message is automatically generated.

 FINDBUGS: remaining Correctness Warnings
 --

 Key: PIG-1058
 URL: https://issues.apache.org/jira/browse/PIG-1058
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1058.patch


 BCImpossible cast from java.lang.Object[] to java.lang.String[] in 
 org.apache.pig.PigServer.listPaths(String)
 ECCall to equals() comparing different types in 
 org.apache.pig.impl.plan.Operator.equals(Object)
 GCjava.lang.Byte is incompatible with expected argument type 
 java.lang.Integer in 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator$LoRearrangeDiscoverer.visitLocalRearrange(POLocalRearrange)
 ILThere is an apparent infinite recursive loop in 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POCogroup$groupComparator.equals(Object)
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.bsR(int)
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode()
 INT   Bad comparison of nonnegative value with -1 in 
 org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode()
 MFField ConstantExpression.res masks field in superclass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator
 Nm
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitSplit(POSplit)
  doesn't override method in superclass because parameter type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit
  doesn't match superclass parameter type 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit
 Nm
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.NoopStoreRemover$PhysicalRemover.visitSplit(POSplit)
  doesn't override method in superclass because parameter type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit
  doesn't match superclass parameter type 
 org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit
 NPPossible null pointer dereference of ? in 
 org.apache.pig.impl.logicalLayer.optimizer.PushDownForeachFlatten.check(List)
 NPPossible null pointer dereference of lo in 
 org.apache.pig.impl.logicalLayer.optimizer.StreamOptimizer.transform(List)
 NPPossible null pointer dereference of 
 Schema$FieldSchema.Schema$FieldSchema.alias in 
 org.apache.pig.impl.logicalLayer.schema.Schema.equals(Schema, Schema, 
 boolean, boolean)
 NPPossible null pointer dereference of Schema$FieldSchema.alias in 
 org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.equals(Schema$FieldSchema,
  Schema$FieldSchema, boolean, boolean)
 NPPossible null pointer dereference of inp in 
 org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run()
 RCN   Nullcheck of pigContext at line 123 of value previously dereferenced in 
 org.apache.pig.impl.util.JarManager.createJar(OutputStream, List, PigContext)
 RV
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.fixUpDomain(String,
  Properties) ignores return value of java.net.InetAddress.getByName(String)
 RVBad attempt to compute absolute value of signed 32-bit hashcode in 
 

[jira] Commented: (PIG-997) [zebra] Sorted Table Support by Zebra

2009-10-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772219#action_12772219
 ] 

Hadoop QA commented on PIG-997:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423724/SortedTable.patch
  against trunk revision 831481.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 173 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 355 release audit warnings 
(more than the trunk's current 337 warnings).

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/134/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/134/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/134/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/134/console

This message is automatically generated.

 [zebra] Sorted Table Support by Zebra
 -

 Key: PIG-997
 URL: https://issues.apache.org/jira/browse/PIG-997
 Project: Pig
  Issue Type: New Feature
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.6.0

 Attachments: SortedTable.patch, SortedTable.patch


 This new feature is for Zebra to support sorted data in storage. As a storage 
 library, Zebra will not sort the data by itself. But it will support creation 
 and use of sorted data either through PIG  or through map/reduce tasks that 
 use Zebra as storage format.
 The sorted table keeps the data in a totally sorted manner across all 
 TFiles created by potentially all mappers or reducers.
 For sorted data creation through PIG's STORE operator ,  if the input data is 
 sorted through ORDER BY, the new Zebra table will be marked as sorted on 
 the sorted columns;
 For sorted data creation though Map/Reduce tasks,  three new static methods 
 of the BasicTableOutput class will be provided to allow or help the user to 
 achieve the goal. setSortInfo allows the user to specify the sorted columns 
 of the input tuple to be stored; getSortKeyGenerator and getSortKey help 
 the user to generate the key acceptable by Zebra as a sorted key based upon 
 the schema, sorted columns and the input tuple.
 For sorted data read through PIG's LOAD operator, pass string sorted as an 
 extra argument to the TableLoader constructor to ask for sorted table to be 
 loaded;
 For sorted data read through Map/Reduce tasks, a new static method of 
 TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
 table to be read. Additionally, an overloaded version of the new method can 
 be called to ask for a sorted table on specified sort columns and comparator.
 For this release, sorted table only supported sorting in ascending order, not 
 in descending order. In addition, the sort keys must be of simple types not 
 complex types such as RECORD, COLLECTION and MAP. 
 Multiple-key sorting is supported. But the ordering of the multiple sort keys 
 is significant with the first sort column being the primary sort key, the 
 second being the secondary sort key, etc.
 In this release, the sort keys are stored along with the sort columns where 
 the keys were originally created from, resulting in some data storage 
 redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1035) support for skewed outer join

2009-10-30 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1035:
-

Status: Patch Available  (was: Open)

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035new.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1035) support for skewed outer join

2009-10-30 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1035:
-

Status: Open  (was: Patch Available)

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035new.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra

2009-10-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-997:
-

Status: Open  (was: Patch Available)

will resubmit patch with 18 missing release notes

 [zebra] Sorted Table Support by Zebra
 -

 Key: PIG-997
 URL: https://issues.apache.org/jira/browse/PIG-997
 Project: Pig
  Issue Type: New Feature
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.6.0

 Attachments: SortedTable.patch, SortedTable.patch


 This new feature is for Zebra to support sorted data in storage. As a storage 
 library, Zebra will not sort the data by itself. But it will support creation 
 and use of sorted data either through PIG  or through map/reduce tasks that 
 use Zebra as storage format.
 The sorted table keeps the data in a totally sorted manner across all 
 TFiles created by potentially all mappers or reducers.
 For sorted data creation through PIG's STORE operator ,  if the input data is 
 sorted through ORDER BY, the new Zebra table will be marked as sorted on 
 the sorted columns;
 For sorted data creation though Map/Reduce tasks,  three new static methods 
 of the BasicTableOutput class will be provided to allow or help the user to 
 achieve the goal. setSortInfo allows the user to specify the sorted columns 
 of the input tuple to be stored; getSortKeyGenerator and getSortKey help 
 the user to generate the key acceptable by Zebra as a sorted key based upon 
 the schema, sorted columns and the input tuple.
 For sorted data read through PIG's LOAD operator, pass string sorted as an 
 extra argument to the TableLoader constructor to ask for sorted table to be 
 loaded;
 For sorted data read through Map/Reduce tasks, a new static method of 
 TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
 table to be read. Additionally, an overloaded version of the new method can 
 be called to ask for a sorted table on specified sort columns and comparator.
 For this release, sorted table only supported sorting in ascending order, not 
 in descending order. In addition, the sort keys must be of simple types not 
 complex types such as RECORD, COLLECTION and MAP. 
 Multiple-key sorting is supported. But the ordering of the multiple sort keys 
 is significant with the first sort column being the primary sort key, the 
 second being the secondary sort key, etc.
 In this release, the sort keys are stored along with the sort columns where 
 the keys were originally created from, resulting in some data storage 
 redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra

2009-10-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-997:
-

Attachment: SortedTable.patch

Add  missing APACHE license agreements in 18 new source files

 [zebra] Sorted Table Support by Zebra
 -

 Key: PIG-997
 URL: https://issues.apache.org/jira/browse/PIG-997
 Project: Pig
  Issue Type: New Feature
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.6.0

 Attachments: SortedTable.patch, SortedTable.patch, SortedTable.patch


 This new feature is for Zebra to support sorted data in storage. As a storage 
 library, Zebra will not sort the data by itself. But it will support creation 
 and use of sorted data either through PIG  or through map/reduce tasks that 
 use Zebra as storage format.
 The sorted table keeps the data in a totally sorted manner across all 
 TFiles created by potentially all mappers or reducers.
 For sorted data creation through PIG's STORE operator ,  if the input data is 
 sorted through ORDER BY, the new Zebra table will be marked as sorted on 
 the sorted columns;
 For sorted data creation though Map/Reduce tasks,  three new static methods 
 of the BasicTableOutput class will be provided to allow or help the user to 
 achieve the goal. setSortInfo allows the user to specify the sorted columns 
 of the input tuple to be stored; getSortKeyGenerator and getSortKey help 
 the user to generate the key acceptable by Zebra as a sorted key based upon 
 the schema, sorted columns and the input tuple.
 For sorted data read through PIG's LOAD operator, pass string sorted as an 
 extra argument to the TableLoader constructor to ask for sorted table to be 
 loaded;
 For sorted data read through Map/Reduce tasks, a new static method of 
 TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
 table to be read. Additionally, an overloaded version of the new method can 
 be called to ask for a sorted table on specified sort columns and comparator.
 For this release, sorted table only supported sorting in ascending order, not 
 in descending order. In addition, the sort keys must be of simple types not 
 complex types such as RECORD, COLLECTION and MAP. 
 Multiple-key sorting is supported. But the ordering of the multiple sort keys 
 is significant with the first sort column being the primary sort key, the 
 second being the secondary sort key, etc.
 In this release, the sort keys are stored along with the sort columns where 
 the keys were originally created from, resulting in some data storage 
 redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra

2009-10-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-997:
-

Status: Patch Available  (was: Open)

 [zebra] Sorted Table Support by Zebra
 -

 Key: PIG-997
 URL: https://issues.apache.org/jira/browse/PIG-997
 Project: Pig
  Issue Type: New Feature
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.6.0

 Attachments: SortedTable.patch, SortedTable.patch, SortedTable.patch


 This new feature is for Zebra to support sorted data in storage. As a storage 
 library, Zebra will not sort the data by itself. But it will support creation 
 and use of sorted data either through PIG  or through map/reduce tasks that 
 use Zebra as storage format.
 The sorted table keeps the data in a totally sorted manner across all 
 TFiles created by potentially all mappers or reducers.
 For sorted data creation through PIG's STORE operator ,  if the input data is 
 sorted through ORDER BY, the new Zebra table will be marked as sorted on 
 the sorted columns;
 For sorted data creation though Map/Reduce tasks,  three new static methods 
 of the BasicTableOutput class will be provided to allow or help the user to 
 achieve the goal. setSortInfo allows the user to specify the sorted columns 
 of the input tuple to be stored; getSortKeyGenerator and getSortKey help 
 the user to generate the key acceptable by Zebra as a sorted key based upon 
 the schema, sorted columns and the input tuple.
 For sorted data read through PIG's LOAD operator, pass string sorted as an 
 extra argument to the TableLoader constructor to ask for sorted table to be 
 loaded;
 For sorted data read through Map/Reduce tasks, a new static method of 
 TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
 table to be read. Additionally, an overloaded version of the new method can 
 be called to ask for a sorted table on specified sort columns and comparator.
 For this release, sorted table only supported sorting in ascending order, not 
 in descending order. In addition, the sort keys must be of simple types not 
 complex types such as RECORD, COLLECTION and MAP. 
 Multiple-key sorting is supported. But the ordering of the multiple sort keys 
 is significant with the first sort column being the primary sort key, the 
 second being the secondary sort key, etc.
 In this release, the sort keys are stored along with the sort columns where 
 the keys were originally created from, resulting in some data storage 
 redundancy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.