[jira] Commented: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case
[ https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771846#action_12771846 ] Hadoop QA commented on PIG-1063: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423638/PIG-1063.patch against trunk revision 831169. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 199 javac compiler warnings (more than the trunk's current 198 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/131/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/131/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/131/console This message is automatically generated. Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case -- Key: PIG-1063 URL: https://issues.apache.org/jira/browse/PIG-1063 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1063.patch A StoreFunc implementation can inform pig of an OutputFormat it uses through the getStoragePreparationClass() method. In a query with multiple stores which gets optimized into a single mapred job, Pig does not call the checkOutputSpecs() method on the outputformat. An example of such a script is: {noformat} a = load 'input.txt'; b = filter a by $0 10; store b into 'output1' using StoreWithOutputFormat(); c = group a by $0; d = foreach c generate group, COUNT(a.$0); store d into 'output2' using StoreWithOutputFormat(); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1035: - Attachment: 1035.patch The attached patch contains modifications to support outer skewed join. It follows the same semantics as regular join. Some of the code used by regular join is moved to a common file - CompilerUtils and used by both. support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1035: - Status: Patch Available (was: Open) support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771878#action_12771878 ] Hadoop QA commented on PIG-1048: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423658/pig_1048.patch against trunk revision 831169. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/35/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/35/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/35/console This message is automatically generated. inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Attachments: pig_1048.patch ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771913#action_12771913 ] Hadoop QA commented on PIG-1035: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423670/1035.patch against trunk revision 831169. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/132/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/132/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/132/console This message is automatically generated. support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1066) ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null
ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null Key: PIG-1066 URL: https://issues.apache.org/jira/browse/PIG-1066 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.4.0 Reporter: Bogdan Dorohonceanu -- load the QID_CT_QP20 data x = LOAD '$FS_TD/$QID_IN_FILES' USING PigStorage('\t') AS (unstem_qid:chararray, jid_score_pairs:chararray); DESCRIBE x; --ILLUSTRATE x; -- load the ID_RQ data y0 = LOAD '$FS_USER/$ID_RQ_IN_FILE' USING PigStorage('\t') AS (sid:chararray, query:chararray); -- force parallelization -- y1 = ORDER y0 BY sid PARALLEL $NUM; -- compute unstem_qid DEFINE f `text_streamer_query j3_unicode.dat prop.dat normal.txt TAB TAB 1:yes:UNSTEM_ID:%llx` INPUT(stdin USING PigStorage('\t')) OU\ TPUT(stdout USING PigStorage('\t')) SHIP('$USER/text_streamer_query', '$USER/j3_unicode.dat', '$USER/prop.dat', '$USER/normal.txt'); y = STREAM y0 THROUGH f AS (sid:chararray, query:chararray, unstem_qid:chararray); DESCRIBE y; --ILLUSTRATE y; rmf /user/vega/zoom/y_debug STORE y INTO '/user/vega/zoom/y_debug' USING PigStorage('\t'); 2009-10-30 13:36:48,437 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name 2009-10-30 13:36:48,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: dd-9c32d04:8889 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: dd-9c32d04:8889 2009-10-30 13:36:49,242 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null 09/10/30 13:36:49 ERROR grunt.Grunt: ERROR 2999: Unexpected internal error. null Details at logfile: /disk1/vega/zoom/pig_1256909801304.log -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1066) ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null
[ https://issues.apache.org/jira/browse/PIG-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771948#action_12771948 ] Bogdan Dorohonceanu commented on PIG-1066: -- In the code above if I comment ILLUSTRATE, the code works fine. If I un-comment it, grunt gets an error. ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null Key: PIG-1066 URL: https://issues.apache.org/jira/browse/PIG-1066 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.4.0 Reporter: Bogdan Dorohonceanu -- load the QID_CT_QP20 data x = LOAD '$FS_TD/$QID_IN_FILES' USING PigStorage('\t') AS (unstem_qid:chararray, jid_score_pairs:chararray); DESCRIBE x; --ILLUSTRATE x; -- load the ID_RQ data y0 = LOAD '$FS_USER/$ID_RQ_IN_FILE' USING PigStorage('\t') AS (sid:chararray, query:chararray); -- force parallelization -- y1 = ORDER y0 BY sid PARALLEL $NUM; -- compute unstem_qid DEFINE f `text_streamer_query j3_unicode.dat prop.dat normal.txt TAB TAB 1:yes:UNSTEM_ID:%llx` INPUT(stdin USING PigStorage('\t')) OU\ TPUT(stdout USING PigStorage('\t')) SHIP('$USER/text_streamer_query', '$USER/j3_unicode.dat', '$USER/prop.dat', '$USER/normal.txt'); y = STREAM y0 THROUGH f AS (sid:chararray, query:chararray, unstem_qid:chararray); DESCRIBE y; --ILLUSTRATE y; rmf /user/vega/zoom/y_debug STORE y INTO '/user/vega/zoom/y_debug' USING PigStorage('\t'); 2009-10-30 13:36:48,437 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name 2009-10-30 13:36:48,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: dd-9c32d04:8889 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: dd-9c32d04:8889 2009-10-30 13:36:49,242 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null 09/10/30 13:36:49 ERROR grunt.Grunt: ERROR 2999: Unexpected internal error. null Details at logfile: /disk1/vega/zoom/pig_1256909801304.log -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1066) ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null
[ https://issues.apache.org/jira/browse/PIG-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771951#action_12771951 ] Bogdan Dorohonceanu commented on PIG-1066: -- Actually, it looks like ILLUSTRATE crashes when used on a relation created with STREAM ... THROUGH ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null Key: PIG-1066 URL: https://issues.apache.org/jira/browse/PIG-1066 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.4.0 Reporter: Bogdan Dorohonceanu -- load the QID_CT_QP20 data x = LOAD '$FS_TD/$QID_IN_FILES' USING PigStorage('\t') AS (unstem_qid:chararray, jid_score_pairs:chararray); DESCRIBE x; --ILLUSTRATE x; -- load the ID_RQ data y0 = LOAD '$FS_USER/$ID_RQ_IN_FILE' USING PigStorage('\t') AS (sid:chararray, query:chararray); -- force parallelization -- y1 = ORDER y0 BY sid PARALLEL $NUM; -- compute unstem_qid DEFINE f `text_streamer_query j3_unicode.dat prop.dat normal.txt TAB TAB 1:yes:UNSTEM_ID:%llx` INPUT(stdin USING PigStorage('\t')) OU\ TPUT(stdout USING PigStorage('\t')) SHIP('$USER/text_streamer_query', '$USER/j3_unicode.dat', '$USER/prop.dat', '$USER/normal.txt'); y = STREAM y0 THROUGH f AS (sid:chararray, query:chararray, unstem_qid:chararray); DESCRIBE y; --ILLUSTRATE y; rmf /user/vega/zoom/y_debug STORE y INTO '/user/vega/zoom/y_debug' USING PigStorage('\t'); 2009-10-30 13:36:48,437 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name 2009-10-30 13:36:48,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: dd-9c32d04:8889 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: dd-9c32d04:8889 2009-10-30 13:36:49,242 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null 09/10/30 13:36:49 ERROR grunt.Grunt: ERROR 2999: Unexpected internal error. null Details at logfile: /disk1/vega/zoom/pig_1256909801304.log -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case
[ https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771985#action_12771985 ] Pradeep Kamath commented on PIG-1063: - The 1 javac warning is due to the deprecation warning explained in my previous comment. The unit test failure seems unrelated and looks like a temporary env. issue - resubmitting to check if the tests pass now. Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case -- Key: PIG-1063 URL: https://issues.apache.org/jira/browse/PIG-1063 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1063.patch A StoreFunc implementation can inform pig of an OutputFormat it uses through the getStoragePreparationClass() method. In a query with multiple stores which gets optimized into a single mapred job, Pig does not call the checkOutputSpecs() method on the outputformat. An example of such a script is: {noformat} a = load 'input.txt'; b = filter a by $0 10; store b into 'output1' using StoreWithOutputFormat(); c = group a by $0; d = foreach c generate group, COUNT(a.$0); store d into 'output2' using StoreWithOutputFormat(); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1067) [Zebra] to support pig projection push down in Zebra
[Zebra] to support pig projection push down in Zebra Key: PIG-1067 URL: https://issues.apache.org/jira/browse/PIG-1067 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 Pig tries to determine which fields in a query script file will be needed and passes that information to the load function, thereby optimizing the query by reducing the data to be loaded. To support this optimization, Zebra needs to implement fieldsToRead method in TableLoader class to utilize this information. This jira is for this new feature. For more information of this optimization on pig side, one can refer to jira: PIG-653 https://issues.apache.org/jira/browse/PIG-653 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1040) FINDBUGS: MS_SHOULD_BE_FINAL: Field isn't final but should be
[ https://issues.apache.org/jira/browse/PIG-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1040: Resolution: Fixed Status: Resolved (was: Patch Available) patch committed FINDBUGS: MS_SHOULD_BE_FINAL: Field isn't final but should be - Key: PIG-1040 URL: https://issues.apache.org/jira/browse/PIG-1040 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1040.patch MS org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.USER_COMPARATOR_MARKER isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.weightedParts isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce.sJobConf isn't final and can't be protected from malicious code MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.bagFactory isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.reporter isn't final and can't be protected from malicious code MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.pigLogger should be package protected MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyBag isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyBool isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyDBA isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyDouble isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyFloat isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyInt isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyLong isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyMap isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyString isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.dummyTuple isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.mTupleFactory isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.mTupleFactory isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.mBagFactory isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.mTupleFactory isn't final but should be MS org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.mTupleFactory isn't final but should be MSorg.apache.pig.builtin.PigDump.recordDelimiter isn't final but should be MSorg.apache.pig.impl.builtin.GFCross.DEFAULT_PARALLELISM isn't final but should be MSorg.apache.pig.impl.logicalLayer.LogicalPlanBuilder.classloader isn't final and can't be protected from malicious code MSorg.apache.pig.impl.logicalLayer.LogicalPlanCloneHelper.mOpToCloneMap should be package protected MS org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.canonicalNamer isn't final but should be MS org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.castLookup isn't final but should be MSorg.apache.pig.impl.plan.OperatorPlan.log isn't final but should be -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772017#action_12772017 ] Dmitriy V. Ryaboy commented on PIG-1062: I have ResourceStats hooked up to LogicalOperators already, need to port the code to the new branch. This will let us take statistics, if they are available, and pass them into the PoissonSampleLoader at initialization time, so it can get the number of tuples and avg tuple size directly from Stats. That being said, statistics may not always be available... Before I go into the more fanciful suggestion below -- perhaps a simple hack will do. We have counters in Hadoop. Any reason we can't just read bytes read in map, records read in map, bytes written in map, records written in map counters directly? If I am overlooking something obvious, here's the ignore counters suggestion: If my understanding is correct, in PoissonSampleLoader we are interested in the average size of a tuple more than # of tuples -- # of tuples is just used as a way of crudely estimating avg size of tuple on disk, which is in turn used to crudely estimate the size of tuple in memory. The estimate is likely to be very off, by the way, if we are not loading from BinStorage, but from arbitrary loadFuncs, as the underlying data, even if it is a file, might be compressed. Perhaps we can get the average tuple size directly, instead? We could get that in the mappers of the sampling job by recording memory usage at the first getNext() call, forcing garbage collection, buffering up K tuples, and getting memory usage again. We now have the following variables available to each sampling mapper in the SkewedPartitioner: * sample rate S (for the appropriate Poisson distribution) * total # of mappers, M * available heap size on the reducer, H * estimated avg size of tuple, s The number of tuples we want to sample is then simply T = max(10, S*H/(s*M)) In getNext(), we can now allocate a buffer for T elements, populate it with the first T tuples, and continue scanning the partition. For every ith next() call, we generate a random number r s.t. 0=ri, and if rT we insert the new tuple into our buffer at position r. This gives us a nicely random sample of the tuples in the partition. So this gets around the need for file size info on that side. Now, PartitionSkewedKey uses the file size / avg_tuple_disk_size to estimate total number of tuples, and uses this estimate, plus the ratio of instances of a given key in the sample to the total sample size to predict the total number of records with a given key in the input. But given the number of sampled tuples, and the sample rate, couldn't we calculate the total number of records in the original file by simply reversing the formula for determining the number of tuples to sample? If we do this, no need to append any metadata. Lastly, if we do want to move around metadata such as number of records in input, etc, and we don't want to use Hadoop counters, we should extend BinStorage with ResourceStats serialization, and use ResourceStatistics for this. Even if the original data might not have stats, there is no reason we can't generate these basic counts at runtime for the data we write ourselves. -D load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach
[ https://issues.apache.org/jira/browse/PIG-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1030: -- Attachment: PIG-1030.patch Add returns as suggested. explain and dump not working with two UDFs inside inner plan of foreach --- Key: PIG-1030 URL: https://issues.apache.org/jira/browse/PIG-1030 Project: Pig Issue Type: Bug Reporter: Ying He Assignee: Richard Ding Attachments: PIG-1030.patch, PIG-1030.patch this scprit does not work register /homes/yinghe/owl/string.jar; a = load '/user/yinghe/a.txt' as (id, color); b = group a all; c = foreach b { d = distinct a.color; generate group, string.BagCount2(d), string.ColumnLen2(d, 0); } the udfs are regular, not algebraic. then if I call dump c; or explain c, I would get this error message. ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan with single leaf. Found 2 leaves. The error only occurs for the first time, after getting this error, if I call dump c or explain c again, it would succeed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach
[ https://issues.apache.org/jira/browse/PIG-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1030: -- Status: Patch Available (was: Open) explain and dump not working with two UDFs inside inner plan of foreach --- Key: PIG-1030 URL: https://issues.apache.org/jira/browse/PIG-1030 Project: Pig Issue Type: Bug Reporter: Ying He Assignee: Richard Ding Attachments: PIG-1030.patch, PIG-1030.patch this scprit does not work register /homes/yinghe/owl/string.jar; a = load '/user/yinghe/a.txt' as (id, color); b = group a all; c = foreach b { d = distinct a.color; generate group, string.BagCount2(d), string.ColumnLen2(d, 0); } the udfs are regular, not algebraic. then if I call dump c; or explain c, I would get this error message. ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan with single leaf. Found 2 leaves. The error only occurs for the first time, after getting this error, if I call dump c or explain c again, it would succeed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-920) optimizing diamond queries
[ https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772066#action_12772066 ] Richard Ding commented on PIG-920: -- Add additional comments. optimizing diamond queries -- Key: PIG-920 URL: https://issues.apache.org/jira/browse/PIG-920 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Attachments: PIG-920.patch, PIG-920.patch The following query A = load 'foo'; B = filer A by $01; C = filter A by $1 = 'foo'; D = COGROUP C by $0, B by $0; .. does not get efficiently executed. Currently, it runs a map only job that basically reads and write the same data before doing the query processing. Query where the data is loaded twice actually executed more efficiently. This is not an uncommon query and we should fix this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-920) optimizing diamond queries
[ https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-920: - Attachment: PIG-920.patch optimizing diamond queries -- Key: PIG-920 URL: https://issues.apache.org/jira/browse/PIG-920 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Attachments: PIG-920.patch, PIG-920.patch The following query A = load 'foo'; B = filer A by $01; C = filter A by $1 = 'foo'; D = COGROUP C by $0, B by $0; .. does not get efficiently executed. Currently, it runs a map only job that basically reads and write the same data before doing the query processing. Query where the data is loaded twice actually executed more efficiently. This is not an uncommon query and we should fix this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case
[ https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772068#action_12772068 ] Hadoop QA commented on PIG-1063: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423638/PIG-1063.patch against trunk revision 831169. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 199 javac compiler warnings (more than the trunk's current 198 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/133/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/133/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/133/console This message is automatically generated. Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case -- Key: PIG-1063 URL: https://issues.apache.org/jira/browse/PIG-1063 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1063.patch A StoreFunc implementation can inform pig of an OutputFormat it uses through the getStoragePreparationClass() method. In a query with multiple stores which gets optimized into a single mapred job, Pig does not call the checkOutputSpecs() method on the outputformat. An example of such a script is: {noformat} a = load 'input.txt'; b = filter a by $0 10; store b into 'output1' using StoreWithOutputFormat(); c = group a by $0; d = foreach c generate group, COUNT(a.$0); store d into 'output2' using StoreWithOutputFormat(); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case
[ https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772072#action_12772072 ] Olga Natkovich commented on PIG-1063: - +1; changes look good Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case -- Key: PIG-1063 URL: https://issues.apache.org/jira/browse/PIG-1063 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1063.patch A StoreFunc implementation can inform pig of an OutputFormat it uses through the getStoragePreparationClass() method. In a query with multiple stores which gets optimized into a single mapred job, Pig does not call the checkOutputSpecs() method on the outputformat. An example of such a script is: {noformat} a = load 'input.txt'; b = filter a by $0 10; store b into 'output1' using StoreWithOutputFormat(); c = group a by $0; d = foreach c generate group, COUNT(a.$0); store d into 'output2' using StoreWithOutputFormat(); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1063) Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case
[ https://issues.apache.org/jira/browse/PIG-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1063: Resolution: Fixed Fix Version/s: 0.6.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed to trunk Pig does not call checkOutSpecs() on OutputFormat provided by StoreFunc in the multistore case -- Key: PIG-1063 URL: https://issues.apache.org/jira/browse/PIG-1063 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.6.0 Attachments: PIG-1063.patch A StoreFunc implementation can inform pig of an OutputFormat it uses through the getStoragePreparationClass() method. In a query with multiple stores which gets optimized into a single mapred job, Pig does not call the checkOutputSpecs() method on the outputformat. An example of such a script is: {noformat} a = load 'input.txt'; b = filter a by $0 10; store b into 'output1' using StoreWithOutputFormat(); c = group a by $0; d = foreach c generate group, COUNT(a.$0); store d into 'output2' using StoreWithOutputFormat(); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1068) COGROUP fails with 'Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableTuple'
COGROUP fails with 'Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableTuple' --- Key: PIG-1068 URL: https://issues.apache.org/jira/browse/PIG-1068 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Vikram Oberoi The COGROUP in the following script fails in its map: {code} logs = LOAD '$LOGS' USING PigStorage() AS (ts:int, id:chararray, command:chararray, comments:chararray); SPLIT logs INTO logins IF command == 'login', all_quits IF command == 'quit'; -- Project login clients and count them by ID. login_info = FOREACH logins { GENERATE id as id, comments AS client; }; logins_grouped = GROUP login_info BY (id, client); count_logins_by_client = FOREACH logins_grouped { generate group.id AS id, group.client AS client, COUNT($1) AS count; } -- Get the first quit. all_quits_grouped = GROUP all_quits BY id; quits = FOREACH all_quits_grouped { ordered = ORDER all_quits BY ts ASC; last_quit =
[jira] Updated: (PIG-1068) COGROUP fails with 'Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableTuple'
[ https://issues.apache.org/jira/browse/PIG-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vikram Oberoi updated PIG-1068: --- Attachment: cogroup-bug.pig log Attached the script and some sample data. COGROUP fails with 'Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableTuple' --- Key: PIG-1068 URL: https://issues.apache.org/jira/browse/PIG-1068 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Vikram Oberoi Attachments: cogroup-bug.pig, log The COGROUP in the following script fails in its map: {code} logs = LOAD '$LOGS' USING PigStorage() AS (ts:int, id:chararray, command:chararray, comments:chararray); SPLIT logs INTO logins IF command == 'login', all_quits IF command == 'quit'; -- Project login clients and count them by ID. login_info = FOREACH logins { GENERATE id as id, comments AS client; }; logins_grouped = GROUP login_info BY (id, client); count_logins_by_client = FOREACH logins_grouped { generate group.id AS id, group.client AS client, COUNT($1) AS count; } -- Get the first quit. all_quits_grouped = GROUP all_quits BY id; quits = FOREACH all_quits_grouped {
[jira] Updated: (PIG-920) optimizing diamond queries
[ https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-920: --- Resolution: Fixed Fix Version/s: 0.6.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) +1, patch committed, thanks for the contribution Richard! optimizing diamond queries -- Key: PIG-920 URL: https://issues.apache.org/jira/browse/PIG-920 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.6.0 Attachments: PIG-920.patch, PIG-920.patch The following query A = load 'foo'; B = filer A by $01; C = filter A by $1 = 'foo'; D = COGROUP C by $0, B by $0; .. does not get efficiently executed. Currently, it runs a map only job that basically reads and write the same data before doing the query processing. Query where the data is loaded twice actually executed more efficiently. This is not an uncommon query and we should fix this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1038: Component/s: impl Description: If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. was:Since the data coming to the reducer is sorted on group+distinct, we don't need to see all distinct values at once Affects Version/s: 0.4.0 Fix Version/s: 0.6.0 Summary: Optimize nested distinct/sort to use secondary key (was: stream nested distinct for in case of accumulate interface) Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772085#action_12772085 ] Daniel Dai commented on PIG-1038: - Here is the design for this optimization: 1. Add SecondaryKeyOptimizer, which optimize map-reduce plan. It will 1.1 Discover if we use sort/distinct in nested foreach plan. 1.2 For the first such sort/distinct, use the sort/distinct key as the secondary key 1.3 Once SecondaryKeyOptimizer discover secondary key, it will call POLocalRearrange.setSecondaryPlan, then drop sort or simplify distinct 2. Change POLocalRearrange 2.1 Add setSecondaryPlan to provide a way to set secondary plan for SecondaryKeyOptimizer 2.2 Change constructLROutput to make a compound key, which is a tuple: (key, secondaryKey) 2.3 We need to duplicate the logic to strip key from values for the secondary key as well 3. Change POPackageAnnotator to patch POPackage with the keyinfo from both key and secondaryKey 4. Change POPackage to stitch secondary key to the value 5. Change MapReduceOper to indicate that map-reduce operator needs secondary key, and JobControlCompiler will set OutputValueGroupingComparator to use the mainKeyComparator 6. Add mainKeyComparator which inherits PigNullableWritable and only compare the main key. We need that for the OutputValueGroupingComparator Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair reassigned PIG-1062: -- Assignee: Thejas M Nair load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772095#action_12772095 ] Pradeep Kamath commented on PIG-1035: - The unit test does not seem to check the results of the outer join - would be good to add check of the actual results. In fact, there are already outer join tests in TestJoin.java - you can just update those to also test skewed join since those tests already check output correctness. In LogToPhyTranslationVisitor.java, in the following code, the return value of op.getSchema() should be checked for null in which case the same Exception should be thrown: {code} 849 try { 850 skj.addSchema(op.getSchema()); 851 } catch (FrontendException e) { 852 int errCode = 2015; 853 String msg = Couldn't set the schema for outer join ; 854 throw new LogicalToPhysicalTranslatorException(msg, errCode, PigException.BUG, e); 855 } {code} With the above code, schema is required for both inputs to the join. Strictly, for left and right outer joins, only the schema of the side where nulls need to be projected is needed. Only in full outer join both inputs should have schemas - if possible for left and right outer joins the restriction should be to require a schema only on the relevant input - for reference - left and right outer joins in regular join do this. support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1057) [Zebra] Zebra does not support concurrent deletions of column groups now.
[ https://issues.apache.org/jira/browse/PIG-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1057: Resolution: Fixed Status: Resolved (was: Patch Available) Patch checked in. [Zebra] Zebra does not support concurrent deletions of column groups now. - Key: PIG-1057 URL: https://issues.apache.org/jira/browse/PIG-1057 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 Attachments: patch_1057 Zebra does not support concurrent deletions of column groups now. As a result, the TestDropColumnGroup testcase can sometimes fail due to this. In this testcase, multiple threads will be launched together, with each one deleting one particular column group. The following exception can be thrown (with callstack): /*/ ... java.io.FileNotFoundException: File /.../pig-trunk/build/contrib/zebra/test/data/DropCGTest/CG02 does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:290) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:716) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:741) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:465) at org.apache.hadoop.zebra.io.BasicTable$SchemaFile.setCGDeletedFlags(BasicTable.java:1610) at org.apache.hadoop.zebra.io.BasicTable$SchemaFile.readSchemaFile(BasicTable.java:1593) at org.apache.hadoop.zebra.io.BasicTable$SchemaFile.init(BasicTable.java:1416) at org.apache.hadoop.zebra.io.BasicTable.dropColumnGroup(BasicTable.java:133) at org.apache.hadoop.zebra.io.TestDropColumnGroup$DropThread.run(TestDropColumnGroup.java:772) ... /*/ We plan to fix this in Zebra to support concurrent deletions of column groups. The root cause is that a thread or process reads in some stale file system information (e.g., it sees /CG0 first) and then can fail later on (it tries to access /CG0, however /CG0 may be deleted by another thread or process). Therefore, we plan to adopt a retry logic to resolve this issue. More detailed, we allow a dropping column group thread to retry n times when doing its deleting job - n is the total number of column groups. Note that here we do NOT try to resolve the more general concurrent column group deletions + reads issue. If a process is reading some data that could be deleted by another process, it can fail as we expect. Here we only try to resolve the concurrent column group deletions issue. If you have multiple threads or processes to delete column groups, they should succeed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra
[ https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-997: - Status: Patch Available (was: Open) [zebra] Sorted Table Support by Zebra - Key: PIG-997 URL: https://issues.apache.org/jira/browse/PIG-997 Project: Pig Issue Type: New Feature Reporter: Yan Zhou Fix For: 0.6.0 Attachments: SortedTable.patch, SortedTable.patch This new feature is for Zebra to support sorted data in storage. As a storage library, Zebra will not sort the data by itself. But it will support creation and use of sorted data either through PIG or through map/reduce tasks that use Zebra as storage format. The sorted table keeps the data in a totally sorted manner across all TFiles created by potentially all mappers or reducers. For sorted data creation through PIG's STORE operator , if the input data is sorted through ORDER BY, the new Zebra table will be marked as sorted on the sorted columns; For sorted data creation though Map/Reduce tasks, three new static methods of the BasicTableOutput class will be provided to allow or help the user to achieve the goal. setSortInfo allows the user to specify the sorted columns of the input tuple to be stored; getSortKeyGenerator and getSortKey help the user to generate the key acceptable by Zebra as a sorted key based upon the schema, sorted columns and the input tuple. For sorted data read through PIG's LOAD operator, pass string sorted as an extra argument to the TableLoader constructor to ask for sorted table to be loaded; For sorted data read through Map/Reduce tasks, a new static method of TableInputFormat class, requireSortedTable, can be called to ask for a sorted table to be read. Additionally, an overloaded version of the new method can be called to ask for a sorted table on specified sort columns and comparator. For this release, sorted table only supported sorting in ascending order, not in descending order. In addition, the sort keys must be of simple types not complex types such as RECORD, COLLECTION and MAP. Multiple-key sorting is supported. But the ordering of the multiple sort keys is significant with the first sort column being the primary sort key, the second being the secondary sort key, etc. In this release, the sort keys are stored along with the sort columns where the keys were originally created from, resulting in some data storage redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-997) [zebra] Sorted Table Support by Zebra
[ https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou reassigned PIG-997: Assignee: Yan Zhou [zebra] Sorted Table Support by Zebra - Key: PIG-997 URL: https://issues.apache.org/jira/browse/PIG-997 Project: Pig Issue Type: New Feature Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.6.0 Attachments: SortedTable.patch, SortedTable.patch This new feature is for Zebra to support sorted data in storage. As a storage library, Zebra will not sort the data by itself. But it will support creation and use of sorted data either through PIG or through map/reduce tasks that use Zebra as storage format. The sorted table keeps the data in a totally sorted manner across all TFiles created by potentially all mappers or reducers. For sorted data creation through PIG's STORE operator , if the input data is sorted through ORDER BY, the new Zebra table will be marked as sorted on the sorted columns; For sorted data creation though Map/Reduce tasks, three new static methods of the BasicTableOutput class will be provided to allow or help the user to achieve the goal. setSortInfo allows the user to specify the sorted columns of the input tuple to be stored; getSortKeyGenerator and getSortKey help the user to generate the key acceptable by Zebra as a sorted key based upon the schema, sorted columns and the input tuple. For sorted data read through PIG's LOAD operator, pass string sorted as an extra argument to the TableLoader constructor to ask for sorted table to be loaded; For sorted data read through Map/Reduce tasks, a new static method of TableInputFormat class, requireSortedTable, can be called to ask for a sorted table to be read. Additionally, an overloaded version of the new method can be called to ask for a sorted table on specified sort columns and comparator. For this release, sorted table only supported sorting in ascending order, not in descending order. In addition, the sort keys must be of simple types not complex types such as RECORD, COLLECTION and MAP. Multiple-key sorting is supported. But the ordering of the multiple sort keys is significant with the first sort column being the primary sort key, the second being the secondary sort key, etc. In this release, the sort keys are stored along with the sort columns where the keys were originally created from, resulting in some data storage redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra
[ https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-997: - Attachment: SortedTable.patch [zebra] Sorted Table Support by Zebra - Key: PIG-997 URL: https://issues.apache.org/jira/browse/PIG-997 Project: Pig Issue Type: New Feature Reporter: Yan Zhou Fix For: 0.6.0 Attachments: SortedTable.patch, SortedTable.patch This new feature is for Zebra to support sorted data in storage. As a storage library, Zebra will not sort the data by itself. But it will support creation and use of sorted data either through PIG or through map/reduce tasks that use Zebra as storage format. The sorted table keeps the data in a totally sorted manner across all TFiles created by potentially all mappers or reducers. For sorted data creation through PIG's STORE operator , if the input data is sorted through ORDER BY, the new Zebra table will be marked as sorted on the sorted columns; For sorted data creation though Map/Reduce tasks, three new static methods of the BasicTableOutput class will be provided to allow or help the user to achieve the goal. setSortInfo allows the user to specify the sorted columns of the input tuple to be stored; getSortKeyGenerator and getSortKey help the user to generate the key acceptable by Zebra as a sorted key based upon the schema, sorted columns and the input tuple. For sorted data read through PIG's LOAD operator, pass string sorted as an extra argument to the TableLoader constructor to ask for sorted table to be loaded; For sorted data read through Map/Reduce tasks, a new static method of TableInputFormat class, requireSortedTable, can be called to ask for a sorted table to be read. Additionally, an overloaded version of the new method can be called to ask for a sorted table on specified sort columns and comparator. For this release, sorted table only supported sorting in ascending order, not in descending order. In addition, the sort keys must be of simple types not complex types such as RECORD, COLLECTION and MAP. Multiple-key sorting is supported. But the ordering of the multiple sort keys is significant with the first sort column being the primary sort key, the second being the secondary sort key, etc. In this release, the sort keys are stored along with the sort columns where the keys were originally created from, resulting in some data storage redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach
[ https://issues.apache.org/jira/browse/PIG-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772123#action_12772123 ] Hadoop QA commented on PIG-1030: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423708/PIG-1030.patch against trunk revision 831402. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/36/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/36/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/36/console This message is automatically generated. explain and dump not working with two UDFs inside inner plan of foreach --- Key: PIG-1030 URL: https://issues.apache.org/jira/browse/PIG-1030 Project: Pig Issue Type: Bug Reporter: Ying He Assignee: Richard Ding Attachments: PIG-1030.patch, PIG-1030.patch this scprit does not work register /homes/yinghe/owl/string.jar; a = load '/user/yinghe/a.txt' as (id, color); b = group a all; c = foreach b { d = distinct a.color; generate group, string.BagCount2(d), string.ColumnLen2(d, 0); } the udfs are regular, not algebraic. then if I call dump c; or explain c, I would get this error message. ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan with single leaf. Found 2 leaves. The error only occurs for the first time, after getting this error, if I call dump c or explain c again, it would succeed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1053) Consider moving to Hadoop for local mode
[ https://issues.apache.org/jira/browse/PIG-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772144#action_12772144 ] Alan Gates commented on PIG-1053: - For testing purposes we could simply change Main to tell PigContext the mode is MapReduce, even when the user selects local mode. Assuming there are no configuration files in the classpath, this will result in using Hadoop in the local mode. However, for a real fix, we need to make sure that when the user says -x local Hadoop's LocalJobRunner and the local file system are chosen even if there are configuration files in the classpath. I believe this would be accomplished by changing PigContext to in local mode still connect to MR and HDFS, but to do so with an empty Properties object rather than using the one that is passed in. This would affect connect, init, setJobTrackerLocation, and perhaps other calls. Consider moving to Hadoop for local mode Key: PIG-1053 URL: https://issues.apache.org/jira/browse/PIG-1053 Project: Pig Issue Type: Improvement Reporter: Alan Gates We need to consider moving Pig to use Hadoop's local mode instead of its own. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1053) Consider moving to Hadoop for local mode
[ https://issues.apache.org/jira/browse/PIG-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1053: --- Assignee: Ankit Modi Consider moving to Hadoop for local mode Key: PIG-1053 URL: https://issues.apache.org/jira/browse/PIG-1053 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Ankit Modi We need to consider moving Pig to use Hadoop's local mode instead of its own. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1048: Resolution: Fixed Fix Version/s: 0.6.0 Status: Resolved (was: Patch Available) Patch checked in. Thanks Sri for fixing this. inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: pig_1048.patch ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1036) Fragment-replicate left outer join
[ https://issues.apache.org/jira/browse/PIG-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772157#action_12772157 ] Pradeep Kamath commented on PIG-1036: - In the unit tests in TestFRJoin, there is a check made for the output tuples using hasmaps and also using TestHelper.compareBags() - are both required? In QueryParser.jjt we currently have: {code} 1974 // in the case of outer joins, only two | 1975 // inputs are allowed | 1976 isOuter = (isLeftOuter || isRightOuter || isFullOuter); | 1977 if(isOuter gis.size() 2) { | 1978 throw new ParseException((left|right|full) outer joins are only supported for two inputs); | 1979 } {code} I think left outer join should only be supported for 2 way FR joins - looks like the code supports =2 inputs - {code} 867 // This condition may not reach, as a join has more than one side 868 if( innerFlags.length = 2 ) { 869 isLeftOuter = !innerFlags[1]; 870 } {code} In the code below, TupleFactory.getInstance() can be replaced with mTupleFactory. Also there should be checks to see if the second input has a schema and we should rely on that to determine how many nulls are needed. Left outer join should only be supported for the case where second input has a schema (assuming we only support 2 way FR join) to be consistent in all our implementations of left join. 209 Tuple nullTuple = TupleFactory.getInstance().newTuple(iter.next().get(0).size()); Why is an array of Bags (nullBags) used? should this be nullTuples since what we really want is just 1 null tuple - also if we only support 2 way left outer join, this would just be a nullTuple instead of an array. Fragment-replicate left outer join -- Key: PIG-1036 URL: https://issues.apache.org/jira/browse/PIG-1036 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Ankit Modi Attachments: LeftOuterFRJoin.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1006) FINDBUGS: EQ_COMPARETO_USE_OBJECT_EQUALS in bags and tuples
[ https://issues.apache.org/jira/browse/PIG-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1006. - Resolution: Fixed already resolved FINDBUGS: EQ_COMPARETO_USE_OBJECT_EQUALS in bags and tuples --- Key: PIG-1006 URL: https://issues.apache.org/jira/browse/PIG-1006 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Eqorg.apache.pig.data.DistinctDataBag$DistinctDataBagIterator$TContainer defines compareTo(DistinctDataBag$DistinctDataBagIterator$TContainer) and uses Object.equals() Eqorg.apache.pig.data.SingleTupleBag defines compareTo(Object) and uses Object.equals() Eqorg.apache.pig.data.SortedDataBag$SortedDataBagIterator$PQContainer defines compareTo(SortedDataBag$SortedDataBagIterator$PQContainer) and uses Object.equals() Eqorg.apache.pig.data.TargetedTuple defines compareTo(Object) and uses Object.equals() Eqorg.apache.pig.pen.util.ExampleTuple defines compareTo(Object) and uses Object.equals() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1058) FINDBUGS: remaining Correctness Warnings
[ https://issues.apache.org/jira/browse/PIG-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1058: Status: Patch Available (was: Open) This patch resolves all the remaining findbugs issues FINDBUGS: remaining Correctness Warnings -- Key: PIG-1058 URL: https://issues.apache.org/jira/browse/PIG-1058 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1058.patch BCImpossible cast from java.lang.Object[] to java.lang.String[] in org.apache.pig.PigServer.listPaths(String) ECCall to equals() comparing different types in org.apache.pig.impl.plan.Operator.equals(Object) GCjava.lang.Byte is incompatible with expected argument type java.lang.Integer in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator$LoRearrangeDiscoverer.visitLocalRearrange(POLocalRearrange) ILThere is an apparent infinite recursive loop in org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POCogroup$groupComparator.equals(Object) INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.bsR(int) INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode() INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode() MFField ConstantExpression.res masks field in superclass org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator Nm org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitSplit(POSplit) doesn't override method in superclass because parameter type org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit doesn't match superclass parameter type org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit Nm org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.NoopStoreRemover$PhysicalRemover.visitSplit(POSplit) doesn't override method in superclass because parameter type org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit doesn't match superclass parameter type org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit NPPossible null pointer dereference of ? in org.apache.pig.impl.logicalLayer.optimizer.PushDownForeachFlatten.check(List) NPPossible null pointer dereference of lo in org.apache.pig.impl.logicalLayer.optimizer.StreamOptimizer.transform(List) NPPossible null pointer dereference of Schema$FieldSchema.Schema$FieldSchema.alias in org.apache.pig.impl.logicalLayer.schema.Schema.equals(Schema, Schema, boolean, boolean) NPPossible null pointer dereference of Schema$FieldSchema.alias in org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.equals(Schema$FieldSchema, Schema$FieldSchema, boolean, boolean) NPPossible null pointer dereference of inp in org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run() RCN Nullcheck of pigContext at line 123 of value previously dereferenced in org.apache.pig.impl.util.JarManager.createJar(OutputStream, List, PigContext) RV org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.fixUpDomain(String, Properties) ignores return value of java.net.InetAddress.getByName(String) RVBad attempt to compute absolute value of signed 32-bit hashcode in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.getPartition(PigNullableWritable, Writable, int) RVBad attempt to compute absolute value of signed 32-bit hashcode in org.apache.pig.impl.plan.DotPlanDumper.getID(Operator) UwF Field only ever set to null: org.apache.pig.impl.builtin.MergeJoinIndexer.dummyTuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1058) FINDBUGS: remaining Correctness Warnings
[ https://issues.apache.org/jira/browse/PIG-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1058: Attachment: PIG-1058.patch FINDBUGS: remaining Correctness Warnings -- Key: PIG-1058 URL: https://issues.apache.org/jira/browse/PIG-1058 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1058.patch BCImpossible cast from java.lang.Object[] to java.lang.String[] in org.apache.pig.PigServer.listPaths(String) ECCall to equals() comparing different types in org.apache.pig.impl.plan.Operator.equals(Object) GCjava.lang.Byte is incompatible with expected argument type java.lang.Integer in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator$LoRearrangeDiscoverer.visitLocalRearrange(POLocalRearrange) ILThere is an apparent infinite recursive loop in org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POCogroup$groupComparator.equals(Object) INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.bsR(int) INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode() INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode() MFField ConstantExpression.res masks field in superclass org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator Nm org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitSplit(POSplit) doesn't override method in superclass because parameter type org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit doesn't match superclass parameter type org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit Nm org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.NoopStoreRemover$PhysicalRemover.visitSplit(POSplit) doesn't override method in superclass because parameter type org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit doesn't match superclass parameter type org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit NPPossible null pointer dereference of ? in org.apache.pig.impl.logicalLayer.optimizer.PushDownForeachFlatten.check(List) NPPossible null pointer dereference of lo in org.apache.pig.impl.logicalLayer.optimizer.StreamOptimizer.transform(List) NPPossible null pointer dereference of Schema$FieldSchema.Schema$FieldSchema.alias in org.apache.pig.impl.logicalLayer.schema.Schema.equals(Schema, Schema, boolean, boolean) NPPossible null pointer dereference of Schema$FieldSchema.alias in org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.equals(Schema$FieldSchema, Schema$FieldSchema, boolean, boolean) NPPossible null pointer dereference of inp in org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run() RCN Nullcheck of pigContext at line 123 of value previously dereferenced in org.apache.pig.impl.util.JarManager.createJar(OutputStream, List, PigContext) RV org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.fixUpDomain(String, Properties) ignores return value of java.net.InetAddress.getByName(String) RVBad attempt to compute absolute value of signed 32-bit hashcode in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.getPartition(PigNullableWritable, Writable, int) RVBad attempt to compute absolute value of signed 32-bit hashcode in org.apache.pig.impl.plan.DotPlanDumper.getID(Operator) UwF Field only ever set to null: org.apache.pig.impl.builtin.MergeJoinIndexer.dummyTuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1058) FINDBUGS: remaining Correctness Warnings
[ https://issues.apache.org/jira/browse/PIG-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772175#action_12772175 ] Daniel Dai commented on PIG-1058: - +1 FINDBUGS: remaining Correctness Warnings -- Key: PIG-1058 URL: https://issues.apache.org/jira/browse/PIG-1058 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1058.patch BCImpossible cast from java.lang.Object[] to java.lang.String[] in org.apache.pig.PigServer.listPaths(String) ECCall to equals() comparing different types in org.apache.pig.impl.plan.Operator.equals(Object) GCjava.lang.Byte is incompatible with expected argument type java.lang.Integer in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator$LoRearrangeDiscoverer.visitLocalRearrange(POLocalRearrange) ILThere is an apparent infinite recursive loop in org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POCogroup$groupComparator.equals(Object) INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.bsR(int) INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode() INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode() MFField ConstantExpression.res masks field in superclass org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator Nm org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitSplit(POSplit) doesn't override method in superclass because parameter type org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit doesn't match superclass parameter type org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit Nm org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.NoopStoreRemover$PhysicalRemover.visitSplit(POSplit) doesn't override method in superclass because parameter type org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit doesn't match superclass parameter type org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit NPPossible null pointer dereference of ? in org.apache.pig.impl.logicalLayer.optimizer.PushDownForeachFlatten.check(List) NPPossible null pointer dereference of lo in org.apache.pig.impl.logicalLayer.optimizer.StreamOptimizer.transform(List) NPPossible null pointer dereference of Schema$FieldSchema.Schema$FieldSchema.alias in org.apache.pig.impl.logicalLayer.schema.Schema.equals(Schema, Schema, boolean, boolean) NPPossible null pointer dereference of Schema$FieldSchema.alias in org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.equals(Schema$FieldSchema, Schema$FieldSchema, boolean, boolean) NPPossible null pointer dereference of inp in org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run() RCN Nullcheck of pigContext at line 123 of value previously dereferenced in org.apache.pig.impl.util.JarManager.createJar(OutputStream, List, PigContext) RV org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.fixUpDomain(String, Properties) ignores return value of java.net.InetAddress.getByName(String) RVBad attempt to compute absolute value of signed 32-bit hashcode in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.SkewedPartitioner.getPartition(PigNullableWritable, Writable, int) RVBad attempt to compute absolute value of signed 32-bit hashcode in org.apache.pig.impl.plan.DotPlanDumper.getID(Operator) UwF Field only ever set to null: org.apache.pig.impl.builtin.MergeJoinIndexer.dummyTuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-997) [zebra] Sorted Table Support by Zebra
[ https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772177#action_12772177 ] Gaurav Jain commented on PIG-997: - Reviewed +1 [zebra] Sorted Table Support by Zebra - Key: PIG-997 URL: https://issues.apache.org/jira/browse/PIG-997 Project: Pig Issue Type: New Feature Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.6.0 Attachments: SortedTable.patch, SortedTable.patch This new feature is for Zebra to support sorted data in storage. As a storage library, Zebra will not sort the data by itself. But it will support creation and use of sorted data either through PIG or through map/reduce tasks that use Zebra as storage format. The sorted table keeps the data in a totally sorted manner across all TFiles created by potentially all mappers or reducers. For sorted data creation through PIG's STORE operator , if the input data is sorted through ORDER BY, the new Zebra table will be marked as sorted on the sorted columns; For sorted data creation though Map/Reduce tasks, three new static methods of the BasicTableOutput class will be provided to allow or help the user to achieve the goal. setSortInfo allows the user to specify the sorted columns of the input tuple to be stored; getSortKeyGenerator and getSortKey help the user to generate the key acceptable by Zebra as a sorted key based upon the schema, sorted columns and the input tuple. For sorted data read through PIG's LOAD operator, pass string sorted as an extra argument to the TableLoader constructor to ask for sorted table to be loaded; For sorted data read through Map/Reduce tasks, a new static method of TableInputFormat class, requireSortedTable, can be called to ask for a sorted table to be read. Additionally, an overloaded version of the new method can be called to ask for a sorted table on specified sort columns and comparator. For this release, sorted table only supported sorting in ascending order, not in descending order. In addition, the sort keys must be of simple types not complex types such as RECORD, COLLECTION and MAP. Multiple-key sorting is supported. But the ordering of the multiple sort keys is significant with the first sort column being the primary sort key, the second being the secondary sort key, etc. In this release, the sort keys are stored along with the sort columns where the keys were originally created from, resulting in some data storage redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1026) [zebra] map split returns null
[ https://issues.apache.org/jira/browse/PIG-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1026: -- Attachment: PIG_1026.patch [zebra] map split returns null -- Key: PIG-1026 URL: https://issues.apache.org/jira/browse/PIG-1026 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0 Attachments: PIG_1026.patch Here is the test scenario: final static String STR_SCHEMA = m1:map(string),m2:map(map(int)); //final static String STR_STORAGE = [m1#{a}];[m2#{x|y}]; [m1#{b}, m2#{z}];[m1]; final static String STR_STORAGE = [m1#{a}, m2#{x}];[m2#{x|y}]; [m1#{b}, m2#{z}];[m1,m2]; projection: String projection2 = new String(m1#{b}, m2#{x|z}); User got null pointer exception on reading m1#{b}. Yan, please refer to the test class: TestNonDefaultWholeMapSplit.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1026) [zebra] map split returns null
[ https://issues.apache.org/jira/browse/PIG-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772192#action_12772192 ] Yan Zhou commented on PIG-1026: --- Another problem is that during STORE, when handleMapSplit method is called to set up the CG schema mapping with the Table schema, an incremented index was used as the column indices to the CG schemas. This works ok if the columns in the CG schemas are in the same order as in the table schema, but will be wrong if the orders are different, causing no values to be stored in some MAP columns in any CG and hence, during LOAD, getValue() returns null. [zebra] map split returns null -- Key: PIG-1026 URL: https://issues.apache.org/jira/browse/PIG-1026 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0 Attachments: PIG_1026.patch Here is the test scenario: final static String STR_SCHEMA = m1:map(string),m2:map(map(int)); //final static String STR_STORAGE = [m1#{a}];[m2#{x|y}]; [m1#{b}, m2#{z}];[m1]; final static String STR_STORAGE = [m1#{a}, m2#{x}];[m2#{x|y}]; [m1#{b}, m2#{z}];[m1,m2]; projection: String projection2 = new String(m1#{b}, m2#{x|z}); User got null pointer exception on reading m1#{b}. Yan, please refer to the test class: TestNonDefaultWholeMapSplit.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-477) passing properties from command line to the backend
[ https://issues.apache.org/jira/browse/PIG-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-477. Resolution: Duplicate Marking as duplicate of PIG-602 passing properties from command line to the backend --- Key: PIG-477 URL: https://issues.apache.org/jira/browse/PIG-477 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich We have users that would like to be able to pass paramters from command line to their UDFs. A natural way to do that would be pass them as properties from the client to the compute node and make them available through System.getProperties on the backend. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1069) Order Preserving Sorted Table Union
Order Preserving Sorted Table Union --- Key: PIG-1069 URL: https://issues.apache.org/jira/browse/PIG-1069 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou The output schema will adopt a schema union semantics, namely, if an output column only appears in one component table, the result rows will have the values of the column if the rows are from that component table and null otherwise; on the other hand, if an output column appears in multiple component tables, the types of the column in all the component tables must be identical. Otherwise, an exception will be thrown. The result rows will have the values of the column if the rows are from the component tables that have the column themselves, or null if otherwise. The order preserving sort-unioned results could be further indexed by the component tables if the projection contains column(s) named source_table. If so specified, the component table index will be output at the position(s) as specified in the projection list. If the underlying table is not a union of sorted tables, use of the special column name in projection will cause an exception thrown. If an attempt is made to create a table of a column named source_table, an excpetion will be thrown as the name is reserved by zebra for the virtual name. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772197#action_12772197 ] Thejas M Nair commented on PIG-1062: Dmitriy, I had overlooked the fact that input size of the file is being used also to calculate the number of samples. Thanks for pointing it out. I don't know if there are any problems in using counters directly, as long as information is required only after (first mapreduce) sampling phase, ie it could be used in PartitionSkewedKey(). The logic in PoissonSampleLoader.computeSamples is ( a detailed explanation will be added soon to the sampler wiki page). - The goal is to sample all keys from the first input that are will need to be partitioned across multiple reducers in the join phase. Let us assume X tuples fit into available memory in reducer. Lets say we want to sample 10 samples in each set of X tuples, with 95% confidence. Using poisson distribution formulas, we arrive at the number 17 as number of tuples to be sampled every X tuples. ( I don't know why poisson distrubution is the appropriate choice ) The total number of tuples to be sampled cannot be calculated without knowing total number of tuples. But what we know is that we should sample one tuple every (X/17) tuples. To calculate X, we need the average size of tuple in memory. Using the process memory usage is unlikely to give good approximation of that, because (as per my understanding) calling the garbage collector is not guaranteed to free memory used by all unused objects. Tuple.getMemorySize() can be used to get an estimate of the memory used by the tuple. The average size could be estimated/corrected as we sample more tuples. ie, PoissonSampleLoader.getNext() will return every H/s tuple in the input. (using H, s in previous comment) In PartitionSkewedKey.exec(), Dmitriy's idea of using number of samples, and sample rate (H/s) can be used to estimate total tuples. WeightedRangePartitioner.setConf is another function using fileSize(). That needs to change as well. I haven't looked at that yet. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1058) FINDBUGS: remaining Correctness Warnings
[ https://issues.apache.org/jira/browse/PIG-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772210#action_12772210 ] Hadoop QA commented on PIG-1058: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423734/PIG-1058.patch against trunk revision 831481. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/37/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/37/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/37/console This message is automatically generated. FINDBUGS: remaining Correctness Warnings -- Key: PIG-1058 URL: https://issues.apache.org/jira/browse/PIG-1058 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1058.patch BCImpossible cast from java.lang.Object[] to java.lang.String[] in org.apache.pig.PigServer.listPaths(String) ECCall to equals() comparing different types in org.apache.pig.impl.plan.Operator.equals(Object) GCjava.lang.Byte is incompatible with expected argument type java.lang.Integer in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator$LoRearrangeDiscoverer.visitLocalRearrange(POLocalRearrange) ILThere is an apparent infinite recursive loop in org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POCogroup$groupComparator.equals(Object) INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.bsR(int) INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode() INT Bad comparison of nonnegative value with -1 in org.apache.tools.bzip2r.CBZip2InputStream.getAndMoveToFrontDecode() MFField ConstantExpression.res masks field in superclass org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator Nm org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitSplit(POSplit) doesn't override method in superclass because parameter type org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit doesn't match superclass parameter type org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit Nm org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.NoopStoreRemover$PhysicalRemover.visitSplit(POSplit) doesn't override method in superclass because parameter type org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit doesn't match superclass parameter type org.apache.pig.backend.local.executionengine.physicalLayer.relationalOperators.POSplit NPPossible null pointer dereference of ? in org.apache.pig.impl.logicalLayer.optimizer.PushDownForeachFlatten.check(List) NPPossible null pointer dereference of lo in org.apache.pig.impl.logicalLayer.optimizer.StreamOptimizer.transform(List) NPPossible null pointer dereference of Schema$FieldSchema.Schema$FieldSchema.alias in org.apache.pig.impl.logicalLayer.schema.Schema.equals(Schema, Schema, boolean, boolean) NPPossible null pointer dereference of Schema$FieldSchema.alias in org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema.equals(Schema$FieldSchema, Schema$FieldSchema, boolean, boolean) NPPossible null pointer dereference of inp in org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run() RCN Nullcheck of pigContext at line 123 of value previously dereferenced in org.apache.pig.impl.util.JarManager.createJar(OutputStream, List, PigContext) RV org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.fixUpDomain(String, Properties) ignores return value of java.net.InetAddress.getByName(String) RVBad attempt to compute absolute value of signed 32-bit hashcode in
[jira] Commented: (PIG-997) [zebra] Sorted Table Support by Zebra
[ https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772219#action_12772219 ] Hadoop QA commented on PIG-997: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423724/SortedTable.patch against trunk revision 831481. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 173 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 355 release audit warnings (more than the trunk's current 337 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/134/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/134/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/134/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/134/console This message is automatically generated. [zebra] Sorted Table Support by Zebra - Key: PIG-997 URL: https://issues.apache.org/jira/browse/PIG-997 Project: Pig Issue Type: New Feature Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.6.0 Attachments: SortedTable.patch, SortedTable.patch This new feature is for Zebra to support sorted data in storage. As a storage library, Zebra will not sort the data by itself. But it will support creation and use of sorted data either through PIG or through map/reduce tasks that use Zebra as storage format. The sorted table keeps the data in a totally sorted manner across all TFiles created by potentially all mappers or reducers. For sorted data creation through PIG's STORE operator , if the input data is sorted through ORDER BY, the new Zebra table will be marked as sorted on the sorted columns; For sorted data creation though Map/Reduce tasks, three new static methods of the BasicTableOutput class will be provided to allow or help the user to achieve the goal. setSortInfo allows the user to specify the sorted columns of the input tuple to be stored; getSortKeyGenerator and getSortKey help the user to generate the key acceptable by Zebra as a sorted key based upon the schema, sorted columns and the input tuple. For sorted data read through PIG's LOAD operator, pass string sorted as an extra argument to the TableLoader constructor to ask for sorted table to be loaded; For sorted data read through Map/Reduce tasks, a new static method of TableInputFormat class, requireSortedTable, can be called to ask for a sorted table to be read. Additionally, an overloaded version of the new method can be called to ask for a sorted table on specified sort columns and comparator. For this release, sorted table only supported sorting in ascending order, not in descending order. In addition, the sort keys must be of simple types not complex types such as RECORD, COLLECTION and MAP. Multiple-key sorting is supported. But the ordering of the multiple sort keys is significant with the first sort column being the primary sort key, the second being the secondary sort key, etc. In this release, the sort keys are stored along with the sort columns where the keys were originally created from, resulting in some data storage redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1035: - Status: Patch Available (was: Open) support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035new.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1035: - Status: Open (was: Patch Available) support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035new.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra
[ https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-997: - Status: Open (was: Patch Available) will resubmit patch with 18 missing release notes [zebra] Sorted Table Support by Zebra - Key: PIG-997 URL: https://issues.apache.org/jira/browse/PIG-997 Project: Pig Issue Type: New Feature Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.6.0 Attachments: SortedTable.patch, SortedTable.patch This new feature is for Zebra to support sorted data in storage. As a storage library, Zebra will not sort the data by itself. But it will support creation and use of sorted data either through PIG or through map/reduce tasks that use Zebra as storage format. The sorted table keeps the data in a totally sorted manner across all TFiles created by potentially all mappers or reducers. For sorted data creation through PIG's STORE operator , if the input data is sorted through ORDER BY, the new Zebra table will be marked as sorted on the sorted columns; For sorted data creation though Map/Reduce tasks, three new static methods of the BasicTableOutput class will be provided to allow or help the user to achieve the goal. setSortInfo allows the user to specify the sorted columns of the input tuple to be stored; getSortKeyGenerator and getSortKey help the user to generate the key acceptable by Zebra as a sorted key based upon the schema, sorted columns and the input tuple. For sorted data read through PIG's LOAD operator, pass string sorted as an extra argument to the TableLoader constructor to ask for sorted table to be loaded; For sorted data read through Map/Reduce tasks, a new static method of TableInputFormat class, requireSortedTable, can be called to ask for a sorted table to be read. Additionally, an overloaded version of the new method can be called to ask for a sorted table on specified sort columns and comparator. For this release, sorted table only supported sorting in ascending order, not in descending order. In addition, the sort keys must be of simple types not complex types such as RECORD, COLLECTION and MAP. Multiple-key sorting is supported. But the ordering of the multiple sort keys is significant with the first sort column being the primary sort key, the second being the secondary sort key, etc. In this release, the sort keys are stored along with the sort columns where the keys were originally created from, resulting in some data storage redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra
[ https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-997: - Attachment: SortedTable.patch Add missing APACHE license agreements in 18 new source files [zebra] Sorted Table Support by Zebra - Key: PIG-997 URL: https://issues.apache.org/jira/browse/PIG-997 Project: Pig Issue Type: New Feature Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.6.0 Attachments: SortedTable.patch, SortedTable.patch, SortedTable.patch This new feature is for Zebra to support sorted data in storage. As a storage library, Zebra will not sort the data by itself. But it will support creation and use of sorted data either through PIG or through map/reduce tasks that use Zebra as storage format. The sorted table keeps the data in a totally sorted manner across all TFiles created by potentially all mappers or reducers. For sorted data creation through PIG's STORE operator , if the input data is sorted through ORDER BY, the new Zebra table will be marked as sorted on the sorted columns; For sorted data creation though Map/Reduce tasks, three new static methods of the BasicTableOutput class will be provided to allow or help the user to achieve the goal. setSortInfo allows the user to specify the sorted columns of the input tuple to be stored; getSortKeyGenerator and getSortKey help the user to generate the key acceptable by Zebra as a sorted key based upon the schema, sorted columns and the input tuple. For sorted data read through PIG's LOAD operator, pass string sorted as an extra argument to the TableLoader constructor to ask for sorted table to be loaded; For sorted data read through Map/Reduce tasks, a new static method of TableInputFormat class, requireSortedTable, can be called to ask for a sorted table to be read. Additionally, an overloaded version of the new method can be called to ask for a sorted table on specified sort columns and comparator. For this release, sorted table only supported sorting in ascending order, not in descending order. In addition, the sort keys must be of simple types not complex types such as RECORD, COLLECTION and MAP. Multiple-key sorting is supported. But the ordering of the multiple sort keys is significant with the first sort column being the primary sort key, the second being the secondary sort key, etc. In this release, the sort keys are stored along with the sort columns where the keys were originally created from, resulting in some data storage redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-997) [zebra] Sorted Table Support by Zebra
[ https://issues.apache.org/jira/browse/PIG-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-997: - Status: Patch Available (was: Open) [zebra] Sorted Table Support by Zebra - Key: PIG-997 URL: https://issues.apache.org/jira/browse/PIG-997 Project: Pig Issue Type: New Feature Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.6.0 Attachments: SortedTable.patch, SortedTable.patch, SortedTable.patch This new feature is for Zebra to support sorted data in storage. As a storage library, Zebra will not sort the data by itself. But it will support creation and use of sorted data either through PIG or through map/reduce tasks that use Zebra as storage format. The sorted table keeps the data in a totally sorted manner across all TFiles created by potentially all mappers or reducers. For sorted data creation through PIG's STORE operator , if the input data is sorted through ORDER BY, the new Zebra table will be marked as sorted on the sorted columns; For sorted data creation though Map/Reduce tasks, three new static methods of the BasicTableOutput class will be provided to allow or help the user to achieve the goal. setSortInfo allows the user to specify the sorted columns of the input tuple to be stored; getSortKeyGenerator and getSortKey help the user to generate the key acceptable by Zebra as a sorted key based upon the schema, sorted columns and the input tuple. For sorted data read through PIG's LOAD operator, pass string sorted as an extra argument to the TableLoader constructor to ask for sorted table to be loaded; For sorted data read through Map/Reduce tasks, a new static method of TableInputFormat class, requireSortedTable, can be called to ask for a sorted table to be read. Additionally, an overloaded version of the new method can be called to ask for a sorted table on specified sort columns and comparator. For this release, sorted table only supported sorting in ascending order, not in descending order. In addition, the sort keys must be of simple types not complex types such as RECORD, COLLECTION and MAP. Multiple-key sorting is supported. But the ordering of the multiple sort keys is significant with the first sort column being the primary sort key, the second being the secondary sort key, etc. In this release, the sort keys are stored along with the sort columns where the keys were originally created from, resulting in some data storage redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.