[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: hbase-0.18.1-test.jar hbase-0.20.0.jar Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Attachments: build.xml.path, hbase-0.18.1-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: Pig_HBase_0.20.0.patch Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Attachments: build.xml.path, hbase-0.18.1-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: zookeeper-hbase-1329.jar Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Attachments: build.xml.path, hbase-0.18.1-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: (was: hbase-0.18.1-test.jar) Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Attachments: build.xml.path, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang reassigned PIG-970: -- Assignee: Jeff Zhang Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: (was: hbase-0.20.0-test.jar) Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: (was: hbase-0.20.0.jar) Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: (was: zookeeper-hbase-1329.jar) Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: (was: Pig_HBase_0.20.0.patch) Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: hbase-0.20.0-test.jar hbase-0.20.0.jar Pig_HBase_0.20.0.patch Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Attachment: zookeeper-hbase-1329.jar Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772461#action_12772461 ] Jeff Zhang commented on PIG-970: Vincent, I do not know how you pass TestHBaseStorage using your patch. Because hbase 0.20 integrate zookeeper , so TestHBaseStorage has to be updated accordingly. I submit the patch including the source code and jars. (one tricky thing is that MiniZookeeperCluster's client port is 21810 which is hard coded in source code level, while the default zookeeper's port is 2181. so I attach hbase-site.xml to override the client port of zookeeper to make it the same as MiniZookeeperCluster) Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Fix For: 0.5.0 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-970: --- Tags: hbase Fix Version/s: 0.5.0 Status: Patch Available (was: Open) Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Fix For: 0.5.0 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772464#action_12772464 ] Hadoop QA commented on PIG-970: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423811/zookeeper-hbase-1329.jar against trunk revision 831481. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 92 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/136/console This message is automatically generated. Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Fix For: 0.5.0 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772465#action_12772465 ] Jeff Zhang commented on PIG-970: this patch works on my machine, but it seems that I have no right to put the jars into pig trunk, so anyone could help validate the patch on pig trunk ? Thank you in advance. Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Fix For: 0.5.0 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Definition of equality of bags
I could not find any documentation (in piglatin manual) on what the definition of equality of bags is (or what it should be), does the order of tuples in the bag matter ? But the definition of a bag does not imply any ordering. This has implication on the definition of join/cogroup/group on bags. Thanks, Thejas
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772562#action_12772562 ] Daniel Dai commented on PIG-1038: - Hi, Ashutosh, I will look into POForeach and find the first nested sort or distinct, and use this sort/distinct key as the secondary sort key for this map-reduce job. So that I can take away/simplify the nested sort/distinct. Yes, we definitely need a framework for the map-reduce layer also. We will work on that, and welcome any suggestions and comments. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772565#action_12772565 ] Thejas M Nair commented on PIG-1062: WeightedRangePartitioner.setConf use of fileSize() is alright, it is checking size of intermediate file. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1070) Review Basics link broken under Getting Started
Review Basics link broken under Getting Started --- Key: PIG-1070 URL: https://issues.apache.org/jira/browse/PIG-1070 Project: Pig Issue Type: Bug Components: site Environment: Apple OS/X Safari Reporter: robert Cook Priority: Trivial The requested URL /pig/docs/r0.4.0/quickstart.html was not found on this server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Definition of equality of bags
Looks like the join/cogroup/group is not defined on bags. I assume this is because the equality on bags is not defined. It gives the error in map-reduce mode, but does not in local mode. Since pig is likely to get rid of custom local mode implementation and use hadoop local mode and that should fix it, I am not filing a jira. -Thejas On 11/2/09 9:19 AM, Thejas Nair te...@yahoo-inc.com wrote: I could not find any documentation (in piglatin manual) on what the definition of equality of bags is (or what it should be), does the order of tuples in the bag matter ? But the definition of a bag does not imply any ordering. This has implication on the definition of join/cogroup/group on bags. Thanks, Thejas
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772623#action_12772623 ] Thejas M Nair commented on PIG-1062: Even after the interface changes, pig can compute the file size by adding up size of each split (from InputSplit.getLenght()) . The documentation of the function in the interface does not make it clear if this is size on disk , compressed/uncompressed etc. Assuming it is size on disk (uncompressed), estimating the total memory it will require is a challenge, one has to make assumption about the compression ratio and the serialization method. Using Tuple.getMemorySize() while sampling will give more accurate numbers for reducer memory that it will consume. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach
[ https://issues.apache.org/jira/browse/PIG-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1030: Resolution: Fixed Fix Version/s: 0.6.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) +1, patch committed, Thanks Richard! explain and dump not working with two UDFs inside inner plan of foreach --- Key: PIG-1030 URL: https://issues.apache.org/jira/browse/PIG-1030 Project: Pig Issue Type: Bug Reporter: Ying He Assignee: Richard Ding Fix For: 0.6.0 Attachments: PIG-1030.patch, PIG-1030.patch this scprit does not work register /homes/yinghe/owl/string.jar; a = load '/user/yinghe/a.txt' as (id, color); b = group a all; c = foreach b { d = distinct a.color; generate group, string.BagCount2(d), string.ColumnLen2(d, 0); } the udfs are regular, not algebraic. then if I call dump c; or explain c, I would get this error message. ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan with single leaf. Found 2 leaves. The error only occurs for the first time, after getting this error, if I call dump c or explain c again, it would succeed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1035: Resolution: Fixed Fix Version/s: 0.6.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) +1, Patch committed, thanks Sri! support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: 1035new.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772704#action_12772704 ] Thejas M Nair commented on PIG-1062: As indicated in previous comment, I am planning to go ahead with the [earlier proposal|https://issues.apache.org/jira/browse/PIG-1062?focusedCommentId=12772197page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12772197] . The current sample frequency would be one tuple every ( (H/s) * (1/17) ) tuples. In PartitionSkewedKey.exec(), the number of reducers for join key k1 can be computed using (no_of_samples(k1) / 17) . But the accuracy of this calculation depends on how accurate the average tuple size computed is (s in (H/s) * (1/17)). Sending a special tuple with number of rows in the split will likely lead to more accurate estimate of number of reducers required. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1026) [zebra] map split returns null
[ https://issues.apache.org/jira/browse/PIG-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1026: --- Patch reviewed. +1 [zebra] map split returns null -- Key: PIG-1026 URL: https://issues.apache.org/jira/browse/PIG-1026 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Yan Zhou Fix For: 0.6.0 Attachments: PIG_1026.patch Here is the test scenario: final static String STR_SCHEMA = m1:map(string),m2:map(map(int)); //final static String STR_STORAGE = [m1#{a}];[m2#{x|y}]; [m1#{b}, m2#{z}];[m1]; final static String STR_STORAGE = [m1#{a}, m2#{x}];[m2#{x|y}]; [m1#{b}, m2#{z}];[m1,m2]; projection: String projection2 = new String(m1#{b}, m2#{x|z}); User got null pointer exception on reading m1#{b}. Yan, please refer to the test class: TestNonDefaultWholeMapSplit.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags
[ https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772772#action_12772772 ] Alan Gates commented on PIG-1037: - The difference is much more than switching from dumping one tuple at a time to multiple tuples. It is about how spilling is activated. In the past, spilling was passive; it was done when the JVM informed us that memory was getting low. This did not work well as the JVM only checks memory usage when it garbage collects. So by the time pig was notified of a low memory condition it was often too late. We often ran out of memory while trying to spill. Now instead, spilling is active. Pig sets aside a buffer for a bag to put its tuples in. For default bags, once this buffer is full any additional tuples are written to disk. For sorted or distinct bags, once the buffer is full it is sorted and dumped to disk, and new records go into the buffer. This particular patch only adds the change for sorted and distinct bags. PIG-975 contains the original patch for default bags. better memory layout and spill for sorted and distinct bags --- Key: PIG-1037 URL: https://issues.apache.org/jira/browse/PIG-1037 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ying He Fix For: 0.6.0 Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772774#action_12772774 ] Alan Gates commented on PIG-1038: - I agree that we need a framework for optimizations in the backend. I'm hoping we can reuse the framework from the front end. However, there's some cleanup we'd still like to do on the LogicalOptimizer before we use it as a template for a MapReduceOptimizer. But I agree that's where we need to go. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-970) Support of HBase 0.20.0
[ https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772781#action_12772781 ] Alan Gates commented on PIG-970: Patch doesn't include binary files. I'll pull together the latest patch plus the jars and test it. Support of HBase 0.20.0 --- Key: PIG-970 URL: https://issues.apache.org/jira/browse/PIG-970 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Vincent BARAT Assignee: Jeff Zhang Fix For: 0.5.0 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar The support of HBase is currently very limited and restricted to HBase 0.18.0. Because the next releases of PIG will support Hadoop 0.20.0, they should also support HBase 0.20.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772797#action_12772797 ] Dmitriy V. Ryaboy commented on PIG-1062: The sampler (in this design) reads all the data, so number of records read is total number of records in dataset, and the number of records written is total number of samples. Same for bytes. The sampler produces a histogram file, which is then used by the join task -- so there is no reliance on counters there. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772807#action_12772807 ] Dmitriy V. Ryaboy commented on PIG-1062: Thejas: bq. sending a special tuple with number of rows in the split will likely lead to more accurate estimate of number of reducers required. You can get the same info from the counters without unnecessarily complicating tuple processing, imo. In fact you can use (num bytes read / num records read) to get the old calculation, and not rely on number of samples and local average size estimates. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
two-level access problem?
Could someone explain the nature of the two-level access problem referred to in the Load/Store redesign wiki and in the DataType code? Thanks, -D
[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags
[ https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772890#action_12772890 ] Ashutosh Chauhan commented on PIG-1037: --- Thanks for the explanation, Alan. better memory layout and spill for sorted and distinct bags --- Key: PIG-1037 URL: https://issues.apache.org/jira/browse/PIG-1037 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ying He Fix For: 0.6.0 Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772925#action_12772925 ] Ankur commented on PIG-958: --- Can we have an update on this please ? Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Attachments: 958.v3.patch, 958.v4.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.