[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: hbase-0.18.1-test.jar
hbase-0.20.0.jar

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
 Attachments: build.xml.path, hbase-0.18.1-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: Pig_HBase_0.20.0.patch

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
 Attachments: build.xml.path, hbase-0.18.1-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: zookeeper-hbase-1329.jar

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
 Attachments: build.xml.path, hbase-0.18.1-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: (was: hbase-0.18.1-test.jar)

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
 Attachments: build.xml.path, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang reassigned PIG-970:
--

Assignee: Jeff Zhang

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: (was: hbase-0.20.0-test.jar)

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: (was: hbase-0.20.0.jar)

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: (was: zookeeper-hbase-1329.jar)

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: (was: Pig_HBase_0.20.0.patch)

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: hbase-0.20.0-test.jar
hbase-0.20.0.jar
Pig_HBase_0.20.0.patch

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

Attachment: zookeeper-hbase-1329.jar

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772461#action_12772461
 ] 

Jeff Zhang commented on PIG-970:


Vincent, I do not know how you pass TestHBaseStorage using your patch. Because 
hbase 0.20 integrate zookeeper , so TestHBaseStorage has to be updated 
accordingly.

I submit the patch including the source code and jars.  (one tricky thing is 
that MiniZookeeperCluster's client port is 21810 which is hard coded in source 
code level, while the default zookeeper's port is 2181. so I attach 
hbase-site.xml to override the client port of zookeeper to make it the same as 
MiniZookeeperCluster) 



 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Fix For: 0.5.0

 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-970:
---

 Tags: hbase
Fix Version/s: 0.5.0
   Status: Patch Available  (was: Open)

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Fix For: 0.5.0

 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772464#action_12772464
 ] 

Hadoop QA commented on PIG-970:
---

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12423811/zookeeper-hbase-1329.jar
  against trunk revision 831481.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 92 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/136/console

This message is automatically generated.

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Fix For: 0.5.0

 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772465#action_12772465
 ] 

Jeff Zhang commented on PIG-970:


this patch works on my machine, but it seems that I have no right to put the 
jars into pig trunk, so anyone could help validate the patch on pig trunk ?

Thank you in advance.

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Fix For: 0.5.0

 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Definition of equality of bags

2009-11-02 Thread Thejas Nair
I could not find any documentation (in piglatin manual) on what the
definition of equality of bags is (or what it should be), does the order of
tuples in the bag matter ? But the definition of a bag does not imply any
ordering.

This has implication on the definition of join/cogroup/group on bags.

Thanks,
Thejas




[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-02 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772562#action_12772562
 ] 

Daniel Dai commented on PIG-1038:
-

Hi, Ashutosh,
I will look into POForeach and find the first nested sort or distinct, and use 
this sort/distinct key as the secondary sort key for this map-reduce job. So 
that I can take away/simplify the nested sort/distinct.

Yes, we definitely need a framework for the map-reduce layer also. We will work 
on that, and welcome any suggestions and comments.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-11-02 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772565#action_12772565
 ] 

Thejas M Nair commented on PIG-1062:


WeightedRangePartitioner.setConf use of fileSize() is alright, it is checking 
size of intermediate file.

 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair

 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1070) Review Basics link broken under Getting Started

2009-11-02 Thread robert Cook (JIRA)
Review Basics link broken under Getting Started
---

 Key: PIG-1070
 URL: https://issues.apache.org/jira/browse/PIG-1070
 Project: Pig
  Issue Type: Bug
  Components: site
 Environment: Apple OS/X Safari
Reporter: robert Cook
Priority: Trivial


The requested URL /pig/docs/r0.4.0/quickstart.html was not found on this server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Definition of equality of bags

2009-11-02 Thread Thejas Nair
Looks like the join/cogroup/group is not defined on bags. I assume this is
because the equality on bags is not defined.

It gives the error in map-reduce mode, but does not in local mode.
Since pig is likely to get rid of custom local mode implementation and use
hadoop local mode and that should fix it, I am not filing a jira.
-Thejas



On 11/2/09 9:19 AM, Thejas Nair te...@yahoo-inc.com wrote:

 I could not find any documentation (in piglatin manual) on what the
 definition of equality of bags is (or what it should be), does the order of
 tuples in the bag matter ? But the definition of a bag does not imply any
 ordering.
 
 This has implication on the definition of join/cogroup/group on bags.
 
 Thanks,
 Thejas
 
 



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-11-02 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772623#action_12772623
 ] 

Thejas M Nair commented on PIG-1062:


Even after the interface changes, pig can compute the file size by adding up 
size of each split (from InputSplit.getLenght()) . The documentation of the 
function in the interface does not make it clear if this is size on disk , 
compressed/uncompressed etc. Assuming it is size on disk (uncompressed), 
estimating the total memory it will require is a challenge, one has to make 
assumption about the compression ratio and the serialization method.
Using Tuple.getMemorySize() while sampling will give more accurate numbers for 
reducer memory that it will consume.

 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair

 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1030) explain and dump not working with two UDFs inside inner plan of foreach

2009-11-02 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1030:


   Resolution: Fixed
Fix Version/s: 0.6.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

+1, patch committed, Thanks Richard!

 explain and dump not working with two UDFs inside inner plan of foreach
 ---

 Key: PIG-1030
 URL: https://issues.apache.org/jira/browse/PIG-1030
 Project: Pig
  Issue Type: Bug
Reporter: Ying He
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: PIG-1030.patch, PIG-1030.patch


 this scprit does not work
 register /homes/yinghe/owl/string.jar;
 a = load '/user/yinghe/a.txt' as (id, color);
 b = group a all;
 c = foreach b {
 d = distinct a.color;
 generate group, string.BagCount2(d), string.ColumnLen2(d, 0);
 }
 the udfs are regular, not algebraic.
 then if I call  dump c; or explain c, I would get  this error message.
 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2019: Expected to find plan 
 with single leaf. Found 2 leaves.
 The error only occurs for the first time, after getting this error, if I call 
 dump c or explain c again, it would succeed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1035) support for skewed outer join

2009-11-02 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1035:


   Resolution: Fixed
Fix Version/s: 0.6.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

+1, Patch committed, thanks Sri!

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: 1035new.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-11-02 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772704#action_12772704
 ] 

Thejas M Nair commented on PIG-1062:


As indicated in previous comment, I am planning to go ahead with the [earlier 
proposal|https://issues.apache.org/jira/browse/PIG-1062?focusedCommentId=12772197page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12772197]
 . The current sample frequency would be one tuple every ( (H/s) * (1/17) ) 
tuples.  

In PartitionSkewedKey.exec(),  the number of reducers for join key k1 can be 
computed using (no_of_samples(k1) / 17) . But the accuracy of this calculation 
depends on how accurate the average tuple size computed is (s in (H/s) * 
(1/17)). Sending a special tuple with number of rows in the split will likely 
lead to more accurate estimate of number of reducers required.

 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair

 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1026) [zebra] map split returns null

2009-11-02 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-1026:
---


Patch reviewed.  +1

 [zebra] map split returns null
 --

 Key: PIG-1026
 URL: https://issues.apache.org/jira/browse/PIG-1026
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Yan Zhou
 Fix For: 0.6.0

 Attachments: PIG_1026.patch


 Here is the test scenario:
  final static String STR_SCHEMA = m1:map(string),m2:map(map(int));
   //final static String STR_STORAGE = [m1#{a}];[m2#{x|y}]; [m1#{b}, 
 m2#{z}];[m1];
  final static String STR_STORAGE = [m1#{a}, m2#{x}];[m2#{x|y}]; [m1#{b}, 
 m2#{z}];[m1,m2];
 projection: String projection2 = new String(m1#{b}, m2#{x|z});
 User got null pointer exception on reading m1#{b}.
 Yan, please refer to the test class:
 TestNonDefaultWholeMapSplit.java 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-11-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772772#action_12772772
 ] 

Alan Gates commented on PIG-1037:
-

The difference is much more than switching from dumping one tuple at a time to 
multiple tuples.  It is about how spilling is activated.  In the past, spilling 
was passive; it was done when the JVM informed us that memory was getting low.  
This did not work well as the JVM only checks memory usage when it garbage 
collects.  So by the time pig was notified of a low memory condition it was 
often too late.  We often ran out of memory while trying to spill.  Now 
instead, spilling is active.  Pig sets aside a buffer for a bag to put its 
tuples in.  For default bags, once this buffer is full any additional tuples 
are written to disk.  For sorted or distinct bags, once the buffer is full it 
is sorted and dumped to disk, and new records go into the buffer.

This particular patch only adds the change for sorted and distinct bags.  
PIG-975 contains the original patch for default bags.


 better memory layout and spill for sorted and distinct bags
 ---

 Key: PIG-1037
 URL: https://issues.apache.org/jira/browse/PIG-1037
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ying He
 Fix For: 0.6.0

 Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772774#action_12772774
 ] 

Alan Gates commented on PIG-1038:
-

I agree that we need a framework for optimizations in the backend.  I'm hoping 
we can reuse the framework from the front end.  However, there's some cleanup 
we'd still like to do on the LogicalOptimizer before we use it as a template 
for a MapReduceOptimizer.  But I agree that's where we need to go.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-970) Support of HBase 0.20.0

2009-11-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772781#action_12772781
 ] 

Alan Gates commented on PIG-970:


Patch doesn't include binary files.  I'll pull together the latest patch plus 
the jars and test it.

 Support of HBase 0.20.0
 ---

 Key: PIG-970
 URL: https://issues.apache.org/jira/browse/PIG-970
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Vincent BARAT
Assignee: Jeff Zhang
 Fix For: 0.5.0

 Attachments: build.xml.path, hbase-0.20.0-test.jar, hbase-0.20.0.jar, 
 pig-hbase-0.20.0-support.patch, pig-hbase-20-v2.patch, 
 Pig_HBase_0.20.0.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt, 
 TEST-org.apache.pig.test.TestHBaseStorage.txt, zookeeper-hbase-1329.jar


 The support of HBase is currently very limited and restricted to HBase 0.18.0.
 Because the next releases of PIG will support Hadoop 0.20.0, they should also 
 support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-11-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772797#action_12772797
 ] 

Dmitriy V. Ryaboy commented on PIG-1062:


The sampler (in this design) reads all the data, so number of records read is 
total number of records in dataset, and the number of records written is total 
number of samples. Same for bytes.  The sampler produces a histogram file, 
which is then used by the join task -- so there is no reliance on counters 
there.



 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair

 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-11-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772807#action_12772807
 ] 

Dmitriy V. Ryaboy commented on PIG-1062:


Thejas:

bq. sending a special tuple with number of rows in the split will likely lead 
to more accurate estimate of number of reducers required.

You can get the same info from the counters without unnecessarily complicating 
tuple processing, imo. In fact you can use (num bytes read / num records read) 
to get the old calculation, and not rely on number of samples and local average 
size estimates.

 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair

 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



two-level access problem?

2009-11-02 Thread Dmitriy Ryaboy
Could someone explain the nature of the two-level access problem
referred to in the Load/Store redesign wiki and in the DataType code?


Thanks,
-D


[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-11-02 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772890#action_12772890
 ] 

Ashutosh Chauhan commented on PIG-1037:
---

Thanks for the explanation, Alan. 

 better memory layout and spill for sorted and distinct bags
 ---

 Key: PIG-1037
 URL: https://issues.apache.org/jira/browse/PIG-1037
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ying He
 Fix For: 0.6.0

 Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-958) Splitting output data on key field

2009-11-02 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772925#action_12772925
 ] 

Ankur commented on PIG-958:
---

Can we have an update on this please ?

 Splitting output data on key field
 --

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
 Attachments: 958.v3.patch, 958.v4.patch


 Pig users often face the need to split the output records into a bunch of 
 files and directories depending on the type of record. Pig's SPLIT operator 
 is useful when record types are few and known in advance. In cases where type 
 is not directly known but is derived dynamically from values of a key field 
 in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.