罗磊, You might try looking at Nutch, which as you may know was the origin of Hadoop. There is an issue active in the Nutch JIRA for adding integration with HBase: https://issues.apache.org/jira/browse/NUTCH-650
With this change to Nutch, we now have an example usage of HBase which matches very closely the table design suggested in the Google Bigtable paper. I downloaded the code for the branch of Nutch integrating with HBase at http://svn.apache.org/repos/asf/nutch/branches/nutchbase/ You can do some searching in that branch, but the class org.apache.nutch.storage.WebPage seems to have a basic structure for a “web page” table that may be what you’re looking for. Nutch is using the gora framework (http://github.com/enis/gora) which I was not familiar with, but it looks to handle the conversion of the persistence/data object class to the underlying HBase table when HBase is used. Best of luck, Duane ________________________________ From: 罗磊 <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Fri, 23 Jul 2010 20:27:11 -0700 To: "[email protected]" <[email protected]> Subject: idea about web page database Hi I'm trying to design a datbase which is used to store web pages for search engine. Can you guys give me some good advice for this? I read the page of bigtable. Google give an example of webtable, but it makes me a little confused. google shows how www.cnn.com <http://www.cnn.com> is stored, but if I have 2 pages named www.cnn.com/a.html <http://www.cnn.com/a.html> and www.cnn.com/b.html <http://www.cnn.com/b.html> , I don't know weather or not to store 2 pages in on row. Google's paper said "In Webtable, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web pages in the contents", it seems google will use domain name as row key, and store a.html and b.html as column names. But in that way, it seems impossible for anchor design, how can users tell which page a.html or b.html an anchor text refer to? Luo Lei
