Re: idea about web page database

Duane Moore Fri, 23 Jul 2010 22:54:50 -0700

罗磊,

You might try looking at Nutch, which as you may know was the origin of Hadoop. 
 There is an issue active in the Nutch JIRA for adding integration with HBase: 
https://issues.apache.org/jira/browse/NUTCH-650


With this change to Nutch, we now have an example usage of HBase which matches 
very closely the table design suggested in the Google Bigtable paper.

I downloaded the code for the branch of Nutch integrating with HBase at 
http://svn.apache.org/repos/asf/nutch/branches/nutchbase/

You can do some searching in that branch, but the class 
org.apache.nutch.storage.WebPage seems to have a basic structure for a “web 
page” table that may be what you’re looking for.  Nutch is using the gora 
framework (http://github.com/enis/gora) which I was not familiar with, but it 
looks to handle the conversion of the persistence/data object class to the 
underlying HBase table when HBase is used.

Best of luck,
Duane

________________________________
From: 罗磊 <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Fri, 23 Jul 2010 20:27:11 -0700
To: "[email protected]" <[email protected]>
Subject: idea about web page database

Hi

I'm trying to design a datbase which is used to store web pages for search 
engine. Can you guys give me some good advice for this?

I read the page of bigtable. Google give an example of webtable, but it makes 
me a little confused. google shows how www.cnn.com <http://www.cnn.com>  is 
stored, but if I have 2 pages named www.cnn.com/a.html 
<http://www.cnn.com/a.html>  and www.cnn.com/b.html <http://www.cnn.com/b.html> 
, I don't know weather or not to store 2 pages in on row.

Google's paper said "In Webtable, we would use URLs as row keys, various 
aspects of web pages as column names, and store the contents of the web pages 
in the contents", it seems google will use domain name as row key, and store 
a.html and b.html as column names. But in that way, it seems impossible for 
anchor design, how can users tell which page a.html or b.html an anchor text 
refer to?


Luo Lei

Re: idea about web page database

Reply via email to