On 2010-10-26 09:20, xiao yang wrote:
> Hi, Scott,
> 
> Thanks for your reply.
> I'm curious about the reason why using database is awful.

I only partially agree with Scott's statement - it really depends on the
scale. For a volume below 100k pages I think a DB could be a good
storage platform. But as the volume of data grows, the cost of updates
in a relational DB grows disproportionately high. As you reach a volume
of tens of millions of documents, a relational database storage will
have very poor performance or extremely high cost.


> Here is my requirement: we have two developers who want to do some
> processing and analysis work on the crawled data. If the data is
> stored in database, we can easily share our data, for the well-defined
> data models. What's more, the analysis results can also be easily
> stored back into the database by just adding a few fields.
> For example, I need to know the average number of urls in one site. In
> database, a single SQL will do. If I want to extract and store the
> main part of web pages, I can't easily modify the data structure of
> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
> through the data set.
> The crawled data is structured, then why not using database?

What you want is Nutch 2.0 :) where the storage layer uses Gora (an
abstraction for key-value stores), and one of the supported backend
types is an SQL database. Your use case - to be able to use existing
standard tools for DBs or other data warehousing platforms - was one of
the motivations to redesign Nutch this way.

Please check out Nutch trunk, and configure it to use an SQL backend.
Currently only MySQL and HSQLDB databases are supported (and HBase), but
it's not that hard to add support for other database types.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to