Re: Are there any web crawlers based on database?

xiao yang Tue, 26 Oct 2010 05:02:36 -0700

Hi, Andrzej

Great, I'll definitely try Nutch 2.0!
As far as I know, Facebook is still using MySQL for storage. I believe
the its data scale will exceed 100K. Do you have any clues how they
solve the problem?


Thanks!
Xiao

On Tue, Oct 26, 2010 at 5:30 PM, Andrzej Bialecki <[email protected]> wrote:
> On 2010-10-26 09:20, xiao yang wrote:
>> Hi, Scott,
>>
>> Thanks for your reply.
>> I'm curious about the reason why using database is awful.
>
> I only partially agree with Scott's statement - it really depends on the
> scale. For a volume below 100k pages I think a DB could be a good
> storage platform. But as the volume of data grows, the cost of updates
> in a relational DB grows disproportionately high. As you reach a volume
> of tens of millions of documents, a relational database storage will
> have very poor performance or extremely high cost.
>
>
>> Here is my requirement: we have two developers who want to do some
>> processing and analysis work on the crawled data. If the data is
>> stored in database, we can easily share our data, for the well-defined
>> data models. What's more, the analysis results can also be easily
>> stored back into the database by just adding a few fields.
>> For example, I need to know the average number of urls in one site. In
>> database, a single SQL will do. If I want to extract and store the
>> main part of web pages, I can't easily modify the data structure of
>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>> through the data set.
>> The crawled data is structured, then why not using database?
>
> What you want is Nutch 2.0 :) where the storage layer uses Gora (an
> abstraction for key-value stores), and one of the supported backend
> types is an SQL database. Your use case - to be able to use existing
> standard tools for DBs or other data warehousing platforms - was one of
> the motivations to redesign Nutch this way.
>
> Please check out Nutch trunk, and configure it to use an SQL backend.
> Currently only MySQL and HSQLDB databases are supported (and HBase), but
> it's not that hard to add support for other database types.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Are there any web crawlers based on database?

Reply via email to