Hi, Andrzej Great, I'll definitely try Nutch 2.0! As far as I know, Facebook is still using MySQL for storage. I believe the its data scale will exceed 100K. Do you have any clues how they solve the problem?
Thanks! Xiao On Tue, Oct 26, 2010 at 5:30 PM, Andrzej Bialecki <[email protected]> wrote: > On 2010-10-26 09:20, xiao yang wrote: >> Hi, Scott, >> >> Thanks for your reply. >> I'm curious about the reason why using database is awful. > > I only partially agree with Scott's statement - it really depends on the > scale. For a volume below 100k pages I think a DB could be a good > storage platform. But as the volume of data grows, the cost of updates > in a relational DB grows disproportionately high. As you reach a volume > of tens of millions of documents, a relational database storage will > have very poor performance or extremely high cost. > > >> Here is my requirement: we have two developers who want to do some >> processing and analysis work on the crawled data. If the data is >> stored in database, we can easily share our data, for the well-defined >> data models. What's more, the analysis results can also be easily >> stored back into the database by just adding a few fields. >> For example, I need to know the average number of urls in one site. In >> database, a single SQL will do. If I want to extract and store the >> main part of web pages, I can't easily modify the data structure of >> Nutch easily. Even in Solr, it's difficult and inefficient to iterate >> through the data set. >> The crawled data is structured, then why not using database? > > What you want is Nutch 2.0 :) where the storage layer uses Gora (an > abstraction for key-value stores), and one of the supported backend > types is an SQL database. Your use case - to be able to use existing > standard tools for DBs or other data warehousing platforms - was one of > the motivations to redesign Nutch this way. > > Please check out Nutch trunk, and configure it to use an SQL backend. > Currently only MySQL and HSQLDB databases are supported (and HBase), but > it's not that hard to add support for other database types. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >

