On 26 October 2010 10:30, Andrzej Bialecki <[email protected]> wrote: > On 2010-10-26 09:20, xiao yang wrote: >> Hi, Scott, >> >> Thanks for your reply. >> I'm curious about the reason why using database is awful. > > I only partially agree with Scott's statement - it really depends on the > scale. For a volume below 100k pages I think a DB could be a good > storage platform. But as the volume of data grows, the cost of updates > in a relational DB grows disproportionately high. As you reach a volume > of tens of millions of documents, a relational database storage will > have very poor performance or extremely high cost.
I wrote such a system a few years ago and this was exactly what I discovered. Non-locality of database reads and writes proved an insurmountable bottleneck beyond about 10 million pages. [Crawl slowed to 2-3 pages/second with everything on a single PC, using separate machine as DB server didn't seem to help]. Might have got further if I hadn't included a table that recorded every inter-page link!. But it was really nice to be able to make arbitrary queries of the page collection and its structure, although some queries could be horrendously slow. I was using MySQL BTW. Have seen some hints that latest version is much faster so might have another go.

