On 26 October 2010 10:30, Andrzej Bialecki <[email protected]> wrote:
> On 2010-10-26 09:20, xiao yang wrote:
>> Hi, Scott,
>>
>> Thanks for your reply.
>> I'm curious about the reason why using database is awful.
>
> I only partially agree with Scott's statement - it really depends on the
> scale. For a volume below 100k pages I think a DB could be a good
> storage platform. But as the volume of data grows, the cost of updates
> in a relational DB grows disproportionately high. As you reach a volume
> of tens of millions of documents, a relational database storage will
> have very poor performance or extremely high cost.

I wrote such a system a few years ago and this was exactly what I discovered.
Non-locality of database reads and writes proved an insurmountable bottleneck
beyond about 10 million pages. [Crawl slowed to 2-3 pages/second with
everything on a single PC, using separate machine as DB server didn't seem
to help]. Might have got further if I hadn't included a table that recorded
every inter-page link!.

But it was really nice to be able to make arbitrary queries of the
page collection
and its structure, although some queries could be horrendously slow.

I was using MySQL BTW. Have seen some hints that latest version is much
faster so might have another go.

Reply via email to