On 1/16/11 7:40 AM, Alexis wrote:
Dear Otis and Lewis,
According to the few tests I made. I feel MySQL has the best
performance, compared to HSQL and HBase. HSQL is slower and takes up
so much disk space. HBase uses more resources. Under HBase, I couldn't
get the Fetch job to complete when holding 5000 pages buffered in
memory, without having my laptop getting extremely slow. It finally
worked with a flushing frequency to the store of 2500 pages. Under
MySQL, it worked out smoothly with a 10000 value.
...and this is of course nowhere near the level of scalability that 1.x
releases had, as they would easily crawl a hundred million pages.
There's a lot of remaining work on Gora and its integration with Nutch
that affects this situation.
Eventually I expect HBase will be the best choice for large scale
crawling, with MySQL backend suitable for small to medium scale, and
HSQL being used only for tests or really small crawls < 1000 pages.
NoSQL technology scales better, but for a "reasonable" volume MySQL
will do the job fine and faster.
It would be nice to test Cassandra as Gora backend. Write operations
are allegedly faster that Hbase. Haven't tried yet.
There are some concurrency limitations in the Cassandra client - OTOH
that's maybe where Gora needs to improve.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com