On 2010-09-28 14:02, Markus Jelsma wrote:
Hi,
My test setup (only local) now has just over 20 million URL's, i fetched 3m
already and the rest needs to be fetched. It's now less time wasting to fetch
for 12 hours because merging takes now over 5.5 hours!
I've searched but found little information so far. Would now be a good time to
try running Nutch on a Hadoop cluster (which i don't have) or try to let
Hadoop take advantage of my multiple cores?
Even running Hadoop in pseudo-distributed mode (on a single node but
with real JobTracker/TaskTracker) would be much better. The reason is
that in local mode tasks are NOT executed in parallel but serially.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com