Hi Sebastian, thanks for the response.
The numbers I gave were for a single reduce task, not a whole job. I'll try to
give a better picture.
crawldb/current has 161.4 gbytes of data, on about 1.6 billion urls. I don't
know how many hosts or domains, but I assume it is many millions.
Cluster cu
Hi Michael,
> reducer spills a lot of records
The job counter "Spilled Records" is not for the reducers alone.
> 255K input records
Does your CrawlDb only contain 250,000 entries?
Also, how many hosts (resp. domains/ips depending on partition.url.mode)
are in the CrawlDb? Note: the counts per
2 matches
Mail list logo