subject:"Re\: spilled records from reducer"

Re: spilled records from reducer

2018-04-13 Thread Michael Coffey

Hi Sebastian, thanks for the response. The numbers I gave were for a single reduce task, not a whole job. I'll try to give a better picture. crawldb/current has 161.4 gbytes of data, on about 1.6 billion urls. I don't know how many hosts or domains, but I assume it is many millions. Cluster cu

Re: spilled records from reducer

2018-04-13 Thread Sebastian Nagel

Hi Michael, > reducer spills a lot of records The job counter "Spilled Records" is not for the reducers alone. > 255K input records Does your CrawlDb only contain 250,000 entries? Also, how many hosts (resp. domains/ips depending on partition.url.mode) are in the CrawlDb? Note: the counts per