Hi everyone, I hope someone can give me advice: i run nutch over last version of cloudera, i have 4 servers. I tried to crawl start pages and all links from it (with same domain). I uploaded about 5 mln domains and see the next
--------------------------- nutch inject /domains - works fine, everything was uploaded --------------------------- nutch generate -topN 40000000 -noFilter -batchId 1432017717-23908 also works fine Map-Reduce Framework: Map input records=4881110 Map output records=4881110 --------------------------- nutch fetch 1432017717-23908 - fine, but already we got 4881050 instead of 4881110 Map-Reduce Framework Map input records=4881050 Map output records=4881050 --------------------------- nutch parse 1432017717-23908 Map-Reduce Framework Map input records=713961 Map output records=702082 We took only 713961 records, why? I can't uderstand --------------------------- nutch updatedb 1432017717-23908 Map-Reduce Framework Map input records=4863372 Map output records=9643464 --------------------------- nutch index 1432017717-23908 Map-Reduce Framework Map input records=1226 Map output records=1226 Only 1226 records I really can't understand what's wrong. I checked everything, previous installation of nutch on 2 servers works fine. Full log http://pastebin.com/BsPrb5WQ --- Sergey Bolshakov

