Hi everyone,

I hope someone can give me advice: i run nutch over last version of cloudera, i 
have 4 servers.  I tried to crawl start pages and all links from it (with same 
domain). I uploaded about 5 mln domains and see the next

---------------------------

nutch inject  /domains  - works fine, everything was uploaded

---------------------------

nutch generate -topN 40000000 -noFilter -batchId 1432017717-23908 also works 
fine

Map-Reduce Framework:
Map input records=4881110
Map output records=4881110

---------------------------

nutch fetch 1432017717-23908 - fine, but already we got 4881050 instead of 
4881110

Map-Reduce Framework
Map input records=4881050
Map output records=4881050

---------------------------

nutch parse 1432017717-23908

Map-Reduce Framework
Map input records=713961
Map output records=702082

We took only 713961 records, why? I can't uderstand

---------------------------
nutch updatedb 1432017717-23908

Map-Reduce Framework
Map input records=4863372
Map output records=9643464

---------------------------

nutch index 1432017717-23908
Map-Reduce Framework
Map input records=1226
Map output records=1226

Only 1226 records


I really can't understand what's wrong. I checked everything, previous 
installation of nutch on 2 servers works fine.

Full log  http://pastebin.com/BsPrb5WQ  

---
Sergey Bolshakov










Reply via email to