Hi Ai,
On Wed, May 20, 2015 at 1:03 AM, <[email protected]> wrote:
>
> I hope someone can give me advice: i run nutch over last version of
> cloudera, i have 4 servers. I tried to crawl start pages and all links
> from it (with same domain). I uploaded about 5 mln domains and see the next
>
>
[SNIP]
>
> nutch fetch 1432017717-23908 - fine, but already we got 4881050 instead
> of 4881110
>
> Map-Reduce Framework
> Map input records=4881050
> Map output records=4881050
>
Please have a look at the Fetch custom Hadoop counters that a number of us
dev's have been adding over previous development cycles
FetcherStatus
ACCESS_DENIED=4846
EXCEPTION=1944714
GONE=1293
HitByTimeLimit-QueueFeeder=0
MOVED=314310
NOTFOUND=30906
NOTMODIFIED=1
SUCCESS=1882487
TEMP_MOVED=150753
>
> ---------------------------
>
> nutch parse 1432017717-23908
>
> Map-Reduce Framework
> Map input records=713961
> Map output records=702082
>
> We took only 713961 records, why? I can't uderstand
>
Again please see the custom Counters
ParserStatus
failed=562
success=258629
Thanks for pasting the entire LOG. It really helps when we have TRACE level
logging for this type of debugging.
Thanks
Lewis