Hi Ai,

On Wed, May 20, 2015 at 1:03 AM, <[email protected]> wrote:

>
> I hope someone can give me advice: i run nutch over last version of
> cloudera, i have 4 servers.  I tried to crawl start pages and all links
> from it (with same domain). I uploaded about 5 mln domains and see the next
>
>
[SNIP]


>
> nutch fetch 1432017717-23908 - fine, but already we got 4881050 instead
> of 4881110
>
> Map-Reduce Framework
> Map input records=4881050
> Map output records=4881050
>

Please have a look at the Fetch custom Hadoop counters that a number of us
dev's have been adding over previous development cycles

        FetcherStatus
                ACCESS_DENIED=4846
                EXCEPTION=1944714
                GONE=1293
                HitByTimeLimit-QueueFeeder=0
                MOVED=314310
                NOTFOUND=30906
                NOTMODIFIED=1
                SUCCESS=1882487
                TEMP_MOVED=150753


>
> ---------------------------
>
> nutch parse 1432017717-23908
>
> Map-Reduce Framework
> Map input records=713961
> Map output records=702082
>
> We took only 713961 records, why? I can't uderstand
>

Again please see the custom Counters

        ParserStatus
                failed=562
                success=258629

Thanks for pasting the entire LOG. It really helps when we have TRACE level
logging for this type of debugging.
Thanks
Lewis

Reply via email to