Strange behavior while crawling process

Lewis John Mcgibbney Wed, 20 May 2015 10:11:02 -0700

Hi Ai,

On Wed, May 20, 2015 at 1:03 AM, <[email protected]> wrote:


>
> I hope someone can give me advice: i run nutch over last version of
> cloudera, i have 4 servers.  I tried to crawl start pages and all links
> from it (with same domain). I uploaded about 5 mln domains and see the next
>
>
[SNIP]


>
> nutch fetch 1432017717-23908 - fine, but already we got 4881050 instead
> of 4881110
>
> Map-Reduce Framework
> Map input records=4881050
> Map output records=4881050
>

Please have a look at the Fetch custom Hadoop counters that a number of us
dev's have been adding over previous development cycles

        FetcherStatus
                ACCESS_DENIED=4846
                EXCEPTION=1944714
                GONE=1293
                HitByTimeLimit-QueueFeeder=0
                MOVED=314310
                NOTFOUND=30906
                NOTMODIFIED=1
                SUCCESS=1882487
                TEMP_MOVED=150753


>
> ---------------------------
>
> nutch parse 1432017717-23908
>
> Map-Reduce Framework
> Map input records=713961
> Map output records=702082
>
> We took only 713961 records, why? I can't uderstand
>

Again please see the custom Counters

        ParserStatus
                failed=562
                success=258629

Thanks for pasting the entire LOG. It really helps when we have TRACE level
logging for this type of debugging.
Thanks
Lewis

Strange behavior while crawling process

Reply via email to