Re: Strange behavior while crawling process

Ai Ai Wed, 20 May 2015 23:15:31 -0700

Hi John ,
Great thanks for your answer, with fetch job was my fail, but i can't 
understand situation with parse job (and also with index job). We succeed 
fetched 1882487 documents, but parse only 258629+562 (mapreduce take only 
713961)
On my mind it could be some problem with hadoop and etc, but i have no idea 
what to check and how. Could you please give a little bit more information if 
you have any assumption.


BR
Sergey Bolshakov



>Hi Ai,
>
>On Wed, May 20, 2015 at 1:03 AM, < [email protected] > wrote:
>
>>
>> I hope someone can give me advice: i run nutch over last version of
>> cloudera, i have 4 servers. I tried to crawl start pages and all links
>> from it (with same domain). I uploaded about 5 mln domains and see the next
>>
>>
>[SNIP]
>
>
>>
>> nutch fetch 1432017717-23908 - fine, but already we got 4881050 instead
>> of 4881110
>>
>> Map-Reduce Framework
>> Map input records=4881050
>> Map output records=4881050
>>
>
>Please have a look at the Fetch custom Hadoop counters that a number of us
>dev's have been adding over previous development cycles
>
>        FetcherStatus
>                ACCESS_DENIED=4846
>                EXCEPTION=1944714
>                GONE=1293
>                HitByTimeLimit-QueueFeeder=0
>                MOVED=314310
>                NOTFOUND=30906
>                NOTMODIFIED=1
>                SUCCESS=1882487
>                TEMP_MOVED=150753
>
>
>>
>> ---------------------------
>>
>> nutch parse 1432017717-23908
>>
>> Map-Reduce Framework
>> Map input records=713961
>> Map output records=702082
>>
>> We took only 713961 records, why? I can't uderstand
>>
>
>Again please see the custom Counters
>
>        ParserStatus
>                failed=562
>                success=258629
>
>Thanks for pasting the entire LOG. It really helps when we have TRACE level
>logging for this type of debugging.
>Thanks
>Lewis
>

Re: Strange behavior while crawling process

Reply via email to