Re: parsechecker fetches url but fetcher fails

arijit Tue, 03 Jul 2012 07:05:04 -0700

Hi,
   I did some more digging around - and noticed this in the output from readseg:


Recno:: 0
URL:: http://en.wikipedia.org/wiki/Districts_of_India/

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jul 03 16:52:09 IST 2012
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1341314531887

CrawlDatum::
Version: 7
Status: 37 (fetch_gone)
Fetch time: Tue Jul 03 16:52:17 IST 2012
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1341314531887_pst_: notfound(14), lastModified=0: 
http://en.wikipedia.org/wiki/Districts_of_India/

Note the _pst_ : notfound(14)!!!

Does this mean that on fetch the url returns status as 404 and therefore fetch 
is unable to carry on....
This will be strange as parsechecker seems to be fine fetching and parsing the 
links in this url into outlinks.
So, it might be that the failure to parse the robots.txt is NOT an issue - the 
issue is that fetcher stops as it does not get anything when trying to fetch 
the contents of the url: http://en.wikipedia.org/wiki/Districts_of_India/

Appreciate all the help that has coming my way.
-Arijit




________________________________
 From: Ken Krugler <[email protected]>
To: [email protected] 
Sent: Monday, July 2, 2012 10:56 PM
Subject: Re: parsechecker fetches url but fetcher fails
 



On Jul 2, 2012, at 5:00am, arijit wrote:

Hi,
>   Since learning that nutch will be unable to crawl the javascript function 
>calls in href, I started looking for other alternatives. I decided to crawl 
>http://en.wikipedia.org/wiki/Districts_of_India.
>    I first tried injecting this URL and follow the step-by-step approach till 
>fetcher - when I realized, nutch did not fetch anything from this website. I 
>tried looking into logs/hadoop.log and found the following 3 lines - which I 
>believe could be saying that nutch is unable to parse the robots.txt in the 
>website and ttherefore, fetcher stopped?
>
>    
>
>    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
>rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
>   
 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
>    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
>rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
The issue is that the Wikipedia robots.txt file contains malformed URLs - these 
three are missing the 'A' from the %3A sequence.


    I tried checking the URL using parsechecker and no issues there! I think it 
means that the robots.txt is malformed for this website, which is preventing 
fetcher from fetching anything. Is there a way to get around this problem, as 
parsechecker seems to go on its merry way parsing.
This is an example of where having Nutch use crawler-commons robots.txt parser 
would help :)

https://issues.apache.org/jira/browse/NUTCH-1031

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: parsechecker fetches url but fetcher fails

Reply via email to