Thanks for your response Julien. Is there a way I can bypass the robots check in the normal crawl?
Thanks, Vijay On Jul 9, 2014, at 11:46 AM, Julien Nioche <[email protected]> wrote: > The clue is in : Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), > lastModified=0 > > The server you are hitting prevents robots, see > http://79657.70194.14886.graphicspotting.com/robots.txt > > The parsechecker does not check for robots.txt whereas the normal crawl > operations do. > > Julien > > > > > On 9 July 2014 16:34, Vijay Chakilam <[email protected]> wrote: > >> Hi, >> >> I am using Nutch 1.7 and I tried to crawl this url: >> http://79657.70194.14886.graphicspotting.com/ >> >> Created the seed url file with the url: >> http://79657.70194.14886.graphicspotting.com/ >> Crawled the url using the crawl command: bin/nutch crawl url -depth 1 >> Ran a readseg to dump the segment: >> >> Here’s the dump: >> >> Recno:: 0 >> URL:: http://79657.70194.14886.graphicspotting.com/ >> >> CrawlDatum:: >> Version: 7 >> Status: 1 (db_unfetched) >> Fetch time: Wed Jul 09 11:15:39 EDT 2014 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 2592000 seconds (30 days) >> Score: 1.0 >> Signature: null >> Metadata: _ngt_: 1404918941993 >> >> CrawlDatum:: >> Version: 7 >> Status: 37 (fetch_gone) >> Fetch time: Wed Jul 09 11:15:46 EDT 2014 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 2592000 seconds (30 days) >> Score: 1.0 >> Signature: null >> Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), lastModified=0 >> >> I don’t see any content, no parse data or text. >> >> I tried to use parsechecker and here’s the output: >> >> vijay$ bin/nutch parsechecker -dumpText >> http://79657.70194.14886.graphicspotting.com/ >> fetching: http://79657.70194.14886.graphicspotting.com/ >> parsing: http://79657.70194.14886.graphicspotting.com/ >> contentType: text/html >> signature: 9f695936ef3bf29b0d1556df1aec7da8 >> --------- >> Url >> --------------- >> >> http://79657.70194.14886.graphicspotting.com/ >> --------- >> ParseData >> --------- >> >> Version: 5 >> Status: success(1,0) >> Title: 79657.70194.14886.graphicspotting >> Outlinks: 3 >> outlink: toUrl: >> http://79657.70194.14886.graphicspotting.com/../css/style.css anchor: >> outlink: toUrl: http://79657.70194.14886.graphicspotting.com/index.php >> anchor: 79657.70194.14886.graphicspotting >> outlink: toUrl: >> http://79657.70194.14886.graphicspotting.com/images/image3017.png anchor: >> Content Metadata: Date=Wed, 09 Jul 2014 15:28:26 GMT Connection=close >> Content-Type=text/html X-Powered-By=PHP/5.3.3 Server=nginx/1.0.15 >> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 >> --------- >> ParseText >> --------- >> >> 79657.70194.14886.graphicspotting Welcome to >> 79657.70194.14886.graphicspotting ©2014 79657.70194.14886.graphicspotting. >> All rights reserved >> >> Not sure why I am not able to get any parse data or parse text in readseg, >> whereas parsechecker is able to extract data and text. >> >> Thanks, >> Vijay > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble

