Thanks for your response Julien. Is there a way I can bypass the robots check 
in the normal crawl?

Thanks,
Vijay

On Jul 9, 2014, at 11:46 AM, Julien Nioche <[email protected]> 
wrote:

> The clue is in : Metadata: _ngt_: 1404918941993_pst_: robots_denied(18),
> lastModified=0
> 
> The server you are hitting prevents robots, see
> http://79657.70194.14886.graphicspotting.com/robots.txt
> 
> The parsechecker does not check for robots.txt whereas the normal crawl
> operations do.
> 
> Julien
> 
> 
> 
> 
> On 9 July 2014 16:34, Vijay Chakilam <[email protected]> wrote:
> 
>> Hi,
>> 
>> I am using Nutch 1.7 and I tried to crawl this url:
>> http://79657.70194.14886.graphicspotting.com/
>> 
>> Created the seed url file with the url:
>> http://79657.70194.14886.graphicspotting.com/
>> Crawled the url using the crawl command: bin/nutch crawl url -depth 1
>> Ran a readseg to dump the segment:
>> 
>> Here’s the dump:
>> 
>> Recno:: 0
>> URL:: http://79657.70194.14886.graphicspotting.com/
>> 
>> CrawlDatum::
>> Version: 7
>> Status: 1 (db_unfetched)
>> Fetch time: Wed Jul 09 11:15:39 EDT 2014
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: _ngt_: 1404918941993
>> 
>> CrawlDatum::
>> Version: 7
>> Status: 37 (fetch_gone)
>> Fetch time: Wed Jul 09 11:15:46 EDT 2014
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), lastModified=0
>> 
>> I don’t see any content, no parse data or text.
>> 
>> I tried to use parsechecker and here’s the output:
>> 
>> vijay$ bin/nutch parsechecker -dumpText
>> http://79657.70194.14886.graphicspotting.com/
>> fetching: http://79657.70194.14886.graphicspotting.com/
>> parsing: http://79657.70194.14886.graphicspotting.com/
>> contentType: text/html
>> signature: 9f695936ef3bf29b0d1556df1aec7da8
>> ---------
>> Url
>> ---------------
>> 
>> http://79657.70194.14886.graphicspotting.com/
>> ---------
>> ParseData
>> ---------
>> 
>> Version: 5
>> Status: success(1,0)
>> Title: 79657.70194.14886.graphicspotting
>> Outlinks: 3
>>  outlink: toUrl:
>> http://79657.70194.14886.graphicspotting.com/../css/style.css anchor:
>>  outlink: toUrl: http://79657.70194.14886.graphicspotting.com/index.php
>> anchor: 79657.70194.14886.graphicspotting
>>  outlink: toUrl:
>> http://79657.70194.14886.graphicspotting.com/images/image3017.png anchor:
>> Content Metadata: Date=Wed, 09 Jul 2014 15:28:26 GMT Connection=close
>> Content-Type=text/html X-Powered-By=PHP/5.3.3 Server=nginx/1.0.15
>> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>> ---------
>> ParseText
>> ---------
>> 
>> 79657.70194.14886.graphicspotting Welcome to
>> 79657.70194.14886.graphicspotting ©2014 79657.70194.14886.graphicspotting.
>> All rights reserved
>> 
>> Not sure why I am not able to get any parse data or parse text in readseg,
>> whereas parsechecker is able to extract data and text.
>> 
>> Thanks,
>> Vijay
> 
> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Reply via email to