Hi, I am using Nutch 1.7 and I tried to crawl this url: http://79657.70194.14886.graphicspotting.com/
Created the seed url file with the url: http://79657.70194.14886.graphicspotting.com/ Crawled the url using the crawl command: bin/nutch crawl url -depth 1 Ran a readseg to dump the segment: Here’s the dump: Recno:: 0 URL:: http://79657.70194.14886.graphicspotting.com/ CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Jul 09 11:15:39 EDT 2014 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1404918941993 CrawlDatum:: Version: 7 Status: 37 (fetch_gone) Fetch time: Wed Jul 09 11:15:46 EDT 2014 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), lastModified=0 I don’t see any content, no parse data or text. I tried to use parsechecker and here’s the output: vijay$ bin/nutch parsechecker -dumpText http://79657.70194.14886.graphicspotting.com/ fetching: http://79657.70194.14886.graphicspotting.com/ parsing: http://79657.70194.14886.graphicspotting.com/ contentType: text/html signature: 9f695936ef3bf29b0d1556df1aec7da8 --------- Url --------------- http://79657.70194.14886.graphicspotting.com/ --------- ParseData --------- Version: 5 Status: success(1,0) Title: 79657.70194.14886.graphicspotting Outlinks: 3 outlink: toUrl: http://79657.70194.14886.graphicspotting.com/../css/style.css anchor: outlink: toUrl: http://79657.70194.14886.graphicspotting.com/index.php anchor: 79657.70194.14886.graphicspotting outlink: toUrl: http://79657.70194.14886.graphicspotting.com/images/image3017.png anchor: Content Metadata: Date=Wed, 09 Jul 2014 15:28:26 GMT Connection=close Content-Type=text/html X-Powered-By=PHP/5.3.3 Server=nginx/1.0.15 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 --------- ParseText --------- 79657.70194.14886.graphicspotting Welcome to 79657.70194.14886.graphicspotting ©2014 79657.70194.14886.graphicspotting. All rights reserved Not sure why I am not able to get any parse data or parse text in readseg, whereas parsechecker is able to extract data and text. Thanks, Vijay

