Hi,

I am using Nutch 1.7 and I tried to crawl this url: 
http://79657.70194.14886.graphicspotting.com/

Created the seed url file with the url: 
http://79657.70194.14886.graphicspotting.com/
Crawled the url using the crawl command: bin/nutch crawl url -depth 1
Ran a readseg to dump the segment:

Here’s the dump:

Recno:: 0
URL:: http://79657.70194.14886.graphicspotting.com/

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Jul 09 11:15:39 EDT 2014
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1404918941993

CrawlDatum::
Version: 7
Status: 37 (fetch_gone)
Fetch time: Wed Jul 09 11:15:46 EDT 2014
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), lastModified=0

I don’t see any content, no parse data or text.

I tried to use parsechecker and here’s the output:

vijay$ bin/nutch parsechecker -dumpText 
http://79657.70194.14886.graphicspotting.com/
fetching: http://79657.70194.14886.graphicspotting.com/
parsing: http://79657.70194.14886.graphicspotting.com/
contentType: text/html
signature: 9f695936ef3bf29b0d1556df1aec7da8
---------
Url
---------------

http://79657.70194.14886.graphicspotting.com/
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: 79657.70194.14886.graphicspotting
Outlinks: 3
  outlink: toUrl: http://79657.70194.14886.graphicspotting.com/../css/style.css 
anchor: 
  outlink: toUrl: http://79657.70194.14886.graphicspotting.com/index.php 
anchor: 79657.70194.14886.graphicspotting
  outlink: toUrl: 
http://79657.70194.14886.graphicspotting.com/images/image3017.png anchor: 
Content Metadata: Date=Wed, 09 Jul 2014 15:28:26 GMT Connection=close 
Content-Type=text/html X-Powered-By=PHP/5.3.3 Server=nginx/1.0.15 
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 
---------
ParseText
---------

79657.70194.14886.graphicspotting Welcome to 79657.70194.14886.graphicspotting 
©2014 79657.70194.14886.graphicspotting. All rights reserved

Not sure why I am not able to get any parse data or parse text in readseg, 
whereas parsechecker is able to extract data and text.

Thanks,
Vijay

Reply via email to