Unable to fetch content

Vijay Chakilam Thu, 17 Jul 2014 13:10:40 -0700

Hi,

I am trying to crawl the page at: 
"http://0-search.proquest.com.alpha2.latrobe.edu.au/";


Here’s the parse checker output.

runtime/local/bin/nutch parsechecker -dumpText 
http://0-search.proquest.com.alpha2.latrobe.edu.au/
fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/
Fetch failed with protocol status: temp_moved(13), lastModified=0: 
https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F

Looks like a redirection and I did a parsechecker again for 
"https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F”

The fetching and parsing was successful this time. I have set http.redirect.max 
at 5 and tried to crawl using notch crawl:

bin/nutch crawl testurl -depth 1

and did a readseg on the above crawl. Here’s the readseg dump:

Recno:: 0
URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Jul 16 00:43:28 EDT 2014
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1405485810821

Content::
Version: -1
url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
contentType: text/plain
metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent 
nutch.crawl.score=1.0 
Location=https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
 _fst_=35 nutch.segment.name=20140716004332 Connection=close 
Content-Type=text/plain Server=III 150 MIME-version=1.0 
Content:

CrawlDatum::
Version: 7
Status: 35 (fetch_redir_temp)
Fetch time: Wed Jul 16 00:43:38 EDT 2014
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_: temp_moved(13), 
lastModified=0: 
https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F

Not sure why notch didn’t fetch any content or parse any data or text when 
crawling the page! Did I miss setting some property? I am sure I have increased 
the redirect to 5. Using parsechecker, I was able to get the data and text 
parsed in two steps, so I think max redirect of 5 should be sufficient. Want to 
understand why parse checker works and crawl doesn’t.

Thanks,
Vijay

Unable to fetch content

Reply via email to