Hi, The crawl command is deprecated, use the crawl script instead and give it a number of rounds > 1 so that it has a chance to fetch the redirection
J. On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote: > Hi, > > I am trying to crawl the page at: " > http://0-search.proquest.com.alpha2.latrobe.edu.au/" > > Here’s the parse checker output. > > runtime/local/bin/nutch parsechecker -dumpText > http://0-search.proquest.com.alpha2.latrobe.edu.au/ > fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > Fetch failed with protocol status: temp_moved(13), lastModified=0: > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > > Looks like a redirection and I did a parsechecker again for " > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > ” > > The fetching and parsing was successful this time. I have set > http.redirect.max at 5 and tried to crawl using notch crawl: > > bin/nutch crawl testurl -depth 1 > > and did a readseg on the above crawl. Here’s the readseg dump: > > Recno:: 0 > URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Jul 16 00:43:28 EDT 2014 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: _ngt_: 1405485810821 > > Content:: > Version: -1 > url: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > base: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > contentType: text/plain > metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent > nutch.crawl.score=1.0 Location= > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > _fst_=35 nutch.segment.name=20140716004332 Connection=close > Content-Type=text/plain Server=III 150 MIME-version=1.0 > Content: > > CrawlDatum:: > Version: 7 > Status: 35 (fetch_redir_temp) > Fetch time: Wed Jul 16 00:43:38 EDT 2014 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_: > temp_moved(13), lastModified=0: > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > > Not sure why notch didn’t fetch any content or parse any data or text when > crawling the page! Did I miss setting some property? I am sure I have > increased the redirect to 5. Using parsechecker, I was able to get the data > and text parsed in two steps, so I think max redirect of 5 should be > sufficient. Want to understand why parse checker works and crawl doesn’t. > > Thanks, > Vijay -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

