Thanks for your reply Julien. I am not doing any indexing and I don’t have a solr url. Looks like crawl script requires me to specify a solr url. How do I run crawl script without specifying a solar url. Also, I want to crawl just the webpage I specify: a depth of 1. I don’t want to fetch any outlinks. How does number of rounds relate to depth? Are they same? If so, what value should I specify for number of rounds to fetch just the page I specify and also take care of the redirects. Are http.redirect.max and number of rounds related?
Thanks, Vijay On Jul 17, 2014, at 4:42 PM, Julien Nioche <[email protected]> wrote: > Hi, > > The crawl command is deprecated, use the crawl script instead and give it a > number of rounds > 1 so that it has a chance to fetch the redirection > > J. > > > On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote: > >> Hi, >> >> I am trying to crawl the page at: " >> http://0-search.proquest.com.alpha2.latrobe.edu.au/" >> >> Here’s the parse checker output. >> >> runtime/local/bin/nutch parsechecker -dumpText >> http://0-search.proquest.com.alpha2.latrobe.edu.au/ >> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >> Fetch failed with protocol status: temp_moved(13), lastModified=0: >> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >> >> Looks like a redirection and I did a parsechecker again for " >> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >> ” >> >> The fetching and parsing was successful this time. I have set >> http.redirect.max at 5 and tried to crawl using notch crawl: >> >> bin/nutch crawl testurl -depth 1 >> >> and did a readseg on the above crawl. Here’s the readseg dump: >> >> Recno:: 0 >> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >> >> CrawlDatum:: >> Version: 7 >> Status: 1 (db_unfetched) >> Fetch time: Wed Jul 16 00:43:28 EDT 2014 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 2592000 seconds (30 days) >> Score: 1.0 >> Signature: null >> Metadata: _ngt_: 1405485810821 >> >> Content:: >> Version: -1 >> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >> contentType: text/plain >> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent >> nutch.crawl.score=1.0 Location= >> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >> _fst_=35 nutch.segment.name=20140716004332 Connection=close >> Content-Type=text/plain Server=III 150 MIME-version=1.0 >> Content: >> >> CrawlDatum:: >> Version: 7 >> Status: 35 (fetch_redir_temp) >> Fetch time: Wed Jul 16 00:43:38 EDT 2014 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 2592000 seconds (30 days) >> Score: 1.0 >> Signature: null >> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_: >> temp_moved(13), lastModified=0: >> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >> >> Not sure why notch didn’t fetch any content or parse any data or text when >> crawling the page! Did I miss setting some property? I am sure I have >> increased the redirect to 5. Using parsechecker, I was able to get the data >> and text parsed in two steps, so I think max redirect of 5 should be >> sufficient. Want to understand why parse checker works and crawl doesn’t. >> >> Thanks, >> Vijay > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble

