Hi
On 17 July 2014 22:04, Vijay Chakilam <[email protected]> wrote: > Thanks for your reply Julien. I am not doing any indexing and I don’t have > a solr url. Looks like crawl script requires me to specify a solr url. How > do I run crawl script without specifying a solar url. Just comment out the commands related to SOLR in the script and pass it a dummy parameter for the SOLR url Also, I want to crawl just the webpage I specify: a depth of 1. I don’t want to fetch any outlinks. That can be done by setting db.update.additions.allowed to false in nutch-site.xml No new URLs will be added to the crawldb > How does number of rounds relate to depth? Are they same? No. They will be the same if there were no redirections and if you were putting all unfetched URLs in the segments. If there are more unfetched URLs in the crawldb then you are putting in the segments then you'll definitely need several iterations. > If so, what value should I specify for number of rounds to fetch just the > page I specify and also take care of the redirects. Are http.redirect.max > and number of rounds related? > Set http.redirect.max to a value > 0 so that they redirection gets tried within the same fetch step (i.e same round). HTH Julien > > Thanks, > Vijay > > On Jul 17, 2014, at 4:42 PM, Julien Nioche <[email protected]> > wrote: > > > Hi, > > > > The crawl command is deprecated, use the crawl script instead and give > it a > > number of rounds > 1 so that it has a chance to fetch the redirection > > > > J. > > > > > > On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote: > > > >> Hi, > >> > >> I am trying to crawl the page at: " > >> http://0-search.proquest.com.alpha2.latrobe.edu.au/" > >> > >> Here’s the parse checker output. > >> > >> runtime/local/bin/nutch parsechecker -dumpText > >> http://0-search.proquest.com.alpha2.latrobe.edu.au/ > >> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > >> Fetch failed with protocol status: temp_moved(13), lastModified=0: > >> > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > >> > >> Looks like a redirection and I did a parsechecker again for " > >> > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > >> ” > >> > >> The fetching and parsing was successful this time. I have set > >> http.redirect.max at 5 and tried to crawl using notch crawl: > >> > >> bin/nutch crawl testurl -depth 1 > >> > >> and did a readseg on the above crawl. Here’s the readseg dump: > >> > >> Recno:: 0 > >> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > >> > >> CrawlDatum:: > >> Version: 7 > >> Status: 1 (db_unfetched) > >> Fetch time: Wed Jul 16 00:43:28 EDT 2014 > >> Modified time: Wed Dec 31 19:00:00 EST 1969 > >> Retries since fetch: 0 > >> Retry interval: 2592000 seconds (30 days) > >> Score: 1.0 > >> Signature: null > >> Metadata: _ngt_: 1405485810821 > >> > >> Content:: > >> Version: -1 > >> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > >> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > >> contentType: text/plain > >> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent > >> nutch.crawl.score=1.0 Location= > >> > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > >> _fst_=35 nutch.segment.name=20140716004332 Connection=close > >> Content-Type=text/plain Server=III 150 MIME-version=1.0 > >> Content: > >> > >> CrawlDatum:: > >> Version: 7 > >> Status: 35 (fetch_redir_temp) > >> Fetch time: Wed Jul 16 00:43:38 EDT 2014 > >> Modified time: Wed Dec 31 19:00:00 EST 1969 > >> Retries since fetch: 0 > >> Retry interval: 2592000 seconds (30 days) > >> Score: 1.0 > >> Signature: null > >> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_: > >> temp_moved(13), lastModified=0: > >> > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > >> > >> Not sure why notch didn’t fetch any content or parse any data or text > when > >> crawling the page! Did I miss setting some property? I am sure I have > >> increased the redirect to 5. Using parsechecker, I was able to get the > data > >> and text parsed in two steps, so I think max redirect of 5 should be > >> sufficient. Want to understand why parse checker works and crawl > doesn’t. > >> > >> Thanks, > >> Vijay > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

