Hi David, Nutch follows redirects. You should check the URL you are redirected to: http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409 If it is - not blocked by URL filters - or by db.ignore.external.links (because it's and external link) the redirect URL is fetched the next round (cycle).
In Nutch 1.x there is a possibility to follow redirects immediately, see http.redirect.max but it has one disadvantage: there is no deduplication! Because multiple URLs (even hundreds) may be redirected to one single document a crawler should fetch the redirect target only once. The properties db.ignore.external.links and the regex URL filter rule -[?*!@=] apply to all kinds of links / URLs including redirects. So, with your configuration changes (nutch-site.xml would be a better place to do the changes) redirects should be followed. Look for the redirect targets in the web table, they should be there. Sebastian On 01/08/2013 01:15 PM, Michael Gang wrote: > Hi all, > > I have the following problem > > I injected the url > http://openurl.ebscohost.com/linksvc/linking.aspx?sid=a9h&volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature > In firefox the url is redirected to another page with the domain > http://web.ebscohost.com/ehost/detail?... > > I want to get the content of the result page. > In nutch i get > > bin/nutch readdb -url ' > http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature' > -content > key: > http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature > baseUrl: > http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature > status: 4 (status_redir_temp) > fetchInterval: 2592000 > fetchTime: 1357644874578 > prevFetchTime: 1357644821312 > retries: 0 > modifiedTime: 0 > protocolStatus: TEMP_MOVED, args=[ > http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409 > ] > parseStatus: (null) > title: null > score: 1.0 > markers: {dist=0, _injmrk_=y, _ftcmrk_=1357644850-1310231024, > _gnmrk_=1357644850-1310231024} > metadata _csh_ : ?\ufffd > metadata ___rdrdsc__ : y > contentType: text/html > content:start: > <html><head><title>Object moved</title></head><body> > <h2>Object moved to <a href="http://search.ebscohost.com/login.aspx?... > .">here</a>.</h2> > </body></html> > > I see that there is a certain problem with redirect. > I changed in the nutch-default.xml > db.ignore.internal.links and db.ignore.external.links to false and in > conf/regex-urlfilter.txt i commented the line > # skip URLs containing certain characters as probable queries, etc. > #-[?*!@=] > > it still does not work. > What did i do wrong ? > Which additional file should be changed? > > Thanks, > David >