Hi David,

Nutch follows redirects. You should check the URL you are redirected to:
  
http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
If it is
 - not blocked by URL filters
 - or by db.ignore.external.links (because it's and external link)
the redirect URL is fetched the next round (cycle).

In Nutch 1.x there is a possibility to follow redirects immediately,
see http.redirect.max but it has one disadvantage:
there is no deduplication! Because multiple URLs (even hundreds)
may be redirected to one single document a crawler should fetch
the redirect target only once.

The properties
 db.ignore.external.links
and the regex URL filter rule
 -[?*!@=]
apply to all kinds of links / URLs including redirects.

So, with your configuration changes (nutch-site.xml would be a better place to 
do the changes)
redirects should be followed. Look for the redirect targets in the web table, 
they should be
there.

Sebastian

On 01/08/2013 01:15 PM, Michael Gang wrote:
> Hi all,
> 
> I have the following problem
> 
> I injected the url
> http://openurl.ebscohost.com/linksvc/linking.aspx?sid=a9h&volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> In firefox the url is redirected to another page with the domain
> http://web.ebscohost.com/ehost/detail?...
> 
> I want to get the content of the result page.
> In nutch i get
> 
> bin/nutch readdb -url '
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature'
> -content
> key:
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> baseUrl:
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> status: 4 (status_redir_temp)
> fetchInterval:  2592000
> fetchTime:      1357644874578
> prevFetchTime:  1357644821312
> retries:        0
> modifiedTime:   0
> protocolStatus: TEMP_MOVED, args=[
> http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
> ]
> parseStatus:    (null)
> title:  null
> score:  1.0
> markers:        {dist=0, _injmrk_=y, _ftcmrk_=1357644850-1310231024,
> _gnmrk_=1357644850-1310231024}
> metadata _csh_ :        ?\ufffd
> metadata ___rdrdsc__ :  y
> contentType:    text/html
> content:start:
> <html><head><title>Object moved</title></head><body>
> <h2>Object moved to <a href="http://search.ebscohost.com/login.aspx?...
> .">here</a>.</h2>
> </body></html>
> 
> I see that there is a certain problem with redirect.
> I changed  in the nutch-default.xml
> db.ignore.internal.links and db.ignore.external.links to false and in
> conf/regex-urlfilter.txt i commented the line
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
> 
> it still does not work.
> What did i do wrong ?
> Which additional file should be changed?
> 
> Thanks,
> David
> 

Reply via email to