Re: problem with nutch2.1 and redirect

Michael Gang Mon, 14 Jan 2013 23:50:33 -0800

Hi,

Now i have a question.
Let's say i want to fetch a list of urls and i want to follow redirects,
but i don't want to fetch other outgoing urls.
How do i accomplish it with nutch 2.1?


Thanks,
David


On Tue, Jan 8, 2013 at 10:49 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi David,
>
> Nutch follows redirects. You should check the URL you are redirected to:
>
> http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
> If it is
>  - not blocked by URL filters
>  - or by db.ignore.external.links (because it's and external link)
> the redirect URL is fetched the next round (cycle).
>
> In Nutch 1.x there is a possibility to follow redirects immediately,
> see http.redirect.max but it has one disadvantage:
> there is no deduplication! Because multiple URLs (even hundreds)
> may be redirected to one single document a crawler should fetch
> the redirect target only once.
>
> The properties
>  db.ignore.external.links
> and the regex URL filter rule
>  -[?*!@=]
> apply to all kinds of links / URLs including redirects.
>
> So, with your configuration changes (nutch-site.xml would be a better
> place to do the changes)
> redirects should be followed. Look for the redirect targets in the web
> table, they should be
> there.
>
> Sebastian
>
> On 01/08/2013 01:15 PM, Michael Gang wrote:
> > Hi all,
> >
> > I have the following problem
> >
> > I injected the url
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?sid=a9h&volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > In firefox the url is redirected to another page with the domain
> > http://web.ebscohost.com/ehost/detail?...
> >
> > I want to get the content of the result page.
> > In nutch i get
> >
> > bin/nutch readdb -url '
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> '
> > -content
> > key:
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > baseUrl:
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > status: 4 (status_redir_temp)
> > fetchInterval:  2592000
> > fetchTime:      1357644874578
> > prevFetchTime:  1357644821312
> > retries:        0
> > modifiedTime:   0
> > protocolStatus: TEMP_MOVED, args=[
> >
> http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
> > ]
> > parseStatus:    (null)
> > title:  null
> > score:  1.0
> > markers:        {dist=0, _injmrk_=y, _ftcmrk_=1357644850-1310231024,
> > _gnmrk_=1357644850-1310231024}
> > metadata _csh_ :        ?\ufffd
> > metadata ___rdrdsc__ :  y
> > contentType:    text/html
> > content:start:
> > <html><head><title>Object moved</title></head><body>
> > <h2>Object moved to <a href="http://search.ebscohost.com/login.aspx?...
> > .">here</a>.</h2>
> > </body></html>
> >
> > I see that there is a certain problem with redirect.
> > I changed  in the nutch-default.xml
> > db.ignore.internal.links and db.ignore.external.links to false and in
> > conf/regex-urlfilter.txt i commented the line
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> >
> > it still does not work.
> > What did i do wrong ?
> > Which additional file should be changed?
> >
> > Thanks,
> > David
> >
>
>

Re: problem with nutch2.1 and redirect

Reply via email to