Hi Micheal, Add this to nutch-site.xml and try out a fresh crawl. (Note that you also need to have the configs suggested by Sebastian)
<property> <name>*db.max.outlinks.per.page*</name> <value>*0*</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> Thanks, Tejas Patil On Mon, Jan 14, 2013 at 11:49 PM, Michael Gang <[email protected]>wrote: > Hi, > > Now i have a question. > Let's say i want to fetch a list of urls and i want to follow redirects, > but i don't want to fetch other outgoing urls. > How do i accomplish it with nutch 2.1? > > Thanks, > David > > > On Tue, Jan 8, 2013 at 10:49 PM, Sebastian Nagel < > [email protected] > > wrote: > > > Hi David, > > > > Nutch follows redirects. You should check the URL you are redirected to: > > > > > http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409 > > If it is > > - not blocked by URL filters > > - or by db.ignore.external.links (because it's and external link) > > the redirect URL is fetched the next round (cycle). > > > > In Nutch 1.x there is a possibility to follow redirects immediately, > > see http.redirect.max but it has one disadvantage: > > there is no deduplication! Because multiple URLs (even hundreds) > > may be redirected to one single document a crawler should fetch > > the redirect target only once. > > > > The properties > > db.ignore.external.links > > and the regex URL filter rule > > -[?*!@=] > > apply to all kinds of links / URLs including redirects. > > > > So, with your configuration changes (nutch-site.xml would be a better > > place to do the changes) > > redirects should be followed. Look for the redirect targets in the web > > table, they should be > > there. > > > > Sebastian > > > > On 01/08/2013 01:15 PM, Michael Gang wrote: > > > Hi all, > > > > > > I have the following problem > > > > > > I injected the url > > > > > > http://openurl.ebscohost.com/linksvc/linking.aspx?sid=a9h&volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature > > > In firefox the url is redirected to another page with the domain > > > http://web.ebscohost.com/ehost/detail?... > > > > > > I want to get the content of the result page. > > > In nutch i get > > > > > > bin/nutch readdb -url ' > > > > > > http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature > > ' > > > -content > > > key: > > > > > > http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature > > > baseUrl: > > > > > > http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature > > > status: 4 (status_redir_temp) > > > fetchInterval: 2592000 > > > fetchTime: 1357644874578 > > > prevFetchTime: 1357644821312 > > > retries: 0 > > > modifiedTime: 0 > > > protocolStatus: TEMP_MOVED, args=[ > > > > > > http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409 > > > ] > > > parseStatus: (null) > > > title: null > > > score: 1.0 > > > markers: {dist=0, _injmrk_=y, _ftcmrk_=1357644850-1310231024, > > > _gnmrk_=1357644850-1310231024} > > > metadata _csh_ : ?\ufffd > > > metadata ___rdrdsc__ : y > > > contentType: text/html > > > content:start: > > > <html><head><title>Object moved</title></head><body> > > > <h2>Object moved to <a href="http://search.ebscohost.com/login.aspx?. > .. > > > .">here</a>.</h2> > > > </body></html> > > > > > > I see that there is a certain problem with redirect. > > > I changed in the nutch-default.xml > > > db.ignore.internal.links and db.ignore.external.links to false and in > > > conf/regex-urlfilter.txt i commented the line > > > # skip URLs containing certain characters as probable queries, etc. > > > #-[?*!@=] > > > > > > it still does not work. > > > What did i do wrong ? > > > Which additional file should be changed? > > > > > > Thanks, > > > David > > > > > > > >

