Hi Jigal, >> <property> >> <name>scoring.depth.max</name> >> <value>2</value> > Will try that.
Please, note that 2 is the right value. We've discussed this behind the scenes and Julien verified that the right value for your use case is 2. depth 1 : fetch seeds only depth 2 : seeds + pages reachable by one link/hop from the seeds The description does specify this and does not give an example. Feel free to open a Jira issue to improve the description. Whether you start list indexes or counts from 0 or 1 is a frequent source of misunderstandings among programmers. > Is my assumption correct that if > > <property> > <name>db.fetch.schedule.class</name> > <value>org.apache.nutch.crawl.DefaultFetchSchedule</value> > > is used that only db.fet.interval.default is used? All the other properties > are then ignored? All db.fetch.schedule.adaptive.* are ignored then. db.fetch.interval.max is used to determine when 404 pages are retried - removed pages may appear again after some time. > It sounds really stupid, but the maker of that site does not output a 404 > header, but puts an HTML formatted message on the page like "Code 303 > Description you are not allowed to access this item" > > Currently my cron job just calls the solr update handler and sends a delete > query that searches for content matching "Code 303 Description" (all HTML > and whitespace are stripped anyway in the solr index) in the stream body. > > Writing a plug in to filter this out is indeed cleaner, but the work > involved is too much compared to what is gained. The workaround does its > job. If there was a plugin that does this already that would be nice. > I've once hit exactly the same problem of such "nice" customized 404 pages. And my solution was also to handle this on the index level: if the layout of the 404 pages is changed you can react quickly, and if the index is not too big, it's clean again after a couple of minutes while it definitely takes longer to reconfigure the crawler and recrawl the content (or reparse and reindex). Cheers, Sebastian On 04/06/2016 04:14 PM, Jigal van Hemert | alterNET internet BV wrote: > Hi Julien and Sebastian, > > Thank you for your replies! > > (both replies had a lot of similarities, so I'll answer them both) > > On 6 April 2016 at 14:16, Sebastian Nagel <[email protected]> > wrote: > >>> One site is indexed by Nutch. Now it should be limited to the pages that >>> are linked in the seed URL (no further crawling necessary). >> Have a look at the plugin "scoring-depth" and add to your nutch-site.xml >> (cf. conf/nutch-default.xml): >> >> >> <!-- scoring-depth properties >> Add 'scoring-depth' to the list of active plugins >> in the parameter 'plugin.includes' in order to use it. >> --> >> >> <property> >> <name>scoring.depth.max</name> >> <value>2</value> >> <description>Max depth value from seed allowed by default. >> Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE" >> as a seed metadata. This plugin adds a "_depth_" metadatum to the pages >> to track the distance from the seed it was found from. >> The depth is used to prioritise URLs in the generation step so that >> shallower pages are fetched first. >> </description> >> </property> >> > > Will try that. > > >> >>> Furthermore all >>> pages must be revisited daily (and new pages must be indexed daily too). >> >> See property "db.fetch.interval.default", >> also take the time to check other >> db.fetch.interval.* >> db.fetch.schedule.* >> properties. >> > > Is my assumption correct that if > > <property> > <name>db.fetch.schedule.class</name> > <value>org.apache.nutch.crawl.DefaultFetchSchedule</value> > <description>The implementation of fetch schedule. DefaultFetchSchedule > simply > adds the original fetchInterval to the last fetch time, regardless of > page changes.</description> > </property> > > is used that only db.fet.interval.default is used? All the other properties > are then ignored? > > >>> Another wish is to exclude pages with certain content on them. Currently >> we >>> do this by a delete query after Nutch finishes. We can keep it this way, >>> but I wondered if there was a smarter option. >> >> How is such content identified? >> > > It sounds really stupid, but the maker of that site does not output a 404 > header, but puts an HTML formatted message on the page like "Code 303 > Description you are not allowed to access this item" > > Currently my cron job just calls the solr update handler and sends a delete > query that searches for content matching "Code 303 Description" (all HTML > and whitespace are stripped anyway in the solr index) in the stream body. > > Writing a plug in to filter this out is indeed cleaner, but the work > involved is too much compared to what is gained. The workaround does its > job. If there was a plugin that does this already that would be nice. >

