Re: Configuration of very specific requirements

Sebastian Nagel Thu, 07 Apr 2016 01:26:35 -0700

Hi Jigal,

>> <property>
>>   <name>scoring.depth.max</name>
>>   <value>2</value>
> Will try that.


Please, note that 2 is the right value.  We've discussed this behind
the scenes and Julien verified that the right value for your use
case is 2.
 depth 1  :  fetch seeds only
 depth 2  :  seeds + pages reachable by one link/hop from the seeds
The description does specify this and does not give an example.
Feel free to open a Jira issue to improve the description.
Whether you start list indexes or counts from 0 or 1 is a frequent
source of misunderstandings among programmers.


> Is my assumption correct that if
>
> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
>
> is used that only db.fet.interval.default is used? All the other properties
> are then ignored?

All db.fetch.schedule.adaptive.* are ignored then.
db.fetch.interval.max is used to determine when 404 pages
are retried - removed pages may appear again after some time.

> It sounds really stupid, but the maker of that site does not output a 404
> header, but puts an HTML formatted message on the page like "Code 303
> Description you are not allowed to access this item"
>
> Currently my cron job just calls the solr update handler and sends a delete
> query that searches for content matching "Code 303 Description" (all HTML
> and whitespace are stripped anyway in the solr index) in the stream body.
>
> Writing a plug in to filter this out is indeed cleaner, but the work
> involved is too much compared to what is gained. The workaround does its
> job. If there was a plugin that does this already that would be nice.
>

I've once hit exactly the same problem of such "nice" customized 404 pages.
And my solution was also to handle this on the index level:
if the layout of the 404 pages is changed you can react quickly,
and if the index is not too big, it's clean again after a couple of minutes
while it definitely takes longer to reconfigure the crawler and recrawl
the content (or reparse and reindex).

Cheers,
Sebastian


On 04/06/2016 04:14 PM, Jigal van Hemert | alterNET internet BV wrote:
> Hi Julien and Sebastian,
> 
> Thank you for your replies!
> 
> (both replies had a lot of similarities, so I'll answer them both)
> 
> On 6 April 2016 at 14:16, Sebastian Nagel <[email protected]>
> wrote:
> 
>>> One site is indexed by Nutch. Now it should be limited to the pages that
>>> are linked in the seed URL (no further crawling necessary).
>> Have a look at the plugin "scoring-depth" and add to your nutch-site.xml
>> (cf. conf/nutch-default.xml):
>>
>>
>> <!-- scoring-depth properties
>>  Add 'scoring-depth' to the list of active plugins
>>  in the parameter 'plugin.includes' in order to use it.
>>  -->
>>
>> <property>
>>   <name>scoring.depth.max</name>
>>   <value>2</value>
>>   <description>Max depth value from seed allowed by default.
>>   Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE"
>>   as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
>>   to track the distance from the seed it was found from.
>>   The depth is used to prioritise URLs in the generation step so that
>>   shallower pages are fetched first.
>>   </description>
>> </property>
>>
> 
> Will try that.
> 
> 
>>
>>> Furthermore all
>>> pages must be revisited daily (and new pages must be indexed daily too).
>>
>> See property "db.fetch.interval.default",
>> also take the time to check other
>>   db.fetch.interval.*
>>   db.fetch.schedule.*
>> properties.
>>
> 
> Is my assumption correct that if
> 
> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
>   <description>The implementation of fetch schedule. DefaultFetchSchedule
> simply
>   adds the original fetchInterval to the last fetch time, regardless of
>   page changes.</description>
> </property>
> 
> is used that only db.fet.interval.default is used? All the other properties
> are then ignored?
> 
> 
>>> Another wish is to exclude pages with certain content on them. Currently
>> we
>>> do this by a delete query after Nutch finishes. We can keep it this way,
>>> but I wondered if there was a smarter option.
>>
>> How is such content identified?
>>
> 
> It sounds really stupid, but the maker of that site does not output a 404
> header, but puts an HTML formatted message on the page like "Code 303
> Description you are not allowed to access this item"
> 
> Currently my cron job just calls the solr update handler and sends a delete
> query that searches for content matching "Code 303 Description" (all HTML
> and whitespace are stripped anyway in the solr index) in the stream body.
> 
> Writing a plug in to filter this out is indeed cleaner, but the work
> involved is too much compared to what is gained. The workaround does its
> job. If there was a plugin that does this already that would be nice.
>

Re: Configuration of very specific requirements

Reply via email to