Thanks for that information. I've moved on to using regex-urlfilter instead of 
trying to filter by depth. It's probably better for what I'm trying to do 
anyway.


      From: Sebastian Nagel <wastl.na...@googlemail.com>
 To: user@nutch.apache.org 
 Sent: Monday, September 25, 2017 9:36 AM
 Subject: Re: depth scoring filter
   
Hi Michael,

I've just tried it with 1.12 and the recent master of 1.x - works as expected,
Except for -  meta refresh redirects and when the fetcher isn't parsing.
Actually, this is an open issue since few months. I'll try to address it
the next days - https://issues.apache.org/jira/browse/NUTCH-2261

A little background what happens for meta refresh redirects:
 - the _depth_ is copied from the link source to the link target in the segment
 - when CrawlDb is updated with links and fetch status from the segment
 - _depth_=1000 is the fall-back if there is no _depth_ found in the segment's 
CrawlDatum

But there may be some other reason. Starting from http://www.cnn.com/ with 3 
cycles I've got only
one page with the wired _depth_=1000.  Maybe try it slowly, cycle by cycle and 
check whether
one item in the CrawlDb gets wrong...

Best,
Sebastian

On 09/22/2017 04:57 AM, Michael Coffey wrote:
> I am still having trouble with the depth scoring filter, and now I have a 
> simpler test case. It does work, somewhat, when I give it a list of 50 seed 
> URLs, but when I give it a very short list, it fails.
> I have tried depth.max values in the range of 1-6. None of them work for the 
> short-list cases.
> 
> If my seed list contains just http://www.cnn.com/ 
> it can do one generate/fetch/update cycle, but then fails saying "0 records 
> selected for fetching" on the next cycle.
> The same is true if I give it this short list of 
> urlshttp://www.thedailybeast.com/
> http://www.thedailybeast.com
> https://thedailybeast.com/
> https://thedailybeast.com
> 
> The same is true for this short list of urlshttps://nytimes.com/
> http://www.nytimes.com/
> https://www.nytimes.com/
> 
> In each case, the first cycle updates a reasonable-looking list of urls into 
> the crawldb, so it seems strange that the depth filter stops it from 
> selecting anything in subsequent rounds.
> The cnn seed works fine when I use opic and not scoring-depth.
> 
> Here is a partial listing of the readdb dump from the failing cnn trial
> http://www.cnn.com/   Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri Sep 22 15:47:46 PDT 2017
> Modified time: Thu Sep 21 15:47:46 PDT 2017
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: d9a6e1aaedca7795ea469dce4929704a
> Metadata: 
>      _depth_=1
>    _pst_=success(1), lastModified=0
>    _rs_=77
>    Content-Type=text/html
>    _maxdepth_=3
>    nutch.protocol.code=200
> 
> http://www.google.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata: 
>      _depth_=1000
>    _maxdepth_=3
> 
> http://www.googletagservices.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:12 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata: 
>      _depth_=1000
>    _maxdepth_=3
> 
> http://www.i.cdn.cnn.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata: 
>      _depth_=1000
>    _maxdepth_=3
> 
> http://www.ugdturner.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:11 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata: 
>      _depth_=1000
>    _maxdepth_=3
> 
> http://z.cdn.turner.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:12 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata: 
>      _depth_=1000
>    _maxdepth_=3
> 
> https://plus.google.com/+cnn/posts   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata: 
>      _depth_=1000
>    _maxdepth_=3
> 
> 
> 
> 
> 
> 
>      From: Jigal van Hemert | alterNET internet BV <ji...@alternet.nl>
>  To: user <user@nutch.apache.org> 
>  Sent: Tuesday, September 19, 2017 11:43 PM
>  Subject: Re: depth scoring filter
>    
> Hi,
> 
> On 20 September 2017 at 06:36, Michael Coffey <mcof...@yahoo.com.invalid>
> wrote:
> 
>> I am trying do develop a news crawler and I want to prohibit it from
>> wandering too far away from the seed list that I provide.
>> It seems like I should use the DepthScoringFilter, but I am having trouble
>> getting it to work. After a few crawl cycles, all the _depth_ metadata say
>> either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look
>> like depths.
>> I have added a scoring.depth.max property to nutch-site.xml.
>> <property>
>>  <name>scoring.depth.max</name>
>>  <value>3</value>
>> </property>
>>
>>
> I use the same plugin to only index seed plus one level below. The value
> for this is 2 so your setup crawls seed plus two levels below.
> 
> I never looked at the values for the _depth_ metadata and frankly, because
> it does what it's supposed to do, I personally don't care what it stores in
> its metadata here.
> 
> What do I need to do to limit the crawl frontier so it won't go more than N
>> hops from the seed list, if that is possible?
>>
>>
> As said above, it should be enough to set the value to N+1.
> 



   

Reply via email to