Thanks for that information. I've moved on to using regex-urlfilter instead of
trying to filter by depth. It's probably better for what I'm trying to do
anyway.
From: Sebastian Nagel <[email protected]>
To: [email protected]
Sent: Monday, September 25, 2017 9:36 AM
Subject: Re: depth scoring filter
Hi Michael,
I've just tried it with 1.12 and the recent master of 1.x - works as expected,
Except for - meta refresh redirects and when the fetcher isn't parsing.
Actually, this is an open issue since few months. I'll try to address it
the next days - https://issues.apache.org/jira/browse/NUTCH-2261
A little background what happens for meta refresh redirects:
- the _depth_ is copied from the link source to the link target in the segment
- when CrawlDb is updated with links and fetch status from the segment
- _depth_=1000 is the fall-back if there is no _depth_ found in the segment's
CrawlDatum
But there may be some other reason. Starting from http://www.cnn.com/ with 3
cycles I've got only
one page with the wired _depth_=1000. Maybe try it slowly, cycle by cycle and
check whether
one item in the CrawlDb gets wrong...
Best,
Sebastian
On 09/22/2017 04:57 AM, Michael Coffey wrote:
> I am still having trouble with the depth scoring filter, and now I have a
> simpler test case. It does work, somewhat, when I give it a list of 50 seed
> URLs, but when I give it a very short list, it fails.
> I have tried depth.max values in the range of 1-6. None of them work for the
> short-list cases.
>
> If my seed list contains just http://www.cnn.com/
> it can do one generate/fetch/update cycle, but then fails saying "0 records
> selected for fetching" on the next cycle.
> The same is true if I give it this short list of
> urlshttp://www.thedailybeast.com/
> http://www.thedailybeast.com
> https://thedailybeast.com/
> https://thedailybeast.com
>
> The same is true for this short list of urlshttps://nytimes.com/
> http://www.nytimes.com/
> https://www.nytimes.com/
>
> In each case, the first cycle updates a reasonable-looking list of urls into
> the crawldb, so it seems strange that the depth filter stops it from
> selecting anything in subsequent rounds.
> The cnn seed works fine when I use opic and not scoring-depth.
>
> Here is a partial listing of the readdb dump from the failing cnn trial
> http://www.cnn.com/ Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri Sep 22 15:47:46 PDT 2017
> Modified time: Thu Sep 21 15:47:46 PDT 2017
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: d9a6e1aaedca7795ea469dce4929704a
> Metadata:
> _depth_=1
> _pst_=success(1), lastModified=0
> _rs_=77
> Content-Type=text/html
> _maxdepth_=3
> nutch.protocol.code=200
>
> http://www.google.com/ Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
> _depth_=1000
> _maxdepth_=3
>
> http://www.googletagservices.com/ Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:12 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
> _depth_=1000
> _maxdepth_=3
>
> http://www.i.cdn.cnn.com/ Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
> _depth_=1000
> _maxdepth_=3
>
> http://www.ugdturner.com/ Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:11 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
> _depth_=1000
> _maxdepth_=3
>
> http://z.cdn.turner.com/ Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:12 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
> _depth_=1000
> _maxdepth_=3
>
> https://plus.google.com/+cnn/posts Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
> _depth_=1000
> _maxdepth_=3
>
>
>
>
>
>
> From: Jigal van Hemert | alterNET internet BV <[email protected]>
> To: user <[email protected]>
> Sent: Tuesday, September 19, 2017 11:43 PM
> Subject: Re: depth scoring filter
>
> Hi,
>
> On 20 September 2017 at 06:36, Michael Coffey <[email protected]>
> wrote:
>
>> I am trying do develop a news crawler and I want to prohibit it from
>> wandering too far away from the seed list that I provide.
>> It seems like I should use the DepthScoringFilter, but I am having trouble
>> getting it to work. After a few crawl cycles, all the _depth_ metadata say
>> either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look
>> like depths.
>> I have added a scoring.depth.max property to nutch-site.xml.
>> <property>
>> <name>scoring.depth.max</name>
>> <value>3</value>
>> </property>
>>
>>
> I use the same plugin to only index seed plus one level below. The value
> for this is 2 so your setup crawls seed plus two levels below.
>
> I never looked at the values for the _depth_ metadata and frankly, because
> it does what it's supposed to do, I personally don't care what it stores in
> its metadata here.
>
> What do I need to do to limit the crawl frontier so it won't go more than N
>> hops from the seed list, if that is possible?
>>
>>
> As said above, it should be enough to set the value to N+1.
>