Re: Staying in Domain

Julien Nioche Wed, 23 Jun 2010 05:23:29 -0700

If the seed list is large it would probably easier to write a custom Scoring
Filter combined with metadata in the Injector e.g.


1. add a metadata to your seed list e.g. '_origin_' with as values the seed
URL
e.g. http://www.cnn.com/    _origin_=http://www.cnn.com/ (a bit tautological
I know but nevermind)

2. The custom scoring filter would take care of :

   - transmitting the origin metadata to its outlinks
   - remove from the outlinks the ones which do not have the same host /
   domain as the origin

The method *distributeScoreToOutlinks* being where it all happens

Of course if your seed list is not large or if you don't mind editing the
list of domains by hand then Dennis' suggestion is a very good solution

J.

PS: we've implemented that for one of my clients. I'll check with them
whether they are happy to donate it to the project and if so add it to JIRA

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


On 23 June 2010 04:35, Dennis Kubes <[email protected]> wrote:

> Try using the DomainUrlFilter.  You will need to do the following:
>
>  1. Activate the domain urlfilter in plugin.includes,
>     urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file.
>  2. In the conf directory add your domains one per line to the
>     domain-urlfilter.txt file.  Entries can be domains
>     (something.com), subdomains (www.something.com), or top level
>     identifiers (.com)
>
> This should work using both the crawl command and calling the individual
> nutch commands directly.
>
> Dennis
>
>
> On 06/22/2010 10:06 PM, Max Lynch wrote:
>
>> I know this is a very popular question based on the searching I've
>> done...but I'm still really confused.
>>
>> I have a seed list that I want nutch to crawl.  And I want to do very deep
>> crawling on each of those domains.  However, I don't want nutch to venture
>> out of each domain on that list.  Also, the list is large which prevents me
>> from building a crawl-urlfilter entry for each domain.
>>
>> I have tried
>> <property>
>> <name>db.ignore.external.links</name>
>> <value>true</value>
>> </property>
>> But that seems to only hit the URL I specify in the url seed list, and
>> doesn't seem to allow nutch to venture more deeply into the domain itself.
>>  For example, it doesn't seem like it will follow a link on
>> http://mydomain.com/index.html to http://mydomain.com/about.html
>>
>> I have also tried   db.max.outlinks.per.page but that doesn't seem to do
>> what I want either.
>>
>> Here is the crawl command I'm issuing:
>>
>> JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2 -depth
>> 15 -topN 100 &> crawl.log
>>
>> Out of a list of ~4500 seed urls, nutch only found 800 docs (404s account
>> for most of those).
>>
>> Is there an easy way to do this?
>>
>> Thanks,
>> Max
>>
>

Re: Staying in Domain

Reply via email to