Re: Staying in Domain

Julien Nioche Wed, 23 Jun 2010 06:06:11 -0700

See https://issues.apache.org/jira/browse/NUTCH-830


This patch has been kindly donated by Ant.com


On 23 June 2010 13:22, Julien Nioche <[email protected]> wrote:

> If the seed list is large it would probably easier to write a custom
> Scoring Filter combined with metadata in the Injector e.g.
>
> 1. add a metadata to your seed list e.g. '_origin_' with as values the seed
> URL
> e.g. http://www.cnn.com/    _origin_=http://www.cnn.com/ (a bit
> tautological I know but nevermind)
>
> 2. The custom scoring filter would take care of :
>
>    - transmitting the origin metadata to its outlinks
>    - remove from the outlinks the ones which do not have the same host /
>    domain as the origin
>
> The method *distributeScoreToOutlinks* being where it all happens
>
> Of course if your seed list is not large or if you don't mind editing the
> list of domains by hand then Dennis' suggestion is a very good solution
>
> J.
>
> PS: we've implemented that for one of my clients. I'll check with them
> whether they are happy to donate it to the project and if so add it to JIRA
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
>
>
> On 23 June 2010 04:35, Dennis Kubes <[email protected]> wrote:
>
>> Try using the DomainUrlFilter.  You will need to do the following:
>>
>>  1. Activate the domain urlfilter in plugin.includes,
>>     urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file.
>>  2. In the conf directory add your domains one per line to the
>>     domain-urlfilter.txt file.  Entries can be domains
>>     (something.com), subdomains (www.something.com), or top level
>>     identifiers (.com)
>>
>> This should work using both the crawl command and calling the individual
>> nutch commands directly.
>>
>> Dennis
>>
>>
>> On 06/22/2010 10:06 PM, Max Lynch wrote:
>>
>>> I know this is a very popular question based on the searching I've
>>> done...but I'm still really confused.
>>>
>>> I have a seed list that I want nutch to crawl.  And I want to do very
>>> deep crawling on each of those domains.  However, I don't want nutch to
>>> venture out of each domain on that list.  Also, the list is large which
>>> prevents me from building a crawl-urlfilter entry for each domain.
>>>
>>> I have tried
>>> <property>
>>> <name>db.ignore.external.links</name>
>>> <value>true</value>
>>> </property>
>>> But that seems to only hit the URL I specify in the url seed list, and
>>> doesn't seem to allow nutch to venture more deeply into the domain itself.
>>>  For example, it doesn't seem like it will follow a link on
>>> http://mydomain.com/index.html to http://mydomain.com/about.html
>>>
>>> I have also tried   db.max.outlinks.per.page but that doesn't seem to do
>>> what I want either.
>>>
>>> Here is the crawl command I'm issuing:
>>>
>>> JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2 -depth
>>> 15 -topN 100 &> crawl.log
>>>
>>> Out of a list of ~4500 seed urls, nutch only found 800 docs (404s account
>>> for most of those).
>>>
>>> Is there an easy way to do this?
>>>
>>> Thanks,
>>> Max
>>>
>>
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Staying in Domain

Reply via email to