Thanks Julien, I've missed that urlmeta passes the tags to the outlinks.
Sebastian On 06/12/2012 03:42 PM, Julien Nioche wrote: > forgot to say : this would work by adding a seed metadata to the urls in > the seed list, the value of which is then propagated by the scoring filter > in urlmeta > > On 12 June 2012 14:41, Julien Nioche <[email protected]> wrote: > >> That's the idea indeed. The urlmeta plugin allows to do that simply by >> setting urlmeta.tags in nutch-site.xml (see nutch-default.xml for >> description etc...) >> >> >> >> On 11 June 2012 22:45, Sebastian Nagel <[email protected]> wrote: >> >>> Hi Sandeep, >>> >>> tracking the seed(s) for a document could be done by a scoring filter. >>> The seed URL must be passed: >>> 0 into CrawlDatum's meta by injectedScore() >>> (alternatively, use additional fields in the seed file: >>> <url> <tab> seed=<url> >>> see Injector Javadoc) >>> 1 in passScoreBeforeParsing(): >>> from CrawlDatum to Content >>> 2 in passScoreAfterParsing(): >>> from Content to ParseData >>> 3 in distributeScoreToOutlinks(): >>> from source ParseData to all target/outlink CrawlDatum objects >>> 4 in updateDbScore(): >>> resolve inlinks from multiple seeds >>> >>> Point 4 shows a little problem: a page may be reachable from multiple >>> seeds. >>> The web is a graph not a forest of trees each with one seed as root! >>> >>> Finally: amazon.com is definitely linked from apache.org >>> but it is not a "project" site. >>> Wouldn't a mapping <domain name> -> <meta data> be more reliable >>> (though notoriously incomplete)? >>> >>> Best, >>> Sebastian >>> >>> On 06/11/2012 08:09 PM, Sandeep C R wrote: >>>> Hello, >>>> >>>> I am trying to find a way in which I can get the seed url of current url >>>> being parsed. I have many URL's in seed.txt. I am trying to add >>> additional >>>> metadata for each URL crawled. The metadata depends on the seed URL of >>> the >>>> current URL. This metadata will be later picked by the indexer. I have >>>> written a custom plugin for this purpose. However I am unable to get the >>>> seed url of the current url being parsed. >>>> >>>> Ex: This is my seed.txt >>>> >>>> http://apache.org >>>> http://amazon.com >>>> http://w3.org >>>> >>>> For all URL's crawled for every seed URL, I want to add metadata. The >>> value >>>> of metadata will depend on seed URL. I have a properties file which will >>>> map seed url to metadata value. If seed url is http://apache.org then >>> my >>>> metadata will be something like "project". If it is http://amazon.comthen >>>> it will be "estore". I have written a plugin which will add metadata. >>> This >>>> plugin extends HtmlParserFilter. However I am not able find a way to get >>>> the seed url of current url. If http://nutch.apache.org is being parsed >>>> currently, then how do we know the seed url(http:/apache.org) of this >>> url? >>>> Is there any API which I could use in my plugin? Or is there any better >>> way >>>> to achieve this? >>>> >>>> Regards, >>>> Sandeep >>>> >>> >>> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> >> > >

