forgot to say : this would work by adding a seed metadata to the urls in the seed list, the value of which is then propagated by the scoring filter in urlmeta
On 12 June 2012 14:41, Julien Nioche <[email protected]> wrote: > That's the idea indeed. The urlmeta plugin allows to do that simply by > setting urlmeta.tags in nutch-site.xml (see nutch-default.xml for > description etc...) > > > > On 11 June 2012 22:45, Sebastian Nagel <[email protected]> wrote: > >> Hi Sandeep, >> >> tracking the seed(s) for a document could be done by a scoring filter. >> The seed URL must be passed: >> 0 into CrawlDatum's meta by injectedScore() >> (alternatively, use additional fields in the seed file: >> <url> <tab> seed=<url> >> see Injector Javadoc) >> 1 in passScoreBeforeParsing(): >> from CrawlDatum to Content >> 2 in passScoreAfterParsing(): >> from Content to ParseData >> 3 in distributeScoreToOutlinks(): >> from source ParseData to all target/outlink CrawlDatum objects >> 4 in updateDbScore(): >> resolve inlinks from multiple seeds >> >> Point 4 shows a little problem: a page may be reachable from multiple >> seeds. >> The web is a graph not a forest of trees each with one seed as root! >> >> Finally: amazon.com is definitely linked from apache.org >> but it is not a "project" site. >> Wouldn't a mapping <domain name> -> <meta data> be more reliable >> (though notoriously incomplete)? >> >> Best, >> Sebastian >> >> On 06/11/2012 08:09 PM, Sandeep C R wrote: >> > Hello, >> > >> > I am trying to find a way in which I can get the seed url of current url >> > being parsed. I have many URL's in seed.txt. I am trying to add >> additional >> > metadata for each URL crawled. The metadata depends on the seed URL of >> the >> > current URL. This metadata will be later picked by the indexer. I have >> > written a custom plugin for this purpose. However I am unable to get the >> > seed url of the current url being parsed. >> > >> > Ex: This is my seed.txt >> > >> > http://apache.org >> > http://amazon.com >> > http://w3.org >> > >> > For all URL's crawled for every seed URL, I want to add metadata. The >> value >> > of metadata will depend on seed URL. I have a properties file which will >> > map seed url to metadata value. If seed url is http://apache.org then >> my >> > metadata will be something like "project". If it is http://amazon.comthen >> > it will be "estore". I have written a plugin which will add metadata. >> This >> > plugin extends HtmlParserFilter. However I am not able find a way to get >> > the seed url of current url. If http://nutch.apache.org is being parsed >> > currently, then how do we know the seed url(http:/apache.org) of this >> url? >> > Is there any API which I could use in my plugin? Or is there any better >> way >> > to achieve this? >> > >> > Regards, >> > Sandeep >> > >> >> > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

