Segments have a field called 'outlinks', could this help? On Tuesday, June 12, 2012, Sebastian Nagel wrote:
> Hi Sandeep, > > tracking the seed(s) for a document could be done by a scoring filter. > The seed URL must be passed: > 0 into CrawlDatum's meta by injectedScore() > (alternatively, use additional fields in the seed file: > <url> <tab> seed=<url> > see Injector Javadoc) > 1 in passScoreBeforeParsing(): > from CrawlDatum to Content > 2 in passScoreAfterParsing(): > from Content to ParseData > 3 in distributeScoreToOutlinks(): > from source ParseData to all target/outlink CrawlDatum objects > 4 in updateDbScore(): > resolve inlinks from multiple seeds > > Point 4 shows a little problem: a page may be reachable from multiple > seeds. > The web is a graph not a forest of trees each with one seed as root! > > Finally: amazon.com is definitely linked from apache.org > but it is not a "project" site. > Wouldn't a mapping <domain name> -> <meta data> be more reliable > (though notoriously incomplete)? > > Best, > Sebastian > > On 06/11/2012 08:09 PM, Sandeep C R wrote: > > Hello, > > > > I am trying to find a way in which I can get the seed url of current url > > being parsed. I have many URL's in seed.txt. I am trying to add > additional > > metadata for each URL crawled. The metadata depends on the seed URL of > the > > current URL. This metadata will be later picked by the indexer. I have > > written a custom plugin for this purpose. However I am unable to get the > > seed url of the current url being parsed. > > > > Ex: This is my seed.txt > > > > http://apache.org > > http://amazon.com > > http://w3.org > > > > For all URL's crawled for every seed URL, I want to add metadata. The > value > > of metadata will depend on seed URL. I have a properties file which will > > map seed url to metadata value. If seed url is http://apache.org then my > > metadata will be something like "project". If it is http://amazon.comthen > > it will be "estore". I have written a plugin which will add metadata. > This > > plugin extends HtmlParserFilter. However I am not able find a way to get > > the seed url of current url. If http://nutch.apache.org is being parsed > > currently, then how do we know the seed url(http:/apache.org) of this > url? > > Is there any API which I could use in my plugin? Or is there any better > way > > to achieve this? > > > > Regards, > > Sandeep > > > >

