Hi Sandeep,

tracking the seed(s) for a document could be done by a scoring filter.
The seed URL must be passed:
 0  into CrawlDatum's meta by injectedScore()
    (alternatively, use additional fields in the seed file:
      <url> <tab> seed=<url>
     see Injector Javadoc)
 1  in passScoreBeforeParsing():
    from CrawlDatum to Content
 2  in passScoreAfterParsing():
    from Content to ParseData
 3  in distributeScoreToOutlinks():
    from source ParseData to all target/outlink CrawlDatum objects
 4  in updateDbScore():
    resolve inlinks from multiple seeds

Point 4 shows a little problem: a page may be reachable from multiple seeds.
The web is a graph not a forest of trees each with one seed as root!

Finally: amazon.com is definitely linked from apache.org
but it is not a "project" site.
Wouldn't a mapping <domain name> -> <meta data> be more reliable
(though notoriously incomplete)?

Best,
Sebastian

On 06/11/2012 08:09 PM, Sandeep C R wrote:
> Hello,
> 
> I am trying to find a way in which I can get the seed url of current url
> being parsed. I have many URL's in seed.txt. I am trying to add additional
> metadata for each URL crawled. The metadata depends on the seed URL of the
> current URL. This metadata will be later picked by the indexer. I have
> written a custom plugin for this purpose. However I am unable to get the
> seed url of the current url being parsed.
> 
> Ex: This is my seed.txt
> 
> http://apache.org
> http://amazon.com
> http://w3.org
> 
> For all URL's crawled for every seed URL, I want to add metadata. The value
> of metadata will depend on seed URL. I have a properties file which will
> map seed url to metadata value. If seed url is http://apache.org then my
> metadata will be something like "project". If it is http://amazon.com then
> it will be "estore". I have written a plugin which will add metadata. This
> plugin extends HtmlParserFilter. However I am not able find a way to get
> the seed url of current url. If http://nutch.apache.org is being parsed
> currently, then how do we know the seed url(http:/apache.org) of this url?
> Is there any API which I could use in my plugin? Or is there any better way
> to achieve this?
> 
> Regards,
> Sandeep
> 

Reply via email to