Hello,

I am trying to find a way in which I can get the seed url of current url
being parsed. I have many URL's in seed.txt. I am trying to add additional
metadata for each URL crawled. The metadata depends on the seed URL of the
current URL. This metadata will be later picked by the indexer. I have
written a custom plugin for this purpose. However I am unable to get the
seed url of the current url being parsed.

Ex: This is my seed.txt

http://apache.org
http://amazon.com
http://w3.org

For all URL's crawled for every seed URL, I want to add metadata. The value
of metadata will depend on seed URL. I have a properties file which will
map seed url to metadata value. If seed url is http://apache.org then my
metadata will be something like "project". If it is http://amazon.com then
it will be "estore". I have written a plugin which will add metadata. This
plugin extends HtmlParserFilter. However I am not able find a way to get
the seed url of current url. If http://nutch.apache.org is being parsed
currently, then how do we know the seed url(http:/apache.org) of this url?
Is there any API which I could use in my plugin? Or is there any better way
to achieve this?

Regards,
Sandeep

Reply via email to