There are numerous methods to do this.
*You can either assign some metadata to each URL chen injecting and
bootstrapping the system
*You could embed some meta tags or other distinguishing feature in the URLs
and use the facilities (existing or available in Jira) to identify these
pages.
*You may also be able to attach the original batchId to all original Seed
URLs. [0]
I imagine that all of the above require you adapt the source code at some
stage... this is why we don't release 2.x binaries.
I recently opened an issue which could easily be adapted for the third
point above [0]
Kiran's contribution to porting metadata plugins to Nutch 2.x would
probably enable you to address point 2 I would imagine.

[0] https://issues.apache.org/jira/browse/NUTCH-1533

On Mon, Mar 11, 2013 at 3:53 AM, Anand Bhagwat <[email protected]>wrote:

> Hi,
> Is there any way to identify seed URL from a record in WebPage table? What
> I am trying to find out is what was the origin of given record? I know
> there are inlinks and outlinks but is there any alternate way?
>
> -Anand.
>



-- 
*Lewis*

Reply via email to