There are numerous methods to do this. *You can either assign some metadata to each URL chen injecting and bootstrapping the system *You could embed some meta tags or other distinguishing feature in the URLs and use the facilities (existing or available in Jira) to identify these pages. *You may also be able to attach the original batchId to all original Seed URLs. [0] I imagine that all of the above require you adapt the source code at some stage... this is why we don't release 2.x binaries. I recently opened an issue which could easily be adapted for the third point above [0] Kiran's contribution to porting metadata plugins to Nutch 2.x would probably enable you to address point 2 I would imagine.
[0] https://issues.apache.org/jira/browse/NUTCH-1533 On Mon, Mar 11, 2013 at 3:53 AM, Anand Bhagwat <[email protected]>wrote: > Hi, > Is there any way to identify seed URL from a record in WebPage table? What > I am trying to find out is what was the origin of given record? I know > there are inlinks and outlinks but is there any alternate way? > > -Anand. > -- *Lewis*

