If I understand correctly, what you want is to index/store the URL where the PDF link was found right? The name of the website we don't track (by default). But you could do this (sort of) using the index-links plugin ( https://github.com/apache/nutch/tree/master/src/plugin/index-links).
This will allow you to index all the outlinks of a given URL. So if A is the parent URL of B (pdf file), then you should be able to find the B URL in the outlinks of A. This is basically reverting the problem, instead of looking for the parent of B, you would be looking for any URL that has B has an outlink. In theory you could find all the URLs that point to a specific resource (PDF file). Hope that helps, Best Regards, Jorge On Fri, Sep 28, 2018 at 11:46 AM UMA MAHESWAR <uma.mahes...@in.ebmpapst.com> wrote: > Hi Sir , > > By Parent URL , i mean the page the PDF document is linked from . > > In other words , the name of website where the PDF is present in the site > > Example : I am crawling multiple pdf from multiple websites . I just wanted > to index the respective website name along with each pdf crawled from > respective websites. > > Thanks, > Uma > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html >