Re: Include parent URL in pdf data - nutch

Jorge Betancourt Fri, 28 Sep 2018 09:33:23 -0700

If I understand correctly, what you want is to index/store the URL where
the PDF link was found right? The name of the website we don't track (by
default). But you could do this (sort of) using the index-links plugin (
https://github.com/apache/nutch/tree/master/src/plugin/index-links).

This will allow you to index all the outlinks of a given URL. So if A is
the parent URL of B (pdf file), then you should be able to find the B URL
in the outlinks of A. This is basically reverting the problem, instead of
looking for the parent of B, you would be looking for any URL that has B
has an outlink. In theory you could find all the URLs that point to a
specific resource (PDF file).

Hope that helps,

Best Regards,
Jorge

On Fri, Sep 28, 2018 at 11:46 AM UMA MAHESWAR <uma.mahes...@in.ebmpapst.com>
wrote:

> Hi Sir ,
>
> By Parent URL , i mean the page the PDF document is linked from .
>
> In other words , the name of website where the PDF is present in the site
>
> Example : I am crawling multiple pdf from multiple websites . I just wanted
> to index the respective website name along with each pdf crawled from
> respective websites.
>
> Thanks,
> Uma
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>

Re: Include parent URL in pdf data - nutch

Reply via email to