Re: Include parent URL in pdf data - nutch

Sebastian Nagel Thu, 27 Sep 2018 23:24:51 -0700

Hi,

could you explain in detail what is meant by "parent URL"?
- the page the PDF document is linked from
- a redirect pointing to the PDF doc
- the "directory" of the PDF URL (clip URL after last "/")
- ...


Nutch indexes all successfully fetched pages but not redirects,
404s, etc. Of course, pages not crawled cannot be indexed.

Best,
Sebastian

On 09/27/2018 11:58 AM, UMA MAHESWAR wrote:
> I am using nutch1.x for website cawing and indexing in solr(5.5.0). 
> I am trying to include the parent URL along with pdf data . 
> Can someone please suggest me some way to do it ?
> 
> Thanks in advance for your comments and suggestions
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>

Re: Include parent URL in pdf data - nutch

Reply via email to