Hi,

could you explain in detail what is meant by "parent URL"?
- the page the PDF document is linked from
- a redirect pointing to the PDF doc
- the "directory" of the PDF URL (clip URL after last "/")
- ...

Nutch indexes all successfully fetched pages but not redirects,
404s, etc. Of course, pages not crawled cannot be indexed.

Best,
Sebastian

On 09/27/2018 11:58 AM, UMA MAHESWAR wrote:
> I am using nutch1.x for website cawing and indexing in solr(5.5.0). 
> I am trying to include the parent URL along with pdf data . 
> Can someone please suggest me some way to do it ?
> 
> Thanks in advance for your comments and suggestions
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> 

Reply via email to