Please remove me from this list -----Original Message----- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] Sent: Friday, September 28, 2018 2:25 AM To: user@nutch.apache.org Subject: [Non-DoD Source] Re: Include parent URL in pdf data - nutch
All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser. ---- Hi, could you explain in detail what is meant by "parent URL"? - the page the PDF document is linked from - a redirect pointing to the PDF doc - the "directory" of the PDF URL (clip URL after last "/") - ... Nutch indexes all successfully fetched pages but not redirects, 404s, etc. Of course, pages not crawled cannot be indexed. Best, Sebastian On 09/27/2018 11:58 AM, UMA MAHESWAR wrote: > I am using nutch1.x for website cawing and indexing in solr(5.5.0). > I am trying to include the parent URL along with pdf data . > Can someone please suggest me some way to do it ? > > Thanks in advance for your comments and suggestions > > > > -- > Sent from: > Caution-http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html >