Hi, great to here.
Ev., you want to add a check whether the parent URL is already set to avoid that it gets overwritten if another page links to the same target. But that depends on your crawling setup and structure of the crawled sites. And of course, in real web crawling there may be millions of parents for some pages. In this case, storing parent URLs in CrawlDb would mean too much overhead. That's why there is the LinkDb. Using the LinkDb in combination with the plugin index-links could be also a solution given the crawled sites have the shape of a tree. Thanks, Sebastian On 11/29/2016 11:33 AM, [email protected] wrote: > > > Thanks Sebastian. > > I did as you suggested, & it worked like a charm. > It would have took me days otherwise. :) > > /The targets for-loop handles each link. So there I am adding it to the > metadata./ > // > [changes.jpg] > > > > ________________________________________ > From: Sebastian Nagel <[email protected]> > Sent: Monday, November 28, 2016 1:09 AM > To: [email protected] > Subject: Re: Need to index Parent URL also > > Hi, > > have a look at the scoring filter interface, esp. the plugin scoring-depth. > In the method distributeScoreToOutlinks the fromUrl is at hand and it's no > big deal to add it to the CrawlDatum's metadata of all outlinks. > > In the method updateDbScore it must be finally added to CrawlDb's CrawlDatum > "datum". > > Just modify an existing plugin or implement your own. > > To finally index the parent URL, add the metadata key which holds the > parent/from URL > to the property index.db.md: > > <property> > <name>index.db.md</name> > <value></value> > <description> > Comma-separated list of keys to be taken from the crawldb metadata to > generate fields. > Can be used to index values propagated from the seeds with the plugin > urlmeta > </description> > </property> > > Cheers, > Sebastian > > On 11/27/2016 12:49 PM, [email protected] wrote: >> Hi, >> >> >> While nutch1.x is indexing in solr (or Elasticsearch) I need to include the >> immediate parent URL too. >> >> There is no clear help online on where to do this. >> >> I don't need the hierarchy till seed url, but just the immediate parent of >> current parsing document. >> >> >> Someone suggested to do it on outlinks, but in code, can anyone help me >> where to find this and > include it. >> >> I have Nutch setup in my eclipse. >> >> >> >> Thanks in advance, >> >> -Ashok. >> >> This e-mail and any files transmitted with it are for the sole use of the >> intended recipient(s) > and may contain confidential and privileged information. If you are not the > intended recipient(s), > please reply to the sender and destroy all copies of the original message. > Any unauthorized review, > use, disclosure, dissemination, forwarding, printing or copying of this > email, and/or any action > taken in reliance on the contents of this e-mail is strictly prohibited and > may be unlawful. Where > permitted by applicable law, this e-mail and other e-mail communications sent > to and from Cognizant > e-mail addresses may be monitored. >> > > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and > may contain confidential and privileged information. If you are not the > intended recipient(s), > please reply to the sender and destroy all copies of the original message. > Any unauthorized review, > use, disclosure, dissemination, forwarding, printing or copying of this > email, and/or any action > taken in reliance on the contents of this e-mail is strictly prohibited and > may be unlawful. Where > permitted by applicable law, this e-mail and other e-mail communications sent > to and from Cognizant > e-mail addresses may be monitored.

