I think something like you are describing would require a new Tool. In concept it would be similar to the LinkRank tool where outlink score is transferred from parent to child. You can use the WebGraph to get the outlinks from a page. You can get the content or text from segments Content or ParseText respectively. I can see having something like this:

  1. MR job one
        1. outlink and content as input
        2. No mapper.
        3. Reducer outputs outlink and content
  2. MR job two
        1. Job one output and X used as input
              1. Here X would be whatever input you want to change from
                 the parent content.  This could be Contents,
                 ParseData, CrawlDb, etc.  It would be keyed off of the
                 outlink url and be the child page.
              2. An important thing to consider is that there would be
multiple "parent" pages to a single child page. Anybody who has an outlink.
        2. Probably no mapper
        3. Reducer take job one output and your X and does something
           with it

I can see altering the child page Content in segments based on parent, storing something in the child ParseData from segments, or altering the CrawlDb. The action itself is up to you. The end result would then flow through the Nutch job stream and end up in the Indexer.

Dennis

On 06/19/2010 03:29 AM, Harry Nutch wrote:
Hi,

I have a scenario where some specific content I'd like to store with a
sub-page, is contained in the parent-page that had the outlink to this
sub-page.
Is there a way I could pass parsed-content from parent page to the outlinked
page, which I can later use while indexing outlinked page?

Thanks,
Harry

Reply via email to