Re: don't crawl links in header

Sebastian Nagel Tue, 22 Mar 2016 10:20:06 -0700

Hi Shani,

> Sometimes in the header of pages that are <link> tag that link
> to pages that are source code that doesn't interesting ...

Yes, that's often true.

> ... I want that the nutch will get only links from body and not from the 
> header.
> Is this possible? (I'm using nutch 1.9)

Have a look at the following property:

<property>
  <name>parser.html.outlinks.ignore_tags</name>
  <value></value>
  <description>Comma separated list of HTML tags, from which outlinks
  shouldn't be extracted. Nutch takes links from: a, area, form, frame,
  iframe, script, link, img. If you add any of those tags here, it
  won't be taken. Default is empty list. Probably reasonable value
  for most people would be "img,script,link".</description>
</property>

This would allow to easily exclude "link" links at all.
Afaik, there is no solution to follow only links from
the body. Also, be aware that some "link" links, e.g.,
  <link rel="canonical" href="..." />
are worth to follow. Of course, well-maintained sites
will always make these pages reachable by ordinary "a" links.
So, normally, that's no problem.

Best,
Sebastian

On 03/22/2016 04:27 PM, Chaushu, Shani wrote:
> Hi,
> Sometimes in the header of pages that are <link> tag that link to pages that 
> are source code that doesn't interesting for example 
> http://......../somexmlsettingsdata?type=xml
> This link is not suffix xml so I can't filter it out but I want that the 
> nutch will get only links from body and not from the header.
> Is this possible? (I'm using nutch 1.9)
> 
> Thanks,
> Shani
> 
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

Re: don't crawl links in header

Reply via email to