Hi Shani, > Sometimes in the header of pages that are <link> tag that link > to pages that are source code that doesn't interesting ...
Yes, that's often true. > ... I want that the nutch will get only links from body and not from the > header. > Is this possible? (I'm using nutch 1.9) Have a look at the following property: <property> <name>parser.html.outlinks.ignore_tags</name> <value></value> <description>Comma separated list of HTML tags, from which outlinks shouldn't be extracted. Nutch takes links from: a, area, form, frame, iframe, script, link, img. If you add any of those tags here, it won't be taken. Default is empty list. Probably reasonable value for most people would be "img,script,link".</description> </property> This would allow to easily exclude "link" links at all. Afaik, there is no solution to follow only links from the body. Also, be aware that some "link" links, e.g., <link rel="canonical" href="..." /> are worth to follow. Of course, well-maintained sites will always make these pages reachable by ordinary "a" links. So, normally, that's no problem. Best, Sebastian On 03/22/2016 04:27 PM, Chaushu, Shani wrote: > Hi, > Sometimes in the header of pages that are <link> tag that link to pages that > are source code that doesn't interesting for example > http://......../somexmlsettingsdata?type=xml > This link is not suffix xml so I can't filter it out but I want that the > nutch will get only links from body and not from the header. > Is this possible? (I'm using nutch 1.9) > > Thanks, > Shani > > --------------------------------------------------------------------- > Intel Electronics Ltd. > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. >

