Re: problems extracting outlinks

2017-08-10 Thread Sebastian Nagel
Hi Carlos, thanks for the follow-up. I've checked the mentioned link and Nutch 1.14: - with parse-html the link is missing (also some more) - with parse-tika it's extracted as expected: a self-referential link, the anchor part removed That's a hint that we should have a closer look on the

Re: problems extracting outlinks

2017-08-09 Thread Carlos PĂ©rez Miguel
Hi Sebastian, Thank you for your answer. I am using Nutch 1.12. Same plugins as you. I am using this old version because I use a modified version (not those plugins). I guess something changed in the parse-html plugin since my version. Anyway, I think I found a clue about what is happening. This

Re: problems extracting outlinks

2017-08-09 Thread Sebastian Nagel
Hi Carlos, sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT and the call $ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \ https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio Could you tell us which