Hi Carlos,
thanks for the follow-up. I've checked the mentioned link and Nutch 1.14:
- with parse-html the link is missing (also some more)
- with parse-tika it's extracted as expected: a self-referential link, the
anchor part removed
That's a hint that we should have a closer look on the
Hi Sebastian,
Thank you for your answer. I am using Nutch 1.12. Same plugins as you. I am
using this old version because I use a modified version (not those
plugins). I guess something changed in the parse-html plugin since my
version.
Anyway, I think I found a clue about what is happening. This
Hi Carlos,
sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT and
the call
$ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \
https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio
Could you tell us which
3 matches
Mail list logo