Hi Carlos, thanks for the follow-up. I've checked the mentioned link and Nutch 1.14: - with parse-html the link is missing (also some more) - with parse-tika it's extracted as expected: a self-referential link, the anchor part removed
That's a hint that we should have a closer look on the problem. Please, open an issue on https://issues.apache.org/jira/browse/NUTCH Thanks, Sebastian On 08/09/2017 08:10 PM, Carlos Pérez Miguel wrote: > Hi Sebastian, > > Thank you for your answer. I am using Nutch 1.12. Same plugins as you. I am > using this old version because I use a modified version (not those > plugins). I guess something changed in the parse-html plugin since my > version. > > Anyway, I think I found a clue about what is happening. This page is in > catalan, a language in which is normal the use of single quotes. Most of > the attributes of the html code are surrounded by single quotes and some of > the values of those attributes use as well single quotes, so, I think the > parser is confused by that. For example, in that page, line 278 we can see > this tag: > > <div data-group='#servei-d'atencio-al-client' class="sublevel"> > > Thanks, > Carlos > > Carlos Pérez Miguel > > 2017-08-09 18:47 GMT+02:00 Sebastian Nagel <wastl.na...@googlemail.com>: > >> Hi Carlos, >> >> sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT >> and the call >> >> $ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \ >> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/ >> assegurances-de-vida/vida-proteccio >> >> Could you tell us which Nutch version is used and also which plugins are >> enabled? >> >> Thanks, >> Sebastian >> >> >> On 08/09/2017 12:09 PM, Carlos Pérez Miguel wrote: >>> Hi, >>> >>> While crawling a site, I found that the crawl stopped before expected >>> because lots of urls being downloaded was of the form: >>> >>> http://www.domain.com/something/"http://www.domain.com" >>> >>> After reading the html of the pages containing that outlinks I found that >>> those outlinks are note included in the source code, so I guess there may >>> be something incorrect in the page content or in the parse made by nutch. >>> How can I know which problem is? I am a little lost with this one. >>> >>> In order to see the problem: >>> >>> $ bin/nutch parsechecker >>> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/ >> assegurances-de-vida/vida-proteccio >>> >>> And within the results we can see this particular outlink: >>> outlink: toUrl: >>> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/ >> assegurances-de-vida/ >>> "http://www.seguroscatalanaoccidente.com" anchor: >>> www.seguroscatalanaoccidente.com >>> >>> Is there any way to solve or avoid this? maybe with the regex-urlfilter >>> file? >>> >>> Thanks >>> >>> Carlos Pérez Miguel >>> >> >> >