Hi, While crawling a site, I found that the crawl stopped before expected because lots of urls being downloaded was of the form:
http://www.domain.com/something/"http://www.domain.com" After reading the html of the pages containing that outlinks I found that those outlinks are note included in the source code, so I guess there may be something incorrect in the page content or in the parse made by nutch. How can I know which problem is? I am a little lost with this one. In order to see the problem: $ bin/nutch parsechecker https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio And within the results we can see this particular outlink: outlink: toUrl: https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/ "http://www.seguroscatalanaoccidente.com" anchor: www.seguroscatalanaoccidente.com Is there any way to solve or avoid this? maybe with the regex-urlfilter file? Thanks Carlos Pérez Miguel