While crawling a site, I found that the crawl stopped before expected
because lots of urls being downloaded was of the form:
After reading the html of the pages containing that outlinks I found that
those outlinks are note included in the source code, so I guess there may
be something incorrect in the page content or in the parse made by nutch.
How can I know which problem is? I am a little lost with this one.
In order to see the problem:
$ bin/nutch parsechecker
And within the results we can see this particular outlink:
Is there any way to solve or avoid this? maybe with the regex-urlfilter
Carlos Pérez Miguel