problems extracting outlinks

Carlos Pérez Miguel Wed, 09 Aug 2017 03:10:01 -0700

Hi,

While crawling a site, I found that the crawl stopped before expected
because lots of urls being downloaded was of the form:


http://www.domain.com/something/"http://www.domain.com";

After reading the html of the pages containing that outlinks I found that
those outlinks are note included in the source code, so I guess there may
be something incorrect in the page content or in the parse made by nutch.
How can I know which problem is? I am a little lost with this one.

In order to see the problem:

$ bin/nutch parsechecker
https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio

And within the results we can see this particular outlink:
 outlink: toUrl:
https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/
"http://www.seguroscatalanaoccidente.com"; anchor:
www.seguroscatalanaoccidente.com

Is there any way to solve or avoid this? maybe with the regex-urlfilter
file?

Thanks

Carlos Pérez Miguel

problems extracting outlinks

Reply via email to