Re: problems extracting outlinks

Sebastian Nagel Thu, 10 Aug 2017 00:24:52 -0700

Hi Carlos,

thanks for the follow-up. I've checked the mentioned link and Nutch 1.14:
- with parse-html the link is missing (also some more)
- with parse-tika it's extracted as expected: a self-referential link, the 
anchor part removed


That's a hint that we should have a closer look on the problem.
Please, open an issue on
  https://issues.apache.org/jira/browse/NUTCH

Thanks,
Sebastian


On 08/09/2017 08:10 PM, Carlos Pérez Miguel wrote:
> Hi Sebastian,
> 
> Thank you for your answer. I am using Nutch 1.12. Same plugins as you. I am
> using this old version because I use a modified version (not those
> plugins). I guess something changed in the parse-html plugin since my
> version.
> 
> Anyway, I think I found a clue about what is happening. This page is in
> catalan, a language in which is normal the use of single quotes. Most of
> the attributes of the html code are surrounded by single quotes and some of
> the values of those attributes use as well single quotes, so, I think the
> parser is confused by that. For example, in that page, line 278 we can see
> this tag:
> 
> <div data-group='#servei-d'atencio-al-client' class="sublevel">
> 
> Thanks,
> Carlos
> 
> Carlos Pérez Miguel
> 
> 2017-08-09 18:47 GMT+02:00 Sebastian Nagel <[email protected]>:
> 
>> Hi Carlos,
>>
>> sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT
>> and the call
>>
>> $ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \
>>   https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
>> assegurances-de-vida/vida-proteccio
>>
>> Could you tell us which Nutch version is used and also which plugins are
>> enabled?
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 08/09/2017 12:09 PM, Carlos Pérez Miguel wrote:
>>> Hi,
>>>
>>> While crawling a site, I found that the crawl stopped before expected
>>> because lots of urls being downloaded was of the form:
>>>
>>> http://www.domain.com/something/"http://www.domain.com";
>>>
>>> After reading the html of the pages containing that outlinks I found that
>>> those outlinks are note included in the source code, so I guess there may
>>> be something incorrect in the page content or in the parse made by nutch.
>>> How can I know which problem is? I am a little lost with this one.
>>>
>>> In order to see the problem:
>>>
>>> $ bin/nutch parsechecker
>>> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
>> assegurances-de-vida/vida-proteccio
>>>
>>> And within the results we can see this particular outlink:
>>>  outlink: toUrl:
>>> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
>> assegurances-de-vida/
>>> "http://www.seguroscatalanaoccidente.com"; anchor:
>>> www.seguroscatalanaoccidente.com
>>>
>>> Is there any way to solve or avoid this? maybe with the regex-urlfilter
>>> file?
>>>
>>> Thanks
>>>
>>> Carlos Pérez Miguel
>>>
>>
>>
>

Re: problems extracting outlinks

Reply via email to