Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind of 
legacy of the ancients for parsing.

The error comes from both parsers available for html

  private DocumentFragment parse(InputSource input) throws Exception {
    if (parserImpl.equalsIgnoreCase("tagsoup"))
      return parseTagSoup(input);
    else
      return parseNeko(input);
  }
 
Neko and TagSoup both are dead for 4+ 
years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1).
If I try to parse it online with one of the modern plugin such as 
https://jsoup.org/ it works fine.

Very amazing considering the fact that it is THE core part of any parser.
 

Sent: Wednesday, November 14, 2018 at 3:32 PM
From: "Semyon Semyonov" <semyon.semyo...@mail.com>
To: user@nutch.apache.org
Subject: Quality problems of crawling. Parsing(Missing attribute name), 
fetching(empty body) and javascript.
Hi everyone,


We are testing the quality of our crawl for one of our domain countries against 
the other public crawling tool( 
http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs
 ).
All the webpages tested via both crawl script and the parsechecker tool for 
both Tika and default HTML plugin. 
 
The results are not very good comparing to the tool, I would appreciate if you 
give me a hint. 


I classify several types of problems:
 
1) Parsing problems.
 
http://www.vialucy.nl/[http://www.vialucy.nl/]
During the parsing I got a bunch of messages such as [Error] :4:23: Missing 
attribute name and as a result I have an empty page back.   
 
 
2) Fetching problems 

https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]
Fetch returns HTTP/1.1 200 OK for header but empty body
 
 
3) Javascipt problems
 
http://www.amphar.com/Home.html[http://www.amphar.com/Home.html] 
Returns an empty body because of javasciprt
 

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 
1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]";><html
 xmlns="http://www.w3.org/1999/xhtml";><head><title></title><meta 
http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html>
 
Another example ,
https://www.sizo.com/[https://www.sizo.com/]

How to crawl these JavaScript websites? An activation of tika javascipt doesnt 
help.



Thanks.

Semyon.

 

Reply via email to