Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind of
legacy of the ancients for parsing.
The error comes from both parsers available for html
private DocumentFragment parse(InputSource input) throws Exception {
if (parserImpl.equalsIgnoreCase("tagsoup"))
return parseTagSoup(input);
else
return parseNeko(input);
}
Neko and TagSoup both are dead for 4+
years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1).
If I try to parse it online with one of the modern plugin such as
https://jsoup.org/ it works fine.
Very amazing considering the fact that it is THE core part of any parser.
Sent: Wednesday, November 14, 2018 at 3:32 PM
From: "Semyon Semyonov" <[email protected]>
To: [email protected]
Subject: Quality problems of crawling. Parsing(Missing attribute name),
fetching(empty body) and javascript.
Hi everyone,
We are testing the quality of our crawl for one of our domain countries against
the other public crawling tool(
http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs
).
All the webpages tested via both crawl script and the parsechecker tool for
both Tika and default HTML plugin.
The results are not very good comparing to the tool, I would appreciate if you
give me a hint.
I classify several types of problems:
1) Parsing problems.
http://www.vialucy.nl/[http://www.vialucy.nl/]
During the parsing I got a bunch of messages such as [Error] :4:23: Missing
attribute name and as a result I have an empty page back.
2) Fetching problems
https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]
Fetch returns HTTP/1.1 200 OK for header but empty body
3) Javascipt problems
http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]
Returns an empty body because of javasciprt
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML
1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]"><html
xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta
http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html>
Another example ,
https://www.sizo.com/[https://www.sizo.com/]
How to crawl these JavaScript websites? An activation of tika javascipt doesnt
help.
Thanks.
Semyon.