Hi everyone,
We are testing the quality of our crawl for one of our domain countries against the other public crawling tool( http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs ). All the webpages tested via both crawl script and the parsechecker tool for both Tika and default HTML plugin. The results are not very good comparing to the tool, I would appreciate if you give me a hint. I classify several types of problems: 1) Parsing problems. http://www.vialucy.nl/ During the parsing I got a bunch of messages such as [Error] :4:23: Missing attribute name and as a result I have an empty page back. 2) Fetching problems https://www.vishandelbunschoten.nl/ Fetch returns HTTP/1.1 200 OK for header but empty body 3) Javascipt problems http://www.amphar.com/Home.html Returns an empty body because of javasciprt <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html> Another example , https://www.sizo.com/ How to crawl these JavaScript websites? An activation of tika javascipt doesnt help. Thanks. Semyon.

