Hi everyone,

We are testing the quality of our crawl for one of our domain countries against 
the other public crawling tool( 
http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs
 ).
All the webpages tested via both crawl script and the parsechecker tool for 
both Tika and default HTML plugin. 
 
The results are not very good comparing to the tool, I would appreciate if you 
give me a hint. 


I classify several types of problems:
 
1) Parsing problems.
 
http://www.vialucy.nl/
During the parsing I got a bunch of messages such as [Error] :4:23: Missing 
attribute name and as a result I have an empty page back.   
 
 
2) Fetching problems 

https://www.vishandelbunschoten.nl/
Fetch returns HTTP/1.1 200 OK for header but empty body
 
 
3) Javascipt problems
 
http://www.amphar.com/Home.html 
Returns an empty body because of javasciprt
 

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 
1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";><html 
xmlns="http://www.w3.org/1999/xhtml";><head><title></title><meta 
http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html>
 
Another example ,
https://www.sizo.com/

How to crawl these JavaScript websites? An activation of tika javascipt doesnt 
help.



Thanks.

Semyon.

 

Reply via email to