Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

Sebastian Nagel Thu, 15 Nov 2018 05:34:38 -0800

Hi Semyon,

I've tried to reproduce your problems using the recent Nutch master (upcoming 
1.16).
I cannot see any issues, except that Javascript is not executed but that's 
clear.
Of course, you are free to use parse-tika instead of parse-html which is legacy.
See results below.


Best,
Sebastian

> http://www.vialucy.nl/[http://www.vialucy.nl/]

Successfully fetched and parsed (no errors). Of course, there is no content kept
because of robots=noindex. Here the output of parsechecker:

% bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' 
-dumpText http://www.vialucy.nl/
...
Parse Metadata:
  dc:title=Vialucy | nieuws uit Les Vans – Ardêche – France
  Content-Encoding=UTF-8
  generator=WordPress 3.1
  robots=noindex,nofollow
  Content-Language=en-US
  Content-Type=text/html; charset=UTF-8


> https://www.vishandelbunschoten.nl/
Succeeds if you can trick the anti-bot software, otherwise the server sends
empty content back. Recently discussed on this list.


> 3) Javascipt problems
>
> http://www.amphar.com/Home.html

Yes, Javascript is not executed. But fetching and parsing works pretty fine
for the HTML page as such:

% bin/nutch parsechecker  -Dplugin.includes='protocol-okhttp|parse-tika' \
     -dumpText http://www.amphar.com/Home.html
fetching: http://www.amphar.com/Home.html
...
Status: success(1,0)
Title: Home
Outlinks: 19
...
Parse Metadata: iWeb-Build=local-build-20140815 X-UA-Compatible=IE=EmulateIE7 
viewport=width=700
dc:title=Home Content-Encoding=UTF-8 Content-Type-Hint=text/html; charset=UTF-8 
Content-Language=en
Content-Type=application/xhtml+xml; charset=UTF-8 Generator=iWeb 3.0.4

Founded in 1975, Amphar B.V. provides solutions, services and support to the 
generic pharmaceutical
industry.
Headquartered in Amsterdam, The Netherlands, we assist our customers in 
identifying and developing
new products, carefully select or initiate appropriate sources for Active 
Pharmaceutical Ingredients
(APIs), develop and test formulations as well as compilation and submission of 
the required
regulatory documentation and data.
With our dedicated staff of experienced professionals and our logistics centre 
at Amsterdam Schiphol
International Airport, we are well positioned to anticipate and react swiftly 
to the dynamic
requirements of our customers.
Amphar B.V.
 


On 11/15/18 1:30 PM, Semyon Semyonov wrote:
> Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind 
> of legacy of the ancients for parsing.
> 
> The error comes from both parsers available for html
> 
>   private DocumentFragment parse(InputSource input) throws Exception {
>     if (parserImpl.equalsIgnoreCase("tagsoup"))
>       return parseTagSoup(input);
>     else
>       return parseNeko(input);
>   }
>  
> Neko and TagSoup both are dead for 4+ 
> years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1).
> If I try to parse it online with one of the modern plugin such as 
> https://jsoup.org/ it works fine.
> 
> Very amazing considering the fact that it is THE core part of any parser.
>  
> 
> Sent: Wednesday, November 14, 2018 at 3:32 PM
> From: "Semyon Semyonov" <semyon.semyo...@mail.com>
> To: user@nutch.apache.org
> Subject: Quality problems of crawling. Parsing(Missing attribute name), 
> fetching(empty body) and javascript.
> Hi everyone,
> 
> 
> We are testing the quality of our crawl for one of our domain countries 
> against the other public crawling tool( 
> http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs
>  ).
> All the webpages tested via both crawl script and the parsechecker tool for 
> both Tika and default HTML plugin. 
>  
> The results are not very good comparing to the tool, I would appreciate if 
> you give me a hint. 
> 
> 
> I classify several types of problems:
>  
> 1) Parsing problems.
>  
> http://www.vialucy.nl/[http://www.vialucy.nl/]
> During the parsing I got a bunch of messages such as [Error] :4:23: Missing 
> attribute name and as a result I have an empty page back.   
>  
>  
> 2) Fetching problems 
> 
> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]
> Fetch returns HTTP/1.1 200 OK for header but empty body
>  
>  
> 3) Javascipt problems
>  
> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html] 
> Returns an empty body because of javasciprt
>  
> 
> <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD 
> XHTML 1.0 Transitional//EN" 
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]";><html
>  xmlns="http://www.w3.org/1999/xhtml";><head><title></title><meta 
> http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html>
>  
> Another example ,
> https://www.sizo.com/[https://www.sizo.com/]
> 
> How to crawl these JavaScript websites? An activation of tika javascipt 
> doesnt help.
> 
> 
> 
> Thanks.
> 
> Semyon.
> 
>  
>

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

Reply via email to