Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

Semyon Semyonov Thu, 15 Nov 2018 06:43:54 -0800

Everyone, we need some kind of commercial support(maybe extra tools) for 
improving the quality of crawling and fixing similar issues. If you are 
interested please contact me.


Sebastian,
My bad, I had another version(modified 1.14).
In addition it is easy to misunderstand the results. 

bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' -dumpText 
http://www.vialucy.nl/ return
Parse Metadata: dc:title=Vialucy | nieuws

bin/nutch parsechecker  -dumpText http://www.vialucy.nl/
Parse Metadata: 

So, default one provides empty metadata and no error messages. This is a bit 
confusing.

Thanks.


Sent: Thursday, November 15, 2018 at 3:05 PM
From: "Sebastian Nagel" <wastl.na...@googlemail.com.INVALID>
To: user@nutch.apache.org
Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), 
fetching(empty body) and javascript.
Hi Semyon,

> Is there any reasons to keep the default HTML plugin there? only for 
> maintenance ?

Are there really HTML pages where parse-html fails?

>From my experience it still does a good job and parses almost every HTML page,
including HTML5. But I've never run any large scale comparison.

One argument pro: it's much smaller. While parse-tika including dependencies 
uses around 60 MB,
parse-html ships with only few 100 kB.

Regarding http://www.vialucy.nl/ : if the noindex is removed the page
is parsed well by parse-tika and parse-html and the outputs only differ
in white space in the parsed text.

Of course, for the long term parse-html should be either actively maintained
or needs to be skipped.

Best,
Sebastian

On 11/15/18 2:39 PM, Semyon Semyonov wrote:
> Hi Sebastian,
>  
> Thanks for the detailed response.
> I will try to migrate to Tika.
>
> Is there any reasons to keep the default HTML plugin there? only for 
> maintenance ?
>  
> Semyon. 
>
> Sent: Thursday, November 15, 2018 at 2:23 PM
> From: "Sebastian Nagel" <wastl.na...@googlemail.com.INVALID>
> To: user@nutch.apache.org
> Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), 
> fetching(empty body) and javascript.
> Hi Semyon,
>
> I've tried to reproduce your problems using the recent Nutch master (upcoming 
> 1.16).
> I cannot see any issues, except that Javascript is not executed but that's 
> clear.
> Of course, you are free to use parse-tika instead of parse-html which is 
> legacy.
> See results below.
>
> Best,
> Sebastian
>
>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]
>
> Successfully fetched and parsed (no errors). Of course, there is no content 
> kept
> because of robots=noindex. Here the output of parsechecker:
>
> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' 
> -dumpText 
> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]
> ...
> Parse Metadata:
> dc:title=Vialucy | nieuws uit Les Vans – Ardêche – France
> Content-Encoding=UTF-8
> generator=WordPress 3.1
> robots=noindex,nofollow
> Content-Language=en-US
> Content-Type=text/html; charset=UTF-8
>
>
>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]
> Succeeds if you can trick the anti-bot software, otherwise the server sends
> empty content back. Recently discussed on this list.
>
>
>> 3) Javascipt problems
>>
>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]
>
> Yes, Javascript is not executed. But fetching and parsing works pretty fine
> for the HTML page as such:
>
> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \
> -dumpText 
> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]
> fetching: 
> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]
> ...
> Status: success(1,0)
> Title: Home
> Outlinks: 19
> ...
> Parse Metadata: iWeb-Build=local-build-20140815 X-UA-Compatible=IE=EmulateIE7 
> viewport=width=700
> dc:title=Home Content-Encoding=UTF-8 Content-Type-Hint=text/html; 
> charset=UTF-8 Content-Language=en
> Content-Type=application/xhtml+xml; charset=UTF-8 Generator=iWeb 3.0.4
>
> Founded in 1975, Amphar B.V. provides solutions, services and support to the 
> generic pharmaceutical
> industry.
> Headquartered in Amsterdam, The Netherlands, we assist our customers in 
> identifying and developing
> new products, carefully select or initiate appropriate sources for Active 
> Pharmaceutical Ingredients
> (APIs), develop and test formulations as well as compilation and submission 
> of the required
> regulatory documentation and data.
> With our dedicated staff of experienced professionals and our logistics 
> centre at Amsterdam Schiphol
> International Airport, we are well positioned to anticipate and react swiftly 
> to the dynamic
> requirements of our customers.
> Amphar B.V.
>  
>
>
> On 11/15/18 1:30 PM, Semyon Semyonov wrote:
>> Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind 
>> of legacy of the ancients for parsing.
>>
>> The error comes from both parsers available for html
>>
>> private DocumentFragment parse(InputSource input) throws Exception {
>> if (parserImpl.equalsIgnoreCase("tagsoup"))
>> return parseTagSoup(input);
>> else
>> return parseNeko(input);
>> }
>>  
>> Neko and TagSoup both are dead for 4+ 
>> years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]]).
>> If I try to parse it online with one of the modern plugin such as 
>> https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]]
>>  it works fine.
>>
>> Very amazing considering the fact that it is THE core part of any parser.
>>  
>>
>> Sent: Wednesday, November 14, 2018 at 3:32 PM
>> From: "Semyon Semyonov" <semyon.semyo...@mail.com>
>> To: user@nutch.apache.org
>> Subject: Quality problems of crawling. Parsing(Missing attribute name), 
>> fetching(empty body) and javascript.
>> Hi everyone,
>>
>>
>> We are testing the quality of our crawl for one of our domain countries 
>> against the other public crawling tool( 
>> http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]]
>>  ).
>> All the webpages tested via both crawl script and the parsechecker tool for 
>> both Tika and default HTML plugin. 
>>  
>> The results are not very good comparing to the tool, I would appreciate if 
>> you give me a hint. 
>>
>>
>> I classify several types of problems:
>>  
>> 1) Parsing problems.
>>  
>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]
>> During the parsing I got a bunch of messages such as [Error] :4:23: Missing 
>> attribute name and as a result I have an empty page back.   
>>  
>>  
>> 2) Fetching problems 
>>
>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]]
>> Fetch returns HTTP/1.1 200 OK for header but empty body
>>  
>>  
>> 3) Javascipt problems
>>  
>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]
>>  
>> Returns an empty body because of javasciprt
>>  
>>
>> <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD 
>> XHTML 1.0 Transitional//EN" 
>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]]";><html
>>  xmlns="http://www.w3.org/1999/xhtml";><head><title></title><meta 
>> http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html>
>>  
>> Another example ,
>> https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]]
>>
>> How to crawl these JavaScript websites? An activation of tika javascipt 
>> doesnt help.
>>
>>
>>
>> Thanks.
>>
>> Semyon.
>>
>>  
>>
>  
>

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

Reply via email to