Everyone, we need some kind of commercial support(maybe extra tools) for 
improving the quality of crawling and fixing similar issues. If you are 
interested please contact me.

Sebastian,
My bad, I had another version(modified 1.14).
In addition it is easy to misunderstand the results. 

bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' -dumpText 
http://www.vialucy.nl/ return
Parse Metadata: dc:title=Vialucy | nieuws

bin/nutch parsechecker  -dumpText http://www.vialucy.nl/
Parse Metadata: 

So, default one provides empty metadata and no error messages. This is a bit 
confusing.

Thanks.


Sent: Thursday, November 15, 2018 at 3:05 PM
From: "Sebastian Nagel" <wastl.na...@googlemail.com.INVALID>
To: user@nutch.apache.org
Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), 
fetching(empty body) and javascript.
Hi Semyon,

> Is there any reasons to keep the default HTML plugin there? only for 
> maintenance ?

Are there really HTML pages where parse-html fails?

>From my experience it still does a good job and parses almost every HTML page,
including HTML5. But I've never run any large scale comparison.

One argument pro: it's much smaller. While parse-tika including dependencies 
uses around 60 MB,
parse-html ships with only few 100 kB.

Regarding http://www.vialucy.nl/ : if the noindex is removed the page
is parsed well by parse-tika and parse-html and the outputs only differ
in white space in the parsed text.

Of course, for the long term parse-html should be either actively maintained
or needs to be skipped.

Best,
Sebastian

On 11/15/18 2:39 PM, Semyon Semyonov wrote:
> Hi Sebastian,
>  
> Thanks for the detailed response.
> I will try to migrate to Tika.
>
> Is there any reasons to keep the default HTML plugin there? only for 
> maintenance ?
>  
> Semyon. 
>
> Sent: Thursday, November 15, 2018 at 2:23 PM
> From: "Sebastian Nagel" <wastl.na...@googlemail.com.INVALID>
> To: user@nutch.apache.org
> Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), 
> fetching(empty body) and javascript.
> Hi Semyon,
>
> I've tried to reproduce your problems using the recent Nutch master (upcoming 
> 1.16).
> I cannot see any issues, except that Javascript is not executed but that's 
> clear.
> Of course, you are free to use parse-tika instead of parse-html which is 
> legacy.
> See results below.
>
> Best,
> Sebastian
>
>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]
>
> Successfully fetched and parsed (no errors). Of course, there is no content 
> kept
> because of robots=noindex. Here the output of parsechecker:
>
> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' 
> -dumpText 
> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]
> ...
> Parse Metadata:
> dc:title=Vialucy | nieuws uit Les Vans – Ardêche – France
> Content-Encoding=UTF-8
> generator=WordPress 3.1
> robots=noindex,nofollow
> Content-Language=en-US
> Content-Type=text/html; charset=UTF-8
>
>
>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]
> Succeeds if you can trick the anti-bot software, otherwise the server sends
> empty content back. Recently discussed on this list.
>
>
>> 3) Javascipt problems
>>
>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]
>
> Yes, Javascript is not executed. But fetching and parsing works pretty fine
> for the HTML page as such:
>
> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \
> -dumpText 
> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]
> fetching: 
> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]
> ...
> Status: success(1,0)
> Title: Home
> Outlinks: 19
> ...
> Parse Metadata: iWeb-Build=local-build-20140815 X-UA-Compatible=IE=EmulateIE7 
> viewport=width=700
> dc:title=Home Content-Encoding=UTF-8 Content-Type-Hint=text/html; 
> charset=UTF-8 Content-Language=en
> Content-Type=application/xhtml+xml; charset=UTF-8 Generator=iWeb 3.0.4
>
> Founded in 1975, Amphar B.V. provides solutions, services and support to the 
> generic pharmaceutical
> industry.
> Headquartered in Amsterdam, The Netherlands, we assist our customers in 
> identifying and developing
> new products, carefully select or initiate appropriate sources for Active 
> Pharmaceutical Ingredients
> (APIs), develop and test formulations as well as compilation and submission 
> of the required
> regulatory documentation and data.
> With our dedicated staff of experienced professionals and our logistics 
> centre at Amsterdam Schiphol
> International Airport, we are well positioned to anticipate and react swiftly 
> to the dynamic
> requirements of our customers.
> Amphar B.V.
>  
>
>
> On 11/15/18 1:30 PM, Semyon Semyonov wrote:
>> Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind 
>> of legacy of the ancients for parsing.
>>
>> The error comes from both parsers available for html
>>
>> private DocumentFragment parse(InputSource input) throws Exception {
>> if (parserImpl.equalsIgnoreCase("tagsoup"))
>> return parseTagSoup(input);
>> else
>> return parseNeko(input);
>> }
>>  
>> Neko and TagSoup both are dead for 4+ 
>> years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]]).
>> If I try to parse it online with one of the modern plugin such as 
>> https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]]
>>  it works fine.
>>
>> Very amazing considering the fact that it is THE core part of any parser.
>>  
>>
>> Sent: Wednesday, November 14, 2018 at 3:32 PM
>> From: "Semyon Semyonov" <semyon.semyo...@mail.com>
>> To: user@nutch.apache.org
>> Subject: Quality problems of crawling. Parsing(Missing attribute name), 
>> fetching(empty body) and javascript.
>> Hi everyone,
>>
>>
>> We are testing the quality of our crawl for one of our domain countries 
>> against the other public crawling tool( 
>> http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]]
>>  ).
>> All the webpages tested via both crawl script and the parsechecker tool for 
>> both Tika and default HTML plugin. 
>>  
>> The results are not very good comparing to the tool, I would appreciate if 
>> you give me a hint. 
>>
>>
>> I classify several types of problems:
>>  
>> 1) Parsing problems.
>>  
>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]
>> During the parsing I got a bunch of messages such as [Error] :4:23: Missing 
>> attribute name and as a result I have an empty page back.   
>>  
>>  
>> 2) Fetching problems 
>>
>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]]
>> Fetch returns HTTP/1.1 200 OK for header but empty body
>>  
>>  
>> 3) Javascipt problems
>>  
>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]
>>  
>> Returns an empty body because of javasciprt
>>  
>>
>> <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD 
>> XHTML 1.0 Transitional//EN" 
>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]]";><html
>>  xmlns="http://www.w3.org/1999/xhtml";><head><title></title><meta 
>> http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html>
>>  
>> Another example ,
>> https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]]
>>
>> How to crawl these JavaScript websites? An activation of tika javascipt 
>> doesnt help.
>>
>>
>>
>> Thanks.
>>
>> Semyon.
>>
>>  
>>
>  
>
 

Reply via email to