Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

Sebastian Nagel Mon, 19 Nov 2018 12:36:01 -0800

Hi Semyon,

> # Logging Threshold
> log4j.threshold=ALL


Ok, I get similar messages with
   log4j.logger.org.apache.nutch=TRACE

  [Error] :24:21: Missing attribute name.
  [Warning] :27:16: Start element <DIV> automatically closes element <P>.

I think they can be ignored unless there is some missing content not contained 
in the output of
parse-html.

Best,
Sebastian

On 11/19/18 5:04 PM, Semyon Semyonov wrote:
> Upd. I finally managed to find out why I got these kind of messages in my 
> version
> "Missing attribute name and as a result I have an empty page back" 
> 
> It is not because of code but because of logs properties. Now, I managed to 
> reproduce it with master branch.
> 
> Having this log settings
> # Logging Threshold
> log4j.threshold=ALL
> 
> I receive
> [Error] :23:70: Missing attribute name.
> [Error] :24:68: Missing attribute name.
> [Error] :25:108: Missing attribute name.
> 
> etc...
> 
> Are these errors important?
> 
> 
> 
> 
> 
> 
> Sent: Thursday, November 15, 2018 at 3:33 PM
> From: "Semyon Semyonov" <[email protected]>
> To: [email protected]
> Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), 
> fetching(empty body) and javascript.
> Everyone, we need some kind of commercial support(maybe extra tools) for 
> improving the quality of crawling and fixing similar issues. If you are 
> interested please contact me.
> 
> Sebastian,
> My bad, I had another version(modified 1.14).
> In addition it is easy to misunderstand the results.
> 
> bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' 
> -dumpText http://www.vialucy.nl/ return
> Parse Metadata: dc:title=Vialucy | nieuws
> 
> bin/nutch parsechecker -dumpText 
> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]
> Parse Metadata:
> 
> So, default one provides empty metadata and no error messages. This is a bit 
> confusing.
> 
> Thanks.
> 
> 
> Sent: Thursday, November 15, 2018 at 3:05 PM
> From: "Sebastian Nagel" <[email protected]>
> To: [email protected]
> Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), 
> fetching(empty body) and javascript.
> Hi Semyon,
> 
>> Is there any reasons to keep the default HTML plugin there? only for 
>> maintenance ?
> 
> Are there really HTML pages where parse-html fails?
> 
> From my experience it still does a good job and parses almost every HTML page,
> including HTML5. But I've never run any large scale comparison.
> 
> One argument pro: it's much smaller. While parse-tika including dependencies 
> uses around 60 MB,
> parse-html ships with only few 100 kB.
> 
> Regarding 
> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]
>  : if the noindex is removed the page
> is parsed well by parse-tika and parse-html and the outputs only differ
> in white space in the parsed text.
> 
> Of course, for the long term parse-html should be either actively maintained
> or needs to be skipped.
> 
> Best,
> Sebastian
> 
> On 11/15/18 2:39 PM, Semyon Semyonov wrote:
>> Hi Sebastian,
>>  
>> Thanks for the detailed response.
>> I will try to migrate to Tika.
>>
>> Is there any reasons to keep the default HTML plugin there? only for 
>> maintenance ?
>>  
>> Semyon. 
>>
>> Sent: Thursday, November 15, 2018 at 2:23 PM
>> From: "Sebastian Nagel" <[email protected]>
>> To: [email protected]
>> Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), 
>> fetching(empty body) and javascript.
>> Hi Semyon,
>>
>> I've tried to reproduce your problems using the recent Nutch master 
>> (upcoming 1.16).
>> I cannot see any issues, except that Javascript is not executed but that's 
>> clear.
>> Of course, you are free to use parse-tika instead of parse-html which is 
>> legacy.
>> See results below.
>>
>> Best,
>> Sebastian
>>
>>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]]]
>>
>> Successfully fetched and parsed (no errors). Of course, there is no content 
>> kept
>> because of robots=noindex. Here the output of parsechecker:
>>
>> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' 
>> -dumpText 
>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]]
>> ...
>> Parse Metadata:
>> dc:title=Vialucy | nieuws uit Les Vans – Ardêche – France
>> Content-Encoding=UTF-8
>> generator=WordPress 3.1
>> robots=noindex,nofollow
>> Content-Language=en-US
>> Content-Type=text/html; charset=UTF-8
>>
>>
>>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]]]
>> Succeeds if you can trick the anti-bot software, otherwise the server sends
>> empty content back. Recently discussed on this list.
>>
>>
>>> 3) Javascipt problems
>>>
>>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]]
>>
>> Yes, Javascript is not executed. But fetching and parsing works pretty fine
>> for the HTML page as such:
>>
>> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \
>> -dumpText 
>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]]
>> fetching: 
>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]]
>> ...
>> Status: success(1,0)
>> Title: Home
>> Outlinks: 19
>> ...
>> Parse Metadata: iWeb-Build=local-build-20140815 
>> X-UA-Compatible=IE=EmulateIE7 viewport=width=700
>> dc:title=Home Content-Encoding=UTF-8 Content-Type-Hint=text/html; 
>> charset=UTF-8 Content-Language=en
>> Content-Type=application/xhtml+xml; charset=UTF-8 Generator=iWeb 3.0.4
>>
>> Founded in 1975, Amphar B.V. provides solutions, services and support to the 
>> generic pharmaceutical
>> industry.
>> Headquartered in Amsterdam, The Netherlands, we assist our customers in 
>> identifying and developing
>> new products, carefully select or initiate appropriate sources for Active 
>> Pharmaceutical Ingredients
>> (APIs), develop and test formulations as well as compilation and submission 
>> of the required
>> regulatory documentation and data.
>> With our dedicated staff of experienced professionals and our logistics 
>> centre at Amsterdam Schiphol
>> International Airport, we are well positioned to anticipate and react 
>> swiftly to the dynamic
>> requirements of our customers.
>> Amphar B.V.
>>  
>>
>>
>> On 11/15/18 1:30 PM, Semyon Semyonov wrote:
>>> Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind 
>>> of legacy of the ancients for parsing.
>>>
>>> The error comes from both parsers available for html
>>>
>>> private DocumentFragment parse(InputSource input) throws Exception {
>>> if (parserImpl.equalsIgnoreCase("tagsoup"))
>>> return parseTagSoup(input);
>>> else
>>> return parseNeko(input);
>>> }
>>>  
>>> Neko and TagSoup both are dead for 4+ 
>>> years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]]][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]]]]).
>>> If I try to parse it online with one of the modern plugin such as 
>>> https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]][https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]]][https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]][https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]]]]
>>>  it works fine.
>>>
>>> Very amazing considering the fact that it is THE core part of any parser.
>>>  
>>>
>>> Sent: Wednesday, November 14, 2018 at 3:32 PM
>>> From: "Semyon Semyonov" <[email protected]>
>>> To: [email protected]
>>> Subject: Quality problems of crawling. Parsing(Missing attribute name), 
>>> fetching(empty body) and javascript.
>>> Hi everyone,
>>>
>>>
>>> We are testing the quality of our crawl for one of our domain countries 
>>> against the other public crawling tool( 
>>> http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]]][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]]]]
>>>  ).
>>> All the webpages tested via both crawl script and the parsechecker tool for 
>>> both Tika and default HTML plugin. 
>>>  
>>> The results are not very good comparing to the tool, I would appreciate if 
>>> you give me a hint. 
>>>
>>>
>>> I classify several types of problems:
>>>  
>>> 1) Parsing problems.
>>>  
>>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]]]
>>> During the parsing I got a bunch of messages such as [Error] :4:23: Missing 
>>> attribute name and as a result I have an empty page back.   
>>>  
>>>  
>>> 2) Fetching problems 
>>>
>>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]]]]
>>> Fetch returns HTTP/1.1 200 OK for header but empty body
>>>  
>>>  
>>> 3) Javascipt problems
>>>  
>>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]]]
>>>  
>>> Returns an empty body because of javasciprt
>>>  
>>>
>>> <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD 
>>> XHTML 1.0 Transitional//EN" 
>>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]]]]";><html
>>>  xmlns="http://www.w3.org/1999/xhtml";><head><title></title><meta 
>>> http-equiv="refresh" content="0;url= Home.html" 
>>> /></head><body></body></html>
>>>  
>>> Another example ,
>>> https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]]]]
>>>
>>> How to crawl these JavaScript websites? An activation of tika javascipt 
>>> doesnt help.
>>>
>>>
>>>
>>> Thanks.
>>>
>>> Semyon.
>>>
>>>  
>>>
>>  
>>
>  
>

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

Reply via email to