Hi Semyon, > # Logging Threshold > log4j.threshold=ALL
Ok, I get similar messages with log4j.logger.org.apache.nutch=TRACE [Error] :24:21: Missing attribute name. [Warning] :27:16: Start element <DIV> automatically closes element <P>. I think they can be ignored unless there is some missing content not contained in the output of parse-html. Best, Sebastian On 11/19/18 5:04 PM, Semyon Semyonov wrote: > Upd. I finally managed to find out why I got these kind of messages in my > version > "Missing attribute name and as a result I have an empty page back" > > It is not because of code but because of logs properties. Now, I managed to > reproduce it with master branch. > > Having this log settings > # Logging Threshold > log4j.threshold=ALL > > I receive > [Error] :23:70: Missing attribute name. > [Error] :24:68: Missing attribute name. > [Error] :25:108: Missing attribute name. > > etc... > > Are these errors important? > > > > > > > Sent: Thursday, November 15, 2018 at 3:33 PM > From: "Semyon Semyonov" <[email protected]> > To: [email protected] > Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), > fetching(empty body) and javascript. > Everyone, we need some kind of commercial support(maybe extra tools) for > improving the quality of crawling and fixing similar issues. If you are > interested please contact me. > > Sebastian, > My bad, I had another version(modified 1.14). > In addition it is easy to misunderstand the results. > > bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' > -dumpText http://www.vialucy.nl/ return > Parse Metadata: dc:title=Vialucy | nieuws > > bin/nutch parsechecker -dumpText > http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]] > Parse Metadata: > > So, default one provides empty metadata and no error messages. This is a bit > confusing. > > Thanks. > > > Sent: Thursday, November 15, 2018 at 3:05 PM > From: "Sebastian Nagel" <[email protected]> > To: [email protected] > Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), > fetching(empty body) and javascript. > Hi Semyon, > >> Is there any reasons to keep the default HTML plugin there? only for >> maintenance ? > > Are there really HTML pages where parse-html fails? > > From my experience it still does a good job and parses almost every HTML page, > including HTML5. But I've never run any large scale comparison. > > One argument pro: it's much smaller. While parse-tika including dependencies > uses around 60 MB, > parse-html ships with only few 100 kB. > > Regarding > http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]] > : if the noindex is removed the page > is parsed well by parse-tika and parse-html and the outputs only differ > in white space in the parsed text. > > Of course, for the long term parse-html should be either actively maintained > or needs to be skipped. > > Best, > Sebastian > > On 11/15/18 2:39 PM, Semyon Semyonov wrote: >> Hi Sebastian, >> >> Thanks for the detailed response. >> I will try to migrate to Tika. >> >> Is there any reasons to keep the default HTML plugin there? only for >> maintenance ? >> >> Semyon. >> >> Sent: Thursday, November 15, 2018 at 2:23 PM >> From: "Sebastian Nagel" <[email protected]> >> To: [email protected] >> Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), >> fetching(empty body) and javascript. >> Hi Semyon, >> >> I've tried to reproduce your problems using the recent Nutch master >> (upcoming 1.16). >> I cannot see any issues, except that Javascript is not executed but that's >> clear. >> Of course, you are free to use parse-tika instead of parse-html which is >> legacy. >> See results below. >> >> Best, >> Sebastian >> >>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]]] >> >> Successfully fetched and parsed (no errors). Of course, there is no content >> kept >> because of robots=noindex. Here the output of parsechecker: >> >> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' >> -dumpText >> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]] >> ... >> Parse Metadata: >> dc:title=Vialucy | nieuws uit Les Vans – Ardêche – France >> Content-Encoding=UTF-8 >> generator=WordPress 3.1 >> robots=noindex,nofollow >> Content-Language=en-US >> Content-Type=text/html; charset=UTF-8 >> >> >>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]]] >> Succeeds if you can trick the anti-bot software, otherwise the server sends >> empty content back. Recently discussed on this list. >> >> >>> 3) Javascipt problems >>> >>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]] >> >> Yes, Javascript is not executed. But fetching and parsing works pretty fine >> for the HTML page as such: >> >> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \ >> -dumpText >> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]] >> fetching: >> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]] >> ... >> Status: success(1,0) >> Title: Home >> Outlinks: 19 >> ... >> Parse Metadata: iWeb-Build=local-build-20140815 >> X-UA-Compatible=IE=EmulateIE7 viewport=width=700 >> dc:title=Home Content-Encoding=UTF-8 Content-Type-Hint=text/html; >> charset=UTF-8 Content-Language=en >> Content-Type=application/xhtml+xml; charset=UTF-8 Generator=iWeb 3.0.4 >> >> Founded in 1975, Amphar B.V. provides solutions, services and support to the >> generic pharmaceutical >> industry. >> Headquartered in Amsterdam, The Netherlands, we assist our customers in >> identifying and developing >> new products, carefully select or initiate appropriate sources for Active >> Pharmaceutical Ingredients >> (APIs), develop and test formulations as well as compilation and submission >> of the required >> regulatory documentation and data. >> With our dedicated staff of experienced professionals and our logistics >> centre at Amsterdam Schiphol >> International Airport, we are well positioned to anticipate and react >> swiftly to the dynamic >> requirements of our customers. >> Amphar B.V. >> >> >> >> On 11/15/18 1:30 PM, Semyon Semyonov wrote: >>> Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind >>> of legacy of the ancients for parsing. >>> >>> The error comes from both parsers available for html >>> >>> private DocumentFragment parse(InputSource input) throws Exception { >>> if (parserImpl.equalsIgnoreCase("tagsoup")) >>> return parseTagSoup(input); >>> else >>> return parseNeko(input); >>> } >>> >>> Neko and TagSoup both are dead for 4+ >>> years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]]][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1][https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]]]]). >>> If I try to parse it online with one of the modern plugin such as >>> https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]][https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]]][https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]][https://jsoup.org/[https://jsoup.org/][https://jsoup.org/[https://jsoup.org/]]]] >>> it works fine. >>> >>> Very amazing considering the fact that it is THE core part of any parser. >>> >>> >>> Sent: Wednesday, November 14, 2018 at 3:32 PM >>> From: "Semyon Semyonov" <[email protected]> >>> To: [email protected] >>> Subject: Quality problems of crawling. Parsing(Missing attribute name), >>> fetching(empty body) and javascript. >>> Hi everyone, >>> >>> >>> We are testing the quality of our crawl for one of our domain countries >>> against the other public crawling tool( >>> http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]]][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs][http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]]]] >>> ). >>> All the webpages tested via both crawl script and the parsechecker tool for >>> both Tika and default HTML plugin. >>> >>> The results are not very good comparing to the tool, I would appreciate if >>> you give me a hint. >>> >>> >>> I classify several types of problems: >>> >>> 1) Parsing problems. >>> >>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]][http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]]]] >>> During the parsing I got a bunch of messages such as [Error] :4:23: Missing >>> attribute name and as a result I have an empty page back. >>> >>> >>> 2) Fetching problems >>> >>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]]]] >>> Fetch returns HTTP/1.1 200 OK for header but empty body >>> >>> >>> 3) Javascipt problems >>> >>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]]]] >>> >>> Returns an empty body because of javasciprt >>> >>> >>> <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD >>> XHTML 1.0 Transitional//EN" >>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]]]]"><html >>> xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta >>> http-equiv="refresh" content="0;url= Home.html" >>> /></head><body></body></html> >>> >>> Another example , >>> https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]][https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]]]] >>> >>> How to crawl these JavaScript websites? An activation of tika javascipt >>> doesnt help. >>> >>> >>> >>> Thanks. >>> >>> Semyon. >>> >>> >>> >> >> > >

