Hi Semyon, I've tried to reproduce your problems using the recent Nutch master (upcoming 1.16). I cannot see any issues, except that Javascript is not executed but that's clear. Of course, you are free to use parse-tika instead of parse-html which is legacy. See results below.
Best, Sebastian > http://www.vialucy.nl/[http://www.vialucy.nl/] Successfully fetched and parsed (no errors). Of course, there is no content kept because of robots=noindex. Here the output of parsechecker: % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' -dumpText http://www.vialucy.nl/ ... Parse Metadata: dc:title=Vialucy | nieuws uit Les Vans – Ardêche – France Content-Encoding=UTF-8 generator=WordPress 3.1 robots=noindex,nofollow Content-Language=en-US Content-Type=text/html; charset=UTF-8 > https://www.vishandelbunschoten.nl/ Succeeds if you can trick the anti-bot software, otherwise the server sends empty content back. Recently discussed on this list. > 3) Javascipt problems > > http://www.amphar.com/Home.html Yes, Javascript is not executed. But fetching and parsing works pretty fine for the HTML page as such: % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \ -dumpText http://www.amphar.com/Home.html fetching: http://www.amphar.com/Home.html ... Status: success(1,0) Title: Home Outlinks: 19 ... Parse Metadata: iWeb-Build=local-build-20140815 X-UA-Compatible=IE=EmulateIE7 viewport=width=700 dc:title=Home Content-Encoding=UTF-8 Content-Type-Hint=text/html; charset=UTF-8 Content-Language=en Content-Type=application/xhtml+xml; charset=UTF-8 Generator=iWeb 3.0.4 Founded in 1975, Amphar B.V. provides solutions, services and support to the generic pharmaceutical industry. Headquartered in Amsterdam, The Netherlands, we assist our customers in identifying and developing new products, carefully select or initiate appropriate sources for Active Pharmaceutical Ingredients (APIs), develop and test formulations as well as compilation and submission of the required regulatory documentation and data. With our dedicated staff of experienced professionals and our logistics centre at Amsterdam Schiphol International Airport, we are well positioned to anticipate and react swiftly to the dynamic requirements of our customers. Amphar B.V. On 11/15/18 1:30 PM, Semyon Semyonov wrote: > Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind > of legacy of the ancients for parsing. > > The error comes from both parsers available for html > > private DocumentFragment parse(InputSource input) throws Exception { > if (parserImpl.equalsIgnoreCase("tagsoup")) > return parseTagSoup(input); > else > return parseNeko(input); > } > > Neko and TagSoup both are dead for 4+ > years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1). > If I try to parse it online with one of the modern plugin such as > https://jsoup.org/ it works fine. > > Very amazing considering the fact that it is THE core part of any parser. > > > Sent: Wednesday, November 14, 2018 at 3:32 PM > From: "Semyon Semyonov" <semyon.semyo...@mail.com> > To: user@nutch.apache.org > Subject: Quality problems of crawling. Parsing(Missing attribute name), > fetching(empty body) and javascript. > Hi everyone, > > > We are testing the quality of our crawl for one of our domain countries > against the other public crawling tool( > http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs > ). > All the webpages tested via both crawl script and the parsechecker tool for > both Tika and default HTML plugin. > > The results are not very good comparing to the tool, I would appreciate if > you give me a hint. > > > I classify several types of problems: > > 1) Parsing problems. > > http://www.vialucy.nl/[http://www.vialucy.nl/] > During the parsing I got a bunch of messages such as [Error] :4:23: Missing > attribute name and as a result I have an empty page back. > > > 2) Fetching problems > > https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/] > Fetch returns HTTP/1.1 200 OK for header but empty body > > > 3) Javascipt problems > > http://www.amphar.com/Home.html[http://www.amphar.com/Home.html] > Returns an empty body because of javasciprt > > > <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD > XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]"><html > xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta > http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html> > > Another example , > https://www.sizo.com/[https://www.sizo.com/] > > How to crawl these JavaScript websites? An activation of tika javascipt > doesnt help. > > > > Thanks. > > Semyon. > > >