Michael, thanks for this, and the below sounds like a worthwhile patch. I’ll try and test it out this week and see if it improves crawling for sites like you mention. My belief is that the domains in which we were testing this in Nutch were such that the only javascript to be rendered was potentially body javascript, or possibly we kicked the problem down into the Selenium “handler” interface in which a call could be similarly made to grab the whole page body.
(check out protocol-interactiveselenium) Cheers, Chris On 10/25/17, 2:06 PM, "Michael Portnoy" <[email protected]> wrote: The pages that I'm crawling are dynamically generated (i.e. using javascript) for which purpose I am using the `protocol-selenium` plugin instead of `protocol-http` as per https://wiki.apache.org/nutch/AdvancedAjaxInteraction. Problem: protocol-selenium is using lib-selenium which unlike protocol-http -- which returns all of the page source -- only returns the data within the <body> tag of the page. This in turn prevents downstream plugins from parsing such items as meta-tags and page title which normally exist outside the page <body>. My solution (part): --- a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java +++ b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java @@ -160,7 +160,7 @@ public class HttpWebClient { - return driver.findElement(By.tagName("body")).getAttribute("innerHTML"); + return driver.getPageSource(); } Question: Has anyone ran into a similar issue and how did you overcome it? I would think this to be a common problem, and is wondering if there is a (good) reason why lib-selenium was not patched to date. Thank you, Michael

