Michael, thanks for this, and the below sounds like a worthwhile patch. I’ll 
try and test it out
this week and see if it improves crawling for sites like you mention. My belief 
is that the domains
in which we were testing this in Nutch were such that the only javascript to be 
rendered was potentially
body javascript, or possibly we kicked the problem down into the Selenium 
“handler” interface in which 
a call could be similarly made to grab the whole page body.

(check out protocol-interactiveselenium)

Cheers,
Chris




On 10/25/17, 2:06 PM, "Michael Portnoy" <[email protected]> wrote:

    The pages that I'm crawling are dynamically generated (i.e. using
    javascript) for which purpose I am using the `protocol-selenium` plugin
    instead of `protocol-http` as per
    https://wiki.apache.org/nutch/AdvancedAjaxInteraction.
    
    Problem:
    
    protocol-selenium is using lib-selenium which unlike protocol-http -- which
    returns all of the page source -- only returns the data within the <body>
    tag of the page. This in turn prevents downstream plugins from parsing such
    items as meta-tags and page title which normally exist outside the page
    <body>.
    
    My solution (part):
    
    ---
    
a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
    +++
    
b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
    
    @@ -160,7 +160,7 @@ public class HttpWebClient {
    -      return
    driver.findElement(By.tagName("body")).getAttribute("innerHTML");
    +      return driver.getPageSource();
       }
    
    Question:
    
    Has anyone ran into a similar issue and how did you overcome it? I would
    think this to be a common problem, and is wondering if there is a (good)
    reason why lib-selenium was not patched to date.
    
    Thank you,
    Michael
    


Reply via email to