protocol-selenium plug-in incompatible with downstream plugins

Michael Portnoy Wed, 25 Oct 2017 14:07:00 -0700

The pages that I'm crawling are dynamically generated (i.e. using
javascript) for which purpose I am using the `protocol-selenium` plugin
instead of `protocol-http` as per
https://wiki.apache.org/nutch/AdvancedAjaxInteraction.


Problem:

protocol-selenium is using lib-selenium which unlike protocol-http -- which
returns all of the page source -- only returns the data within the <body>
tag of the page. This in turn prevents downstream plugins from parsing such
items as meta-tags and page title which normally exist outside the page
<body>.

My solution (part):

---
a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
+++
b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java

@@ -160,7 +160,7 @@ public class HttpWebClient {
-      return
driver.findElement(By.tagName("body")).getAttribute("innerHTML");
+      return driver.getPageSource();
   }

Question:

Has anyone ran into a similar issue and how did you overcome it? I would
think this to be a common problem, and is wondering if there is a (good)
reason why lib-selenium was not patched to date.

Thank you,
Michael

protocol-selenium plug-in incompatible with downstream plugins

Reply via email to