The pages that I'm crawling are dynamically generated (i.e. using
javascript) for which purpose I am using the `protocol-selenium` plugin
instead of `protocol-http` as per
https://wiki.apache.org/nutch/AdvancedAjaxInteraction.
Problem:
protocol-selenium is using lib-selenium which unlike protocol-http -- which
returns all of the page source -- only returns the data within the <body>
tag of the page. This in turn prevents downstream plugins from parsing such
items as meta-tags and page title which normally exist outside the page
<body>.
My solution (part):
---
a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
+++
b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
@@ -160,7 +160,7 @@ public class HttpWebClient {
- return
driver.findElement(By.tagName("body")).getAttribute("innerHTML");
+ return driver.getPageSource();
}
Question:
Has anyone ran into a similar issue and how did you overcome it? I would
think this to be a common problem, and is wondering if there is a (good)
reason why lib-selenium was not patched to date.
Thank you,
Michael