Dear All,
I have this scenario, where I need to initialize an HtmlUnit (a
browser for scraping) web client inside a nutch plugin code. The code
is (in clojure)
(defn parser-filter
"Called by nutch to perform the parsing. Implementation of
org.apache.nutch.parse.HtmlParseFilter.filter"
[this content parse-result meta-tags doc]
(println "testing 123")
(try
(doto (new WebClient)
(.setJavaScriptEnabled true)
(.setThrowExceptionOnFailingStatusCode false)
(.setThrowExceptionOnScriptError false))
(catch Exception e
(println "caught")
(throw e)))
(println "ending testing 123")
...................
WebClient class comes from [com.gargoylesoftware.htmlunit WebClient].
I believe it is an Apache's http client. I found that the program
encountered exception inside the try block, yet the exception was not
caught.
The output from nutch:
testing 123
Parsing: http://sg.news.yahoo.com/
Error parsing: http://sg.news.yahoo.com/: failed(2,200):
org.apache.nutch.parse.ParseException: Unable to successfully parse
content
ParseSegment: finished at 2012-06-29 09:16:31, elapsed: 00:00:07
Neither "caught" nor "ending testing 123" was not printed out.
Any idea?
-Jiang