Dear All,

I have this scenario, where I need to initialize an HtmlUnit (a
browser for scraping) web client inside a nutch plugin code. The code
is (in clojure)

(defn parser-filter
"Called by nutch to perform the parsing. Implementation of
org.apache.nutch.parse.HtmlParseFilter.filter"
[this content parse-result meta-tags doc]

(println "testing 123")

(try

  (doto (new WebClient)
              (.setJavaScriptEnabled true)
              (.setThrowExceptionOnFailingStatusCode false)
              (.setThrowExceptionOnScriptError false))


(catch Exception e

    (println "caught")
    (throw e)))

(println "ending testing 123")

...................


WebClient class comes from [com.gargoylesoftware.htmlunit WebClient].
I believe it is an Apache's http client. I found that the program
encountered exception inside the try block, yet the exception was not
caught.


The output from nutch:

testing 123
Parsing: http://sg.news.yahoo.com/
Error parsing: http://sg.news.yahoo.com/: failed(2,200):
org.apache.nutch.parse.ParseException: Unable to successfully parse
content
ParseSegment: finished at 2012-06-29 09:16:31, elapsed: 00:00:07

Neither "caught" nor "ending testing 123" was not printed out.

Any idea?


-Jiang

Reply via email to