A quick pointer:

Do you have trace logging enabled? If so try to disabled and see if that
works.
See https://issues.apache.org/jira/browse/NUTCH-1253


On Fri, Jun 29, 2012 at 11:17 AM, Jiang Fung Wong
<[email protected]>wrote:

> Dear All,
>
> I have this scenario, where I need to initialize an HtmlUnit (a
> browser for scraping) web client inside a nutch plugin code. The code
> is (in clojure)
>
> (defn parser-filter
> "Called by nutch to perform the parsing. Implementation of
> org.apache.nutch.parse.HtmlParseFilter.filter"
> [this content parse-result meta-tags doc]
>
> (println "testing 123")
>
> (try
>
>  (doto (new WebClient)
>              (.setJavaScriptEnabled true)
>              (.setThrowExceptionOnFailingStatusCode false)
>              (.setThrowExceptionOnScriptError false))
>
>
> (catch Exception e
>
>    (println "caught")
>    (throw e)))
>
> (println "ending testing 123")
>
> ...................
>
>
> WebClient class comes from [com.gargoylesoftware.htmlunit WebClient].
> I believe it is an Apache's http client. I found that the program
> encountered exception inside the try block, yet the exception was not
> caught.
>
>
> The output from nutch:
>
> testing 123
> Parsing: http://sg.news.yahoo.com/
> Error parsing: http://sg.news.yahoo.com/: failed(2,200):
> org.apache.nutch.parse.ParseException: Unable to successfully parse
> content
> ParseSegment: finished at 2012-06-29 09:16:31, elapsed: 00:00:07
>
> Neither "caught" nor "ending testing 123" was not printed out.
>
> Any idea?
>
>
> -Jiang
>

Reply via email to