Hi, On Fri, Jan 11, 2013 at 3:12 PM, Bayu Widyasanyata <[email protected]>wrote:
> > We can see that some of parse processes were not completed successfully. > Yes I see this. I also see that you have a http.proxy.port = 8080 but no proxy host and that the protocol-httpclient plugin is not activated. I also see some strange fetcher behaviour as it seems to fetch the server instance e.g. 2013-01-12 05:37:41,987 INFO fetcher.FetcherJob - fetching http://localhost/, however I assume there is no document @ this location on the server... That being said, as we've established fetching does not seem to be the problem. Unless you wish to skip parsing for truncated documents then you will need to increase the http.content.limit to something over ~40K. This will then remove the following log output (meaning that the document should be fully parsed) 2013-01-12 05:38:27,508 WARN parse.ParserJob - http://localhost/sapi/Solr-install-v2.pdf skipped. Content of size 395125 was truncated to 65536 You may also wish to consider the parser.skip.truncated property in nutch-site.xml I don't suppose these PDF's are password protected or something like that? I would also explicitly map the content type application/vnd.oasis.opendocument.text to parse-tika in parse-plugins.xml. 2013-01-12 05:39:07,594 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/vnd.oasis.opendocument.text, but they are not mapped to it in the parse-plugins.xml file

