if you are seeing this warning then this means that parse-pdf IS being used. You should modify nutch-site.xml and not nutch-default and my bet is that your are doing this in NUTCH_HOME/conf and not in NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)
On 29 May 2012 07:31, Tolga <[email protected]> wrote: > Hi, > > I know this issue should have been closed, but I thought I'd continue this > rather than starting a new thread. > > Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin: > parse-pdf mapped to contentType application/pdf via parse-plugins.xml, but > not enabled via plugin.includes in nutch-default.xml and I have tika in my > nutch-default.xml: <value>protocol-http|**urlfilter-regex|parse-(html|** > tika|js|swf|zip|xml)|index-(**basic|anchor)|scoring-opic|** > urlnormalizer-(pass|regex|**basic)</value>. What's the point of seeing > this warning if I already have tika? This should be removed IMHO. > > Regards, > > > On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote: > >> Unless your using<= Nutch 1.2 you should not be using >> msexcel|mspowerpoint|msword|**oo|pdf| within your plugin.includes... all >> of these document formats are (and have been for some time) >> implemented as Apache Tika parsers. >> >> hth >> >> >> >> On Tue, May 22, 2012 at 9:20 PM, Tolga<[email protected]> wrote: >> >>> Hi, >>> >>> I crawl / index PDF files just fine, but I get the following warning. >>> >>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to >>> contentType >>> application/pdf via parse-plugins.xml, but not enabled via >>> plugin.includes >>> in nutch-default.xml. >>> >>> I've got the value >>> protocol-http|urlfilter-regex|**parse-(html|tika|js|msexcel|** >>> mspowerpoint|msword|oo|pdf|**swf|zip)|index-(basic|anchor)|** >>> scoring-opic|urlnormalizer-(**pass|regex|basic) >>> for plugin.includes property in nutch-default.xml. What am I missing? >>> >>> Regards, >>> >> >> >> -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

