if you are seeing this warning then this means that parse-pdf IS being
used. You should modify nutch-site.xml and not nutch-default and my bet is
that your are doing this in NUTCH_HOME/conf and not in
NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)



On 29 May 2012 07:31, Tolga <[email protected]> wrote:

> Hi,
>
> I know this issue should have been closed, but I thought I'd continue this
> rather than starting a new thread.
>
> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin:
> parse-pdf mapped to contentType application/pdf via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml and I have tika in my
> nutch-default.xml: <value>protocol-http|**urlfilter-regex|parse-(html|**
> tika|js|swf|zip|xml)|index-(**basic|anchor)|scoring-opic|**
> urlnormalizer-(pass|regex|**basic)</value>. What's the point of seeing
> this warning if I already have tika? This should be removed IMHO.
>
> Regards,
>
>
> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
>
>> Unless your using<= Nutch 1.2 you should not be using
>> msexcel|mspowerpoint|msword|**oo|pdf| within your plugin.includes... all
>> of these document formats are (and have been for some time)
>> implemented as Apache Tika parsers.
>>
>> hth
>>
>>
>>
>> On Tue, May 22, 2012 at 9:20 PM, Tolga<[email protected]>  wrote:
>>
>>> Hi,
>>>
>>> I crawl / index PDF files just fine, but I get the following warning.
>>>
>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to
>>> contentType
>>> application/pdf via parse-plugins.xml, but not enabled via
>>> plugin.includes
>>> in nutch-default.xml.
>>>
>>> I've got the value
>>> protocol-http|urlfilter-regex|**parse-(html|tika|js|msexcel|**
>>> mspowerpoint|msword|oo|pdf|**swf|zip)|index-(basic|anchor)|**
>>> scoring-opic|urlnormalizer-(**pass|regex|basic)
>>> for plugin.includes property in nutch-default.xml. What am I missing?
>>>
>>> Regards,
>>>
>>
>>
>>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to