> I am doing it in NUTCH_HOME/runtime/local/conf. I thought I could use
> nutch-default.xml, and nutch-site.xml just overrode nutch-default.xml.


that's the case. I was just mentioning a recommended practice, not a strict
requirement



>
>
> On 5/29/12 9:48 AM, Julien Nioche wrote:
>
>> if you are seeing this warning then this means that parse-pdf IS being
>> used. You should modify nutch-site.xml and not nutch-default and my bet is
>> that your are doing this in NUTCH_HOME/conf and not in
>> NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)
>>
>>
>>
>> On 29 May 2012 07:31, Tolga<[email protected]>  wrote:
>>
>>  Hi,
>>>
>>> I know this issue should have been closed, but I thought I'd continue
>>> this
>>> rather than starting a new thread.
>>>
>>> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin:
>>> parse-pdf mapped to contentType application/pdf via parse-plugins.xml,
>>> but
>>> not enabled via plugin.includes in nutch-default.xml and I have tika in
>>> my
>>> nutch-default.xml:<value>**protocol-http|**urlfilter-**
>>> regex|parse-(html|**
>>> tika|js|swf|zip|xml)|index-(****basic|anchor)|scoring-opic|**
>>> urlnormalizer-(pass|regex|****basic)</value>. What's the point of seeing
>>>
>>> this warning if I already have tika? This should be removed IMHO.
>>>
>>> Regards,
>>>
>>>
>>> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
>>>
>>>  Unless your using<= Nutch 1.2 you should not be using
>>>> msexcel|mspowerpoint|msword|****oo|pdf| within your plugin.includes...
>>>> all
>>>>
>>>> of these document formats are (and have been for some time)
>>>> implemented as Apache Tika parsers.
>>>>
>>>> hth
>>>>
>>>>
>>>>
>>>> On Tue, May 22, 2012 at 9:20 PM, Tolga<[email protected]>   wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> I crawl / index PDF files just fine, but I get the following warning.
>>>>>
>>>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to
>>>>> contentType
>>>>> application/pdf via parse-plugins.xml, but not enabled via
>>>>> plugin.includes
>>>>> in nutch-default.xml.
>>>>>
>>>>> I've got the value
>>>>> protocol-http|urlfilter-regex|****parse-(html|tika|js|msexcel|****
>>>>> mspowerpoint|msword|oo|pdf|****swf|zip)|index-(basic|anchor)|****
>>>>> scoring-opic|urlnormalizer-(****pass|regex|basic)
>>>>>
>>>>> for plugin.includes property in nutch-default.xml. What am I missing?
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>
>>>>
>>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to