Re: Prevent parsing of office documents and PDFs

Julien Nioche Fri, 11 Jul 2014 07:27:56 -0700

Hi Harald

The parsing step is necessary in order to index documents as this is where
the text and metadata are extracted. As document which is not parsed won't
get indexed. Not clear what you mean by "the conversion to indexable text
takes place somewhere else" : it is done by the parse step.


Julien

On 11 July 2014 14:50, Harald Kirsch <[email protected]> wrote:

> Hi Julien.
>
> The reason is that I want pdfs and such to be indexed.
> But they should not be parsed to find outgoing URLs.
>
> So I guess for indexing they need to be fetched. But Nutch should not try
> to parse them. The conversion to indexable text takes place somewhere else,
> not need for Nutch to sweat on it.
>
> Harald.
>
>
>
> On 11.07.2014 15:27, Julien Nioche wrote:
>
>> You don't need to modify parse-plugins.xml, just remove parse-tika
>> from plugin.includes.
>> Your problem here is that you have an open office document in the segment
>> and no parser to deal with it.
>>
>> why don't you add a regular expression to URL filters to remove all URLs
>> ending in .pdf, .docx, .doc ? That would prevent such documents to be
>> fetching in the first place
>>
>> Julien
>>
>>
>> On 11 July 2014 13:50, Harald Kirsch <[email protected]> wrote:
>>
>>  Hi everyone,
>>>
>>> in an Intranet, I want Nutch to follow only links found in HTML (and
>>> maybe
>>> Javascript, XHTML), but clearly not office documents and PDFs.
>>>
>>> - I took out parse-tika from the plugin.includes.
>>> - I took out everything related to tika in parse-plugins.xml.
>>>
>>> But now I get
>>>
>>> Error parsing: http:...docx: org.apache.nutch.parse.ParseException:
>>> parser not found for contentType=application/x-tika-ooxml
>>> url=http:....docx
>>>
>>> I wonder what is wrong here. Do I need a catchall in parse-plugins.xml.
>>> What does the sneaky <plugin id="feed"/> for some <mimeType> elements
>>> mean?
>>>
>>> Regards,
>>> Harald.
>>>
>>>
>>
>>
>>
> --
> Harald Kirsch
> Raytion GmbH
> Kaiser-Friedrich-Ring 74
> 40547 Duesseldorf
> Fon +49 211 53883-216
> Fax +49-211-550266-19
> http://www.raytion.com
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Prevent parsing of office documents and PDFs

Reply via email to