Re: Prevent parsing of office documents and PDFs

Julien Nioche Fri, 11 Jul 2014 06:28:40 -0700

You don't need to modify parse-plugins.xml, just remove parse-tika
from plugin.includes.
Your problem here is that you have an open office document in the segment
and no parser to deal with it.


why don't you add a regular expression to URL filters to remove all URLs
ending in .pdf, .docx, .doc ? That would prevent such documents to be
fetching in the first place

Julien


On 11 July 2014 13:50, Harald Kirsch <[email protected]> wrote:

> Hi everyone,
>
> in an Intranet, I want Nutch to follow only links found in HTML (and maybe
> Javascript, XHTML), but clearly not office documents and PDFs.
>
> - I took out parse-tika from the plugin.includes.
> - I took out everything related to tika in parse-plugins.xml.
>
> But now I get
>
> Error parsing: http:...docx: org.apache.nutch.parse.ParseException:
> parser not found for contentType=application/x-tika-ooxml
> url=http:....docx
>
> I wonder what is wrong here. Do I need a catchall in parse-plugins.xml.
> What does the sneaky <plugin id="feed"/> for some <mimeType> elements mean?
>
> Regards,
> Harald.
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Prevent parsing of office documents and PDFs

Reply via email to