You don't need to modify parse-plugins.xml, just remove parse-tika from plugin.includes. Your problem here is that you have an open office document in the segment and no parser to deal with it.
why don't you add a regular expression to URL filters to remove all URLs ending in .pdf, .docx, .doc ? That would prevent such documents to be fetching in the first place Julien On 11 July 2014 13:50, Harald Kirsch <[email protected]> wrote: > Hi everyone, > > in an Intranet, I want Nutch to follow only links found in HTML (and maybe > Javascript, XHTML), but clearly not office documents and PDFs. > > - I took out parse-tika from the plugin.includes. > - I took out everything related to tika in parse-plugins.xml. > > But now I get > > Error parsing: http:...docx: org.apache.nutch.parse.ParseException: > parser not found for contentType=application/x-tika-ooxml > url=http:....docx > > I wonder what is wrong here. Do I need a catchall in parse-plugins.xml. > What does the sneaky <plugin id="feed"/> for some <mimeType> elements mean? > > Regards, > Harald. > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

