Re: Parsing only common file types

Ferdy Galema Thu, 01 Sep 2011 07:13:29 -0700

Prevent it from being fetched in the first place is a nice optimizationof course, but how do entirely prevent the types from being parsed?Because not all files have proper extensions.

Nutch (or rather Tika) has a sophisticated way of determining themimetype, but how do I properly configure parse-plugins.xml to specify afew mimetypes for a specific plugin without using a wildcard so that ALLmimetypes are sent to that parser. As for as I know currently it is onlypossible to specify one or all mimetypes to a parser. (Correct me if I'mwrong).

One workaround is to create some sort of dispatcher parser that usesinternal, flexible logic to dispatch specific types to specific parsers.


On 08/31/2011 12:56 PM, Markus Jelsma wrote:


On Wednesday 31 August 2011 12:49:02 Marek Bachmann wrote:

Hello again,

As I ran in trouble with parsing again and again because there are so
many strange file types around our university network, I am looking for
an easy way for only parsing html / text and may be pdf (but this takes
very long)

Can anybody tell me were and how I could configure it that the parser
works that way?

Thank you!

BTW: Is there a possibility to stop unwanted content during fetching? As
I see it, the only way is blocking file names in the
regex-urlfilter.txt, am I right?

Yes, you want to prevent it from being fetched in the first place. Take a look
at suffix filter; a convenient plugin to filter extensions. You can also use a
regex filter to allow only certain files.

Re: Parsing only common file types

Reply via email to