Re: Parsing only common file types

Ferdy Galema Fri, 02 Sep 2011 02:29:31 -0700

The reason Nutch tries to parse unmapped types to Tika (or any otherparser that has a wildcard in plugin.xml) is because it usesplugin.includes as a last resort.

I agree that allowing multple contentType values or perhaps a regex inplugin.xml would be very useful indeed.


On 09/01/2011 05:04 PM, Markus Jelsma wrote:


On Thursday 01 September 2011 16:13:24 Ferdy Galema wrote:

Prevent it from being fetched in the first place is a nice optimization
of course, but how do entirely prevent the types from being parsed?
Because not all files have proper extensions.

Nutch (or rather Tika) has a sophisticated way of determining the
mimetype, but how do I properly configure parse-plugins.xml to specify a
few mimetypes for a specific plugin without using a wildcard so that ALL
mimetypes are sent to that parser. As for as I know currently it is only
possible to specify one or all mimetypes to a parser. (Correct me if I'm
wrong).

Hmm yes. In parse-plugins.xml you can map multiple mime's to a single parser
yet all unmapped mime's end-up in Tika. The problem is you cannot list a
limited set of mime's in Tika's plugin.xml, only one mime or the wildcard.

One workaround is to create some sort of dispatcher parser that uses
internal, flexible logic to dispatch specific types to specific parsers.

It might be easier to allow multiple Content-Type parameters in plugin.xml,
making it multi valued.

        <parameter name="contentType" value="*"/>

The problem with Tika, it seems, is that even when you comment it out in
parse-plugins.xml, it's still used for unmapped mime's.

On 08/31/2011 12:56 PM, Markus Jelsma wrote:

On Wednesday 31 August 2011 12:49:02 Marek Bachmann wrote:

Hello again,

As I ran in trouble with parsing again and again because there are so
many strange file types around our university network, I am looking for
an easy way for only parsing html / text and may be pdf (but this takes
very long)

Can anybody tell me were and how I could configure it that the parser
works that way?

Thank you!

BTW: Is there a possibility to stop unwanted content during fetching? As
I see it, the only way is blocking file names in the
regex-urlfilter.txt, am I right?

Yes, you want to prevent it from being fetched in the first place. Take a
look at suffix filter; a convenient plugin to filter extensions. You can
also use a regex filter to allow only certain files.

Re: Parsing only common file types

Reply via email to