Prevent it from being fetched in the first place is a nice optimization
of course, but how do entirely prevent the types from being parsed?
Because not all files have proper extensions.
Nutch (or rather Tika) has a sophisticated way of determining the
mimetype, but how do I properly configure parse-plugins.xml to specify a
few mimetypes for a specific plugin without using a wildcard so that ALL
mimetypes are sent to that parser. As for as I know currently it is only
possible to specify one or all mimetypes to a parser. (Correct me if I'm
wrong).
One workaround is to create some sort of dispatcher parser that uses
internal, flexible logic to dispatch specific types to specific parsers.
On 08/31/2011 12:56 PM, Markus Jelsma wrote:
On Wednesday 31 August 2011 12:49:02 Marek Bachmann wrote:
Hello again,
As I ran in trouble with parsing again and again because there are so
many strange file types around our university network, I am looking for
an easy way for only parsing html / text and may be pdf (but this takes
very long)
Can anybody tell me were and how I could configure it that the parser
works that way?
Thank you!
BTW: Is there a possibility to stop unwanted content during fetching? As
I see it, the only way is blocking file names in the
regex-urlfilter.txt, am I right?
Yes, you want to prevent it from being fetched in the first place. Take a look
at suffix filter; a convenient plugin to filter extensions. You can also use a
regex filter to allow only certain files.