On Thursday 01 September 2011 16:13:24 Ferdy Galema wrote:
> Prevent it from being fetched in the first place is a nice optimization
> of course, but how do entirely prevent the types from being parsed?
> Because not all files have proper extensions.
>
> Nutch (or rather Tika) has a sophisticated way of determining the
> mimetype, but how do I properly configure parse-plugins.xml to specify a
> few mimetypes for a specific plugin without using a wildcard so that ALL
> mimetypes are sent to that parser. As for as I know currently it is only
> possible to specify one or all mimetypes to a parser. (Correct me if I'm
> wrong).
Hmm yes. In parse-plugins.xml you can map multiple mime's to a single parser
yet all unmapped mime's end-up in Tika. The problem is you cannot list a
limited set of mime's in Tika's plugin.xml, only one mime or the wildcard.
>
> One workaround is to create some sort of dispatcher parser that uses
> internal, flexible logic to dispatch specific types to specific parsers.
It might be easier to allow multiple Content-Type parameters in plugin.xml,
making it multi valued.
<parameter name="contentType" value="*"/>
The problem with Tika, it seems, is that even when you comment it out in
parse-plugins.xml, it's still used for unmapped mime's.
>
> On 08/31/2011 12:56 PM, Markus Jelsma wrote:
> > On Wednesday 31 August 2011 12:49:02 Marek Bachmann wrote:
> >> Hello again,
> >>
> >> As I ran in trouble with parsing again and again because there are so
> >> many strange file types around our university network, I am looking for
> >> an easy way for only parsing html / text and may be pdf (but this takes
> >> very long)
> >>
> >> Can anybody tell me were and how I could configure it that the parser
> >> works that way?
> >>
> >> Thank you!
> >>
> >> BTW: Is there a possibility to stop unwanted content during fetching? As
> >> I see it, the only way is blocking file names in the
> >> regex-urlfilter.txt, am I right?
> >
> > Yes, you want to prevent it from being fetched in the first place. Take a
> > look at suffix filter; a convenient plugin to filter extensions. You can
> > also use a regex filter to allow only certain files.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350