Re: Parsing only common file types

Markus Jelsma Wed, 31 Aug 2011 03:57:02 -0700


On Wednesday 31 August 2011 12:49:02 Marek Bachmann wrote:
> Hello again,
> 
> As I ran in trouble with parsing again and again because there are so
> many strange file types around our university network, I am looking for
> an easy way for only parsing html / text and may be pdf (but this takes
> very long)
> 
> Can anybody tell me were and how I could configure it that the parser
> works that way?
> 
> Thank you!
> 
> BTW: Is there a possibility to stop unwanted content during fetching? As
> I see it, the only way is blocking file names in the
> regex-urlfilter.txt, am I right?


Yes, you want to prevent it from being fetched in the first place. Take a look 
at suffix filter; a convenient plugin to filter extensions. You can also use a 
regex filter to allow only certain files.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parsing only common file types

Reply via email to