On Wednesday 31 August 2011 12:49:02 Marek Bachmann wrote: > Hello again, > > As I ran in trouble with parsing again and again because there are so > many strange file types around our university network, I am looking for > an easy way for only parsing html / text and may be pdf (but this takes > very long) > > Can anybody tell me were and how I could configure it that the parser > works that way? > > Thank you! > > BTW: Is there a possibility to stop unwanted content during fetching? As > I see it, the only way is blocking file names in the > regex-urlfilter.txt, am I right?
Yes, you want to prevent it from being fetched in the first place. Take a look at suffix filter; a convenient plugin to filter extensions. You can also use a regex filter to allow only certain files. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

