Re: content-type crawling problem

2006-05-29 Thread Eugen Kochuev
Any information on this? I really need to limit nutch in indexing (only textual formats, excluding css, javascript and other non human oriented data) Nutch is trying to crawl everything, including DLL, EXE and all non-textual formats. How to limit nutch to only some desirable content-types? I

Re: content-type crawling problem

2006-05-29 Thread Heiko Dietze
Hello, i had also a similar problem, my little fix was to edit the parse-plugins.xml file. There is a the rule: mimeType name=* plugin id=parse-text / /mimeType Just uncomment this wilcard match. You might also check the other rules for further unwanted content. I don't know if this is the

Re[2]: content-type crawling problem

2006-05-29 Thread Eugen Kochuev
Thanks for sharing the information, I'll try this, but if I got it right parse-plugins.xml contains rules for the parser and still undesirable documents will be fetched and stored in the segments. Is it possible to stop fetcher from crawling these pages? Hello, i had also a similar problem, my

Re[2]: content-type crawling problem

2006-05-29 Thread Eugen Kochuev
Btw, do I need to uncomment this? It's more logical to comment this out. Right? mimeType name=* plugin id=parse-text / /mimeType Just uncomment this wilcard match. You might also check the other rules for further unwanted content. -- Best regards, Eugen

Re: content-type crawling problem

2006-05-29 Thread Heiko Dietze
Hello, Eugen Kochuev wrote: Btw, do I need to uncomment this? It's more logical to comment this out. Right? mimeType name=* plugin id=parse-text / /mimeType Just uncomment this wilcard match. You might also check the other rules for further unwanted content. Sorry for the typo, I

Re: content-type crawling problem

2006-05-29 Thread Stefan Neufeind
Heiko Dietze wrote: Hello, Eugen Kochuev wrote: Btw, do I need to uncomment this? It's more logical to comment this out. Right? mimeType name=* plugin id=parse-text / /mimeType Just uncomment this wilcard match. You might also check the other rules for further unwanted content.

content-type crawling problem

2006-05-25 Thread Eugen Kochuev
Hello , Nutch is trying to crawl everything, including DLL, EXE and all non-textual formats. How to limit nutch to only some desirable content-types? I know it's possible to do this by editing urlfilter plugin settings, but it's hard to predetermine all the possible extensions and this technique