subject:"content\-type crawling problem"

Re: content-type crawling problem

2006-05-29 Thread Eugen Kochuev

Any information on this? I really need to limit nutch in indexing (only textual formats, excluding css, javascript and other non human oriented data) Nutch is trying to crawl everything, including DLL, EXE and all non-textual formats. How to limit nutch to only some desirable content-types? I

Re: content-type crawling problem

2006-05-29 Thread Heiko Dietze

Hello, i had also a similar problem, my little fix was to edit the parse-plugins.xml file. There is a the rule: mimeType name=* plugin id=parse-text / /mimeType Just uncomment this wilcard match. You might also check the other rules for further unwanted content. I don't know if this is the

Re[2]: content-type crawling problem

2006-05-29 Thread Eugen Kochuev

Thanks for sharing the information, I'll try this, but if I got it right parse-plugins.xml contains rules for the parser and still undesirable documents will be fetched and stored in the segments. Is it possible to stop fetcher from crawling these pages? Hello, i had also a similar problem, my

Re[2]: content-type crawling problem

2006-05-29 Thread Eugen Kochuev

Btw, do I need to uncomment this? It's more logical to comment this out. Right? mimeType name=* plugin id=parse-text / /mimeType Just uncomment this wilcard match. You might also check the other rules for further unwanted content. -- Best regards, Eugen

Re: content-type crawling problem

2006-05-29 Thread Heiko Dietze

Hello, Eugen Kochuev wrote: Btw, do I need to uncomment this? It's more logical to comment this out. Right? mimeType name=* plugin id=parse-text / /mimeType Just uncomment this wilcard match. You might also check the other rules for further unwanted content. Sorry for the typo, I

Re: content-type crawling problem

2006-05-29 Thread Stefan Neufeind

Heiko Dietze wrote: Hello, Eugen Kochuev wrote: Btw, do I need to uncomment this? It's more logical to comment this out. Right? mimeType name=* plugin id=parse-text / /mimeType Just uncomment this wilcard match. You might also check the other rules for further unwanted content.

content-type crawling problem

2006-05-25 Thread Eugen Kochuev

Hello , Nutch is trying to crawl everything, including DLL, EXE and all non-textual formats. How to limit nutch to only some desirable content-types? I know it's possible to do this by editing urlfilter plugin settings, but it's hard to predetermine all the possible extensions and this technique

Re: content-type crawling problem

Re: content-type crawling problem

Re[2]: content-type crawling problem

Re[2]: content-type crawling problem

Re: content-type crawling problem

Re: content-type crawling problem

content-type crawling problem

7 matches

Site Navigation

Mail list logo

Footer information