Any information on this? I really need to limit nutch in indexing
(only textual formats, excluding css, javascript and other non human
oriented data)
Nutch is trying to crawl everything, including DLL, EXE and all
non-textual formats. How to limit nutch to only some desirable
content-types? I
Hello,
i had also a similar problem, my little fix was to
edit the parse-plugins.xml file. There is a the rule:
mimeType name=*
plugin id=parse-text /
/mimeType
Just uncomment this wilcard match. You might also check
the other rules for further unwanted content.
I don't know if this is the
Thanks for sharing the information, I'll try this, but if I got it
right parse-plugins.xml contains rules for the parser and still
undesirable documents will be fetched and stored in the segments.
Is it possible to stop fetcher from crawling these pages?
Hello,
i had also a similar problem, my
Btw, do I need to uncomment this? It's more logical to comment this
out. Right?
mimeType name=*
plugin id=parse-text /
/mimeType
Just uncomment this wilcard match. You might also check
the other rules for further unwanted content.
--
Best regards,
Eugen
Hello,
Eugen Kochuev wrote:
Btw, do I need to uncomment this? It's more logical to comment this
out. Right?
mimeType name=*
plugin id=parse-text /
/mimeType
Just uncomment this wilcard match. You might also check
the other rules for further unwanted content.
Sorry for the typo, I
Heiko Dietze wrote:
Hello,
Eugen Kochuev wrote:
Btw, do I need to uncomment this? It's more logical to comment this
out. Right?
mimeType name=*
plugin id=parse-text /
/mimeType
Just uncomment this wilcard match. You might also check
the other rules for further unwanted content.
Hello ,
Nutch is trying to crawl everything, including DLL, EXE and all
non-textual formats. How to limit nutch to only some desirable
content-types? I know it's possible to do this by editing urlfilter
plugin settings, but it's hard to predetermine all the possible
extensions and this technique