Look in crawl url filter file and/or regex url filter file. There is section in there that specifies file extensions that you don't want to be processed. Just like below. Note that the comment is misleading. Some of the extensions can indeed be parsed. I chose to not parse them (ex:pdf, rtf, txt,doc etc)
# skip image and other suffixes we can't yet parse -\.(swf|SWF|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV| wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|IC O|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|w ma|WMA|PSD|psd|dll|DLL|exe|EXE|chm|CHM|db|DB|doc|DOC|pdf|PDF|wpd|WPD)$ Hope this helps -----Original Message----- From: Mark Stephenson [mailto:[email protected]] Sent: Wednesday, September 29, 2010 7:29 PM To: [email protected] Subject: Excluding javascript files from indexing and search results. Hi, I'm wondering if there's a way to prevent nutch from indexing javascript files. I still would like to fetch and parse javascript files to find valuable outlinks, but I don't want them to show up in my search results. Is there a good way to do this? Thanks a lot, Mark

