Look in crawl url filter file and/or regex url filter file.  There is
section in there that specifies file extensions that you don't want to
be processed.  
Just like below.  Note that the comment is misleading.  Some of the
extensions can indeed be parsed.  I chose to not parse them (ex:pdf,
rtf, txt,doc etc)

# skip image and other suffixes we can't yet parse
-\.(swf|SWF|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|
wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|IC
O|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|w
ma|WMA|PSD|psd|dll|DLL|exe|EXE|chm|CHM|db|DB|doc|DOC|pdf|PDF|wpd|WPD)$

Hope this helps

-----Original Message-----
From: Mark Stephenson [mailto:[email protected]] 
Sent: Wednesday, September 29, 2010 7:29 PM
To: [email protected]
Subject: Excluding javascript files from indexing and search results.

Hi,

I'm wondering if there's a way to prevent nutch from indexing  
javascript files.  I still would like to fetch and parse javascript  
files to find valuable outlinks, but I don't want them to show up in  
my search results.  Is there a good way to do this?

Thanks a lot,
Mark

Reply via email to