You are right. I had to add a custom plugin - InvalidUrlIndexFilter which
filters out all the invalid urls while indexing the pages/files. Check out
this blog:
http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
Just follow the process of creating/adding a new custom plugin
I m also facing the same problem..
i thought of devlop a plugin that will return null when such URL is
encountered and will return null. As a result that URl wont be
indexed.
But i was thinking what will be the criteria on the basis of which i
ll discard the URl.
I hope my approach is
While indexing files on local file system, why does NUTCH interpret the
directory as a URL - fetching file:/C:/temp/html/
This causes the index page of this directory to show up on search results.
Any solutions for this issue??
Bharteesh Kulkarni
Because it's a URL indeed.
You can either filter this kind of URL by configuring
crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the
regular expression) or filter the search result (you need to develop a
nutch plugin).
Thanks!
Xiao
On Thu, Apr 29, 2010 at 4:33 AM, BK