Re: why does nutch interpret directory as URL

2010-05-01 Thread b k
You are right. I had to add a custom plugin - InvalidUrlIndexFilter which filters out all the invalid urls while indexing the pages/files. Check out this blog: http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html Just follow the process of creating/adding a new custom plugin

Re: why does nutch interpret directory as URL

2010-04-29 Thread arpit khurdiya
I m also facing the same problem.. i thought of devlop a plugin that will return null when such URL is encountered and will return null. As a result that URl wont be indexed. But i was thinking what will be the criteria on the basis of which i ll discard the URl. I hope my approach is

why does nutch interpret directory as URL

2010-04-28 Thread BK
While indexing files on local file system, why does NUTCH interpret the directory as a URL - fetching file:/C:/temp/html/ This causes the index page of this directory to show up on search results. Any solutions for this issue?? Bharteesh Kulkarni

Re: why does nutch interpret directory as URL

2010-04-28 Thread xiao yang
Because it's a URL indeed. You can either filter this kind of URL by configuring crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the regular expression) or filter the search result (you need to develop a nutch plugin). Thanks! Xiao On Thu, Apr 29, 2010 at 4:33 AM, BK