Re: why does nutch interpret directory as URL

2010-05-01 Thread b k
You are right. I had to add a custom plugin - InvalidUrlIndexFilter which
filters out all the invalid urls while indexing the pages/files. Check out
this blog:
http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html

Just follow the process of creating/adding a new custom plugin
http://wiki.apache.org/nutch/WritingPluginExample-0.9

After adding this
plugin, I was able to index the files by skipping this index page...hope
this helps...


On Thu, Apr 29, 2010 at 12:31 PM, arpit khurdiya wrote:

> I m also facing the same problem..
>
> i thought of devlop a plugin  that will return null when such  URL is
> encountered and will return null. As a result that URl wont be
> indexed.
>
> But i was thinking what will be the criteria on the basis of which i
> ll discard the URl.
>
> I hope my approach is correct.
>
> On Thu, Apr 29, 2010 at 9:59 AM, xiao yang  wrote:
> > Because it's a URL indeed.
> > You can either filter this kind of URL by configuring
> > crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the
> > regular expression) or filter the search result (you need to develop a
> > nutch plugin).
> > Thanks!
> >
> > Xiao
> >
> > On Thu, Apr 29, 2010 at 4:33 AM, BK  wrote:
> >> While indexing files on local file system, why does NUTCH interpret the
> >> directory as a URL - fetching file:/C:/temp/html/
> >> This causes the index page of this directory to show up on search
> results.
> >> Any solutions for this issue??
> >>
> >>
> >> Bharteesh Kulkarni
> >>
> >
>
>
>
> --
> Regards,
> Arpit Khurdiya
>


Re: why does nutch interpret directory as URL

2010-04-29 Thread arpit khurdiya
I m also facing the same problem..

i thought of devlop a plugin  that will return null when such  URL is
encountered and will return null. As a result that URl wont be
indexed.

But i was thinking what will be the criteria on the basis of which i
ll discard the URl.

I hope my approach is correct.

On Thu, Apr 29, 2010 at 9:59 AM, xiao yang  wrote:
> Because it's a URL indeed.
> You can either filter this kind of URL by configuring
> crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the
> regular expression) or filter the search result (you need to develop a
> nutch plugin).
> Thanks!
>
> Xiao
>
> On Thu, Apr 29, 2010 at 4:33 AM, BK  wrote:
>> While indexing files on local file system, why does NUTCH interpret the
>> directory as a URL - fetching file:/C:/temp/html/
>> This causes the index page of this directory to show up on search results.
>> Any solutions for this issue??
>>
>>
>> Bharteesh Kulkarni
>>
>



-- 
Regards,
Arpit Khurdiya


Re: why does nutch interpret directory as URL

2010-04-28 Thread xiao yang
Because it's a URL indeed.
You can either filter this kind of URL by configuring
crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the
regular expression) or filter the search result (you need to develop a
nutch plugin).
Thanks!

Xiao

On Thu, Apr 29, 2010 at 4:33 AM, BK  wrote:
> While indexing files on local file system, why does NUTCH interpret the
> directory as a URL - fetching file:/C:/temp/html/
> This causes the index page of this directory to show up on search results.
> Any solutions for this issue??
>
>
> Bharteesh Kulkarni
>


why does nutch interpret directory as URL

2010-04-28 Thread BK
While indexing files on local file system, why does NUTCH interpret the
directory as a URL - fetching file:/C:/temp/html/
This causes the index page of this directory to show up on search results.
Any solutions for this issue??


Bharteesh Kulkarni