Re: Removing URLs from index

Jeroen van Vianen Tue, 17 Aug 2010 04:48:59 -0700

On 17-8-2010 13:35, Alex McLintock wrote:

I happen to have accumulated a lot of URLs in my index with the following
layout:


http://www.company.com/directory1;if(T.getElementsByClassName(
http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case


Hmmm,

This may be thinking out loud rather than helpful:

I thought ";" was supposed to introduce a session id. I wonder if we
can or should be ignoring everything after the ";" character.

Maybe we should. I'm unsure why these JS fragments have been added tothe URLs to crawl in the first place. Problem is that the webserver ishappily serving URLs with above structure and generates proper content,probably because the JS fragment is an invalid session id and thewebserver will automatically create a new session.

I've recently seen cases where something which looked like a URL
appeared in some Javascript and Nutch identified it as something to
crawl. I don't know whether there is a easy fix.

There seem to be errors in the discovery of links from one page to the next.
I have now excluded URLs with a ';' in regex-urlfilter.txt.

My question now is, how do I remove these documents from the index?



Not sure. I suppose you could add in a plugin of your own which gets
used when you extract the index - but I guess that would be too much
trouble for you.

May I ask why you want them removed from the index? Is it because you
don't want users seeing them?

Yes. I have lots of similar results because of these URLs occurring manytimes for the same original URL.


Thanks and best regards,


Jeroen

Re: Removing URLs from index

Reply via email to