On 17-8-2010 13:35, Alex McLintock wrote:
I happen to have accumulated a lot of URLs in my index with the following
layout:
http://www.company.com/directory1;if(T.getElementsByClassName(
http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case
Hmmm,
This may be thinking out loud rather than helpful:
I thought ";" was supposed to introduce a session id. I wonder if we
can or should be ignoring everything after the ";" character.
Maybe we should. I'm unsure why these JS fragments have been added to
the URLs to crawl in the first place. Problem is that the webserver is
happily serving URLs with above structure and generates proper content,
probably because the JS fragment is an invalid session id and the
webserver will automatically create a new session.
I've recently seen cases where something which looked like a URL
appeared in some Javascript and Nutch identified it as something to
crawl. I don't know whether there is a easy fix.
There seem to be errors in the discovery of links from one page to the next.
I have now excluded URLs with a ';' in regex-urlfilter.txt.
My question now is, how do I remove these documents from the index?
Not sure. I suppose you could add in a plugin of your own which gets
used when you extract the index - but I guess that would be too much
trouble for you.
May I ask why you want them removed from the index? Is it because you
don't want users seeing them?
Yes. I have lots of similar results because of these URLs occurring many
times for the same original URL.
Thanks and best regards,
Jeroen