Re: Removing URLs from index

2010-08-17 Thread Markus Jelsma
On Tuesday 17 August 2010 13:47:32 Jeroen van Vianen wrote: > > Yes. I have lots of similar results because of these URLs occurring many > times for the same original URL. You can use deduplication [1]. It generates signatures for (near) exact content depending on configuration. It can then opt

Re: Removing URLs from index

2010-08-17 Thread Jeroen van Vianen
On 17-8-2010 13:35, Markus Jelsma wrote: I assume it's about your Solr index again (for which you should mail to the Solr mailinglist). It features deleteById and deleteByQuery methods but in your case it's going to be rather hard. Your URL field is, using the stock schema, analyzed and has a tok

Re: Removing URLs from index

2010-08-17 Thread Jeroen van Vianen
On 17-8-2010 13:35, Alex McLintock wrote: I happen to have accumulated a lot of URLs in my index with the following layout: http://www.company.com/directory1;if(T.getElementsByClassName( http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case Hmmm, This may be thinkin

Re: Removing URLs from index

2010-08-17 Thread Alex McLintock
On 17 August 2010 12:04, Jeroen van Vianen wrote: > Hi, > > I happen to have accumulated a lot of URLs in my index with the following > layout: > > http://www.company.com/directory1;if(T.getElementsByClassName( > http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case Hmm

Re: Removing URLs from index

2010-08-17 Thread Markus Jelsma
Hi, I assume it's about your Solr index again (for which you should mail to the Solr mailinglist). It features deleteById and deleteByQuery methods but in your case it's going to be rather hard. Your URL field is, using the stock schema, analyzed and has a tokenizer that strips characters such

Removing URLs from index

2010-08-17 Thread Jeroen van Vianen
Hi, I happen to have accumulated a lot of URLs in my index with the following layout: http://www.company.com/directory1;if(T.getElementsByClassName( http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case There seem to be errors in the discovery of links from one page