Hi, I assume it's about your Solr index again (for which you should mail to the Solr mailinglist). It features deleteById and deleteByQuery methods but in your case it's going to be rather hard. Your URL field is, using the stock schema, analyzed and has a tokenizer that strips characters such as your semicolon. Perhaps you can find a common trait amongst your bogus URL's that can be queried. If not, you must do it manually.
But, if you reindex from Nutch, the already fetched and parsed pages will reappear in your Solr index. Removing data from Nutch is really hard but because of your urlfilter, the generate command will no longer add those URL's to the fetch queue but the pages are still in the segments. Cheers, On Tuesday 17 August 2010 13:04:21 Jeroen van Vianen wrote: > Hi, > > I happen to have accumulated a lot of URLs in my index with the > following layout: > > http://www.company.com/directory1;if(T.getElementsByClassName( > http://www.company.com/directory2;this.bottomContainer.appendChild(u);break > ;case > > There seem to be errors in the discovery of links from one page to the > next. I have now excluded URLs with a ';' in regex-urlfilter.txt. > > My question now is, how do I remove these documents from the index? > > Regards, > > > Jeroen > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

