Re: Removing URLs from index

Markus Jelsma Tue, 17 Aug 2010 04:34:06 -0700

Hi,

I assume it's about your Solr index again (for which you should mail to the 
Solr mailinglist). It features deleteById and deleteByQuery methods but in 
your case it's going to be rather hard. Your URL field is, using the stock 
schema, analyzed and has a tokenizer that strips characters such as your 
semicolon. Perhaps you can find a common trait amongst your bogus URL's that 
can be queried. If not, you must do it manually.


But, if you reindex from Nutch, the already fetched and parsed pages will 
reappear in your Solr index. Removing data from Nutch is really hard but 
because of your urlfilter, the generate command will no longer add those URL's 
to the fetch queue but the pages are still in the segments. 

Cheers,

On Tuesday 17 August 2010 13:04:21 Jeroen van Vianen wrote:
> Hi,
> 
> I happen to have accumulated a lot of URLs in my index with the
> following layout:
> 
> http://www.company.com/directory1;if(T.getElementsByClassName(
> http://www.company.com/directory2;this.bottomContainer.appendChild(u);break
> ;case
> 
> There seem to be errors in the discovery of links from one page to the
> next. I have now excluded URLs with a ';' in regex-urlfilter.txt.
> 
> My question now is, how do I remove these documents from the index?
> 
> Regards,
> 
> 
> Jeroen
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Removing URLs from index

Reply via email to