On 17-8-2010 13:35, Markus Jelsma wrote:
I assume it's about your Solr index again (for which you should mail to the Solr mailinglist). It features deleteById and deleteByQuery methods but in your case it's going to be rather hard. Your URL field is, using the stock schema, analyzed and has a tokenizer that strips characters such as your semicolon. Perhaps you can find a common trait amongst your bogus URL's that can be queried. If not, you must do it manually.
That's too bad as I'm unsure which URLs to look for. I think I'll just remove the entire domainname and crawl it again.
But, if you reindex from Nutch, the already fetched and parsed pages will reappear in your Solr index. Removing data from Nutch is really hard but because of your urlfilter, the generate command will no longer add those URL's to the fetch queue but the pages are still in the segments.
Clear. Thanks, Jeroen

