On 17-8-2010 13:35, Markus Jelsma wrote:
I assume it's about your Solr index again (for which you should mail to the
Solr mailinglist). It features deleteById and deleteByQuery methods but in
your case it's going to be rather hard. Your URL field is, using the stock
schema, analyzed and has a tokenizer that strips characters such as your
semicolon. Perhaps you can find a common trait amongst your bogus URL's that
can be queried. If not, you must do it manually.

That's too bad as I'm unsure which URLs to look for. I think I'll just remove the entire domainname and crawl it again.

But, if you reindex from Nutch, the already fetched and parsed pages will
reappear in your Solr index. Removing data from Nutch is really hard but
because of your urlfilter, the generate command will no longer add those URL's
to the fetch queue but the pages are still in the segments.

Clear.

Thanks,


Jeroen

Reply via email to