Right now the only way to do so are manual deleteByQuery operations using
Lucene's cabability of regex queries. Keep in mind that Nutch' filter does a
find() where Lucene needs a match() so you have to rewrite the queries.
-----Original message-----
> From:Bayu Widyasanyata <[email protected]>
> Sent: Tuesday 18th February 2014 0:02
> To: [email protected]
> Subject: How to check URL that have been indexed by Solr?
>
> Hi,
>
> Sometimes we accidentally crawls unneeded URLs format until push them into
> last "solrindex" step.
>
> As we know we can drop or delete those URLs by add regex on
> regex-urlfilter.txt and do "nutch updatedb". Then those URL will be
> dropped/deleted from crawldb database.
>
> But, how to ensure URLs that have been indexed by Solr ("nutch solrindex")
> before we do "nutch updatedb" has also deleted?
> Does the URL is also deleted when we perform "solrindex" again?
>
> Thank you.-
>
> --
> wassalam,
> [bayu]
>