Dear List: I asked a similar question before, but I haven't solved the problem. Therefore I try to re-ask the question more clearly and seek advice.
I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. The basic problem I face in crawling/indexing is that I need to control which pages the crawlers should VISIT (so far through nutch/conf/regex-urlfilter.txt) and which pages are INDEXED by Solr. The latter are only a SUBSET of the former, and they are giving me headache. A real-life example would be: when we crawl CNN.com, we only want to index "real content" pages such as http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1. When we start the crawling from the root, we can't specify tight patterns (e.g., +^http://([a-z0-9]*\.)* cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*) in nutch/conf/regex-urlfilter.txt, because the pages on the path between root and content pages do not satisfy such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt would severely jeopardize the coverage of the crawl. The closest solution I've got so far (courtesy of Markus) was this: nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ... but unfortunately I haven't been able to make it work for me. The content of the urlfilter.regex.file is what I thought "correct" --- something like the following: +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* -. Everything seems quite straightforward. Am I doing anything wrong here? Can anyone advise? I'd greatly appreciate. Joe

