Hi,

I just tested a small index job that usually writes 1200 records to Solr. It 
works fine if i specify -. in a filter (index nothing) and point to it with 
-Durlfilter.regex.file=path like you do.  I assume you mean by `it doesn't 
work` that it filters nothing and indexes all records from the segment. Did you 
forget the -filter parameter?

Cheers 
 
-----Original message-----
> From:Joe Zhang <[email protected]>
> Sent: Thu 22-Nov-2012 07:29
> To: user <[email protected]>
> Subject: Indexing-time URL filtering again
> 
> Dear List:
> 
> I asked a similar question before, but I haven't solved the problem.
> Therefore I try to re-ask the question more clearly and seek advice.
> 
> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> rudimentary level.
> 
> The basic problem I face in crawling/indexing is that I need to control
> which pages the crawlers should VISIT (so far through
> nutch/conf/regex-urlfilter.txt)
> and which pages are INDEXED by Solr. The latter are only a SUBSET of the
> former, and they are giving me headache.
> 
> A real-life example would be: when we crawl CNN.com, we only want to index
> "real content" pages such as
> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1.
> When we start the crawling from the root, we can't specify tight
> patterns (e.g., +^http://([a-z0-9]*\.)*
> cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*) in nutch/conf/regex-urlfilter.txt,
> because the pages on the path between root and content pages do not satisfy
> such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt
> would severely jeopardize the coverage of the crawl.
> 
> The closest solution I've got so far (courtesy of Markus) was this:
> 
> nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ...
> 
>  but unfortunately I haven't been able to make it work for me. The content
> of the urlfilter.regex.file is what I thought "correct" --- something like
> the following:
> 
> +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*
> -.
> 
> Everything seems quite straightforward. Am I doing anything wrong here? Can
> anyone advise? I'd greatly appreciate.
> 
> Joe
> 

Reply via email to