Indexing-time URL filtering again

Joe Zhang Wed, 21 Nov 2012 22:23:41 -0800

Dear List:

I asked a similar question before, but I haven't solved the problem.
Therefore I try to re-ask the question more clearly and seek advice.


I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
rudimentary level.

The basic problem I face in crawling/indexing is that I need to control
which pages the crawlers should VISIT (so far through
nutch/conf/regex-urlfilter.txt)
and which pages are INDEXED by Solr. The latter are only a SUBSET of the
former, and they are giving me headache.

A real-life example would be: when we crawl CNN.com, we only want to index
"real content" pages such as
http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1.
When we start the crawling from the root, we can't specify tight
patterns (e.g., +^http://([a-z0-9]*\.)*
cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*) in nutch/conf/regex-urlfilter.txt,
because the pages on the path between root and content pages do not satisfy
such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt
would severely jeopardize the coverage of the crawl.

The closest solution I've got so far (courtesy of Markus) was this:

nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ...

 but unfortunately I haven't been able to make it work for me. The content
of the urlfilter.regex.file is what I thought "correct" --- something like
the following:

+^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*
-.

Everything seems quite straightforward. Am I doing anything wrong here? Can
anyone advise? I'd greatly appreciate.

Joe

Indexing-time URL filtering again

Reply via email to