I tried the follloiwng command: bin/nutch solrindex http://localhost:8983/solr//home/data/crawl/bloomberg7/crawldb/ /home/data/crawl/bloomberg7/segments/* -filter conf/regex-urlfilter-indexing.txt
And here is what I got: SolrIndexer: starting at 2012-11-04 05:59:39 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/conf/regex-urlfilter-indexing.txt/crawl_fetch Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/conf/regex-urlfilter-indexing.txt/crawl_parse Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/conf/regex-urlfilter-indexing.txt/parse_data Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/conf/regex-urlfilter-indexing.txt/parse_text On Sat, Nov 3, 2012 at 4:16 AM, Lewis John Mcgibbney < [email protected]> wrote: > Hi, > > Markus was referring to the -filter flag you can add to your solrindex > command. Please take a look at the relevant wiki entry [0] > > You should be able to point this to a specific regex or automaton > urlfiler file and achieve what you want... hopefully without dabbling > in Java and indexing filters. > > hth > > Lewis > > [0] http://wiki.apache.org/nutch/bin/nutch%20solrindex > > On Sat, Nov 3, 2012 at 3:57 AM, Joe Zhang <[email protected]> wrote: > > Markus gave me a little hint, but he's not available today. And This is > an > > urgent issue. > > > > The question is simple (nutch 1.5.1 and solr 3.6.1 working together): > > > > - The URL patterns in regex-urlfilter.txt control the behavior of > crawling, > > i.e., which pages to visit (or not to visit) > > - What I need to do is to specificy **which pages to be indexed by solr** > > (this is a subset of the pages visited) --> I wonder whether there is a > > place to specify such URL patterns. > > > > -- > Lewis >

