Hi, May be you can crawl abc .com as usual and then use domain-urlfilter using nutch "mergesegs" tool to select urls with "car.abc.com" and create segments which will have only filtered urls. The index can be created using these filtered segments.
- On Thu, Jul 15, 2010 at 9:10 PM, Savannah Beckett < [email protected]> wrote: > Hi, > I want nutch to crawl abc.com, but I want to index only car.abc.com. > car.abc.com links can in any levels in abc.com. So, basically, I want > nutch to > keep crawl abc.com normally, but index only pages that start as > car.abc.com. > e.g. car.abc.com/toyota...car.abc.com/honda... > > > > I set the regex-urlfilter.txt to include only car.abc.com and run the > command > "generate crawl/crawldb crawl/segments", but it just say "Generator: 0 > records > selected for fetching, exiting ..." . I guess car.abc.com links exist > only in > several levels deep. > > > How to do this? I am using nutch 1.1 and solr 1.4.1 > Thanks. > > > -- Ashish Almeida ---------------------------------

