RE: How to Index Only Pages with Certain Urls?

Arkadi.Kosmynin Thu, 15 Jul 2010 23:43:09 -0700

Hi Savannah,

You can control indexing with an index plugin. If you don't want a particular 
url in the index, just return null.


Regards,

Arkadi 

>-----Original Message-----
>From: Savannah Beckett [mailto:[email protected]]
>Sent: Friday, July 16, 2010 1:41 AM
>To: [email protected]
>Subject: How to Index Only Pages with Certain Urls?
>
>Hi,
>  I want nutch to crawl abc.com, but  I want to index only car.abc.com.
> car.abc.com links can in any levels in abc.com.  So, basically, I want
>nutch to
>keep crawl abc.com normally, but index only pages that start as
>car.abc.com.
> e.g. car.abc.com/toyota...car.abc.com/honda...
>
>
>
>I set the regex-urlfilter.txt to include only car.abc.com and run the
>command
>"generate crawl/crawldb crawl/segments", but it just say "Generator: 0
>records
>selected for fetching, exiting ..." .  I guess car.abc.com links exist
>only in
>several levels deep.
>
>
>How to do this?  I am using nutch 1.1 and solr 1.4.1
>Thanks.
>
>
>

RE: How to Index Only Pages with Certain Urls?

Reply via email to