Re: How to Index Only Pages with Certain Urls?

Ashish Almeida Thu, 15 Jul 2010 23:22:46 -0700

Hi,
May be you can crawl abc .com as usual and then use domain-urlfilter using
nutch "mergesegs" tool to select urls with "car.abc.com" and create segments
which will have only filtered urls. The index can be created using these
filtered segments.


-
On Thu, Jul 15, 2010 at 9:10 PM, Savannah Beckett <
[email protected]> wrote:

> Hi,
>   I want nutch to crawl abc.com, but  I want to index only car.abc.com.
>  car.abc.com links can in any levels in abc.com.  So, basically, I want
> nutch to
> keep crawl abc.com normally, but index only pages that start as
> car.abc.com.
>  e.g. car.abc.com/toyota...car.abc.com/honda...
>
>
>
> I set the regex-urlfilter.txt to include only car.abc.com and run the
> command
> "generate crawl/crawldb crawl/segments", but it just say "Generator: 0
> records
> selected for fetching, exiting ..." .  I guess car.abc.com links exist
> only in
> several levels deep.
>
>
> How to do this?  I am using nutch 1.1 and solr 1.4.1
> Thanks.
>
>
>




-- 
Ashish Almeida
---------------------------------

Re: How to Index Only Pages with Certain Urls?

Reply via email to