Re: Crawling just one particular page from a host

Karl Wright Tue, 14 May 2013 05:08:19 -0700

You can set a hopcount filter - that should do it.
Karl


On Tue, May 14, 2013 at 8:06 AM, Erlend Garåsen <[email protected]>wrote:

> On 14.05.13 13.49, Karl Wright wrote:
>
>> Hi Erlend,
>>
>> "Hosts matching seeds" means that if the domain (in this case
>> www.ibsen.uio.no <http://www.ibsen.uio.no>) is mentioned in a seed, a
>>
>> page with the same domain will be included in the crawl if there is
>> nothing else that excludes it.  So it sounds like it is working as
>> designed.
>>
>
> Yes, you are right. I'm just trying to find a simple way to crawl just the
> starting page of a host and nothing else, i.e.:
> http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>
>
> I tried to place this in the include in crawl box:
> http://www\.ibsen\.uio\.no/**forside\.xhtml$
>
> Still it will include everything else from that host unless I write a lot
> of exclude reg exp rules.
>
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
> 31050
>

Re: Crawling just one particular page from a host

Reply via email to