Re: how to crawl a page but not index it

Andrzej Bialecki Tue, 16 Nov 2010 04:05:11 -0800

On 2010-11-16 12:13, ytthet wrote:
> 
> Hi All,
> 
> I have similar requirements as Beats.
> 
> I need to crawl certain page to extract URLs, but not to index the page. 
> 
> For example, blog home page contains snap shot of last page and links to
> them. In that case, I need to extract only links and not to index the page.
> 
> I cannot do as Jake suggested, <meta name="robots"
> content="noindex,follow">, for I do not own the page. Rather, I am indexing
> few collections of web sites.
> 
> Have we found any solutions or suggestions on the matter?


This and similar use case scenarios all boil down to your ability to
specify what is so special about this page, and then just skipping it in
your custom IndexingFilter (returning null from a filter will discard
the page from index).

One simple solution, if you know in advance the urls of pages that you
want to discard, would be to inject these urls with an additional
metadata "homepage=true" and then check this in your IndexingFilter.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: how to crawl a page but not index it

Reply via email to