On 2010-11-16 12:13, ytthet wrote: > > Hi All, > > I have similar requirements as Beats. > > I need to crawl certain page to extract URLs, but not to index the page. > > For example, blog home page contains snap shot of last page and links to > them. In that case, I need to extract only links and not to index the page. > > I cannot do as Jake suggested, <meta name="robots" > content="noindex,follow">, for I do not own the page. Rather, I am indexing > few collections of web sites. > > Have we found any solutions or suggestions on the matter?
This and similar use case scenarios all boil down to your ability to specify what is so special about this page, and then just skipping it in your custom IndexingFilter (returning null from a filter will discard the page from index). One simple solution, if you know in advance the urls of pages that you want to discard, would be to inject these urls with an additional metadata "homepage=true" and then check this in your IndexingFilter. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

