Hi All, I have similar requirements as Beats.
I need to crawl certain page to extract URLs, but not to index the page. For example, blog home page contains snap shot of last page and links to them. In that case, I need to extract only links and not to index the page. I cannot do as Jake suggested, <meta name="robots" content="noindex,follow">, for I do not own the page. Rather, I am indexing few collections of web sites. Have we found any solutions or suggestions on the matter? Thanks in advance. Y.T Thet jakecjacobson wrote: > > Hi, > > Nutch should follow the meta robots directives so in page A add this > meta directive. > > <meta name="robots" content="noindex,follow"> > > http://www.seoresource.net/robots-metatags.htm > > Jake Jacobson > > http://www.linkedin.com/in/jakejacobson > http://www.facebook.com/jakecjacobson > http://twitter.com/jakejacobson > > Our greatest fear should not be of failure, > but of succeeding at something that doesn't really matter. > -- ANONYMOUS > > > > On Tue, Jul 14, 2009 at 8:32 AM, Beats<[email protected]> wrote: >> >> hi, >> >> actually what i want is to crawl a web page say 'page A' and all its >> outlinks. >> i want to index all the content gathered by crawling the outlinks. But >> not >> the 'page A'. >> is there any way to do it in single run. >> >> with Regards >> >> Beats >> [email protected] >> > > -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-crawl-a-page-but-not-index-it-tp618712p1910348.html Sent from the Nutch - User mailing list archive at Nabble.com.

