Re: how to crawl a page but not index it

ytthet Tue, 16 Nov 2010 03:14:03 -0800

Hi All,

I have similar requirements as Beats.


I need to crawl certain page to extract URLs, but not to index the page. 

For example, blog home page contains snap shot of last page and links to
them. In that case, I need to extract only links and not to index the page.

I cannot do as Jake suggested, <meta name="robots"
content="noindex,follow">, for I do not own the page. Rather, I am indexing
few collections of web sites.

Have we found any solutions or suggestions on the matter?

Thanks in advance.

Y.T Thet



jakecjacobson wrote:
> 
> Hi,
> 
> Nutch should follow the meta robots directives so in page A add this
> meta directive.
> 
> <meta name="robots" content="noindex,follow">
> 
> http://www.seoresource.net/robots-metatags.htm
> 
> Jake Jacobson
> 
> http://www.linkedin.com/in/jakejacobson
> http://www.facebook.com/jakecjacobson
> http://twitter.com/jakejacobson
> 
> Our greatest fear should not be of failure,
> but of succeeding at something that doesn't really matter.
>    -- ANONYMOUS
> 
> 
> 
> On Tue, Jul 14, 2009 at 8:32 AM, Beats<[email protected]> wrote:
>>
>> hi,
>>
>> actually what i want is to crawl a web page say 'page A' and all its
>> outlinks.
>> i want to index all the content gathered by crawling the outlinks. But
>> not
>> the 'page A'.
>> is there any way to do it in single run.
>>
>> with Regards
>>
>> Beats
>> [email protected]
>>
> 
> 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-crawl-a-page-but-not-index-it-tp618712p1910348.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to crawl a page but not index it

Reply via email to