Re: How to let crawlers in, but prevent their damage?

Dennis Gearon Mon, 10 Jan 2011 09:06:42 -0800

I don't nkow about stopping proble3ms with the issues that you've raised.

But I do know that web sites that aren't indempotent with GET requests are in a 
hurt locket. That seems to be WAY too many of them.
This means, don't do anything with GET that changes the contents of your web 
site.


Regarding a more dierct answer to your question, you'd probably have to have 
some sort of filtering applied. And anyway, crawlers only issue 'queries' based 
on the URLs found in the site, right? So are you going to have wierd URLs 
embedded in your site?

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Otis Gospodnetic <otis_gospodne...@yahoo.com>
To: solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 5:41:17 AM
Subject: How to let crawlers in, but prevent their damage?

Hi,

How do people with public search services deal with bots/crawlers?
And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay 
stuff in robots.txt) or prevent them from digging too deep in search results...

What I mean is that when you have publicly exposed search that bots crawl, they 
issue all kinds of crazy "queries" that result in errors, that add noise to 
Solr 

caches, increase Solr cache evictions, etc. etc.

Are there some known recipes for dealing with them, minimizing their negative 
side-effects, while still letting them crawl you?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: How to let crawlers in, but prevent their damage?

Reply via email to