[Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread richard
Alan Perkins writes:
 > What's the current accepted practice for hit rate?

In general, leave an interval several times longer than the time
taken for the last response. e.g. if a site responds in 20 ms,
you can hit it again the same second. If a site takes 4 seconds
to response, leave it at least 30 seconds before trying again.

 > B) The number of robots you are running (e.g. 30 seconds per site per
 > robot, or 30 seconds per site across all your robots?)

Generally, take into account all your robots. If you use a mercator
style distribution strategy, this is a non-issue.

 > D) Some other factor (e.g. server response time, etc.)

Server response time is the biggest factor.

 > E) None of the above (i.e. anything goes)
 > 
 > It's clear from the log files I study that some of the big players are
 > not sticking to 30 seconds.  There are good reasons for this and I
 > consider it a good thing (in moderation).  E.g. retrieving one page from
 > a site every 30 seconds only allows 2880 pages per day to be retrieved
 > from a site and this has obvious "freshness" implications when indexing
 > large sites.

Many large sites are split across several servers. Often these can be
hit in parallel - if your robot is clever enough.

Richard
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] Re: Looksmart's robots.txt file

2002-05-30 Thread richard


Rasmus Mohr writes:
 > Yes, that would be the case. For some unknown reason Looksmart allows
 > recognized robots/crawlers/spider and other non-standard user-agents
 > unlimited access according to the the robots.txt - all others are excluded.
 > I'd guess the weird looking "java" user-agent originates from an Java
 > application running on a platform/JVM unable to set the user-agent property.
 > The guys at Looksmart probably detected it in their logfiles...

I don't think so. I think they just processed the web robots list
autocatically. In fact, that's what it says at the top of the
robots.txt file. If you look at
 
http://www.robotstxt.org/wc/active/html/contact.html
you'll see where it comes from.

 > eh...beef?

gripes, wrath, criticism, complaints, etc. General feeling of displeasure
directed to some person or thing,

Richard




[Robots] Re: .NET gatherers/spiders

2002-02-21 Thread Richard Chuang


Hi, Erick,

We are developing one, and I guess probably there are still other
companies or else doing the .NET spider also.

Regards,
Richard

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On
Behalf Of Erick Thompson
Sent: Friday, February 22, 2002 7:52 AM
To: [EMAIL PROTECTED]
Subject: [Robots] .NET gatherers/spiders



Does anyone know if there are any open source/free .NET spiders under
development? I am developing a custom search engine system, and don't
want to have to reinvent the wheel, but I would like have everything in
.NET.

Thanks,
Erick


--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]).  For list server commands, send "help" in the
body of a message to "[EMAIL PROTECTED]".


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: Java for spiders

2001-06-14 Thread Richard Wheeldon


cpaul wrote:
> Oh yes, it reminds me: This thread sparked my interest. I hate bugs that
> I can't work around, so I sat down to find out how the HttpURLConnection
> works in Java. And the thing is, it is quite simple to set a timeout for a
> URL connection in Java if you get down to some subclassing.

Unfortunately you can't create a subclass for URL and hence override
the URL.getConnection() method, as URL is final. This is also annoying
as
the URL class is flawed in other ways (from the point of view of robot
development) such as using DNS lookups for URL.equals(). Simply changing
url1.equals(url2) to url1.toString().equals(url2.toString()) can double
the speed of a java robot. Why sun felt the need to make this class
final is anybody's guess.

A second point on the subject of java is that Sun have released the
JDK 1.4 beta spec and it has some interesting features which may be
relevant such as regexp classes, improved corba classes, ssl support,
etc.

Richard

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".




[Robots] Re: Java for spiders

2001-06-11 Thread Richard Wheeldon


Lars J. Nilsson wrote:
> > I wouldn't think of Java as the first choice for a high-volume web spider.
> > What are the advantages?

> - Java's build in network capabilities

except the httpconnection implementations on most of the linux versions
suck :( failure to timeout properly is one common bug.

> - HTML parsing is a part of the core language

and the HTML parsing is somewhat stricter than it might be, and needs
to be expanded to do anything useful.

> - Portability (write once debug... sorry, run everywhere)

sort of true. perl/c can be easily made source portable. java
can easily be made non-portable.

> - Existing n-tier server architectures (JSP, Servlets, EJB, JDBC, JNDI and so on)
> - Easy scalability (possibly through JavaSpaces and Jini)

I'd agree with these, plus the Threading issues mentioned in another
post. I still think java is a suitable language for building a spider
particularly as most of the work goes into waiting for servers, and
it is perfectly possible (if awkward) to use a seperate language
for writing the parsers.

Look here for some more info on the problems with using java:
http://www.research.compaq.com/SRC/mercator/papers/Java99/final.html

Richard

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".




Commerce Robot Needed

2000-03-24 Thread Richard Barnett

Hello everyone,

Our company is seeking an organization that can create, or has an
existing bot designed to crawl Affiliate Merchant sites and update a
local Oracle database. The functionality we are seeking is similar to
popular comparison sites such as MySimon.com, Dealtime.com and
Bottomdollar.com.

If you have this expertise please contact me via email or telephone. Thank you.

Richard Barnett
[EMAIL PROTECTED]
Tel: (949) 495-9205
Fax: (949) 495-9215