[Robots] Hit Rate - testing is this mailing linst alive?
Alan Perkins writes: > What's the current accepted practice for hit rate? In general, leave an interval several times longer than the time taken for the last response. e.g. if a site responds in 20 ms, you can hit it again the same second. If a site takes 4 seconds to response, leave it at least 30 seconds before trying again. > B) The number of robots you are running (e.g. 30 seconds per site per > robot, or 30 seconds per site across all your robots?) Generally, take into account all your robots. If you use a mercator style distribution strategy, this is a non-issue. > D) Some other factor (e.g. server response time, etc.) Server response time is the biggest factor. > E) None of the above (i.e. anything goes) > > It's clear from the log files I study that some of the big players are > not sticking to 30 seconds. There are good reasons for this and I > consider it a good thing (in moderation). E.g. retrieving one page from > a site every 30 seconds only allows 2880 pages per day to be retrieved > from a site and this has obvious "freshness" implications when indexing > large sites. Many large sites are split across several servers. Often these can be hit in parallel - if your robot is clever enough. Richard ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Re: Looksmart's robots.txt file
Rasmus Mohr writes: > Yes, that would be the case. For some unknown reason Looksmart allows > recognized robots/crawlers/spider and other non-standard user-agents > unlimited access according to the the robots.txt - all others are excluded. > I'd guess the weird looking "java" user-agent originates from an Java > application running on a platform/JVM unable to set the user-agent property. > The guys at Looksmart probably detected it in their logfiles... I don't think so. I think they just processed the web robots list autocatically. In fact, that's what it says at the top of the robots.txt file. If you look at http://www.robotstxt.org/wc/active/html/contact.html you'll see where it comes from. > eh...beef? gripes, wrath, criticism, complaints, etc. General feeling of displeasure directed to some person or thing, Richard
[Robots] Re: .NET gatherers/spiders
Hi, Erick, We are developing one, and I guess probably there are still other companies or else doing the .NET spider also. Regards, Richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Erick Thompson Sent: Friday, February 22, 2002 7:52 AM To: [EMAIL PROTECTED] Subject: [Robots] .NET gatherers/spiders Does anyone know if there are any open source/free .NET spiders under development? I am developing a custom search engine system, and don't want to have to reinvent the wheel, but I would like have everything in .NET. Thanks, Erick -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]". -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: Java for spiders
cpaul wrote: > Oh yes, it reminds me: This thread sparked my interest. I hate bugs that > I can't work around, so I sat down to find out how the HttpURLConnection > works in Java. And the thing is, it is quite simple to set a timeout for a > URL connection in Java if you get down to some subclassing. Unfortunately you can't create a subclass for URL and hence override the URL.getConnection() method, as URL is final. This is also annoying as the URL class is flawed in other ways (from the point of view of robot development) such as using DNS lookups for URL.equals(). Simply changing url1.equals(url2) to url1.toString().equals(url2.toString()) can double the speed of a java robot. Why sun felt the need to make this class final is anybody's guess. A second point on the subject of java is that Sun have released the JDK 1.4 beta spec and it has some interesting features which may be relevant such as regexp classes, improved corba classes, ssl support, etc. Richard -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: Java for spiders
Lars J. Nilsson wrote: > > I wouldn't think of Java as the first choice for a high-volume web spider. > > What are the advantages? > - Java's build in network capabilities except the httpconnection implementations on most of the linux versions suck :( failure to timeout properly is one common bug. > - HTML parsing is a part of the core language and the HTML parsing is somewhat stricter than it might be, and needs to be expanded to do anything useful. > - Portability (write once debug... sorry, run everywhere) sort of true. perl/c can be easily made source portable. java can easily be made non-portable. > - Existing n-tier server architectures (JSP, Servlets, EJB, JDBC, JNDI and so on) > - Easy scalability (possibly through JavaSpaces and Jini) I'd agree with these, plus the Threading issues mentioned in another post. I still think java is a suitable language for building a spider particularly as most of the work goes into waiting for servers, and it is perfectly possible (if awkward) to use a seperate language for writing the parsers. Look here for some more info on the problems with using java: http://www.research.compaq.com/SRC/mercator/papers/Java99/final.html Richard -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
Commerce Robot Needed
Hello everyone, Our company is seeking an organization that can create, or has an existing bot designed to crawl Affiliate Merchant sites and update a local Oracle database. The functionality we are seeking is similar to popular comparison sites such as MySimon.com, Dealtime.com and Bottomdollar.com. If you have this expertise please contact me via email or telephone. Thank you. Richard Barnett [EMAIL PROTECTED] Tel: (949) 495-9205 Fax: (949) 495-9215