RE: [Robots] Hit Rate - testing is this mailing linst alive?
Hello Robots list Well maybe this list can finally put to rest a great deal of the 30 second wait issue. Can we all collectively research into an adaptive routine? We all need a common code routine that all our spidering modules and connective programs can use. Especially when we wish to get as close to the Ethernet optimum (or about 80% of true max, I believe) without getting ourselves into the DoS Zone ( 80% of Ethernet max ), where signal collisions will start failures and the repeat signals and competing signals will effectively collapse the Ethernet communications medium. Can we not, therefore, settle the issue of finding the balancing point in determining optimum throughput from networks and servers at any given time? Can we not determine the optimum mathematical formula, then program this into our libraries of code; so our spiders can all follow this formula? So in this effort: Has anyone found, started to build , or can recommend the building blocks, of an such adaptive routine? Can this list supply us all with THE defacto real time adaptive throttling routine? A routine that will track and adapt to the ever changing conditions, by taking in real time network measurements, feeding them through the formula, and the result is optimum wait time, before connecting to the same server again. The wait time resets after each ACK package from the target server. Any formula suggestions? One of the variables in the formula, should be from our spider configs initially set through users input, as some users will need to max out their dedicated network communication lines (such as adapter card to adapter card, isolation work of very controlled networks). Suggest a 0 input to do this work. The default setting or 1 , will result inn the optimal time determined by the formula. Any other integer would simply multiple the time delay between server connections. In this way the user could throttle it down to the needs of the local network and servers. -Thomas Kay -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 2003-11-04 10:21 AM To: [EMAIL PROTECTED]; Internet robots, spiders, web-walkers, etc. Subject: [Robots] Hit Rate - testing is this mailing linst alive? Alan Perkins writes: What's the current accepted practice for hit rate? In general, leave an interval several times longer than the time taken for the last response. e.g. if a site responds in 20 ms, you can hit it again the same second. If a site takes 4 seconds to response, leave it at least 30 seconds before trying again. B) The number of robots you are running (e.g. 30 seconds per site per robot, or 30 seconds per site across all your robots?) Generally, take into account all your robots. If you use a mercator style distribution strategy, this is a non-issue. D) Some other factor (e.g. server response time, etc.) Server response time is the biggest factor. E) None of the above (i.e. anything goes) It's clear from the log files I study that some of the big players are not sticking to 30 seconds. There are good reasons for this and I consider it a good thing (in moderation). E.g. retrieving one page from a site every 30 seconds only allows 2880 pages per day to be retrieved from a site and this has obvious freshness implications when indexing large sites. Many large sites are split across several servers. Often these can be hit in parallel - if your robot is clever enough. Richard ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] semantic markup
Form: Reply Text: (51 lines follow) Human Resources Développement des ressources Development Canada humaines Canada __ Anyone working on a robot that marks up ( semantic web style ) crawled content and makes it available to new robots not yet so semantically aware? A leader/teacher semantic robot building an index for all new robots/agents to goto to check their own budding semantic guess work. As the robots become more aware they share the sematic markup and crossreference each others work and build greater assurance in their own semantic markup methods. Anyone? -Thomas Kay Senior Analyst, Enterprise Information Management Services Information Resource Management (IRM), HRDC Systems [EMAIL PROTECTED] (819)956-1502 -- Original Text -- From: Paul Maddox [EMAIL PROTECTED], on 11/8/2002 4:42 AM: Hi, I'm sure even Google themselves would admit there there's scope for improvement. With Answers, Catalogs, Image Search, News, etc, etc, they seem to be quite busy! :-) As an AI programmer specialising in NLP, personally I'd like to see web bots actually 'understanding' the content they review, rather than indexing by brute force. How about the equivalent of Dmoz or Yahoo Directory, but generated by a web spider? Paul. On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote: Haven't seen traffic in ages. I guess the theme's pretty much dead. What's there to invent after Google? -h ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Re: Multilanguage robots
Handling multiple character sets within the same file is still a problem. Sometimes the agent encounters a multiple language file. At times the file appearly is using overlapping character sets. The character sets like CP1252 and ISO8859-1 are used ( and browsers tolerate it, so the source is not corrected! ). The agents are encountering the above with a mix of HTML encode characters used in another part of the same file encoded as CP1252 and/or ISO8859-1. This makes the summaries a bit difficult to build and display correctly. Any one have a bestpractise for a robot agent handling a many multi-parted translation rosette-stone-like file? Anyone have a bestpractise for encode such a file? -Thomas -- Original Text -- From: Art Pollard [EMAIL PROTECTED], on 4/6/2002 9:32 AM: At 06:43 PM 4/5/2002 -0800, you wrote: I'm working on a multi language spider, and I've come to a point where I'm not sure what assumption to make. BIG SNIP The solution to your problem is to use a language identifier. A language identifier is capable of recognizing not only what language it is but also what character set is in use. So all you need to do is to download the page and throw it at a language identifier and it will tell you what language and character set it is. Or, you could do it at a paragraph at a time just in case you are dealing with a mixed language document. Just so happens we market one. ;-) It supports ~230 languages in a variety of different character sets in addition to UTF-8, and Unicode Big/Little Endian. You can play with a simple demo at: http://www.languageidentifier.com/ (Though Chinese isn't included in the demo.) We developed it originally to assist with doing language specific crawling among other things. Interestingly enough, we are finishing up work on a Chinese text segmentation system. (This puts the spaces into Chinese text so that you can index it and search it more efficiently.) If interested, please contact me at: [EMAIL PROTECTED] -Art -- Art Pollard http://www.lextek.com/ Suppliers of High Performance Text Retrieval Engines.
[Robots] Re: User agent
This does lead to the question what importance level do those ahemreference/ahem links actually have to offer a spider? Is the content supplier making a default statement on the value of the references therein to a spider? Such as: The wapper reference ads links that I will not show you Mr spider have zero or near zero value in relation to the centre content of the page and those links therein that body of information. Careful, those links may be the IE user eye ball grabbing links that say click me for something completely different Mr Monty Pythn. Take care that your link reference counts that apply referencing value or weight are not skewed. It may be wise to compare the links returned with and then without the IE agent identity being assumed by the spider.Try to apply that comparative knowledge in link weight evaluations for the referencial strength assessment you do on the content of the page. (until the Semantic web does this ) If the body links are ok, I generally have a higher trust level on those links being real reference links. Content creators have a higher due diligence for their article then the content publishers that add content wrapper style navigation, IMHO; but I think everyone can give you examples of the other extreme. ( yet would not those excellent publishers take steps to allow your spider a quality experience in walking their offerings without the need for an IE agent disguise??? so as to being in more readers to a quality site???) -Thomas Kay -- Original Text -- From: Erick Thompson [EMAIL PROTECTED], on 28/02/2002 2:32 PM: A lot of navigation scripts are setup to work with IE or Netscape, and the fall through navigation html is sometimes incomplete or doesn't work at all. Erick - Original Message - From: Oskar Bartenstein [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, February 27, 2002 9:06 PM Subject: [Robots] Re: User agent What would be a good motivation to do so? Server side benefit? Spider side benefit? Wed, 27 Feb 2002 19:23:01 -0800 Erick Thompson [EMAIL PROTECTED] said: I'm trying to constructor a user agent string that emulates IE5/6, but also -- Dr. Oskar Bartenstein [EMAIL PROTECTED] IF Computer Japan http://www.ifcomputer.com -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: Correct URL, shlash at the end ?
Crazy thought... This is where the robots.txt file could be used to hold that information for the robot agents that need to know the operational order of the / default names used on that service. User-agent: * Slash: default.htm, default.html, index.htm, index.html, welcome.html, sitemap.html robots could be informed of this little detail. -Thomas Kay -- I like that crazy thought. I think it would be handy for robots, although it would be error-prone, because the default file name is configured in a Web server config files, and robots.txt would have to be manually kept in sync. Otis If one crazy idea leads to another ...then if the above did get in the robots.txt spec then the web services could then edit that slash part of robots.txt file. When the webservice config files holding that default file list detected the change event the web admin is then asked if they wish to also update the robots.txt. Update your robot.txt file in the [doc root] to include the change in the Slash: list? The crazy task is so simple the web servers programers would fight to do it just to be the first. User-agent: * Slash: (old web server config list) ..becomes User-agent: * Slash: (new web server config list) Alas it is a crazy idea on how the web servers could relay these important details to the robots and agents that need them. -thomas kay PS: Canadian hockey-fan/hacker quotation: slash-U-later-A -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: Correct URL, shlash at the end ?
I guess it depends on what you are asking to have returned. ( And this bring up another robots.txt question.. below) http://www.abc.de/xyz Asking for the directory. (where the service is allowed redirection to a temporary default file list or another default file as a reply if the service doesn't wish to send you the whole directory) or http://www.abc.de/xyz/ Asking directly for the file list or default file of that directory. A best practice is to add the slash if you are really asking for the default file list or index for that directory when the default file name is not known. It would be great to know how to ask that http service for the list of default or index file names so the agents could verify what file name was indeed associated with the / slash. We could then put the file name on the URL to completely qualify that URL path. Anyone? We can scan through all defaults names for each known http services, but almost all I have dealt with have allowed the customization on that default name. The complexity is that the default name can be a list, not a single file name on the service; so the order of checking for the first issued default name is a concern. This is why I would like to know how the agents can query the http services for the default name list, with the returned names listed inorder of operation. Or at least why the web services have not added such a useful service query? ( Was it just not done before, or is there some known security issue) Anyone? - - - - - Crazy thought... This is where the robots.txt file could be used to hold that information for the robot agents that need to know the operational order of the / defaults names used on that service. User-agent: * Slash: default.htm, default.html, index.htm, index.html, welcome.html, sitemap.html The above is just for consideration if the robots.txt is ever updated so the robots could be informed of this little detail. -Thomas Kay -- Original Text -- From: Matthias Jaekle [EMAIL PROTECTED], on 21/11/2001 11:49 AM: Hello, I read about adding a slash at the end of the URLs, if there is no absolut path present. But what about pathes ending in subdirectories (xyz). A link to http://www.abc.de/xyz/ might be more correct then the link to http://www.abc.de/xyz But is there a possibility to find out if somebody who was writing http://www.abc.de/xyz is meaning http://www.abc.de/xyz/ In my database of scanned urls I found both versions, so I believe I analysed many files twice. How do I handle this circumstance correctly ? Many thanks for your help Matthias -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].