RE: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread thomas.kay
Hello Robots list

Well maybe this list can finally put to rest a great deal of the 30 second wait 
issue.

Can we all collectively research into an adaptive routine?

We all need a common code routine that all our spidering modules and connective 
programs can use.  

Especially when we wish to get as close to the Ethernet optimum (or about 80% of true 
max, I believe) without getting ourselves into the DoS Zone ( 80% of Ethernet max ), 
where signal collisions will start failures and the repeat signals and competing 
signals will effectively collapse the Ethernet communications medium.  

Can we not, therefore, settle the issue of finding the balancing point in determining 
optimum throughput from networks and servers at any given time?   

Can we not determine the optimum mathematical formula, then program this into our 
libraries of code; so our spiders can all follow this formula?

So in this effort: Has anyone found, started to build , or can recommend the building 
blocks, of an such adaptive routine?

Can this list supply us all with THE defacto real time adaptive throttling routine?  

A routine that will track and adapt to the ever changing conditions, by taking in real 
time network measurements, feeding them through the formula, and the result is optimum 
wait time, before connecting to the same server again.  The wait time resets after 
each ACK package from the target server. 

Any formula suggestions?

One of the variables in the formula, should be from our spider configs initially set 
through users input, as some users will need to max out their dedicated network 
communication lines (such as adapter card to adapter card, isolation work of very 
controlled networks). Suggest a 0 input to do this work.  The default setting or 1 
, will result inn the optimal time determined by the formula.  Any other integer would 
simply multiple the time delay between server connections.  In this way the user could 
throttle it down to the needs of the local network and servers.  

-Thomas Kay



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: 2003-11-04 10:21 AM
To: [EMAIL PROTECTED]; Internet robots, spiders, web-walkers, etc.
Subject: [Robots] Hit Rate - testing is this mailing linst alive?


Alan Perkins writes:
  What's the current accepted practice for hit rate?

In general, leave an interval several times longer than the time
taken for the last response. e.g. if a site responds in 20 ms,
you can hit it again the same second. If a site takes 4 seconds
to response, leave it at least 30 seconds before trying again.

  B) The number of robots you are running (e.g. 30 seconds per site per
  robot, or 30 seconds per site across all your robots?)

Generally, take into account all your robots. If you use a mercator
style distribution strategy, this is a non-issue.

  D) Some other factor (e.g. server response time, etc.)

Server response time is the biggest factor.

  E) None of the above (i.e. anything goes)
  
  It's clear from the log files I study that some of the big players are
  not sticking to 30 seconds.  There are good reasons for this and I
  consider it a good thing (in moderation).  E.g. retrieving one page from
  a site every 30 seconds only allows 2880 pages per day to be retrieved
  from a site and this has obvious freshness implications when indexing
  large sites.

Many large sites are split across several servers. Often these can be
hit in parallel - if your robot is clever enough.

Richard
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] semantic markup

2002-11-08 Thread thomas.kay
Form: Reply
Text: (51 lines follow)
Human Resources Développement des ressources
Development Canada  humaines Canada
__

Anyone working on a robot that marks up ( semantic web style ) crawled 
content and makes it available to new robots not yet so semantically 
aware?  A leader/teacher semantic robot building an index for all new 
robots/agents to goto to check their own budding semantic guess work. As 
the robots become more aware they share the sematic markup and crossreference 
each others work and build greater assurance in their own semantic markup 
methods. 

Anyone?

-Thomas Kay
Senior Analyst, Enterprise Information Management Services
Information Resource Management (IRM), HRDC Systems
[EMAIL PROTECTED]
(819)956-1502
-- Original Text --

From: Paul Maddox [EMAIL PROTECTED], on 11/8/2002 4:42 AM:

Hi,

I'm sure even Google themselves would admit there there's scope for 
improvement.  With Answers, Catalogs, Image Search, News, etc, etc, 
they seem to be quite busy! :-)

As an AI programmer specialising in NLP, personally I'd like to see 
web bots actually 'understanding' the content they review, rather 
than indexing by brute force.  How about the equivalent of Dmoz or 
Yahoo Directory, but generated by a web spider?

Paul.


On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote:
Haven't seen traffic in ages.
I guess the theme's pretty much dead.

What's there to invent after Google?

-h



___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



[Robots] Re: Multilanguage robots

2002-04-08 Thread thomas.kay


Handling multiple character sets within the same file is still a problem. 
Sometimes the agent encounters a multiple language file.  At times the file 
appearly is using overlapping character sets. The character sets like CP1252 
and ISO8859-1 are used ( and browsers tolerate it, so the source is not 
corrected! ).   The agents are encountering the above with a mix of HTML 
encode characters used in another part of the same file encoded as CP1252 
and/or  ISO8859-1.   This makes the summaries a bit difficult to build and 
display correctly.

Any one have a bestpractise for a robot agent handling a many multi-parted 
translation rosette-stone-like file?  Anyone have a bestpractise for encode 
such a file?

-Thomas
-- Original Text --

From: Art Pollard [EMAIL PROTECTED], on 4/6/2002 9:32 AM:


At 06:43 PM 4/5/2002 -0800, you wrote:

I'm working on a multi language spider, and I've come to a point where I'm
not sure what assumption to make.

BIG SNIP

The solution to your problem is to use a language identifier.
A language identifier is capable of recognizing not only what
language it is but also what character set is in use.  So all you
need to do is to download the page and throw it at a language
identifier and it will tell you what language and character set
it is.  Or, you could do it at a paragraph at a time just in case
you are dealing with a mixed language document.

Just so happens we market one. ;-)  It supports ~230 languages
in a variety of different character sets in addition to UTF-8, and
Unicode Big/Little Endian.  You can play with a simple demo at:
http://www.languageidentifier.com/ (Though Chinese isn't included
in the demo.)

We developed it originally to assist with doing language specific
crawling among other things.  Interestingly enough, we are
finishing up work on a Chinese text segmentation system.
(This puts the spaces into Chinese text so that you can index it
and search it more efficiently.)

If interested, please contact me at: [EMAIL PROTECTED]

-Art
-- 
Art Pollard
http://www.lextek.com/
Suppliers of High Performance Text Retrieval Engines.







[Robots] Re: User agent

2002-02-28 Thread thomas.kay



This does lead to the question  what importance level do those 
ahemreference/ahem links actually have to offer a spider?   Is the 
content supplier making a default statement on the value of the references 
therein to a spider?  Such as:  The wapper reference ads links that I will 
not show you Mr spider have zero or near zero value in relation to the centre 
content of the page and those links therein that body of information.   
Careful, those links may be the IE user eye ball grabbing links that say 
click me for something completely different Mr Monty Pythn.   Take care 
that your link reference counts that apply referencing value or weight are 
not skewed.   It may be wise to compare the links returned with and then 
without the IE agent identity being assumed by the spider.Try to apply 
that comparative knowledge in link weight evaluations for the referencial 
strength assessment you do on the content of the page. (until the Semantic 
web does this )   

If the body links are ok, I generally have a higher trust level on those 
links being real reference links.   Content creators have a higher due 
diligence for their article then the content publishers that add content 
wrapper style navigation, IMHO; but I think everyone can give you examples of 
the other extreme. ( yet would not those excellent publishers take steps to 
allow your spider a quality experience in walking their offerings without the 
need for an IE agent disguise??? so as to being in more readers to a quality 
site???)

-Thomas Kay
-- Original Text --

From: Erick Thompson [EMAIL PROTECTED], on 28/02/2002 2:32 PM:


A lot of navigation scripts are setup to work with IE or Netscape, and the
fall through navigation html is sometimes incomplete or doesn't work at all.

Erick


- Original Message -
From: Oskar Bartenstein [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, February 27, 2002 9:06 PM
Subject: [Robots] Re: User agent



 What would be a good motivation to do so?
 Server side benefit? Spider side benefit?

 Wed, 27 Feb 2002 19:23:01 -0800 Erick Thompson [EMAIL PROTECTED] said:
  I'm trying to constructor a user agent string that emulates IE5/6, but
also

 --
 Dr. Oskar Bartenstein [EMAIL PROTECTED]
 IF Computer Japan  http://www.ifcomputer.com


 --
 This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]).  For list server commands, send help in the body of
a message to [EMAIL PROTECTED].


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of 
a message to [EMAIL PROTECTED].


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: Correct URL, shlash at the end ?

2001-11-22 Thread thomas.kay


 Crazy thought...
 
 This is where the robots.txt file could be used to hold that
 information for the robot agents that need to know the operational order 
of the /
 default names used on that service.
 
 User-agent: *
 Slash: default.htm, default.html, index.htm, index.html, welcome.html, 
sitemap.html

 
 robots could be informed of this little detail.   
 
 -Thomas Kay
 --

I like that crazy thought.  I think it would be handy for robots,
although it would be error-prone, because the default file name is
configured in a Web server config files, and robots.txt would have to
be manually kept in sync.

Otis


If one crazy idea leads to another ...then if the above did get in the 
robots.txt spec then the web services could then edit that slash part of 
robots.txt file.  When the webservice config files holding that default file 
list detected the change event the web admin is then asked if they wish to 
also update the robots.txt.  

Update your robot.txt file in the [doc root] to include the change in the 
Slash: list?

The crazy task is so simple the web servers programers would fight to do it 
just to be the first.

User-agent: *
Slash: (old web server config list)

..becomes

User-agent: *
Slash: (new web server config list)

Alas it is a crazy idea on how the web servers could relay these important 
details to the robots and agents that need them.

-thomas kay
PS: Canadian hockey-fan/hacker quotation: slash-U-later-A 


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




[Robots] Re: Correct URL, shlash at the end ?

2001-11-21 Thread thomas.kay


I guess it depends on what you are asking to have returned. ( And this bring 
up another robots.txt question.. below) 

http://www.abc.de/xyz
Asking for the directory. (where the service  is allowed  redirection to a 
temporary default file list or another default file as a reply if the service 
doesn't wish to send you the whole directory) 

or

http://www.abc.de/xyz/  
Asking directly for the file list or default file of that directory.

A best practice is to add the slash if you are really asking for the default 
file list or index for that directory when the default file name is not known.

It would be great to know how to ask that http service for the list of 
default or index file names so the agents could verify what file name was 
indeed associated with the / slash.  We could then put the file name on the 
URL to completely qualify that URL path.   Anyone? 

We can scan through all defaults names for each known http services, but 
almost  all I have dealt with have allowed the customization on that default 
name.   The complexity is that the default name can be a list,  not a single 
file name on the service; so the order of checking  for the first issued 
default name is a concern.

This is why I would like to know how  the agents can query the http services 
for the default name list, with the returned names listed inorder of 
operation.  Or at least why the web services have not added such a useful 
service query? ( Was it just not done before, or is there some known security 
issue) Anyone?

- - - - - 
Crazy thought...

This is where the robots.txt file could be used to hold that information for 
the robot agents that need to know the operational order of the / defaults 
names used on that service.

User-agent: *
Slash: default.htm, default.html, index.htm, index.html, welcome.html, 
sitemap.html

 The above is just for consideration if the robots.txt is ever updated so the 
robots could be informed of this little detail.   

-Thomas Kay
-- Original Text --

From: Matthias Jaekle [EMAIL PROTECTED], on 21/11/2001 11:49 AM:


Hello,

I read about adding a slash at the end of the URLs, if there is no
absolut path present.

But what about pathes ending in subdirectories (xyz).
A link to http://www.abc.de/xyz/ might be more correct then the link
to http://www.abc.de/xyz

But is there a possibility to find out if somebody who was writing
http://www.abc.de/xyz is meaning http://www.abc.de/xyz/

In my database of scanned urls I found both versions, so I believe I
analysed many files twice.

How do I handle this circumstance correctly ?

Many thanks for your help

Matthias




--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of 
a message to [EMAIL PROTECTED].


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].