[Robots] Anti-thesaurus proposal
http://www.hastingsresearch.com/net/06-anti-thesaurus.shtml This is a proposal for a meta-tag to tell search engines to ignore certain words on a page when scoring relevancy. Among other things, it mentions robots.txt as problematic: Also, returning to the robots.txt standard: it may be underused simply because it is a security breach (the file openly lists URLs that webmasters do not want visible through search engines). It is possible that many more webmasters would be using it properly, if not for that security problem. My opinion is that this is enormously impractical, but perhaps there's the seed of a good idea in it. However, it seems to me that if the authors of a page would actually bother to create meta-tags to increase search efficiency, it would be much easier (semi-automated, even) to create a tag containing the *most* relevant words, not the least. Nick Arnett Phone/fax: 408-904-7198 -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] FW: Re: Correct URL, shlash at the end ?
-Original Message- From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]] Sent: Friday, November 23, 2001 11:26 PM To: [EMAIL PROTECTED] Subject: Re: [Robots] Re: Correct URL, shlash at the end ? It was thus said that the Great George Phillips once stated: Don't be mislead by relative URLs. Yes, they use . and ... Yes, / is very important. Yes, they operate almost identically to UNIX relative paths (but different enough to keep us on our toes). Yes, they are extremely useful. But they're just rules that take the stuff you used to get the current page and some relative stuff to construct new stuff -- all done by the browser. The web server only understands pure, unadulterated, unrelative stuff. There are rules for parsing relative URLs in RFC-1808 and no, web servers do understand relative URLs---only they must start (if giving a GET (or other) command) with a leading `/'. I just fed ``/people/../index/html'' to my colocated server (telnet to port 80, feed in the GET request directly) and I got the main index page at http://www.conman.org/ . So the webserver can do the processing as well (at least Apache). My suggestion is that the robot construct URLs with care -- always do what a browser would do and respect the fact that the HTTP server may need exactly the same stuff back as it put into the HTML. And always, always store exactly the URL used to retrieve a block of content. But implement some generic mechanism to generalize URL equality beyond strcmp(). Regular expression search and replace looks as promising as anything. Imagine something like this (with perlish regexp): URL-same: s'/(index|default).html?$'/' In other words, if the URL ends in /index.html, /default.html, /index.htm or /default.htm then drop all but the slash and we'll assume the URL will boil down to the same content. Is this for the robot configuration (on the robot end of things) or for something like robots.txt? URL-same: s'[^/]+/..(/|$)'' # condense .. Make sure you follow RFC-1808 though. URL-same: tr'A-Z'a-z' # case fold the whole thing 'cause why not? Because not every webserver is case insensitive. The host portion is (has to be, DNS is case insensitive) but the relative portion (at least in the standards portions) is not. Okay, some sites (like AOL) treats them as case insensitive, but not all sites. And something for the pathological sites URL-same: s'^(http://boston.conman.org/.*/)0+)'$1'g URL-same: s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g What, exactly does that map? Because I assure you that http://boston.conman.org/2001/11/17.2 is not the same as: http://boston.conman.org/2001/11/17 even though the latter contains the content of the former (plus other entries from that day). But ... http://boston.conman.org/2000/8/10.2-15.5 and http://boston.conman.org/2000/8/10.2-8/15.5 do return the same content (in other words, those are equivalent), where as: http://boston.conman.org/2000/8/10.2-15.5 and http://boston.conman.org/2000/8/10-15 Are not (but again, the latter contains the content of the former). (Yet one more odd case. This: http://boston.conman.org/1999 and this: http://boston.conman.org/1999/12 and this: http://boston.conman.org/1999/12/4-15 Are the same, but only because I started keeping entries in December of 1999. You can repeat for a couple of other variations). It would be so cool if a robot could discover these patterns for itself. Seems like it would be a small scale version of covering boston.conman.org's other problem of multiple overlapping data views. I'm not as sure of that 8-) -spc (I calculated that http://bible.conman.org/kj/ has over 15 million different URL views into the King James Bible ... ) -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] FW: Re: Correct URL, shlash at the end ?
-Original Message- From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]] Sent: Friday, November 23, 2001 11:26 PM To: [EMAIL PROTECTED] Subject: Re: [Robots] Re: Correct URL, shlash at the end ? It was thus said that the Great George Phillips once stated: Don't be mislead by relative URLs. Yes, they use . and ... Yes, / is very important. Yes, they operate almost identically to UNIX relative paths (but different enough to keep us on our toes). Yes, they are extremely useful. But they're just rules that take the stuff you used to get the current page and some relative stuff to construct new stuff -- all done by the browser. The web server only understands pure, unadulterated, unrelative stuff. There are rules for parsing relative URLs in RFC-1808 and no, web servers do understand relative URLs---only they must start (if giving a GET (or other) command) with a leading `/'. I just fed ``/people/../index/html'' to my colocated server (telnet to port 80, feed in the GET request directly) and I got the main index page at http://www.conman.org/ . So the webserver can do the processing as well (at least Apache). My suggestion is that the robot construct URLs with care -- always do what a browser would do and respect the fact that the HTTP server may need exactly the same stuff back as it put into the HTML. And always, always store exactly the URL used to retrieve a block of content. But implement some generic mechanism to generalize URL equality beyond strcmp(). Regular expression search and replace looks as promising as anything. Imagine something like this (with perlish regexp): URL-same: s'/(index|default).html?$'/' In other words, if the URL ends in /index.html, /default.html, /index.htm or /default.htm then drop all but the slash and we'll assume the URL will boil down to the same content. Is this for the robot configuration (on the robot end of things) or for something like robots.txt? URL-same: s'[^/]+/..(/|$)'' # condense .. Make sure you follow RFC-1808 though. URL-same: tr'A-Z'a-z' # case fold the whole thing 'cause why not? Because not every webserver is case insensitive. The host portion is (has to be, DNS is case insensitive) but the relative portion (at least in the standards portions) is not. Okay, some sites (like AOL) treats them as case insensitive, but not all sites. And something for the pathological sites URL-same: s'^(http://boston.conman.org/.*/)0+)'$1'g URL-same: s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g What, exactly does that map? Because I assure you that http://boston.conman.org/2001/11/17.2 is not the same as: http://boston.conman.org/2001/11/17 even though the latter contains the content of the former (plus other entries from that day). But ... http://boston.conman.org/2000/8/10.2-15.5 and http://boston.conman.org/2000/8/10.2-8/15.5 do return the same content (in other words, those are equivalent), where as: http://boston.conman.org/2000/8/10.2-15.5 and http://boston.conman.org/2000/8/10-15 Are not (but again, the latter contains the content of the former). (Yet one more odd case. This: http://boston.conman.org/1999 and this: http://boston.conman.org/1999/12 and this: http://boston.conman.org/1999/12/4-15 Are the same, but only because I started keeping entries in December of 1999. You can repeat for a couple of other variations). It would be so cool if a robot could discover these patterns for itself. Seems like it would be a small scale version of covering boston.conman.org's other problem of multiple overlapping data views. I'm not as sure of that 8-) -spc (I calculated that http://bible.conman.org/kj/ has over 15 million different URL views into the King James Bible ... ) -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
Re: Rumorbot
The company seems more like a contractor for hire. Are they actually starting this as a service. I saw the description of the talk at Bot2001 and it seemed like it was just an idea that they were floating, not a service they were really going to launch. Thanks! -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Alexander Macgillivray Sent: Friday, February 02, 2001 4:23 PM To: [EMAIL PROTECTED] Subject: Re: Rumorbot What would you like to know? They were at Bot2001 (http://seminars.internet.com/bot/sf01/index.html) and I talked to them about their tech and company (also attended the session). Alex At 12:10 PM 02/02/2001 -0800, Nick Arnett wrote: Anyone know more about this company or project...? http://news.bbc.co.uk/hi/english/sci/tech/newsid_1146000/1146589.stm Nick Arnett Sr. VP and Co-Founder Opion Inc. Direct phone/fax: 408-733-7613 http://www.opion.com
[no subject]
From [EMAIL PROTECTED] Tue Oct 31 15: 54:17 2000 Received: by mccmedia.com from localhost (router,SLMail V2.7); Tue, 31 Oct 2000 15:54:17 -0800 Received: by mccmedia.com from mail2 (209.133.89.19::mail daemon; unverified,SLMail V2.7); Tue, 31 Oct 2000 15:54:13 -0800 Received: from MAIL2.MCCMEDIA.COM by MAIL2.MCCMEDIA.COM (LISTSERV-TCP/IP release 1.8c) with spool id 34194 for [EMAIL PROTECTED]; Tue, 31 Oct 2000 15:45:06 -0800 Received: by mccmedia.com from localhost (router,SLMail V2.7); Tue, 31 Oct 2000 15:52:36 -0800 Received: by mccmedia.com from nick.mccmedia.com (209.133.89.24::mail daemon; unverified,SLMail V2.7); Tue, 31 Oct 2000 15:52:35 -0800 X-Sender: [EMAIL PROTECTED] X-Mailer: QUALCOMM Windows Eudora Version 5.0 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Message-ID: [EMAIL PROTECTED] Date: Tue, 31 Oct 2000 15:48:21 -0800 Reply-To: [EMAIL PROTECTED] Sender: [EMAIL PROTECTED] From: Nick Arnett [EMAIL PROTECTED] Subject: Robots, km lists back up Comments: To: [EMAIL PROTECTED], [EMAIL PROTECTED] To: [EMAIL PROTECTED] Date: Tue, 31 Oct 2000 23:54:17 -0800 Content-transfer-encoding: 7bit x-flowed My mail server suffered some sort of ugly disk problem that I'm still trying to fix, but I have installed a temporary backup server until then. The robots and km mailing lists were down since mid-day yesterday, but if this message reaches you, you'll know that they're back up. Some mail *may* have been lost, but that's not very likely, since the mail server forwards it to the list server machine immediately. Note that even if mail at mccmedia.com is down, I can be reached at my Opion address. Nick -- Senior VP Strategic Development, Co-Founder Opion Inc. [EMAIL PROTECTED] (408) 733-7613 /x-flowed
Robot list bounces
Robot list subscribers, I'm getting fairly aggressive about deleting e-mail addresses from the list when they start bouncing. So, if your address stops working temporarily for whatever reason, you may find yourself off the list and you'll need to re-join. I am a bit astonished at the number of bounces that show addresses that are not subscribed to the list... so if you see a few bounces when you post (most come here, as they should), that may be the reason. Nick Arnett Sr. VP and Co-Founder Opion Inc. Direct phone: 408-733-7613 Fax: 408-904-7198 http://www.opion.com
[no subject]
From [EMAIL PROTECTED] Fri Nov 10 14: 47:29 2000 Received: by mccmedia.com from localhost (router,SLMail V2.7); Fri, 10 Nov 2000 14:47:29 -0800 Received: by mccmedia.com from mail2 (209.133.89.19::mail daemon; unverified,SLMail V2.7); Fri, 10 Nov 2000 14:47:26 -0800 Received: from MAIL2.MCCMEDIA.COM by MAIL2.MCCMEDIA.COM (LISTSERV-TCP/IP release 1.8c) with spool id 35331 for [EMAIL PROTECTED]; Fri, 10 Nov 2000 14:39:44 -0800 Received: by mccmedia.com from localhost (router,SLMail V2.7); Fri, 10 Nov 2000 14:42:04 -0800 Received: by mccmedia.com from searchtools.com (157.22.1.144::mail daemon; unverified,SLMail V2.7); Fri, 10 Nov 2000 14:42:01 -0800 Received: by searchtools.com (Stalker Internet Mail Server 1.8b7) with FILE id S.025730 for [EMAIL PROTECTED]; Fri, 10 Nov 2000 15:37:56 -0700 Received: from [171.66.196.146] (POP-user avi-list) by searchtools.com (Stalker POP3 Server 1.8b7) with POP/XMIT id S.025729; Fri, 10 Nov 2000 15:37:52 -0700 Mime-Version: 1.0 References: [EMAIL PROTECTED] p04310103b4eae66b4370@[171.66.196.146] Content-Type: text/plain; charset=us-ascii; format=flowed Message-ID: p0510021cb63225c3ac1f@[171.66.196.146] Date: Fri, 10 Nov 2000 14:28:08 -0800 Reply-To: [EMAIL PROTECTED] Sender: [EMAIL PROTECTED] From: Avi Rappoport [EMAIL PROTECTED] Subject: anyone want to license their robot spider for search? Comments: To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] In-Reply-To: p04310103b4eae66b4370@[171.66.196.146] Date: Fri, 10 Nov 2000 22:47:29 -0800 Content-transfer-encoding: 7bit x-flowed I have a consulting customer writing a search engine looking for a heavy-duty robot spider that can handle millions of URLs. This one would have to be very robust, have a decent API, behave nicely, handle ugly HTML and strange links, etc. etc. Please contact me with rates if you would like to be considered. Avi PS I also get calls asking for smaller-scale spiders, so let me know if you have that code as well. -- _ Complete Guide to Search Engines for Web Sites, Intranets, and Portals: http://www.searchtools.com /x-flowed
[Robots] Re: SV: matching and User-Agent: in robots.txt
Certainly LWP is widely used, but I think it's an open question as to how many LWP users use the robots.txt capabilities. I have used LWP extensively, but have never bothered with the latter. My robots target a handful of sites and really don't recurse, as such, so I just keep an eye on those sites' policies. And they tend to be very large, busy sites, so I'm a mere blip in their stats, I assume... which is not to say that I would lightly ignore anyone's wishes regarding robots. But I'm not really doing the usual search engine robot thing of sucking down every page. I'm heavily focused on tools that figure out which pages are most significant, so my robots behave more like people would... which I hope leaves me a bit more free. Going back to the original question... I can't quite see why anyone would give a robot a name like Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]. But if that's the name, then that's what robots.txt should reference. A robots.txt that contains a directive for a robot named Banjo should either be referring to another robot or it has the wrong name. I think the original poster has confused (conflated, actually) the HTTP User-Agent and From headers. $ua = LWP::RobotUA-new($agent_name, $from, [$rules]) Your robot's name and the mail address of the human responsible for the robot (i.e. you) is required by the constructor. Create a user-agent object thus: $ua = LWP::RobotUA-new('Banjo/1.1','http://nowhere.int/banjo.html [EMAIL PROTECTED]') The string that gets compared with robots.txt is Banjo/1.1. That's the HTTP User-Agent header. The second parameter is the HTTP From header, which allows the target site's administrator to find you (easily) if your robot misbehaves. Of course, it isn't special to robots. Any HTTP client can send a From header (the default behavior of which in some clients led to much controversy years ago, of course). From the LWP docs: The from attribute can be set to the e-mail address of the person responsible for running the application. If this is set, then the address will be sent to the servers with every request. Hope that's reasonably clear. Nick -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Otis Gospodnetic Sent: Thursday, March 14, 2002 8:57 AM To: [EMAIL PROTECTED] Subject: [Robots] Re: SV: matching and UserAgent: in robots.txt LWP? Very popular in a big Perl community. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
Having worked in Perl and Python, I'll recommend Python. Although I haven't been using it for long, I'm definitely more productive with it. Performance seems fine, though I haven't really pushed hard on it. I'm not seeing long, mysterious time-outs as I occasionally did with LWP. And I hit some weird bug in LWP a few weeks ago, which resulted in a strange error message that I eventually discovered was coming out of the expat DLL for XML. Instead of retrieving the page I wanted, it was misinterpreting a server error. I wish I could be more specific, but I never did figure out what was really going on. Following an LWP request through the debugger is a long and convoluted journey... Nick -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of srinivas mohan Sent: Thursday, March 14, 2002 9:48 AM To: [EMAIL PROTECTED] Subject: [Robots] better language for writing a Spider ? Hello, I am working on a robot develpoment, in java,. We are developing a search enginealmost the complete engine is developed... We used java for the devlopment...but the performance of java api in fetching the web pages is too low, basically we developed out own URL Connection , as we dont have some features like timeout... supported by the java.net.URLConnection api .. Though there are better spiders in java..like Mercator..we could not achive a better performance with our product... Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. I will be thankful if any one can help me choosing the better language..where i can get better performance.. Thanks in advance Mohan __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: matching and UserAgent: in robots.txt
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Sean M. Burke ... E.g., http://www.robotstxt.org/wc/norobots.html says: User-agent [...] The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended. ...note the without version information. Ditto the spec you cited, which says That is, the User-Agent (HTTP) header consists of one or more words, and the very first word is taken to be the name, which is referred to in the robot exclusion files. Ah, now I see your point. That does seem to be a problem, since apparently version numbers were contemplated in User-Agent headers... Sounds like something for the LWP author(s). Or, a convenient excuse for a badly behaved robot... ! Nick -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Python timeouts
I've been hitting problems with a Python-based robot I'm working on and just found out that there's a timeout module that will make it easy to implement the kind of functionality that Tim Bray was suggesting here earlier. It apparently works for any TCP connection. Here's the link: http://www.timo-tasi.org/python/timeoutsocket.py -- [EMAIL PROTECTED] (408) 904-7198
[Robots] Re: unsubscibe
Commands need to be send to [EMAIL PROTECTED]. Send unsubscribe robots in the body of a message to leave this list. Nick -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of HuiFang Wang Sent: Tuesday, March 26, 2002 2:30 AM To: [EMAIL PROTECTED] Subject: [Robots] unsubscibe hello, I want to unsubscibe.
[Robots] Does Yahoo have new robot defenses?
It looks to me as though Yahoo has some sort of robot defense operating. I was just testing a multi-threaded robot that I use to analyze discussions, including Yahoo's stock market boards. On the first run, it seemed to do fine, but when I tried to run it again a few minutes later, it didn't retrieve anything... so I tried going to the message boards using IE on the same machine. Every page is returning a 403 Forbidden error now -- even when I try to see robots.txt. As far as I know, Yahoo has never even had a robots.txt file. I'm guessing that the speed of my robot triggered a block against this IP address. Another machine, in the same subnet, can access the pages just fine. I've been working on the underlying database for the last few weeks, so I haven't run the spider lately. Thus, I'm not sure when this behavior might hvae started. My robot is quite fast and my connection yields throughput of about 1 mbit/sec, so it certainly hit their server fairly hard. But hey, it's Yahoo. If they can't handle getting hit this hard on a mid-day Saturday, it's hard to imagine who can. No lectures about well-behaved robots, please. I know, I know. The next step for that robot will be to have each thread hit completely different domains. Perhaps each one will rotate through a few domains. Anybody know what Yahoo might be doing, or what its policy is about robots? I haven't been able to find anything that addresses the issue directly. I don't see anything under its TOS that would clearly apply. If they want to have a limit on robots, I sure would appreciate it if they would say what it is... It's been about 30 minutes now and I'm still blocked, it seems. Just checked from another machine -- they still have no robots.txt at all. Nick -- [EMAIL PROTECTED] (408) 904-7198
[Robots] Safe parameters for spidering Yahoo message header pages?
Anyone here figured out what Yahoo will tolerate in terms of spidering its message header pages before it blocks the robot's IP address? Before I start testing, I figured I'd see if anyone else here has already done so. The duration of the block seems to lengthen, so testing could take a while. Sure would be nice if they'd just say what they consider acceptable... Nick -- [EMAIL PROTECTED] (408) 904-7198 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Post
As long as we're kicking around what's new, here's mine. I've been working on a system that finds topical Internet discussions (web forums, usenet, mailing lists) and does some analysis of who's who, looking for the people who connect communities together, lead discussions, etc. At the moment, it's focusing on Java developers. It's been quite interesting to see what it discovers in terms of how various subtopics are related and what other things that Java developers tend to be interested in. Regarding markup, etc., in the back of my mind I've had the notion of enhancing my spider to recognize how to parse and recurse forums and list archives, so that I don't have to write new code for every different forum or archiving format. But it's not something I'd be comfortable tossing out into the open, since it obviously would be a tool that spammers could use for address harvesting. I'm essentially creating a toolbox with Python and MySQL, which I'm using to create custom information products for consulting clients. For the moment, those (obviously) are companies with a strong interest in Java. Nick -- Nick Arnett Phone/fax: (408) 904-7198 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Efficient crawling of mailing list archives?
At the risk of talking to myself... Would a gateway from mailing lists to NNTP address most of the issues I described? NNTP already knows about threading, updating, etc. However, I've been stymied by the problem of discovering new NNTP servers. -- Nick Arnett Phone/fax: (408) 904-7198 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Is this mailing linst alive?
[EMAIL PROTECTED] wrote: I've created a robot, www.dead-links.com and i wonder if this list is alive. It is alive, but very, very quiet. Nick ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] robot in python?
Petter Karlström wrote: Hello all, Nice to see that this list woke up again! :^) And now the list owner finally woke up, too... I hadn't noticed the recent traffic on the list until just now. Are those messages about an address no longer in use going to the whole list? Aghh. I've taken care of that, I hope, but the source address wasn't actually subscribed, so I had to guess. Back to the point at hand... I've written several specialized robots in Python over the last few years. They are specifically for crawling on-line discussions and parsing out individual messages and meta-data. Look for Aahz's examples (do a Google search on Aahz and Python, I'm sure that'll lead you there). He makes multi-threading for your spider pretty easy and adaptable to various kinds of processing. I have written crawlers in Perl before, but I wish to try out Python for a hobby project. Has anybody here written a webbot i Python? Python is of course a smaller language, so the libraries aren't as extensive as the Perl counterparts. Also, I find the documentation somewhat lacking (or it could be me being new to the language). After switching from Perl to Python a couple of years ago, I haven't ever found the Python libraries lacking, although I expected to. Documentation, in the form of published books, has been a bit scarce, but new ones have been coming out lately. I just looked through one on text applications in Python, but haven't bought it yet. It definitely looked good. Are there any small examples available on use of HTMLParser and htmllib? Specifically, I need something like the linkextor available in Perl. One trick is to search on import [modulename] as a phrase. That'll often uncover code you can use as an example. What does linkextor do? Link extractor? If so, I just use regular expressions. Also, what is the neatest way to store session data like login and password? PassWordMgr? Store in what sense? I'll take a look at my code and see if I can share something generic. Since we're doing www.opensector.org, I suppose it would only be right for us to share at least *some* of our code! However... I just looked at what I have and the older stuff doesn't really add much to Aahz's examples, other than some simple use of MySQL as the store; my newer stuff is far too specific to the task I'm doing to be able to quickly sanitize it. The main thing I did to address our specific needs was to create a Java class for message pages in specific types of web-based discussion forums. That's partly to extract URLs, but mostly to extract other features and to intelligently (in the sense of being able to update my database rapidly, re-visiting the minimum number of pages) navigate the threading structures, which work in various ways. The class for Jive-based forums is only 225 lines, as an example. The multi-threaded module that uses it is 100 lines; a single-threaded version is 25 lines. We also have a Python robot for NNTP servers, which obviously doesn't need recursion. It's about 400 lines. A lot of it deals with things like missing messages, zeroing in on desired date ranges, avoiding downloading huge messages, recovery from failure, etc. All of these talk to MySQL... Nick -- Nick Arnett Phone/fax: (408) 904-7198 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Yahoo evolving robots.txt, finally
Walter Underwood wrote: Nah, they would have e-mailed me directly by now. I used to work with them at Inktomi. How about dropping them an e-mail to invite them here? Yahoo limits crawler access to its own site. I haven't tried in the last 9 or 10 months, but the way it was back then, if you crawled the message boards, the crawler's IP address would be blocked for increasingly long time periods -- a day, two days, etc. I tried slowing down our gathering, but couldn't find a speed at which they wouldn't eventually block it. And of course they never responded to any questions about what they'd consider acceptable. And yet, their own servers don't seem to have a robots.txt that defines any limitations. Sure would be nice if *they* would tell *us* what's acceptable when crawling Yahoo! Nick -- Nick Arnett Director, Business Intelligence Services LiveWorld Inc. Phone/fax: (408) 551-0427 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] [Fwd: add-robot@robotstxt.org is not working]
Anybody know what this is about? Original Message Subject: [EMAIL PROTECTED] is not working Date: Tue, 06 Apr 2004 08:25:29 +0300 From: Max Max [EMAIL PROTECTED] Reply-To: Max Max [EMAIL PROTECTED] To: [EMAIL PROTECTED] Dear webmaster! Our team developed the uptimebot web spider and use it yet three month. I tried to send the appropriate form to [EMAIL PROTECTED], but the mail delivery failed. I send you the same filled form. If you are not concern with adding new robots - I apologize. But however can you tell me where to apply? Thanks for your help! The form of robot goes as follows: robot-id: uptimebot robot-name: UptimeBot robot-cover-url: http://www.uptimebot.com robot-details-url: http://www.uptimebot.com robot-owner-name: UCO team robot-owner-url: http://www.uptimebot.com robot-owner-email: [EMAIL PROTECTED] robot-status: active robot-purpose: indexing, statistics robot-type: standalone robot-platform: unix robot-availability: none robot-exclusion: uptimebot robot-exclusion-useragent: no robot-noindex: no robot-host: uptimebot.com robot-from: no robot-useragent: uptimebot robot-language: c++ robot-description: UptimeBot is a web crawler that checks return codes of web servers and calculates average number of current servers status. The robot runs daily, and visits sites in a random order. robot-history: This robot is a local research product of the UtimeBot team. robot-environment: research modified-date: Sat, 19 March 2004 21:19:03 GMT modified-by: UptimeBot team Best regards. Maks (aka Luft) -- Nick Arnett Director, Business Intelligence Services LiveWorld Inc. Phone/fax: (408) 551-0427 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots