[Robots] Anti-thesaurus proposal

2001-11-20 Thread Nick Arnett


http://www.hastingsresearch.com/net/06-anti-thesaurus.shtml

This is a proposal for a meta-tag to tell search engines to ignore certain
words on a page when scoring relevancy.  Among other things, it mentions
robots.txt as problematic:

Also, returning to the robots.txt standard: it may be underused simply
because it is a security breach (the file openly lists URLs that webmasters
do not want visible through search engines). It is possible that many more
webmasters would be using it properly, if not for that security problem.

My opinion is that this is enormously impractical, but perhaps there's the
seed of a good idea in it.  However, it seems to me that if the authors of a
page would actually bother to create meta-tags to increase search
efficiency, it would be much easier (semi-automated, even) to create a tag
containing the *most* relevant words, not the least.

Nick Arnett
Phone/fax: 408-904-7198


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




[Robots] FW: Re: Correct URL, shlash at the end ?

2001-11-24 Thread Nick Arnett




-Original Message-
From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]]
Sent: Friday, November 23, 2001 11:26 PM
To: [EMAIL PROTECTED]
Subject: Re: [Robots] Re: Correct URL, shlash at the end ?


It was thus said that the Great George Phillips once stated:

 Don't be mislead by relative URLs.  Yes, they use . and ...  Yes,
 / is very important.  Yes, they operate almost identically to
 UNIX relative paths (but different enough to keep us on our toes).
 Yes, they are extremely useful.  But they're just rules that take
 the stuff you used to get the current page and some relative stuff to
 construct new stuff -- all done by the browser.  The web server only
 understands pure, unadulterated, unrelative stuff.

  There are rules for parsing relative URLs in RFC-1808 and no, web servers
do understand relative URLs---only they must start (if giving a GET (or
other) command) with a leading `/'.  I just fed ``/people/../index/html'' to
my colocated server (telnet to port 80, feed in the GET request directly)
and I got the main index page at http://www.conman.org/ .  So the webserver
can do the processing as well (at least Apache).

 My suggestion is that the robot construct URLs with care -- always do what
 a browser would do and respect the fact that the HTTP server may need
 exactly the same stuff back as it put into the HTML.  And always, always
 store exactly the URL used to retrieve a block of content.  But implement
 some generic mechanism to generalize URL equality beyond strcmp().
Regular
 expression search and replace looks as promising as anything.  Imagine
something
 like this (with perlish regexp):

 URL-same: s'/(index|default).html?$'/'

 In other words, if the URL ends in /index.html, /default.html,
/index.htm or
 /default.htm then drop all but the slash and we'll assume the URL will
boil
 down to the same content.

  Is this for the robot configuration (on the robot end of things) or for
something like robots.txt?

 URL-same: s'[^/]+/..(/|$)''   # condense ..

  Make sure you follow RFC-1808 though.

 URL-same: tr'A-Z'a-z' # case fold the whole thing 'cause why not?

  Because not every webserver is case insensitive.  The host portion is (has
to be, DNS is case insensitive) but the relative portion (at least in the
standards portions) is not.  Okay, some sites (like AOL) treats them as case
insensitive, but not all sites.

 And something for the pathological sites

 URL-same: s'^(http://boston.conman.org/.*/)0+)'$1'g
 URL-same: s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g

  What, exactly does that map?  Because I assure you that

http://boston.conman.org/2001/11/17.2

  is not the same as:

 http://boston.conman.org/2001/11/17

  even though the latter contains the content of the former (plus other
entries from that day).  But ...

  http://boston.conman.org/2000/8/10.2-15.5

  and

 http://boston.conman.org/2000/8/10.2-8/15.5

  do return the same content (in other words, those are equivalent), where
as:

  http://boston.conman.org/2000/8/10.2-15.5

  and

http://boston.conman.org/2000/8/10-15

  Are not (but again, the latter contains the content of the former).

  (Yet one more odd case.  This:

http://boston.conman.org/1999

  and this:

  http://boston.conman.org/1999/12

  and this:

http://boston.conman.org/1999/12/4-15

  Are the same, but only because I started keeping entries in December of
1999.  You can repeat for a couple of other variations).

 It would be so cool if a robot could discover these patterns for itself.
Seems
 like it would be a small scale version of covering boston.conman.org's
other
 problem of multiple overlapping data views.

  I'm not as sure of that 8-)

  -spc (I calculated that http://bible.conman.org/kj/ has over 15 million
different URL views into the King James Bible ... )



--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




[Robots] FW: Re: Correct URL, shlash at the end ?

2001-11-24 Thread Nick Arnett




-Original Message-
From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]]
Sent: Friday, November 23, 2001 11:26 PM
To: [EMAIL PROTECTED]
Subject: Re: [Robots] Re: Correct URL, shlash at the end ?


It was thus said that the Great George Phillips once stated:

 Don't be mislead by relative URLs.  Yes, they use . and ...  Yes,
 / is very important.  Yes, they operate almost identically to
 UNIX relative paths (but different enough to keep us on our toes).
 Yes, they are extremely useful.  But they're just rules that take
 the stuff you used to get the current page and some relative stuff to
 construct new stuff -- all done by the browser.  The web server only
 understands pure, unadulterated, unrelative stuff.

  There are rules for parsing relative URLs in RFC-1808 and no, web servers
do understand relative URLs---only they must start (if giving a GET (or
other) command) with a leading `/'.  I just fed ``/people/../index/html'' to
my colocated server (telnet to port 80, feed in the GET request directly)
and I got the main index page at http://www.conman.org/ .  So the webserver
can do the processing as well (at least Apache).

 My suggestion is that the robot construct URLs with care -- always do what
 a browser would do and respect the fact that the HTTP server may need
 exactly the same stuff back as it put into the HTML.  And always, always
 store exactly the URL used to retrieve a block of content.  But implement
 some generic mechanism to generalize URL equality beyond strcmp().
Regular
 expression search and replace looks as promising as anything.  Imagine
something
 like this (with perlish regexp):

 URL-same: s'/(index|default).html?$'/'

 In other words, if the URL ends in /index.html, /default.html,
/index.htm or
 /default.htm then drop all but the slash and we'll assume the URL will
boil
 down to the same content.

  Is this for the robot configuration (on the robot end of things) or for
something like robots.txt?

 URL-same: s'[^/]+/..(/|$)''   # condense ..

  Make sure you follow RFC-1808 though.

 URL-same: tr'A-Z'a-z' # case fold the whole thing 'cause why not?

  Because not every webserver is case insensitive.  The host portion is (has
to be, DNS is case insensitive) but the relative portion (at least in the
standards portions) is not.  Okay, some sites (like AOL) treats them as case
insensitive, but not all sites.

 And something for the pathological sites

 URL-same: s'^(http://boston.conman.org/.*/)0+)'$1'g
 URL-same: s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g

  What, exactly does that map?  Because I assure you that

http://boston.conman.org/2001/11/17.2

  is not the same as:

 http://boston.conman.org/2001/11/17

  even though the latter contains the content of the former (plus other
entries from that day).  But ...

  http://boston.conman.org/2000/8/10.2-15.5

  and

 http://boston.conman.org/2000/8/10.2-8/15.5

  do return the same content (in other words, those are equivalent), where
as:

  http://boston.conman.org/2000/8/10.2-15.5

  and

http://boston.conman.org/2000/8/10-15

  Are not (but again, the latter contains the content of the former).

  (Yet one more odd case.  This:

http://boston.conman.org/1999

  and this:

  http://boston.conman.org/1999/12

  and this:

http://boston.conman.org/1999/12/4-15

  Are the same, but only because I started keeping entries in December of
1999.  You can repeat for a couple of other variations).

 It would be so cool if a robot could discover these patterns for itself.
Seems
 like it would be a small scale version of covering boston.conman.org's
other
 problem of multiple overlapping data views.

  I'm not as sure of that 8-)

  -spc (I calculated that http://bible.conman.org/kj/ has over 15 million
different URL views into the King James Bible ... )



--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




Re: Rumorbot

2001-02-03 Thread Nick Arnett

The company seems more like a contractor for hire.  Are they actually
starting this as a service.  I saw the description of the talk at Bot2001
and it seemed like it was just an idea that they were floating, not a
service they were really going to launch.

Thanks!

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of
Alexander Macgillivray
Sent: Friday, February 02, 2001 4:23 PM
To: [EMAIL PROTECTED]
Subject: Re: Rumorbot


What would you like to know?
They were at Bot2001 (http://seminars.internet.com/bot/sf01/index.html) and
I talked to them about their tech and company (also attended the session).
Alex
At 12:10 PM 02/02/2001 -0800, Nick Arnett wrote:
Anyone know more about this company or project...?

http://news.bbc.co.uk/hi/english/sci/tech/newsid_1146000/1146589.stm

Nick Arnett
Sr. VP and Co-Founder
Opion Inc.
Direct phone/fax: 408-733-7613

http://www.opion.com




[no subject]

2002-02-21 Thread Nick Arnett

From [EMAIL PROTECTED] Tue Oct 31 15: 54:17 2000
Received: by mccmedia.com from localhost
(router,SLMail V2.7); Tue, 31 Oct 2000 15:54:17 -0800
Received: by mccmedia.com from mail2
(209.133.89.19::mail daemon; unverified,SLMail V2.7); Tue, 31 Oct 2000 15:54:13 
-0800
Received: from MAIL2.MCCMEDIA.COM by MAIL2.MCCMEDIA.COM (LISTSERV-TCP/IP
  release 1.8c) with spool id 34194 for [EMAIL PROTECTED]; Tue,
  31 Oct 2000 15:45:06 -0800
Received: by mccmedia.com from localhost (router,SLMail V2.7); Tue, 31 Oct 2000
  15:52:36 -0800
Received: by mccmedia.com from nick.mccmedia.com (209.133.89.24::mail daemon;
  unverified,SLMail V2.7); Tue, 31 Oct 2000 15:52:35 -0800
X-Sender: [EMAIL PROTECTED]
X-Mailer: QUALCOMM Windows Eudora Version 5.0
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Message-ID: [EMAIL PROTECTED]
Date: Tue, 31 Oct 2000 15:48:21 -0800
Reply-To: [EMAIL PROTECTED]
Sender: [EMAIL PROTECTED]
From: Nick Arnett [EMAIL PROTECTED]
Subject: Robots, km lists back up
Comments: To: [EMAIL PROTECTED], [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Date: Tue, 31 Oct 2000 23:54:17 -0800
Content-transfer-encoding: 7bit

x-flowed
My mail server suffered some sort of ugly disk problem that I'm still
trying to fix, but I have installed a temporary backup server until
then.  The robots and km mailing lists were down since mid-day yesterday,
but if this message reaches you, you'll know that they're back up.  Some
mail *may* have been lost, but that's not very likely, since the mail
server forwards it to the list server machine immediately.

Note that even if mail at mccmedia.com is down, I can be reached at my
Opion address.

Nick

--
Senior VP Strategic Development, Co-Founder
Opion Inc.

[EMAIL PROTECTED]
(408) 733-7613
/x-flowed




Robot list bounces

2001-03-08 Thread Nick Arnett

Robot list subscribers,

I'm getting fairly aggressive about deleting e-mail addresses from the list
when they start bouncing.  So, if your address stops working temporarily for
whatever reason, you may find yourself off the list and you'll need to
re-join.

I am a bit astonished at the number of bounces that show addresses that are
not subscribed to the list... so if you see a few bounces when you post
(most come here, as they should), that may be the reason.

Nick Arnett
Sr. VP and Co-Founder
Opion Inc.
Direct phone: 408-733-7613 Fax: 408-904-7198

http://www.opion.com




[no subject]

2002-02-21 Thread Nick Arnett

From [EMAIL PROTECTED] Fri Nov 10 14: 47:29 2000
Received: by mccmedia.com from localhost
(router,SLMail V2.7); Fri, 10 Nov 2000 14:47:29 -0800
Received: by mccmedia.com from mail2
(209.133.89.19::mail daemon; unverified,SLMail V2.7); Fri, 10 Nov 2000 14:47:26 
-0800
Received: from MAIL2.MCCMEDIA.COM by MAIL2.MCCMEDIA.COM (LISTSERV-TCP/IP
  release 1.8c) with spool id 35331 for [EMAIL PROTECTED]; Fri,
  10 Nov 2000 14:39:44 -0800
Received: by mccmedia.com from localhost (router,SLMail V2.7); Fri, 10 Nov 2000
  14:42:04 -0800
Received: by mccmedia.com from searchtools.com (157.22.1.144::mail daemon;
  unverified,SLMail V2.7); Fri, 10 Nov 2000 14:42:01 -0800
Received: by searchtools.com (Stalker Internet Mail Server 1.8b7) with FILE id
  S.025730 for [EMAIL PROTECTED]; Fri, 10 Nov 2000 15:37:56
  -0700
Received: from [171.66.196.146] (POP-user avi-list) by searchtools.com (Stalker
  POP3 Server 1.8b7) with POP/XMIT id S.025729; Fri, 10 Nov 2000
  15:37:52 -0700
Mime-Version: 1.0
References: [EMAIL PROTECTED]
p04310103b4eae66b4370@[171.66.196.146]
Content-Type: text/plain; charset=us-ascii; format=flowed
Message-ID: p0510021cb63225c3ac1f@[171.66.196.146]
Date: Fri, 10 Nov 2000 14:28:08 -0800
Reply-To: [EMAIL PROTECTED]
Sender: [EMAIL PROTECTED]
From: Avi Rappoport [EMAIL PROTECTED]
Subject: anyone want to license their robot spider for search?
Comments: To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
In-Reply-To: p04310103b4eae66b4370@[171.66.196.146]
Date: Fri, 10 Nov 2000 22:47:29 -0800
Content-transfer-encoding: 7bit

x-flowed
I have a consulting customer writing a search engine looking for a
heavy-duty robot spider that can handle millions of URLs.  This one
would have to be very robust, have a decent API, behave nicely,
handle ugly HTML and strange links, etc. etc.

Please contact me with rates if you would like to be considered.

Avi

PS I also get calls asking for smaller-scale spiders, so let me know
if you have that code as well.

--
_
Complete Guide to Search Engines for Web Sites, Intranets,
   and Portals: http://www.searchtools.com
/x-flowed




[Robots] Re: SV: matching and User-Agent: in robots.txt

2002-03-14 Thread Nick Arnett


Certainly LWP is widely used, but I think it's an open question as to how
many LWP users use the robots.txt capabilities.  I have used LWP
extensively, but have never bothered with the latter.  My robots target a
handful of sites and really don't recurse, as such, so I just keep an eye on
those sites' policies.  And they tend to be very large, busy sites, so I'm a
mere blip in their stats, I assume... which is not to say that I would
lightly ignore anyone's wishes regarding robots.  But I'm not really doing
the usual search engine robot thing of sucking down every page.  I'm heavily
focused on tools that figure out which pages are most significant, so my
robots behave more like people would... which I hope leaves me a bit more
free.

Going back to the original question... I can't quite see why anyone would
give a robot a name like Banjo/1.1 [http://nowhere.int/banjo.html
[EMAIL PROTECTED]].  But if that's the name, then that's what robots.txt
should reference.  A robots.txt that contains a directive for a robot named
Banjo should either be referring to another robot or it has the wrong
name.

I think the original poster has confused (conflated, actually) the HTTP
User-Agent and From headers.

 $ua = LWP::RobotUA-new($agent_name, $from, [$rules])

 Your robot's name and the mail address of the human responsible for the
robot (i.e. you) is
 required by the constructor.

Create a user-agent object thus:

$ua = LWP::RobotUA-new('Banjo/1.1','http://nowhere.int/banjo.html
[EMAIL PROTECTED]')

The string that gets compared with robots.txt is Banjo/1.1.  That's the
HTTP User-Agent header.  The second parameter is the HTTP From header,
which allows the target site's administrator to find you (easily) if your
robot misbehaves.  Of course, it isn't special to robots.  Any HTTP client
can send a From header (the default behavior of which in some clients led
to much controversy years ago, of course).

From the LWP docs: The from attribute can be set to the e-mail address of
the person responsible for running the application. If this is set, then the
address will be sent to the servers with every request.

Hope that's reasonably clear.

Nick

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Otis Gospodnetic
 Sent: Thursday, March 14, 2002 8:57 AM
 To: [EMAIL PROTECTED]
 Subject: [Robots] Re: SV: matching and UserAgent: in robots.txt



 LWP?  Very popular in a big Perl community.


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Nick Arnett


Having worked in Perl and Python, I'll recommend Python.  Although I haven't
been using it for long, I'm definitely more productive with it.  Performance
seems fine, though I haven't really pushed hard on it.  I'm not seeing long,
mysterious time-outs as I occasionally did with LWP.  And I hit some weird
bug in LWP a few weeks ago, which resulted in a strange error message that I
eventually discovered was coming out of the expat DLL for XML.  Instead of
retrieving the page I wanted, it was misinterpreting a server error.  I wish
I could be more specific, but I never did figure out what was really going
on.  Following an LWP request through the debugger is a long and convoluted
journey...

Nick

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of srinivas mohan
 Sent: Thursday, March 14, 2002 9:48 AM
 To: [EMAIL PROTECTED]
 Subject: [Robots] better language for writing a Spider ?



 Hello,

 I am working on a robot develpoment, in java,.
 We are developing a search enginealmost the
 complete engine is developed...
 We used  java for the devlopment...but the performance
 of java api in fetching the web pages is too low,
 basically we developed out own URL Connection , as
 we dont have some features like timeout...
 supported  by the java.net.URLConnection api ..

 Though there are better spiders in java..like
 Mercator..we could not achive a better performance
 with our product...

 Now as the performance is  low..we wanted to redevelop
 our spider..in a language like c or perl...and use
 it with our existing product..

 I will be thankful if any one can help me choosing
 the better language..where i can get better
 performance..

 Thanks in advance
 Mohan



 __
 Do You Yahoo!?
 Yahoo! Sports - live college hoops coverage
 http://sports.yahoo.com/

 --
 This message was sent by the Internet robots and spiders
 discussion list ([EMAIL PROTECTED]).  For list server commands,
 send help in the body of a message to [EMAIL PROTECTED].


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: matching and UserAgent: in robots.txt

2002-03-14 Thread Nick Arnett




 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Sean M. Burke

...

 E.g.,  http://www.robotstxt.org/wc/norobots.html says:
 User-agent [...] The robot should be liberal in interpreting
 this field.
 A case insensitive substring match of the name without version
 information
 is recommended.

 ...note the without version information.  Ditto the spec you
 cited, which
 says That is, the User-Agent (HTTP) header consists of one or
 more words,
 and the very first word is taken to be the name, which is
 referred to in
 the robot exclusion files.

Ah, now I see your point.  That does seem to be a problem, since apparently
version numbers were contemplated in User-Agent headers...  Sounds like
something for the LWP author(s).

Or, a convenient excuse for a badly behaved robot... !

Nick


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Python timeouts

2002-03-25 Thread Nick Arnett


I've been hitting problems with a Python-based robot I'm working on and just
found out that there's a timeout module that will make it easy to implement
the kind of functionality that Tim Bray was suggesting here earlier.  It
apparently works for any TCP connection.  Here's the link:

http://www.timo-tasi.org/python/timeoutsocket.py

--
[EMAIL PROTECTED]
(408) 904-7198





[Robots] Re: unsubscibe

2002-03-26 Thread Nick Arnett


Commands need to be send to [EMAIL PROTECTED].

Send unsubscribe robots in the body of a message to leave this list.

Nick

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of HuiFang Wang
 Sent: Tuesday, March 26, 2002 2:30 AM
 To: [EMAIL PROTECTED]
 Subject: [Robots] unsubscibe
 
 
 
 hello, 
  I want to unsubscibe.




[Robots] Does Yahoo have new robot defenses?

2002-07-27 Thread Nick Arnett


It looks to me as though Yahoo has some sort of robot defense operating.  I
was just testing a multi-threaded robot that I use to analyze discussions,
including Yahoo's stock market boards.  On the first run, it seemed to do
fine, but when I tried to run it again a few minutes later, it didn't
retrieve anything... so I tried going to the message boards using IE on the
same machine.  Every page is returning a 403 Forbidden error now -- even
when I try to see robots.txt.  As far as I know, Yahoo has never even had a
robots.txt file.

I'm guessing that the speed of my robot triggered a block against this IP
address.  Another machine, in the same subnet, can access the pages just
fine.

I've been working on the underlying database for the last few weeks, so I
haven't run the spider lately.  Thus, I'm not sure when this behavior might
hvae started.

My robot is quite fast and my connection yields throughput of about 1
mbit/sec, so it certainly hit their server fairly hard.  But hey, it's
Yahoo.  If they can't handle getting hit this hard on a mid-day Saturday,
it's hard to imagine who can.

No lectures about well-behaved robots, please.  I know, I know.  The next
step for that robot will be to have each thread hit completely different
domains.  Perhaps each one will rotate through a few domains.

Anybody know what Yahoo might be doing, or what its policy is about robots?
I haven't been able to find anything that addresses the issue directly.  I
don't see anything under its TOS that would clearly apply.  If they want to
have a limit on robots, I sure would appreciate it if they would say what it
is...

It's been about 30 minutes now and I'm still blocked, it seems.

Just checked from another machine -- they still have no robots.txt at all.

Nick

--
[EMAIL PROTECTED]
(408) 904-7198





[Robots] Safe parameters for spidering Yahoo message header pages?

2002-08-02 Thread Nick Arnett

Anyone here figured out what Yahoo will tolerate in terms of spidering its
message header pages before it blocks the robot's IP address?  Before I
start testing, I figured I'd see if anyone else here has already done so.
The duration of the block seems to lengthen, so testing could take a while.

Sure would be nice if they'd just say what they consider acceptable...

Nick

--
[EMAIL PROTECTED]
(408) 904-7198

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



RE: [Robots] Post

2002-11-08 Thread Nick Arnett
As long as we're kicking around what's new, here's mine.  I've been working
on a system that finds topical Internet discussions (web forums, usenet,
mailing lists) and does some analysis of who's who, looking for the people
who connect communities together, lead discussions, etc.  At the moment,
it's focusing on Java developers.  It's been quite interesting to see what
it discovers in terms of how various subtopics are related and what other
things that Java developers tend to be interested in.

Regarding markup, etc., in the back of my mind I've had the notion of
enhancing my spider to recognize how to parse and recurse forums and list
archives, so that I don't have to write new code for every different forum
or archiving format.  But it's not something I'd be comfortable tossing out
into the open, since it obviously would be a tool that spammers could use
for address harvesting.

I'm essentially creating a toolbox with Python and MySQL, which I'm using to
create custom information products for consulting clients.  For the moment,
those (obviously) are companies with a strong interest in Java.

Nick

--
Nick Arnett
Phone/fax: (408) 904-7198
[EMAIL PROTECTED]

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



RE: [Robots] Efficient crawling of mailing list archives?

2003-02-28 Thread Nick Arnett
At the risk of talking to myself... Would a gateway from mailing lists to
NNTP address most of the issues I described?  NNTP already knows about
threading, updating, etc.

However, I've been stymied by the problem of discovering new NNTP servers.

--
Nick Arnett
Phone/fax: (408) 904-7198
[EMAIL PROTECTED]

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Is this mailing linst alive?

2003-11-04 Thread Nick Arnett
[EMAIL PROTECTED] wrote:

I've created a robot, www.dead-links.com and i wonder if this list is alive.
It is alive, but very, very quiet.

Nick

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] robot in python?

2003-11-26 Thread Nick Arnett
Petter Karlström wrote:

Hello all,

Nice to see that this list woke up again! :^)
And now the list owner finally woke up, too... I hadn't noticed the
recent traffic on the list until just now.  Are those messages about an
address no longer in use going to the whole list?  Aghh.  I've taken
care of that, I hope, but the source address wasn't actually subscribed,
so I had to guess.
Back to the point at hand... I've written several specialized robots in
Python over the last few years.  They are specifically for crawling
on-line discussions and parsing out individual messages and meta-data.
Look for Aahz's examples (do a Google search on Aahz and Python, I'm
sure that'll lead you there).  He makes multi-threading for your spider
pretty easy and adaptable to various kinds of processing.
I have written crawlers in Perl before, but I wish to try out Python for
a hobby project. Has anybody here written a webbot i Python?
Python is of course a smaller language, so the libraries aren't as
extensive as the Perl counterparts. Also, I find the documentation
somewhat lacking (or it could be me being new to the language).
After switching from Perl to Python a couple of years ago, I haven't
ever found the Python libraries lacking, although I expected to.
Documentation, in the form of published books, has been a bit scarce,
but new ones have been coming out lately.  I just looked through one on
text applications in Python, but haven't bought it yet.  It definitely
looked good.
Are there any small examples available on use of HTMLParser and htmllib?
Specifically, I need something like the linkextor available in Perl.
One trick is to search on import [modulename] as a phrase.  That'll
often uncover code you can use as an example.  What does linkextor do?
Link extractor?  If so, I just use regular expressions.
Also, what is the neatest way to store session data like login and 
password? PassWordMgr?
Store in what sense?

I'll take a look at my code and see if I can share something generic.
Since we're doing www.opensector.org, I suppose it would only be right
for us to share at least *some* of our code!
However... I just looked at what I have and the older stuff doesn't
really add much to Aahz's examples, other than some simple use of MySQL
as the store; my newer stuff is far too specific to the task I'm doing
to be able to quickly sanitize it.
The main thing I did to address our specific needs was to create a Java
class for message pages in specific types of web-based discussion
forums.  That's partly to extract URLs, but mostly to extract other
features and to intelligently (in the sense of being able to update my
database rapidly, re-visiting the minimum number of pages) navigate the
threading structures, which work in various ways.  The class for
Jive-based forums is only 225 lines, as an example.  The multi-threaded
module that uses it is 100 lines; a single-threaded version is 25 lines.
We also have a Python robot for NNTP servers, which obviously doesn't
need recursion.  It's about 400 lines.  A lot of it deals with things
like missing messages, zeroing in on desired date ranges, avoiding
downloading huge messages, recovery from failure, etc.
All of these talk to MySQL...

Nick

--
Nick Arnett
Phone/fax: (408) 904-7198
[EMAIL PROTECTED]
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Yahoo evolving robots.txt, finally

2004-03-15 Thread Nick Arnett
Walter Underwood wrote:


Nah, they would have e-mailed me directly by now. I used to work
with them at Inktomi.
How about dropping them an e-mail to invite them here?

Yahoo limits crawler access to its own site.  I haven't tried in the 
last 9 or 10 months, but the way it was back then, if you crawled the 
message boards, the crawler's IP address would be blocked for 
increasingly long time periods -- a day, two days, etc.  I tried slowing 
down our gathering, but couldn't find a speed at which they wouldn't 
eventually block it.  And of course they never responded to any 
questions about what they'd consider acceptable.

And yet, their own servers don't seem to have a robots.txt that defines 
any limitations.  Sure would be nice if *they* would tell *us* what's 
acceptable when crawling Yahoo!

Nick

--
Nick Arnett
Director, Business Intelligence Services
LiveWorld Inc.
Phone/fax: (408) 551-0427
[EMAIL PROTECTED]
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] [Fwd: add-robot@robotstxt.org is not working]

2004-04-06 Thread Nick Arnett
Anybody know what this is about?

 Original Message 
Subject: [EMAIL PROTECTED] is not working
Date: Tue, 06 Apr 2004 08:25:29 +0300
From: Max Max [EMAIL PROTECTED]
Reply-To: Max Max [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Dear webmaster! Our team developed the uptimebot web spider and use it 
yet three month. I tried to send the appropriate form to 
[EMAIL PROTECTED], but the mail delivery failed. I send you the same 
filled form. If you are not concern with adding new robots - I 
apologize. But however can you tell me where to apply? Thanks for your 
help! The form of robot goes as follows:

robot-id: uptimebot
robot-name: UptimeBot
robot-cover-url: http://www.uptimebot.com
robot-details-url: http://www.uptimebot.com
robot-owner-name: UCO team
robot-owner-url: http://www.uptimebot.com
robot-owner-email: [EMAIL PROTECTED]
robot-status: active
robot-purpose: indexing, statistics
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: uptimebot
robot-exclusion-useragent: no
robot-noindex: no
robot-host: uptimebot.com
robot-from: no
robot-useragent: uptimebot
robot-language: c++
robot-description: UptimeBot is a web crawler that checks return codes 
of web
servers and calculates average number of current servers status. The 
robot runs daily, and visits sites in a random order.
robot-history: This robot is a local research product of the UtimeBot team.
robot-environment: research
modified-date: Sat, 19 March 2004 21:19:03 GMT
modified-by: UptimeBot team





Best regards.
Maks (aka Luft)
--
Nick Arnett
Director, Business Intelligence Services
LiveWorld Inc.
Phone/fax: (408) 551-0427
[EMAIL PROTECTED]
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots