Re: [Robots] Post

2002-11-08 Thread Paul Maddox
Hi all,

Wow, someone posted something!

How many subscribers are there?  What's everyone working on at the 
moment?

To answer your question Rick, go to the URL below.

Paul.


On Thu, 7 Nov 2002 23:18:01 -0800 (PST), Rick Beacham wrote:
Sign me UP!!

__
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



Re: [Robots] Post

2002-11-08 Thread Paul Maddox
Hi,

I'm sure even Google themselves would admit there there's scope for 
improvement.  With Answers, Catalogs, Image Search, News, etc, etc, 
they seem to be quite busy! :-)

As an AI programmer specialising in NLP, personally I'd like to see 
web bots actually 'understanding' the content they review, rather 
than indexing by brute force.  How about the equivalent of Dmoz or 
Yahoo Directory, but generated by a web spider?

Paul.


On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote:
Haven't seen traffic in ages.
I guess the theme's pretty much dead.

What's there to invent after Google?

-h



___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



Re: [Robots] Post

2002-11-08 Thread Michel Leonard Goldstein




Hey Paul,

 Great that somebody is trying to get this group moving again!

 I agree with you that there is still a lot to be done in 'understanding'
web pages. I'm especially hopeful that the "Semantic Web" initiative will,
in a not-too-long run, give us a more tractable way of generating useful
web indexes. So far NLP has shown to be too time consuming and error-prone
for a task this size (correct me if I'm wrong! NLP is not really my area).
Ontology use (that's a little bit closer to my area) has shown to help a
lot to classify web pages, but there is still a lot to go to be able to make
this as scalable as the brute force algorithms used by Google. 
 Remember that the web is highly dynamic and HUGE. There are no standard
protocols to receive messages when a new page is created or when the contents
or address of a page have changed. So you have always to keep "browsing"
and updating knowledge. My opinion is that maybe Google is the best you can
get (well, the ranking scheme can always get a little better with some minor
changes) when you want to treat all web pages. NLP and other processing methods
can be used on top of this to generate something better, but the domain has
to be constraint.
 Maybe, as a long term project, many different constrained indexes can
be combined if they are made using the same infrastructure (DAML+OIL/Web
Ontology?).

 Well, that's all my point of view. Actually my research area is more
related to the integration part. I'm working on a Ontology-enabled Link Discovery
system. The objective of such system is to find patterns (spacial and time,
hopefully seamlessly) in data that has been previously been pre-processed
and stored using ontologies (DAML+OIL, mainly, with some extra things to
enable pattern definition) as a data structure framework.

Michel

Paul Maddox wrote:

  Hi,

I'm sure even Google themselves would admit there there's scope for 
improvement.  With Answers, Catalogs, Image Search, News, etc, etc, 
they seem to be quite busy! :-)

As an AI programmer specialising in NLP, personally I'd like to see 
web bots actually 'understanding' the content they review, rather 
than indexing by brute force.  How about the equivalent of Dmoz or 
Yahoo Directory, but generated by a web spider?

Paul.


On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote:
  
  
Haven't seen traffic in ages.
I guess the theme's pretty much dead.

What's there to invent after Google?

   -h


  
  

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots
  






RE: [Robots] Post

2002-11-08 Thread Nick Arnett
As long as we're kicking around what's new, here's mine.  I've been working
on a system that finds topical Internet discussions (web forums, usenet,
mailing lists) and does some analysis of who's who, looking for the people
who connect communities together, lead discussions, etc.  At the moment,
it's focusing on Java developers.  It's been quite interesting to see what
it discovers in terms of how various subtopics are related and what other
things that Java developers tend to be interested in.

Regarding markup, etc., in the back of my mind I've had the notion of
enhancing my spider to recognize how to parse and recurse forums and list
archives, so that I don't have to write new code for every different forum
or archiving format.  But it's not something I'd be comfortable tossing out
into the open, since it obviously would be a tool that spammers could use
for address harvesting.

I'm essentially creating a toolbox with Python and MySQL, which I'm using to
create custom information products for consulting clients.  For the moment,
those (obviously) are companies with a strong interest in Java.

Nick

--
Nick Arnett
Phone/fax: (408) 904-7198
[EMAIL PROTECTED]

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



RE: [Robots] Post

2002-11-08 Thread Otis Gospodnetic
Sounds interesting.
I'd love to see some screenshots of some community graphs and main
characters in itpossible?

Otis

--- Nick Arnett [EMAIL PROTECTED] wrote:
 As long as we're kicking around what's new, here's mine.  I've been
 working
 on a system that finds topical Internet discussions (web forums,
 usenet,
 mailing lists) and does some analysis of who's who, looking for the
 people
 who connect communities together, lead discussions, etc.  At the
 moment,
 it's focusing on Java developers.  It's been quite interesting to see
 what
 it discovers in terms of how various subtopics are related and what
 other
 things that Java developers tend to be interested in.
 
 Regarding markup, etc., in the back of my mind I've had the notion of
 enhancing my spider to recognize how to parse and recurse forums and
 list
 archives, so that I don't have to write new code for every different
 forum
 or archiving format.  But it's not something I'd be comfortable
 tossing out
 into the open, since it obviously would be a tool that spammers could
 use
 for address harvesting.
 
 I'm essentially creating a toolbox with Python and MySQL, which I'm
 using to
 create custom information products for consulting clients.  For the
 moment,
 those (obviously) are companies with a strong interest in Java.
 
 Nick
 
 --
 Nick Arnett
 Phone/fax: (408) 904-7198
 [EMAIL PROTECTED]
 
 ___
 Robots mailing list
 [EMAIL PROTECTED]
 http://www.mccmedia.com/mailman/listinfo/robots


__
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



RE: [Robots] Post

2002-11-08 Thread Brian Broderick
Where are your proposals located?

-Original Message-
From: Sean 'Captain Napalm' Conner [mailto:spc;conman.org] 
Sent: Friday, November 08, 2002 1:46 PM
To: [EMAIL PROTECTED]
Subject: Re: [Robots] Post


  Well, I was surprised to recently find that O'Reilly has mentioned me
in
their book _HTTP: The Definitive Guide_; seems they mentioned my propsed
draft extentions to the Robot Exclusion Protocol [1] although I'm not
sure
what they said about it (my friend actually found the reference in
O'Reilly,
I haven't had a chance to check it out myself---page 230 if any one has
it).  

  Does anyone know if any robots out there actually implement any of the
proposals?  It'd be interesting to know.

  -spc (Six years since that was proposed ... )

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



Re: [Robots] Post

2002-11-08 Thread Otis Gospodnetic
I think I remember those proposals, actually.
I have never hear anyone mention them anywhere else, so I don't think
anyone has implemented a crawler that looks for those new things in
robots.txt

Otis

--- Sean 'Captain Napalm' Conner [EMAIL PROTECTED] wrote:
 
   Well, I was surprised to recently find that O'Reilly has mentioned
 me in
 their book _HTTP: The Definitive Guide_; seems they mentioned my
 propsed
 draft extentions to the Robot Exclusion Protocol [1] although I'm not
 sure
 what they said about it (my friend actually found the reference in
 O'Reilly,
 I haven't had a chance to check it out myself---page 230 if any one
 has it).  
 
   Does anyone know if any robots out there actually implement any of
 the
 proposals?  It'd be interesting to know.
 
   -spc (Six years since that was proposed ... )
 
 ___
 Robots mailing list
 [EMAIL PROTECTED]
 http://www.mccmedia.com/mailman/listinfo/robots


__
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



RE: [Robots] Post

2002-11-08 Thread Matthew Meadows
Regarding this: 

What's there to invent after Google? 

Quite a lot, actually.  Google has built a magnificent search portal 
for the Internet, but there's still room in the market for companies 
like Inktomi, Verity, DTSearch, AltaVista, and dozens of others big 
and small.  The reason is that search is an extremely rich problem 
domain, and different users have different search needs.  Searching 
source code, tagged documents, databases, log files, archives, LDAP 
servers, Usenet, and the Internet is a lot to ask of any single product. 
Google, AllTheWeb, and other free search engines are optimized for 
one aspect of the IR problem domain: returning relevancy scored results 
to queries into a massive index of web content.  Their business model 
is largely based on selling advertisements that correspond to keywords 
entered into a search page and providing a compelling portal for 
end users to link out to other sites, and the choices they've made in 
their indexing approach reflects that model.  However, many of these 
choices are not necessarily suitable for other aspects of the IR 
problem. 

For instance, most of these indexing algorithms for internet search are 
lossy, and the index administrators (or programmers) have determined the 
depth of the index.  The index relies on stop terms to keep it a manageable 
size, and the result sets include a fraction of results out of orders of 
billions, for good reason.  But these kind of constraints are not 
suitable for source code, log file, or legal document analysis.  Further, 
the types of weightings used in the relevancy scoring are not necessarily the 
same across different document repositories.  For instance, popularity based 
relevance has little bearing on corporate LANS full of ordinary business 
documents, and whereas keyword and metatag scoring have fallen out of 
favor with free public search engines they may be very effective parameters 
in scoring a query against a more controlled document repository. To truly 
create the most effective index possible requires the index administrator 
or an automated query optimizer to adjust the weightings of a wide range 
of variables that impact the size, depth, and effectiveness of the index. 

Consider also vertical searches, indexes optimized for a specific domain. 
A researcher in a particular discipline may benefit from having a clean index 
with a finely-honed affinity to that discipline.  Such indexes allow for 
a tremendous signal-to-noise ratio.  Imagine for example an index specific to 
Genetic Programming that contains daily traffic from message boards, 
Usenet messages and other online content intersected with information from 
your LAN, your inbox, your source code, and other proprietary sources.  
You can achieve an effective depth and breadth of content in such an index 
with far less resources than what would be required in a less discriminating 
database. 

Finally, don't forget about cost.  Last time I checked the enterprise 
versions of Google, AltaVista, and Inktomi - as far as I recall - all charge 
an escalating fee that corresponds to the number of documents indexed, a 
licensing model that may drastically increase the TCO of these solutions as the 
end user's business grows. 

I have built a discriminating filer that has most of these capabilities, and 
many more that I can't describe here.  That's why I never post, I've been busy 
working on the project on the side for over three years.  I can reveal more 
about it in the next couple of months after my management decides its level of 
interest in ownership of the code.  

It's good to see the activity on the mailing list today.  I suspect that a lot 
of people that would normally post are just busy working on their own robots, 
or just flat out lucky enough to be working. 


-Original Message- 
From: Paul Maddox [mailto:paulmdx;hotpop.com] 
Sent: Friday, November 08, 2002 3:42 AM 
To: [EMAIL PROTECTED] 
Subject: Re: [Robots] Post 


Hi, 

I'm sure even Google themselves would admit there there's scope for 
improvement.  With Answers, Catalogs, Image Search, News, etc, etc, 
they seem to be quite busy! :-) 

As an AI programmer specialising in NLP, personally I'd like to see 
web bots actually 'understanding' the content they review, rather 
than indexing by brute force.  How about the equivalent of Dmoz or 
Yahoo Directory, but generated by a web spider? 

Paul. 


On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote: 
Haven't seen traffic in ages. 
I guess the theme's pretty much dead. 
 
What's there to invent after Google? 
 
-h 
 


___ 
Robots mailing list 
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots 

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots