Re: [Robots] Post
Hi all, Wow, someone posted something! How many subscribers are there? What's everyone working on at the moment? To answer your question Rick, go to the URL below. Paul. On Thu, 7 Nov 2002 23:18:01 -0800 (PST), Rick Beacham wrote: Sign me UP!! __ Do you Yahoo!? U2 on LAUNCH - Exclusive greatest hits videos http://launch.yahoo.com/u2 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Post
Hi, I'm sure even Google themselves would admit there there's scope for improvement. With Answers, Catalogs, Image Search, News, etc, etc, they seem to be quite busy! :-) As an AI programmer specialising in NLP, personally I'd like to see web bots actually 'understanding' the content they review, rather than indexing by brute force. How about the equivalent of Dmoz or Yahoo Directory, but generated by a web spider? Paul. On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote: Haven't seen traffic in ages. I guess the theme's pretty much dead. What's there to invent after Google? -h ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Post
Hey Paul, Great that somebody is trying to get this group moving again! I agree with you that there is still a lot to be done in 'understanding' web pages. I'm especially hopeful that the "Semantic Web" initiative will, in a not-too-long run, give us a more tractable way of generating useful web indexes. So far NLP has shown to be too time consuming and error-prone for a task this size (correct me if I'm wrong! NLP is not really my area). Ontology use (that's a little bit closer to my area) has shown to help a lot to classify web pages, but there is still a lot to go to be able to make this as scalable as the brute force algorithms used by Google. Remember that the web is highly dynamic and HUGE. There are no standard protocols to receive messages when a new page is created or when the contents or address of a page have changed. So you have always to keep "browsing" and updating knowledge. My opinion is that maybe Google is the best you can get (well, the ranking scheme can always get a little better with some minor changes) when you want to treat all web pages. NLP and other processing methods can be used on top of this to generate something better, but the domain has to be constraint. Maybe, as a long term project, many different constrained indexes can be combined if they are made using the same infrastructure (DAML+OIL/Web Ontology?). Well, that's all my point of view. Actually my research area is more related to the integration part. I'm working on a Ontology-enabled Link Discovery system. The objective of such system is to find patterns (spacial and time, hopefully seamlessly) in data that has been previously been pre-processed and stored using ontologies (DAML+OIL, mainly, with some extra things to enable pattern definition) as a data structure framework. Michel Paul Maddox wrote: Hi, I'm sure even Google themselves would admit there there's scope for improvement. With Answers, Catalogs, Image Search, News, etc, etc, they seem to be quite busy! :-) As an AI programmer specialising in NLP, personally I'd like to see web bots actually 'understanding' the content they review, rather than indexing by brute force. How about the equivalent of Dmoz or Yahoo Directory, but generated by a web spider? Paul. On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote: Haven't seen traffic in ages. I guess the theme's pretty much dead. What's there to invent after Google? -h ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Post
As long as we're kicking around what's new, here's mine. I've been working on a system that finds topical Internet discussions (web forums, usenet, mailing lists) and does some analysis of who's who, looking for the people who connect communities together, lead discussions, etc. At the moment, it's focusing on Java developers. It's been quite interesting to see what it discovers in terms of how various subtopics are related and what other things that Java developers tend to be interested in. Regarding markup, etc., in the back of my mind I've had the notion of enhancing my spider to recognize how to parse and recurse forums and list archives, so that I don't have to write new code for every different forum or archiving format. But it's not something I'd be comfortable tossing out into the open, since it obviously would be a tool that spammers could use for address harvesting. I'm essentially creating a toolbox with Python and MySQL, which I'm using to create custom information products for consulting clients. For the moment, those (obviously) are companies with a strong interest in Java. Nick -- Nick Arnett Phone/fax: (408) 904-7198 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Post
Sounds interesting. I'd love to see some screenshots of some community graphs and main characters in itpossible? Otis --- Nick Arnett [EMAIL PROTECTED] wrote: As long as we're kicking around what's new, here's mine. I've been working on a system that finds topical Internet discussions (web forums, usenet, mailing lists) and does some analysis of who's who, looking for the people who connect communities together, lead discussions, etc. At the moment, it's focusing on Java developers. It's been quite interesting to see what it discovers in terms of how various subtopics are related and what other things that Java developers tend to be interested in. Regarding markup, etc., in the back of my mind I've had the notion of enhancing my spider to recognize how to parse and recurse forums and list archives, so that I don't have to write new code for every different forum or archiving format. But it's not something I'd be comfortable tossing out into the open, since it obviously would be a tool that spammers could use for address harvesting. I'm essentially creating a toolbox with Python and MySQL, which I'm using to create custom information products for consulting clients. For the moment, those (obviously) are companies with a strong interest in Java. Nick -- Nick Arnett Phone/fax: (408) 904-7198 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots __ Do you Yahoo!? U2 on LAUNCH - Exclusive greatest hits videos http://launch.yahoo.com/u2 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Post
Where are your proposals located? -Original Message- From: Sean 'Captain Napalm' Conner [mailto:spc;conman.org] Sent: Friday, November 08, 2002 1:46 PM To: [EMAIL PROTECTED] Subject: Re: [Robots] Post Well, I was surprised to recently find that O'Reilly has mentioned me in their book _HTTP: The Definitive Guide_; seems they mentioned my propsed draft extentions to the Robot Exclusion Protocol [1] although I'm not sure what they said about it (my friend actually found the reference in O'Reilly, I haven't had a chance to check it out myself---page 230 if any one has it). Does anyone know if any robots out there actually implement any of the proposals? It'd be interesting to know. -spc (Six years since that was proposed ... ) ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Post
I think I remember those proposals, actually. I have never hear anyone mention them anywhere else, so I don't think anyone has implemented a crawler that looks for those new things in robots.txt Otis --- Sean 'Captain Napalm' Conner [EMAIL PROTECTED] wrote: Well, I was surprised to recently find that O'Reilly has mentioned me in their book _HTTP: The Definitive Guide_; seems they mentioned my propsed draft extentions to the Robot Exclusion Protocol [1] although I'm not sure what they said about it (my friend actually found the reference in O'Reilly, I haven't had a chance to check it out myself---page 230 if any one has it). Does anyone know if any robots out there actually implement any of the proposals? It'd be interesting to know. -spc (Six years since that was proposed ... ) ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots __ Do you Yahoo!? U2 on LAUNCH - Exclusive greatest hits videos http://launch.yahoo.com/u2 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Post
Regarding this: What's there to invent after Google? Quite a lot, actually. Google has built a magnificent search portal for the Internet, but there's still room in the market for companies like Inktomi, Verity, DTSearch, AltaVista, and dozens of others big and small. The reason is that search is an extremely rich problem domain, and different users have different search needs. Searching source code, tagged documents, databases, log files, archives, LDAP servers, Usenet, and the Internet is a lot to ask of any single product. Google, AllTheWeb, and other free search engines are optimized for one aspect of the IR problem domain: returning relevancy scored results to queries into a massive index of web content. Their business model is largely based on selling advertisements that correspond to keywords entered into a search page and providing a compelling portal for end users to link out to other sites, and the choices they've made in their indexing approach reflects that model. However, many of these choices are not necessarily suitable for other aspects of the IR problem. For instance, most of these indexing algorithms for internet search are lossy, and the index administrators (or programmers) have determined the depth of the index. The index relies on stop terms to keep it a manageable size, and the result sets include a fraction of results out of orders of billions, for good reason. But these kind of constraints are not suitable for source code, log file, or legal document analysis. Further, the types of weightings used in the relevancy scoring are not necessarily the same across different document repositories. For instance, popularity based relevance has little bearing on corporate LANS full of ordinary business documents, and whereas keyword and metatag scoring have fallen out of favor with free public search engines they may be very effective parameters in scoring a query against a more controlled document repository. To truly create the most effective index possible requires the index administrator or an automated query optimizer to adjust the weightings of a wide range of variables that impact the size, depth, and effectiveness of the index. Consider also vertical searches, indexes optimized for a specific domain. A researcher in a particular discipline may benefit from having a clean index with a finely-honed affinity to that discipline. Such indexes allow for a tremendous signal-to-noise ratio. Imagine for example an index specific to Genetic Programming that contains daily traffic from message boards, Usenet messages and other online content intersected with information from your LAN, your inbox, your source code, and other proprietary sources. You can achieve an effective depth and breadth of content in such an index with far less resources than what would be required in a less discriminating database. Finally, don't forget about cost. Last time I checked the enterprise versions of Google, AltaVista, and Inktomi - as far as I recall - all charge an escalating fee that corresponds to the number of documents indexed, a licensing model that may drastically increase the TCO of these solutions as the end user's business grows. I have built a discriminating filer that has most of these capabilities, and many more that I can't describe here. That's why I never post, I've been busy working on the project on the side for over three years. I can reveal more about it in the next couple of months after my management decides its level of interest in ownership of the code. It's good to see the activity on the mailing list today. I suspect that a lot of people that would normally post are just busy working on their own robots, or just flat out lucky enough to be working. -Original Message- From: Paul Maddox [mailto:paulmdx;hotpop.com] Sent: Friday, November 08, 2002 3:42 AM To: [EMAIL PROTECTED] Subject: Re: [Robots] Post Hi, I'm sure even Google themselves would admit there there's scope for improvement. With Answers, Catalogs, Image Search, News, etc, etc, they seem to be quite busy! :-) As an AI programmer specialising in NLP, personally I'd like to see web bots actually 'understanding' the content they review, rather than indexing by brute force. How about the equivalent of Dmoz or Yahoo Directory, but generated by a web spider? Paul. On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote: Haven't seen traffic in ages. I guess the theme's pretty much dead. What's there to invent after Google? -h ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots