Gregory,

This should be simple, though time consuming. If you don't know from which domains you want to download news, theres's no need to customize for individual sites. Just get the links from a search engine and parse the list.

Then download the sites and determine where the body text starts and finishes or simply remove everything that probably isn't body text (e.g. texts with a relatively large number of exlamation marks, short paragraphs without punctuation etc). Eventually store the text and mark the texts that seem to consist of body text only or garbage only for manual review.

Naturally, you will want to ignore sites with particular words in them and domains with particular words, you probably also want to ignore nonsense domains (say longer than 7 chars with 0 or 1 vowel in them). I'm sure, you'll find more ways to filter the search results when you start testing.

Important is that you make your filters adjustable --preferably with a nice GUI-- so you can tweak them without changing your scripts.

Best regards,

Mark Schonewille

--

Economy-x-Talk Consulting and Software Engineering
http://economy-x-talk.com
http://www.salery.biz

A large collection of scripts for HyperCard, Revolution, SuperCard and other programming languages can be found at http://runrev.info




On 13 mrt 2008, at 14:34, Gregory Lypny wrote:

Hello everyone,

I'm working on a major research project that involves the analysis of hundreds of thousands of news releases. I've used Revolution to build utility applications that will index news files that I've obtained from Factiva, but now I'd like to expand my news sources. I'm hoping that you can advise me on the feasibility of building something in Revolution that would submit multiple queries (e.g., news for 2005 having to do with patent rejections), extract the links to the hits, then run through them and grab the individual stories to catalogue them. I can appreciate that it would have to be customized for each news site. Any insights on the general approach would be most appreciated.

Regards,


Gregory Lypny

Associate Professor of Finance
John Molson School of Business
Concordia University
Montreal, Canada

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to