RE: Analyzing News Stories

Gregory Lypny Wed, 26 Jan 2005 07:42:40 -0800

Thanks for your replies, Jonathan and Alex.

Jonathan,

I don't know what an RSS feed is, but if it refers to new stories or accessing news in real time, that's not what I'm doing. I want to tap into complete archives.

Most news outlets are moving to RSS feeds...

You could probably set up Rev to continuously monitor an RSS feed, and
pull out, save, and categorieze those stories that you need.


Alex,

Index every word on a random sample of 100 stories. Eliminate any word that appears in more than 80% of them. Look briefly at those words that appear in between 50% and 80% and see what you think about them; if necessary, adjust the thresholds until it feels right for your purposes.

Easy enough and intuitive. Thanks.


Do you want to index each word separately, or try to accumulate common
roots; e.g. cause, causes, caused, causing, causation ... one entry, 2
entries, 5 entries?


    One entry for starters.

I'd worry about whether I had deduced the serial numbering scheme fully. Did I get every story ? Could there be any particular kind of story that was indexed differently (e.g. stories printed straight form the AP wire might be indexed differently from those written, or extensively modified, by the paper's own writers).

Yes, exactly. I thought about that. It'll require some tinkering.


There are some ethical issues about collecting large amounts of data;
you should, at a minimum, read up on the content of the robots.txt
system, and in general conform to the site's requests as described in
their robots.txt files.

Thanks. I didn't know about robots policies. So, I'll request that information.

        Greg

_______________________________________________
use-revolution mailing list
[email protected]
http://lists.runrev.com/mailman/listinfo/use-revolution

RE: Analyzing News Stories

Reply via email to