Thanks for your replies, Jonathan and Alex.

Jonathan,

I don't know what an RSS feed is, but if it refers to new stories or accessing news in real time, that's not what I'm doing. I want to tap into complete archives.

Most news outlets are moving to RSS feeds...

You could probably set up Rev to continuously monitor an RSS feed, and
pull out, save, and categorieze those stories that you need.



Alex,

Index every word on a random sample of 100 stories. Eliminate any word
that appears in more than 80% of them. Look briefly at those words that
appear in between 50% and 80% and see what you think about them; if
necessary, adjust the thresholds until it feels right for your purposes.

Easy enough and intuitive. Thanks.

Do you want to index each word separately, or try to accumulate common roots; e.g. cause, causes, caused, causing, causation ... one entry, 2 entries, 5 entries?


One entry for starters.

I'd worry about whether I had deduced the serial numbering scheme fully.
Did I get every story ? Could there be any particular kind of story that
was indexed differently (e.g. stories printed straight form the AP wire
might be indexed differently from those written, or extensively
modified, by the paper's own writers).

Yes, exactly. I thought about that. It'll require some tinkering.

There are some ethical issues about collecting large amounts of data; you should, at a minimum, read up on the content of the robots.txt system, and in general conform to the site's requests as described in their robots.txt files.

Thanks. I didn't know about robots policies. So, I'll request that information.


        Greg



_______________________________________________
use-revolution mailing list
[email protected]
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to