Mark Beckman wrote:
> 
> Here is a set of .site files for the latest version of the Los Angeles
> Times.

Many thanks for these. Guess I better delurk here, as we've found a
novel use for Sitescooper that has solved a huge problem for us.

Collins Dictionaries (along with Birmingham University and COBUILD)
compile The Bank of English, a 400m+ word corpus of English usage for
reasearch and dictionary compilation. [There's more about the Bank of
English here: http://www.cobuild.collins.co.uk/boe_info.html]. We also
collect text in many other languages, including French, German and
Spanish.

We have agreements with many newspapers (inc. the LA Times) to use their
text in our corpus. In the past, newspapers sent us text dumps from
their story databases on an irregular basis, in variable file formats,
on a variety of odd media. Using Sitescooper (and its -html option)
triggered from a cron job, we can harvest stories as they are posted.

We convert the HTML files to text via XML (using Dave Raggett's Tidy and
a series of XSLT scripts) and index them on our corpus server. It's an
unusual use for the program, but someone once said that the mark of a
truly great program is its ability to be used for something its author
never thought of.

 Stewart

-- 
Stewart C. Russell              Senior Analyst Programmer
[EMAIL PROTECTED]       Collins Dictionaries
use Disclaimer; my $opinion;    Bishopbriggs, Scotland

_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/sitescooper-talk

Reply via email to