Marek wrote:
> 
> I'm trying to gather tons of text as quickly as possible without regard to
> maintaining an index or hierarchy. I just need to have lots of sentences.

sounds like you are trying to build a corpus, like I do.

A generic site download tool such as wget or snagorama might do the job
better for you. Feed all the html files through w3m -dump, and you have
text, mostly.

For getting cleaner text, it might be worth buying a corpus resource
from ELRA, or a similar organisation. They've gone to the bother of
clearing rights, etc, which you'd have to do if you did anything at all
with the data.

 Stewart

-- 
Stewart C. Russell              Senior Analyst Programmer
[EMAIL PROTECTED]       Collins Dictionaries
use Disclaimer; my $opinion;    Bishopbriggs, Scotland

_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/sitescooper-talk

Reply via email to