Marek wrote: > > I'm trying to gather tons of text as quickly as possible without regard to > maintaining an index or hierarchy. I just need to have lots of sentences.
sounds like you are trying to build a corpus, like I do. A generic site download tool such as wget or snagorama might do the job better for you. Feed all the html files through w3m -dump, and you have text, mostly. For getting cleaner text, it might be worth buying a corpus resource from ELRA, or a similar organisation. They've gone to the bother of clearing rights, etc, which you'd have to do if you did anything at all with the data. Stewart -- Stewart C. Russell Senior Analyst Programmer [EMAIL PROTECTED] Collins Dictionaries use Disclaimer; my $opinion; Bishopbriggs, Scotland _______________________________________________ Sitescooper-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/sitescooper-talk
