I've come across some documents that are formatted in such a way that, when converted to HTML, they come out something like this:
<font face="Arial">And</font> <font face="Arial">then</font> <font face="Arial">they</font> <font face="Arial">looked</font> or even worse: <font face="Arial">A</font><font face="Arial">n</font><font face="Arial">d</font> ... I've come up with a way, using PHP's DOMDocument system, to scrape a file clear of these, but it's very slow, and it's basically something that can be done on a stream of text (rather than having to worry about the document's structure). I'm thinking of writing something in PHP or C to clean stuff like this up, but am wondering if anyone else has any experience and suggestions? (And yes, I've used "htmltidy", but while that can merge _nested_ styles, e.g., a "<font face="Arial"><font size=+1>" get combined into its own CSS stype, e.g., "<span class="c123">", it doesn't seem to be able to merge _consecutive_ styles, as shown in the examples above. :^/ ) -- -bill! Sent from my computer _______________________________________________ vox-tech mailing list vox-tech@lists.lugod.org http://lists.lugod.org/mailman/listinfo/vox-tech