[vox-tech] Suggestions for cleaning up repetitive HTML tags?

Bill Kendrick Wed, 18 Aug 2010 10:49:01 -0700

I've come across some documents that are formatted in
such a way that, when converted to HTML, they come out
something like this:


  <font face="Arial">And</font> <font face="Arial">then</font>
  <font face="Arial">they</font> <font face="Arial">looked</font>

or even worse:

  <font face="Arial">A</font><font face="Arial">n</font><font
  face="Arial">d</font>
  ...


I've come up with a way, using PHP's DOMDocument system, to
scrape a file clear of these, but it's very slow, and it's
basically something that can be done on a stream of text
(rather than having to worry about the document's structure).

I'm thinking of writing something in PHP or C to clean stuff
like this up, but am wondering if anyone else has any experience
and suggestions?

(And yes, I've used "htmltidy", but while that can merge _nested_
styles, e.g., a "<font face="Arial"><font size=+1>" get
combined into its own CSS stype, e.g., "<span class="c123">",
it doesn't seem to be able to merge _consecutive_ styles,
as shown in the examples above. :^/ )


-- 
-bill!
Sent from my computer
_______________________________________________
vox-tech mailing list
[email protected]
http://lists.lugod.org/mailman/listinfo/vox-tech

[vox-tech] Suggestions for cleaning up repetitive HTML tags?

Reply via email to