On Wed, 2010-08-18 at 10:48 -0700, Bill Kendrick wrote: > I've come across some documents that are formatted in > such a way that, when converted to HTML, they come out > something like this: > > <font face="Arial">And</font> <font face="Arial">then</font> > <font face="Arial">they</font> <font face="Arial">looked</font> > > or even worse: > > <font face="Arial">A</font><font face="Arial">n</font><font > face="Arial">d</font> > ... > > > I've come up with a way, using PHP's DOMDocument system, to > scrape a file clear of these, but it's very slow, and it's > basically something that can be done on a stream of text > (rather than having to worry about the document's structure). > > I'm thinking of writing something in PHP or C to clean stuff > like this up, but am wondering if anyone else has any experience > and suggestions? > > (And yes, I've used "htmltidy", but while that can merge _nested_ > styles, e.g., a "<font face="Arial"><font size=+1>" get > combined into its own CSS stype, e.g., "<span class="c123">", > it doesn't seem to be able to merge _consecutive_ styles, > as shown in the examples above. :^/ )
Consider writing a SAX filter that just drops the offending <font> and </font>. Also consider using XPath, like my following example in Ruby (using the Nokogiri XML library) require 'nokogiri' def reform xml xml.xpath('//font[1]').each do |x| newcontent=x.content.to_s.dup textnodes=x.xpath('(following-sibling::text() | following-sibling::font/text())') x.content=x.content+textnodes.map{|y| y.to_s}.join textnodes.unlink x.xpath('following-sibling::font').unlink end xml end xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font></test>') puts reform(xml).to_xml xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font> <b>More</b></test>') puts reform(xml).to_xml xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font> More</test>') puts reform(xml).to_xml #That last example probably does the wrong thing #to fix that you might want the following more complicated version of #the XPath def reform xml xml.xpath('//font[1]').each do |x| newcontent=x.content.to_s.dup textnodes=x.xpath('(following-sibling::text()[following-sibling::node()[1][self::font]] | following-sibling::font/text())') x.content=x.content+textnodes.map{|y| y.to_s}.join textnodes.unlink x.xpath('following-sibling::font').unlink end xml end #More hackage may be necessary depending on the exact structure of your data. _______________________________________________ vox-tech mailing list vox-tech@lists.lugod.org http://lists.lugod.org/mailman/listinfo/vox-tech