[Wtr-general] HTML Pages with bad XML

John Castellucci Wed, 31 Jan 2007 13:14:51 -0800

Howdy all,

I'm working on a short project where I am parsing a page that happens to
contain some nodes that cause REXML to die -- some specific examples are:


<page _extended="true" user:="user:" per="per" Views="Views" />
<[EMAIL PROTECTED] _extended="true" />
<j _extended="true" 221,546="221,546" />

The nodes with @, : and , all throw:

c:/ruby/lib/ruby/site_ruby/1.8/rexml/parsers/treeparser.rb:90:in `parse':
#<REXML::ParseException: malformed XML: missing tag start
(REXML::ParseException)


I've hacked in a workaround (see below) that will massage the html source
before passing it to REXML, but then I have to search the Document object
for the nodes I am looking for (instead of using the spiffy
IE.elements_by_xpath)

Any tips on getting Watir to be happy with lousy XML source? 

--john



# Hack for the Watir::IE object to return an XML document that has been
scrubbed of offending node names from the html source
#
module Watir
        class IE
                def xml_source
                        xmlSource = html_source(document.body, "<?xml
version=\"1.0\" encoding=\"us-ascii\"?>\n<HTML>\n", " ")
                        xmlSource += "\n</HTML>\n"
                        xmlSource = xmlSource.gsub(/&nbsp;/, '&#160;')
                        xmlSource = xmlSource.gsub(/user:/, 'user')
                        xmlSource = xmlSource.gsub(/@/, '_')
                        xmlSource = xmlSource.gsub(/,/, '')
                        return REXML::Document.new(xmlSource)
                end
        end
end

_______________________________________________
Wtr-general mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/wtr-general

[Wtr-general] HTML Pages with bad XML

Reply via email to