I am wondering whether the HTML actually says it is UTF-8 or not. If it has a load of double byte characters, but says it is some other 8-bit encoding, then you'd get the situation you describe, I think.
Can you show us the original HTML? -- Sebastian Rahtz Information Manager, Oxford University Computing Services 13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431 Sólo le pido a Dios que el futuro no me sea indiferente _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml