Hi,
I have wrappers around libxml calls, and already implemented the
"hack" to skip the BOM. But I think there is an issue here, because when
I use for instance xmlParserInputBufferCreateIO(), it skips the BOM when
the encoding is provided. So for me there is an inconsistency here as
xmlCtxtReadIO() should work the same way, or maybe I missed something.
Regards,
Frank
Le 02/02/2018 à 19:52, Eric S. Eberhard a écrit :
Same advice I just gave to someone else. Unless it is HUGE this
works. Read it into a memory buffer (calloc, malloc, whatever).
Remove BOM. Parse the memory buffer.
If you do this often you can make the buffer address and it's size
static so that you don't release it (deliberate memory leak) and then
keep using it for minimal context switching (and more memory) ... if
you need it bigger, realloc.
libxml2.a cannot do everything for everyone -- putting small wrappers
on things is good. I generally use it with giant wrappers (meaning
the open, calloc, parse, etc are all one routine). Then when changes
occur you can change your wrapper and generally life is good. I would
not recommend coding directly with raw libxml2 calls -- they are lower
level but complex.
BTW -- if the data is HUGE then write it to a /tmp file (removing the
BOM as you do it) and parse and delete the file ... modern machines
are so fast it won't notice. I have systems sending and receiving 2-4
million XML docs per day. Several have to deal with quirks --
especially when dealing with "big box" places or shipping companies
(you cannot get Target or USPS to change for you). One does not put
spaces between attribute ending quote and the start of the next
attribute. It is wrong. It won't parse. So I filter it with my
wrapper. And so forth.
The specs are often interpreted differently by other organizations
that you cannot win against. So work around them.
Daniel is a great guy but ... if he had to make an exception and
change for everything I have (and I imagine thousands of others) he
need 100 clones :-)
E
On 2/2/2018 1:19 AM, Frank Gross wrote:
Hi,
I came to an issue where I try to parse an XML document from an HTTP
stream. I decode the charset from the HTTP header and then create a
xmlCtxReadIO with that charset value as encoding parameter. The
problem is that the XML document has three BOM characters, and it
seems that xmlCtxReadIO considers the document as malformed in that
case. (XML document with BOM value and when we call xmlCtxReadIO with
an encoding value). Notice that if I don't provide the encoding
value to xmlCtxReadIO, the parsing works well as BOM is decoded. Is
there a way to ignore the BOM when parsing with xmlCtxReadIO ?
Regards,
Frank
--
Eric S. Eberhard
VICS
2933 W Middle Verde Road
Camp Verde, AZ 86322
928-567-3727 work 928-301-7537 cell
http://www.vicsmba.com/index.html (our work)
http://www.vicsmba.com/ourpics/index.html (fun pictures)
--
Frank GROSS
Software Engineer - Web Services
Four J's Development Tools - http://www.4js.com
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml