Re: [xml] xmlCtxReadIO and BOM

Frank Gross Thu, 08 Feb 2018 01:26:39 -0800

Hi,

I have wrappers around libxml calls, and already implemented the"hack" to skip the BOM. But I think there is an issue here, because whenI use for instance xmlParserInputBufferCreateIO(), it skips the BOM whenthe encoding is provided. So for me there is an inconsistency here asxmlCtxtReadIO() should work the same way, or maybe I missed something.


Regards,

Frank


Le 02/02/2018 à 19:52, Eric S. Eberhard a écrit :

Same advice I just gave to someone else. Unless it is HUGE thisworks. Read it into a memory buffer (calloc, malloc, whatever). Remove BOM. Parse the memory buffer.
If you do this often you can make the buffer address and it's sizestatic so that you don't release it (deliberate memory leak) and thenkeep using it for minimal context switching (and more memory) ... ifyou need it bigger, realloc.
libxml2.a cannot do everything for everyone -- putting small wrapperson things is good. I generally use it with giant wrappers (meaningthe open, calloc, parse, etc are all one routine). Then when changesoccur you can change your wrapper and generally life is good. I wouldnot recommend coding directly with raw libxml2 calls -- they are lowerlevel but complex.
BTW -- if the data is HUGE then write it to a /tmp file (removing theBOM as you do it) and parse and delete the file ... modern machinesare so fast it won't notice. I have systems sending and receiving 2-4million XML docs per day. Several have to deal with quirks --especially when dealing with "big box" places or shipping companies(you cannot get Target or USPS to change for you). One does not putspaces between attribute ending quote and the start of the nextattribute. It is wrong. It won't parse. So I filter it with mywrapper. And so forth.
The specs are often interpreted differently by other organizationsthat you cannot win against. So work around them.
Daniel is a great guy but ... if he had to make an exception andchange for everything I have (and I imagine thousands of others) heneed 100 clones :-)
E

On 2/2/2018 1:19 AM, Frank Gross wrote:
Hi,
I came to an issue where I try to parse an XML document from an HTTPstream. I decode the charset from the HTTP header and then create axmlCtxReadIO with that charset value as encoding parameter. Theproblem is that the XML document has three BOM characters, and itseems that xmlCtxReadIO considers the document as malformed in thatcase. (XML document with BOM value and when we call xmlCtxReadIO withan encoding value). Notice that if I don't provide the encodingvalue to xmlCtxReadIO, the parsing works well as BOM is decoded. Isthere a way to ignore the BOM when parsing with xmlCtxReadIO ?
Regards,

Frank
--
Eric S. Eberhard
VICS
2933 W Middle Verde Road
Camp Verde, AZ  86322

928-567-3727  work                      928-301-7537  cell

http://www.vicsmba.com/index.html              (our work)
http://www.vicsmba.com/ourpics/index.html      (fun pictures)


--
Frank GROSS
Software Engineer - Web Services
Four J's Development Tools - http://www.4js.com

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] xmlCtxReadIO and BOM

Reply via email to