Since encoding support is an oft raised issue (like the way I used that word 'oft' to make it sound like I'm intelligent?), in this episode of Xerces-C Tech Talk, our fearless parser discusses how to tame evil encodings to make them work for you. Ok, so that was a little cliched, but damnit Jim, I'm a software engineer, not a prose writer. Anyway, I've obviously had too much coffee and lets move on to the subject at hand. This document describes how text encodings are used by the parser, and the steps it takes to read in an external XML entity (meaning a file or memory buffer or some external representation of XML text.) There are gazzillions of ways to represent a character or symbol of a language in binary form in storage. Well, ok maybe there are only bazzillions, but there are still a lot of them. Some of them, such as ASCII, are very well known since they'be been around since the dawn of computerdom. But XML, being designed as the Borg of text formats, must be able to potentially handle XML text stored in any conceiveable character encoding. An XML parser must be able to figure out (usually without any help from the studio audience) what encoding each XML entity is in, and how to read that XML text into some internal format. When the parser encounters some external entity that must be parsed, it will create an XMLReader object for that entity. XMLReader objects present entities inside the parser, and provide the mechanisms to do all of the work that we are discussing here. Each parser has a reader manager object, which is really just a stack of readers. Since an entity can reference and entity which can reference an entity and so on, obviously the parser must have a way to store what its doing, parse a new entity, and then come back to where it left off. This stack of readers provides that capability. When the parser needs to look at the next character of input, it asks the reader manager for the next character. The reader manager in turn asks the reader on the top of the stack for its next character. This way, the parser is not aware of nesting of entities for the most part. When a new reader is created for a newly referenced external entity, the reader constructor does a series of probes to figure out what the heck its dealing with. In some cases, if the input source for that entity is provided by client code, and provides an encoding override, then no work is done and the client code's word is taken on the encoding. Otherwise, the reader must figure out how to interpret the data that its been given. So the first thing the reader does is to read a buffer of raw binary data from the input stream it was given. In order to drastically improve performance, each reader maintains a relatively large buffer of raw data, so that it can read from the source in large chunks. This is particularly important for socket based remote data, in which a read operations might involve a good bit of work. The parser will then work out of this large raw buffer until its empty, then it will fill it again. Once its has that first buffer of raw data, the reader will proceed to try to 'auto sense' the basic encoding. In the src/framework/ directory, there is a class called XMLRecognizer. This class contains the smarts to do the autosensing operation. It is provided as a public class to allow client code to autosense formats if they choose to. The reason I said 'basic' encoding is that each XML entity can have, in its XMLDecl or TextDecl, an encoding="" statement that says exactly what encoding it is in. However, in order to read that first line, we have to know enough about the encoding of the file to decode the first line and find out if an encoding="" is present or not. The 'basic' encoding well... basically I guess, tells us what family of encodings the file is in. There are a set of encodings from which most everything out there descends in one way or another. Since the Decl lines can only have a very limited set of characters in it, once we figure out the encoding family, we can get through that first Decl line without knowing the exact encoding. The families of encodings are: UTF-8/ASCII, UTF-16, UCS-4, and EBCDIC. So, if we can figure out which of these families the entity's encoding belongs to, we can get started. In order to figure this out, we use two tricks. One is the BOM and the other is the XMLDecl/TextDecl itself. Since many XML entities start with this decl, the first characters in the file are often "<?xml ", in some encoding. If so, then the first bytes of the file will follow a known pattern. For instance, if its in the ASCII/UTF-8 family, it will start with the bytes: 0x3C, 0x3F, 0x78, 0x6D, 0x6C, 0x20. If we see one of these patterns of bytes, we know its in that particular encoding family. If no decl is present, then the XML spec says that the file must be in UTF-8. However, as a convenience, many parsers will also look for a BOM, or Byte Order Mark, that is prepended to most UTF-16 text. The BOM is the value 0xFEFF, and indicates that the file is highly likely to be a UTF-16 file. According to the endianness of the machine that stored the data, the bytes of this value may be either 0xFE, 0xFF or 0xFF, 0xFE. This tells us the endianness of the data itself and allows us to know which way to decode it. By the way, though UCS-4 has no BOM, we check for both little and big endian Decl byte sequences so that we know which endianness it is in as well. The other families are either single byte or are single byte within the limited set of characters allowed by the XML spec within the decl line. So, now we've either figured out that the text is in one of the basic encoding families, or we've not figured it out and assumed its UTF-8. If we figured it out by their being an XMLDecl/TextDecl present, then we need to be able to look through it and find the encoding="" line, if its present. The way that our parser does this is to manually transcode the Decl line, from within the XMLReader constructor. It puts this code into another buffer of XMLCh characters. XMLCh is the internal representation of characters inside the parser. The size of the XMLCh type can vary, but it always holds Unicode characters. Since Unicode is capable of representing all of the possible encodings, using Unicode internally allows the parser to be written to a single character type, vastly simplifying it. The reader constructor then returns and processing proceeds normally; and, after some other bookkeeping, the high level parsing code begins trying to parse this new entity and starts asking for characters. At this point, one of two things happen. If an XMLDecl/TextDecl was found and manually pre-transcoded, then the initial parsing code will have enough characters available to it to parse the Decl line. During this process, it will see any encoding="" string. When it does, it will call back into the current reader and ask it to update itself to use this new encoding. At that point, the reader will use the encoding name to create an XMLTrancoder object (which it gets from the installed transcoding service) and it will store this for subsequent transcoding duties. If there was no Decl line, then no data was pre-transcoded. So, when the first character is asked for, it will be assumed that the auto-sensed encoding is the correct one and an XMLTranscoder for that encoding will be created. When either the end of the Decl line is hit or the first character is requested and no Decl line was present, the installed transcoder will be used to transcode data from the reader's raw data buffer to its internalized character buffer, about 4K at a time. The reader then spools out characters from this internalized buffer until it becomes empty. It then tries to transcode another 4K characters from the raw buffer. If the raw buffer empties out, then it too is refilled. Eventually, no more raw data can be gotten, hence no more characters can be gotten, so the parser knows that this reader is all washed up. So it pops the reader off the top of the reader manager stack and starts working again with the previous reader. This whole sequence repeats itself again and again until the end of the original reader is hit, at which time the end of the parse is at hand, and there is much rejoicing and feasting if it all went well. When an encoding="" line is seen, and the parser calls back to the reader with the encoding name, the reader will ask the currently installed transcoding service (each per-platform utilities file can choose which want it wants to install) to create an XMLTranscoder for that encoding. This call first goes to the platform independent part of the transcoder. This code will first check to see whether the encoding represents one of the intrinsically supported encodings. If so, it will short circuit the request and return one of the XMLTrancoder derivatives from the src/util directory. As of 3.1.0, the parser intrinsically supports UTF-8, ASCII, UTF-16, UCS-4, EBCDIC-US and ISO-8859-1 (Latin1.) If the encoding is not one of these, then the transcoder is asked to create a transcoder for the encoding. If the transcode cannot or does not, then the parser will issue an error that a transcoder for encoding X cannot be created. If the encoding string is of a set of endian ambiguous encodings, such as UTF-16 or CP1200 or UCS-4, then the previously sensed endianness is used to know what type of each transcoder to create.Where possible, you should always use an endian specific name, such as "UTF-16LE", so that there is no ambiguity. Related Random Facts: Even if the sequence of referenced entities creates a legal sequence of XML characters, that's not good enough necessarily. This little addendum is not strictly rlated to the topic, but its interesting so I threw it in. The XML spec requires that all entities contain 'balanced markup'. So you cannot have the beginning quote of a string in one entity and the closing quote in another. You cannot have the start tag of an element in one entity, and the end tag of that element in another entity. You cannot have the open angle brack of an element decl in one entity and the close bracket in another, and so on. So, the parser has to do a lot of checking to insure that these rules are met. The Xerces parser does it by assigning a unique id to each new reader that is created. It uses a synchronized counter (a static member of the reader manager), to create a new id that it sets on each reader that it creates and pushes onto the stack. So now for instance, when the high level parsing code sees a < character in the DTD and knows its about to start parsing some markup decl, it can ask the reader manager for the id of the current reader, which it stores in a local variable. when it gets to the > character, it gets the id again. If it does not get back the same id, it knows that the entity is unbalanced and issues a 'Partial Markup' error. ---------------------------------------- Dean Roddey Software Weenie IBM Center for Java Technology - Silicon Valley [EMAIL PROTECTED]