Hi all,
Story time: So we build this script, reading from the REST API of a webapp, writing some .xml file and uploading zipped into some sftp endpoint. For writing .xml we used a textbook [1] like way [2] to build some nice, horrifying, XML-ish document. Using what amounts to unvalidated user input in some of the text nodes. To no ones surprise (at this point), now, 2 years later, the receiving entity complains, that they find illegal characters '0x8' in their uploads, which they cannot parse. Turns out XML [3] and HTML [4] both have their own opinion, about what characters are allowed in their documents. But at least they agree, that most control characters (0x0 - 0x8; 0xB; 0xC; 0xD - 0x1F) are bad, and some are at least 'discouraged'. (0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, ...) Now, the "MarkupBuilder" is first and foremost called "MarkupBuilder". So one could argue, that it /does/ handle the markup part just fine and that's all that it /should/ do. On the other hand, the class proclaims itself in the javadoc [5] to be "for creating XML or HTML markup". And the documentation [1] also kindof markets it for that purpose. (And, maybe, it's a bad look, to be able to write invalid .xml?) So here is the question to you: 1) Is the MarkupBuilder's behavior okay as-is? 2) (if 1) What should the behavior be? 3) Is this historically a 'done discussion', and are we unwilling to open up /that/ can of worms again? (What was the previous consensus?) Going a bit further with this, personally, I could imagine: * by default sanitizing the output of MarkupBuilder to a compatible subset of characters for _both_ formats * having some config option to switch to 'xml', 'html' or 'off' mode for "character set validation" * dealing with invalid characters by replacing them with \uFFFD (�) character (as one comment on the Jeff Atwood answer post [6] suggested) Which might be the maximum degree of changing things. But I'm eager to hear some of your opinions. Any thoughts / arguments / things I've missed so far? Any chance of finding some kind of consensus on the matter? Best, Simon [1] https://groovy-lang.org/processing-xml.html#_markupbuilder [2] private toXmlFile(body) { def writer = new StringWriter() def xml = new MarkupBuilder(writer) body(xml) '<?xml version="1.0" encoding="UTF-8"?>' + "\n" + writer.toString() + "\n" } [3] https://www.w3.org/TR/xml/#NT-Char "Consequently, XML processors MUST accept any character in the range specified for Char. [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */" (Note: this is a "positive definition", and could be amended at some point to include /more/ character.) [4] https://html.spec.whatwg.org/#character-references "The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters <https://infra.spec.whatwg.org/#noncharacter>, and controls <https://infra.spec.whatwg.org/#control> other than ASCII whitespace <https://infra.spec.whatwg.org/#ascii-whitespace>." [5] https://docs.groovy-lang.org/latest/html/api/groovy/xml/MarkupBuilder.html [6] https://stackoverflow.com/questions/397250/unicode-regex-invalid-xml-characters/961504#961504