One of those enlighting topics: Control characters and the MarkupBuilder

Simon Tost Fri, 07 Aug 2020 14:59:23 -0700

Hi all,


Story time:
So we build this script, reading from the REST API of a webapp, writing
some .xml file and uploading zipped into some sftp endpoint.
For writing .xml we used a textbook [1] like way [2] to build some nice,
horrifying, XML-ish document.
Using what amounts to unvalidated user input in some of the text nodes.

To no ones surprise (at this point), now, 2 years later, the receiving
entity complains, that they find illegal characters '0x8' in their
uploads, which they cannot parse.


Turns out XML [3] and HTML [4] both have their own opinion, about what
characters are allowed in their documents.
But at least they agree, that most control characters (0x0 - 0x8; 0xB;
0xC; 0xD - 0x1F) are bad, and some are at least 'discouraged'.
(0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, ...)


Now, the "MarkupBuilder" is first and foremost called "MarkupBuilder".
So one could argue, that it /does/ handle the markup part just fine and
that's all that it /should/ do.

On the other hand, the class proclaims itself in the javadoc [5] to be
"for creating XML or HTML markup".
And the documentation [1] also kindof markets it for that purpose.
(And, maybe, it's a bad look, to be able to write invalid .xml?)


So here is the question to you:
1) Is the MarkupBuilder's behavior okay as-is?
2) (if 1) What should the behavior be?

3) Is this historically a 'done discussion', and are we unwilling to
open up /that/ can of worms again?
(What was the previous consensus?)



Going a bit further with this, personally, I could imagine:
* by default sanitizing the output of MarkupBuilder to a compatible
subset of characters for _both_ formats
* having some config option to switch to 'xml', 'html' or 'off' mode for
"character set validation"
* dealing with invalid characters by replacing them with \uFFFD (�)
character
  (as one comment on the Jeff Atwood answer post [6] suggested)

Which might be the maximum degree of changing things.
But I'm eager to hear some of your opinions.

Any thoughts / arguments / things I've missed so far?
Any chance of finding some kind of consensus on the matter?


Best,
Simon


[1] https://groovy-lang.org/processing-xml.html#_markupbuilder
[2]
    private toXmlFile(body) {
        def writer = new StringWriter()
        def xml = new MarkupBuilder(writer)

        body(xml)

        '<?xml version="1.0" encoding="UTF-8"?>' + "\n" +
writer.toString() + "\n"
    }

[3]
https://www.w3.org/TR/xml/#NT-Char
"Consequently, XML processors MUST accept any character in the range
specified for Char.
[2]       Char       ::=       #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character,
excluding the surrogate blocks, FFFE, and FFFF. */"
(Note: this is a "positive definition", and could be amended at some
point to include /more/ character.)

[4] https://html.spec.whatwg.org/#character-references
"The numeric character reference forms described above are allowed to
reference any code point excluding U+000D CR, noncharacters
<https://infra.spec.whatwg.org/#noncharacter>, and controls
<https://infra.spec.whatwg.org/#control> other than ASCII whitespace
<https://infra.spec.whatwg.org/#ascii-whitespace>."

[5]
https://docs.groovy-lang.org/latest/html/api/groovy/xml/MarkupBuilder.html
[6]
https://stackoverflow.com/questions/397250/unicode-regex-invalid-xml-characters/961504#961504

One of those enlighting topics: Control characters and the MarkupBuilder

Reply via email to