[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2019-04-27 Thread Stefan Behnel


Stefan Behnel  added the comment:

This is a tricky decision. lxml, for example, validates user input, but that's 
because it has to process it anyway and does it along the way directly on input 
(and very efficiently in C code). ET, on the other hand, is rather lenient 
about what it allows users to do and doesn't apply much processing to user 
input. It even allows invalid trees during processing and only expects the tree 
to be serialisable when requested to serialise it.

I think that's a fair behaviour, because most user input will be ok and 
shouldn't need to suffer the performance penalty of validating all input. 
Null-characters are a very rare thing to find in text, for example, and I think 
it's reasonable to let users handle the few cases by themselves where they can 
occur.

Note that simply replacing invalid characters by the replacement character is 
not a good solution, at least not in the general case, since it silently 
corrupts data. It's probably a better solution for users to make their code 
scream out loudly when it has to deal with data that it cannot serialise in the 
end, and to do that early on input (where its easy to debug) rather than late 
on serialisation where it might be difficult to understand how the data became 
what it is. Trying to serialise a null-character seems only a symptom of a more 
important problem somewhere else in the processing pipeline.

In the end, users who *really* care about correct output should run some kind 
of schema validation over it *after* serialisation, as that would detect not 
only data issues but also structural and logical issues (such as a missing or 
empty attribute), specifically for their target data format. In some cases, it 
might even detect random data corruption due to old non-ECC RAM in the server 
machine. :)

So, if someone finds a way to augment the text escaping procedure with a bit of 
character validation without making it slower (especially for the extremely 
common very short strings), then I think we can reconsider this as an 
enhancement. Until then, and seeing that no-one has come up with a patch in the 
last 10 years, I'll close this as "won't fix".

--
dependencies:  -Document Object Model API - validation
nosy: +scoder
resolution:  -> wont fix
stage:  -> resolved
status: open -> closed
versions: +Python 3.8 -Python 3.4, Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2018-11-07 Thread Ben Spiller


Change by Ben Spiller :


--
nosy: +Ben Spiller

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2018-10-19 Thread Ben Spiller


Ben Spiller  added the comment:

To help anyone else struggling with this bug, based on 
https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/
 the best workaround I've currently found is to define this:

def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'):
return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\u]', 
replaceWith, unicodeString)

and then copy+paste the following pattern into every bit of code that generates 
XML:

myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8'))

It's obviously pretty grim (and unsafe) to expect every python developer to 
copy+paste this kind of thing into their own project to avoid buggy XML 
generation, so would be better to have the escape_xml_illegal_chars function in 
the python standard library (maybe alongside xml.sax.utils.escape - which 
notably does _not_ escape all the unicode characters that aren't valid XML), 
and built-in support for this as part of document.toxml. 

I guess we'd want it to be user-configurable for any users who are prepared to 
tolerate the possibility unparseable XML documents will be generated in return 
for improved performance for the common case where these characters are not 
present, not not having the capability at all just means most python 
applications that do XML generate with special-casing this have a bug. I 
suggest we definitely need some clear warnings about this in the doc.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2018-09-10 Thread Ben Spiller


Ben Spiller  added the comment:

Hi it's been a few years now since this was reported and it's still a problem, 
any chance of a fix for this? The API gives the impression that if you pass 
python strings to the XML API then the library will generate valid XML. It 
takes care of the charset/encoding and entity escaping aspects of XML 
generation so would be logical for it to in some way take care of control 
characters too - especially as silently generating unparseable XML is a 
somewhat dangerous failure mode. 

I think there's a strong case for some built-in functionality to replace/ignore 
the control characters (perhaps as a configurable option, in case of 
performance worries) rather than just throwing an exception, since it's very 
common to have an arbitrary string generated by some other program or user 
input that needs to be written into an XML file (and a lot less common to be 
100% sure in all cases what characters your string might contain). For those 
common use cases, the current situation where every python developer needs to 
implement their own workaround to sanitize strings isn't ideal, especially as 
it's not trivial to get it right and likely a lot of the community who end up 
'rolling their own' are getting in wrong in some way. 

[On the other hand if you guys decide this really isn't going to be fixed, then 
at the very least I'd suggest that the API documentation should prominently 
state that it is up to the users of these libraries to implement their own 
sanitization of control characters, since I'm sure none of us want people using 
python to end up with buggy applications]

--
nosy: +benspiller
versions: +Python 3.5, Python 3.6, Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2014-12-12 Thread Martin Panter

Changes by Martin Panter vadmium...@gmail.com:


--
nosy: +vadmium

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2014-02-03 Thread Mark Lawrence

Changes by Mark Lawrence breamore...@yahoo.co.uk:


--
nosy:  -BreamoreBoy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2013-09-02 Thread Eli Bendersky

Changes by Eli Bendersky eli...@gmail.com:


--
nosy: +eli.bendersky

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2012-07-21 Thread Florent Xicluna

Changes by Florent Xicluna florent.xicl...@gmail.com:


--
assignee: effbot - 
components: +XML
versions: +Python 3.4 -Python 2.7, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2011-04-08 Thread Santoso Wijaya

Changes by Santoso Wijaya santoso.wij...@gmail.com:


--
nosy: +santa4nt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2010-03-16 Thread Vetoshkin Nikita

Vetoshkin Nikita nikita.vetosh...@gmail.com added the comment:

What about this example?
 from xml.dom import minidom
 doc = minidom.Document()
 el = doc.createElement(Test)
 el.setAttribute(with space, False)
 doc.appendChild(el)
DOM Element: Test at 0xba1440

 #nahhh
... minidom.parseString(doc.toxml())
Traceback (most recent call last):
  File stdin, line 2, in module
  File C:\Python26\lib\xml\dom\minidom.py, line 1928, in parseString
return expatbuilder.parseString(string)
  File C:\Python26\lib\xml\dom\expatbuilder.py, line 940, in parseString
return builder.parseString(string)
  File C:\Python26\lib\xml\dom\expatbuilder.py, line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 33



Is it worth making another bug report?

--
nosy: +nvetoshkin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2009-11-24 Thread Andy

Andy strangefeatu...@users.sourceforge.net added the comment:

I'm also of the opinion that this would be a valuable feature to have. I
think it's a reasonable expectation that an XML library produces valid
XML. It's particularly strange that ET would output XML that it can't
itself read. Surely the job of making the input valid falls on the XML
creator - that's the point of using libraries in the first place, to
abstract away from details like not being able to use characters in the
0-32 range, in the same way that ampersands etc are auto-escaped.
Granted, it's not as clear-cut here since the low-range ASCII characters
are likely to be less frequent and the strategy to handle them is less
clear. I think the sanest behaviour would be to raise an exception by
default, although a user-configurable option to replace or omit the
characters would also make sense. If impacting performance is a concern,
maybe it would make sense to be off by default, but I would have thought
that the single regex that could perform the check would have relatively
minimal impact - and it seems to be an acceptable overhead on the
parsing side, so why not on generation?

--
nosy: +strangefeatures

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2009-11-24 Thread Denis S. Otkidach

Denis S. Otkidach denis.otkid...@gmail.com added the comment:

Here is a regexp I use to clean up text (note, that I don't touch 
compatibility characters that are also not recommended in XML; some 
other developers remove them too):

# http://www.w3.org/TR/REC-xml/#NT-Char
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
#  [#x1- #x10]
# (any Unicode character, excluding the surrogate blocks, FFFE, and 
)
_char_tail = ''
if sys.maxunicode  0x1:
_char_tail = u'%s-%s' % (unichr(0x1),
 unichr(min(sys.maxunicode, 0x10)))
_nontext_sub = re.compile(
ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' % 
_char_tail,
re.U).sub
def replace_nontext(text, replacement=u'\uFFFD'):
return _nontext_sub(replacement, text)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2009-06-25 Thread Denis S. Otkidach

Denis S. Otkidach denis.otkid...@gmail.com added the comment:

Every blog engine I've even seen so far pass through comments from
untrusted users to RSS/Atom feeds without proper validation causing
broken XML in feeds. Sure, this is a bug in web applications, but DOM
manipulation packages should prevent from creation broken XML to help
detecting errors earlier.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2009-06-24 Thread Fredrik Lundh

Fredrik Lundh fred...@effbot.org added the comment:

For ET, that's very much on purpose.  Validating data provided by every 
single application would kill performance for all of them, even if only a 
small minority would ever try to serialize data that cannot be represented 
in XML.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2009-02-06 Thread Denis S. Otkidach

New submission from Denis S. Otkidach denis.otkid...@gmail.com:

ElementTree and minidom allow creation of not well-formed XML, that
can't be parsed:

 from xml.etree import ElementTree
 element = ElementTree.Element('element')
 element.text = u'\0'
 xml = ElementTree.tostring(element, encoding='utf-8')
 ElementTree.fromstring(xml)
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 9

 from xml.dom import minidom
 doc = minidom.getDOMImplementation().createDocument(None, None, None)
 element = doc.createElement('element')
 element.appendChild(doc.createTextNode(u'\0'))
DOM Text node 
 doc.appendChild(element)
DOM Element: element at 0xb7ca688c
 xml = doc.toxml(encoding='utf-8')
 minidom.parseString(xml)
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, colum

I believe they should raise some exception when there are characters 
not allowed in XML (http://www.w3.org/TR/REC-xml/#NT-Char) are used in
attribute values, text nodes and CDATA sections.

--
components: Library (Lib)
messages: 81259
nosy: ods
severity: normal
status: open
title: ElementTree and minidom don't prevent creation of not well-formed XML
type: behavior
versions: Python 2.5, Python 2.6, Python 3.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2009-02-06 Thread Georg Brandl

Changes by Georg Brandl ge...@python.org:


--
assignee:  - effbot
nosy: +effbot

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5166
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com