[issue6191] HTMLParser attribute parsing - 2 test cases when it fails
Paweł Widera added the comment: No. As the value of the href attribute is not suppose to contain spaces, I'd rather expect the parser to assume that there is an ending " missing before the space. -- ___ Python tracker <http://bugs.python.org/issue6191> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6191] HTMLParser attribute parsing - 2 test cases when it fails
Paweł Widera added the comment: Great! With one "but"... the second case *is* handled by browsers. Browsers do not throw an exception on it as HTMLParser do. So improvement is definitely possible here. If it is worth an effort, it is not for me to judge. -- ___ Python tracker <http://bugs.python.org/issue6191> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6191] HTMLParser attribute parsing - 2 test cases when it fails
Paweł Widera added the comment: It depends whether you want a HTMLParser to be an useful tool that can deal with real world HTML or just a toy without practical meaning. Crashing on every little deviation from the standard, where more relaxed approach is possible, doesn't sound to me as a reasonable choice. Maybe guess is not a proper word... If the standard strict approach fails, the parser should fall back to a less strict one in an attempt to actually parse the document. Throwing an exception and giving up is just not good enough. Can we have somebody else commenting on this one please? -- status: closed -> open ___ Python tracker <http://bugs.python.org/issue6191> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6191] HTMLParser attribute parsing - 2 test cases when it fails
New submission from Paweł Widera : Of course both are not correct HTML but are easy to guess, so I believe the parser should not give up too quick here. 1) extra comma between attributes 2) missing closing quotation mark for the first attribute http://xxx.org/xxx.php?a=1 target="_blank">click me -- components: Library (Lib) messages: 88867 nosy: momat severity: normal status: open title: HTMLParser attribute parsing - 2 test cases when it fails type: behavior versions: Python 2.6 ___ Python tracker <http://bugs.python.org/issue6191> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue670664] HTMLParser.py - more robust SCRIPT tag parsing
Paweł Widera added the comment: A simple workaround for the BeautifulSoup is the following wrapper. It sanitize the javascript code before passing it to the parser by joining the disjoint strings, so that "" becomes "". def bs(input): pattern = re.compile('\"\+\"') match = lambda x: "" massage = copy.copy(BeautifulSoup.MARKUP_MASSAGE) massage.extend([(pattern, match)]) return BeautifulSoup(input, markupMassage=massage) -- ___ Python tracker <http://bugs.python.org/issue670664> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue670664] HTMLParser.py - more robust SCRIPT tag parsing
Changes by Paweł Widera : -- nosy: +momat ___ Python tracker <http://bugs.python.org/issue670664> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com