[issue9577] html parser bug related with CDATA sections
Changes by R. David Murray rdmur...@bitdance.com: Removed file: http://bugs.python.org/file18495/unnamed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9577 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9577] html parser bug related with CDATA sections
New submission from Arman arman.hunan...@gmail.com: When HTMLParser reaches CDATA element it enters cdata mode by calling set_cdata_mode (file html/parser.py line 270). this method assigns self.interesting member new value r'(/|\Z)'. But this is not correct. Consider following case script language=javascript !-- if (window.adgroupid == undefined) { window.adgroupid = Math.round(Math.random() * 1000); } document.write('scr'+'ipt language=javascript1.1 src=http://adserver.adtech.de/addyn|3.0|876|2378574|0|225|ADTECH;loc=100;target=_blank;key=;grp='+window.adgroupid+';misc='+new Date().getTime()+'/scri'+'pt'); //-- /script /scri'+'pt matches with r'(/|\Z)' and parser gets confused and produce wrong results. You can see such real htmls in www.ahram.org.eg www.chefkoch.de www.chemieonline.de www.eip.gov.eg www.rezepte.li www.scienceworld.com The solution can be to keep interesting_cdata_script = re.compile(r'(/|\z)script') interesting_cdata_style = re.compile(r'(/|\z)style') instead of interesting_cdata = re.compile(r'(/|\Z)') and depending on what tag is begins (script or style) set_cdata_mode can assign correct regexp to self.interesting member. Please contact with me via email if you need more details. arman.hunan...@gmail.com -- components: Library (Lib) messages: 113688 nosy: Hunanyan priority: normal severity: normal status: open title: html parser bug related with CDATA sections type: behavior versions: Python 3.1 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9577 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9577] html parser bug related with CDATA sections
R. David Murray rdmur...@bitdance.com added the comment: I believe this is a duplicate of Issue670664. If you disagree please reopen with additional information. -- nosy: +r.david.murray resolution: - duplicate stage: - committed/rejected status: open - closed superseder: - HTMLParser.py - more robust SCRIPT tag parsing ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9577 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com