[issue9577] html parser bug related with CDATA sections

2010-08-13 Thread R. David Murray

Changes by R. David Murray rdmur...@bitdance.com:


Removed file: http://bugs.python.org/file18495/unnamed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9577
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9577] html parser bug related with CDATA sections

2010-08-12 Thread Arman

New submission from Arman arman.hunan...@gmail.com:

When HTMLParser reaches CDATA element it enters cdata mode by calling 
set_cdata_mode (file html/parser.py line 270). this method assigns 
self.interesting member new value r'(/|\Z)'. But this is not correct. Consider 
following case 

script language=javascript
!--
if (window.adgroupid == undefined) {
window.adgroupid = Math.round(Math.random() * 1000);
}
document.write('scr'+'ipt language=javascript1.1 
src=http://adserver.adtech.de/addyn|3.0|876|2378574|0|225|ADTECH;loc=100;target=_blank;key=;grp='+window.adgroupid+';misc='+new
 Date().getTime()+'/scri'+'pt');
//--
/script

/scri'+'pt matches with r'(/|\Z)' and parser gets confused and produce wrong 
results.  You can see such real htmls in 

www.ahram.org.eg
www.chefkoch.de
www.chemieonline.de
www.eip.gov.eg
www.rezepte.li
www.scienceworld.com 

The solution can be to keep

interesting_cdata_script = re.compile(r'(/|\z)script')
interesting_cdata_style = re.compile(r'(/|\z)style')

instead of 

interesting_cdata = re.compile(r'(/|\Z)')

and depending on what tag is begins (script or style) set_cdata_mode can assign 
correct regexp to self.interesting member.


Please contact with me via email if you need more details.

arman.hunan...@gmail.com

--
components: Library (Lib)
messages: 113688
nosy: Hunanyan
priority: normal
severity: normal
status: open
title: html parser bug related with CDATA sections
type: behavior
versions: Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9577
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9577] html parser bug related with CDATA sections

2010-08-12 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

I believe this is a duplicate of Issue670664.  If you disagree please reopen 
with additional information.

--
nosy: +r.david.murray
resolution:  - duplicate
stage:  - committed/rejected
status: open - closed
superseder:  - HTMLParser.py - more robust SCRIPT tag parsing

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9577
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com