[issue13358] HTMLParser incorrectly handles cdata elements.
Michael Brooks firealwayswo...@gmail.com added the comment: Has anyone else been able to verify this? On Mon, Nov 7, 2011 at 7:46 AM, Michael Brooks rep...@bugs.python.orgwrote: Michael Brooks firealwayswo...@gmail.com added the comment: This one should also have a priority change. Tested python 2.7.3 --MIke On Sun, Nov 6, 2011 at 12:54 PM, Michael Brooks rep...@bugs.python.org wrote: Michael Brooks firealwayswo...@gmail.com added the comment: Yes I am running python 2.7.2. On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti rep...@bugs.python.org wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: Have you tried with the latest 2.7? (see msg147170) -- nosy: +ezio.melotti stage: - test needed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13358] HTMLParser incorrectly handles cdata elements.
Michael Brooks firealwayswo...@gmail.com added the comment: Ok so until you fix this bug, i'll be overriding HTMLParser with my fix, becuase this is a blocking issue for my project. My HTMLParser must behave like a browser, period end of story. Thanks. On Thu, Nov 17, 2011 at 9:24 AM, Ezio Melotti rep...@bugs.python.orgwrote: Ezio Melotti ezio.melo...@gmail.com added the comment: It seems to me that the arguments are parsed correctly, but handle_data is called multiple time between handle_starttag and handle_endtag. This might happen, e.g. in case the source lines are fed one by one to the parser, but in this case seems to happen whenever / is found. (The tests didn't detect this because they join the data to avoid buffer artifacts.) I'm not sure if this can be considered a bug, but the situation can indeed be improved. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13358] HTMLParser incorrectly handles cdata elements.
Michael Brooks firealwayswo...@gmail.com added the comment: Oah, then there is a misunderstanding. No browser will parse the html that is declared within a javascript variable, it must be treated as a continues data segment (with cdata properties) until the exit /\s*script\s* is encountered (and if this tag found anywhere, even in a quoted string it will still terminate this data segment, because its a cdata element). The snip of html provided must only be a single data segment. / alone is not a proper terminator. Thu, Nov 17, 2011 at 11:17 AM, Ezio Melotti rep...@bugs.python.org wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: It already behaves like a browser, it just gives you data in chunks instead of calling handle_data() only once at the end. The documentation is not clear about this though. It says that feed() can be called several times, but it doesn't say that handle_data() (and possibly other methods) might get called more than once. This seems to always be the case while calling feed() several times. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13358] HTMLParser incorrectly handles cdata elements.
Michael Brooks firealwayswo...@gmail.com added the comment: This one should also have a priority change. Tested python 2.7.3 --MIke On Sun, Nov 6, 2011 at 12:54 PM, Michael Brooks rep...@bugs.python.orgwrote: Michael Brooks firealwayswo...@gmail.com added the comment: Yes I am running python 2.7.2. On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti rep...@bugs.python.org wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: Have you tried with the latest 2.7? (see msg147170) -- nosy: +ezio.melotti stage: - test needed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13357] HTMLParser parses attributes incorrectly.
New submission from Michael Brooks firealwayswo...@gmail.com: Open the attached file red_test.html in a browser. The bad elements are blue because the style tag isn't parsed by any known browser. However, the HTMLParser library will incorrectly recognize them. -- components: Library (Lib) files: red_test.html messages: 147169 nosy: Michael.Brooks priority: normal severity: normal status: open title: HTMLParser parses attributes incorrectly. type: behavior versions: Python 2.7 Added file: http://bugs.python.org/file23618/red_test.html ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13357 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13358] HTMLParser incorrectly handles cdata elements.
New submission from Michael Brooks firealwayswo...@gmail.com: The HTML tag at the bottom of this page correctly identified has having cdata like properties and trigger set_cdata_mode(). Due to the cdata properties of this tag, the only way to end the data segment is with a closing /script tag, NO OTHER tag can close this data segment. Currently in cdata mode the HTMLParser will use this regular expression to close this script tag: re.compile(r'(/|\Z)'), however this script tag is setting a variable with data that contains /b which will terminate this script tag prematurely. I have written and tested the following patch on my system: #used to terminate cdata elements endtagfind_script = re.compile('(?i)/\s*script\s*') endtagfind_style = re.compile('(?i)/\s*style\s*') class html_patch(HTMLParser.HTMLParser): # Internal -- sets the proper tag terminator based on cdata element type def set_cdata_mode(self, tag): #We check if the script is either a style or a script #based on self.CDATA_CONTENT_ELEMENTS if tag==style: self.interesting = endtagfind_style elif tag==script: self.interesting = endtagfind_script else: self.error(Unknown cdata type:+tag) # should never happen self.cdata_tag = tag This cdata tag isn't parsed properly by HTMLParser, but it works fine in a browser: script pwa.setup( pwa.searchview, 'lhid_searchheader', 'lhid_content', 'lhid_trayhandle', 'lhid_tray', {'query': 'test', 'tagQuery': '', 'searchScope': '', 'owner': '', 'doCrowding': false, 'isOwner': false, 'albumId': '' ,'experimentalsearchquality': true}, 'firealwaysworks' , {feedUrl: 'https://picasaweb.google.com/data/feed/tiny/all?alt=jsonmamp;kind=photoamp;access=publicamp;filter=1amp;q=test', feedPreload: null}, {NEW_HOMEPAGE:1,NEW_ONE_BAR:1,fr:1,tags:1,search:1,globalsearch:1,globalsearchpromo:1,newfeatureslink:1,cart:1,contentcaching:1,developerlink:1,payments:1,newStrings:1,cccquota:1,signups:1,flashSlideshow:1,URL_SHORTENER_VISIBILITY:1,emailupload:1,photopickeralbumview:1,PWA_NEWUI:1,WILDCARD_QUERY_FEED:1,recentphotos:1,editinpicasa:1,imagesearch:1,froptin:1,FR_CONTINUOUS_CLUSTERING:1,asyncUploads:1,PERFORMANCE_EXPERIMENTS:1,BAKED_PRELOAD_FEEDS:1,albumviewlimit:1,HQ_VIDEOS:1,VIDEO_INFO_DISPLAY:1,CSI:1,EXPERIMENTAL_SEARCH_QUALITY:1,COMMENT_TRANSLATION:1,NEW_COMMENT_STYLE:1,ENABLE_NEW_FLAG_ABUSE_FORM:1,QRCODE:1,CHINA:1,GWS_URL_REDIRECTION:1,FEATURED_PHOTOS:1,COMMENT_SUBSCRIPTION:1,COMMENT_SUBSCRIPTION_SETTING:1,PICASA_MAC:1,AD_ON_SEARCHPAGE:1,API_AUTO_ACCOUNTS:1,FOCUS_GROUP_ACL:1,PHOTOSTREAM:1,BACKEND_ACL:1,ADVANCED_SEARCH:1,FACE_SEARCH:1,CAMERA_SEARCH:1,NOTIFICATION:1,PIXELATED_PREVIEW:1,TRANSPARENT_PIXELATED_PREVIEW:1,NEW_SETTINGS_PAGE:1,VIEW_STARRERS:1,FR_FOCUS_MERGE:1,AD_ON_SEARCH_ONE UP:1,GALLERY_COMMENTS:1,COMMENT_ABUSE_BLOCKING:1,FAVORITE_NOTIFICATION:1,IMAGE_ONLY_LINK:1,RECENT_PHOTOS_SLIDESHOW:1,HEART:1,SMALLER_IMAGE:1,FAST_SLIDESHOW:1,VIEW_CONTACTS:1,COLLABORATIVE_ALBUMS:1,PRINT_MARKETPLACE:1,PRINT_MARKETPLACE_REPLACEMENT:1,VIEW_COUNT:1,POST_TO:1,GAPLUS:1,PICASA_PROMO:1,DOUBLECLICK_PREMIUM_ADS:1,DOUBLECLICK_EXPLORE_MAIN:1,DOUBLECLICK_MYPHOTOS:1,DOUBLECLICK_PUBLIC_GALLERY:1,DOUBLECLICK_USER_ALBUM:1,DOUBLECLICK_USER_PHOTO:1,DOUBLECLICK_VISITOR_ADS:1,PRODUCTION:1,NOSCRIPT:1,UNLISTED_GALLERY:1,GA_TRACKING:1,UNLIMITED_GALLERY:1,PICNIK_EDIT:1,MICROSCOPE_ZOOM:1,FR_V2:1,FAVORITE_SUGGESTION:1,FAVORITE_UPDATE:1,MERGED_PROFILES_SOFTLAUNCH:1,MERGED_PROFILES:1,MERGED_PROFILES_ASYNC:1,NEW_FR_UI:1,GAPLUS_UNMERGED_SOCIALIZATION:1,OPTOUT_ACL_NOTIFICATION:1,HTTPS_VISIBILITY:1,DEFAULT_HTTPS:1,EXTENDED_EXIF:1,DOUBLECLICK_MULTISLOT:1,ONEPICK:1,PER_ALBUM_GEO_VISIBILITY:1,FOCUS_MERGE_LINK_DIALOG_VISIBILITY:1,SHAREBOX_VISIBILITY:1,AUTO_DOWNSIZE:1,BULK_ALBUM_EDITOR_VISIBILITY:1,PROF ILE_NAME_CHECK:1,COLLABORATIVE_NAMETAGS:1,NOT_FOUND_404:1,REDIRECT_TO_PLUS:1}, { 'gdataVersion': '4.0', 'updateCartPath': '\x2Flh\x2FupdateCart?rtok=b8S9ibYqrTMF', 'editCaptionsPath': '', 'albumMapPath': '', 'albumKmlUrl': '', 'selectedPhotosPath': '\x2Flh\x2FselectedPhotos?tok=QUI1UGxRYk9fNmw1Q2tVeS1DWnY3UlFoTTY1RzRNNWphdzoxMzIwNjAyMzA3NDYx', 'setLicensePath': '', 'setStarPath': '\x2Flh\x2FsetStar?tok=QUI1UGxRWW4zY1ZKb3U0TzROZU5tUHhIV3hhRW9HcUYwQToxMzIwNjAyMzA3NDYx', 'peopleManagerPath': '', 'peopleSearchPath': '', 'clusterViewPath': '', 'frOptStatus': 'OptedIn', 'isNameTagsVisible': '','authUserIsPhotosUser': true, 'authUserNickname': 'Some Nickname', 'authUserPortraitUrl': 'https:\x2F\x2Flh4.googleusercontent.com\x2F-UI9ZfIFfyQI\x2FAAI\x2FAAA\x2Fm0enLvZXYbI\x2Fs32-c\x2Ffirealwaysworks.jpg', 'authUserProfileUrl':'https:\x2F\x2Fprofiles.google.com\x2F115162402406836485912', 'authUser':{name:'firealwaysworks',isProfileUser:1,isLoggedIn:1,user:1,isOwner:1 ,'showGeo': 0 }, 'foreignNickname': '', 'subjects': [ ] , 'owner': {name:'firealwaysworks',nickname:'Michael Brooks',portrait:'https:\x2F\x2Flh4.googleusercontent.com\x2F-UI9ZfIFfyQI
[issue13358] HTMLParser incorrectly handles cdata elements.
Changes by Michael Brooks firealwayswo...@gmail.com: -- type: - behavior ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13357] HTMLParser parses attributes incorrectly.
Michael Brooks firealwayswo...@gmail.com added the comment: Yes, I am running the latest version, which is python 2.7.2. On Sun, Nov 6, 2011 at 12:14 PM, Ezio Melotti rep...@bugs.python.orgwrote: Ezio Melotti ezio.melo...@gmail.com added the comment: Thanks for the report. Could you try with the latest 2.7 and see if you can reproduce the problem? (see the devguide for instructions.) If you can reproduce the issue even on the latest 2.7, it would be great if you could provide a patch with a test case like the ones in Lib/test/test_htmlparser.py. -- nosy: +ezio.melotti stage: - test needed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13357 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13357 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13358] HTMLParser incorrectly handles cdata elements.
Michael Brooks firealwayswo...@gmail.com added the comment: Yes I am running python 2.7.2. On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti rep...@bugs.python.orgwrote: Ezio Melotti ezio.melo...@gmail.com added the comment: Have you tried with the latest 2.7? (see msg147170) -- nosy: +ezio.melotti stage: - test needed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13358 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13357] HTMLParser parses attributes incorrectly.
Michael Brooks firealwayswo...@gmail.com added the comment: Python 2.7.3 is still affected by both of these issues. On Sun, Nov 6, 2011 at 12:56 PM, Ezio Melotti rep...@bugs.python.orgwrote: Ezio Melotti ezio.melo...@gmail.com added the comment: I mean 2.7.3 (i.e. the development version). You need to get a clone of Python as explained here: http://docs.python.org/devguide/ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13357 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue13357 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10599] sgmllib.parse_endtag() is not respecting quoted text
New submission from Michael Brooks firealwayswo...@gmail.com: In the attached example is a very simple usage of sgmllib that is trying to parse: input value=a href=http://buglink/a The bug is that sgmllib is parsing this href. Browsers on the other hand see this as the input's value. Also keep in mind that escaping of quote marks in HTML is not like python. \ is not a character literal thus input value=\a href=http://buglink/a is still quoted text and the href should not be parsed. Thank you -- components: None files: sgmllib_bug.py messages: 123016 nosy: Michael.Brooks priority: normal severity: normal status: open title: sgmllib.parse_endtag() is not respecting quoted text type: behavior versions: Python 2.6 Added file: http://bugs.python.org/file19895/sgmllib_bug.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10599 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10599] sgmllib.parse_endtag() is not respecting quoted text
Michael Brooks firealwayswo...@gmail.com added the comment: Oops, I had a misnomer in my bug report. input value=\a href=http://buglink/a is not escaped and there for the href should be parsed in this condition but not parsed in the attached sgmllib_bug.py. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10599 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com