Re: ![CDATA[]] vs. BeautifulSoup
Ian Kelly, 04.05.2012 01:02: BeautifulSoup is supposed to parse like a browser would Not at all, that would be html5lib. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: ![CDATA[]] vs. BeautifulSoup
On Fri, May 4, 2012 at 12:57 AM, Stefan Behnel stefan...@behnel.de wrote: Ian Kelly, 04.05.2012 01:02: BeautifulSoup is supposed to parse like a browser would Not at all, that would be html5lib. Well, I guess that depends on whether we're talking about BeautifulSoup 3 (a regex-based screen scraper with methods for navigating and searching the resulting tree) or 4 (purely a parse tree navigation library that relies on external libraries to do the actual parsing). According to the BS3 documentation, The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. If we're talking about BS4, though, then the problem in this instance would have nothing to do with BS4 and instead would be an issue of whatever underlying parser the OP is using. -- http://mail.python.org/mailman/listinfo/python-list
Re: ![CDATA[]] vs. BeautifulSoup
On Thu, May 3, 2012 at 1:59 PM, John Nagle na...@animats.com wrote: An HTML page for a major site (http://www.chase.com) has some incorrect HTML. It contains ![CDATA[]] which is not valid HTML, XML, or SMGL. However, most browsers ignore it. BeautifulSoup treats it as the start of a CDATA section, and consumes the rest of the document in CDATA format. Bug? Seems like a bug to me. BeautifulSoup is supposed to parse like a browser would, so if most browsers just ignore an unterminated CDATA section, then BeautifulSoup probably should too. -- http://mail.python.org/mailman/listinfo/python-list