subject:"Re\: \"<\!\[CDATA\[\]\]\" vs. BeautifulSoup"

Re: ![CDATA[]] vs. BeautifulSoup

2012-05-04 Thread Stefan Behnel

Ian Kelly, 04.05.2012 01:02:
 BeautifulSoup is supposed to parse like a browser would

Not at all, that would be html5lib.

Stefan

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: ![CDATA[]] vs. BeautifulSoup

2012-05-04 Thread Ian Kelly

On Fri, May 4, 2012 at 12:57 AM, Stefan Behnel stefan...@behnel.de wrote:
 Ian Kelly, 04.05.2012 01:02:
 BeautifulSoup is supposed to parse like a browser would

 Not at all, that would be html5lib.

Well, I guess that depends on whether we're talking about
BeautifulSoup 3 (a regex-based screen scraper with methods for
navigating and searching the resulting tree) or 4 (purely a parse tree
navigation library that relies on external libraries to do the actual
parsing).

According to the BS3 documentation, The BeautifulSoup class is full
of web-browser-like heuristics for divining the intent of HTML
authors.

If we're talking about BS4, though, then the problem in this instance
would have nothing to do with BS4 and instead would be an issue of
whatever underlying parser the OP is using.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: ![CDATA[]] vs. BeautifulSoup

2012-05-03 Thread Ian Kelly

On Thu, May 3, 2012 at 1:59 PM, John Nagle na...@animats.com wrote:
  An HTML page for a major site (http://www.chase.com) has
 some incorrect HTML.  It contains

        ![CDATA[]]

 which is not valid HTML, XML, or SMGL.  However, most browsers
 ignore it.  BeautifulSoup treats it as the start of a CDATA section,
 and consumes the rest of the document in CDATA format.

  Bug?

Seems like a bug to me.  BeautifulSoup is supposed to parse like a
browser would, so if most browsers just ignore an unterminated CDATA
section, then BeautifulSoup probably should too.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: ![CDATA[]] vs. BeautifulSoup

Re: ![CDATA[]] vs. BeautifulSoup

Re: ![CDATA[]] vs. BeautifulSoup

3 matches

Site Navigation

Mail list logo

Footer information