[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2011-04-21 Thread Paweł Widera

Paweł Widera  added the comment:

No. As the value of the href attribute is not suppose to contain spaces, I'd 
rather expect the parser to assume that there is an ending " missing before the 
space.

--

___
Python tracker 
<http://bugs.python.org/issue6191>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2011-04-14 Thread Paweł Widera

Paweł Widera  added the comment:

Great! With one "but"... the second case *is* handled by browsers. Browsers do 
not throw an exception on it as HTMLParser do. So improvement is definitely 
possible here. If it is worth an effort, it is not for me to judge.

--

___
Python tracker 
<http://bugs.python.org/issue6191>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2009-06-04 Thread Paweł Widera

Paweł Widera  added the comment:

It depends whether you want a HTMLParser to be an useful tool that can
deal with real world HTML or just a toy without practical meaning.
Crashing on every little deviation from the standard, where more relaxed
approach is possible, doesn't sound to me as a reasonable choice.

Maybe guess is not a proper word... If the standard strict approach
fails, the parser should fall back to a less strict one in an attempt to
actually parse the document. Throwing an exception and giving up is just
not good enough.

Can we have somebody else commenting on this one please?

--
status: closed -> open

___
Python tracker 
<http://bugs.python.org/issue6191>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2009-06-04 Thread Paweł Widera

New submission from Paweł Widera :

Of course both are not correct HTML but are easy to guess, so I believe
the parser should not give up too quick here.

1) extra comma between attributes


2) missing closing quotation mark for the first attribute
http://xxx.org/xxx.php?a=1 target="_blank">click me

--
components: Library (Lib)
messages: 88867
nosy: momat
severity: normal
status: open
title: HTMLParser attribute parsing - 2 test cases when it fails
type: behavior
versions: Python 2.6

___
Python tracker 
<http://bugs.python.org/issue6191>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-06-03 Thread Paweł Widera

Paweł Widera  added the comment:

A simple workaround for the BeautifulSoup is the following wrapper. It
sanitize the javascript code before passing it to the parser by joining
the disjoint strings, so that "" becomes "".

def bs(input):
pattern = re.compile('\"\+\"')
match = lambda x: ""
massage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
massage.extend([(pattern, match)])
return BeautifulSoup(input, markupMassage=massage)

--

___
Python tracker 
<http://bugs.python.org/issue670664>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-06-03 Thread Paweł Widera

Changes by Paweł Widera :


--
nosy: +momat

___
Python tracker 
<http://bugs.python.org/issue670664>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com