[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2011-05-14 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

What I described in my previous message is what Firefox does.  If you think 
this should be changed, I suggest you to open another issue, possibly attaching 
a test case with the desired behavior and a patch to change it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2011-04-21 Thread Paweł Widera

Paweł Widera mo...@man.poznan.pl added the comment:

No. As the value of the href attribute is not suppose to contain spaces, I'd 
rather expect the parser to assume that there is an ending  missing before the 
space.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2011-04-14 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

The first case has been fixed already in 1cbfeffea19f, the second case is not 
even handled by browsers, so I'm closing this.

--
resolution:  - fixed
stage:  - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2011-04-14 Thread Paweł Widera

Paweł Widera mo...@man.poznan.pl added the comment:

Great! With one but... the second case *is* handled by browsers. Browsers do 
not throw an exception on it as HTMLParser do. So improvement is definitely 
possible here. If it is worth an effort, it is not for me to judge.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2011-04-14 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

So you are suggesting that 
  a href=http://xxx.org/xxx.php?a=1 target=_blankclick me/a
should result in an 'a' element with an href attribute equals to 
http://xxx.org/xxx.php?a=1 target= and then discard _blank as extra data?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2011-04-05 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
versions: +Python 3.2, Python 3.3 -Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2009-06-06 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

BeautifulSoup use SGMLParser for all the versions 3.1. BeautifulSoup 
3.1 is supposed to be compatible with Python 3 and since SGMLParser is
gone it's now using HTMLParser, but it's not able to handle some things
anymore.

For more information:
http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

(FWIW I tried BeautifulSoup 3.1 but it failed where BeautifulSoup 3.0.7
was working so I came back to 3.0.7)

--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2009-06-04 Thread Paweł Widera

New submission from Paweł Widera mo...@man.poznan.pl:

Of course both are not correct HTML but are easy to guess, so I believe
the parser should not give up too quick here.

1) extra comma between attributes
form action=/xxx.php?a=1amp;b=2amp, method=post

2) missing closing quotation mark for the first attribute
a href=http://xxx.org/xxx.php?a=1 target=_blankclick me/a

--
components: Library (Lib)
messages: 88867
nosy: momat
severity: normal
status: open
title: HTMLParser attribute parsing - 2 test cases when it fails
type: behavior
versions: Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2009-06-04 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

I do not think HTMLParser should guess.  Guessing always opens the door
to misinterpretation.

--
nosy: +georg.brandl
resolution:  - wont fix
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2009-06-04 Thread Paweł Widera

Paweł Widera mo...@man.poznan.pl added the comment:

It depends whether you want a HTMLParser to be an useful tool that can
deal with real world HTML or just a toy without practical meaning.
Crashing on every little deviation from the standard, where more relaxed
approach is possible, doesn't sound to me as a reasonable choice.

Maybe guess is not a proper word... If the standard strict approach
fails, the parser should fall back to a less strict one in an attempt to
actually parse the document. Throwing an exception and giving up is just
not good enough.

Can we have somebody else commenting on this one please?

--
status: closed - open

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2009-06-04 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

 Throwing an exception and giving up is just not good enough.

Yes it is, in some cases. There are forgiving HTML parsers out there,
HTMLParser does not strive to be one.

There are *so many* cases where HTML is a bit malformed that it takes
more than just two exceptions to get it right.  It's for a reason that
browsers' parsers are so complex.  If you add these corner cases, people
will come asking for this exception, and that one, etc.

--
status: open - pending

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2009-06-04 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

In doing web scraping I started using BeautifulSoup precisely because it
was very lenient in what html it accepted (I haven't written such an ap
for a while, so I'm not sure what BeautifulSoup currently does...I
thought I heard it was now using HTMLParser...).

There are a lot of messed up web pages out there.

I don't have time right now to evaluate your particular cases, but my
rule of thumb would be that if the major web browsers do something
reasonable with these cases, then a python tool designed to read web
pages should do so as well, where possible.  (Be liberal in what you
accept, and strict in what you generate.)

That said, I'm not sure what HTMLParser's design goals are, so this may
not be an appropriate goal for the module.

--
nosy: +r.david.murray
priority:  - normal
status: pending - open

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6191] HTMLParser attribute parsing - 2 test cases when it fails

2009-06-04 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

So BeautifulSoup is using HTMLParser? That is interesting, because they
claim to support broken HTML.

In any case, if a quirky mode is added, it should have to be turned on
explicitly by a flag.

--
resolution: wont fix - 

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6191
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com