[issue3590] sax.parser considers XML as text rather than bytes

2008-09-03 Thread Benjamin Peterson

Benjamin Peterson [EMAIL PROTECTED] added the comment:

This is a duplicate of #2501.

--
resolution:  - duplicate
status: open - closed

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-21 Thread Benjamin Peterson

Changes by Benjamin Peterson [EMAIL PROTECTED]:


--
priority: critical - release blocker

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Antoine Pitrou

Antoine Pitrou [EMAIL PROTECTED] added the comment:

From the discussion on the python-3000, it looks like it would be nice
if sax.parser handled both bytes and unicode streams.

Edward, does your simple fix make sax.parser work entirely well with
byte streams?

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Edward K Ream

Edward K Ream [EMAIL PROTECTED] added the comment:

On Mon, Aug 18, 2008 at 1:51 PM, Antoine Pitrou [EMAIL PROTECTED]wrote:


 Antoine Pitrou [EMAIL PROTECTED] added the comment:

 From the discussion on the python-3000, it looks like it would be nice
 if sax.parser handled both bytes and unicode streams.


 Edward, does your simple fix make sax.parser work entirely well with
 byte streams?

No. The sax.parser seems to have other problems.  Here is what I *think* I
know ;-)

1. A smallish .leo file (an xml file) containing a single non-ascii (utf-8)
encoded character appears to have been read correctly with Python 3.0.

2. A larger .leo file fails as follows (it's possible that the duplicate
error messages are a Leo problem):

Traceback (most recent call last):
Traceback (most recent call last):

  File C:\leo.repo\leo-30\leo\core\leoFileCommands.py, line 1283, in
parse_leo_file
parser.parse(theFile) # expat does not support parseString
  File C:\leo.repo\leo-30\leo\core\leoFileCommands.py, line 1283, in
parse_leo_file
parser.parse(theFile) # expat does not support parseString

  File c:\python30\lib\xml\sax\expatreader.py, line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
  File c:\python30\lib\xml\sax\expatreader.py, line 107, in parse
xmlreader.IncrementalParser.parse(self, source)

  File c:\python30\lib\xml\sax\xmlreader.py, line 121, in parse
buffer = file.read(self._bufsize)
  File c:\python30\lib\xml\sax\xmlreader.py, line 121, in parse
buffer = file.read(self._bufsize)

  File C:\Python30\lib\io.py, line 1670, in read
eof = not self._read_chunk()
  File C:\Python30\lib\io.py, line 1670, in read
eof = not self._read_chunk()

  File C:\Python30\lib\io.py, line 1499, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File C:\Python30\lib\io.py, line 1499, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))

  File C:\Python30\lib\io.py, line 1236, in decode
output = self.decoder.decode(input, final=final)
  File C:\Python30\lib\io.py, line 1236, in decode
output = self.decoder.decode(input, final=final)

  File C:\Python30\lib\encodings\cp1252.py, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
  File C:\Python30\lib\encodings\cp1252.py, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74:
character maps to undefined
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74:
character maps to undefined

The same calls to sax read the file correctly on Python 2.5.

It would be nice to have a message pinpoint the line and character offset of
the problem.

My vote would be for the code to work on both kinds of input streams. This
would save the users considerable confusion if sax does the (tricky)
conversions automatically.

Imo, now would be the most convenient time to attempt this--there is a
certain freedom in having everything be partially broken :-)

Edward

Edward K. Ream email: [EMAIL PROTECTED]
Leo: http://webpages.charter.net/edreamleo/front.html


Added file: http://bugs.python.org/file11147/unnamed

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___div dir=ltrbrbrdiv class=gmail_quoteOn Mon, Aug 18, 2008 at 1:51 
PM, Antoine Pitrou span dir=ltrlt;a href=mailto:[EMAIL 
PROTECTED][EMAIL PROTECTED]/agt;/span wrote:brblockquote 
class=gmail_quote style=border-left: 1px solid rgb(204, 204, 204); margin: 
0pt 0pt 0pt 0.8ex; padding-left: 1ex;
div class=Ih2E3dbr
Antoine Pitrou lt;a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/agt; 
added the comment:br
br
/divFrom the discussion on the python-3000, it looks like it would be nicebr
if sax.parser handled both bytes and unicode 
streams.br/blockquotedivnbsp;br/divblockquote class=gmail_quote 
style=border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; 
padding-left: 1ex;
Edward, does your simple fix make sax.parser work entirely well withbr
byte streams?/blockquotedivbrNo. The sax.parser seems to have other 
problems.nbsp; Here is what I *think* I know ;-)brbr1. A smallish .leo 
file (an xml file) containing a single non-ascii (utf-8) encoded character 
appears to have been read correctly with Python 3.0.br
br2. A larger .leo file fails as follows (it#39;s possible that the 
duplicate error messages are a Leo problem):brbrTraceback (most recent call 
last):brTraceback (most recent call last):brbrnbsp; File 
quot;C:\leo.repo\leo-30\leo\core\leoFileCommands.pyquot;, line 1283, in 
parse_leo_filebr
nbsp;nbsp;nbsp; parser.parse(theFile) # expat does not support 
parseStringbrnbsp; File 

[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Edward K Ream

Edward K Ream [EMAIL PROTECTED] added the comment:

On Mon, Aug 18, 2008 at 11:00 AM, Antoine Pitrou [EMAIL PROTECTED]wrote:


 Antoine Pitrou [EMAIL PROTECTED] added the comment:

  Just to be clear, I am at present totally confused about io streams :-)

 Python 3.0 distincts more clearly between unicode strings (called str
 in 3.0) and bytes strings (called bytes in 3.0). The most important
 point being that there is no more any implicit conversion between the
 two: you must explicitly use .encode() or .decode().

 Files opened in binary (rb) mode returns byte strings, but files
 opened in text (r) mode return unicode strings, which means you can't
 give a text file to 3.0 library expecting a binary file, or vice-versa.

 What is more worrying is that XML, until decoded, should be considered a
 byte stream, so sax.parser should accept binary files rather than text
 files. I took a look at test_sax and indeed it considers XML as text
 rather than bytes :-(

Thanks for these remarks.  They confirm what I suspected, but was unsure of,
namely that it seems strange to be passing something other than a byte
stream to parser.parse.


 Bumping this as critical because it needs a decision very soon (ideally
 before beta3).

Thanks for taking this seriously.

Edward

P.S.  I love the new unicode plans.  They are going to cause some pain at
first for everyone (Python team and developers), but in the long run they
are going to be a big plus for Python.

EKR

Edward K. Ream email: [EMAIL PROTECTED]
Leo: http://webpages.charter.net/edreamleo/front.html


Added file: http://bugs.python.org/file11148/unnamed

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___div dir=ltrbrbrdiv class=gmail_quoteOn Mon, Aug 18, 2008 at 11:00 
AM, Antoine Pitrou span dir=ltrlt;a href=mailto:[EMAIL 
PROTECTED][EMAIL PROTECTED]/agt;/span wrote:brblockquote 
class=gmail_quote style=border-left: 1px solid rgb(204, 204, 204); margin: 
0pt 0pt 0pt 0.8ex; padding-left: 1ex;
br
Antoine Pitrou lt;a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/agt; 
added the comment:br
br
gt; Just to be clear, I am at present totally confused about io streams :-)br
br
Python 3.0 distincts more clearly between unicode strings (called 
quot;strquot;br
in 3.0) and bytes strings (called quot;bytesquot; in 3.0). The most 
importantbr
point being that there is no more any implicit conversion between thebr
two: you must explicitly use .encode() or .decode().br
br
Files opened in binary (quot;rbquot;) mode returns byte strings, but filesbr
opened in text (quot;rquot;) mode return unicode strings, which means you 
can#39;tbr
give a text file to 3.0 library expecting a binary file, or vice-versa.br
br
What is more worrying is that XML, until decoded, should be considered abr
byte stream, so sax.parser should accept binary files rather than textbr
files. I took a look at test_sax and indeed it considers XML as textbr
rather than bytes :-(/blockquotedivbrThanks for these remarks.nbsp; They 
confirm what I suspected, but was unsure of, namely that it seems strange to be 
passing something other than a byte stream to parser.parse.br/div
blockquote class=gmail_quote style=border-left: 1px solid rgb(204, 204, 
204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;
br
Bumping this as critical because it needs a decision very soon (ideallybr
before beta3)./blockquotedivbrThanks for taking this 
seriously.brbrEdwardbrbrP.S.nbsp; I love the new unicode plans.nbsp; 
They are going to cause some pain at first for everyone (Python team and 
developers), but in the long run they are going to be a big plus for Python.br
brEKRbr/div/divbrEdward
 K. Ream email: a href=mailto:[EMAIL PROTECTED][EMAIL 
PROTECTED]/abrLeo: a 
href=http://webpages.charter.net/edreamleo/front.html;http://webpages.charter.net/edreamleo/front.html/abr
brbr
/div
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Antoine Pitrou

Antoine Pitrou [EMAIL PROTECTED] added the comment:

 The same calls to sax read the file correctly on Python 2.5.

What are those calls exactly?
Why is cp1252 used as an encoding? Is it what is specified in the XML
file? Or do you somehow feed stdin to the SAX parser? (if the latter,
you aren't testing bytes handling since stdin/stdout/stderr are text
streams in py3k)

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc [EMAIL PROTECTED] added the comment:

I guess that the file is simply opened in text mode (r). This uses the
preferred encoding, which is cp1252 on (western) Windows machines.

--
nosy: +amaury.forgeotdarc

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Edward K Ream

Edward K Ream [EMAIL PROTECTED] added the comment:

On Mon, Aug 18, 2008 at 4:15 PM, Antoine Pitrou [EMAIL PROTECTED]wrote:


 Antoine Pitrou [EMAIL PROTECTED] added the comment:

  The same calls to sax read the file correctly on Python 2.5.

 What are those calls exactly?

  parser = xml.sax.make_parser()
  parser.setFeature(xml.sax.handler.feature_external_ges,1)
  handler = saxContentHandler(c,inputFileName,silent,inClipboard)
  parser.setContentHandler(handler)
  parser.parse(theFile)

As discussed in http://bugs.python.org/issue3590

theFile is a file opened with 'rb' attributes

Edward


Edward K. Ream email: [EMAIL PROTECTED]
Leo: http://webpages.charter.net/edreamleo/front.html


Added file: http://bugs.python.org/file11151/unnamed

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___div dir=ltrbrbrdiv class=gmail_quoteOn Mon, Aug 18, 2008 at 4:15 
PM, Antoine Pitrou span dir=ltrlt;a href=mailto:[EMAIL 
PROTECTED][EMAIL PROTECTED]/agt;/span wrote:brblockquote 
class=gmail_quote style=border-left: 1px solid rgb(204, 204, 204); margin: 
0pt 0pt 0pt 0.8ex; padding-left: 1ex;
div class=Ih2E3dbr
Antoine Pitrou lt;a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/agt; 
added the comment:br
br
/divdiv class=Ih2E3dgt; The same calls to sax read the file correctly on 
Python 2.5.br
br
/divWhat are those calls exactly?/blockquotedivbrpre  parser = 
xml.sax.make_parser()br  
parser.setFeature(xml.sax.handler.feature_external_ges,1)br  handler = 
saxContentHandler(c,inputFileName,silent,inClipboard)br
  parser.setContentHandler(handler)br  parser.parse(theFile)brbrAs 
discussed in a 
href=http://bugs.python.org/issue3590;http://bugs.python.org/issue3590/abrbrtheFile
 is a file opened with #39;rb#39; attributesbr
brEdward/pre/div/divbrEdward
 K. Ream email: a href=mailto:[EMAIL PROTECTED][EMAIL 
PROTECTED]/abrLeo: a 
href=http://webpages.charter.net/edreamleo/front.html;http://webpages.charter.net/edreamleo/front.html/abr
brbr
/div
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Antoine Pitrou

Changes by Antoine Pitrou [EMAIL PROTECTED]:


Removed file: http://bugs.python.org/file11145/unnamed

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Antoine Pitrou

Changes by Antoine Pitrou [EMAIL PROTECTED]:


Removed file: http://bugs.python.org/file11147/unnamed

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Antoine Pitrou

Changes by Antoine Pitrou [EMAIL PROTECTED]:


Removed file: http://bugs.python.org/file11148/unnamed

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Antoine Pitrou

Changes by Antoine Pitrou [EMAIL PROTECTED]:


Removed file: http://bugs.python.org/file11151/unnamed

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3590] sax.parser considers XML as text rather than bytes

2008-08-18 Thread Antoine Pitrou

Antoine Pitrou [EMAIL PROTECTED] added the comment:

Ok, then xml.sax looks rather broken.

(by the way, can you avoid sending HTML emails? each time you send one,
the bug tracker attaches a file names unnamed. I've removed all 4 of
them now.)

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3590
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com