Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Damien Diederen

Hi Antoine,

Antoine Pitrou  writes:
> Damien Diederen  crosstwine.com> writes:
>> I couldn't figure out a way to get rid of it short of multi-#including
>> "templates" and playing with the C preprocessor, however, and have the
>> nagging feeling the latter would be frowned upon by the maintainers.
>> 
>> There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm
>> wrong about that.  Should I give it a try, and see how "clean" the
>> result can be made?
>
> Keep in mind that json is externally maintained by Bob. The more we rework his
> code, the less easy it will be to backport other changes from the simplejson
> library.
>
> I think we should either keep the code duplication (if we want to keep fast
> paths for both bytes and str objects), or only keep one of the two versions as
> my patch does.

Yes, I was (slowly) reaching the same conclusion.

>> Provided one of the alternatives is dropped, wouldn't it be better to do
>> the opposite, i.e., have the decoder take bytes as input, and the
>> encoder produce bytes—and layer the str functionality on top of that?  I
>> guess the answer depends on how the (most common) lower layers are
>> structured, but it would be nice to allow a straight bytes path to/from
>> the underlying transport.
>
> The straightest path is actually to/from unicode, since JSON data can contain
> unicode strings but no byte strings. Also, the json library /has/ to output
> unicode when `ensure_ascii` is False. In 2.x:
>
 json.dumps([u"éléphant"], ensure_ascii=False)
> u'["\xe9l\xe9phant"]'
>
> In any case, I don't think it will matter much in terms of speed
> whether we take one route or the other. UTF-8 encoding/decoding is
> probably much faster (in characters per second) than JSON
> encoding/decoding is.

You're undoubtedly right.  I was more concerned about the interaction
with other modules, and avoiding unnecessary copies/conversions
especially when they don't make sense from the user's perspective.

I will whip up a patch adding a {loadb,dumpb} API as you suggested in
another email, with the most trivial implementation, and then we'll see
where to go from there.

It can still be dropped if there is a concern of perpetuating a "bad
idea," or I can follow up with a port of Bob's "bytes" implementation
from 2.x if there is any interest.

> Regards
> Antoine.

Cheers,
Damien

-- 
http://crosstwine.com

"Strong Opinions, Weakly Held"
 -- Bob Johansen
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Antoine Pitrou
Damien Diederen  crosstwine.com> writes:
> 
> I couldn't figure out a way to get rid of it short of multi-#including
> "templates" and playing with the C preprocessor, however, and have the
> nagging feeling the latter would be frowned upon by the maintainers.
> 
> There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm
> wrong about that.  Should I give it a try, and see how "clean" the
> result can be made?

Keep in mind that json is externally maintained by Bob. The more we rework his
code, the less easy it will be to backport other changes from the simplejson
library.

I think we should either keep the code duplication (if we want to keep fast
paths for both bytes and str objects), or only keep one of the two versions as
my patch does.

> Provided one of the alternatives is dropped, wouldn't it be better to do
> the opposite, i.e., have the decoder take bytes as input, and the
> encoder produce bytes—and layer the str functionality on top of that?  I
> guess the answer depends on how the (most common) lower layers are
> structured, but it would be nice to allow a straight bytes path to/from
> the underlying transport.

The straightest path is actually to/from unicode, since JSON data can contain
unicode strings but no byte strings. Also, the json library /has/ to output
unicode when `ensure_ascii` is False. In 2.x:

>>> json.dumps([u"éléphant"], ensure_ascii=False)
u'["\xe9l\xe9phant"]'

In any case, I don't think it will matter much in terms of speed whether we take
one route or the other. UTF-8 encoding/decoding is probably much faster (in
characters per second) than JSON encoding/decoding is.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Damien Diederen

Hi Eric,

"Eric Smith"  writes:
>> I couldn't figure out a way to get rid of it short of multi-#including
>> "templates" and playing with the C preprocessor, however, and have the
>> nagging feeling the latter would be frowned upon by the maintainers.
>
> Not sure if this is exactly what you mean, but look at Objects/stringlib.
> str.format() and unicode.format() share the same implementation, using
> stringdefs.h and unicodedefs.h.

That's indeed a much better example!  I'm more confortable applying the
same technique to the json module now that I see it used in the core.

(Provided Bob and Antoine are not turned away by the relative ugliness,
that is.)

> Eric.

Cheers,
Damien

--
http://crosstwine.com

"Strong Opinions, Weakly Held"
 -- Bob Johansen
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Bob Ippolito
On Mon, Apr 27, 2009 at 7:25 AM, Damien Diederen  wrote:
>
> Antoine Pitrou  writes:
>> Hello,
>>
>> We're in the process of forward-porting the recent (massive) json
>> updates to 3.1, and we are also thinking of dropping remnants of
>> support of the bytes type in the json library (in 3.1, again). This
>> bytes support almost didn't work at all, but there was a lot of C and
>> Python code for it nevertheless. We're also thinking of dropping the
>> "encoding" argument in the various APIs, since it is useless.
>
> I had a quick look into the module on both branches, and at Antoine's
> latest patch (json_py3k-3).  The current situation on trunk is indeed
> not very pretty in terms of code duplication, and I agree it would be
> nice not to carry that forward.
>
> I couldn't figure out a way to get rid of it short of multi-#including
> "templates" and playing with the C preprocessor, however, and have the
> nagging feeling the latter would be frowned upon by the maintainers.
>
> There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm
> wrong about that.  Should I give it a try, and see how "clean" the
> result can be made?
>
>> Under the new situation, json would only ever allow str as input, and
>> output str as well. By posting here, I want to know whether anybody
>> would oppose this (knowing, once again, that bytes support is already
>> broken in the current py3k trunk).
>
> Provided one of the alternatives is dropped, wouldn't it be better to do
> the opposite, i.e., have the decoder take bytes as input, and the
> encoder produce bytes—and layer the str functionality on top of that?  I
> guess the answer depends on how the (most common) lower layers are
> structured, but it would be nice to allow a straight bytes path to/from
> the underlying transport.
>
> (I'm willing to have a go at the conversion in case somebody is
> interested.)
>
> Bob, would you have an idea of which lower layers are most commonly used
> with the json module, and whether people are more likely to expect strs
> or bytes in Python 3.x?  Maybe that data could be inferred from some bug
> tracking system?

I don't know what Python 3.x users expect. As far as I know, none of
the lower layers of the json package are used directly. They're
certainly not supposed to be or documented as such.

My use case for dumps is typically bytes output because we push it
straight to and from IO. Some people embed JSON in other documents
(e.g. HTML) where you would want it to be text. I'm pretty sure that
the IO case is more common.

-bob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Eric Smith
> I couldn't figure out a way to get rid of it short of multi-#including
> "templates" and playing with the C preprocessor, however, and have the
> nagging feeling the latter would be frowned upon by the maintainers.

Not sure if this is exactly what you mean, but look at Objects/stringlib.
str.format() and unicode.format() share the same implementation, using
stringdefs.h and unicodedefs.h.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Damien Diederen

Hello,

Antoine Pitrou  writes:
> Hello,
>
> We're in the process of forward-porting the recent (massive) json
> updates to 3.1, and we are also thinking of dropping remnants of
> support of the bytes type in the json library (in 3.1, again). This
> bytes support almost didn't work at all, but there was a lot of C and
> Python code for it nevertheless. We're also thinking of dropping the
> "encoding" argument in the various APIs, since it is useless.

I had a quick look into the module on both branches, and at Antoine's
latest patch (json_py3k-3).  The current situation on trunk is indeed
not very pretty in terms of code duplication, and I agree it would be
nice not to carry that forward.

I couldn't figure out a way to get rid of it short of multi-#including
"templates" and playing with the C preprocessor, however, and have the
nagging feeling the latter would be frowned upon by the maintainers.

There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm
wrong about that.  Should I give it a try, and see how "clean" the
result can be made?

> Under the new situation, json would only ever allow str as input, and
> output str as well. By posting here, I want to know whether anybody
> would oppose this (knowing, once again, that bytes support is already
> broken in the current py3k trunk).

Provided one of the alternatives is dropped, wouldn't it be better to do
the opposite, i.e., have the decoder take bytes as input, and the
encoder produce bytes—and layer the str functionality on top of that?  I
guess the answer depends on how the (most common) lower layers are
structured, but it would be nice to allow a straight bytes path to/from
the underlying transport.

(I'm willing to have a go at the conversion in case somebody is
interested.)

Bob, would you have an idea of which lower layers are most commonly used
with the json module, and whether people are more likely to expect strs
or bytes in Python 3.x?  Maybe that data could be inferred from some bug
tracking system?

> The bug entry is: http://bugs.python.org/issue4136
>
> Regards
> Antoine.

Regards,
Damien

-- 
http://crosstwine.com

"Strong Opinions, Weakly Held"
 -- Bob Johansen
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-14 Thread Lino Mastrodomenico
2009/4/13 Daniel Stutzbach :
> print("Content-Type: application/json; charset=utf-8")

Please don't do that! According to RFC 4627 the "charset" parameter is
not allowed for the application/json media type.

Just use "Content-Type: application/json", the charset is only
misleading because even if you specify, e.g., ISO-8859-1 a
standard-compliant receiver will probably still try to interpret the
content as UTF-8/16/32.

OTOH a charset can be used if you send JSON with an
application/javascript MIME type.

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Stephen J. Turnbull
Warning: Reply-To set to email-sig.

Greg Ewing writes:

 > Only for headers known to be unstructured, I think.
 > Completely unknown headers should be available only
 > as bytes.

Why do I get the feeling that you guys are feeling up an
elephant?

There are four things you might want to do with a header:

(1) Put it on the wire, which must be bytes (in fact, ASCII).
(2) Show it to a user (such as a rootin-tootin spam-fightin mail
admin), which for consistency with well-behaved, implemented
headers (ie, you might want to *gasp* *concatenate* your unknown
header with a string), will sooner or later be string (ie,
Unicode).
(3) (Try to) parse it, in which case an internal representation with
some other structure may or may not be appropriate for storing the
parsed data.
(4) Munge it, in which case an internal representation with some other
structure may or may not be appropriate.

I see no particular reason for restricting these basic API classes for
any header.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Martin v. Löwis
> Below is a basic CGI application that assumes that json module works
> with str, not bytes.  How would you write it if the json module does not
> support returning a str?

In a CGI application, you shouldn't be using sys.stdin or print().
Instead, you should be using sys.stdin.buffer (or sys.stdin.buffer.raw),
and sys.stdout.buffer.raw. A CGI script essentially does binary IO;
if you use TextIO, there likely will be bugs (e.g. if you have
attachments of type application/octet-stream).

> print("Content-Type: application/json; charset=utf-8")
> input_object = json.loads(sys.stdin.read())
> output_object = do_some_work(input_object)
> print(json.dumps(output_object))
> print()

out = sys.stdout.buffer.raw
out.write(b"Content-Type: application/json; charset=utf-8\n\n")
input_object = json.loads(sys.stdin.buffer.raw.read())
output_object = do_some_work(input_object)
out.write(json.dumps(output_object))

> What's the benefit of preventing users from getting a str out if that's
> what they want?

If they really want it, there is no benefit from preventing them.
I'm just puzzled why they want it, and what possible applications
might be where they want it. Perhaps they misunderstand something
when they think they want it.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Greg Ewing

Alexandre Vassalotti wrote:


print("Content-Type: application/json; charset=utf-8")
input_object = json.loads(sys.stdin.read())
output_object = do_some_work(input_object)
print(json.dumps(output_object))
print()


That assumes the encoding being used by stdout has
ascii as a subset.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Antoine Pitrou
Bob Ippolito  redivi.com> writes:
> 
> The output of json/simplejson dumps for Python 2.x is either an ASCII
> bytestring (default) or a unicode string (when ensure_ascii=False).
> This is very practical in 2.x because an ASCII bytestring can be
> treated as either text or bytes in most situations, isn't going to get
> mangled over any kind of encoding mismatch (as long as it's an ASCII
> superset), and skips an encoding step if getting sent over the wire..

Which means that the json module already deals with text rather than bytes,
apart from the optimization that pure ASCII text is returned as 8-bit strings.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Alexandre Vassalotti
On Mon, Apr 13, 2009 at 5:25 PM, Daniel Stutzbach
 wrote:
> On Mon, Apr 13, 2009 at 3:02 PM, "Martin v. Löwis" 
> wrote:
>>
>> > True, I can always convert from bytes to str or vise versa.
>>
>> I think you are missing the point. It will not be necessary to convert.
>
> Sometimes I want bytes and sometimes I want str.  I am going to be
> converting some of the time. ;-)
>
> Below is a basic CGI application that assumes that json module works with
> str, not bytes.  How would you write it if the json module does not support
> returning a str?
>
> print("Content-Type: application/json; charset=utf-8")
> input_object = json.loads(sys.stdin.read())
> output_object = do_some_work(input_object)
> print(json.dumps(output_object))
> print()
>

Like this?

print("Content-Type: application/json; charset=utf-8")
input_object = json.loads(sys.stdin.buffer.read())
output_object = do_some_work(input_object)
stdout.buffer.write(json.dumps(output_object))


-- Alexandre
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Greg Ewing

Barry Warsaw wrote:
The default 
would probably be some unstructured parser for  headers like Subject.


Only for headers known to be unstructured, I think.
Completely unknown headers should be available only
as bytes.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Daniel Stutzbach
On Mon, Apr 13, 2009 at 3:02 PM, "Martin v. Löwis" wrote:

> > True, I can always convert from bytes to str or vise versa.
>
> I think you are missing the point. It will not be necessary to convert.


Sometimes I want bytes and sometimes I want str.  I am going to be
converting some of the time. ;-)

Below is a basic CGI application that assumes that json module works with
str, not bytes.  How would you write it if the json module does not support
returning a str?

print("Content-Type: application/json; charset=utf-8")
input_object = json.loads(sys.stdin.read())
output_object = do_some_work(input_object)
print(json.dumps(output_object))
print()

The questions is: which of them is more appropriate, if, what you want,
> is bytes. I argue that the second form is better, since it saves you
> an encode invocation.
>

If what you want is bytes, encoding has to happen somewhere.  If the json
module has some optimizations to do the encoding at the same time as the
serialization, great.  However, based on the original post of this thread,
it sounds like that code doesn't exist or doesn't work correctly.

What's the benefit of preventing users from getting a str out if that's what
they want?

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Daniel Stutzbach
On Mon, Apr 13, 2009 at 3:28 PM, Bob Ippolito  wrote:

> It's not a bug in dumps, it's a matter of not reading the
> documentation. The encoding parameter of dumps decides how byte
> strings should be interpreted, not what the output encoding is.
>

You're right; I apologize for not reading more closely.

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Bob Ippolito
On Mon, Apr 13, 2009 at 1:02 PM, "Martin v. Löwis"  wrote:
>> Yes, there's a TCP connection.  Sorry for not making that clear to begin
>> with.
>>
>>     If so, it doesn't matter what representation these implementations chose
>>     to use.
>>
>>
>> True, I can always convert from bytes to str or vise versa.
>
> I think you are missing the point. It will not be necessary to convert.
> You can write the JSON into the TCP connection in Python, and it will
> come out just fine as strings just fine in C# and JavaScript. This
> is how middleware works - it abstracts from programming languages, and
> allows for different representations in different languages, in a
> manner invisible to the participating processes.
>
>> At least one of these two needs to work:
>>
>> json.dumps({}).encode('utf-16le')  # dumps() returns str
>> '{\x00}\x00'
>>
>> json.dumps({}, encoding='utf-16le')  # dumps() returns bytes
>> '{\x00}\x00'
>>
>> In 2.6, the first one works.  The second incorrectly returns '{}'.
>
> Ok, that might be a bug in the JSON implementation - but you shouldn't
> be using utf-16le, anyway. Use UTF-8 always, and it will work fine.
>
> The questions is: which of them is more appropriate, if, what you want,
> is bytes. I argue that the second form is better, since it saves you
> an encode invocation.

It's not a bug in dumps, it's a matter of not reading the
documentation. The encoding parameter of dumps decides how byte
strings should be interpreted, not what the output encoding is.

The output of json/simplejson dumps for Python 2.x is either an ASCII
bytestring (default) or a unicode string (when ensure_ascii=False).
This is very practical in 2.x because an ASCII bytestring can be
treated as either text or bytes in most situations, isn't going to get
mangled over any kind of encoding mismatch (as long as it's an ASCII
superset), and skips an encoding step if getting sent over the wire..

>>> simplejson.dumps(['\x00f\x00o\x00o'], encoding='utf-16be')
'["foo"]'
>>> simplejson.dumps(['\x00f\x00o\x00o'], encoding='utf-16be', 
>>> ensure_ascii=False)
u'["foo"]'

-bob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Martin v. Löwis
> Yes, there's a TCP connection.  Sorry for not making that clear to begin
> with.
> 
> If so, it doesn't matter what representation these implementations chose
> to use.
> 
> 
> True, I can always convert from bytes to str or vise versa.

I think you are missing the point. It will not be necessary to convert.
You can write the JSON into the TCP connection in Python, and it will
come out just fine as strings just fine in C# and JavaScript. This
is how middleware works - it abstracts from programming languages, and
allows for different representations in different languages, in a
manner invisible to the participating processes.

> At least one of these two needs to work:
> 
> json.dumps({}).encode('utf-16le')  # dumps() returns str
> '{\x00}\x00'
> 
> json.dumps({}, encoding='utf-16le')  # dumps() returns bytes
> '{\x00}\x00'
> 
> In 2.6, the first one works.  The second incorrectly returns '{}'.

Ok, that might be a bug in the JSON implementation - but you shouldn't
be using utf-16le, anyway. Use UTF-8 always, and it will work fine.

The questions is: which of them is more appropriate, if, what you want,
is bytes. I argue that the second form is better, since it saves you
an encode invocation.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread James Y Knight

On Apr 13, 2009, at 10:11 AM, Barry Warsaw wrote:
The email package does not need a parser for every header, but it  
should provide a framework that applications (or third party  
libraries) can use to extend the built-in header parsers.  A bare  
minimum for functionality requires a Content-Type parser.  I think  
the email package should also include an address header (Originator,  
Destination) parser, and a Message-ID header parser.  Possibly others.


Sure, that's fine...

The default would probably be some unstructured parser for headers  
like Subject.



But for unknown headers, it's not a useful choice to return a "str"  
object. "str" is just one possible structured data representation for  
a header: there's no correct useful decoding of all headers into str.  
Of course for the "Subject" header, str is the correct result type,  
but that's not a default, that's explicit support for "Subject". You  
can't correctly decode "To" into a str, so what makes you think you  
can decode "X-Gabazaborph" into str?


The only useful and correct representation for unknown (or  
unimplemented) headers is the raw bytes.


James

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Daniel Stutzbach
On Mon, Apr 13, 2009 at 12:19 PM, "Martin v. Löwis" wrote:

> > I use the json module in 2.6 to communicate with a C# JSON library and a
> > JavaScript JSON library.  The C# and JavaScript libraries produce and
> > consume the equivalent of str, not the equivalent of bytes.
>
> I assume there is a TCP connection between the json module and the
> C#/JavaScript libraries?
>

Yes, there's a TCP connection.  Sorry for not making that clear to begin
with.

I also sometimes store JSON objects in a database.  In that case, I pass
strings to the database API which stores them in a TEXT field.  Obviously
somewhere they get encoding to bytes, but that's handled by the database.


> If so, it doesn't matter what representation these implementations chose
> to use.


True, I can always convert from bytes to str or vise versa.  Sometimes it is
illustrative to see how others have chosen to solve the same problem.  The
JSON specification and other implementations serializes an object to a
string.  Python's json.dumps() needs to either return a str or let the user
specify an encoding.

At least one of these two needs to work:

json.dumps({}).encode('utf-16le')  # dumps() returns str
'{\x00}\x00'

json.dumps({}, encoding='utf-16le')  # dumps() returns bytes
'{\x00}\x00'

In 2.6, the first one works.  The second incorrectly returns '{}'.

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Martin v. Löwis
> I use the json module in 2.6 to communicate with a C# JSON library and a
> JavaScript JSON library.  The C# and JavaScript libraries produce and
> consume the equivalent of str, not the equivalent of bytes.

I assume there is a TCP connection between the json module and the
C#/JavaScript libraries?

If so, it doesn't matter what representation these implementations chose
to use.

> Hope that helps,

Maybe I misunderstood, and you are *not* communicating over the wire.
In this case, I'm puzzled how you get the data from Python to the C#
JSON library, or to the JavaScript library.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Daniel Stutzbach
On Fri, Apr 10, 2009 at 10:06 PM, "Martin v. Löwis" wrote:

> However, I really think that this question cannot be answered by
> reading the RFC. It should be answered by verifying how people use
> the json library in 2.x.
>

I use the json module in 2.6 to communicate with a C# JSON library and a
JavaScript JSON library.  The C# and JavaScript libraries produce and
consume the equivalent of str, not the equivalent of bytes.

Yes, the data eventually has to go over a socket as bytes, but that's often
handled by a different layer of code.

For JavaScript, data is typically received by via XMLHttpRequest(), which
automatically figures out the encoding from the HTTP headers and/or other
information (defaulting to UTF-8) and returns a str-like object that I pass
to the JavaScript JSON library.

For C#, I wrap the socket in a StreamReader object, which decodes the byte
stream into a string stream (similar to Python's new TextIOWrapper class).

Hope that helps,

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-13 Thread Barry Warsaw

On Apr 10, 2009, at 11:08 AM, James Y Knight wrote:

Until you write a parser for every header, you simply cannot decode  
to unicode. The only sane choices are:

1) raw bytes
2) parsed structured data


The email package does not need a parser for every header, but it  
should provide a framework that applications (or third party  
libraries) can use to extend the built-in header parsers.  A bare  
minimum for functionality requires a Content-Type parser.  I think the  
email package should also include an address header (Originator,  
Destination) parser, and a Message-ID header parser.  Possibly  
others.  The default would probably be some unstructured parser for  
headers like Subject.


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-11 Thread Mark Hammond

On 11/04/2009 6:12 PM, Antoine Pitrou wrote:

Martin v. Löwis  v.loewis.de>  writes:

Not sure whether it would be *significantly* faster, but yes, Bob wrote
an accelerator for parsing out of a byte string to make it really fast;
IIRC, he claims that it is faster than pickling.


Isn't premature optimization the root of all evil?

Besides, the fact that many values in a typical JSON object will be strings, and
must be encoded from/decoded to unicode objects in py3k, suggests that
accepting/outputting unicode as default is the laziest (i.e. the best) choice
performance-wise.


I don't see it as premature optimization, but rather trying to ensure 
the interface/api best suits the actual use cases.



But you don't have to trust me: look at the quick numbers I've posted. The py3k
version (in the str-only incarnation I've proposed) is sometimes actually faster
than the trunk version:
http://mail.python.org/pipermail/python-dev/2009-April/088498.html


But if all *actual* use-cases involve moving to and from utf8 encoded 
bytes, I'm not sure that little example is particularly useful.  In 
those use-cases, I'd be surprised if there wasn't significant time and 
space benefits in not asking apps to use an 'intermediate' string object 
before getting the bytes they need, particularly when the payload may be 
a significant size.


Assuming the above is all true, I'd see choosing bytes less as a 
premature optimization and more a design choice which best supports 
actual use.  So to my mind the only real question is whether the above 
*is* true, or if there are common use-cases which don't involve 
utf8-off/on-the-wire...


Cheers,

Mark

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-11 Thread Chris Withers

gl...@divmod.com wrote:


My preference would be that

   message.headers['Subject'] = b'Some Bytes'

would simply raise an exception.  If you've got some bytes, you should 
instead do


   message.bytes_headers['Subject'] = b'Some Bytes'


Remind me again why you need to differentiate between headers and 
bytes_headers?


I think bytes headers are evil. If you don't know the encoding when you 
have one, who does or ever will?


   message.headers['Subject'] = Header(bytes=b'Some Bytes', 
encoding='utf-8')


Explicit is better than implicit, right?


Indeed, and the case for the above would be to keep indempotence of 
incoming messages in applications like mailman...


...otherwise we could just decode them and be done with it.

cheers,

Chris

--
Simplistix - Content Management, Zope & Python Consulting
   - http://www.simplistix.co.uk
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-11 Thread Stephen J. Turnbull
Greg Ewing writes:

 > The reason you use a text format in the first place is that
 > you have some way of transmitting text, and you want to
 > send something that isn't text. In that situation, the
 > encoding is already determined by whatever means you're
 > using to send the text.

Determined, yes, but all too often in a nondeterministic way.  That's
precisely the problem that the spec is trying to avert.  People often
schlep "text" around as if that were well-defined, forcing receivers
to guess what is meant.  Having a spec isn't going to stop them, but
at least you can lash them with a wet noodle.

The specification of at least the abstract character repertoire and
coded character set also allows implementers like Python to proceed
confidently with their usual internal encoding.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-11 Thread Antoine Pitrou
Martin v. Löwis  v.loewis.de> writes:
> 
> Not sure whether it would be *significantly* faster, but yes, Bob wrote
> an accelerator for parsing out of a byte string to make it really fast;
> IIRC, he claims that it is faster than pickling.

Isn't premature optimization the root of all evil?

Besides, the fact that many values in a typical JSON object will be strings, and
must be encoded from/decoded to unicode objects in py3k, suggests that
accepting/outputting unicode as default is the laziest (i.e. the best) choice
performance-wise.

But you don't have to trust me: look at the quick numbers I've posted. The py3k
version (in the str-only incarnation I've proposed) is sometimes actually faster
than the trunk version:
http://mail.python.org/pipermail/python-dev/2009-April/088498.html

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Martin v. Löwis
> I'm personally leaning slightly towards strings, putting the burden on
> bytes-users of json to explicitly use the appropriate encoding, even in
> cases where it *must* be utf8.  On the other hand, I'm too lazy to dig
> back through this large thread, but I seem to recall a suggestion that
> using bytes would be significantly faster. 

Not sure whether it would be *significantly* faster, but yes, Bob wrote
an accelerator for parsing out of a byte string to make it really fast;
IIRC, he claims that it is faster than pickling.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Mark Hammond

[Dropping email sig]

On 11/04/2009 1:06 PM, "Martin v. Löwis" wrote:


However, I really think that this question cannot be answered by
reading the RFC. It should be answered by verifying how people use
the json library in 2.x.


In the absence of anything more formal, here are 2 anecdotes:

* The python-twitter package seems to:
  - Use dumps() mainly to get string objects.  It uses it both for 
__str__, and for an API called 'AsJsonString' - the intent of this seems 
to be to provide strings for the consumer of the twitter API - its not 
clear how such consumers would use them.  Note that this API doesn't 
seem to need to 'write' json objects, else I suspect they would then be 
expecting dumps to return bytes to put on the wire.  They expect loads 
to accept the bytes they are reading directly off the wire.


* couchdb's wrappers use these functions purely as bytes - they are 
either decoding an application/json object from the bits they read, or 
they are encoding it to use directly in the body of a request (or even 
directly in the URL of the request!)


I find myself conflicted.  On one hand I believe the most common use of 
json will be to exchange data with something inherently byte-based.  On 
the other hand though, json itself seems to be naturally "stringy" and 
the most natural interface for a casual user would be strings.


I'm personally leaning slightly towards strings, putting the burden on 
bytes-users of json to explicitly use the appropriate encoding, even in 
cases where it *must* be utf8.  On the other hand, I'm too lazy to dig 
back through this large thread, but I seem to recall a suggestion that 
using bytes would be significantly faster.  If that is true, I'd be 
happy to settle for bytes as I believe the most common *actual* use of 
json will be via things like the twitter and couch libraries - and may 
even be a key bottleneck for such libraries - so people will not be 
directly exposed to its interface...


Mark

Cheers,

Mark
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Martin v. Löwis
>> In email's case this is true, but in JSON's case it's not.  JSON is a
>> format defined as a sequence of code points; MIME is defined as a
>> sequence of octets.
> 
> What is the 'bytes support' issue for json?  Is it about content within
> a json text? Or about the transport format of a json text?

The question is whether the json parsing should take bytes or str as
input, and whether the json marshalling should produce bytes or str.
More specifically, the question is whether it is ok to drop bytes.

I personally think that it needs to support bytes, and that perhaps
str support is optional (as you could always explicitly encode the
str as UTF-8 before passing it to the JSON parser, if you somehow
managed to get a str of JSON to parse).

However, I really think that this question cannot be answered by
reading the RFC. It should be answered by verifying how people use
the json library in 2.x.

> The standard does not specify any correspondence between representations
> and domain objects

And that is not the issue at all; nobody is debating what output the
parsing should produce.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Greg Ewing

Paul Moore wrote:


3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

This is at best confused (in my utterly non-expert opinion :-)) as
Unicode isn't an encoding...


I'm inclined to agree. I'd go further and say that if JSON
is really mean to be a text format, the standard has no
business mentioning encodings at all.

The reason you use a text format in the first place is that
you have some way of transmitting text, and you want to
send something that isn't text. In that situation, the
encoding is already determined by whatever means you're
using to send the text.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Terry Reedy

gl...@divmod.com wrote:


On 03:21 am, ncogh...@gmail.com wrote:

Barry Warsaw wrote:



I don't know whether the parameter thing will work or not, but you're
probably right that we need to get the bytes-everywhere API first.



Given that json is a wire protocol, that sounds like the right approach
for json as well. Once bytes-everywhere works, then a text API can be
built on top of it, but it is difficult to build a bytes API on top of a
text one.


I wish I could agree, but JSON isn't really a wire protocol.  According 
to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the 
serialization of structured data".  There are some notes about encoding, 
but it is very clearly described in terms of unicode code points.

So I guess the IO library *is* the right model: bytes at the bottom of
the stack, with text as a wrapper around it (mediated by codecs).


In email's case this is true, but in JSON's case it's not.  JSON is a 
format defined as a sequence of code points; MIME is defined as a 
sequence of octets.


What is the 'bytes support' issue for json?  Is it about content within 
a json text? Or about the transport format of a json text?


Reading rfc4627, a json text is a unicode string representation of an 
instance of one of 6 classes.  In Python terms, they are Nonetype, bool, 
numbers (int, float, decimal?), (unicode) str, list, and [string-keyed] 
dict.  The representation is nearly identical to Python's literals and 
displays.


For transport,  the encoding SHALL be one of UTF-8, -16LE/BE, -32LE/BD, 
with UFT-8 the 'default'.


So a json parser (a restricted eval()) tokenizes and parses a stream of 
unicode chars which in Python could come from either a unicode string or 
decoded bytes object.  The bytes decoding could be either bulk or 
incremental.


Similarly, a json generator (an repr()-like function) produces a stream 
of unicode chars which again could be optionally encoded to bytes, 
either incrementally or in bulk.


The standard does not specify any correspondence between representations 
and domain objects,  For Python making 'null', 'true', and 'false' 
inter-convert with None, True, False is obvious.  Numbers are slightly 
more problemmtical.  A generator could produce decimal literals from 
both floats and decimals but without a non-json extension, a parser 
could only convert back to one, so the other would not round-trip. (Int 
could be handled by the presence or absence of '.0'.)  Similarly, tuples 
could be represented, like lists, as json square-bracketed arrays, but 
they would be converted back to lists, not tuples, unless a non-json 
extension were used.


So the two possible byte-suppost content issues I see are how to 
represent them as legal json strings and/or whether some device should 
be added to make them round-trip.  But as indicated above, these two 
issues are not unique to bytes.


Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Stephen J. Turnbull
Robert Brewer writes:

 > Syntactically, there's no sense in providing:
 > 
 > Message.set_header('Subject', 'Some text', encoding='utf-16')
 > 
 > ...since you could more clearly write the same as:
 > 
 > Message.set_header('Subject', 'Some text'.encode('utf-16'))

Which you now must *parse* and guess the encoding to determine how to
RFC-2047-encode the binary mush.  I think the encoding parameter is
necessary here.

 > But it would be far easier to do all the encoding at once in an
 > output() or serialize() method. Do different headers need different
 > encodings?

You can have multiple encodings within a single header (and a naïve
algorithm might very well encode "The price of Gödel-Escher-Bach is
€25" as "The price of =?ISO-8859-1?Q?G=F6del-Escher-Bach?= is
=?ISO-8859-15?Q?=A425?=").

 > If so, make message['Subject'] a subclass of str and give it an
 > .encoding attribute (with a default).

But if you've set the .encoding attribute, you don't need to encode
'Some text'; .set_header() can take care of it for you.  And what
about the possibility that the encoding attributes disagree with the
argument you passed to the codec?

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Aahz
On Fri, Apr 10, 2009, Barry Warsaw wrote:
> On Apr 10, 2009, at 2:06 PM, Michael Foord wrote:
>>
>> Shouldn't headers always be text?
>
> /me weeps

/me hands Barry a hankie
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

Why is this newsgroup different from all other newsgroups?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > > (3) The default transfer encoding syntax is UTF-8.
 > 
 > Notice that the RFC is partially irrelevant. It only applies
 > to the application/json mime type, and JSON is used in various
 > other protocols, using various other encodings.

Sure.  That's their problem.  In Python, Unicode is the native
encoding, and we have codecs to deal with the outside world, no?  That
happens to match very well not only with RFC 4627, but the sidebar on
json.org that defines JSON.

 > > I think it's a bad idea for any of the core JSON API to accept or
 > > produce bytes in any language that provides a Unicode string type.
 > 
 > So how do you integrate the encoding detection that the RFC suggests
 > to be done?

I suggest you don't.  That's mission creep.  Think about writing tests
for it, and remember that out in the wild those "various other
encodings" almost certainly include Shift JIS, Big5, and KOI8-R.  Both
those considerations point to "er, let's delegate detection and
en/decoding to the nice folks who maintain the codec suite."  Where
it's embedded in some other protocol which specifies a TES, the TES
can be implemented there, too.

As I wrote earlier, I don't see anything wrong with providing a
wrapper module that deals with some default/common/easy cases.  But
I'd stick it in the contrib directory.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Robert Brewer
On Thu, 2009-04-09 at 22:38 -0400, Barry Warsaw wrote:
> On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote:
> 
> > On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw  wrote:
> > Anyway, aside from that decision, I haven't come up with an elegant  
> > way to allow /output/ in both bytes and strings (input is I think  
> > theoretically easier by sniffing the arguments).
> >
> > Won't this work? (assuming dumps() always returns a string)
> >
> > def dumpb(obj, encoding='utf-8', *args, **kw):
> > s = dumps(obj, *args, **kw)
> > return s.encode(encoding)
> 
> So, what I'm really asking is this.  Let's say you agree that there  
> are use cases for accessing a header value as either the raw encoded  
> bytes or the decoded unicode.  What should this return:
> 
>  >>> message['Subject']
> 
> The raw bytes or the decoded unicode?
> 
> Okay, so you've picked one.  Now how do you spell the other way?
> 
> The Message class probably has these explicit methods:
> 
>  >>> Message.get_header_bytes('Subject')
>  >>> Message.get_header_string('Subject')
> 
> (or better names... it's late and I'm tired ;).  One of those maps to  
> message['Subject'] but which is the more obvious choice?
> 
> Now, setting headers.  Sometimes you have some unicode thing and  
> sometimes you have some bytes.  You need to end up with bytes in the  
> ASCII range and you'd like to leave the header value unencoded if so.   
> But in both cases, you might have bytes or characters outside that  
> range, so you need an explicit encoding, defaulting to utf-8 probably.
> 
>  >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
>  >>> Message.set_header('Subject', b'Some bytes')
> 
> One of those maps to
> 
>  >>> message['Subject'] = ???
> 
> I'm open to any suggestions here!

Syntactically, there's no sense in providing:

Message.set_header('Subject', 'Some text', encoding='utf-16')

...since you could more clearly write the same as:

Message.set_header('Subject', 'Some text'.encode('utf-16'))

The only interesting case is if you provided a *default* encoding, so that:

Message.default_header_encoding = 'utf-16'
Message.set_header('Subject', 'Some text')

...has the same effect.

But it would be far easier to do all the encoding at once in an output()
or serialize() method. Do different headers need different encodings? If
so, make message['Subject'] a subclass of str and give it an .encoding
attribute (with a default). If not, Message.header_encoding should be
sufficient.


Robert Brewer
fuman...@aminus.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Barry Warsaw

On Apr 10, 2009, at 1:19 AM, gl...@divmod.com wrote:


On 02:38 am, ba...@python.org wrote:
So, what I'm really asking is this.  Let's say you agree that there  
are use cases for accessing a header value as either the raw  
encoded bytes or the decoded unicode.  What should this return:


>>> message['Subject']

The raw bytes or the decoded unicode?


My personal preference would be to just get deprecate this API, and  
get rid of it, replacing it with a slightly more explicit one.


  message.headers['Subject']
  message.bytes_headers['Subject']


This is pretty darn clever Glyph.  Stop that! :)

I'm not 100% sure I like the name .bytes_headers or that .headers  
should be the decoded header (rather than have .headers return the  
bytes thingie and say .decoded_headers return the decoded thingies),  
but I do like the general approach.


Now, setting headers.  Sometimes you have some unicode thing and  
sometimes you have some bytes.  You need to end up with bytes in  
the ASCII range and you'd like to leave the header value unencoded  
if so. But in both cases, you might have bytes or characters  
outside that range, so you need an explicit encoding, defaulting to  
utf-8 probably.


  message.headers['Subject'] = 'Some text'

should be equivalent to

  message.headers['Subject'] = Header('Some text')


Yes, absolutely.  I think we're all in general agreement that header  
values should be instances of Header, or subclasses thereof.



My preference would be that

  message.headers['Subject'] = b'Some Bytes'

would simply raise an exception.  If you've got some bytes, you  
should instead do


  message.bytes_headers['Subject'] = b'Some Bytes'

or

  message.headers['Subject'] = Header(bytes=b'Some Bytes',  
encoding='utf-8')


Explicit is better than implicit, right?


Yes.

Again, I really like the general idea, if I might quibble about some  
of the details.  Thanks for a great suggestion.


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Martin v. Löwis
> (3) The default transfer encoding syntax is UTF-8.

Notice that the RFC is partially irrelevant. It only applies
to the application/json mime type, and JSON is used in various
other protocols, using various other encodings.

> I think it's a bad idea for any of the core
> JSON API to accept or produce bytes in any language that provides a
> Unicode string type.

So how do you integrate the encoding detection that the RFC suggests
to be done?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Bob Ippolito
On Fri, Apr 10, 2009 at 8:38 AM, Stephen J. Turnbull  wrote:
> Paul Moore writes:
>
>  > On the other hand, further down in the document:
>  >
>  > """
>  > 3.  Encoding
>  >
>  >    JSON text SHALL be encoded in Unicode.  The default encoding is
>  >    UTF-8.
>  >
>  >    Since the first two characters of a JSON text will always be ASCII
>  >    characters [RFC0020], it is possible to determine whether an octet
>  >    stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
>  >    at the pattern of nulls in the first four octets.
>  > """
>  >
>  > This is at best confused (in my utterly non-expert opinion :-)) as
>  > Unicode isn't an encoding...
>
> The word "encoding" (by itself) does not have a standard definition
> AFAIK.  However, since Unicode *is* a "coded character set" (plus a
> bunch of hairy usage rules), there's nothing wrong with saying "text
> is encoded in Unicode".  The RFC 2130 and Unicode TR#17 taxonomies are
> annoying verbose and pedantic to say the least.
>
> So what is being said there (in UTR#17 terminology) is
>
> (1) JSON is *text*, that is, a sequence of characters.
> (2) The abstract repertoire and coded character set are defined by the
>    Unicode standard.
> (3) The default transfer encoding syntax is UTF-8.
>
>  > That implies that loads can/should also allow bytes as input, applying
>  > the given algorithm to guess an encoding.
>
> It's not a guess, unless the data stream is corrupt---or nonconforming.
>
> But it should not be the JSON package's responsibility to deal with
> corruption or non-conformance (eg, ISO-8859-15-encoded programs).
> That's the whole point of specifying the coded character set in the
> standard the first place.  I think it's a bad idea for any of the core
> JSON API to accept or produce bytes in any language that provides a
> Unicode string type.
>
> That doesn't mean Python's module shouldn't provide convenience
> functions to read and write JSON serialized as UTF-8 (in fact, that
> *should* be done, IMO) and/or other UTFs (I'm not so happy about
> that).  But those who write programs using them should not report bugs
> until they've checked out and eliminated the possibility of an
> encoding screwup!

The current implementation doesn't do any encoding guesswork and I
have no intention to allow that as a feature. The input must be
unicode, UTF-8 bytes, or an encoding must be specified.

Personally most of experience with JSON is as a wire protocol and thus
bytes, so the obvious function to encode json should do that. There
probably should be another function to get unicode output, but nobody
has ever asked for that in the Python 2.x version. They either want
the default behavior (encoding as ASCII str which can be used as
unicode due to implementation details of Python 2.x) or encoding as a
more compact UTF-8 str (without escaping non-ASCII code points).
Perhaps Python 3 users would ask for a unicode output when decoding
though.

-bob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Stephen J. Turnbull
Paul Moore writes:

 > On the other hand, further down in the document:
 > 
 > """
 > 3.  Encoding
 > 
 >JSON text SHALL be encoded in Unicode.  The default encoding is
 >UTF-8.
 > 
 >Since the first two characters of a JSON text will always be ASCII
 >characters [RFC0020], it is possible to determine whether an octet
 >stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
 >at the pattern of nulls in the first four octets.
 > """
 > 
 > This is at best confused (in my utterly non-expert opinion :-)) as
 > Unicode isn't an encoding...

The word "encoding" (by itself) does not have a standard definition
AFAIK.  However, since Unicode *is* a "coded character set" (plus a
bunch of hairy usage rules), there's nothing wrong with saying "text
is encoded in Unicode".  The RFC 2130 and Unicode TR#17 taxonomies are
annoying verbose and pedantic to say the least.

So what is being said there (in UTR#17 terminology) is

(1) JSON is *text*, that is, a sequence of characters.
(2) The abstract repertoire and coded character set are defined by the
Unicode standard.
(3) The default transfer encoding syntax is UTF-8.

 > That implies that loads can/should also allow bytes as input, applying
 > the given algorithm to guess an encoding.

It's not a guess, unless the data stream is corrupt---or nonconforming.

But it should not be the JSON package's responsibility to deal with
corruption or non-conformance (eg, ISO-8859-15-encoded programs).
That's the whole point of specifying the coded character set in the
standard the first place.  I think it's a bad idea for any of the core
JSON API to accept or produce bytes in any language that provides a
Unicode string type.

That doesn't mean Python's module shouldn't provide convenience
functions to read and write JSON serialized as UTF-8 (in fact, that
*should* be done, IMO) and/or other UTFs (I'm not so happy about
that).  But those who write programs using them should not report bugs
until they've checked out and eliminated the possibility of an
encoding screwup!

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread James Y Knight

On Apr 9, 2009, at 10:38 PM, Barry Warsaw wrote:
So, what I'm really asking is this.  Let's say you agree that there  
are use cases for accessing a header value as either the raw encoded  
bytes or the decoded unicode.


As I said in the thread having nearly the same exact discussion on web- 
sig, except about WSGI headers...



What should this return:

>>> message['Subject']

The raw bytes or the decoded unicode?


Until you write a parser for every header, you simply cannot decode to  
unicode. The only sane choices are:

1) raw bytes
2) parsed structured data

There's no "decoded to unicode but not parsed" option: that's doing  
things in the wrong order. If you RFC2047-decode the header before  
doing tokenization and parsing, you will just have a *broken*  
implementation.


Here's an example where it matters. If you decode the RFC2047 part  
before parsing, you'd decide that there's two recipients to the  
message. There aren't. ", " is the display-name of  
"act...@example.com", not a second recipient.


  To: =?UTF-8?B?PGJyb2tlbkBleGFtcGxlLmNvbT4sIA==?= 

Here's a quote from RFC2047:
NOTE: Decoding and display of encoded-words occurs *after* a  
structured field body is parsed into tokens. It is therefore  
possible to hide 'special' characters in encoded-words which, when  
displayed, will be indistinguishable from 'special' characters in  
the surrounding text. For this and other reasons, it is NOT  
generally possible to translate a message header containing 'encoded- 
word's to an unencoded form which can be parsed by an RFC 822 mail  
reader.

And another quote for good measure:
(2) Any header field not defined as '*text' should be parsed  
according to the syntax rules for that header field. However, any  
'word' that appears within a 'phrase' should be treated as an  
'encoded-word' if it meets the syntax rules in section 2. Otherwise  
it should be treated as an ordinary 'word'.



Now, I suppose there's also a third possibility:
3) US-ASCII-only strings, unmolested except for doing  
a .decode('ascii'). That'll give you a string all right, but it's  
really just cheating. It's not actually a text string in any  
meaningful sense.


(in all this I'm assuming your question is not about the "Subject"  
header in particular; that is of course just unstructured text so the  
parse step doesn't actually do anything...).


James

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Martin v. Löwis
>> In email's case this is true, but in JSON's case it's not.  JSON is a 
>> format defined as a sequence of code points; MIME is defined as a 
>> sequence of octets.
> 
> Another to look at it is that JSON is a subset of Javascript, and as such is
> text rather than bytes.

I don't think this can be approached from a theoretical point of view.
Instead, what matters is how users want to use it.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Paul Moore
2009/4/10 Nick Coghlan :
> gl...@divmod.com wrote:
>> On 03:21 am, ncogh...@gmail.com wrote:
>>> Given that json is a wire protocol, that sounds like the right approach
>>> for json as well. Once bytes-everywhere works, then a text API can be
>>> built on top of it, but it is difficult to build a bytes API on top of a
>>> text one.
>>
>> I wish I could agree, but JSON isn't really a wire protocol.  According
>> to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the
>> serialization of structured data".  There are some notes about encoding,
>> but it is very clearly described in terms of unicode code points.
>
> Ah, my apologies - if the RFC defines things such that the native format
> is Unicode, then yes, the appropriate Python 3.x data type for the base
> implementation would indeed be strings.

Indeed, the RFC seems to clearly imply that loads should take a
Unicode string, dumps should produce one, and load/dump should work in
terms of text files (not byte files).

On the other hand, further down in the document:

"""
3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.
"""

This is at best confused (in my utterly non-expert opinion :-)) as
Unicode isn't an encoding...

I would guess that what the RFC is trying to say is that JSON is text
(Unicode) and where a byte stream purporting to be JSON is encountered
without a defined encoding, this is how to guess one.

That implies that loads can/should also allow bytes as input, applying
the given algorithm to guess an encoding. And similarly load
can/should accept a byte stream, on the same basis. (There's no need
to allow the possibility of accepting bytes plus an encoding - in that
case the user should decode the bytes before passing Unicode to the
JSON module).

An alternative might be for the JSON module to register a special
encoding ('JSON-guess'?) which captures the rules here. Then there's
no need for special bytes parameter handling.

Of course, this is all from a native English speaker, who therefore
has no idea of the real life issues involved in Unicode :-)

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Antoine Pitrou
 divmod.com> writes:
> 
> In email's case this is true, but in JSON's case it's not.  JSON is a 
> format defined as a sequence of code points; MIME is defined as a 
> sequence of octets.

Another to look at it is that JSON is a subset of Javascript, and as such is
text rather than bytes.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-10 Thread Nick Coghlan
gl...@divmod.com wrote:
> On 03:21 am, ncogh...@gmail.com wrote:
>> Given that json is a wire protocol, that sounds like the right approach
>> for json as well. Once bytes-everywhere works, then a text API can be
>> built on top of it, but it is difficult to build a bytes API on top of a
>> text one.
> 
> I wish I could agree, but JSON isn't really a wire protocol.  According
> to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the
> serialization of structured data".  There are some notes about encoding,
> but it is very clearly described in terms of unicode code points.

Ah, my apologies - if the RFC defines things such that the native format
is Unicode, then yes, the appropriate Python 3.x data type for the base
implementation would indeed be strings.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread glyph


On 03:21 am, ncogh...@gmail.com wrote:

Barry Warsaw wrote:



I don't know whether the parameter thing will work or not, but you're
probably right that we need to get the bytes-everywhere API first.



Given that json is a wire protocol, that sounds like the right approach
for json as well. Once bytes-everywhere works, then a text API can be
built on top of it, but it is difficult to build a bytes API on top of 
a

text one.


I wish I could agree, but JSON isn't really a wire protocol.  According 
to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the 
serialization of structured data".  There are some notes about encoding, 
but it is very clearly described in terms of unicode code points.

So I guess the IO library *is* the right model: bytes at the bottom of
the stack, with text as a wrapper around it (mediated by codecs).


In email's case this is true, but in JSON's case it's not.  JSON is a 
format defined as a sequence of code points; MIME is defined as a 
sequence of octets.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread glyph


On 02:38 am, ba...@python.org wrote:
So, what I'm really asking is this.  Let's say you agree that there 
are use cases for accessing a header value as either the raw encoded 
bytes or the decoded unicode.  What should this return:


>>> message['Subject']

The raw bytes or the decoded unicode?


My personal preference would be to just get deprecate this API, and get 
rid of it, replacing it with a slightly more explicit one.


   message.headers['Subject']
   message.bytes_headers['Subject']
Now, setting headers.  Sometimes you have some unicode thing and 
sometimes you have some bytes.  You need to end up with bytes in the 
ASCII range and you'd like to leave the header value unencoded if so. 
But in both cases, you might have bytes or characters outside that 
range, so you need an explicit encoding, defaulting to utf-8 probably.


   message.headers['Subject'] = 'Some text'

should be equivalent to

   message.headers['Subject'] = Header('Some text')

My preference would be that

   message.headers['Subject'] = b'Some Bytes'

would simply raise an exception.  If you've got some bytes, you should 
instead do


   message.bytes_headers['Subject'] = b'Some Bytes'

or

   message.headers['Subject'] = Header(bytes=b'Some Bytes', 
encoding='utf-8')


Explicit is better than implicit, right?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 11:21 PM, Nick Coghlan wrote:


Barry Warsaw wrote:

I don't know whether the parameter thing will work or not, but you're
probably right that we need to get the bytes-everywhere API first.


Given that json is a wire protocol, that sounds like the right  
approach

for json as well. Once bytes-everywhere works, then a text API can be
built on top of it, but it is difficult to build a bytes API on top  
of a

text one.


Agreed!


So I guess the IO library *is* the right model: bytes at the bottom of
the stack, with text as a wrapper around it (mediated by codecs).


Yes, that's a very interesting (and proven?) model.  I don't quite see  
how we could apply that email and json, but it seems like there's a  
good idea there. ;)


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Nick Coghlan
Barry Warsaw wrote:
> I don't know whether the parameter thing will work or not, but you're
> probably right that we need to get the bytes-everywhere API first.

Given that json is a wire protocol, that sounds like the right approach
for json as well. Once bytes-everywhere works, then a text API can be
built on top of it, but it is difficult to build a bytes API on top of a
text one.

So I guess the IO library *is* the right model: bytes at the bottom of
the stack, with text as a wrapper around it (mediated by codecs).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 10:52 PM, Aahz wrote:


On Thu, Apr 09, 2009, Barry Warsaw wrote:


So, what I'm really asking is this.  Let's say you agree that there  
are
use cases for accessing a header value as either the raw encoded  
bytes or

the decoded unicode.  What should this return:


message['Subject']


The raw bytes or the decoded unicode?


Let's make that the raw bytes by default -- we can add a parameter to
Message() to specify that the default where possible is unicode for
returned values, if that isn't too painful.


I don't know whether the parameter thing will work or not, but you're  
probably right that we need to get the bytes-everywhere API first.


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Aahz
On Thu, Apr 09, 2009, Barry Warsaw wrote:
>
> So, what I'm really asking is this.  Let's say you agree that there are 
> use cases for accessing a header value as either the raw encoded bytes or 
> the decoded unicode.  What should this return:
>
> >>> message['Subject']
>
> The raw bytes or the decoded unicode?

Let's make that the raw bytes by default -- we can add a parameter to
Message() to specify that the default where possible is unicode for
returned values, if that isn't too painful.

Here's my reasoning: ultimately, everyone NEEDS to understand that the
underlying transport for e-mail is bytes (similar to sockets).  We do
people no favors by pasting over this too much.  We can overlay
convenience at various points, but except for text payloads, everything
should be bytes by default.  

Even for text payloads, I'm not entirely certain the default shouldn't be
bytes: consider an HTML attachment that you want to compare against the
output from a webserver.  Still, as long as it's easy to get bytes for
text payloads, I think overall I'm still leaning toward unicode for them.
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

Why is this newsgroup different from all other newsgroups?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 2:25 PM, Martin v. Löwis wrote:

This is an interesting question, and something I'm struggling with  
for
the email package for 3.x.  It turns out to be pretty convenient to  
have

both a bytes and a string API, both for input and output, but I think
email really wants to be represented internally as bytes.  Maybe.  Or
maybe just for content bodies and not headers, or maybe both.   
Anyway,
aside from that decision, I haven't come up with an elegant way to  
allow

/output/ in both bytes and strings (input is I think theoretically
easier by sniffing the arguments).


If you allow for content-transfer-encoding: 8bit, I think there is  
just

no way to represent email as text. You have to accept conversion to,
say, base64 (or quoted-unreadable) when converting an email message to
text.


Agreed.  But applications will want to deal with some parts of the  
message as text on the boundaries.  Internally, it should be all bytes  
(although even that is a pain to write ;).


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote:


On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw  wrote:
Anyway, aside from that decision, I haven't come up with an elegant  
way to allow /output/ in both bytes and strings (input is I think  
theoretically easier by sniffing the arguments).


Won't this work? (assuming dumps() always returns a string)

def dumpb(obj, encoding='utf-8', *args, **kw):
s = dumps(obj, *args, **kw)
return s.encode(encoding)


So, what I'm really asking is this.  Let's say you agree that there  
are use cases for accessing a header value as either the raw encoded  
bytes or the decoded unicode.  What should this return:


>>> message['Subject']

The raw bytes or the decoded unicode?

Okay, so you've picked one.  Now how do you spell the other way?

The Message class probably has these explicit methods:

>>> Message.get_header_bytes('Subject')
>>> Message.get_header_string('Subject')

(or better names... it's late and I'm tired ;).  One of those maps to  
message['Subject'] but which is the more obvious choice?


Now, setting headers.  Sometimes you have some unicode thing and  
sometimes you have some bytes.  You need to end up with bytes in the  
ASCII range and you'd like to leave the header value unencoded if so.   
But in both cases, you might have bytes or characters outside that  
range, so you need an explicit encoding, defaulting to utf-8 probably.


>>> Message.set_header('Subject', 'Some text', encoding='utf-8')
>>> Message.set_header('Subject', b'Some bytes')

One of those maps to

>>> message['Subject'] = ???

I'm open to any suggestions here!
-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 11:08 AM, Bill Janssen wrote:


Barry Warsaw  wrote:


Anyway, aside from that decision, I haven't come up with an
elegant way to allow /output/ in both bytes and strings (input is I
think theoretically easier by sniffing the arguments).


Probably a good thing.  It just promotes more confusion to do things
that way, IMO.


Very possibly so.  But applications will definitely want stuff like  
the text/plain payload as a unicode, or the image/gif payload as a  
bytes (or even as a PIL image or whatever).


Not that I think the email package needs to know about every content  
type under the sun, but I do think that it should be pluggable so as  
to allow applications to more conveniently access the data that way.   
Possibly the defaults should be unicodes for any text/* type and bytes  
for everything else.


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 8:07 AM, Steve Holden wrote:

The real problem I came across in storing email in a relational  
database

was the inability to store messages as Unicode. Some messages have a
body in one encoding and an attachment in another, so the only ways to
store the messages are either as a monolithic bytes string that gets
parsed when the individual components are required or as a sequence of
components in the database's preferred encoding (if you want to keep  
the
original encoding most relational databases won't be able to help  
unless

you store the components as bytes).

All in all, as you might expect from a system that's been growing up
since 1970 or so, it can be quite intractable.


There are really two ways to look at an email message.  It's either an  
unstructured blob of bytes, or it's a structured tree of objects.   
Those objects have headers and payload.  The payload can be of any  
type, though I think it generally breaks down into "strings" for text/ 
* types and bytes for anything else (not counting multiparts).


The email package isn't a perfect mapping to this, which is something  
I want to improve.  That aside, I think storing a message in a  
database means storing some or all of the headers separately from the  
byte stream (or text?) of its payload.  That's for non-multipart  
types.  It would be more complicated to represent a message tree of  
course.


It does seem to make sense to think about headers as text header names  
and text header values.  Of course, header values can contain almost  
anything and there's an encoding to bring it back to 7-bit ASCII, but  
again, you really have two views of a header value.  Which you want  
really depends on your application.


Maybe you just care about the text of both the header name and value.   
In that case, I think you want the values as unicodes, and probably  
the headers as unicodes containing only ASCII.  So your table would be  
strings in both cases.  OTOH, maybe your application cares about the  
raw underlying encoded data, in which case the header names are  
probably still strings of ASCII-ish unicodes and the values are  
bytes.  It's this distinction (and I think the competing use cases)  
that make a true Python 3.x API for email more complicated.


Thinking about this stuff makes me nostalgic for the sloppy happy days  
of Python 2.x


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Martin v. Löwis
> As far as Python 3 goes, I honestly have not yet familiarized myself
> with the changes to the IO infrastructure and what the new idioms are.
> At this time, I can't make any educated decisions with regard to how
> it should be done because I don't know exactly how bytes are supposed
> to work and what the common idioms are for other libraries in the
> stdlib that do similar things.

It's really very similar to 2.x: the "bytes" type is to used in all
interfaces that operate on byte sequences that may or may not represent
characters; in particular, for interface where the operating system
deliberately uses bytes - ie. low-level file IO and socket IO; also
for cases where the encoding is embedded in the stream that still
needs to be processed (e.g. XML parsing).

(Unicode) strings should be used where the data is truly text by
nature, i.e. where no encoding information is necessary to find out
what characters are intended. It's used on interfaces where the
encoding is known (e.g. text IO, where the encoding is specified
on opening, XML parser results, with the declared encoding, and
GUI libraries, which naturally expect text).

> Until I figure that out, someone else
> is better off making decisions about the Python 3 version.

Some of us can certainly explain to you how this is supposed to
work. However, we need you to check any assumption against the
known use cases - would the users of the module be happy if it
worked one way or the other?

> My guess is
> that it should work the same way as it does in Python 2.x: take bytes
> or unicode input in loads (which means encoding is still relevant). I
> also think the output of dumps should also be bytes, since it is a
> serialization, but I am not sure how other libraries do this in Python
> 3 because one could argue that it is also text.

This, indeed, had been an endless debate, and, in the end, the decision
was somewhat arbitrary. Here are some examples:

- base64.encodestring expects bytes (naturally, since it is supposed to
  encode arbitrary binary data), and produces bytes (debatably)
- binascii.b2a_hex likewise (expect and produce bytes)
- pickle.dumps produces bytes (uniformly, both for binary and text
  pickles)
- marshal.dumps likewise
- email.message.Message().as_string produces a (unicode) string
  (see Barry's recent thread on whether that's a good thing; the
  email package hasn't been fully ported to 3k, either)
- the XML libraries (continue to) parse bytes, and produce
  Unicode strings
- for the IO libraries, see above

> If other libraries
> that do text/text encodings (e.g. binascii, mimelib, ...) use str for
> input and output

See above - most of them don't; mimetools is no longer (replaced by
email package)

> instead of bytes then maybe Antoine's changes are the
> right solution and I just don't know better because I'm not up to
> speed with how people write Python 3 code.

There isn't too much fresh end-user code out there, so we can't really
tell, either. As for standard library users - users will do whatever
the library forces them to do.

This is why I'm so concerned about this issue: we should get it right,
or not done at all. I still think you would be the best person to
determine what is right.

> I'll do my best to find some time to look into Python 3 more closely
> soon, but thus far I have not been very motivated to do so because
> Python 3 isn't useful for us at work and twiddling syntax isn't a very
> interesting problem for me to solve.

And I didn't expect you to - it seems people are quite willing to do
the actual work, as long as there is some guidance.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Bob Ippolito
On Thu, Apr 9, 2009 at 1:05 PM, "Martin v. Löwis"  wrote:
>>> I can understand that you don't want to spend much time on it. How
>>> about removing it from 3.1? We could re-add it when long-term support
>>> becomes more likely.
>>
>> I'm speechless.
>
> It seems that my statement has surprised you, so let me explain:
>
> I think we should refrain from making design decisions (such as
> API decisions) without Bob's explicit consent, unless we assign
> a new maintainer for the simplejson module (perhaps just for the
> 3k branch, which perhaps would be a fork from Bob's code).
>
> Antoine suggests that Bob did not comment on the issues at hand,
> therefore, we should not proceed with the proposed design. Since
> the 3.1 release is only a few weeks ahead, we have the choice of
> either shipping with the broken version that is currently in the
> 3k branch, or drop the module from the 3k branch. I believe our
> users are better served by not having to waste time with a module
> that doesn't quite work, or may change.

Most of my time to spend on json/simplejson and these mailing list
discussions is on weekends, I try not to bother with it when I'm busy
doing Actual Work unless there is a bug or some other issue that needs
more immediate attention. I also wasn't aware that I was expected to
comment on those issues. I'm CC'ed on the discussion for issue4136 but
I don't see any unanswered questions directed at me.

I have the issues (issue5723, issue4136) starred in my gmail and I
planned to look at it more closely later, hopefully on Friday or
Saturday.

As far as Python 3 goes, I honestly have not yet familiarized myself
with the changes to the IO infrastructure and what the new idioms are.
At this time, I can't make any educated decisions with regard to how
it should be done because I don't know exactly how bytes are supposed
to work and what the common idioms are for other libraries in the
stdlib that do similar things. Until I figure that out, someone else
is better off making decisions about the Python 3 version. My guess is
that it should work the same way as it does in Python 2.x: take bytes
or unicode input in loads (which means encoding is still relevant). I
also think the output of dumps should also be bytes, since it is a
serialization, but I am not sure how other libraries do this in Python
3 because one could argue that it is also text. If other libraries
that do text/text encodings (e.g. binascii, mimelib, ...) use str for
input and output instead of bytes then maybe Antoine's changes are the
right solution and I just don't know better because I'm not up to
speed with how people write Python 3 code.

I'll do my best to find some time to look into Python 3 more closely
soon, but thus far I have not been very motivated to do so because
Python 3 isn't useful for us at work and twiddling syntax isn't a very
interesting problem for me to solve.

-bob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Martin v. Löwis
Alexandre Vassalotti wrote:
> On Thu, Apr 9, 2009 at 1:15 AM, Antoine Pitrou  wrote:
>> As for reading/writing bytes over the wire, JSON is often used in the same
>> context as HTML: you are supposed to know the charset and decode/encode the
>> payload using that charset. However, the RFC specifies a default encoding of
>> utf-8. (*)
>>
>>
>> (*) http://www.ietf.org/rfc/rfc4627.txt
>>
> 
> That is one short and sweet RFC. :-)

It is indeed well-specified. Unfortunately, it only talks about the
application/json type; the pre-existing other versions of json in MIME
types vary widely, such as text/plain (possibly with a charset=
parameter), text/json, or text/javascript. For these, the RFC doesn't
apply.

> Given the RFC specifies that the encoding used should be one of the
> encodings defined by Unicode, wouldn't be a better idea to remove the
> "unicode" support, instead? To me, it would make sense to use the
> detection algorithms for Unicode to sniff the encoding of the JSON
> stream and then use the detected encoding to decode the strings embed
> in the JSON stream.

That might be reasonable. (but then, I also stand by my view that we
shouldn't proceed without Bob's approval).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Martin v. Löwis
>> I can understand that you don't want to spend much time on it. How
>> about removing it from 3.1? We could re-add it when long-term support
>> becomes more likely.
> 
> I'm speechless.

It seems that my statement has surprised you, so let me explain:

I think we should refrain from making design decisions (such as
API decisions) without Bob's explicit consent, unless we assign
a new maintainer for the simplejson module (perhaps just for the
3k branch, which perhaps would be a fork from Bob's code).

Antoine suggests that Bob did not comment on the issues at hand,
therefore, we should not proceed with the proposed design. Since
the 3.1 release is only a few weeks ahead, we have the choice of
either shipping with the broken version that is currently in the
3k branch, or drop the module from the 3k branch. I believe our
users are better served by not having to waste time with a module
that doesn't quite work, or may change.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Alexandre Vassalotti
On Thu, Apr 9, 2009 at 1:15 AM, Antoine Pitrou  wrote:
> As for reading/writing bytes over the wire, JSON is often used in the same
> context as HTML: you are supposed to know the charset and decode/encode the
> payload using that charset. However, the RFC specifies a default encoding of
> utf-8. (*)
>
>
> (*) http://www.ietf.org/rfc/rfc4627.txt
>

That is one short and sweet RFC. :-)

> The RFC also specifies a discrimination algorithm for non-supersets of ASCII
> (“Since the first two characters of a JSON text will always be ASCII
>   characters [RFC0020], it is possible to determine whether an octet
>   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
>   at the pattern of nulls in the first four octets.”), but it is not
> implemented in the json module:
>

Given the RFC specifies that the encoding used should be one of the
encodings defined by Unicode, wouldn't be a better idea to remove the
"unicode" support, instead? To me, it would make sense to use the
detection algorithms for Unicode to sniff the encoding of the JSON
stream and then use the detected encoding to decode the strings embed
in the JSON stream.

Cheers,
-- Alexandre
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Martin v. Löwis
> This is an interesting question, and something I'm struggling with for
> the email package for 3.x.  It turns out to be pretty convenient to have
> both a bytes and a string API, both for input and output, but I think
> email really wants to be represented internally as bytes.  Maybe.  Or
> maybe just for content bodies and not headers, or maybe both.  Anyway,
> aside from that decision, I haven't come up with an elegant way to allow
> /output/ in both bytes and strings (input is I think theoretically
> easier by sniffing the arguments).

If you allow for content-transfer-encoding: 8bit, I think there is just
no way to represent email as text. You have to accept conversion to,
say, base64 (or quoted-unreadable) when converting an email message to
text.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Daniel Stutzbach
On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw  wrote:

> Anyway, aside from that decision, I haven't come up with an elegant way to
> allow /output/ in both bytes and strings (input is I think theoretically
> easier by sniffing the arguments).
>

Won't this work? (assuming dumps() always returns a string)

def dumpb(obj, encoding='utf-8', *args, **kw):
s = dumps(obj, *args, **kw)
return s.encode(encoding)

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Bill Janssen
Barry Warsaw  wrote:

> Anyway, aside from that decision, I haven't come up with an  
> elegant way to allow /output/ in both bytes and strings (input is I  
> think theoretically easier by sniffing the arguments).

Probably a good thing.  It just promotes more confusion to do things
that way, IMO.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Steve Holden
Barry Warsaw wrote:
> On Apr 9, 2009, at 1:15 AM, Antoine Pitrou wrote:
> 
>> Guido van Rossum  python.org> writes:
>>>
>>> I'm kind of surprised that a serialization protocol like JSON wouldn't
>>> support reading/writing bytes (as the serialized format -- I don't
>>> care about having bytes as values, since JavaScript doesn't have
>>> something equivalent AFAIK, and hence JSON doesn't allow it IIRC).
>>> Marshal and Pickle, for example, *always* treat the serialized format
>>> as bytes. And since in most cases it will be sent over a socket, at
>>> some point the serialized representation *will* be bytes, I presume.
>>> What makes supporting this hard?
> 
>> It's not hard, it just means a lot of duplicated code if the library
>> wants to
>> support both str and bytes in an optimized way as Martin alluded to. This
>> duplicated code already exists in the C parts to support the 2.x
>> semantics of
>> accepting unicode objects as well as str, but not in the Python parts,
>> which
>> explains why the bytes support is broken in py3k - in 2.x, the same
>> Python code
>> can be used for str and unicode.
> 
> This is an interesting question, and something I'm struggling with for
> the email package for 3.x.  It turns out to be pretty convenient to have
> both a bytes and a string API, both for input and output, but I think
> email really wants to be represented internally as bytes.  Maybe.  Or
> maybe just for content bodies and not headers, or maybe both.  Anyway,
> aside from that decision, I haven't come up with an elegant way to allow
> /output/ in both bytes and strings (input is I think theoretically
> easier by sniffing the arguments).
> 
The real problem I came across in storing email in a relational database
was the inability to store messages as Unicode. Some messages have a
body in one encoding and an attachment in another, so the only ways to
store the messages are either as a monolithic bytes string that gets
parsed when the individual components are required or as a sequence of
components in the database's preferred encoding (if you want to keep the
original encoding most relational databases won't be able to help unless
you store the components as bytes).

All in all, as you might expect from a system that's been growing up
since 1970 or so, it can be quite intractable.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Watch PyCon on video now!  http://pycon.blip.tv/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Dirkjan Ochtman
On Thu, Apr 9, 2009 at 13:10, Antoine Pitrou  wrote:
> Sure, but then:
>
 json.loads('[]')
> []
 json.loads(u'[]'.encode('utf16'))
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/antoine/cpython/__svn__/Lib/json/__init__.py", line 310, in loads
>    return _default_decoder.decode(s)
>  File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 344, in decode
>    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>  File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 362, in 
> raw_decode
>    raise ValueError("No JSON object could be decoded")
> ValueError: No JSON object could be decoded

Right. :) Just wanted to point your test might not be testing what you
want to test.

Cheers,

Dirkjan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Antoine Pitrou
Dirkjan Ochtman  ochtman.nl> writes:
> 
> The RFC states
> that JSON-text = object / array, meaning "loads" for '"hi"' isn't
> strictly valid.

Sure, but then:

>>> json.loads('[]')
[]
>>> json.loads(u'[]'.encode('utf16'))
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/antoine/cpython/__svn__/Lib/json/__init__.py", line 310, in loads
return _default_decoder.decode(s)
  File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 344, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 362, in 
raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded


Cheers

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Barry Warsaw

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Apr 9, 2009, at 1:15 AM, Antoine Pitrou wrote:


Guido van Rossum  python.org> writes:


I'm kind of surprised that a serialization protocol like JSON  
wouldn't

support reading/writing bytes (as the serialized format -- I don't
care about having bytes as values, since JavaScript doesn't have
something equivalent AFAIK, and hence JSON doesn't allow it IIRC).
Marshal and Pickle, for example, *always* treat the serialized format
as bytes. And since in most cases it will be sent over a socket, at
some point the serialized representation *will* be bytes, I presume.
What makes supporting this hard?


It's not hard, it just means a lot of duplicated code if the library  
wants to
support both str and bytes in an optimized way as Martin alluded to.  
This
duplicated code already exists in the C parts to support the 2.x  
semantics of
accepting unicode objects as well as str, but not in the Python  
parts, which
explains why the bytes support is broken in py3k - in 2.x, the same  
Python code

can be used for str and unicode.


This is an interesting question, and something I'm struggling with for  
the email package for 3.x.  It turns out to be pretty convenient to  
have both a bytes and a string API, both for input and output, but I  
think email really wants to be represented internally as bytes.   
Maybe.  Or maybe just for content bodies and not headers, or maybe  
both.  Anyway, aside from that decision, I haven't come up with an  
elegant way to allow /output/ in both bytes and strings (input is I  
think theoretically easier by sniffing the arguments).


Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSd3Vf3EjvBPtnXfVAQKyNgQApNmI5hh9heTYynyADYaDkP8wzZFXUpgg
cKYL741MbLpOFn3IFGAGaRWBQe4Dt8i4CiIEIbg3X7QZqwQJjoTtFwxsJKmXFd1M
JR0oCB8Du2kE5YzD+avrEp+d8zwl2goxvzD9dJwziBav5V98w7PMiZc3sApklQFD
gNYzbHEOfv4=
=tjGr
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Dirkjan Ochtman
On Thu, Apr 9, 2009 at 07:15, Antoine Pitrou  wrote:
> The RFC also specifies a discrimination algorithm for non-supersets of ASCII
> (“Since the first two characters of a JSON text will always be ASCII
>   characters [RFC0020], it is possible to determine whether an octet
>   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
>   at the pattern of nulls in the first four octets.”), but it is not
> implemented in the json module:

Well, your example is bad in the context of the RFC. The RFC states
that JSON-text = object / array, meaning "loads" for '"hi"' isn't
strictly valid. The discrimination algorithm obviously only works in
the context of that grammar, where the first character of a document
must be { or [ and the next character can only be {, [, f, n, t, ", -,
a number, or insignificant whitespace (space, \t, \r, \n).

 json.loads('"hi"')
> 'hi'
 json.loads(u'"hi"'.encode('utf16'))
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/antoine/cpython/__svn__/Lib/json/__init__.py", line 310, in loads
>    return _default_decoder.decode(s)
>  File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 344, in decode
>    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>  File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 362, in 
> raw_decode
>    raise ValueError("No JSON object could be decoded")
> ValueError: No JSON object could be decoded

Cheers,

Dirkjan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-09 Thread Raymond Hettinger


[Antoine Pitrou]

Besides, Bob doesn't really seem to care about
porting to py3k (he hasn't said anything about it until now, other than that he
didn't feel competent to do it).


His actual words were: "I will need some help with 3.0 since I am not well versed in the changes to the C API or Python code for 
that, but merging for 2.6.1 should be no big deal."



[MvL]

That is quite unfortunate, and suggests that perhaps the module
shouldn't have been added to Python in the first place.


Bob participated actively in http://bugs.python.org/issue4136 and was responsive to detailed patch review.  He gave a popular talk 
at PyCon less than two weeks ago.  He's not derelict.




I can understand that you don't want to spend much time on it. How
about removing it from 3.1? We could re-add it when long-term support
becomes more likely.


I'm speechless.


Raymond 


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-08 Thread Martin v. Löwis
> Besides, Bob doesn't really seem to care about
> porting to py3k (he hasn't said anything about it until now, other than that 
> he
> didn't feel competent to do it).

That is quite unfortunate, and suggests that perhaps the module
shouldn't have been added to Python in the first place.

I can understand that you don't want to spend much time on it. How
about removing it from 3.1? We could re-add it when long-term support
becomes more likely.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-08 Thread Antoine Pitrou
Guido van Rossum  python.org> writes:
> 
> I'm kind of surprised that a serialization protocol like JSON wouldn't
> support reading/writing bytes (as the serialized format -- I don't
> care about having bytes as values, since JavaScript doesn't have
> something equivalent AFAIK, and hence JSON doesn't allow it IIRC).
> Marshal and Pickle, for example, *always* treat the serialized format
> as bytes. And since in most cases it will be sent over a socket, at
> some point the serialized representation *will* be bytes, I presume.
> What makes supporting this hard?

It's not hard, it just means a lot of duplicated code if the library wants to
support both str and bytes in an optimized way as Martin alluded to. This
duplicated code already exists in the C parts to support the 2.x semantics of
accepting unicode objects as well as str, but not in the Python parts, which
explains why the bytes support is broken in py3k - in 2.x, the same Python code
can be used for str and unicode.

On the other hand, supporting it without going after the last percents of
performance should be fairly trivial (by encoding/decoding before doing the
processing proper), and it would avoid the current duplicated code.

As for reading/writing bytes over the wire, JSON is often used in the same
context as HTML: you are supposed to know the charset and decode/encode the
payload using that charset. However, the RFC specifies a default encoding of
utf-8. (*)


(*) http://www.ietf.org/rfc/rfc4627.txt

The RFC also specifies a discrimination algorithm for non-supersets of ASCII
(“Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.”), but it is not
implemented in the json module:

>>> json.loads('"hi"')
'hi'
>>> json.loads(u'"hi"'.encode('utf16'))
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/antoine/cpython/__svn__/Lib/json/__init__.py", line 310, in loads
return _default_decoder.decode(s)
  File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 344, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/antoine/cpython/__svn__/Lib/json/decoder.py", line 362, in 
raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-08 Thread Guido van Rossum
On Wed, Apr 8, 2009 at 4:10 AM, Antoine Pitrou  wrote:
> We're in the process of forward-porting the recent (massive) json updates to
> 3.1, and we are also thinking of dropping remnants of support of the bytes 
> type
> in the json library (in 3.1, again). This bytes support almost didn't work at
> all, but there was a lot of C and Python code for it nevertheless. We're also
> thinking of dropping the "encoding" argument in the various APIs, since it is
> useless.
>
> Under the new situation, json would only ever allow str as input, and output 
> str
> as well. By posting here, I want to know whether anybody would oppose this
> (knowing, once again, that bytes support is already broken in the current py3k
> trunk).
>
> The bug entry is: http://bugs.python.org/issue4136

I'm kind of surprised that a serialization protocol like JSON wouldn't
support reading/writing bytes (as the serialized format -- I don't
care about having bytes as values, since JavaScript doesn't have
something equivalent AFAIK, and hence JSON doesn't allow it IIRC).
Marshal and Pickle, for example, *always* treat the serialized format
as bytes. And since in most cases it will be sent over a socket, at
some point the serialized representation *will* be bytes, I presume.
What makes supporting this hard?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-08 Thread Antoine Pitrou
Martin v. Löwis  v.loewis.de> writes:
> 
> What does Bob Ippolito think about this change? IIUC, he considers
> simplejson's speed one of its primary advantages, and also attributes it
> to the fact that he can parse directly out of byte strings, and marshal
> into them (which is important, as you typically receive them over the
> wire).

The only thing I know is that the new version (the one I've tried to merge) is
massively faster than the old one - several times faster - and within 20-30% of
the speed of the 2.x version (*). Besides, Bob doesn't really seem to care about
porting to py3k (he hasn't said anything about it until now, other than that he
didn't feel competent to do it). But I'm happy with someone proposing an
alternate patch if they want to. As for me, I just wanted to fill the gap and
I'm not interested in doing lot of work on this issue.

(*)

timeit -s "import json; l=['abc']*100" "json.dumps(l)"

-> trunk: 33.4 usec per loop
-> py3k + patch: 37.1 usec per loop
-> vanilla py3k: 314 usec per loop

timeit -s "import json; s=json.dumps(['abc']*100)" "json.loads(s)"

-> trunk: 44.8 usec per loop
-> py3k + patch: 35.4 usec per loop
-> vanilla py3k: 1.48 msec per loop (!)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-08 Thread Martin v. Löwis
> We're in the process of forward-porting the recent (massive) json updates to
> 3.1, and we are also thinking of dropping remnants of support of the bytes 
> type
> in the json library (in 3.1, again). This bytes support almost didn't work at
> all, but there was a lot of C and Python code for it nevertheless. We're also
> thinking of dropping the "encoding" argument in the various APIs, since it is
> useless.
> 
> Under the new situation, json would only ever allow str as input, and output 
> str
> as well. By posting here, I want to know whether anybody would oppose this
> (knowing, once again, that bytes support is already broken in the current py3k
> trunk).

What does Bob Ippolito think about this change? IIUC, he considers
simplejson's speed one of its primary advantages, and also attributes it
to the fact that he can parse directly out of byte strings, and marshal
into them (which is important, as you typically receive them over the
wire). Having to run them through a codec slows parsing down.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-08 Thread Raymond Hettinger



We're in the process of forward-porting the recent (massive) json updates to
3.1, and we are also thinking of dropping remnants of support of the bytes type
in the json library (in 3.1, again). This bytes support almost didn't work at
all, but there was a lot of C and Python code for it nevertheless. We're also
thinking of dropping the "encoding" argument in the various APIs, since it is
useless.

Under the new situation, json would only ever allow str as input, and output str
as well. By posting here, I want to know whether anybody would oppose this
(knowing, once again, that bytes support is already broken in the current py3k
trunk).


+1


Raymond
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Dropping bytes "support" in json

2009-04-08 Thread Antoine Pitrou
Hello,

We're in the process of forward-porting the recent (massive) json updates to
3.1, and we are also thinking of dropping remnants of support of the bytes type
in the json library (in 3.1, again). This bytes support almost didn't work at
all, but there was a lot of C and Python code for it nevertheless. We're also
thinking of dropping the "encoding" argument in the various APIs, since it is
useless.

Under the new situation, json would only ever allow str as input, and output str
as well. By posting here, I want to know whether anybody would oppose this
(knowing, once again, that bytes support is already broken in the current py3k
trunk).

The bug entry is: http://bugs.python.org/issue4136

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com