Re: About size of Unicode string

2005-06-13 Thread Fredrik Lundh
Frank Abel Cancio Bello wrote:

 Can I get how many bytes have a string object independently of its encoding?

strings hold characters, not bytes.  an encoding is used to convert a
stream of characters to a stream of bytes.   if you need to know the
number of bytes needed to hold an encoded string, you need to know
the encoding.

(and in some cases, including UTF-8, you need to *do* the encoding
before you can tell how many bytes you get)

 Is the len function the right way of get it?

len() on the encoded string, yes.

 Laci look the following code:

 import urllib2
 request = urllib2.Request(url= 'http://localhost:6000')
 data = 'data to send\n'.encode('utf_8')
 request.add_data(data)
 request.add_header('content-length', str(len(data)))
 request.add_header('content-encoding', 'UTF-8')
 file = urllib2.urlopen(request)

 Is always true that the size of the entity-body is len(data)
 independently of the encoding of data?

your data variable contains bytes, not characters, so the answer is yes.

on the other hand, that add_header line isn't really needed -- if you leave
it out, urllib2 will add the content-length header all by itself.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


About size of Unicode string

2005-06-06 Thread Frank Abel Cancio Bello
Hi all!

I need know the size of string object independently of its encoding. For
example:

len('123') == len('123'.encode('utf_8'))

while the size of '123' object is different of the size of
'123'.encode('utf_8')

More:
I need send in HTTP request a string. Then I need know the length of the
string to set the header content-length independently of its encoding.

Any idea?

Thanks in advance
Frank




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: About size of Unicode string

2005-06-06 Thread Laszlo Zsolt Nagy
Frank Abel Cancio Bello wrote:

Hi all!

I need know the size of string object independently of its encoding. For
example:

   len('123') == len('123'.encode('utf_8'))

while the size of '123' object is different of the size of
'123'.encode('utf_8')

More:
I need send in HTTP request a string. Then I need know the length of the
string to set the header content-length independently of its encoding.

Any idea?
  

This is from the RFC:


 The Content-Length entity-header field indicates the size of the 
 entity-body, in decimal number of OCTETs, sent to the recipient or, in 
 the case of the HEAD method, the size of the entity-body that would 
 have been sent had the request been a GET.

   Content-Length= Content-Length : 1*DIGIT
  

 An example is

   Content-Length: 3495
  

 Applications SHOULD use this field to indicate the transfer-length of 
 the message-body, unless this is prohibited by the rules in section 
 4.4 http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4.

 Any Content-Length greater than or equal to zero is a valid value. 
 Section 4.4 describes how to determine the length of a message-body if 
 a Content-Length is not given.

Looks to me that the Content-Length header has nothing to do with the 
encoding. It is a very low levet stuff. The content length is given in 
OCTETs and it represents the size of the body. Clearly, it has nothing 
to do with MIME/encoding etc. It is about the number of bits transferred 
in the body. Try to write your unicode strings into a StringIO and take 
its length

   Laci

-- 
http://mail.python.org/mailman/listinfo/python-list


RE: About size of Unicode string

2005-06-06 Thread Frank Abel Cancio Bello
Well I will repeat the question:

Can I get how many bytes have a string object independently of its encoding?
Is the len function the right way of get it?

Laci look the following code:

import urllib2
request = urllib2.Request(url= 'http://localhost:6000')
data = 'data to send\n'.encode('utf_8')
request.add_data(data)
request.add_header('content-length', str(len(data)))
request.add_header('content-encoding', 'UTF-8')
file = urllib2.urlopen(request)

Is always true that the size of the entity-body is len(data)
independently of the encoding of data?


 -Original Message-
 From: Laszlo Zsolt Nagy [mailto:[EMAIL PROTECTED]
 Sent: Monday, June 06, 2005 1:43 PM
 To: Frank Abel Cancio Bello; python-list@python.org
 Subject: Re: About size of Unicode string
 
 Frank Abel Cancio Bello wrote:
 
 Hi all!
 
 I need know the size of string object independently of its encoding. For
 example:
 
  len('123') == len('123'.encode('utf_8'))
 
 while the size of '123' object is different of the size of
 '123'.encode('utf_8')
 
 More:
 I need send in HTTP request a string. Then I need know the length of the
 string to set the header content-length independently of its encoding.
 
 Any idea?
 
 
 This is from the RFC:
 
 
  The Content-Length entity-header field indicates the size of the
  entity-body, in decimal number of OCTETs, sent to the recipient or, in
  the case of the HEAD method, the size of the entity-body that would
  have been sent had the request been a GET.
 
Content-Length= Content-Length : 1*DIGIT
 
 
  An example is
 
Content-Length: 3495
 
 
  Applications SHOULD use this field to indicate the transfer-length of
  the message-body, unless this is prohibited by the rules in section
  4.4 http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4.
 
  Any Content-Length greater than or equal to zero is a valid value.
  Section 4.4 describes how to determine the length of a message-body if
  a Content-Length is not given.
 
 Looks to me that the Content-Length header has nothing to do with the
 encoding. It is a very low levet stuff. The content length is given in
 OCTETs and it represents the size of the body. Clearly, it has nothing
 to do with MIME/encoding etc. It is about the number of bits transferred
 in the body. Try to write your unicode strings into a StringIO and take
 its length
 
Laci
 
 





-- 
http://mail.python.org/mailman/listinfo/python-list


RE: About size of Unicode string

2005-06-06 Thread Andrew Dalke
Frank Abel Cancio Bello wrote:
 Can I get how many bytes have a string object independently of its encoding?
 Is the len function the right way of get it?

No.  len(unicode_string) returns the number of characters in the
unicode_string.

Number of bytes depends on how the unicode character are represented.
Different encodings will use different numbers of bytes.

 u = uG\N{Latin small letter A with ring above}
 u
u'G\xe5'
 len(u)
2
 u.encode(utf-8)
'G\xc3\xa5'
 len(u.encode(utf-8))
3
 u.encode(latin1)  
'G\xe5'
 len(u.encode(latin1))
2
 u.encode(utf16) 
'\xfe\xff\x00G\x00\xe5'
 len(u.encode(utf16))
6
 

 Laci look the following code:
 
   import urllib2
   request = urllib2.Request(url= 'http://localhost:6000')
   data = 'data to send\n'.encode('utf_8')
   request.add_data(data)
   request.add_header('content-length', str(len(data)))
   request.add_header('content-encoding', 'UTF-8')
   file = urllib2.urlopen(request)
 
 Is always true that the size of the entity-body is len(data)
 independently of the encoding of data?

For this case it is true because the logical length of 'data'
(which is a byte string) is equal to the number of bytes in the
string, and the utf-8 encoding of a byte string with character
values in the range 0-127, inclusive, is unchanged from the
original string.

In general, as if 'data' is a unicode strings, no.

len() returns the logical length of 'data'.  That number does
not need to be the number of bytes used to represent 'data'.
To get the bytes you must encode the object.

Andrew
[EMAIL PROTECTED]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: About size of Unicode string

2005-06-06 Thread Leif K-Brooks
Frank Abel Cancio Bello wrote:
   request.add_header('content-encoding', 'UTF-8')

The Content-Encoding header is for things like gzip, not for
specifying the text encoding. Use the charset parameter to the
Content-Type header for that, as in Content-Type: text/plain;
charset=utf-8.
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: About size of Unicode string

2005-06-06 Thread Frank Abel Cancio Bello
Thanks to all. Andrew's answer was an excellent explanation. Thanks Leif for
you suggestion.



 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On
 Behalf Of Leif K-Brooks
 Sent: Monday, June 06, 2005 4:29 PM
 To: python-list@python.org
 Subject: Re: About size of Unicode string
 
 Frank Abel Cancio Bello wrote:
  request.add_header('content-encoding', 'UTF-8')
 
 The Content-Encoding header is for things like gzip, not for
 specifying the text encoding. Use the charset parameter to the
 Content-Type header for that, as in Content-Type: text/plain;
 charset=utf-8.
 --
 http://mail.python.org/mailman/listinfo/python-list
 





-- 
http://mail.python.org/mailman/listinfo/python-list