[issue18679] include a codec to handle escaping only control characters but not any others

2014-12-12 Thread Martin Panter

Changes by Martin Panter vadmium...@gmail.com:


--
nosy: +vadmium

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18679] include a codec to handle escaping only control characters but not any others

2013-10-08 Thread R. David Murray

R. David Murray added the comment:

Well, you could writing a streaming codec.  Even if it didn't get accepted for 
the stdlib, you could put it up on pypi.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18679] include a codec to handle escaping only control characters but not any others

2013-10-04 Thread Derek Wilson

Derek Wilson added the comment:

Any update on this? Just so you can see what my work around is, I'll paste in 
the code I'm using. The major issue I have with this is that performance 
doesn't scale to large strings.

This is also a bytes-to-bytes or str-to-str encoding, because this is the type 
of operation that one plans to do with the data one has.

Having a full fledged streaming codec to handle this would be very helpful when 
writing applications that stream tab and newline separated utf-8 data over 
stdin/stdout.

  
text_types = (str, )  

escape_tm = dict((k, repr(chr(k))[1:-1]) for k in range(32))  
escape_tm[0] = '\0'
escape_tm[7] = '\a'
escape_tm[8] = '\b'
escape_tm[11] = '\v'   
escape_tm[12] = '\f'   
escape_tm[ord('\\')] = ''

def escape_control(s):  
if isinstance(s, text_types):   
return s.translate(escape_tm)
else:
return s.decode('utf-8', 
'surrogateescape').translate(escape_tm).encode('utf-8', 'surrogateescape')

def unescape_control(s):
if isinstance(s, text_types):   
return s.encode('latin1', 'backslashreplace').decode('unicode_escape')
else:   
return s.decode('utf-8', 'surrogateescape').encode('latin1', 
'backslashreplace').decode('unicode_escape').encode('utf-8', 'surrogateescape')

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18679] include a codec to handle escaping only control characters but not any others

2013-08-08 Thread Derek Wilson

Derek Wilson added the comment:

using repr(x)[1:-1] is not safe for my use case as i need this for encoding and 
decoding data. the deserialization of repr would be eval, and aside from the 
security issues with that, if I strip the quotes off I can't reliably eval the 
result and get back the original. On top of that, quote escape handling makes 
this non-portable to other languages/tools that do understand control character 
escapes. Consider:

 s = Α\t'''Ω
 print(s)
Α '''Ω
 e = repr(s)[1:-1]
 print(e)
Α\t\'\'\'Ω

how do i know what to quote e with before I eval it to get back the value? I 
can't even try all the quoting options and stop when i don't get a syntax error 
because more than one could work and give me a bad result:

 d = eval('{}'.format(e))
 d == s
False
 print(d)
Α   '''Ω

Aside from python not being able to handle the repr(x)[1:-1] case itself, the 
goal is to use output generated in common tools from cut to hadoop where tab is 
a field separator (aside: wouldn't adoption of ascii 0x1f as a common unit 
separator be great). Sometimes it is useful to separate newlines in data from a 
literal new line in formats (again like hadoop or unix utilities) that treat 
lines as records (and here again ascii 0x1e would have been a nice solution).

But we have to work with what we've got and there are many tools that care 
about tab separated fields and per line records. In these cases, the right tool 
for the interoperability job is a codec that simply backslash escapes control 
characters and nothing else.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18679] include a codec to handle escaping only control characters but not any others

2013-08-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

ast.literal_eval('%s' % e)
e.encode().decode('unicode-escape').encode('latin1').decode()
e.encode('latin1', 'backslashescape').decode('unicode-escape')

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18679] include a codec to handle escaping only control characters but not any others

2013-08-08 Thread Derek Wilson

Derek Wilson added the comment:

 ast.literal_eval('%s' % e)

this doesn't work if you use the wrong quote. without introspecting the data in 
e you can't reliably choose whether to use '%s' '%s' '%s' or 
'''%s'''. which ones break (and break siliently) depend on the data.


 e.encode().decode('unicode-escape').encode('latin1').decode()

so ... encode the repr()[1:-1] string in utf-8 bytes, decode backslash escape 
sequences and individual bytes as if they are latin1, encode as latin1 (which 
is just byte for byte serialization), then decode the byte representation as if 
it is utf-8 encoded to recombine the characters that were broken with the 
'unicode-escape' decode earlier? 

this may work for my example, but this looks and feels very hacky for something 
that should be simple and straight forward. and again tools other than python 
will run into escaped quotes in the data which may cause problems.

 e.encode('latin1', 'backslashescape').decode('unicode-escape')

when i execute this i get a traceback

LookupError: unknown error handler name 'backslashescape'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18679] include a codec to handle escaping only control characters but not any others

2013-08-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 this doesn't work if you use the wrong quote. without introspecting the data 
 in e you can't reliably choose whether to use '%s' '%s' '%s' or 
 '''%s'''.

Indeed.

 and again tools other than python will run into escaped quotes in the data 
 which may cause problems.

Then use s.translate() or re.sub() for encoding.

 when i execute this i get a traceback

Sorry, it should be

e.encode('latin1', 'backslashreplace').decode('unicode-escape').

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18679] include a codec to handle escaping only control characters but not any others

2013-08-08 Thread Derek Wilson

Derek Wilson added the comment:

 e.encode('latin1', 'backslashreplace').decode('unicode-escape')

this works, but still the quotes are backslash escaped. 

translate will do what i need for my use case, but it doesn't support streaming 
for larger chunks of data.

it is nice that there is a workaround but i do still think this is a valuable 
enough feature that there should be a builtin codec for it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18679] include a codec to handle escaping only control characters but not any others

2013-08-07 Thread Derek Wilson

New submission from Derek Wilson:

Escaping strings for serialization or display is a common problem. Currently, 
in python3, in order to escape a sting, you need to do this:

'my\tstring'.encode('unicode_escape').decode('ascii')

This would give you a string that was represented like this:

'my\\tstring'

But this does not present a suitable representation when the string contains 
unicode characters. Consider this example:

s = 'Α\tΩ'

There is no method to write this string this with only the control character 
escaped.

Even python itself recognizes this as a problem and implemented a solution 
for it.

 s = 'Α\tΩ'
 print(s)
Α   Ω
 print(repr(s))
'Α\tΩ'
 print(s.encode('unicode_escape').decode('ascii'))
\u0391\t\u03a9

What I want is public exposure of the functionality to represent control 
characters with their common \ escape sequences (or \x## for control characters 
where necessary - for instance unit and record separators).

I have numerous use cases for this and python's own str.__repr__ implementation 
shows that the functionality is valuable. I would bet that the majority of 
cases where people use unicode_escape something like a control_escape is more 
along the lines of what is desired.

And while we're at it, it would be great if this were a unicode-unicode codec 
like the rot_13 codec. My desired soluiton would look like this:

 import codecs
 s = 'Α\tΩ'
 e = codecs.encode(s, 'control_escape'))
 print(e)
Α\tΩ
 print(codecs.decode(e, 'control_escape'))
Α   Ω

If this is something that could be included in python 3.4, that would be 
awesome. I am willing to work on this if so.

--
components: Library (Lib)
messages: 194625
nosy: underrun
priority: normal
severity: normal
status: open
title: include a codec to handle escaping only control characters but not any 
others
type: enhancement
versions: Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18679] include a codec to handle escaping only control characters but not any others

2013-08-07 Thread R. David Murray

R. David Murray added the comment:

In what way does repr(x)[1:-1] not serve your use case?

--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18679
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com