Re: [Tutor] subprocess.getstatusoutput : UnicodeDecodeError

2017-09-22 Thread Chris Warrick
On 22 September 2017 at 03:57, Evuraan  wrote:
 result = subprocess.run(["tail", "-400", "/tmp/pmaster.txt"], 
 stdout=subprocess.PIPE)
 result.returncode
> 0
 subprocess.getstatusoutput("file  /tmp/pmaster.txt",)
> (0, '/tmp/pmaster.txt: Non-ISO extended-ASCII text, with very long
> lines, with LF, NEL line terminators')


You’re still using the deprecated function.

>>> subprocess.run(['file', '/tmp/pmaster.txt'], stdout=subprocess.PIPE)
CompletedProcess(args=['file', '/tmp/pmaster.txt'], returncode=0,
stdout=b'/tmp/pmaster.txt: Non-ISO…\n')
>>> result = _  # underscore means result of previous line in interactive mode
>>> result.stdout
b'/tmp/pmaster.txt: Non-ISO…line terminators\n'
>>> result.returncode
0

And if you want to get a Unicode string (if output of command is your
system encoding, hopefully UTF-8):

>>> subprocess.run(['file', '/tmp/pmaster.txt'], stdout=subprocess.PIPE, 
>>> universal_newlines=True)
CompletedProcess(args=['file', '/tmp/pmaster.txt'], returncode=0,
stdout='/tmp/pmaster.txt: Non-ISO…\n')
>>> (_.stdout is an unicode string)

Also, going back to your original example: you should not be using
`tail` from within Python. You should not depend on tail being
available (it’s not on Windows), and there may also be version
differences. Instead of tail, you should use Python’s standard file
operations (open()) to accomplish your task.

[advertisement] Extra reading on security (shell=False) and the
necessity of calling subprocesses:
https://chriswarrick.com/blog/2017/09/02/spawning-subprocesses-smartly-and-securely/
[/advertisement]

-- 
Chris Warrick 
PGP: 5EAAEA16
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] subprocess.getstatusoutput : UnicodeDecodeError

2017-09-21 Thread Evuraan
>
> getstatusoutput is a "legacy" function. It still exists for code that
> has already been using it, but it is not recommended for new code.
>
> https://docs.python.org/3.5/library/subprocess.html#using-the-subprocess-module
>
> Since you're using Python 3.5, let's try using the brand new `run`
> function and see if it does better:
>
> import subprocess
> result = subprocess.run(["tail", "-3", "/tmp/pmaster.db"],
> stdout=subprocess.PIPE)
> print("return code is", result.returncode)
> print("output is", result.stdout)
>
>
> It should do better than getstatusoutput, since it returns plain bytes
> without assuming they are ASCII. You can then decode them yourself:
>
> # try this and see if it is sensible
> print("output is", result.stdout.decode('latin1'))
>
> # otherwise this
> print("output is", result.stdout.decode('utf-8', errors='replace'))
>
>
>
>> >>> subprocess.getstatusoutput("tail -3 /tmp/pmaster.db",)
>> Traceback (most recent call last):
> [...]
>>   File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
>> return codecs.ascii_decode(input, self.errors)[0]
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 189:
>> ordinal not in range(128)
>
> Let's look at the error message. getstatusoutput apparently expects only
> pure ASCII output, because it is choking on a non-ASCII byte, namely
> 0xe0. Obviously 0xe0 (or in decimal, 224) is not an ASCII value, since
> ASCII goes from 0 to 127 only.
>
> If there's one non-ASCII byte in the file, there are probably more.
>
> So what is that mystery 0xe0 byte? It is hard to be sure, because it
> depends on the source. If pmaster.db is a binary file, it could mean
> anything or nothing. If it is a text file, it depends on the encoding
> that the file uses. If it comes from a Mac, it might be:
>
> py> b'\xe0'.decode('macroman')
> '‡'
>
> If it comes from Windows in Western Europe, it might be:
>
> py> b'\xe0'.decode('latin1')
> 'à'
>
> If it comes from Windows in Greece, it might be:
>
> py> b'\xe0'.decode('iso 8859-7')
> 'ΰ'
>
> and so forth. There's no absolutely reliable way to tell. This is the
> sort of nightmare that Unicode was invented to fix, but unfortunately
> there still exist millions of files, data formats and applications which
> insist on using rubbish "extended ASCII" encodings instead.
>
>
>
>> That file's content is kryptonite for python apparently. Other shell
>> operations work.
>>
>> >>> subprocess.getstatusoutput("file /tmp/pmaster.db",)
>> (0, '/tmp/pmaster.db: Non-ISO extended-ASCII text, with very long lines,
>> with LF, NEL line terminators')
>
> The `file` command agrees with me: it is not ASCII.


Thank you Steve! subprocess.run handles it better.


>>> subprocess.getstatusoutput("tail -400 /tmp/pmaster.txt",)
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.5/subprocess.py", line 805, in getstatusoutput
data = check_output(cmd, shell=True, universal_newlines=True, stderr=STDOUT)
  File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
**kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 695, in run
stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.5/subprocess.py", line 1059, in communicate
stdout = self.stdout.read()
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position
60942: invalid continuation byte

as opposed to:

>>> result = subprocess.run(["tail", "-400", "/tmp/pmaster.txt"], 
>>> stdout=subprocess.PIPE)
>>> result.returncode
0
>>> subprocess.getstatusoutput("file  /tmp/pmaster.txt",)
(0, '/tmp/pmaster.txt: Non-ISO extended-ASCII text, with very long
lines, with LF, NEL line terminators')
>>>

That was awesome! :)
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] subprocess.getstatusoutput : UnicodeDecodeError

2017-09-21 Thread Steven D'Aprano
On Thu, Sep 21, 2017 at 03:46:29PM -0700, Evuraan wrote:

> How can I work around this issue where  subprocess.getstatusoutput gives
> up, on Python 3.5.2:

getstatusoutput is a "legacy" function. It still exists for code that 
has already been using it, but it is not recommended for new code.

https://docs.python.org/3.5/library/subprocess.html#using-the-subprocess-module

Since you're using Python 3.5, let's try using the brand new `run` 
function and see if it does better:

import subprocess
result = subprocess.run(["tail", "-3", "/tmp/pmaster.db"], 
stdout=subprocess.PIPE)
print("return code is", result.returncode)
print("output is", result.stdout)


It should do better than getstatusoutput, since it returns plain bytes 
without assuming they are ASCII. You can then decode them yourself:

# try this and see if it is sensible
print("output is", result.stdout.decode('latin1'))

# otherwise this
print("output is", result.stdout.decode('utf-8', errors='replace'))



> >>> subprocess.getstatusoutput("tail -3 /tmp/pmaster.db",)
> Traceback (most recent call last):
[...]
>   File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
> return codecs.ascii_decode(input, self.errors)[0]
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 189:
> ordinal not in range(128)

Let's look at the error message. getstatusoutput apparently expects only 
pure ASCII output, because it is choking on a non-ASCII byte, namely 
0xe0. Obviously 0xe0 (or in decimal, 224) is not an ASCII value, since 
ASCII goes from 0 to 127 only.

If there's one non-ASCII byte in the file, there are probably more.

So what is that mystery 0xe0 byte? It is hard to be sure, because it 
depends on the source. If pmaster.db is a binary file, it could mean 
anything or nothing. If it is a text file, it depends on the encoding 
that the file uses. If it comes from a Mac, it might be:

py> b'\xe0'.decode('macroman')
'‡'

If it comes from Windows in Western Europe, it might be:

py> b'\xe0'.decode('latin1')
'à'

If it comes from Windows in Greece, it might be:

py> b'\xe0'.decode('iso 8859-7')
'ΰ'

and so forth. There's no absolutely reliable way to tell. This is the 
sort of nightmare that Unicode was invented to fix, but unfortunately 
there still exist millions of files, data formats and applications which 
insist on using rubbish "extended ASCII" encodings instead.


 
> That file's content is kryptonite for python apparently. Other shell
> operations work.
> 
> >>> subprocess.getstatusoutput("file /tmp/pmaster.db",)
> (0, '/tmp/pmaster.db: Non-ISO extended-ASCII text, with very long lines,
> with LF, NEL line terminators')

The `file` command agrees with me: it is not ASCII.


-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] subprocess.getstatusoutput : UnicodeDecodeError

2017-09-21 Thread Evuraan
>
> But: do you really want to "tail" what's probably not really a plaintext
> file? Just guessing, but the .db as well as the error msgs are a hint.

although the filename ends with a ".db", it is just a text file. not
tailing a SQLite or a binary file, just happened to name it so.

I work around the same sort elsewhere thusly:

with open("/tmp/pmaster.db",encoding="utf-8",errors="ignore") as fobj:

or even with codecs

with codecs.open("/tmp/pmaster.db",encoding="utf-8",errors="ignore") as fobj:

I think I've to follow suit here, instead of tail,  apparently there's
no "errors=ignore" for subprocess.

I was hoping subprocess would be impervious to decoding errors, as the
text is coming from tail.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] subprocess.getstatusoutput : UnicodeDecodeError

2017-09-21 Thread Mats Wichmann
On 09/21/2017 04:46 PM, Evuraan wrote:
> Greetings!
> 
> My search-fu failed me, so thought of finally asking this question here.
> 
> 
> How can I work around this issue where  subprocess.getstatusoutput gives
> up, on Python 3.5.2:

You picked a fun one!

First off, subprocess.getstatusoutput is a dubious API, because if you
call a command on a UNIX/Linux flavored system, it has the possibility
of writing to both the standard output and standard error streams, but
this API cannot separate them.  I remember fighting this in something
years and years ago.

Second, this thing has changed even in the Python 3.x series, which is a
bad sign (see docs): first it didn't exist, then it was put back, then
the first element of the return tuple changed.

Third, according to the docs, "none of the guarantees described above
regarding security and exception handling consistency are valid for
these functions" (referring to "legacy" getstatusoutput and getoutput)

Probably - don't use it.

You have your main hint in the error messages:

"Non-ISO extended-ASCII text"  along with failures claiming a problem
decoding something that is not in the range of standard ASCII:  'ascii'
codec can't decode byte 0xe0 in position 189: ordinal not in range(128)'

Maybe try using subprocess.check_output, although it's not a direct
replacement.

But: do you really want to "tail" what's probably not really a plaintext
file? Just guessing, but the .db as well as the error msgs are a hint.

You'll need to fill us in more on what you want to accomplish.




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor