Re: [Python-ideas] Correct way for writing Python code without causing interpreter crashes due to parser stack overflow

2018-06-27 Thread Fiedler Roman
> Von: Michael Selik [mailto:m...@selik.org]
> 
> On Wed, Jun 27, 2018 at 12:04 AM Fiedler Roman   > wrote:
> 
>   Context: we are conducting machine learning experiments that
> generate some kind of nested decision trees. As the tree includes specific
> decision elements (which require custom code to evaluate), we decided to
> store the decision tree (result of the analysis) as generated Python code. 
> Thus
> the decision tree can be transferred to sensor nodes (detectors) that will 
> then
> filter data according to the decision tree when executing the given code.
> 
> How do you write tests for the sensor nodes? Do they use code as data for
> test cases?

We have two approaches for test data generation: as we are processing log data, 
we may use adaptive, self-learning log data generators that can then be spiked 
with anomalies. In other tests we used armored zero day exploits on 
production-like test systems to get more realistic data.

The big picture: When finally everything is working, distributed sensor nodes 
shall pre-process machine log data streams for security analysis in real time 
and report findings back to a central instance. Findings also include data, 
that does not make sense to the sensor node (cannot be classified). This 
central instance updates its internal model attempting to learn how to classify 
the new data and then creates new model-evaluation-code (that is the one that 
caused the crash) that is sent to the sensors again. The sensor replaces the 
model with the generated code, thus altering the log data analysis behaviour.

The current implementation uses 
https://packages.debian.org/search?keywords=logdata-anomaly-miner to run the 
sensor nodes, the central instance is experimental code creating configuration 
for the nodes. When the detection methods get more mature, the way of model 
distribution is likely to change to a more robust scheme. We try to apply those 
mining approaches to various domains, e.g. for attack detection based on log 
data without known structure (proprietary systems, no SIEM-regexes available 
yet, no rules), but also e.g. for detecting vulnerable code before it is 
exploited (zero-day discovery of LXC container escape vulnerabilites) but also 
to detect execution of zeroday exploits itself, that we wrote for demonstration 
purposes. See 
https://itsecx.fhstp.ac.at/wp-content/uploads/2016/11/06_RomanFiedler_SyscallAuditLogMining-V1.pdf
 (sorry, German slides only)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Correct way for writing Python code without causing interpreter crashes due to parser stack overflow

2018-06-27 Thread Fiedler Roman
> Von: Guido van Rossum [mailto:gu...@python.org]
> 
> I consider this is a bug -- a violation of Python's (informal) promise to the 
> user
> that when CPython segfaults it is not the user's fault.

Strictly it is not a segfault, just a parser exception that cannot be caught 
(at least I failed to catch it in a quick test). Seems that the catch block is 
parsed after parsing the problematic code, so any "except" in the code itself 
is useless. Apart from that: even when caught, what to do? Your program 
partially refuses to load - only benefit is that you can die gracefully.

> Given typical Python usage patterns, I don't consider this an important bug,
> but maybe someone is interested in trying to fix it.

Acknowledged: I do not know of any software, where this has high relevance, but 
my knowledge is quite limited, so asked PSRT before to be sure.

> As far as your application is concerned, I'm not sure that generating code 
> like
> that is the right approach. Why don't you generate a data structure and a 
> little
> engine that walks the data structure?

That's what I told the colleague asking me to assist in analysis of the crash 
too. I guess that the "simple generator" was just easier to write, thus used as 
a starting point. And now by chance a model was generated hitting the Python 
limit of 50 instantiations/lists per statement or whatsoever. So there is not 
much "why" to be explained, it just happened.

Kind regards,
Roman


> On Wed, Jun 27, 2018 at 12:05 AM Fiedler Roman   > wrote:
> 
> 
>   Hello List,
> 
>   Context: we are conducting machine learning experiments that
> generate some kind of nested decision trees. As the tree includes specific
> decision elements (which require custom code to evaluate), we decided to
> store the decision tree (result of the analysis) as generated Python code. 
> Thus
> the decision tree can be transferred to sensor nodes (detectors) that will 
> then
> filter data according to the decision tree when executing the given code.
> 
>   Tracking down a crash when executing that generated code, we came
> to following simplified reproducer that will cause the interpreter to crash 
> (on
> Python 2/3) when loading the code before execution is started:
> 
>   #!/usr/bin/python2 -BEsStt
>   A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A
> ([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(No
> ne)])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])
> 
>   The error message is:
> 
>   s_push: parser stack overflow
>   MemoryError
> 
>   Despite the machine having 16GB of RAM, the code cannot be loaded.
> Splitting it into two lines using an intermediate variable is the current
> workaround to still get it running after manual adapting.
> 
>   As discussed on Python security list, crashes when loading such
> decision trees or also mathematical formulas (see bug report [1]) should not
> be a security problem. Even when not directly covered in the Python security
> model documentation [2], this case comes too close to "arbitrary code
> execution", where Python does not attempt to provide any protection. There
> might be only some border cases of affected software,  e.g. Python sandbox
> systems like Zope/Plone or maybe even Python based smart contract
> blockchains like Etherereum (do not know if/where the use/derived work
> from the default Python interpreter for their use). But in both cases they
> would also be too close violating the security model, thus no changes to
> Python required from this side. Thus Python security suggested that the
> discussion should be continued on this list.
> 
> 
>   Even when no security problem involved, the crash is still quite an
> annoyance. Development of code generators can be a tedious tasks. It is then
> somehow frustrating, when your generated code is not accepted by the
> interpreter, even when you do not feel like getting close to some system-
> relevant limits, e.g. 50 elements in a line like above on a 16GB machine. You
> may adapt the generator, but as the error does not include any information,
> which limit you really violated (number of brackets, function calls, list
> definitions?) you can only do experiments or look on the Python compiler
> code to figure that out. Even when you fix it, you have no guarantee to hit
> some other obscure limit the next day or that those limits change from one
> Python minor version to the next causing regressions.
> 
>   Questions:
> 
>   * Do you deem it possible/sensible to even attempt to write a Python
> language code generator that will produce non-malicious, syntactically valid
> decision tree code/mathematical formulas and still having a sufficiently high
> probability that the Python interpreter will also run that code now and in 
> near
> future (regressions)?
> 
>   * Assuming yes to the question 

Re: [Python-ideas] Correct way for writing Python code without causing interpreter crashes due to parser stack overflow

2018-06-27 Thread Michael Selik
On Wed, Jun 27, 2018 at 12:04 AM Fiedler Roman 
wrote:

> Context: we are conducting machine learning experiments that generate some
> kind of nested decision trees. As the tree includes specific decision
> elements (which require custom code to evaluate), we decided to store the
> decision tree (result of the analysis) as generated Python code. Thus the
> decision tree can be transferred to sensor nodes (detectors) that will then
> filter data according to the decision tree when executing the given code.
>

How do you write tests for the sensor nodes? Do they use code as data for
test cases?
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Correct way for writing Python code without causing interpreter crashes due to parser stack overflow

2018-06-27 Thread Nick Coghlan
On 27 June 2018 at 17:04, Fiedler Roman  wrote:
> Hello List,
>
> Context: we are conducting machine learning experiments that generate some 
> kind of nested decision trees. As the tree includes specific decision 
> elements (which require custom code to evaluate), we decided to store the 
> decision tree (result of the analysis) as generated Python code. Thus the 
> decision tree can be transferred to sensor nodes (detectors) that will then 
> filter data according to the decision tree when executing the given code.
>
> Tracking down a crash when executing that generated code, we came to 
> following simplified reproducer that will cause the interpreter to crash (on 
> Python 2/3) when loading the code before execution is started:
>
> #!/usr/bin/python2 -BEsStt
> A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])
>
> The error message is:
>
> s_push: parser stack overflow
> MemoryError
>
> Despite the machine having 16GB of RAM, the code cannot be loaded. Splitting 
> it into two lines using an intermediate variable is the current workaround to 
> still get it running after manual adapting.

This seems like it may indicate a potential problem in the pgen2
parser generator, since the compilation is failing at the original
parse step, but checking the largest version of this that CPython can
parse on my machine gives a syntax tree of only ~77kB:

>>> tree = 
parser.expr("A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])")
>>> sys.getsizeof(tree)
77965

Attempting to print that hints more closely at the potential problem:

>>> tree.tolist()
Traceback (most recent call last):
 File "", line 1, in 
RecursionError: maximum recursion depth exceeded while getting the
repr of an object

As far as I'm aware, the CPython parser is using the actual C stack
for recursion, and is hence throwing MemoryError because it ran out of
stack space to recurse into, not because it ran out of memory in
general (RecursionError would be a more accurate exception).

Trying your original example in PyPy (which uses a different parser
implementation) suggests you may want to try using that as your
execution target before resorting to switching languages entirely:

 tree2 =
parser.expr("A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])]]))])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])")
 len(tree2.tolist())
5

Alternatively, you could explore mimicking the way that scikit-learn
saves its trained models (which I believe is a variation on "use
pickle", but I've never actually gone and checked for sure).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Correct way for writing Python code without causing interpreter crashes due to parser stack overflow

2018-06-27 Thread Antoine Pitrou


The OP says "crash" (implying some kind of segfault) but here the
snippet raises a mere exception:

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])
s_push: parser stack overflow
MemoryError
>>> 

Regards

Antoine.


On Wed, 27 Jun 2018 08:04:06 -0700
Guido van Rossum  wrote:
> I consider this is a bug -- a violation of Python's (informal) promise to
> the user that when CPython segfaults it is not the user's fault.
> 
> Given typical Python usage patterns, I don't consider this an important
> bug, but maybe someone is interested in trying to fix it.
> 
> As far as your application is concerned, I'm not sure that generating code
> like that is the right approach. Why don't you generate a data structure
> and a little engine that walks the data structure?
> 
> On Wed, Jun 27, 2018 at 12:05 AM Fiedler Roman 
> wrote:
> 
> > Hello List,
> >
> > Context: we are conducting machine learning experiments that generate some
> > kind of nested decision trees. As the tree includes specific decision
> > elements (which require custom code to evaluate), we decided to store the
> > decision tree (result of the analysis) as generated Python code. Thus the
> > decision tree can be transferred to sensor nodes (detectors) that will then
> > filter data according to the decision tree when executing the given code.
> >
> > Tracking down a crash when executing that generated code, we came to
> > following simplified reproducer that will cause the interpreter to crash
> > (on Python 2/3) when loading the code before execution is started:
> >
> > #!/usr/bin/python2 -BEsStt
> >
> > A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])
> >
> > The error message is:
> >
> > s_push: parser stack overflow
> > MemoryError
> >
> > Despite the machine having 16GB of RAM, the code cannot be loaded.
> > Splitting it into two lines using an intermediate variable is the current
> > workaround to still get it running after manual adapting.
> >
> > As discussed on Python security list, crashes when loading such decision
> > trees or also mathematical formulas (see bug report [1]) should not be a
> > security problem. Even when not directly covered in the Python security
> > model documentation [2], this case comes too close to "arbitrary code
> > execution", where Python does not attempt to provide any protection. There
> > might be only some border cases of affected software,  e.g. Python sandbox
> > systems like Zope/Plone or maybe even Python based smart contract
> > blockchains like Etherereum (do not know if/where the use/derived work from
> > the default Python interpreter for their use). But in both cases they would
> > also be too close violating the security model, thus no changes to Python
> > required from this side. Thus Python security suggested that the discussion
> > should be continued on this list.
> >
> >
> > Even when no security problem involved, the crash is still quite an
> > annoyance. Development of code generators can be a tedious tasks. It is
> > then somehow frustrating, when your generated code is not accepted by the
> > interpreter, even when you do not feel like getting close to some
> > system-relevant limits, e.g. 50 elements in a line like above on a 16GB
> > machine. You may adapt the generator, but as the error does not include any
> > information, which limit you really violated (number of brackets, function
> > calls, list definitions?) you can only do experiments or look on the Python
> > compiler code to figure that out. Even when you fix it, you have no
> > guarantee to hit some other obscure limit the next day or that those limits
> > change from one Python minor version to the next causing regressions.
> >
> > Questions:
> >
> > * Do you deem it possible/sensible to even attempt to write a Python
> > language code generator that will produce non-malicious, syntactically
> > valid decision tree code/mathematical formulas and still having a
> > sufficiently high probability that the Python interpreter will also run
> > that code now and in near future (regressions)?
> >
> > * Assuming yes to the question above, when generating code, what should be
> > the maximal nesting depth a code generator can always expect to be compiled
> > on Python 2.7 and 3.5 on? Are there any other similar restrictions that
> > need to be considered by the code generator? Or is generating code that way
> > not the preferred solution anyway - the code generator should generate e.g.
> > 

Re: [Python-ideas] Correct way for writing Python code without causing interpreter crashes due to parser stack overflow

2018-06-27 Thread Guido van Rossum
I consider this is a bug -- a violation of Python's (informal) promise to
the user that when CPython segfaults it is not the user's fault.

Given typical Python usage patterns, I don't consider this an important
bug, but maybe someone is interested in trying to fix it.

As far as your application is concerned, I'm not sure that generating code
like that is the right approach. Why don't you generate a data structure
and a little engine that walks the data structure?

On Wed, Jun 27, 2018 at 12:05 AM Fiedler Roman 
wrote:

> Hello List,
>
> Context: we are conducting machine learning experiments that generate some
> kind of nested decision trees. As the tree includes specific decision
> elements (which require custom code to evaluate), we decided to store the
> decision tree (result of the analysis) as generated Python code. Thus the
> decision tree can be transferred to sensor nodes (detectors) that will then
> filter data according to the decision tree when executing the given code.
>
> Tracking down a crash when executing that generated code, we came to
> following simplified reproducer that will cause the interpreter to crash
> (on Python 2/3) when loading the code before execution is started:
>
> #!/usr/bin/python2 -BEsStt
>
> A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])
>
> The error message is:
>
> s_push: parser stack overflow
> MemoryError
>
> Despite the machine having 16GB of RAM, the code cannot be loaded.
> Splitting it into two lines using an intermediate variable is the current
> workaround to still get it running after manual adapting.
>
> As discussed on Python security list, crashes when loading such decision
> trees or also mathematical formulas (see bug report [1]) should not be a
> security problem. Even when not directly covered in the Python security
> model documentation [2], this case comes too close to "arbitrary code
> execution", where Python does not attempt to provide any protection. There
> might be only some border cases of affected software,  e.g. Python sandbox
> systems like Zope/Plone or maybe even Python based smart contract
> blockchains like Etherereum (do not know if/where the use/derived work from
> the default Python interpreter for their use). But in both cases they would
> also be too close violating the security model, thus no changes to Python
> required from this side. Thus Python security suggested that the discussion
> should be continued on this list.
>
>
> Even when no security problem involved, the crash is still quite an
> annoyance. Development of code generators can be a tedious tasks. It is
> then somehow frustrating, when your generated code is not accepted by the
> interpreter, even when you do not feel like getting close to some
> system-relevant limits, e.g. 50 elements in a line like above on a 16GB
> machine. You may adapt the generator, but as the error does not include any
> information, which limit you really violated (number of brackets, function
> calls, list definitions?) you can only do experiments or look on the Python
> compiler code to figure that out. Even when you fix it, you have no
> guarantee to hit some other obscure limit the next day or that those limits
> change from one Python minor version to the next causing regressions.
>
> Questions:
>
> * Do you deem it possible/sensible to even attempt to write a Python
> language code generator that will produce non-malicious, syntactically
> valid decision tree code/mathematical formulas and still having a
> sufficiently high probability that the Python interpreter will also run
> that code now and in near future (regressions)?
>
> * Assuming yes to the question above, when generating code, what should be
> the maximal nesting depth a code generator can always expect to be compiled
> on Python 2.7 and 3.5 on? Are there any other similar restrictions that
> need to be considered by the code generator? Or is generating code that way
> not the preferred solution anyway - the code generator should generate e.g.
> binary python code immediately? Note: in the end the exact same logic code
> will run as Python process, it seems it is only about how it is loaded into
> the Python interpreter.
>
> * If not possible/recommended/sensible, we might generate Java-bytecode or
> native x86-code instead, where the likelihood of the (virtual) CPU really
> executing code that is compliant to the language specification (even with
> CPU errata like FDIV-bug et al) might be magnitudes higher than with the
> Python interpreter.
>
> Any feedback appreciated!
>
> Roman
>
> [1] https://bugs.python.org/issue3971)
> [2] http://python-security.readthedocs.io/security.html#security-model
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> 

[Python-ideas] Correct way for writing Python code without causing interpreter crashes due to parser stack overflow

2018-06-27 Thread Fiedler Roman
Hello List,

Context: we are conducting machine learning experiments that generate some kind 
of nested decision trees. As the tree includes specific decision elements 
(which require custom code to evaluate), we decided to store the decision tree 
(result of the analysis) as generated Python code. Thus the decision tree can 
be transferred to sensor nodes (detectors) that will then filter data according 
to the decision tree when executing the given code.

Tracking down a crash when executing that generated code, we came to following 
simplified reproducer that will cause the interpreter to crash (on Python 2/3) 
when loading the code before execution is started:

#!/usr/bin/python2 -BEsStt
A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A([A(None)])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])])

The error message is:

s_push: parser stack overflow
MemoryError

Despite the machine having 16GB of RAM, the code cannot be loaded. Splitting it 
into two lines using an intermediate variable is the current workaround to 
still get it running after manual adapting.

As discussed on Python security list, crashes when loading such decision trees 
or also mathematical formulas (see bug report [1]) should not be a security 
problem. Even when not directly covered in the Python security model 
documentation [2], this case comes too close to "arbitrary code execution", 
where Python does not attempt to provide any protection. There might be only 
some border cases of affected software,  e.g. Python sandbox systems like 
Zope/Plone or maybe even Python based smart contract blockchains like 
Etherereum (do not know if/where the use/derived work from the default Python 
interpreter for their use). But in both cases they would also be too close 
violating the security model, thus no changes to Python required from this 
side. Thus Python security suggested that the discussion should be continued on 
this list.


Even when no security problem involved, the crash is still quite an annoyance. 
Development of code generators can be a tedious tasks. It is then somehow 
frustrating, when your generated code is not accepted by the interpreter, even 
when you do not feel like getting close to some system-relevant limits, e.g. 50 
elements in a line like above on a 16GB machine. You may adapt the generator, 
but as the error does not include any information, which limit you really 
violated (number of brackets, function calls, list definitions?) you can only 
do experiments or look on the Python compiler code to figure that out. Even 
when you fix it, you have no guarantee to hit some other obscure limit the next 
day or that those limits change from one Python minor version to the next 
causing regressions.

Questions:

* Do you deem it possible/sensible to even attempt to write a Python language 
code generator that will produce non-malicious, syntactically valid decision 
tree code/mathematical formulas and still having a sufficiently high 
probability that the Python interpreter will also run that code now and in near 
future (regressions)?

* Assuming yes to the question above, when generating code, what should be the 
maximal nesting depth a code generator can always expect to be compiled on 
Python 2.7 and 3.5 on? Are there any other similar restrictions that need to be 
considered by the code generator? Or is generating code that way not the 
preferred solution anyway - the code generator should generate e.g. binary 
python code immediately? Note: in the end the exact same logic code will run as 
Python process, it seems it is only about how it is loaded into the Python 
interpreter.

* If not possible/recommended/sensible, we might generate Java-bytecode or 
native x86-code instead, where the likelihood of the (virtual) CPU really 
executing code that is compliant to the language specification (even with CPU 
errata like FDIV-bug et al) might be magnitudes higher than with the Python 
interpreter.

Any feedback appreciated!

Roman

[1] https://bugs.python.org/issue3971)
[2] http://python-security.readthedocs.io/security.html#security-model
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/