Re: [SC-L] [WEB SECURITY] SATE?

2010-06-09 Thread Jim Manico

Great SATE reply from Vadim Okun:


We have been releasing the real deep data. There have been delays, but there 
are no sinister reasons for the delays.

The results of the 2nd SATE (our report and all data) will be released in June 
(We promised to release between February and May, but we're late with the 
report).

We released the results of the 1st SATE last summer: our report, the raw tool 
reports, and our analysis of the reports. The data is available (below the list 
of cautions) from
http://samate.nist.gov/SATE2008.html

or a direct link:
http://samate.nist.gov/SATE2008/resources/sate2008.tar.gz

I will answer some specific points in Jim's email below, but first, let me 
describe some limitations of SATE and how we are addressing them. SATE 2008 had 
a number of big limitations, including:

1) We analyzed a non-random subset of tool warnings
2) Determining correctness of tool warnings turned out to be more complicated 
than a binary true/false decision. Also, determining relevance of a warning to 
security turned out more difficult than we thought.
3) In most cases, we did not match warnings from different tools that refer to 
the same weakness. When we started SATE, we thought that we could match 
warnings by line number and weakness name or CWE id. In fact, most weaknesses 
are more complex - see Section 3.4 of our report.
4) Analysis criteria were applied inconsistently.

In our publicly released analysis, we used the confirmed/unconfirmed markings 
instead of true/false markings. We describe the reasons for this in our report 
- Section 4.2, page 29 of
http://samate.nist.gov/docs/NIST_Special_Publication_500-279.pdf

In SATE 2009, we made some improvements, including:

1) Randomly select a subset of tool warnings for analysis
2) We also looked at tool warnings that were related to human findings by 
security experts.
3) Use 4 categories for analysis of correctness: true, true but insignificant 
(for security), false, unknown.  It is an improvement, but there are still 
problems: for example distinguishing true from true but insignificant is often 
hard.

   

1) false positive rates from these tools are overwhelming
 

First, defining a false positive is tough.  Also in SATE 2008, the criteria 
that we used for analysis of correctness were inconsistent, we did not analyze 
a random sample of warnings, our analysis had errors. Steve gave a good example 
in his reply. We corrected some of these problems in 2009, but still way to go.

   

2) the work load to triage results from ONE of these tools were
man-years
 

We are not the developers of the test cases, our knowledge of the test case 
code is very limited. Also, we used tools differently from their use in 
practice. We analyzed tool warnings for correctness and looked for related 
warnings from other tools, whereas developers use tools to determine what 
changes need to be made to software, auditors look for evidence of assurance.

   

3) by every possible measurement, manual review was more cost effective
 

As Steve said, SATE did not consider cost. In SATE 2009, we had security 
contractors analyze two of the test cases and report the most important 
security weaknesses. We then looked at tool warnings that report the same (or 
related) weakness. This will be released as part of 2009 release (The data set 
is too small for statistical conclusions.)

A big limitation of SATE has been the lack of ground truths about what security 
weaknesses really are in the test cases. This determination is hard for reasonably large 
software. We are trying to address this: manual analysis by security contractors, 
"CVE-selected" test cases.

   

the NIST team chose only a small percentage of the automated findings to review
 

A small percentage by itself should not be a problem if the selection of tool 
warnings is done correctly (it was not done correctly in SATE 2008).

Vadim


From: Jim Manico [...@manico.net]
Sent: Thursday, May 27, 2010 5:31 PM
To: 'Webappsec Group'
Subject: [WEB SECURITY] SATE?

I feel that NIST made a few errors in the first 2 SATE studies.

After the second round of SATE, the results were never fully released to
the public - even when NIST agreed to do just that at the inception of
the contest. I do not understand why SATE censored the final results - I
feel such censorship hurts the industry.

And even worse, I felt that vendor pressure encouraged NIST to not
release the final results. If the results (the real deep data, not the
executive summary that NIST release) were favorable to the tool vendors,
I bet they would have welcomed the release of the real data. But
instead, vendor pressure caused NIST to block the release of the final
data set.

The problems that the data would have revealed is:

1) false positive rates from these tools are overwhelming
2) the work load to triage results from ONE of these tools were man-years
3) by every possible measurement, manual rev

Re: [SC-L] [WEB SECURITY] SATE?

2010-06-09 Thread Jim Manico

Fantastic SATE reply from Steven M. Christey:


I participated in SATE 2008 and SATE 2009, much more actively in the 
2008 effort.  I'm not completely sure of the 2009 results and final 
publication, as I've been otherwise occupied lately :-/ Looks like a 
final report has been delayed till June (the SATE 2008 report didn't 
get published till July 2009).


For SATE 2008, we did not release final results because the human 
analysis itself had too many false positives - so sometimes we claimed 
a false positive when, in fact, the issue was a true positive.  Given 
this and other data-quality problems (e.g. we only covered ~12% of the 
more than 49,000 items), we believed that to release the raw data 
would make it way too easy for people to make completely wrong 
conclusions about the tools.



The problems that the data would have revealed is:

1) false positive rates from these tools are overwhelming


As covered extensively in the 2008 SATE report (see my section for 
example), there is no clear definition of "false positive" especially 
when it comes to proving that a specific finding is a vulnerability.


For example: suppose you have a report in a function of a buffer 
overflow. To prove the finding is a vulnerability, you have to dig 
back through all the data flow, sometimes going 20 levels deep.  This 
is not feasible for a human evaluator to determine if there's really a 
vulnerability.  Or, maybe the overflow happens when you're reading a 
configuration file that's only under the control of the 
administrator.  These could be regarded as false positives.  However, 
the finding may be "locally true" - i.e. the function itself might not 
do any validation at all, so *if* it's called incorrectly, an overflow 
will occur.  My suspicion is that a lot of the "false positives" 
people complain about are actually "locally true." And, as we saw in 
SATE 2008 (and 2009 I suspect), sometimes the human evaluator is 
actually wrong, and the finding is correct.  Hopefully we'll account 
for "locally true" in the design of SATE 2010.


2) the work load to triage results from ONE of these tools were 
man-years


This was also covered (albeit estimated) in the 2008 SATE report, both 
the original section and my section.



3) by every possible measurement, manual review was more cost effective


There was no consideration of cost in this sense.

One lost opportunity for SATE 2008, however, was in comparing the 
results from the manual-review participants (e.g. Aspect) versus the 
tools in terms of what kinds of problems got reported.  (This also had 
major implications for how to count number of results).  I believe 
that such a focused effort would have shown some differences in what 
got reported. At least, that's in the raw data since it shows who 
claimed what got found.


While the SATE 2008 report is quite long mostly thanks to my excessive 
verbiage, I believe people who read that document will see that SATE 
has been steadily improving its design over the years.  The reality is 
that any study of this type is going to suffer from limited manpower 
in evaluating the results.


http://samate.nist.gov/docs/NIST_Special_Publication_500-279.pdf

The coverage was limited ONLY to injection and data flow problems 
that tools have a chance of finding. In fact, the NIST team chose 
only a small percentage of the automated findings to review, since it 
would have taken years to review everything due to the massive number 
of false positives. Get the problem here?


While there were focused efforts in various types of issues, there was 
also random sampling to get some exposure the wide range of problems 
being reported by the tools.  Your critique of SATE with respect to 
its focus on tools versus manual methods is understandable, but SATE 
(and its parent SAMATE project) are really about understanding tools, 
so this focus should not be a surprise.  After all, the first three 
letters of SATE expand to "Static Analysis Tool."


- Steve


___
Secure Coding mailing list (SC-L) SC-L@securecoding.org
List information, subscriptions, etc - http://krvw.com/mailman/listinfo/sc-l
List charter available at - http://www.securecoding.org/list/charter.php
SC-L is hosted and moderated by KRvW Associates, LLC (http://www.KRvW.com)
as a free, non-commercial service to the software security community.
Follow KRvW Associates on Twitter at: http://twitter.com/KRvW_Associates
___