Re: Static Analysis: proposed interchange format (firehose)

2013-01-23 Thread David Malcolm
On Thu, 2013-01-17 at 13:33 +0800, Daniel Veillard wrote:
 On Wed, Jan 16, 2013 at 03:53:56PM -0500, David Malcolm wrote:
  This is a followup to my proposal in
  http://lists.fedoraproject.org/pipermail/devel/2012-December/175232.html
  
  I want a common output format for static analysis tools so that we can
  easily slurp the results from different tools into a database and have a
  common system for managing the results (marking false positives, having
  automated de-duplication, etc).
  
  (I like the name firehose for the overall system since it describes
  the issue we'll have of managing the flood of data).
  
  I came up with an XML format, which I've uploaded code to here:
  https://github.com/fedora-static-analysis/firehose
  
  Does this look sane?  I think that it should be possible to write
 
   okay, taking the question from the XML side, so analysing the
 firehose.rng schemas driving the format. Points and remarks as i go
 through it:

Thanks!

  - the cwe attribute is a number or free form ? if a number add
and explicit rule to check its type.
I've constrained it to be an integer as of:
https://github.com/fedora-static-analysis/firehose/commit/43a50c6763f718b4c8163b645bf5ce7a328f6efa

(I hope I got my RELAX-NG correct)


  - the sut content choice is a bit weird on one side you have text
on the other you have rpm, I would  still allow a free form
description but in an element at the same level of rpm
something like
choice
  element name=description
text/
  /element
  element name=rpm
...
  element
For the sake of larger usage, i would also make some room for
debian, and also expand that to be able to express a given file
to give an example allowing extra details there, and make some
if not all of the attributes optionals, for example to be able
to express independance say on the arch:
sut
  file/usr/bin/xmllint/file
  package type=rpm name=libxml2 version=2.9.0 release=1.fc17
/sut
so optional file element, extra type attribute, use package to not
feel tied to rpm, but use a type attribute to distinguish :-)

Yeah, I hadn't thought out that part of the schema very well.

I've already made it optional, since I'm finding it easier to add during
post-processing.

I'm thinking that there are several cases:
* analysis done of a source rpm
  * name, version, release,  build architecture
* what would Debian want?
* analysis done of a tarball or other archive
  * name, url, sha1sum, build architecture
* analysis done of an scm checkout (e.g. from upstream git)
  * kind (git, svn, etc), url
* etc (what am I missing?)

Some possible examples of these

sut
   source-rpm name=python-ethtool version=0.7 release=4.fc19
build-arch=x86_64/
/sut

sut
   tarball name=python-ethtool-0.7.tar.bz2
   hash alg=sha1d8334fe3e1a9b31c8f94a4e10e516ddea617cfd2/hash
   /tarball
/sut

sut
   checkout scm=git
 url=http://git.fedorahosted.org/cgit/python-ethtool.git/tag/?id=v0.7;
   /checkout
/sut


  - for notes i would separate them
notes
  note.../note
  note.../note
/notes
since they are likely to me entered manually, and you may want to
track who entered them as you go.

I wasn't very clear in my posting; I'd meant these notes for extra
descriptive data emitted by the static analysis tool, with a vague idea
of a mini markup vocabulary for describing functions, variables, etc.
My cpychecker tool has knowledge about much of the CPython C API, and
knows the URLs for the API docs, so I was hoping to have some way of
providing links to those docs whenever it sees an API call within a
problematic function.


  - I would use where instead of point myself but i understand your
logic too
There seem to be multiple kinds of location that checkers emit:
* file and line
* file, line and column
* file with range, expressed as a pair of the above (LLVM can emit
ranges of start line/column  end line/column)


 Long reply but overall that look mostly fine from my very narrow POV

Thanks for the review
Dave


-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: Static Analysis: proposed interchange format (firehose)

2013-01-17 Thread Kamil Dudka
On Wednesday, January 16, 2013 15:53:56 David Malcolm wrote:
 This is a followup to my proposal in
 http://lists.fedoraproject.org/pipermail/devel/2012-December/175232.html
 
 I want a common output format for static analysis tools so that we can
 easily slurp the results from different tools into a database and have a
 common system for managing the results (marking false positives, having
 automated de-duplication, etc).
 
 (I like the name firehose for the overall system since it describes
 the issue we'll have of managing the flood of data).
 
 I came up with an XML format, which I've uploaded code to here:
 https://github.com/fedora-static-analysis/firehose
 
 Does this look sane?  I think that it should be possible to write
 converters that turn the output from other tools into this, and I think
 it's possible to hack up my static analyzers to emit this format.
 
 The firehose.py script is able to turn such an XML report into a text
 format mimicking what GCC emits, which is useful in Emacs (and probably
 other editors) which can parse that text format for clicking through to
 the underlying source code being tested.
 
 Thoughts?

We usually need to maintain more metadata about the scan itself together
with the results: arguments given to the analyzer, date/time the scan 
started/finished, total count of lines processed, hostname, mock config,
etc.

Also if the results are obtained by subtracting the results of an old version
of the package (to report only newly introduced defects), it is good to keep 
metadata of both the scans.  Then you can check that both of them ran with
the same configuration, or prevent reporting newly added defects if the old
build partially failed.

Kamil
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Static Analysis: proposed interchange format (firehose)

2013-01-16 Thread David Malcolm
This is a followup to my proposal in
http://lists.fedoraproject.org/pipermail/devel/2012-December/175232.html

I want a common output format for static analysis tools so that we can
easily slurp the results from different tools into a database and have a
common system for managing the results (marking false positives, having
automated de-duplication, etc).

(I like the name firehose for the overall system since it describes
the issue we'll have of managing the flood of data).

I came up with an XML format, which I've uploaded code to here:
https://github.com/fedora-static-analysis/firehose

Does this look sane?  I think that it should be possible to write
converters that turn the output from other tools into this, and I think
it's possible to hack up my static analyzers to emit this format.

The firehose.py script is able to turn such an XML report into a text
format mimicking what GCC emits, which is useful in Emacs (and probably
other editors) which can parse that text format for clicking through to
the underlying source code being tested.

Thoughts?

BTW, I hope to run a hackfest on Static Analysis in Fedora at FUDCon
Lawrence this weekend.  Anyone around?  [there are plenty of different
tasks requiring different skill sets: Python scripting, web development,
etc - you don't need to know about compiler internals!  though that
would help also :) ]

Dave

-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: Static Analysis: proposed interchange format (firehose)

2013-01-16 Thread Daniel Veillard
On Wed, Jan 16, 2013 at 03:53:56PM -0500, David Malcolm wrote:
 This is a followup to my proposal in
 http://lists.fedoraproject.org/pipermail/devel/2012-December/175232.html
 
 I want a common output format for static analysis tools so that we can
 easily slurp the results from different tools into a database and have a
 common system for managing the results (marking false positives, having
 automated de-duplication, etc).
 
 (I like the name firehose for the overall system since it describes
 the issue we'll have of managing the flood of data).
 
 I came up with an XML format, which I've uploaded code to here:
 https://github.com/fedora-static-analysis/firehose
 
 Does this look sane?  I think that it should be possible to write

  okay, taking the question from the XML side, so analysing the
firehose.rng schemas driving the format. Points and remarks as i go
through it:

 - the cwe attribute is a number or free form ? if a number add
   and explicit rule to check its type.
 - the sut content choice is a bit weird on one side you have text
   on the other you have rpm, I would  still allow a free form
   description but in an element at the same level of rpm
   something like
   choice
 element name=description
   text/
 /element
 element name=rpm
   ...
 element
   For the sake of larger usage, i would also make some room for
   debian, and also expand that to be able to express a given file
   to give an example allowing extra details there, and make some
   if not all of the attributes optionals, for example to be able
   to express independance say on the arch:
   sut
 file/usr/bin/xmllint/file
 package type=rpm name=libxml2 version=2.9.0 release=1.fc17
   /sut
   so optional file element, extra type attribute, use package to not
   feel tied to rpm, but use a type attribute to distinguish :-)

 - for notes i would separate them
   notes
 note.../note
 note.../note
   /notes
   since they are likely to me entered manually, and you may want to
   track who entered them as you go.

 - I would use where instead of point myself but i understand your
   logic too

Long reply but overall that look mostly fine from my very narrow POV

Daniel


-- 
Daniel Veillard  | Open Source and Standards, Red Hat
veill...@redhat.com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel