Bug#908678: Testing the filter-branch scripts

2018-11-13 Thread Daniel Lange
Am 13.11.18 um 23:09 schrieb Moritz Muehlenhoff:
> The current data structure works very well for us and splitting the files
> has many downsides.

Could you detail what those many downsides are besides the scripts that
need to be amended?



External check

2018-11-13 Thread Security Tracker
CVE-2018-14655: TODO: check
CVE-2018-14657: TODO: check
CVE-2018-14658: TODO: check
CVE-2018-15978: RESERVED
CVE-2018-19208: TODO: check
--
The output might be a bit terse, but the above ids are known elsewhere,
check the references in the tracker. The second part indicates the status
of that id in the tracker at the moment the script was run.



Bug#908678: Testing the filter-branch scripts

2018-11-13 Thread Moritz Muehlenhoff
On Tue, Nov 13, 2018 at 12:22:54PM -0500, Antoine Beaupré wrote:
 > But before going through that trouble, I think we'd need to get approval
> from the security team first, as that's quite a lot of work. I figured
> we would make a feasability study first...

The current data structure works very well for us and splitting the files
has many downsides.

If we can't get the repository in run on salsa in a manner that doesn't
impact other repositories (e.g. by disabling the repository browser or
similar), then moving the security tracker repository out of Salsa is
the more likely solution.

Did anyone follow Guido's suggestion to report this upstream to
get their assessment on possible optimisations?

Cheers,
Moritz



Bug#908678: Testing the filter-branch scripts

2018-11-13 Thread Daniel Lange
> The Python job finished successfully here after 10 hours.
6h40 mins here as I ported your improved logic to the python2 version :).

# git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
Rewrite 1169d256b27eb7244273671582cc08ba88002819 (68356/68357) (24226 seconds 
passed, remaining 0 predicted)
Ref 'refs/heads/master' was rewritten

The tree-filter blows up the .git/objects store to 13G though.
But nothing a git gc can't fix.

> 
> I did some tests on the new git repository. Cloning the repository from
> scratch takes around 2 minutes (the original repo: 21 minutes).
Confirmed.

> So that's about it. I have not done a thorough job at checking the
> actual *integrity* of the results. It's difficult, considering CVE
> identifiers are not sequential in the data/CVE/list file, so a naive
> diff like this will fail:
> 
> $ diff -u <(cat 
> ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999}
>  ) data/CVE/list | diffstat
>  list |106562 
> +--
>  1 file changed, 53281 insertions(+), 53281 deletions(-)
> 
> But at least the numbers add up: it looks like no line is lost. And
> indeed, it looks like all CVEs add up:
> 
> $ diff -u <(cat 
> ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999}
>  | grep ^CVE | sort -n ) <( grep ^CVE data/CVE/list | sort -n  ) | diffstat
>  0 files changed
> 
> A cursory look at the diff seems to indicate it is clean, however.

I uploaded "my" version to https://people.debian.org/~dlange/
so people can poke the log and diffs and see whether there are any
issues left.

> I looked at splitting that file per CVE. That did not scale and just
> created new problems. But splitting by *year* seems like a very
> efficient switch, and I think it would be worth pursuing that idea
> forward.

The tools in bin/ would need a brush through. I.e. throw away the
unused ones and amend the ones that are used on data/CVE/* to learn
about the split files.



Bug#908678: Testing the filter-branch scripts

2018-11-13 Thread Antoine Beaupré
On 2018-11-12 12:22:58, Antoine Beaupré wrote:
> I'll start a run on the whole history to see if I can find any problems,
> as soon as a first clone finishes resolving those damn deltas. ;)

The Python job finished successfully here after 10 hours.

I did some tests on the new git repository. Cloning the repository from
scratch takes around 2 minutes (the original repo: 21 minutes). It is
145MB while the original repo is 1.6GB.

Running git annotate on data/CVE/list.2018 takes about 26 seconds, while
it takes basically forever to annotate the original data/CVE/list. (It's
been running for 10 minutes here already.)

So that's about it. I have not done a thorough job at checking the
actual *integrity* of the results. It's difficult, considering CVE
identifiers are not sequential in the data/CVE/list file, so a naive
diff like this will fail:

$ diff -u <(cat 
../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999}
 ) data/CVE/list | diffstat
 list |106562 
+--
 1 file changed, 53281 insertions(+), 53281 deletions(-)

But at least the numbers add up: it looks like no line is lost. And
indeed, it looks like all CVEs add up:

$ diff -u <(cat 
../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999}
 | grep ^CVE | sort -n ) <( grep ^CVE data/CVE/list | sort -n  ) | diffstat
 0 files changed

A cursory look at the diff seems to indicate it is clean, however.

I looked at splitting that file per CVE. That did not scale and just
created new problems. But splitting by *year* seems like a very
efficient switch, and I think it would be worth pursuing that idea
forward.

A.

-- 
There is no cloud, it's just someone else's computer.
   - Chris Watterson