In answer to your questions about the Equivalent-Packages process:

1) You are right that the tool can get confused when there is little source
in the package or if the majority of files include common things like
readme/todo/makefile etc.  One thing I could do is exclude source
files which are excessively common. I do that in other tools where I
use package similarity, but tried to keep this package equivalent tool as
simple as possible. It is not difficult to implement and it should reduce
some of the false positives.

2) When there are multiple possible matches, I simply choose the package
with the highest similarity. One thing that I did was to run the tool
against one repo only (instead of say the Fedora repo against the Debian
repo) to find near duplicate packages. I have done this only so far for
Debian
https://github.com/silviocesare/Equivalent-Packages/blob/master/Clusters/Debian5Cluster

The first entry in the list is the base package, the remaining entries are
the near duplicate packages and their similarities to the base package. An
example from the Debian repo -->

 libxml-um-perl libapp-control-perl:0.846154 libcrypt-des-ede3-perl:0.846154
libdata-buffer-perl:0.846154 libdb-file-lock-perl:0.846154
libio-tee-perl:0.846154 liblingua-preferred-perl:0.846154
liblingua-pt-stemmer-perl:0.846154 liblog-tracemessages-perl:0.846154
libpdf-reuse-barcode-perl:0.846154 libsort-fields-perl:0.846154
libtemplate-plugin-calendar-simple-perl:0.846154
libxml-filter-detectws-perl:0.846154 libxml-filter-saxt-perl:0.846154
libxml-handler-printevents-perl:0.846154 libxml-handler-trees-perl:0.846154
libxml-regexp-perl:0.846154

One possible method of reducing false positives is to ignore packages which
are equivalent to more than one other package. Or perhaps it could require
human intervention.
3) I did some trivial testing of unexpected matches. In fact one thing I
looked at was when the same package name was in Fedora and Debian but the
similarity was so low it didn't match. Suprisingly a not insignificant
number of packages were like this. And manual verification showed in the
ones I looked at, they were different packages. This demonstrates that if
you base equivalence on names only, then you will get false positives.

I could add heuristics based on the package name to request human
intervention, ie. if two packages are found similar and if the package names
do not have 50% overlap, then request human verification. I am not sure how
useful this will be because from experience, package names can sometimes be
problematic.

--
Silvio

On Tue, Feb 1, 2011 at 1:13 AM, Tomas Hoger <tho...@redhat.com> wrote:

> Hi Silvio!
>
> On Mon, 31 Jan 2011 19:21:39 +1100 Silvio Cesare wrote:
>
> > Debian maintain a list of CPE inormation for packages on their
> > security tracker
> > http://svn.debian.org/wsvn/secure-testing/data/CPE/list
>
> We currently do not use CPE names for security tracking in Fedora, so
> I don't see an obvious benefit maintaining such list.  Can you explain
> briefly how you use it for Debian security tracking and what benefits
> it brings?
>
> > This makes it relatively static except when packages are added or
> > removed from the repository.
>
> It's not that uncommon to see new packages added to Fedora repositories
> even after the release of some Fedora version.
>
> > In the past I generated an automatic mapping between packages in
> > Debian and Fedora
> >
> https://github.com/silviocesare/Equivalent-Packages/blob/master/NearestNeighbour/Debian5_Fedora13_Matches
>
> I played a little more with this list and noticed few problems:
> - quite a few Debian packages map to Fedora arptools or binclock.
>  Probably packages with not much sources, where other file (license,
>  configure) confuse your tool to match unrelated packages
> - there does not seem to be a good way to list cases where multiple
>  components contain the same sources.  In Fedora, mingw32-* packages
>  are a good example, and the list often maps Debian package foo to
>  Fedora package mingw32-foo, while there is Fedora package foo that
>  should be similarly good match.  Another example is
>  zlib:arm-gp2x-linux-zlib.
>
> Did you review "unexpected matches" to see if the sources are really
> similar, and how the match is picked when there are multiple "good
> candidates"?
>
> --
> Tomas Hoger / Red Hat Security Response Team
>
--
security mailing list
security@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/security

Reply via email to