Hi all!

I invite you to try out "Project Ruprecht"[1][2], a tool that measures the
"tangledness" of PHP code, and provides you with a "naughty list" of things to 
fix.

For now, you will have to install this locally. I hope however to soon have this
run automatically against core on a regular basis, perhaps by integrating it
with SonarQube. Maybe some day we can also integrate it with CI, to generate a
warning when a new cyclic dependency is about to be introduced.

So that was the tl;dr. Now for some context, history, and shout-outs. And some
actual real world science, too!

For a while now, I have been talking about the how much of a problem cyclic
dependencies are in MediaWiki core: When two components (classes, namespaces,
libraries, whatever) depend on each other, directly or indirectly, this means
that one cannot be used without the other, nor can it be tested, understood, or
modified without also considering the other. So, in effect, they behave as *one*
component, not two. Applied to MW core, this means that roughly half of our 1600
classes effectively behave like a single giant class. This makes the code rather
hard to deal with.

To fix this, I have been looking for tools that let me identify "tangles" of
classes that depend on each other, and metrics' to measure the progress of
"untangling" the code. However, the classic code quality metrics focus on
"local" properties of the code, so they can't tell us much about the progress of
untangling. And the tools I found that would detect cyclic dependencies in PHP
code would all choke on MediaWiki core: they would try to list all detected
cycles - which, by the super-exponential nature of possible paths through a
graph, would be millions and millions. So, the tools would choke and die. That
approach isn't practical for us.

Two discoveries allowed me to come up with a working solution:
First, I decided to leave the PHP world and turned towards graph analysis tools
built for large data sets. Python's graph-tool did the trick. It's build on top
of boost and numpy, and it's *fast*. It crunched through the 7500 or so class
dependencies in MW core in a split second, and told me that we have 14 "tangles"
(non-trivial strongly connected components), and that 43% of our classes are in
these tangles, with 40% being part of one big tangle that is essentially our
monolith manifest. So now I had a metric to work with: the number of classes in
tangles.

That was great, but still didn't tell me where to start. Graph-tool was still
not fast enough to deal with millions of cycles, and even if it had been, that
data wouldn't be very useful. I needed some smart heuristics. Luckily, I
(totally unintentionally, promise!) nerd sniped[5] Amir Sarabadani one evening
at the WMDE office by telling him about this problem. The next day, he told me
that he had been digging into the problem all night, and he had found a paper
that sounded relevant, and it also came with working code: "Breaking Cycles in
Noisy Hierarchies"[3] by J. Sun, D. Ajwani, P.K. Nicholson, A. Sala, and S.
Parthasarathy. I played with the code a bit, and yes! It spat out a list of 290
or so dependencies[4] that it thought were bad - and I agree for a good number
of them. It's not a clean working list, but it gives a very good idea of where
to start looking.

I find it quite fascinating that this works so well for cleaning up a codebase.
After all, the heuristic wasn't design for this - it was designed for fixing
messy ontologies. Indeed, one of their test data sets was (English language)
Wikipedia's category system! I'd love to see what it does with Wikidata's
subclass hierarchy :)

But I suppose it makes sense - dependencies in software are conceptually a lot
like an ontology, and the same strategies of stratification and abstraction
apply. And the same difficulties, too - it's easy enough to spot a problematic
cycle, but often hard to say where it should be cut. And how to cut it - often,
the solution is not to just remove the dependency, but to introduce a new
abstraction that allows the relationship to exist without a cycle. I'd love to
see the research continue in that direction!

So, a big shout out to the researchers, and to Amir who found the paper!

I hope my ramblings have made you curious to play with Ruprecht, and see what it
has to say about other code bases. There's also another feature to play with
which I haven't discussed here: detection of risky classes using the Page Rank
algorithm. Fun!

Cheers,
Daniel

[1] https://phabricator.wikimedia.org/diffusion/MTDA/repository/master/
[2]
https://gerrit.wikimedia.org/r/admin/projects/mediawiki/tools/dependency-analysis
[3] https://github.com/zhenv5/breaking_cycles_in_noisy_hierarchies
[4] https://phabricator.wikimedia.org/P8513
[5] https://xkcd.com/356/

-- 
Daniel Kinzler
Principal Software Engineer, Core Platform
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to