Hi, I would like to comment from the sidelines. At the Haiku project we run one of these huge Git repositories, and for us currently the trac-gitplugin is unusable. I myself have done some work and research into getting insight into why this is, I have found that the main issue is that the information that Trac wants to display is not by any means efficiently retrievable due to the way the Git data model is designed.
On Tue, Jun 5, 2012 at 3:10 AM, Peter Stuge <pe...@stuge.se> wrote: >>> The get_youngest_rev(), previous_rev() and child_revs() Repository >>> operations are very very expensive; they require traversing the >>> entire commit graph! >> >> That was pretty much my point in our little discussion in #10594. >> I'm not convinced that simply switching the way we use git (command-line >> vs. library bindings) will be enough to address the performance issues; > > You know that fork and exec are expensive, right? It's not cool to > have web software work like this in 2012. It was halfway OK in 1995, > but not anymore. :) > > Also, "simply" is downplaying PyGIT. Because it uses git commands it > remains limited to what git commands can do, where the Git data model > isn't very well exposed, nor in a manner which is very useful. It's > all sorts of wrong. I can concur. In my own experiments [1] I have tried to use Dulwich [2] which is a pure-python implementation of the Git data model, and a part of the git command set. The biggest problem is with file history. For example, the simple tree view that the source browser has now already poses problems. This is because to get the commit revision in which the file was last changed, you need to traverse all the commits (starting from the top) up to the commit it was last changed. In the Haiku root we have one file that was last changed about 30.000 commits ago. This means that in an uncached universe the server is always processing the commits for each and every file in the source browser! >> it's rather rethinking carefully how to access the information, when >> and what to cache, etc. > > I very much welcome this effort to review the Trac Repository data > model! But I have zero expectation that it will result in substantial > code changes within say the next few months. > > Native repo access is on the other hand a bite sized problem, and > will certainly have noticeable impact on performance. Hopefully you > and others will turn the Repository interface inside out in parallel > with pygit2 work, to get even more out of Trac in the end! :) While I agree that improving the responsiveness of the back-end might improve the situation for small and mid-sized repositories, it is not a sustainable solution for the future. After all: small repositories (hope to) grow big! One part of the solution is to improve caching. There are two ways to do this. In trac-dulwich I experimented with doing a full cache, which means reproducing the structure of the repository in a way that Trac can easily get information that it needs. While I nowhere got to a full solution (I did not get to caching the relations between files), I think this might not be a good idea in the end. For merely caching all file and directory revisions in the Haiku repository, I now have a sqlite database of 192 MB. Imagine that all the relations between file revisions are also stored in there. I am starting to see more potential in an alternative kind of caching, at a higher level. I looked at cgit as an example[3][4]. cgit is extreme in its caching: it basically caches the html output which I doubt is acceptable for Trac itself. However, I do see potential for the versioncontrol module to cache data structures that have been requested. This way, whenever a request is repeated, and the underlying datamodel did not change, then the cached data can be fetched instead. Another change, at the higher level, would be to have incremental loading for some operations that we know are expensive. For now I can think of only one, showing a path history. Basically, load a frame page, after which the history would be incrementally loaded. However, this cannot be a full replacement for caching. I'm glad I'm not the only one trying to wrap my head around this problem. My next step personally would be to stop with the 'cache everything' strategy (for now) and go on to the more intelligent higher level caching. I could use some ideas and input for that. Regards, Niels [1] https://github.com/nielx/trac-dulwich - Note: the caching code is very ugly right now. [2] http://www.samba.org/~jelmer/dulwich/ [3] http://hjemli.net/git/cgit/ [4] http://cgit.haiku-os.org/ -- You received this message because you are subscribed to the Google Groups "Trac Development" group. To post to this group, send email to trac-dev@googlegroups.com. To unsubscribe from this group, send email to trac-dev+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/trac-dev?hl=en.