Hi,

I would like to comment from the sidelines. At the Haiku project we
run one of these huge Git repositories, and for us currently the
trac-gitplugin is unusable. I myself have done some work and research
into getting insight into why this is, I have found that the main
issue is that the information that Trac wants to display is not by any
means efficiently retrievable due to the way the Git data model is
designed.

On Tue, Jun 5, 2012 at 3:10 AM, Peter Stuge <pe...@stuge.se> wrote:
>>> The get_youngest_rev(), previous_rev() and child_revs() Repository
>>> operations are very very expensive; they require traversing the
>>> entire commit graph!
>>
>> That was pretty much my point in our little discussion in #10594.
>> I'm not convinced that simply switching the way we use git (command-line
>> vs. library bindings) will be enough to address the performance issues;
>
> You know that fork and exec are expensive, right? It's not cool to
> have web software work like this in 2012. It was halfway OK in 1995,
> but not anymore. :)
>
> Also, "simply" is downplaying PyGIT. Because it uses git commands it
> remains limited to what git commands can do, where the Git data model
> isn't very well exposed, nor in a manner which is very useful. It's
> all sorts of wrong.

I can concur. In my own experiments [1] I have tried to use Dulwich
[2] which is a pure-python implementation of the Git data model, and a
part of the git command set.

The biggest problem is with file history. For example, the simple tree
view that the source browser has now already poses problems. This is
because to get the commit revision in which the file was last changed,
you need to traverse all the commits (starting from the top) up to the
commit it was last changed. In the Haiku root we have one file that
was last changed about 30.000 commits ago. This means that in an
uncached universe the server is always processing the commits for each
and every file in the source browser!

>> it's rather rethinking carefully how to access the information, when
>> and what to cache, etc.
>
> I very much welcome this effort to review the Trac Repository data
> model! But I have zero expectation that it will result in substantial
> code changes within say the next few months.
>
> Native repo access is on the other hand a bite sized problem, and
> will certainly have noticeable impact on performance. Hopefully you
> and others will turn the Repository interface inside out in parallel
> with pygit2 work, to get even more out of Trac in the end! :)

While I agree that improving the responsiveness of the back-end might
improve the situation for small and mid-sized repositories, it is not
a sustainable solution for the future. After all: small repositories
(hope to) grow big!

One part of the solution is to improve caching. There are two ways to
do this. In trac-dulwich I experimented with doing a full cache, which
means reproducing the structure of the repository in a way that Trac
can easily get information that it needs. While I nowhere got to a
full solution (I did not get to caching the relations between files),
I think this might not be a good idea in the end. For merely caching
all file and directory revisions in the Haiku repository, I now have a
sqlite database of 192 MB. Imagine that all the relations between file
revisions are also stored in there.

I am starting to see more potential in an alternative kind of caching,
at a higher level. I looked at cgit as an example[3][4]. cgit is
extreme in its caching: it basically caches the html output which I
doubt is acceptable for Trac itself. However, I do see potential for
the versioncontrol module to cache data structures that have been
requested. This way, whenever a request is repeated, and the
underlying datamodel did not change, then the cached data can be
fetched instead.

Another change, at the higher level, would be to have incremental
loading for some operations that we know are expensive. For now I can
think of only one, showing a path history. Basically, load a frame
page, after which the history would be incrementally loaded. However,
this cannot be a full replacement for caching.

I'm glad I'm not the only one trying to wrap my head around this
problem. My next step personally would be to stop with the 'cache
everything' strategy (for now) and go on to the more intelligent
higher level caching. I could use some ideas and input for that.

Regards,

Niels


[1] https://github.com/nielx/trac-dulwich  - Note: the caching code is
very ugly right now.
[2] http://www.samba.org/~jelmer/dulwich/
[3] http://hjemli.net/git/cgit/
[4] http://cgit.haiku-os.org/

-- 
You received this message because you are subscribed to the Google Groups "Trac 
Development" group.
To post to this group, send email to trac-dev@googlegroups.com.
To unsubscribe from this group, send email to 
trac-dev+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/trac-dev?hl=en.

Reply via email to