Re: is there a fast web-interface to git for huge repos?
Am 07.06.2013 22:21, schrieb Constantine A. Murenin: I'm totally fine with daily updates; but I think there still has to be some better way of doing this than wasting 0.5s of CPU time and 5s of HDD time (if completely cold) for each blame / log, at the price of more storage and some pre-caching, and (daily (in my use-case)) fine-grained incremental updates. To get a feel for the numbers: I would guess 'git blame' is mostly run against the newest version and the release version of a file, right? I couldn't find the number of files in bsd, so lets take linux instead: That is 25k files for version 2.6.27. Lets say 35k files altogether for both release and newer versions of the files. A typical page of git blame output on github seems to be in the vicinity of 500 kbytes, but that seems to include lots of overhead for comfort functions. At least that means it is a good upper bound value. 35k files times 500k gives 17.5 Gbytes, a trivial value for a static *disk* based cache. It is also a manageable value for affordable SSDs -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: is there a fast web-interface to git for huge repos?
On 7 June 2013 13:13, Charles McGarvey wrote: > On 06/07/2013 01:02 PM, Constantine A. Murenin wrote: >>> That's a one-time penalty. Why would that be a problem? And why is wget >>> even mentioned? Did we misunderstood eachother? >> >> `wget` or `curl --head` would be used to trigger the caching. >> >> I don't understand how it's a one-time penalty. Noone wants to look >> at an old copy of the repository, so, pretty much, if, say, I want to >> have a gitweb of all 4 BSDs, updated daily, then, pretty much, even >> with lots of ram (e.g. to eliminate the cold-case 5s penalty, and >> reduce each page to 0.5s), on a quad-core box, I'd be kinda be lucky >> to complete a generation of all the pages within 12h or so, obviously >> using the machine at, or above, 50% capacity just for the caching. Or >> several days or even a couple of weeks on an Intel Atom or VIA Nano >> with 2GB of RAM or so. Obviously not acceptable, there has to be a >> better solution. >> >> One could, I guess, only regenerate the pages which have changed, but >> it still sounds like an ugly solution, where you'd have to be >> generating a list of files that have changed between one gen and the >> next, and you'd still have to have a very high cpu, cache and storage >> requirements. > > Have you already ruled out caching on a proxy? Pages would only be generated > on demand, so the first visitor would still experience the delay but the rest > would be fast until the page expires. Even expiring pages as often as five > minutes or less would probably provide significant processing savings > (depending on how many users you have), and that level of staleness and the > occasional delays may be acceptable to your users. > > As you say, generating the entire cache upfront and continuously is wasteful > and probably unrealistic, but any type of caching, by definition, is going to > involve users seeing stale content, and I don't see that you have any other > option but some type of caching. Well, you could reproduce what git does in a > bunch of distributed algorithms and run your app on a farm--which, I guess, is > probably what GitHub is doing--but throwing up a caching reverse proxy is a > lot quicker if you can accept the caveats. I don't think GitHub / Gitorious / whatever have solved this problem at all. They're terribly slow on big repos, some pages don't even generate the first time you click on the link. I'm totally fine with daily updates; but I think there still has to be some better way of doing this than wasting 0.5s of CPU time and 5s of HDD time (if completely cold) for each blame / log, at the price of more storage and some pre-caching, and (daily (in my use-case)) fine-grained incremental updates. C. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: is there a fast web-interface to git for huge repos?
On 06/07/2013 01:02 PM, Constantine A. Murenin wrote: >> That's a one-time penalty. Why would that be a problem? And why is wget >> even mentioned? Did we misunderstood eachother? > > `wget` or `curl --head` would be used to trigger the caching. > > I don't understand how it's a one-time penalty. Noone wants to look > at an old copy of the repository, so, pretty much, if, say, I want to > have a gitweb of all 4 BSDs, updated daily, then, pretty much, even > with lots of ram (e.g. to eliminate the cold-case 5s penalty, and > reduce each page to 0.5s), on a quad-core box, I'd be kinda be lucky > to complete a generation of all the pages within 12h or so, obviously > using the machine at, or above, 50% capacity just for the caching. Or > several days or even a couple of weeks on an Intel Atom or VIA Nano > with 2GB of RAM or so. Obviously not acceptable, there has to be a > better solution. > > One could, I guess, only regenerate the pages which have changed, but > it still sounds like an ugly solution, where you'd have to be > generating a list of files that have changed between one gen and the > next, and you'd still have to have a very high cpu, cache and storage > requirements. Have you already ruled out caching on a proxy? Pages would only be generated on demand, so the first visitor would still experience the delay but the rest would be fast until the page expires. Even expiring pages as often as five minutes or less would probably provide significant processing savings (depending on how many users you have), and that level of staleness and the occasional delays may be acceptable to your users. As you say, generating the entire cache upfront and continuously is wasteful and probably unrealistic, but any type of caching, by definition, is going to involve users seeing stale content, and I don't see that you have any other option but some type of caching. Well, you could reproduce what git does in a bunch of distributed algorithms and run your app on a farm--which, I guess, is probably what GitHub is doing--but throwing up a caching reverse proxy is a lot quicker if you can accept the caveats. -- Charles McGarvey signature.asc Description: OpenPGP digital signature
Re: is there a fast web-interface to git for huge repos?
On 7 June 2013 10:57, Fredrik Gustafsson wrote: > On Fri, Jun 07, 2013 at 10:05:37AM -0700, Constantine A. Murenin wrote: >> On 6 June 2013 23:33, Fredrik Gustafsson wrote: >> > On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote: >> >> I'm interested in running a web interface to this and other similar >> >> git repositories (FreeBSD and NetBSD git repositories are even much, >> >> much bigger). >> >> >> >> Software-wise, is there no way to make cold access for git-log and >> >> git-blame to be orders of magnitude less than ~5s, and warm access >> >> less than ~0.5s? >> > >> > The obvious way would be to cache the results. You can even put an >> >> That would do nothing to prevent slowness of the cold requests, which >> already run for 5s when completely cold. >> >> In fact, unless done right, it would actually slow things down, as >> lines would not necessarily show up as they're ready. > > You need to cache this _before_ the web-request. Don't let the > web-request trigger a cache-update but a git push to the repository. > >> >> > update cache hook the git repositories to make the cache always be up to >> > date. >> >> That's entirely inefficient. It'll probably take hours or days to >> pre-cache all the html pages with a naive wget and the list of all the >> files. Not a solution at all. >> >> (0.5s x 35k files = 5 hours for log/blame, plus another 5h of cpu time >> for blame/log) > > That's a one-time penalty. Why would that be a problem? And why is wget > even mentioned? Did we misunderstood eachother? `wget` or `curl --head` would be used to trigger the caching. I don't understand how it's a one-time penalty. Noone wants to look at an old copy of the repository, so, pretty much, if, say, I want to have a gitweb of all 4 BSDs, updated daily, then, pretty much, even with lots of ram (e.g. to eliminate the cold-case 5s penalty, and reduce each page to 0.5s), on a quad-core box, I'd be kinda be lucky to complete a generation of all the pages within 12h or so, obviously using the machine at, or above, 50% capacity just for the caching. Or several days or even a couple of weeks on an Intel Atom or VIA Nano with 2GB of RAM or so. Obviously not acceptable, there has to be a better solution. One could, I guess, only regenerate the pages which have changed, but it still sounds like an ugly solution, where you'd have to be generating a list of files that have changed between one gen and the next, and you'd still have to have a very high cpu, cache and storage requirements. C. >> > There's some dynamic web frontends like cgit and gitweb out there but >> > there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/ >> > ) that might be more of an option to you. >> >> The concept for git-arr looks interesting, but it has neither blame >> nor log, so, it's kinda pointless, because the whole thing that's slow >> is exactly blame and log. >> >> There has to be some way to improve these matters. Noone wants to >> wait 5 seconds until a page is generated, we're not running enterprise >> software here, latency is important! >> >> C. > > Git's internal structures make just blame pretty expensive. There's > nothing you really can do for it algoritm wise (as far as I know, if > there was, people would already improved it). > > The solution here is to have a "hot" repository to speed up things. > > There's of course little things you can do. I imagine that using git > repack in a sane way probably could speed things up, as well as git gc. > > -- > Med vänliga hälsningar > Fredrik Gustafsson > > tel: 0733-608274 > e-post: iv...@iveqy.com -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: is there a fast web-interface to git for huge repos?
On Fri, Jun 07, 2013 at 10:05:37AM -0700, Constantine A. Murenin wrote: > On 6 June 2013 23:33, Fredrik Gustafsson wrote: > > On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote: > >> I'm interested in running a web interface to this and other similar > >> git repositories (FreeBSD and NetBSD git repositories are even much, > >> much bigger). > >> > >> Software-wise, is there no way to make cold access for git-log and > >> git-blame to be orders of magnitude less than ~5s, and warm access > >> less than ~0.5s? > > > > The obvious way would be to cache the results. You can even put an > > That would do nothing to prevent slowness of the cold requests, which > already run for 5s when completely cold. > > In fact, unless done right, it would actually slow things down, as > lines would not necessarily show up as they're ready. You need to cache this _before_ the web-request. Don't let the web-request trigger a cache-update but a git push to the repository. > > > update cache hook the git repositories to make the cache always be up to > > date. > > That's entirely inefficient. It'll probably take hours or days to > pre-cache all the html pages with a naive wget and the list of all the > files. Not a solution at all. > > (0.5s x 35k files = 5 hours for log/blame, plus another 5h of cpu time > for blame/log) That's a one-time penalty. Why would that be a problem? And why is wget even mentioned? Did we misunderstood eachother? > > > There's some dynamic web frontends like cgit and gitweb out there but > > there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/ > > ) that might be more of an option to you. > > The concept for git-arr looks interesting, but it has neither blame > nor log, so, it's kinda pointless, because the whole thing that's slow > is exactly blame and log. > > There has to be some way to improve these matters. Noone wants to > wait 5 seconds until a page is generated, we're not running enterprise > software here, latency is important! > > C. Git's internal structures make just blame pretty expensive. There's nothing you really can do for it algoritm wise (as far as I know, if there was, people would already improved it). The solution here is to have a "hot" repository to speed up things. There's of course little things you can do. I imagine that using git repack in a sane way probably could speed things up, as well as git gc. -- Med vänliga hälsningar Fredrik Gustafsson tel: 0733-608274 e-post: iv...@iveqy.com -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: is there a fast web-interface to git for huge repos?
On 6 June 2013 23:33, Fredrik Gustafsson wrote: > On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote: >> I'm interested in running a web interface to this and other similar >> git repositories (FreeBSD and NetBSD git repositories are even much, >> much bigger). >> >> Software-wise, is there no way to make cold access for git-log and >> git-blame to be orders of magnitude less than ~5s, and warm access >> less than ~0.5s? > > The obvious way would be to cache the results. You can even put an That would do nothing to prevent slowness of the cold requests, which already run for 5s when completely cold. In fact, unless done right, it would actually slow things down, as lines would not necessarily show up as they're ready. > update cache hook the git repositories to make the cache always be up to > date. That's entirely inefficient. It'll probably take hours or days to pre-cache all the html pages with a naive wget and the list of all the files. Not a solution at all. (0.5s x 35k files = 5 hours for log/blame, plus another 5h of cpu time for blame/log) > There's some dynamic web frontends like cgit and gitweb out there but > there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/ > ) that might be more of an option to you. The concept for git-arr looks interesting, but it has neither blame nor log, so, it's kinda pointless, because the whole thing that's slow is exactly blame and log. There has to be some way to improve these matters. Noone wants to wait 5 seconds until a page is generated, we're not running enterprise software here, latency is important! C. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: is there a fast web-interface to git for huge repos?
On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote: > I'm interested in running a web interface to this and other similar > git repositories (FreeBSD and NetBSD git repositories are even much, > much bigger). > > Software-wise, is there no way to make cold access for git-log and > git-blame to be orders of magnitude less than ~5s, and warm access > less than ~0.5s? The obvious way would be to cache the results. You can even put an update cache hook the git repositories to make the cache always be up to date. There's some dynamic web frontends like cgit and gitweb out there but there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/ ) that might be more of an option to you. -- Med vänliga hälsningar Fredrik Gustafsson tel: 0733-608274 e-post: iv...@iveqy.com -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html