Re: is there a fast web-interface to git for huge repos?

2013-06-14 Thread Holger Hellmuth (IKS)

Am 07.06.2013 22:21, schrieb Constantine A. Murenin:

I'm totally fine with daily updates; but I think there still has to be
some better way of doing this than wasting 0.5s of CPU time and 5s of
HDD time (if completely cold) for each blame / log, at the price of
more storage and some pre-caching, and (daily (in my use-case))
fine-grained incremental updates.


To get a feel for the numbers: I would guess 'git blame' is mostly run 
against the newest version and the release version of a file, right? I 
couldn't find the number of files in bsd, so lets take linux instead: 
That is 25k files for version 2.6.27. Lets say 35k files altogether for 
both release and newer versions of the files.


A typical page of git blame output on github seems to be in the vicinity 
of 500 kbytes, but that seems to include lots of overhead for comfort 
functions. At least that means it is a good upper bound value.


35k files times 500k gives 17.5 Gbytes, a trivial value for a static 
*disk* based cache. It is also a manageable value for affordable SSDs











--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: is there a fast web-interface to git for huge repos?

2013-06-07 Thread Constantine A. Murenin
On 7 June 2013 13:13, Charles McGarvey  wrote:
> On 06/07/2013 01:02 PM, Constantine A. Murenin wrote:
>>> That's a one-time penalty. Why would that be a problem? And why is wget
>>> even mentioned? Did we misunderstood eachother?
>>
>> `wget` or `curl --head` would be used to trigger the caching.
>>
>> I don't understand how it's a one-time penalty.  Noone wants to look
>> at an old copy of the repository, so, pretty much, if, say, I want to
>> have a gitweb of all 4 BSDs, updated daily, then, pretty much, even
>> with lots of ram (e.g. to eliminate the cold-case 5s penalty, and
>> reduce each page to 0.5s), on a quad-core box, I'd be kinda be lucky
>> to complete a generation of all the pages within 12h or so, obviously
>> using the machine at, or above, 50% capacity just for the caching.  Or
>> several days or even a couple of weeks on an Intel Atom or VIA Nano
>> with 2GB of RAM or so.  Obviously not acceptable, there has to be a
>> better solution.
>>
>> One could, I guess, only regenerate the pages which have changed, but
>> it still sounds like an ugly solution, where you'd have to be
>> generating a list of files that have changed between one gen and the
>> next, and you'd still have to have a very high cpu, cache and storage
>> requirements.
>
> Have you already ruled out caching on a proxy?  Pages would only be generated
> on demand, so the first visitor would still experience the delay but the rest
> would be fast until the page expires.  Even expiring pages as often as five
> minutes or less would probably provide significant processing savings
> (depending on how many users you have), and that level of staleness and the
> occasional delays may be acceptable to your users.
>
> As you say, generating the entire cache upfront and continuously is wasteful
> and probably unrealistic, but any type of caching, by definition, is going to
> involve users seeing stale content, and I don't see that you have any other
> option but some type of caching.  Well, you could reproduce what git does in a
> bunch of distributed algorithms and run your app on a farm--which, I guess, is
> probably what GitHub is doing--but throwing up a caching reverse proxy is a
> lot quicker if you can accept the caveats.

I don't think GitHub / Gitorious / whatever have solved this problem
at all.  They're terribly slow on big repos, some pages don't even
generate the first time you click on the link.

I'm totally fine with daily updates; but I think there still has to be
some better way of doing this than wasting 0.5s of CPU time and 5s of
HDD time (if completely cold) for each blame / log, at the price of
more storage and some pre-caching, and (daily (in my use-case))
fine-grained incremental updates.

C.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: is there a fast web-interface to git for huge repos?

2013-06-07 Thread Charles McGarvey
On 06/07/2013 01:02 PM, Constantine A. Murenin wrote:
>> That's a one-time penalty. Why would that be a problem? And why is wget
>> even mentioned? Did we misunderstood eachother?
> 
> `wget` or `curl --head` would be used to trigger the caching.
> 
> I don't understand how it's a one-time penalty.  Noone wants to look
> at an old copy of the repository, so, pretty much, if, say, I want to
> have a gitweb of all 4 BSDs, updated daily, then, pretty much, even
> with lots of ram (e.g. to eliminate the cold-case 5s penalty, and
> reduce each page to 0.5s), on a quad-core box, I'd be kinda be lucky
> to complete a generation of all the pages within 12h or so, obviously
> using the machine at, or above, 50% capacity just for the caching.  Or
> several days or even a couple of weeks on an Intel Atom or VIA Nano
> with 2GB of RAM or so.  Obviously not acceptable, there has to be a
> better solution.
> 
> One could, I guess, only regenerate the pages which have changed, but
> it still sounds like an ugly solution, where you'd have to be
> generating a list of files that have changed between one gen and the
> next, and you'd still have to have a very high cpu, cache and storage
> requirements.

Have you already ruled out caching on a proxy?  Pages would only be generated
on demand, so the first visitor would still experience the delay but the rest
would be fast until the page expires.  Even expiring pages as often as five
minutes or less would probably provide significant processing savings
(depending on how many users you have), and that level of staleness and the
occasional delays may be acceptable to your users.

As you say, generating the entire cache upfront and continuously is wasteful
and probably unrealistic, but any type of caching, by definition, is going to
involve users seeing stale content, and I don't see that you have any other
option but some type of caching.  Well, you could reproduce what git does in a
bunch of distributed algorithms and run your app on a farm--which, I guess, is
probably what GitHub is doing--but throwing up a caching reverse proxy is a
lot quicker if you can accept the caveats.

-- 
Charles McGarvey



signature.asc
Description: OpenPGP digital signature


Re: is there a fast web-interface to git for huge repos?

2013-06-07 Thread Constantine A. Murenin
On 7 June 2013 10:57, Fredrik Gustafsson  wrote:
> On Fri, Jun 07, 2013 at 10:05:37AM -0700, Constantine A. Murenin wrote:
>> On 6 June 2013 23:33, Fredrik Gustafsson  wrote:
>> > On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote:
>> >> I'm interested in running a web interface to this and other similar
>> >> git repositories (FreeBSD and NetBSD git repositories are even much,
>> >> much bigger).
>> >>
>> >> Software-wise, is there no way to make cold access for git-log and
>> >> git-blame to be orders of magnitude less than ~5s, and warm access
>> >> less than ~0.5s?
>> >
>> > The obvious way would be to cache the results. You can even put an
>>
>> That would do nothing to prevent slowness of the cold requests, which
>> already run for 5s when completely cold.
>>
>> In fact, unless done right, it would actually slow things down, as
>> lines would not necessarily show up as they're ready.
>
> You need to cache this _before_ the web-request. Don't let the
> web-request trigger a cache-update but a git push to the repository.
>
>>
>> > update cache hook the git repositories to make the cache always be up to
>> > date.
>>
>> That's entirely inefficient.  It'll probably take hours or days to
>> pre-cache all the html pages with a naive wget and the list of all the
>> files.  Not a solution at all.
>>
>> (0.5s x 35k files = 5 hours for log/blame, plus another 5h of cpu time
>> for blame/log)
>
> That's a one-time penalty. Why would that be a problem? And why is wget
> even mentioned? Did we misunderstood eachother?

`wget` or `curl --head` would be used to trigger the caching.

I don't understand how it's a one-time penalty.  Noone wants to look
at an old copy of the repository, so, pretty much, if, say, I want to
have a gitweb of all 4 BSDs, updated daily, then, pretty much, even
with lots of ram (e.g. to eliminate the cold-case 5s penalty, and
reduce each page to 0.5s), on a quad-core box, I'd be kinda be lucky
to complete a generation of all the pages within 12h or so, obviously
using the machine at, or above, 50% capacity just for the caching.  Or
several days or even a couple of weeks on an Intel Atom or VIA Nano
with 2GB of RAM or so.  Obviously not acceptable, there has to be a
better solution.

One could, I guess, only regenerate the pages which have changed, but
it still sounds like an ugly solution, where you'd have to be
generating a list of files that have changed between one gen and the
next, and you'd still have to have a very high cpu, cache and storage
requirements.

C.

>> > There's some dynamic web frontends like cgit and gitweb out there but
>> > there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/
>> > ) that might be more of an option to you.
>>
>> The concept for git-arr looks interesting, but it has neither blame
>> nor log, so, it's kinda pointless, because the whole thing that's slow
>> is exactly blame and log.
>>
>> There has to be some way to improve these matters.  Noone wants to
>> wait 5 seconds until a page is generated, we're not running enterprise
>> software here, latency is important!
>>
>> C.
>
> Git's internal structures make just blame pretty expensive. There's
> nothing you really can do for it algoritm wise (as far as I know, if
> there was, people would already improved it).
>
> The solution here is to have a "hot" repository to speed up things.
>
> There's of course little things you can do. I imagine that using git
> repack in a sane way probably could speed things up, as well as git gc.
>
> --
> Med vänliga hälsningar
> Fredrik Gustafsson
>
> tel: 0733-608274
> e-post: iv...@iveqy.com
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: is there a fast web-interface to git for huge repos?

2013-06-07 Thread Fredrik Gustafsson
On Fri, Jun 07, 2013 at 10:05:37AM -0700, Constantine A. Murenin wrote:
> On 6 June 2013 23:33, Fredrik Gustafsson  wrote:
> > On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote:
> >> I'm interested in running a web interface to this and other similar
> >> git repositories (FreeBSD and NetBSD git repositories are even much,
> >> much bigger).
> >>
> >> Software-wise, is there no way to make cold access for git-log and
> >> git-blame to be orders of magnitude less than ~5s, and warm access
> >> less than ~0.5s?
> >
> > The obvious way would be to cache the results. You can even put an
> 
> That would do nothing to prevent slowness of the cold requests, which
> already run for 5s when completely cold.
> 
> In fact, unless done right, it would actually slow things down, as
> lines would not necessarily show up as they're ready.

You need to cache this _before_ the web-request. Don't let the
web-request trigger a cache-update but a git push to the repository.

> 
> > update cache hook the git repositories to make the cache always be up to
> > date.
> 
> That's entirely inefficient.  It'll probably take hours or days to
> pre-cache all the html pages with a naive wget and the list of all the
> files.  Not a solution at all.
> 
> (0.5s x 35k files = 5 hours for log/blame, plus another 5h of cpu time
> for blame/log)

That's a one-time penalty. Why would that be a problem? And why is wget
even mentioned? Did we misunderstood eachother?

> 
> > There's some dynamic web frontends like cgit and gitweb out there but
> > there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/
> > ) that might be more of an option to you.
> 
> The concept for git-arr looks interesting, but it has neither blame
> nor log, so, it's kinda pointless, because the whole thing that's slow
> is exactly blame and log.
> 
> There has to be some way to improve these matters.  Noone wants to
> wait 5 seconds until a page is generated, we're not running enterprise
> software here, latency is important!
> 
> C.

Git's internal structures make just blame pretty expensive. There's
nothing you really can do for it algoritm wise (as far as I know, if
there was, people would already improved it).

The solution here is to have a "hot" repository to speed up things.

There's of course little things you can do. I imagine that using git
repack in a sane way probably could speed things up, as well as git gc.

-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iv...@iveqy.com
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: is there a fast web-interface to git for huge repos?

2013-06-07 Thread Constantine A. Murenin
On 6 June 2013 23:33, Fredrik Gustafsson  wrote:
> On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote:
>> I'm interested in running a web interface to this and other similar
>> git repositories (FreeBSD and NetBSD git repositories are even much,
>> much bigger).
>>
>> Software-wise, is there no way to make cold access for git-log and
>> git-blame to be orders of magnitude less than ~5s, and warm access
>> less than ~0.5s?
>
> The obvious way would be to cache the results. You can even put an

That would do nothing to prevent slowness of the cold requests, which
already run for 5s when completely cold.

In fact, unless done right, it would actually slow things down, as
lines would not necessarily show up as they're ready.

> update cache hook the git repositories to make the cache always be up to
> date.

That's entirely inefficient.  It'll probably take hours or days to
pre-cache all the html pages with a naive wget and the list of all the
files.  Not a solution at all.

(0.5s x 35k files = 5 hours for log/blame, plus another 5h of cpu time
for blame/log)

> There's some dynamic web frontends like cgit and gitweb out there but
> there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/
> ) that might be more of an option to you.

The concept for git-arr looks interesting, but it has neither blame
nor log, so, it's kinda pointless, because the whole thing that's slow
is exactly blame and log.

There has to be some way to improve these matters.  Noone wants to
wait 5 seconds until a page is generated, we're not running enterprise
software here, latency is important!

C.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: is there a fast web-interface to git for huge repos?

2013-06-06 Thread Fredrik Gustafsson
On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote:
> I'm interested in running a web interface to this and other similar
> git repositories (FreeBSD and NetBSD git repositories are even much,
> much bigger).
> 
> Software-wise, is there no way to make cold access for git-log and
> git-blame to be orders of magnitude less than ~5s, and warm access
> less than ~0.5s?

The obvious way would be to cache the results. You can even put an
update cache hook the git repositories to make the cache always be up to
date.

There's some dynamic web frontends like cgit and gitweb out there but
there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/
) that might be more of an option to you.

-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iv...@iveqy.com
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html