Re: repospanner and our Ansible repo

2019-09-18 Thread Neal Gompa
On Wed, Sep 18, 2019 at 9:58 AM Stephen John Smoogen  wrote:
>
> On Wed, 18 Sep 2019 at 09:44, Randy Barlow  
> wrote:
> >
> > On Tue, 2019-09-17 at 19:01 -0400, Neal Gompa wrote:
> > > Out of curiosity, do we know where the bottlenecks are in
> > > repoSpanner?
> > > In theory, the architecture of repoSpanner isn't supposed to be too
> > > different from gitaly, so I'm curious where we're falling down.
> >
> > I believe it needs a more efficient way to store the git objects. As I
> > understand it, it currently stores each one in its own file, resulting
> > in a large number of small files.
>
> So my "hot-take probably wrong" look at things seems to indicate that
> the reason it stores everything as a separate file is to make certain
> git actions faster. When you pack the files, searches, diffs and other
> checks become slower or memory intensive because you have to calculate
> new deltas and other things 'lost' in the packing.
>
> Looking at the gitaly documents, I think that is the reason they have
> multiple different types of in-memory caches at different layers. It
> allows for both faster accesses but probably blows up the size of what
> is needed for hardware. We have to be careful here because we don't
> have a hardware reserve to dive into for more memory/cpu.
>
> I think that for gitlab.org (versus running a local gitlab) they also
> use a lot of backend 'eventual' consistency caching. You push and it
> begins to spread that out through the multiple regions it is housed.
> The 'user' doesn't see this because the front end level just directs
> you to the known hot caches for that particular pull/push request..
> but if you somehow were hardcoded to a region you might not see the
> update/change for a while because it hasn't mirrored out completely.
> That also would speed up push/pull/changes greatly and not something
> we could 'duplicate'.
>

That definitely explains the performance consistency between
repoSpanner and gitaly for my local deployment. So it's most likely
related to how they simulate better performance as the backend catches
up.

That said, the most recent change to gitaly is that it now does hashed
storage of git objects and does "fast forking" using alternates
instead of storing as bare git repos and duplicating repos on disk.

None of that changes the initial push for a unique repo.




--
真実はいつも一つ!/ Always, there's only one truth!
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org


Re: repospanner and our Ansible repo

2019-09-18 Thread Stephen John Smoogen
On Wed, 18 Sep 2019 at 09:44, Randy Barlow  wrote:
>
> On Tue, 2019-09-17 at 19:01 -0400, Neal Gompa wrote:
> > Out of curiosity, do we know where the bottlenecks are in
> > repoSpanner?
> > In theory, the architecture of repoSpanner isn't supposed to be too
> > different from gitaly, so I'm curious where we're falling down.
>
> I believe it needs a more efficient way to store the git objects. As I
> understand it, it currently stores each one in its own file, resulting
> in a large number of small files.

So my "hot-take probably wrong" look at things seems to indicate that
the reason it stores everything as a separate file is to make certain
git actions faster. When you pack the files, searches, diffs and other
checks become slower or memory intensive because you have to calculate
new deltas and other things 'lost' in the packing.

Looking at the gitaly documents, I think that is the reason they have
multiple different types of in-memory caches at different layers. It
allows for both faster accesses but probably blows up the size of what
is needed for hardware. We have to be careful here because we don't
have a hardware reserve to dive into for more memory/cpu.

I think that for gitlab.org (versus running a local gitlab) they also
use a lot of backend 'eventual' consistency caching. You push and it
begins to spread that out through the multiple regions it is housed.
The 'user' doesn't see this because the front end level just directs
you to the known hot caches for that particular pull/push request..
but if you somehow were hardcoded to a region you might not see the
update/change for a while because it hasn't mirrored out completely.
That also would speed up push/pull/changes greatly and not something
we could 'duplicate'.


-- 
Stephen J Smoogen.
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org


Re: repospanner and our Ansible repo

2019-09-18 Thread Randy Barlow
On Tue, 2019-09-17 at 19:01 -0400, Neal Gompa wrote:
> Out of curiosity, do we know where the bottlenecks are in
> repoSpanner?
> In theory, the architecture of repoSpanner isn't supposed to be too
> different from gitaly, so I'm curious where we're falling down.

I believe it needs a more efficient way to store the git objects. As I
understand it, it currently stores each one in its own file, resulting
in a large number of small files.


signature.asc
Description: This is a digitally signed message part
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org


Re: repospanner and our Ansible repo

2019-09-17 Thread Stephen John Smoogen
On Tue, 17 Sep 2019 at 19:02, Neal Gompa  wrote:
>
> On Tue, Sep 17, 2019 at 6:47 PM Randy Barlow
>  wrote:
> >

> > I don't expect it would be useful to perform this test with GitHub
> > since I'd expect essentially the same results (bottlenecked on my home
> > internet connection).
>
> Out of curiosity, do we know where the bottlenecks are in repoSpanner?
> In theory, the architecture of repoSpanner isn't supposed to be too
> different from gitaly, so I'm curious where we're falling down.
>
>

Looking at the architecture of gitaly, there seems to be a redis?
cache in front of the gitaly and file cache behind it. If I read that
correctly than those are things which would make things seem much
faster as they would hold things in faster memory access that the
gitaly would be interfacing with. However, that is just a rough look
at what is written up versus a domain knowledge.


-- 
Stephen J Smoogen.
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org


Re: repospanner and our Ansible repo

2019-09-17 Thread Randy Barlow
stickster asked me today how these numbers would compare to
Git{Hub,Lab}. I did a bit of testing with GitLab just now.

Note that this isn't a particularly apples to apples test, because my
repospanner nodes were on the same virtual host, and my git client was
on a 1 Gbps LAN with them. My GitLab test results are from my house,
where I only have a 60x6 Mbps connection to the Internet, and of
course, higher latency.

I considered testing from batcave01 to get higher bandwidth, but I
didn't want to try to figure out a safe way to use my GitLab
credentials on a shared server and I didn't want to make a throw away
account just to test this.

On Mon, 2019-09-16 at 18:51 -0400, Randy Barlow wrote:
> I pushed the Ansible repository into it. This took a very long time:
> 298m2.157s!

This took 6m44.705s to get to GitLab. However, since I only have 6 Mbps
outbound and the repository is 268.43 MiB, I calculate that almost all
of this time was just due to waiting on my outbound pipe.

> The next test was to see how long it takes to clone our repo. I did
> this on another machine on the same LAN (so again, ideal network
> latency) and it took 2m27.433s.

This took 0m40.359s, and again, almost all of the time was just due to
how long it would take to send that much data over a 60 Mpbs link.

> Next, I made a small commit (just added/deleted some lines) and
> pushed
> it into the cluster. This went reasonably quick at 0.366s, which I
> think we would be OK with.

This took 1.443s to GitLab, and I bet most of it was just latency/round
trip crypto setup time.

> The last test I performed was to see how quickly another checkout
> could
> pull that commit, and this was again a speed I might consider to be a
> bit slow at 4.931s, especially considering that the commit was small
> and was only one.

This took 0m1.523s to GitLab, and I bet most of it was just
latency/round trip crypto setup time.

I don't expect it would be useful to perform this test with GitHub
since I'd expect essentially the same results (bottlenecked on my home
internet connection).


signature.asc
Description: This is a digitally signed message part
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org


Re: repospanner and our Ansible repo

2019-09-16 Thread Igor Gnatenko
I don't contribute much to infra repo. Although I do pulls from time
to time. Sending PR is definitely cool, but I think waiting 10's of
seconds for pulling few commits is not very good.
On Tue, Sep 17, 2019 at 1:01 AM Randy Barlow
 wrote:
>
> Greetings!
>
> Kevin asked me last week whether we are ready to move our
> infrastructure Ansible repository into repospanner. The benefit of
> moving it into repospanner is that it is one way to enable us to allow
> pull requests into the repository, which I think would be nice.
>
> repospanner seems to work correctly as a git server, but it does need
> improvements in its performance, so I offered to do a little
> benchmarking with our Ansible repo and repospanner to see what kind of
> performance we might see.
>
> I deployed a 3-node repospanner cluster today on fairly high
> performance hardware (SSD storage). It was three VMs on the same
> physical machine. Note that due to my test setup, network latency was
> about as good as it could get, and so was storage iops. I believe the
> performance bottlenecks will depend heavily on storage iops. Thus, this
> hardware is not really a great way to predict how the performance might
> be if we deployed into our infra, but it was easy for me to do and get
> a "best case" performance benchmark. I am willing to attempt to
> replicate this test on more realistic hardware in our infra if we want
> more realistic data for our own use case.
>
> I pushed the Ansible repository into it. This took a very long time:
> 298m2.157s! If we were to deploy nodes in different geos and use NAS
> storage, I believe this would take longer. The good thing is that we'd
> only need to do this operation once, if we were to decide to proceed.
>
> The next test was to see how long it takes to clone our repo. I did
> this on another machine on the same LAN (so again, ideal network
> latency) and it took 2m27.433s. That's a pretty long time too I'd say,
> but maybe liveable? This would impact every contributor who wanted to
> clone us, so I'll let the list debate whether that is acceptable.
>
> Next, I made a small commit (just added/deleted some lines) and pushed
> it into the cluster. This went reasonably quick at 0.366s, which I
> think we would be OK with.
>
> The last test I performed was to see how quickly another checkout could
> pull that commit, and this was again a speed I might consider to be a
> bit slow at 4.931s, especially considering that the commit was small
> and was only one. I would expect this to be somewhat proportional to
> the amount of change that has happened since the user last fetched, and
> this repo does see a lot of activity. So I might expect git pull to
> take 10's of seconds for contributors who are fairly active and pull
> once every few days or so, and maybe longer for users who pull less
> frequently.
>
> The repo copy I tested with has 199717 objects and 132918 deltas in it.
> repospanner performance seems to be fairly proportionally correlated
> with these numbers, as the bodhi repo pushed into it in about an hour
> and has 50kish objects, iirc (didn't write it down, so from memory).
>
> I personally am on the fence about whether we should proceed at this
> time. I am certain that people will notice the speed issues, and I also
> expect that it will be slower than the numbers I listed above since my
> tests were done on consumer hardware. But it would also be pretty sweet
> if we had pull requests on the repo.
>
> Improving repospanner's performance is a goal I am focusing on, so if
> we deployed it now I would hopefully be able to get it into better
> shape soon. Alternatively, we hopefully wouldn't have to wait that long
> if we wanted to wait for performance fixes before proceeding. I could
> see either decision being reasonable.
>
> To reiterate, I'd be willing to replicate the tests I did above on
> infra hardware if we are on the fence about the numbers I've reported
> here and want to see more realistic numbers to make a final decision. I
> think that would give us more realistic numbers since the tests I did
> here were on a much more ideal situation, performance wise.
>
> What do others think?
> ___
> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org
___
infrastructure mailing list 

repospanner and our Ansible repo

2019-09-16 Thread Randy Barlow
Greetings!

Kevin asked me last week whether we are ready to move our
infrastructure Ansible repository into repospanner. The benefit of
moving it into repospanner is that it is one way to enable us to allow
pull requests into the repository, which I think would be nice.

repospanner seems to work correctly as a git server, but it does need
improvements in its performance, so I offered to do a little
benchmarking with our Ansible repo and repospanner to see what kind of
performance we might see.

I deployed a 3-node repospanner cluster today on fairly high
performance hardware (SSD storage). It was three VMs on the same
physical machine. Note that due to my test setup, network latency was
about as good as it could get, and so was storage iops. I believe the
performance bottlenecks will depend heavily on storage iops. Thus, this
hardware is not really a great way to predict how the performance might
be if we deployed into our infra, but it was easy for me to do and get
a "best case" performance benchmark. I am willing to attempt to
replicate this test on more realistic hardware in our infra if we want
more realistic data for our own use case.

I pushed the Ansible repository into it. This took a very long time:
298m2.157s! If we were to deploy nodes in different geos and use NAS
storage, I believe this would take longer. The good thing is that we'd
only need to do this operation once, if we were to decide to proceed.

The next test was to see how long it takes to clone our repo. I did
this on another machine on the same LAN (so again, ideal network
latency) and it took 2m27.433s. That's a pretty long time too I'd say,
but maybe liveable? This would impact every contributor who wanted to
clone us, so I'll let the list debate whether that is acceptable.

Next, I made a small commit (just added/deleted some lines) and pushed
it into the cluster. This went reasonably quick at 0.366s, which I
think we would be OK with.

The last test I performed was to see how quickly another checkout could
pull that commit, and this was again a speed I might consider to be a
bit slow at 4.931s, especially considering that the commit was small
and was only one. I would expect this to be somewhat proportional to
the amount of change that has happened since the user last fetched, and
this repo does see a lot of activity. So I might expect git pull to
take 10's of seconds for contributors who are fairly active and pull
once every few days or so, and maybe longer for users who pull less
frequently.

The repo copy I tested with has 199717 objects and 132918 deltas in it.
repospanner performance seems to be fairly proportionally correlated
with these numbers, as the bodhi repo pushed into it in about an hour
and has 50kish objects, iirc (didn't write it down, so from memory).

I personally am on the fence about whether we should proceed at this
time. I am certain that people will notice the speed issues, and I also
expect that it will be slower than the numbers I listed above since my
tests were done on consumer hardware. But it would also be pretty sweet
if we had pull requests on the repo.

Improving repospanner's performance is a goal I am focusing on, so if
we deployed it now I would hopefully be able to get it into better
shape soon. Alternatively, we hopefully wouldn't have to wait that long
if we wanted to wait for performance fixes before proceeding. I could
see either decision being reasonable.

To reiterate, I'd be willing to replicate the tests I did above on
infra hardware if we are on the fence about the numbers I've reported
here and want to see more realistic numbers to make a final decision. I
think that would give us more realistic numbers since the tests I did
here were on a much more ideal situation, performance wise.

What do others think?


signature.asc
Description: This is a digitally signed message part
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org