Re: [ccache] why is limit_multiple ignored?

2018-01-30 Thread Anders Björklund via ccache
Den 2018-01-29 kl. 18:59, skrev Joel Rosdahl via ccache:
> On 29 January 2018 at 07:14, Scott Bennett  wrote:
> 
>> Countless data base software implementations handle these situations
>> acceptably well.
> 
> Sigh.
> 
> I see. You're talking about a completely different model than what
> ccache currently uses, which was not clear to me when I read your
> initial description. What you seem to don't understand, or choose to
> ignore, is that ccache can't stop supporting the simple server-less
> file-based model since that would drop support for two important use
> cases:
> 
> 1. Using ccache on a personal account without having access to a system
>service and without having to start a personal server.
> 2. Using a shared cache on NFS.
> 
> (For case 1, it would be feasible with a model where the client starts a
> server on demand, unless the cache is shared.)
> 
> Why are they important? Simply because people have used ccache like that
> for many years.

We have found that use case 2 needs extending, to add a local cache
in front. Then the NFS store is only used as a "secondary cache"...

But this can be done without changing very much in ccache, and I know
I said that we would post some code for that. Will try again. :-)


> It would certainly be possible to add optional client-server backends,
> and for them I fully agree that using a centralized index is obvious,
> but that's just an entirely separate discussion as I see it.
> 
> If we would drop support for simple file-based caches, then it would no
> longer be ccache but something else. Which would be fine, but that's
> another project. (It could of course be ccache version 4, but then I
> expect that ccache version 3 would continue living its own life, so it
> would be a separate project in practice.)

We already have such an implementation, which works fine - in parallell.
We only have to complete the work of merging the dev/memcached branch ?

By starting a memcached daemon, we get the required client/server setup.
And by using "moxi", we can have that daemon talk to a "real" database.


>> Very possibly you have the requisite knowledge/experience yourself.
> 
> Actually yes, so you could have saved both your and my time by just
> asking something like "Have you considered using a client-server model,
> perhaps using a standard database, instead of a file-based cache?"
> instead of trying to educate me.
> 
>> To modify ccache to use data base software is admittedly a major
>> rewriting job, so I expect such an idea to put you off, but it's a
>> project that should ultimately yield a far superior product, IMO.
> 
> I don't disagree, but as I said, that would be another project, and I
> neither have time nor interest in that personally.

As mentioned above, it is very possible to use Couchbase for this.
It is an open source NoSQL database, from https://couchbase.com

Since ccache is talking the standard memcached protocol directly,
it doesn't even have to know that it is using a standard database!

https://www.couchbase.com/memcached

https://github.com/ccache/ccache/tree/dev/memcached

/Anders
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] why is limit_multiple ignored?

2018-01-29 Thread Joel Rosdahl via ccache
On 29 January 2018 at 07:14, Scott Bennett  wrote:

> Countless data base software implementations handle these situations
> acceptably well.

Sigh.

I see. You're talking about a completely different model than what
ccache currently uses, which was not clear to me when I read your
initial description. What you seem to don't understand, or choose to
ignore, is that ccache can't stop supporting the simple server-less
file-based model since that would drop support for two important use
cases:

1. Using ccache on a personal account without having access to a system
   service and without having to start a personal server.
2. Using a shared cache on NFS.

(For case 1, it would be feasible with a model where the client starts a
server on demand, unless the cache is shared.)

Why are they important? Simply because people have used ccache like that
for many years.

It would certainly be possible to add optional client-server backends,
and for them I fully agree that using a centralized index is obvious,
but that's just an entirely separate discussion as I see it.

If we would drop support for simple file-based caches, then it would no
longer be ccache but something else. Which would be fine, but that's
another project. (It could of course be ccache version 4, but then I
expect that ccache version 3 would continue living its own life, so it
would be a separate project in practice.)

> (FWIW, I use cache_dir_levels = 5, which may not be optimal in terms
> of performance. I don't have a good way of determining the optimal
> depth to use for the cache directory trees. It seems to be very, very
> fast for use in building things, but may well be a killer for
> cleanups.)

Let's see. cache_dir_levels = 5 means 16⁵ ≈ 1 million directories on the
lowest level. A large cache might hold, say, 10 million files? Then 10
files per directory is clearly not optimal. How many files do you have
in your cache?

I think that a good rule of thumb would be to store a couple of thousand
or tens of thousands of files per directory, depending on the file
system characteristics. That would mean that cache_dir_levels = 3 would
be enough even for very large caches.

Perhaps lowering cache_dir_levels could partly solve the bad cleanup
performance you have?

> Very possibly you have the requisite knowledge/experience yourself.

Actually yes, so you could have saved both your and my time by just
asking something like "Have you considered using a client-server model,
perhaps using a standard database, instead of a file-based cache?"
instead of trying to educate me.

> To modify ccache to use data base software is admittedly a major
> rewriting job, so I expect such an idea to put you off, but it's a
> project that should ultimately yield a far superior product, IMO.

I don't disagree, but as I said, that would be another project, and I
neither have time nor interest in that personally.

-- Joel

On 29 January 2018 at 07:14, Scott Bennett  wrote:
> Joel Rosdahl  wrote:
>
>> On 7 January 2018 at 14:02, Scott Bennett wrote:
>>
>> > The design problem is that there is no centralized index maintained of
>> > cache entries' paths, their sizes, and their timestamps, necessitating
>> > the plumbing of the directory trees. [...]
>>
>> Thanks for sharing your ideas!
>
>  You may wish to retract any thanks once you've read what follows.  The
> current independence of ccache from any other third-party software is valued
> and for good reasons.  However, I hope to show below a better way to do 
> things.
> That independence can still be maintained, but only at the cost of another
> wheel reinvention. :-(
>>
>> I fully agree that the cleanup algorithm/design hasn't aged well. It has
>> essentially stayed the same since Tridge created ccache in 2002, when
>> storage devices were much smaller and a cache of one GB or two probably
>> was considered quite large.
>>
>> Trying to improve the cleanup algorithm/design has not been a priority
>> since I personally haven't seen such pathological behavior that you
>> describe ("cleanups that can take over a half hour to run and hammer a
>
>  I don't know whether users of other operating systems are using ccache
> in building their systems, but many FreeBSD users do so because the time
> savings are so great.  When one can cut a build time of six hours to, say,
> an hour and a half, one tends to appreciate the tool(s) that make(s) it
> possible.  I.e., we use and love ccache because, in general, it works so well
> and improves performance so much.
>  However, compiling an operating system means a pretty large cache area
> is needed if one is to fit the working set within the cache.  Similarly,
> FreeBSD users who compile third-party software from the ports tree, rather
> than installing it from prebuilt packages, potentially need an even larger
> cache area whose size roughly depends upon the number and size of the ports
> built and installed onto their 

Re: [ccache] why is limit_multiple ignored?

2018-01-28 Thread Scott Bennett via ccache
Joel Rosdahl  wrote:

> On 7 January 2018 at 14:02, Scott Bennett wrote:
>
> > The design problem is that there is no centralized index maintained of
> > cache entries' paths, their sizes, and their timestamps, necessitating
> > the plumbing of the directory trees. [...]
>
> Thanks for sharing your ideas!

 You may wish to retract any thanks once you've read what follows.  The
current independence of ccache from any other third-party software is valued
and for good reasons.  However, I hope to show below a better way to do things.
That independence can still be maintained, but only at the cost of another
wheel reinvention. :-(
>
> I fully agree that the cleanup algorithm/design hasn't aged well. It has
> essentially stayed the same since Tridge created ccache in 2002, when
> storage devices were much smaller and a cache of one GB or two probably
> was considered quite large.
>
> Trying to improve the cleanup algorithm/design has not been a priority
> since I personally haven't seen such pathological behavior that you
> describe ("cleanups that can take over a half hour to run and hammer a

 I don't know whether users of other operating systems are using ccache
in building their systems, but many FreeBSD users do so because the time
savings are so great.  When one can cut a build time of six hours to, say,
an hour and a half, one tends to appreciate the tool(s) that make(s) it
possible.  I.e., we use and love ccache because, in general, it works so well
and improves performance so much.
 However, compiling an operating system means a pretty large cache area
is needed if one is to fit the working set within the cache.  Similarly,
FreeBSD users who compile third-party software from the ports tree, rather
than installing it from prebuilt packages, potentially need an even larger
cache area whose size roughly depends upon the number and size of the ports
built and installed onto their systems.  For example, I currently have over
2300 ports installed, which should make clear the reason my ports cache area
is so large.  Large cache areas take a long time for the "cleanups" to run.
(FWIW, I use cache_dir_levels = 5, which may not be optimal in terms of
performance.  I don't have a good way of determining the optimal depth to
use for the cache directory trees.  It seems to be very, very fast for use
in building things, but may well be a killer for cleanups.)

> hard drive mercilessly"). However, I'm not at all convinced that
> introducing a centralized index is the panacea you describe.

 Countless data base software implementations handle these situations
acceptably well.
>
> Do you have a sketch design of how to maintain a centralized index? Here

 Well, sort of.  I.e., I haven't written up a design spec or anything
of that sort, but some things seem rather obvious.

> are some requirements to consider for the design:
>
> A. It should cope with a ccache process being killed at any time.

 Sure.

> B. It should work reasonably well on flaky and/or slow file systems,
>e.g. NFS.

 No, not at all.  Using a file system as data base software is usually
a Very Bad Idea (tm).

> C. It should not introduce lock contention for reasonable use cases.
> D. It should be quick for cache misses (not only for cleanup).
> E. It should handle cleanup quickly and gracefully.

 In my view, the above are misconceived in the sense that they are
predicated upon the use of file system code as data base software.
>
> I'm guessing that you envision having one centralized lock for the
> index. The tiny stats files already suffer from lock contention in some
> scenarios because they are so few. That's why ideas like
> https://github.com/ccache/ccache/issues/168 and comments like
> https://www.mail-archive.com/ccache@lists.samba.org/msg01011.html
> (comment number 2) pop up. Even if a centralized index only needs a lock
> for writing, it would still serialize writes to the cache. I have
> trouble seeing how that would work out well. But I'll gladly be proved
> wrong.
>
 Try this on for size for a moment.  Imagine the software as two programs,
ccache and ccached.  ccache would contain all the current code analysis and
comparison (including hashes) stuff that it currently has, but it would make
a connection via UDP or TCP to the other program, which we will call ccached,
to access the cache data base.  Modern data base software packages do very
well at handling multiple, simultaneous clients, atomic commission of updates,
multiple indices, and so forth.
 Now, keep in mind that this "ccached" might be a specialized program
linked to data base software or it might simply be a generic data base server.
Multiple caches (in the current sense) might be maintained as separate data
bases, either through a single server instance or as multiple, discrete server
processes, depending upon the software chosen for the purpose, but the
server(s) would be accessed by potentially many concurrent 

Re: [ccache] why is limit_multiple ignored?

2018-01-16 Thread Joel Rosdahl via ccache
On 7 January 2018 at 14:02, Scott Bennett wrote:

> The design problem is that there is no centralized index maintained of
> cache entries' paths, their sizes, and their timestamps, necessitating
> the plumbing of the directory trees. [...]

Thanks for sharing your ideas!

I fully agree that the cleanup algorithm/design hasn't aged well. It has
essentially stayed the same since Tridge created ccache in 2002, when
storage devices were much smaller and a cache of one GB or two probably
was considered quite large.

Trying to improve the cleanup algorithm/design has not been a priority
since I personally haven't seen such pathological behavior that you
describe ("cleanups that can take over a half hour to run and hammer a
hard drive mercilessly"). However, I'm not at all convinced that
introducing a centralized index is the panacea you describe.

Do you have a sketch design of how to maintain a centralized index? Here
are some requirements to consider for the design:

A. It should cope with a ccache process being killed at any time.
B. It should work reasonably well on flaky and/or slow file systems,
   e.g. NFS.
C. It should not introduce lock contention for reasonable use cases.
D. It should be quick for cache misses (not only for cleanup).
E. It should handle cleanup quickly and gracefully.

I'm guessing that you envision having one centralized lock for the
index. The tiny stats files already suffer from lock contention in some
scenarios because they are so few. That's why ideas like
https://github.com/ccache/ccache/issues/168 and comments like
https://www.mail-archive.com/ccache@lists.samba.org/msg01011.html
(comment number 2) pop up. Even if a centralized index only needs a lock
for writing, it would still serialize writes to the cache. I have
trouble seeing how that would work out well. But I'll gladly be proved
wrong.

For reference: When updating the stats files, the current method is to
acquire a lock, write the new content to a temporary file, rename the
temporary file to the target file and release the lock. Writing the full
content to a temporary file and renaming it to the target solves A and
B, and having 16 files instead of 1 improves on C. Having no index
trivially solves D. (And E is not solved well at all.)

> The lack of a centralized index can also result in cache evictions
> that are not actually LRU.

Not having true LRU eviction doesn't bother me at all. I think that it's
a very reasonable trade-off to have "approximate LRU eviction" if the
performance is better and/or the implementation is easier.

> Where does the hysteresis of (0.9-0.8)max_size=0.1*max_size come from?

When the cache has filled up at least once, the fill grade of one of the
16 subdirectories is a random variable between 0.8 and 1.0 with uniform
distribution, so the probability of the total size of the 16
subdirectories is approximately a normal distribution with 0.9 as the
mean. In other words, it's likely that the cache size is around 0.9 and
much less likely that it's near 0.8 or 1.0. For serial usage of ccache,
that is.

> What I've seen is that the cleanups are usually triggered by
> 0.8*max_size, and that does not change when I set limit_multiple =
> 0.95.

As already explained, nothing is triggered at 0.8*max_size or even at
limit_multiple*max_size, so the reason for your 24 GB cache is something
else. And that something else is most likely that when several ccache
invocations trigger a cleanup of the same subdirectory at the same time,
the net effect will be removal of more than
(1-limit_multiple)*max_size/16, potentially much more. I bet that if you
run something like "du $CCACHE_DIR/$X" for each X in [0-9a-f], or just
count the number of files in each subdirectory, you'll see some
subdirectories that are much smaller than limit_multiple*max_size/16 but
some that are near max_size/16.

***   ***   ***   ***   ***   ***   ***   ***   ***   ***   ***   ***   ***

During a couple of recent walks with my daughter in the stroller, I've
been thinking more about how to improve ccache's cleanup. I think that I
have come up with something that will be significantly better, but I
don't have time to describe any details right now. Stay tuned.

-- Joel

___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] why is limit_multiple ignored?

2018-01-07 Thread Scott Bennett via ccache
 Joel Rosdahl  wrote:

> On 19 December 2017 at 02:16, Scott Bennett via ccache <
> ccache@lists.samba.org> wrote:
>
Hi Joel,
 Sorry about the delay in responding.  I've been off-line for about a
week and a half and may be again shortly.

> >  I set "limit_multiple = 0.95" in ccache.conf and "max_size = 30.0G"
> > in ccache.conf, but cleanups are triggered when space usage reaches 24 GB,
> > which is the default of 0.8.  Why is this happening with ccache 3.3.4?
> >
>
> The ccache manual is not very good at describing what actually happens at
> cleanup. I'll try to improve it.
>
> Here's how cleanup works: After a cache miss, ccache stores the object file
> in (a subdirectory of) one of the 16 top level directories in the cache
> (0-9, a-f). It then checks if that top level directory holds more than
> max_cache_size/16 bytes (and similar for max_files). If yes, ccache removes
> files from that top level directory until it contains at most
> limit_multiple*max_cache_size/16 bytes. This means that if limit_multiple

 The design problem is that there is no centralized index maintained of
cache entries' paths, their sizes, and their timestamps, necessitating the
plumbing of the directory trees.  This very time-consuming task should only
be required when a ccache user determines that the cache is internally
inconsistent somehow, e.g., by having one or more damaged entries, having
erroneous statistics, or by being out of step with the index.  It should not
be part of an ordinary cache eviction procedure.  A command to run a
consistency check/repair should not do any cache evictions based upon space,
which would be done by the next actual use of ccache anyway, but rather only
if the files involved are part(s) of a damaged cache entry.  The overhead of
maintaining the index should be minor, especially when compared to the
current cleanups that can take over a half hour to run and hammer a hard
drive mercilessly.  (A centralized index should also include the total space
in use.)  The lack of a centralized index can also result in cache evictions
that are not actually LRU.  The kludge of using 16 caches instead of a
single, unified cache would be unnecessary with a centralized index as well.
The index would be used to go directly to each file to be deleted without
the need for a directory tree search.  Cleanups ought to be much faster.
Note that some sort of short-term lock would need to be used for updating
the index, too, but the same is already true for the
$CCACHE_DIR/[0-9a-f]/stats files.

> is 0.8, the total cache size is expected to hover around 0.9*max_cache_size
> when it has filled up. But due to the pseudo-randomness of the hash

 Where does the hysteresis of (0.9-0.8)max_size=0.1*max_size come from?

> algorithm, the cache size can be closer to 0.8*max_cache_size or
> 1.0*max_cache_size.
>
> The above should be true for any serial usage of ccache. However, ccache is
> of course very often called in parallel, and then there is a race condition
> since several ccache processes that have stored an object to the same top
> level directory may start the cleanup process simultaneously. Since
> performing cleanup in a large cache with a low limit_multiple can take a
> lot of time, more ccache processes may start to perform cleanup of the same
> directory. The race can lead to the final cache size being below
> limit_multiple*max_cache_size, perhaps very much so. This is a known
> problem. We have had some ideas to improve the admittedly naive cleanup
> logic, but nothing has been done yet.

 That problem, at least, seems relatively straightforward to fix.  First,
only one cleanup need be done in such situations, so a lock should be tested
and set by the first ccache process that decides a cleanup is necessary.  All
later comers should be delayed until that cleanup completes, but then those
others should proceed without also doing cleanups.  Their decisions in favor
of a cleanup are out of date once the cleanup run completes, so they should
just skip any cleanups themselves or at least retest the size of what they
need to store plus the current cache size against max_size to make a fresh
decision.
>
> Maybe the above described problem is why you get a 24 GB cache size?

 See discussion below.
>
> Or maybe you ran "ccache -c"? Unlike what the manual indicates, "ccache -c"

 No, it was automatically triggered.

> will delete files until each top level directory holds at most
> limit_multiple*max_size/16...
>
> why is limit_multiple ignored?
>
>
> It isn't. Or don't you see a difference if you e.g. set it to 0.5?
>
 I haven't tried that.  The caches I have represent a lot of CPU time
and elapsed time, especially given that I have compression turned on, so
I'm not thrilled at the idea of throwing nearly half a cache away just to
try it out.  What I've seen is that the cleanups are usually triggered by
0.8*max_size, and that does not change when I set 

Re: [ccache] why is limit_multiple ignored?

2018-01-04 Thread Joel Rosdahl via ccache
Hi Scott,

On 19 December 2017 at 02:16, Scott Bennett via ccache <
ccache@lists.samba.org> wrote:

>  I set "limit_multiple = 0.95" in ccache.conf and "max_size = 30.0G"
> in ccache.conf, but cleanups are triggered when space usage reaches 24 GB,
> which is the default of 0.8.  Why is this happening with ccache 3.3.4?
>

The ccache manual is not very good at describing what actually happens at
cleanup. I'll try to improve it.

Here's how cleanup works: After a cache miss, ccache stores the object file
in (a subdirectory of) one of the 16 top level directories in the cache
(0-9, a-f). It then checks if that top level directory holds more than
max_cache_size/16 bytes (and similar for max_files). If yes, ccache removes
files from that top level directory until it contains at most
limit_multiple*max_cache_size/16 bytes. This means that if limit_multiple
is 0.8, the total cache size is expected to hover around 0.9*max_cache_size
when it has filled up. But due to the pseudo-randomness of the hash
algorithm, the cache size can be closer to 0.8*max_cache_size or
1.0*max_cache_size.

The above should be true for any serial usage of ccache. However, ccache is
of course very often called in parallel, and then there is a race condition
since several ccache processes that have stored an object to the same top
level directory may start the cleanup process simultaneously. Since
performing cleanup in a large cache with a low limit_multiple can take a
lot of time, more ccache processes may start to perform cleanup of the same
directory. The race can lead to the final cache size being below
limit_multiple*max_cache_size, perhaps very much so. This is a known
problem. We have had some ideas to improve the admittedly naive cleanup
logic, but nothing has been done yet.

Maybe the above described problem is why you get a 24 GB cache size?

Or maybe you ran "ccache -c"? Unlike what the manual indicates, "ccache -c"
will delete files until each top level directory holds at most
limit_multiple*max_size/16...

why is limit_multiple ignored?


It isn't. Or don't you see a difference if you e.g. set it to 0.5?

-- Joel
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache