Re: [ccache] why is limit_multiple ignored?
Den 2018-01-29 kl. 18:59, skrev Joel Rosdahl via ccache: > On 29 January 2018 at 07:14, Scott Bennett wrote: > >> Countless data base software implementations handle these situations >> acceptably well. > > Sigh. > > I see. You're talking about a completely different model than what > ccache currently uses, which was not clear to me when I read your > initial description. What you seem to don't understand, or choose to > ignore, is that ccache can't stop supporting the simple server-less > file-based model since that would drop support for two important use > cases: > > 1. Using ccache on a personal account without having access to a system >service and without having to start a personal server. > 2. Using a shared cache on NFS. > > (For case 1, it would be feasible with a model where the client starts a > server on demand, unless the cache is shared.) > > Why are they important? Simply because people have used ccache like that > for many years. We have found that use case 2 needs extending, to add a local cache in front. Then the NFS store is only used as a "secondary cache"... But this can be done without changing very much in ccache, and I know I said that we would post some code for that. Will try again. :-) > It would certainly be possible to add optional client-server backends, > and for them I fully agree that using a centralized index is obvious, > but that's just an entirely separate discussion as I see it. > > If we would drop support for simple file-based caches, then it would no > longer be ccache but something else. Which would be fine, but that's > another project. (It could of course be ccache version 4, but then I > expect that ccache version 3 would continue living its own life, so it > would be a separate project in practice.) We already have such an implementation, which works fine - in parallell. We only have to complete the work of merging the dev/memcached branch ? By starting a memcached daemon, we get the required client/server setup. And by using "moxi", we can have that daemon talk to a "real" database. >> Very possibly you have the requisite knowledge/experience yourself. > > Actually yes, so you could have saved both your and my time by just > asking something like "Have you considered using a client-server model, > perhaps using a standard database, instead of a file-based cache?" > instead of trying to educate me. > >> To modify ccache to use data base software is admittedly a major >> rewriting job, so I expect such an idea to put you off, but it's a >> project that should ultimately yield a far superior product, IMO. > > I don't disagree, but as I said, that would be another project, and I > neither have time nor interest in that personally. As mentioned above, it is very possible to use Couchbase for this. It is an open source NoSQL database, from https://couchbase.com Since ccache is talking the standard memcached protocol directly, it doesn't even have to know that it is using a standard database! https://www.couchbase.com/memcached https://github.com/ccache/ccache/tree/dev/memcached /Anders ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] why is limit_multiple ignored?
On 29 January 2018 at 07:14, Scott Bennett wrote: > Countless data base software implementations handle these situations > acceptably well. Sigh. I see. You're talking about a completely different model than what ccache currently uses, which was not clear to me when I read your initial description. What you seem to don't understand, or choose to ignore, is that ccache can't stop supporting the simple server-less file-based model since that would drop support for two important use cases: 1. Using ccache on a personal account without having access to a system service and without having to start a personal server. 2. Using a shared cache on NFS. (For case 1, it would be feasible with a model where the client starts a server on demand, unless the cache is shared.) Why are they important? Simply because people have used ccache like that for many years. It would certainly be possible to add optional client-server backends, and for them I fully agree that using a centralized index is obvious, but that's just an entirely separate discussion as I see it. If we would drop support for simple file-based caches, then it would no longer be ccache but something else. Which would be fine, but that's another project. (It could of course be ccache version 4, but then I expect that ccache version 3 would continue living its own life, so it would be a separate project in practice.) > (FWIW, I use cache_dir_levels = 5, which may not be optimal in terms > of performance. I don't have a good way of determining the optimal > depth to use for the cache directory trees. It seems to be very, very > fast for use in building things, but may well be a killer for > cleanups.) Let's see. cache_dir_levels = 5 means 16⁵ ≈ 1 million directories on the lowest level. A large cache might hold, say, 10 million files? Then 10 files per directory is clearly not optimal. How many files do you have in your cache? I think that a good rule of thumb would be to store a couple of thousand or tens of thousands of files per directory, depending on the file system characteristics. That would mean that cache_dir_levels = 3 would be enough even for very large caches. Perhaps lowering cache_dir_levels could partly solve the bad cleanup performance you have? > Very possibly you have the requisite knowledge/experience yourself. Actually yes, so you could have saved both your and my time by just asking something like "Have you considered using a client-server model, perhaps using a standard database, instead of a file-based cache?" instead of trying to educate me. > To modify ccache to use data base software is admittedly a major > rewriting job, so I expect such an idea to put you off, but it's a > project that should ultimately yield a far superior product, IMO. I don't disagree, but as I said, that would be another project, and I neither have time nor interest in that personally. -- Joel On 29 January 2018 at 07:14, Scott Bennett wrote: > Joel Rosdahl wrote: > >> On 7 January 2018 at 14:02, Scott Bennett wrote: >> >> > The design problem is that there is no centralized index maintained of >> > cache entries' paths, their sizes, and their timestamps, necessitating >> > the plumbing of the directory trees. [...] >> >> Thanks for sharing your ideas! > > You may wish to retract any thanks once you've read what follows. The > current independence of ccache from any other third-party software is valued > and for good reasons. However, I hope to show below a better way to do > things. > That independence can still be maintained, but only at the cost of another > wheel reinvention. :-( >> >> I fully agree that the cleanup algorithm/design hasn't aged well. It has >> essentially stayed the same since Tridge created ccache in 2002, when >> storage devices were much smaller and a cache of one GB or two probably >> was considered quite large. >> >> Trying to improve the cleanup algorithm/design has not been a priority >> since I personally haven't seen such pathological behavior that you >> describe ("cleanups that can take over a half hour to run and hammer a > > I don't know whether users of other operating systems are using ccache > in building their systems, but many FreeBSD users do so because the time > savings are so great. When one can cut a build time of six hours to, say, > an hour and a half, one tends to appreciate the tool(s) that make(s) it > possible. I.e., we use and love ccache because, in general, it works so well > and improves performance so much. > However, compiling an operating system means a pretty large cache area > is needed if one is to fit the working set within the cache. Similarly, > FreeBSD users who compile third-party software from the ports tree, rather > than installing it from prebuilt packages, potentially need an even larger > cache area whose size roughly depends upon the number and size of the ports > built and installed onto their systems. For example, I currently have over > 2300 por
Re: [ccache] why is limit_multiple ignored?
Joel Rosdahl wrote: > On 7 January 2018 at 14:02, Scott Bennett wrote: > > > The design problem is that there is no centralized index maintained of > > cache entries' paths, their sizes, and their timestamps, necessitating > > the plumbing of the directory trees. [...] > > Thanks for sharing your ideas! You may wish to retract any thanks once you've read what follows. The current independence of ccache from any other third-party software is valued and for good reasons. However, I hope to show below a better way to do things. That independence can still be maintained, but only at the cost of another wheel reinvention. :-( > > I fully agree that the cleanup algorithm/design hasn't aged well. It has > essentially stayed the same since Tridge created ccache in 2002, when > storage devices were much smaller and a cache of one GB or two probably > was considered quite large. > > Trying to improve the cleanup algorithm/design has not been a priority > since I personally haven't seen such pathological behavior that you > describe ("cleanups that can take over a half hour to run and hammer a I don't know whether users of other operating systems are using ccache in building their systems, but many FreeBSD users do so because the time savings are so great. When one can cut a build time of six hours to, say, an hour and a half, one tends to appreciate the tool(s) that make(s) it possible. I.e., we use and love ccache because, in general, it works so well and improves performance so much. However, compiling an operating system means a pretty large cache area is needed if one is to fit the working set within the cache. Similarly, FreeBSD users who compile third-party software from the ports tree, rather than installing it from prebuilt packages, potentially need an even larger cache area whose size roughly depends upon the number and size of the ports built and installed onto their systems. For example, I currently have over 2300 ports installed, which should make clear the reason my ports cache area is so large. Large cache areas take a long time for the "cleanups" to run. (FWIW, I use cache_dir_levels = 5, which may not be optimal in terms of performance. I don't have a good way of determining the optimal depth to use for the cache directory trees. It seems to be very, very fast for use in building things, but may well be a killer for cleanups.) > hard drive mercilessly"). However, I'm not at all convinced that > introducing a centralized index is the panacea you describe. Countless data base software implementations handle these situations acceptably well. > > Do you have a sketch design of how to maintain a centralized index? Here Well, sort of. I.e., I haven't written up a design spec or anything of that sort, but some things seem rather obvious. > are some requirements to consider for the design: > > A. It should cope with a ccache process being killed at any time. Sure. > B. It should work reasonably well on flaky and/or slow file systems, >e.g. NFS. No, not at all. Using a file system as data base software is usually a Very Bad Idea (tm). > C. It should not introduce lock contention for reasonable use cases. > D. It should be quick for cache misses (not only for cleanup). > E. It should handle cleanup quickly and gracefully. In my view, the above are misconceived in the sense that they are predicated upon the use of file system code as data base software. > > I'm guessing that you envision having one centralized lock for the > index. The tiny stats files already suffer from lock contention in some > scenarios because they are so few. That's why ideas like > https://github.com/ccache/ccache/issues/168 and comments like > https://www.mail-archive.com/ccache@lists.samba.org/msg01011.html > (comment number 2) pop up. Even if a centralized index only needs a lock > for writing, it would still serialize writes to the cache. I have > trouble seeing how that would work out well. But I'll gladly be proved > wrong. > Try this on for size for a moment. Imagine the software as two programs, ccache and ccached. ccache would contain all the current code analysis and comparison (including hashes) stuff that it currently has, but it would make a connection via UDP or TCP to the other program, which we will call ccached, to access the cache data base. Modern data base software packages do very well at handling multiple, simultaneous clients, atomic commission of updates, multiple indices, and so forth. Now, keep in mind that this "ccached" might be a specialized program linked to data base software or it might simply be a generic data base server. Multiple caches (in the current sense) might be maintained as separate data bases, either through a single server instance or as multiple, discrete server processes, depending upon the software chosen for the purpose, but the server(s) would be accessed by potentially many concurrent ccache processes and c
Re: [ccache] why is limit_multiple ignored?
On 7 January 2018 at 14:02, Scott Bennett wrote: > The design problem is that there is no centralized index maintained of > cache entries' paths, their sizes, and their timestamps, necessitating > the plumbing of the directory trees. [...] Thanks for sharing your ideas! I fully agree that the cleanup algorithm/design hasn't aged well. It has essentially stayed the same since Tridge created ccache in 2002, when storage devices were much smaller and a cache of one GB or two probably was considered quite large. Trying to improve the cleanup algorithm/design has not been a priority since I personally haven't seen such pathological behavior that you describe ("cleanups that can take over a half hour to run and hammer a hard drive mercilessly"). However, I'm not at all convinced that introducing a centralized index is the panacea you describe. Do you have a sketch design of how to maintain a centralized index? Here are some requirements to consider for the design: A. It should cope with a ccache process being killed at any time. B. It should work reasonably well on flaky and/or slow file systems, e.g. NFS. C. It should not introduce lock contention for reasonable use cases. D. It should be quick for cache misses (not only for cleanup). E. It should handle cleanup quickly and gracefully. I'm guessing that you envision having one centralized lock for the index. The tiny stats files already suffer from lock contention in some scenarios because they are so few. That's why ideas like https://github.com/ccache/ccache/issues/168 and comments like https://www.mail-archive.com/ccache@lists.samba.org/msg01011.html (comment number 2) pop up. Even if a centralized index only needs a lock for writing, it would still serialize writes to the cache. I have trouble seeing how that would work out well. But I'll gladly be proved wrong. For reference: When updating the stats files, the current method is to acquire a lock, write the new content to a temporary file, rename the temporary file to the target file and release the lock. Writing the full content to a temporary file and renaming it to the target solves A and B, and having 16 files instead of 1 improves on C. Having no index trivially solves D. (And E is not solved well at all.) > The lack of a centralized index can also result in cache evictions > that are not actually LRU. Not having true LRU eviction doesn't bother me at all. I think that it's a very reasonable trade-off to have "approximate LRU eviction" if the performance is better and/or the implementation is easier. > Where does the hysteresis of (0.9-0.8)max_size=0.1*max_size come from? When the cache has filled up at least once, the fill grade of one of the 16 subdirectories is a random variable between 0.8 and 1.0 with uniform distribution, so the probability of the total size of the 16 subdirectories is approximately a normal distribution with 0.9 as the mean. In other words, it's likely that the cache size is around 0.9 and much less likely that it's near 0.8 or 1.0. For serial usage of ccache, that is. > What I've seen is that the cleanups are usually triggered by > 0.8*max_size, and that does not change when I set limit_multiple = > 0.95. As already explained, nothing is triggered at 0.8*max_size or even at limit_multiple*max_size, so the reason for your 24 GB cache is something else. And that something else is most likely that when several ccache invocations trigger a cleanup of the same subdirectory at the same time, the net effect will be removal of more than (1-limit_multiple)*max_size/16, potentially much more. I bet that if you run something like "du $CCACHE_DIR/$X" for each X in [0-9a-f], or just count the number of files in each subdirectory, you'll see some subdirectories that are much smaller than limit_multiple*max_size/16 but some that are near max_size/16. *** *** *** *** *** *** *** *** *** *** *** *** *** During a couple of recent walks with my daughter in the stroller, I've been thinking more about how to improve ccache's cleanup. I think that I have come up with something that will be significantly better, but I don't have time to describe any details right now. Stay tuned. -- Joel ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] why is limit_multiple ignored?
Joel Rosdahl wrote: > On 19 December 2017 at 02:16, Scott Bennett via ccache < > ccache@lists.samba.org> wrote: > Hi Joel, Sorry about the delay in responding. I've been off-line for about a week and a half and may be again shortly. > > I set "limit_multiple = 0.95" in ccache.conf and "max_size = 30.0G" > > in ccache.conf, but cleanups are triggered when space usage reaches 24 GB, > > which is the default of 0.8. Why is this happening with ccache 3.3.4? > > > > The ccache manual is not very good at describing what actually happens at > cleanup. I'll try to improve it. > > Here's how cleanup works: After a cache miss, ccache stores the object file > in (a subdirectory of) one of the 16 top level directories in the cache > (0-9, a-f). It then checks if that top level directory holds more than > max_cache_size/16 bytes (and similar for max_files). If yes, ccache removes > files from that top level directory until it contains at most > limit_multiple*max_cache_size/16 bytes. This means that if limit_multiple The design problem is that there is no centralized index maintained of cache entries' paths, their sizes, and their timestamps, necessitating the plumbing of the directory trees. This very time-consuming task should only be required when a ccache user determines that the cache is internally inconsistent somehow, e.g., by having one or more damaged entries, having erroneous statistics, or by being out of step with the index. It should not be part of an ordinary cache eviction procedure. A command to run a consistency check/repair should not do any cache evictions based upon space, which would be done by the next actual use of ccache anyway, but rather only if the files involved are part(s) of a damaged cache entry. The overhead of maintaining the index should be minor, especially when compared to the current cleanups that can take over a half hour to run and hammer a hard drive mercilessly. (A centralized index should also include the total space in use.) The lack of a centralized index can also result in cache evictions that are not actually LRU. The kludge of using 16 caches instead of a single, unified cache would be unnecessary with a centralized index as well. The index would be used to go directly to each file to be deleted without the need for a directory tree search. Cleanups ought to be much faster. Note that some sort of short-term lock would need to be used for updating the index, too, but the same is already true for the $CCACHE_DIR/[0-9a-f]/stats files. > is 0.8, the total cache size is expected to hover around 0.9*max_cache_size > when it has filled up. But due to the pseudo-randomness of the hash Where does the hysteresis of (0.9-0.8)max_size=0.1*max_size come from? > algorithm, the cache size can be closer to 0.8*max_cache_size or > 1.0*max_cache_size. > > The above should be true for any serial usage of ccache. However, ccache is > of course very often called in parallel, and then there is a race condition > since several ccache processes that have stored an object to the same top > level directory may start the cleanup process simultaneously. Since > performing cleanup in a large cache with a low limit_multiple can take a > lot of time, more ccache processes may start to perform cleanup of the same > directory. The race can lead to the final cache size being below > limit_multiple*max_cache_size, perhaps very much so. This is a known > problem. We have had some ideas to improve the admittedly naive cleanup > logic, but nothing has been done yet. That problem, at least, seems relatively straightforward to fix. First, only one cleanup need be done in such situations, so a lock should be tested and set by the first ccache process that decides a cleanup is necessary. All later comers should be delayed until that cleanup completes, but then those others should proceed without also doing cleanups. Their decisions in favor of a cleanup are out of date once the cleanup run completes, so they should just skip any cleanups themselves or at least retest the size of what they need to store plus the current cache size against max_size to make a fresh decision. > > Maybe the above described problem is why you get a 24 GB cache size? See discussion below. > > Or maybe you ran "ccache -c"? Unlike what the manual indicates, "ccache -c" No, it was automatically triggered. > will delete files until each top level directory holds at most > limit_multiple*max_size/16... > > why is limit_multiple ignored? > > > It isn't. Or don't you see a difference if you e.g. set it to 0.5? > I haven't tried that. The caches I have represent a lot of CPU time and elapsed time, especially given that I have compression turned on, so I'm not thrilled at the idea of throwing nearly half a cache away just to try it out. What I've seen is that the cleanups are usually triggered by 0.8*max_size, and that does not change when I set limit_multiple = 0.95. 0.95*max
Re: [ccache] why is limit_multiple ignored?
Hi Scott, On 19 December 2017 at 02:16, Scott Bennett via ccache < ccache@lists.samba.org> wrote: > I set "limit_multiple = 0.95" in ccache.conf and "max_size = 30.0G" > in ccache.conf, but cleanups are triggered when space usage reaches 24 GB, > which is the default of 0.8. Why is this happening with ccache 3.3.4? > The ccache manual is not very good at describing what actually happens at cleanup. I'll try to improve it. Here's how cleanup works: After a cache miss, ccache stores the object file in (a subdirectory of) one of the 16 top level directories in the cache (0-9, a-f). It then checks if that top level directory holds more than max_cache_size/16 bytes (and similar for max_files). If yes, ccache removes files from that top level directory until it contains at most limit_multiple*max_cache_size/16 bytes. This means that if limit_multiple is 0.8, the total cache size is expected to hover around 0.9*max_cache_size when it has filled up. But due to the pseudo-randomness of the hash algorithm, the cache size can be closer to 0.8*max_cache_size or 1.0*max_cache_size. The above should be true for any serial usage of ccache. However, ccache is of course very often called in parallel, and then there is a race condition since several ccache processes that have stored an object to the same top level directory may start the cleanup process simultaneously. Since performing cleanup in a large cache with a low limit_multiple can take a lot of time, more ccache processes may start to perform cleanup of the same directory. The race can lead to the final cache size being below limit_multiple*max_cache_size, perhaps very much so. This is a known problem. We have had some ideas to improve the admittedly naive cleanup logic, but nothing has been done yet. Maybe the above described problem is why you get a 24 GB cache size? Or maybe you ran "ccache -c"? Unlike what the manual indicates, "ccache -c" will delete files until each top level directory holds at most limit_multiple*max_size/16... why is limit_multiple ignored? It isn't. Or don't you see a difference if you e.g. set it to 0.5? -- Joel ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache