Re: [ccache] why is limit_multiple ignored?

Joel Rosdahl via ccache Mon, 29 Jan 2018 10:00:54 -0800

On 29 January 2018 at 07:14, Scott Bennett <benn...@sdf.org> wrote:

> Countless data base software implementations handle these situations
> acceptably well.


Sigh.

I see. You're talking about a completely different model than what
ccache currently uses, which was not clear to me when I read your
initial description. What you seem to don't understand, or choose to
ignore, is that ccache can't stop supporting the simple server-less
file-based model since that would drop support for two important use
cases:

1. Using ccache on a personal account without having access to a system
   service and without having to start a personal server.
2. Using a shared cache on NFS.

(For case 1, it would be feasible with a model where the client starts a
server on demand, unless the cache is shared.)

Why are they important? Simply because people have used ccache like that
for many years.

It would certainly be possible to add optional client-server backends,
and for them I fully agree that using a centralized index is obvious,
but that's just an entirely separate discussion as I see it.

If we would drop support for simple file-based caches, then it would no
longer be ccache but something else. Which would be fine, but that's
another project. (It could of course be ccache version 4, but then I
expect that ccache version 3 would continue living its own life, so it
would be a separate project in practice.)

> (FWIW, I use cache_dir_levels = 5, which may not be optimal in terms
> of performance. I don't have a good way of determining the optimal
> depth to use for the cache directory trees. It seems to be very, very
> fast for use in building things, but may well be a killer for
> cleanups.)

Let's see. cache_dir_levels = 5 means 16⁵ ≈ 1 million directories on the
lowest level. A large cache might hold, say, 10 million files? Then 10
files per directory is clearly not optimal. How many files do you have
in your cache?

I think that a good rule of thumb would be to store a couple of thousand
or tens of thousands of files per directory, depending on the file
system characteristics. That would mean that cache_dir_levels = 3 would
be enough even for very large caches.

Perhaps lowering cache_dir_levels could partly solve the bad cleanup
performance you have?

> Very possibly you have the requisite knowledge/experience yourself.

Actually yes, so you could have saved both your and my time by just
asking something like "Have you considered using a client-server model,
perhaps using a standard database, instead of a file-based cache?"
instead of trying to educate me.

> To modify ccache to use data base software is admittedly a major
> rewriting job, so I expect such an idea to put you off, but it's a
> project that should ultimately yield a far superior product, IMO.

I don't disagree, but as I said, that would be another project, and I
neither have time nor interest in that personally.

-- Joel

On 29 January 2018 at 07:14, Scott Bennett <benn...@sdf.org> wrote:
> Joel Rosdahl <j...@rosdahl.net> wrote:
>
>> On 7 January 2018 at 14:02, Scott Bennett wrote:
>>
>> > The design problem is that there is no centralized index maintained of
>> > cache entries' paths, their sizes, and their timestamps, necessitating
>> > the plumbing of the directory trees. [...]
>>
>> Thanks for sharing your ideas!
>
>      You may wish to retract any thanks once you've read what follows.  The
> current independence of ccache from any other third-party software is valued
> and for good reasons.  However, I hope to show below a better way to do 
> things.
> That independence can still be maintained, but only at the cost of another
> wheel reinvention. :-(
>>
>> I fully agree that the cleanup algorithm/design hasn't aged well. It has
>> essentially stayed the same since Tridge created ccache in 2002, when
>> storage devices were much smaller and a cache of one GB or two probably
>> was considered quite large.
>>
>> Trying to improve the cleanup algorithm/design has not been a priority
>> since I personally haven't seen such pathological behavior that you
>> describe ("cleanups that can take over a half hour to run and hammer a
>
>      I don't know whether users of other operating systems are using ccache
> in building their systems, but many FreeBSD users do so because the time
> savings are so great.  When one can cut a build time of six hours to, say,
> an hour and a half, one tends to appreciate the tool(s) that make(s) it
> possible.  I.e., we use and love ccache because, in general, it works so well
> and improves performance so much.
>      However, compiling an operating system means a pretty large cache area
> is needed if one is to fit the working set within the cache.  Similarly,
> FreeBSD users who compile third-party software from the ports tree, rather
> than installing it from prebuilt packages, potentially need an even larger
> cache area whose size roughly depends upon the number and size of the ports
> built and installed onto their systems.  For example, I currently have over
> 2300 ports installed, which should make clear the reason my ports cache area
> is so large.  Large cache areas take a long time for the "cleanups" to run.
> (FWIW, I use cache_dir_levels = 5, which may not be optimal in terms of
> performance.  I don't have a good way of determining the optimal depth to
> use for the cache directory trees.  It seems to be very, very fast for use
> in building things, but may well be a killer for cleanups.)
>
>> hard drive mercilessly"). However, I'm not at all convinced that
>> introducing a centralized index is the panacea you describe.
>
>      Countless data base software implementations handle these situations
> acceptably well.
>>
>> Do you have a sketch design of how to maintain a centralized index? Here
>
>      Well, sort of.  I.e., I haven't written up a design spec or anything
> of that sort, but some things seem rather obvious.
>
>> are some requirements to consider for the design:
>>
>> A. It should cope with a ccache process being killed at any time.
>
>      Sure.
>
>> B. It should work reasonably well on flaky and/or slow file systems,
>>    e.g. NFS.
>
>      No, not at all.  Using a file system as data base software is usually
> a Very Bad Idea (tm).
>
>> C. It should not introduce lock contention for reasonable use cases.
>> D. It should be quick for cache misses (not only for cleanup).
>> E. It should handle cleanup quickly and gracefully.
>
>      In my view, the above are misconceived in the sense that they are
> predicated upon the use of file system code as data base software.
>>
>> I'm guessing that you envision having one centralized lock for the
>> index. The tiny stats files already suffer from lock contention in some
>> scenarios because they are so few. That's why ideas like
>> https://github.com/ccache/ccache/issues/168 and comments like
>> https://www.mail-archive.com/ccache@lists.samba.org/msg01011.html
>> (comment number 2) pop up. Even if a centralized index only needs a lock
>> for writing, it would still serialize writes to the cache. I have
>> trouble seeing how that would work out well. But I'll gladly be proved
>> wrong.
>>
>      Try this on for size for a moment.  Imagine the software as two programs,
> ccache and ccached.  ccache would contain all the current code analysis and
> comparison (including hashes) stuff that it currently has, but it would make
> a connection via UDP or TCP to the other program, which we will call ccached,
> to access the cache data base.  Modern data base software packages do very
> well at handling multiple, simultaneous clients, atomic commission of updates,
> multiple indices, and so forth.
>      Now, keep in mind that this "ccached" might be a specialized program
> linked to data base software or it might simply be a generic data base server.
> Multiple caches (in the current sense) might be maintained as separate data
> bases, either through a single server instance or as multiple, discrete server
> processes, depending upon the software chosen for the purpose, but the
> server(s) would be accessed by potentially many concurrent ccache processes
> and could deal with consistency/integrity issues at the cache-entry or
> cache-entry-element level.
>      Please don't ask me for a recommendation of particular data base software
> because I haven't the foggiest idea.  I haven't worked with a data base
> package since the early 1970s, although I did work considerably later with
> various software that today would be thought of a data base applications, but
> were not so thought of at the time, that used IBM's ISAM.  Back then, a data
> base typically involved many files and indices, all interlinked at the record
> level, so an access method like ISAM was not, by itself, sufficient to be
> called a data base, but it was sometimes a component of a data base.  Very
> often, though, people wrote their own data base access methods or bought a
> commercial data base package.  A data base was a more formal affair with every
> field defined in a data dictionary, etc., etc.  ccache needs nothing so
> complex, but you would need to consult someone familiar with each of the
> "modern" types of data base software available to decide which way to go.
> Very possibly you have the requisite knowledge/experience yourself.
>      To modify ccache to use data base software is admittedly a major
> rewriting job, so I expect such an idea to put you off, but it's a project
> that should ultimately yield a far superior product, IMO.  Those are my two
> bits' worth, and you are more than welcome to take shots at what I've written.
>
>
>                                   Scott Bennett, Comm. ASMELG, CFIAG
> **********************************************************************
> * Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
> *--------------------------------------------------------------------*
> * "A well regulated and disciplined militia, is at all times a good  *
> * objection to the introduction of that bane of all free governments *
> * -- a standing army."                                               *
> *    -- Gov. John Hancock, New York Journal, 28 January 1790         *
> **********************************************************************
>

_______________________________________________
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache

Re: [ccache] why is limit_multiple ignored?

Reply via email to