sql based cache [was Re: [gentoo-portage-dev] Few things, which imho would make portage better]

Brian Harring Tue, 14 Mar 2006 16:29:22 -0800

On Tue, Mar 14, 2006 at 04:52:14PM +0200, tvali wrote:
> > You're talking about the cache, take a look at the cache subsystem and
> > write a mysql module for it. This will never become a default though (we
> > would get killed if portage starts to depend on mysql).
> 
> I think that it should not become default as mysql module, but if it
> is working, it should become default as "portable" sql module.
> 
> # emerge sqlite pysqlite
> 
> I havent used sqlite, but it seems to be small and usable. I think
> that it should start with it.
> 
> I think that portage should *support* sql by default, but of course it
> should not be default before it's clear that many people like it and
> use it. What is imho more important is how to make one usable
> interface, which would cover both fs and sql portage db's so that
> development didnt go into two products.


See the restrictions framework I've started-
http://gentooexperimental.org/~ferringb/blog/archives/2005-07.html#e2005-07-13T01_21_42.txt
http://gentooexperimental.org/~ferring/bzr/pkgcore/dev-notes/framework/restrictions

Short version is that converting to sql internally sucks badly since 
you'll have to parse (ad hoc) sql statements for any file based 
backend.  Using sql directly in portage requires encapsulating the sql 
code so that rdbms syntax differences (replace comes to mind) can be 
worked around...

Re: rdbms being faster then an on disk file db... it's only faster in 
certain cases.
Properly designed/coded backends, RDBMS is _only_ faster when it's 
returning N records when comparing it to a local file db.

As to why adding rdbms into stable is a bad idea right now, the 
problem is in querying; you _could_ add a sql backend (pretty easy, 
2.1 ships with a sql_template and sqlite backend from my earlier 
work), but it'll actually be slower.  Portage does cache lookups 
individually; want the data for all bsdiff versions?  portage does 
thus-

keys=[]
for x in portdb.cp_all("dev-util/bsdiff"):
        keys.append(portdb.aux_get(x, ["DEPENDS"]))

Each lookup is a seperate call- there is no way to leverage rdbms 
speed for N record return if the calling api is (effectively) single 
row queries.

To fully leverage a rdbms backend, need to restructure portage calls 
so that it's dealing in lists instead of individual elements- fex, 
under the rewrite

repository.match(atom("dev-util/bsdiff"))

Via that (and the restriction framework it uses) the api calls are 
designed so that rdbms can shine; instead of N calls, the 
repository/cache backend can convert the restrictions into a sql 
statement and run _one_ search.

Finally...rdbms still has problems.  If the repository isn't 'frozen' 
(eg, it can regen it's metadata, as all portage trees in stable 
currently can) you cannot rely on the cache backend aside from doing 
random access lookups in it.

Why?

Cache holds dev-util/bsdiff-4.2 and dev-util/bsdiff-4.3, but not 
dev-util/bsdiff-4.4 .  If you hand off to the cache backend, it'll 
return just those two, when it should return all 3.

~harring

pgplQtavjLmlz.pgp
Description: PGP signature

sql based cache [was Re: [gentoo-portage-dev] Few things, which imho would make portage better]

Reply via email to