Simon Slavin-3 wrote: > > How much slower. Did you make an index SQLite could use for that query ? > Using distinct isn't slow, it is what I would expect but count(*) is incredibly fast, it is instantaneous no matter how large is the table. Yes, I tried with an index.
BareFeetWare-2 wrote: > > Or you can try this: > > select count(*) from People > where ID in > ( select People_ID from "Movie People" where Capacity_ID = > (select ID from Capacity where Name = 'director') > ); > > But I expect you'll get the same performance. > Yes, almost the same performance. BareFeetWare-2 wrote: > > You don't want to denormalize and have separate director and writer tables > etc. That will get ugly. > That's what I thought. Petite Abeille-2 wrote: > > Right... looking at something like the casting information in IMdb (actor, > director, writer, etc), there are about 25M movie + role + person > combinations. Which, while not huge, starts to get taxing when queried on > a lowly laptop. > > For example, out of these 25M rows, about 1.2M represent directors (~217K) > in movies (~1M). > Yes, in my application I have in the sidebar all those sections (movies, years, tags, keywords, actors, directors, writers, etc...) and I count each one, how many movies, how many years, etc... It isn't very slow if I update only one item but the problem is when I have to update the entire list, this uses a lot of cpu and takes some time and I have to do it every time a movie is added. http://3.bp.blogspot.com/_EbNPUgfUDXs/TIpif4U1III/AAAAAAAAA2A/YFSGOAdpOGA/s1600/nested-brosers.jpg Here there is an old screenshot from the sidebar. Petite Abeille-2 wrote: > > Assuming a movie_cast table [1] and a cast dimension [2], you could record > the distinct count for each cast in the dimension once, and store it. That > way you don't have to recompute it over and over. > Yes, I am caching all counts in the application level but I have to update it every time the database changes so I have to update it at least every time I add a movie. Petite Abeille-2 wrote: > > Alternatively, you could indeed partition that bulky movie_cast table by > its cast type, which will in effect reduce the search space to at most ~9M > (movies by actors, ~1.3M movies, ~1.3M actors). > > That said, even partitioning will not help you much here as you still have > to search through ~9M records to figure out how many distinct actors you > have. > Actually, I already have different tables for directors, writers, etc... Petite Abeille-2 wrote: > > So... perhaps best to cheat :) > > One way to cheat is to precompute the answer by, for example, adding a > is_* flag on your person table: > > update person set is_actor = ( select count( distinct person_id ) from > movie_actor where movie_actor.person_id = person.id ); > > The person table is much smaller (~3M) and easier to query in bulk: > > explain query plan > select count( * ) from person where is_actor = 1; > 0|0|0|SEARCH TABLE person USING COVERING INDEX person_is_actor > (is_actor=?) (~1596808 rows) > > 1310119 > CPU Time: user 0.256286 sys 0.000257 > > There you go :) > That's a good idea, I wonder if it isn't better to use a different table to store only the ids for the directors, etc, as I said before. I would end up with these tables: movies people movies_directors (movieId, peopleId) movies_writers (movieId, peopleId) directors (peopleId) writers (peopleId) Then I can count directors with: SELECT count(*) FROM directors; Petite Abeille-2 wrote: > > This is where I wish SQLite could have bitmap indices: > http://en.wikipedia.org/wiki/Bitmap_index > That seems what I need. -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30897078.html Sent from the SQLite mailing list archive at Nabble.com. _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users