Simon Slavin-3 wrote:
> 
> How much slower.  Did you make an index SQLite could use for that query ?
> 
Using distinct isn't slow, it is what I would expect but count(*) is
incredibly fast, it is instantaneous no matter how large is the table.
Yes, I tried with an index.


BareFeetWare-2 wrote:
> 
> Or you can try this:
> 
> select count(*) from People
> where ID in
> ( select People_ID from "Movie People" where Capacity_ID =
>                 (select ID from Capacity where Name = 'director')
> );
> 
> But I expect you'll get the same performance. 
> 
Yes, almost the same performance.


BareFeetWare-2 wrote:
> 
> You don't want to denormalize and have separate director and writer tables
> etc. That will get ugly. 
> 
That's what I thought.


Petite Abeille-2 wrote:
> 
> Right... looking at something like the casting information in IMdb (actor,
> director, writer, etc), there are about 25M movie + role + person
> combinations. Which, while not huge, starts to get taxing when queried on
> a lowly laptop.
> 
> For example, out of these 25M rows, about 1.2M represent directors (~217K)
> in movies (~1M). 
> 

Yes, in my application I have in the sidebar all those sections (movies,
years, tags, keywords, actors, directors, writers, etc...) and I count each
one, how many movies, how many years, etc...
It isn't very slow if I update only one item but the problem is when I have
to update the entire list, this uses a lot of cpu and takes some time and I
have to do it every time a movie is added.
http://3.bp.blogspot.com/_EbNPUgfUDXs/TIpif4U1III/AAAAAAAAA2A/YFSGOAdpOGA/s1600/nested-brosers.jpg
Here there is an old screenshot from the sidebar. 


Petite Abeille-2 wrote:
> 
> Assuming a movie_cast table [1] and a cast dimension [2], you could record
> the distinct count for each cast in the dimension once, and store it. That
> way you don't have to recompute it over and over.
> 
Yes, I am caching all counts in the application level but I have to update
it every time the database changes so I have to update it at least every
time I add a movie.


Petite Abeille-2 wrote:
> 
> Alternatively, you could indeed partition that bulky movie_cast table by
> its cast type, which will in effect reduce the search space to at most ~9M
> (movies by actors, ~1.3M movies, ~1.3M actors). 
> 
> That said, even partitioning will not help you much here as you still have
> to search through ~9M records to figure out how many distinct actors you
> have.
> 
Actually, I already have different tables for directors, writers, etc...


Petite Abeille-2 wrote:
> 
> So... perhaps best to cheat :)
> 
> One way to cheat is to precompute the answer by, for example, adding a
> is_* flag on your person table:
> 
> update person set is_actor = ( select count( distinct person_id ) from
> movie_actor where movie_actor.person_id = person.id );
> 
> The person table is much smaller (~3M) and easier to query in bulk:
> 
> explain query plan
> select count( * ) from person where is_actor = 1;
> 0|0|0|SEARCH TABLE person USING COVERING INDEX person_is_actor
> (is_actor=?) (~1596808 rows)
> 
> 1310119
> CPU Time: user 0.256286 sys 0.000257
> 
> There you go :)
> 
That's a good idea, I wonder if it isn't better to use a different table to
store only the ids for the directors, etc, as I said before.
I would end up with these tables:
movies 
people
movies_directors (movieId, peopleId)
movies_writers (movieId, peopleId)
directors (peopleId)
writers (peopleId)

Then I can count directors with:
SELECT count(*) FROM directors;


Petite Abeille-2 wrote:
> 
> This is where I wish SQLite could have bitmap indices:
> http://en.wikipedia.org/wiki/Bitmap_index
> 
That seems what I need.
-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30897078.html
Sent from the SQLite mailing list archive at Nabble.com.

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to