Hmm. Maybe try yours with union instead of union all? Though if there's only 1 path between any pair of nodes that shouldn't make too much difference. Otherwise I'm getting low on ideas.
What're the record counts for nodes and edges? -----Original Message----- From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On Behalf Of Shane Dev Sent: Thursday, January 04, 2018 5:20 PM To: SQLite mailing list Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG. Hi David I recommend using longer names than 1 letter for your aliases, what you > save in typing you lose a couple times over again when wondering what "r" > is or why "t" has anything to do with "nodes" > Fair enough. I tend to use shorts names to reduce the risk of typos. My original node table was called "tasks". I tried to simplify the query for this forum post but neglected to change the alias. > > In your CTE you're doing a 3 table join. There's no need to include the > nodes table in there at all, you can get the node ID from the edge table. > ...union all select e.child, top from r, edges as e where e.parent = r.id > )... > You're right in this case. My original node table "tasks" had more columns which I wanted in the final result set. > > The big thing though is in the where clause. > where...and id not in (select parent from edges where parent = id)... > That was a sloppy mistake, I changed it to ..and id not in (select parent from edges)... but it was still very slow > > Old: > sqlite> explain query plan with recursive r (id, top) as (select id, id > from nodes union all select t.id, top from nodes as t, edges as e, r > where e.parent = r.id and t.id = e.child) select top, count(*) from r > where top != id and id not in (select parent from edges where parent = id) > group by top; > CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id, id from nodes union all select t.id, top from nodes as t, edges as e, r where e.parent = r.id and t.id = e.child) select top, count(*) from r where top != id and id not in (select parent from edges where parent = id) group by top; sqlite> select * from v_count_leaves where top=679; top count(*) 679 2 Run Time: real 73.365 user 73.328125 sys 0.000000 > New: > sqlite> explain query plan with recursive r (id, top) as (select id, id > from nodes union all select e.child, top from edges as e, r where e.parent > = r.id) select top, count(*) from r where top != id and id not in (select > parent from edges) group by top; > Now give your modified query a go and let me know how it compares to what > I came up with. > CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select id, id from nodes union all select e.child, top from edges as e, r where e.parent = r.id) select top, count(*) from r where top != id and id not in (select parent from edges) group by top; sqlite> select * from v_count_leaves_new where top=679; top count(*) 679 2 Run Time: real 45.099 user 45.093750 sys 0.000000 faster, but about 8 times slower than your query - sqlite> select * from leafcounts where parent=679; parent leafCount 679 2 Run Time: real 5.639 user 5.640625 sys 0.000000 and that is without the reverseEdges index. I still don't understand why "leafcounts" is so much faster than "v_count_leaves_new" > -----Original Message----- > From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] > On Behalf Of Shane Dev > Sent: Wednesday, January 03, 2018 12:45 AM > To: SQLite mailing list > Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG. > > Hi David, > > Nice work! your query is far quicker than mine- even without the > reverseEdges index. I think you are right about the problem of potentially > double counting leaves. There weren't any multi-parent nodes in my test > data so I didn't notice this mistake. > > Could you please explain why your query is so much faster? > > On 2 January 2018 at 17:50, David Raymond <david.raym...@tomtom.com> > wrote: > > > I think you need a union there instead of a union all. Otherwise you're > > double (or more) counting leaves where there is more than 1 path to get > to > > the leaf. > > > > I don't have a large dataset to test it on, but how about something like: > > > > create table nodes > > ( > > id integer primary key, > > description text > > ); > > > > create table edges > > ( > > parent int not null references nodes, > > child int not null references nodes, > > primary key (parent, child), > > check (parent != child) > > ) without rowid; > > create index reverseEdges on edges (child, parent); > > > > create view leafCounts as with recursive > > leaves (id) as ( > > select nodes.id > > from nodes left outer join edges > > on nodes.id = edges.parent > > where edges.parent is null > > ), > > paths (parent, child) as ( > > select parent, child from edges > > union > > select paths.parent, edges.child > > from paths inner join edges > > on paths.child = edges.parent > > ) > > select parent, count(*) as leafCount > > from paths > > where child in leaves > > group by parent; > > > > > > -----Original Message----- > > From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] > > On Behalf Of Shane Dev > > Sent: Monday, January 01, 2018 11:14 AM > > To: SQLite mailing list > > Subject: [sqlite] Efficient query to count number of leaves in a DAG. > > > > Hi, > > > > I want to the count the number of leaves (descendants without children) > for > > each node in a DAG > > > > DAG definition - > > > > CREATE TABLE nodes(id integer primary key, description text); > > CREATE TABLE edges(parent not null references nodes, child not null > > references nodes, primary key(parent, child)); > > > > My query - > > > > CREATE VIEW v_count_leaves as with recursive r(id, top) as ( > > select id, id from nodes > > union all > > select t.id, top from nodes as t, edges as e, r where e.parent=r.id and > > t.id > > =e.child) > > select top, count(*) from r where top<>id and id not in (select parent > from > > edges where parent=id) group by top; > > > > It seems to work but is complex to understand and debug despite my aim to > > keep it simple as possible, but more importantly - it is very slow when > > there are more than a few thousand nodes and edges. > > > > It there a more efficient (and ideally simpler) way? > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@mailinglists.sqlite.org > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@mailinglists.sqlite.org > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users