Hmm. Maybe try yours with union instead of union all? Though if there's only 1 
path between any pair of nodes that shouldn't make too much difference. 
Otherwise I'm getting low on ideas.

What're the record counts for nodes and edges?

-----Original Message-----
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Shane Dev
Sent: Thursday, January 04, 2018 5:20 PM
To: SQLite mailing list
Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.

Hi David

I recommend using longer names than 1 letter for your aliases, what you
> save in typing you lose a couple times over again when wondering what "r"
> is or why "t" has anything to do with "nodes"
>

Fair enough. I tend to use shorts names to reduce the risk of typos. My
original node table was called "tasks". I tried to simplify the query for
this forum post but neglected to change the alias.

>
> In your CTE you're doing a 3 table join. There's no need to include the
> nodes table in there at all, you can get the node ID from the edge table.
> ...union all select e.child, top from r, edges as e where e.parent = r.id
> )...
>

You're right in this case. My original node table "tasks" had more columns
which I wanted in the final result set.

>
> The big thing though is in the where clause.
> where...and id not in (select parent from edges where parent = id)...
>

That was a sloppy mistake, I changed it to  ..and id not in (select parent
from edges)... but it was still very slow

>
> Old:
> sqlite> explain query plan with recursive r (id, top) as (select id, id
> from nodes union all select t.id, top from nodes as t, edges as e, r
> where e.parent = r.id and t.id = e.child) select top, count(*) from r
> where top != id and id not in (select parent from edges where parent = id)
> group by top;
>

CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id, id
from nodes union all select t.id, top from nodes as t, edges as e, r where
e.parent = r.id and t.id = e.child) select top, count(*) from r where top
!= id and id not in (select parent from edges where parent = id) group by
top;

sqlite> select * from v_count_leaves where top=679;
top     count(*)
679     2
Run Time: real 73.365 user 73.328125 sys 0.000000


> New:
> sqlite> explain query plan with recursive r (id, top) as (select id, id
> from nodes union all select e.child, top from edges as e, r where e.parent
> = r.id) select top, count(*) from r where top != id and id not in (select
> parent from edges) group by top;
> Now give your modified query a go and let me know how it compares to what
> I came up with.
>

CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select id,
id from nodes union all select e.child, top from edges as e, r where
e.parent = r.id) select top, count(*) from r where top != id and id not in
(select parent from edges) group by top;

sqlite> select * from v_count_leaves_new where top=679;
top     count(*)
679     2
Run Time: real 45.099 user 45.093750 sys 0.000000

faster, but about 8 times slower than your query -

sqlite> select * from leafcounts where parent=679;
parent  leafCount
679     2
Run Time: real 5.639 user 5.640625 sys 0.000000

and that is without the reverseEdges index.

I still don't understand why "leafcounts" is so much faster than
"v_count_leaves_new"




> -----Original Message-----
> From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
> On Behalf Of Shane Dev
> Sent: Wednesday, January 03, 2018 12:45 AM
> To: SQLite mailing list
> Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi David,
>
> Nice work! your query is far quicker than mine- even without the
> reverseEdges index. I think you are right about the problem of potentially
> double counting leaves. There weren't any multi-parent nodes in my test
> data so I didn't notice this mistake.
>
> Could you please explain why your query is so much faster?
>
> On 2 January 2018 at 17:50, David Raymond <david.raym...@tomtom.com>
> wrote:
>
> > I think you need a union there instead of a union all. Otherwise you're
> > double (or more) counting leaves where there is more than 1 path to get
> to
> > the leaf.
> >
> > I don't have a large dataset to test it on, but how about something like:
> >
> > create table nodes
> > (
> >   id integer primary key,
> >   description text
> > );
> >
> > create table edges
> > (
> >   parent int not null references nodes,
> >   child int not null references nodes,
> >   primary key (parent, child),
> >   check (parent != child)
> > ) without rowid;
> > create index reverseEdges on edges (child, parent);
> >
> > create view leafCounts as with recursive
> > leaves (id) as (
> >   select nodes.id
> >   from nodes left outer join edges
> >   on nodes.id = edges.parent
> >   where edges.parent is null
> > ),
> > paths (parent, child) as (
> >   select parent, child from edges
> >   union
> >   select paths.parent, edges.child
> >   from paths inner join edges
> >   on paths.child = edges.parent
> > )
> > select parent, count(*) as leafCount
> > from paths
> > where child in leaves
> > group by parent;
> >
> >
> > -----Original Message-----
> > From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
> > On Behalf Of Shane Dev
> > Sent: Monday, January 01, 2018 11:14 AM
> > To: SQLite mailing list
> > Subject: [sqlite] Efficient query to count number of leaves in a DAG.
> >
> > Hi,
> >
> > I want to the count the number of leaves (descendants without children)
> for
> > each node in a DAG
> >
> > DAG definition -
> >
> > CREATE TABLE nodes(id integer primary key, description text);
> > CREATE TABLE edges(parent not null references nodes, child not null
> > references nodes, primary key(parent, child));
> >
> > My query -
> >
> > CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> > select id, id from nodes
> > union all
> > select t.id, top from nodes as t, edges as e, r where e.parent=r.id and
> > t.id
> > =e.child)
> > select top, count(*) from r where top<>id and id not in (select parent
> from
> > edges where parent=id) group by top;
> >
> > It seems to work but is complex to understand and debug despite my aim to
> > keep it simple as possible, but more importantly - it is very slow when
> > there are more than a few thousand nodes and edges.
> >
> > It there a more efficient (and ideally simpler) way?
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to