[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-04-02 Thread Larry Hastings


Larry Hastings  added the comment:

One final aside.  Let me preface this by saying: I'm not proposing the 
following for graphlib.TopologicalSort.  It's obviously too late to change that 
object this much.  It's just something I'm thinking about--maybe I'll use this 
in my own library.

Where we are now, the graphlib.TopologicalSort object is simultaneously a 
static representation of a graph--which the object doesn't provide a public API 
for you to examine!--and a stateful single-use iterator over that graph.  My 
proposed change to the API seems to increase the tension between these two sets 
of semantics.  Perhaps a better set of semantics, that more successfully maps 
my use case to the graph object, would be as follows:

* you can add() nodes and edges at any time.
* get_ready() always returns the current list of nodes with no 
prececessors.  There is no internal "we already told you about this one" state.
* replace done() with remove(), which removes the node and all edges 
from/to it from the graph.
* static_order() is still fine.

I think this would make it easy to reason about the object's behavior, and 
would be a better match to my use case where you're continually adding (and 
removing?) nodes, not just during an initial "populate the graph" phase.

Again, not appropriate to consider for graphlib.TopologicalSort.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-04-02 Thread Larry Hastings


Larry Hastings  added the comment:

I agree that the API should have as few surprises as possible.  AFAICT you 
haven't made any terminal pronouncements like "it's impossible to add this 
feature without too many unacceptable surprises", so I'll proceed assuming we 
can find an API (and semantics) that we're all happy with.

I've come around on the idea that forgetting "done" nodes is too surprising.  
The least surprising behavior is: once we've added a node to the graph, the 
graph remembers it.

Now I'll propose a second simple rule, which you've already touched on: you 
can't add a dependency after a node has been returned by get_ready().  
Attempting this always throws an exception; which exception specifically TBD.  
"Tough beans".

I think those two rules are easy enough to remember and to reason about.  It 
meaningfully expands the utility of the library with minimal surprise.  The 
library behaves identically for users of the existing API but permits new use 
cases too.  Adding this functionality would therefore mean fewer users would 
discover too late their use case isn't supported.


As far as my "long-lived graph objects will consume more and more memory over 
time" caveat, there's a better solution than "forget": graph.remove(*nodes).  
(My version already has a "remove" method, and I forgot that graphlib's 
doesn't.)  Allowing the user to remove a node from the graph gives them 
explicit control, and the semantics should be obvious and unsurprising; if 
you--the user--remove a node, and later you--the user--re-add that node to the 
graph, it behaves identically to any other node the graph has never seen 
before.  Removing a node intuitively removes all edges to that node.

Two notes on "remove" if we decide to go that route.  First, I'd ensure you can 
remove a node at any time. Nodes have three externally visible states wrt 
TopologicalSort:

1) added but not published by get_ready,
2) published by get_ready but not returned using done, and
3) done.  You should be able to remove a node in any of those three states.

Removing a node in 2) should be equivalent to calling done before calling 
remove; that is, if you're removing the node anyway, you don't need to call 
done.

Second, the current underlying implementation would make remove really slow.  
Nodes don't remember their predecessors, only their successors, so removing a 
node would be O(n).  If we expected remove to get a lot of use, we'd probably 
want to change how the graph is stored to speed up this operation.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-30 Thread Tim Peters


Tim Peters  added the comment:

I believe I'm elaborating on your "footgun".

It doesn't matter to me whether we pick some scheme and document it, _if_ that 
scheme is incoherent, impossible to remember, error-prone, etc.

That's how, e.g., regular expression syntax was designed. "I know! ++*??+) 
doesn't have a coherent meaning now, so let's say it means to match a prime 
number of the pattern it follows" ;-)

That is, an error-prone extension is worse than no extension at all.

The algorithm now isn't guessing at anything: it's saying up front that two 
tasks are the same task if and only if they compare equal. Context, execution 
history, ..., nothing else is relevant. It's simple. Complicating a simple 
thing may open new possibilities, but creates new traps too.

One trap is pretty obviously making "the rules" for when two tasks are the same 
task depend on the execution history at the time the question is asked.

That goes away, though, if the current rule is retained ("the same iff =="), 
but can be explicitly overridden by .forget() (or some such).

That doesn't make it a no-brainer, though. For example, do

.add(A, B)

and run until A and B are marked done. Now do

.add(B, C)

What then? We're back in a guessing game again. We've just been told that B 
depends on C first. But we already did B. Depending on _domain_ knowledge we 
cannot have, that may or may not be worthy of raising an exception.

You can make up any rule you want about that and arbitrarily declare victory, 
but the chance that a user will realize it's not the behavior _their_ domain 
needs is approximately 0 until after they've been burned by it. So now it's 
error-prone, at least to them.

FWIW, in that specific case I'd say "tough beans - you told us too late that B 
depends on C to stop B the first time, but now that you've told us we'll 
respect it in future".

Another twist: assuming that's "the rule", what does

.add(B, C)

really mean? If B really is "the same task" as it was the first time around, 
well, it's already been marked done. Are we supposed to do it _again_ now? Why? 
Why not?

It's hard for me to find any _compelling_ sense here - just masses of 
more-or-less arbitrary implementation decisions. In which case "hard to 
remember" follows soon after.

None of that haunts the current API. It all follows quite directly from what 
"depends on" means to essentially everyone.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-30 Thread Larry Hastings


Larry Hastings  added the comment:

I'm not sure I follow you.  What do you suggest graphlib is guessing at?  I 
thought we were discussing what graphlib's (documented, intentional) behavior 
should be; if there was any guessing going on, it was us doing it, guessing at 
what behavior the user would prefer.

I appreciate you corresponding on the issue, but I'm having difficulty 
understanding what light you're shining on the problem.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-30 Thread Tim Peters


Tim Peters  added the comment:

Various kinds of tasks:

- "Power switch must be on." Needs to done the first time. _May_ need to be 
done again later (if some later task turns the power off again). Can be done 
any number of times without harm (beyond the expense of checking), so long as 
the switch is on.

- "Ensure gas tank is full." Probably needs to be done anew for every new added 
task that depends on it.

- "Remove outermost layer of skin." Probably shouldn't be done more than once 
;-)

- "Flip Najdorf switch." Who knows? Doing it a second time - or failing to do 
it a second time - may be necessary, harmless, or deadly.

I'd rather not bother with any of this dicey guessing. While some dynamism may 
be attractive, what it all "should mean" appears to be a rat's nest, and 
depends on domain knowledge graphlib can't possibly have.

I doubt there is a compelling default. As is, two tasks are considered to be 
"the same" task if and only if they compare equal, so that's the least 
_surprising_ default for tasks added later. "Remove outermost layer of skin"

"Two tasks that compare equal may or may not be considered the same task, 
depending on the execution history at the time the question is posed" is at 
best expedient, at worst disastrous.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-30 Thread Larry Hastings


Larry Hastings  added the comment:

Having slept on it, I think the footgun is slightly bigger than I gave it 
credit for.  Also the problem of nodes lingering forever is easily solved: give 
the user control.

My next iteration will keep the done nodes around, but I'll also add a forget() 
method that compels the graph to forget all done nodes.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-30 Thread Larry Hastings


Larry Hastings  added the comment:

> Assuming we do want to be able to add() after a get_ready(), is there
> a reason that "forgetting" already-produced nodes is the correct
> behavior, as opposed to remembering all nodes ever added, and
> raising iff the addition creates a cycle among all nodes ever
> added or depends on an already-yielded node?

I'm not sure "correct" applies here, because I don't have a sense that one 
behavior is conceptually more correct than the other.  But in implementing my 
API change, forgetting about the done nodes seemed more practical.

The "benefit" to remembering done nodes: the library can ignore dependencies to 
them in the future, forever.  Adding a dependency to a node that's already been 
marked as "done" doesn't make much conceptual sense to me, but as a practical 
thing maybe it's useful?  I'm not sure, but it doesn't seem all that useful.

I can only come up with a marginal reason why remembering done nodes is useful. 
 Let's say all your tasks fan out from some fundamental task, like "download 
master list of work" or something.  One of your tasks might discover additional 
tasks that need to run, and conceptually those tasks might depend on your 
"download master list of work" task.  If the graph remembers the done list 
forever, then adding that dependency is harmless.  If the graph forgets about 
done nodes, then adding that dependency could re-introduce that task to the 
graph, which could goof things up.  So maybe it's a little bit of a footgun?  
But on the other hand: you already know you're running, and you're a task that 
was dependent on the master list of work, which means you implicitly know that 
dependency has been met.  So just skip adding the redundant dependency and 
you're fine.

On the other hand, forgetting about the nodes has a definite practical benefit: 
the graph consumes less memory.  If you use a graph object for a long time, the 
list of done nodes it's holding references to would continue to grow and grow 
and grow.  If we forget about done nodes, we free up all that memory, and done 
membership testing maybe gets faster.

I guess I'm not married to the behavior.  If someone had a great conceptual or 
practical reason why remembering the done nodes forever was better, I'd be 
willing to listen to reason.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-30 Thread Larry Hastings


Larry Hastings  added the comment:

I'm using my graph library to manage a list of tasks that need doing in some 
sort of proper order.  One task may spawn other tasks at runtime, and we don't 
necessarily know what the tasks will be until runtime.  It's way more 
convenient to simply add such tasks on demand, rather than trying to 
preemptively pre-add all such possible tasks before preparing the graph, or 
creating additional graphs.

For example, consider a tool that downloads and digests zip files full of music 
from an online store.  Your initial tasks might represent "download the zip 
file", then "decompress and examine the contents of the zip file".  You could 
then iterate over the contents of the zip file, adding different tasks based on 
what you find--one pipeline of tasks for media files (FLAC/MP3/OGG/etc), 
another for the playlist, a third if you don't *find* a playlist, a fourth for 
image files, etc.  (Not that you'd necessarily write such a tool this way, but 
it's at least plausible.)

The new nodes needn't be connected to the existing nodes for this to still be 
useful.  You could reuse the same graph object for all your tasks.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-29 Thread Dennis Sweeney


Dennis Sweeney  added the comment:

> depends on an already-yielded node

I mean "creates a new not-yet-yielded dependency for an already-yielded node".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-29 Thread Dennis Sweeney


Dennis Sweeney  added the comment:

Out of curiosity, what are the use cases for adding nodes after get_ready has 
already produced nodes?

I was wondering about avoiding the need to call prepare() by having it 
automatically do the cycle-checking at the first get_ready() call and then 
raising ValueError if add() is called any time thereafter.

Assuming we do want to be able to add() after a get_ready(), is there a reason 
that "forgetting" already-produced nodes is the correct behavior, as opposed to 
remembering all nodes ever added, and raising iff the addition creates a cycle 
among all nodes ever added or depends on an already-yielded node?

--
nosy: +Dennis Sweeney

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-29 Thread Eric V. Smith


Eric V. Smith  added the comment:

My personal usage of a topological sort are restricted to maybe 100 entries 
max, so I can't really test this in any meaningful way.

That, and the project that uses it is stuck on 3.6, so I haven't switched to 
the graphlib version yet.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47145] Improve graphlib.TopologicalSort by removing the prepare step

2022-03-28 Thread Larry Hastings


New submission from Larry Hastings :

I've maintained my own topological sort class for a while now.  The API I came 
up with (or was it Tim and I?) was adopted for graphlib.TopologicalSort().  
Some of my method names are slightly different, but the APIs are so similar, I 
can do a little renaming and run Lib/test/test_graphlib using my 
implementation.  (For clarity I'm going to write the rest of this issue using 
the names from graphlib.)

Recently I made an improvement to my version: it no longer has the modal 
behavior where there's a "before prepare" phase where you add nodes and an 
"after prepare" phase where you can call get_ready and done.

Instead, in my new and improved version the graph is always maintained in a 
"ready" state.  You can call add, get_ready, and done in any order, as much as 
you like.  prepare doesn't do anything besides the cycle check--but read on!

This approach required tweaking some semantics behind the scenes.  My graph 
object now maintains an internal "dirty" flag.  It's set to True after any 
edges are added, and only gets cleared after checking for cycles.  prepare runs 
this cycle check, but get_ready *also* runs this cycle check.  (Though in both 
cases, only when the dirty flag is set, of course.)  In theory this means you 
no longer need the prepare call.  However, it's still useful, because you can 
call it to check for a CycleError.

So, on the one hand, this means that get_ready can throw a CycleError, which it 
never did before.  On the other hand, it won't throw for existing users of the 
library, because they never add edges after their prepare call.  So this 
semantic change shouldn't break existing users, assuming they're not misusing 
the existing API.

Another wrinkle: adding a dependency on a node that's already been handed out 
by get_ready (but hasn't been returned by done) is ignored.  Again, not 
something existing users of graphlib can do, so this shouldn't break code.

There's one final semantic change I made worth knowing about: when you return a 
node via done, the graph simply forgets about it.  This means you could add it 
a second time (!) and it could go for a complete ride through the graph again, 
wh!  This also means that when all nodes have been returned by done, the 
graph is essentially returned to its initial pristine state.  This too seems 
completely harmless, because with the existing library it's illegal to add 
nodes to a graph after calling prepare, so nobody's doing it, so it won't break 
any code.

I'm happy to contribute my version.  But I was lazy and didn't worry about 
speed or memory implementation; it's strewn with sets and things.  I figured I 
could always optimize it later.  But "later" could become "now" if we were 
interested in trying to merge this behavior into Python's graphlib.

Interested?

--
assignee: larry
components: Library (Lib)
messages: 416182
nosy: eric.smith, larry, pablogsal, tim.peters
priority: normal
severity: normal
stage: needs patch
status: open
title: Improve graphlib.TopologicalSort by removing the prepare step
type: enhancement
versions: Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com