On Mon, Jun 22, 2009 at 13:47, Jan Hidders<[email protected]> wrote:
> If we speak about the whole processor, not just the single iteration,
> wouldn't that be ?:
> [0,0]: o1="x"
> [0,1]: o1="y"
> [0]: o1 = ["x", "y"]
Well yes, if we are inside another iteration, yes, then an additional
indices will be added to the iteration index.
> []: -1 = [["x","y"], ["u"]]
> Btw. are you really sending the whole ["x", "y"] at the closing of the
> list, because that would seem a bit redundant, or is this just
> conceptually there?
No, just conceptually. 'Really' it's sending a reference to a
registered list - and the items in the list are also references to
registered data - the actual "x" and "y" is registered once, returning
two identifiers.
The two identifiers will be kept until the end of the iteration, at
which the list will be registered, and the identifier to that list is
what is returned - and then again kept to be registered in the
super-list of [].
It does mean that if there's 17.000 items in one single list, 17.000
references will be kept in memory - so about 100 bytes * 17k ~= 2 MB -
however most heavy iterations are over lists of lists, and so it could
be an iteration over 5000 lists containing 500 items each - unless
there's heavy threading there mainly be a couple of those
500-item-lists of references in memory at any point.
As for simple values like "x" the identifier would take up slightly
more memory than the actual value would have taken - but this is a
tradeoff that means we can handle this in the very same way even if
the values are 1 GB each.
> In principle a processor could have two output ports that both produce
> their values as a stream stretched out over time, even if it is not
> iterating, so then they do not really produce their values at exactly
> the same moment, do they?
The processor push the values for all its output ports up the dispatch
stack at the same time.
It is true that at the top of the dispatch stack there is an iteration
over these ports that send them out to the individual outgoing links -
meaning that they would not technically be sent out exactly at the
same time - which would be very difficult to do unless you had one
thread per output port and one CPU core per thread.
This iteration is usually quite fast (say 10x5=50 receiving ports with
10 output ports and 5 links from each), as in the receiving processors
they would in most cases just be added to a queue while waiting for
the other inputs. In the case of all the inputs being ready for a
receiving processor, the job would be sent down the dispatch stack, in
Parallelize it would either just sent the job down (if the processor
is not saturated on its maximum jobs), or it would be put in a queue
to be processed after an earlier iteration is finished in that
processor.
Now even if it does go down past parallelize (I think I spelt this
word wrong in this email!) in the same thread (that as you remember is
in the middle of iterating over the output ports) - as soon as it hits
the activity itself it will invoke the activity asynchronously in a
new thread.
In total this is a nice compromise that avoids creating threads
unnecessarily (and at controlled locations like Paralellize), but with
the code doing as little work as needed in the output-iterating thread
(which will anyway come in a separate thread from Parallelize, and
originally from that fresh asynchronous thread for the activity).
But this would mean - depending on when the VM starts the new activity
threads - that some activities could start invoking before this full
iteration of output values have finished. There should however not
really be much difference from all of the iterations being done in
parallel, or all the new activity threads being kickstarted at once -
in the end you still can't know which activity thread the VM is going
to run first.
> It's probably not even true that these
> streams should start and stop at the same moment, nor do they need to
> have the same length. What I mean is: this is not a rule that holds in
> general. I understand that in many cases this holds, but this is then
> due to the semantics of the activities in the list of activities
> associated with the processor, rather than that the processor enforces
> this. Correct?
Well, they would have to be of the same length, but it's not currently
enforced. (enforcing this would mean to keep additional state). The
reason is that if you've 'finished' output on port A, then you can't
really continue outputting on port B, because you would be returning
[1,3,4] b={"fish"} - what is the value of a for position [1,3,4] ?
To support this the activity would have to be able to return values
with individual indexes, something like:
{ a[1,2,3] = "fish",
b[1,2] = "soup" }
This is not currently supported in the API, but it might be something
we could possibly add if there comes a case of activities that need to
do that.
> The only thing that the processor would enforce is that every
> iteration has completely stopped once it has sent its last output
> message.
No - as the activity thread has an asynchronous callback, it can't
really know that the activity invocation thread has 'stopped' - it
does not enforce this - and it probably should not - perhaps the
activity needs to do some clean-up after returning its values.
> Apologies for taking so much of your time, btw., but we are on a
> deadline with our paper and really need to get this correct.
No problem :)
FYI: It's taken 59 minutes this week according to the very exciting
and scary time tracker ManicTime..
--
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
------------------------------------------------------------------------------
_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/taverna-mailing-lists/
Developers Guide: http://www.mygrid.org.uk/tools/developer-information