Hi Joe, Steve & All,

On 30/09/2015 17:10, Bill Somerville wrote:
> As an aside: Am I right in thinking that the decoders divide the
> received spectrum and attempt decodes across the spectrum. If so then is
> there a better concurrency opportunity in dividing by frequency,
> assuming that can be done independently, rather than at the lower level
> of codeword matching. I ask because there would seem to be less
> sub-tasks in the former which in theory would incur lower
> parallelization overhead. I realise that this may be most relevant to
> the HF problem but I suspect this is the larger one.
On 01/10/2015 02:44, Joe Taylor wrote:
> Yes, there are several possible ways to slice the JT65 decoding task
> that could benefit from parallel processing.  Divide-by-frequency is one
> possibility.
>
> A disadvantage is that it would provide little or no help in most EME
> situations, which involve only one (or very few) signals.  In such
> cases, where we definitely want the best possible sensitivity, we'll
> want to set "ntrials" to a fairly large number.  Then, processing
> distinct portions of the stochastic loop in different threads could be
> the best way to go.
> On 30/09/2015 19:57, Steven Franke wrote:
>> As Joe said, for EME we probably want to run very large numbers of trials. 
>> If one were to start multiple calls to sfrsd2, each with its own random 
>> seed, all at once, and ensure that each call runs in its own thread, would 
>> this accomplish much of what we are after?
On 30/09/2015 20:32, Joe Taylor wrote:
> That's a possibile approach for cases in which many synchronized signals
> need decoding.
>
> My idea (which should be useful even when there's only one signal to
> decode) is simply to parallelize the "ntrials" loop starting around line
> ~147 of sfrsd.  Each thread would begin with its own random seed and
> have its own local variables for era_pos, workdat, etc.  Each thread
> would keep track of its own nhard_min, nsoft_min, etc.  When any one of
> the threads produces a decode, it would terminate the others and
> continue as a normal single thread.
On 01/10/2015 02:02, Steven Franke wrote:
> Yes, that makes sense to me. I'm unencumbered by any hard knowledge about how 
> these things are done, so I understand that my thinking is probably too 
> simplistic to be practical.
>
> I was thinking that there is so little overhead before the ntrials loop that 
> it might make sense to keep sfrsd2 more-or-less intact. Suppose that we 
> re-wrote sfrsd2 using whatever C++ constructs necessary to make it a function 
> that can run in its own thread. There is no need for shared data or 
> communication between multiple sfrsd2 threads except for sending the input 
> data (in the function call) and a return flag. Then, if we create a decode 
> thread for each signal in the spectrum then each of those threads takes care 
> of calling sfrsd2 several times (in multiple threads).  Each decode thread 
> would have to call multiple sfrsd2’s with the same input data and different 
> seeds, and then wait for return flags to come in and act accordingly. This 
> seemed like something that would be kind of modular and easy to program.
On 01/10/2015 02:44, Joe Taylor wrote:
> Your notion to do multi-threading by running multiple simultaneous sfrsd
> threads may indeed be a good way to go.  As you say, the setup time in
> sfrsd (before entering the stochastic loop) is small enough to be
> ignored.  Moreover, that makes it easy to use OpenMP because we can do
> that part in Fortran -- similar to how it's done in already when doing
> both JT9 and JT65 at the same time.  Nothing special needs to be done to
> the sfrsd code; there's no need for it to be C++ rather than C.  We will
> just need to take care to distinguish shared variables from "per thread"
> variables, do I/O in the main thread, etc.  I managed to make this work,
> before, and Bill knows all the tricks.
A few comments from me:

As it stands the whole sfrd2 function is not thread safe, it initializes 
a global structure and that is not acceptable in multi-threaded (MT) 
code. Joe's suggestion of dividing the ntrials loop does look like a 
good opportunity for concurrency as it all looks reasonably thread safe. 
Note that the syndrome calculation option of decode_rs_int is not thread 
safe but it appears not to be used.

A strategy that starts a pool of threads and distributes work to those 
threads is normally best. Many examples show a thread being launched to 
complete a piece of work but this approach can be very inefficient if 
the pieces of work are small since the thread construction and 
destruction overhead will cancel the benefits. This means that the code 
may need more reorganization than a naive approach suggests.

Joe correctly states that some variables would need to be thread local. 
By re-factoring so that the work unit that runs in parallel is a 
function it is trivial to achieve that by having local variables in that 
function. But note that there are more shared variables than the local 
variables for example tt[], ntry[] and, correct[] which will either need 
to be re-factored or synchronized between threads. It should also be 
noted that rand() is not thread safe but hopefully rand_r() is available 
on all our platforms.

I don't think a strategy that attempts to divide by frequency *and* 
divide by trials (divide as in "divide and conquer" by running work 
units on concurrent threads) is useful since all work units will be CPU 
and memory bound and there will be more work units to do than there are 
CPU threads to run them. OTOH divide by frequency alone would be 
excellent except that the current code has many thread unsafe constructs 
at that level, also it doesn't help with the single signal EME scenario. 
The divide by trials approach Joe suggests is, relatively, much easier 
to achieve.

Allocating trials to work units may be non-trival. Firstly the work must 
be divided into few enough work units to fit in the available CPU 
threads. I believe each trial is random and therefore it looks like in 
this case it is as simple as doing something like (ntrials / max(1, 
num_CPU_threads_available)) trials per work unit. It should be noted 
that if it were appropriate to do similar concurrency in the JT9 decoder 
then distributing work amongst CPU threads becomes a harder problem when 
doing dual JT9+JT65 decoding.

The termination criteria is non-trivial as it involves abandonment of 
all work units early, this implicitly adds a dependency between the work 
units which will require synchronization. It would appear that a simple 
atomic boolean variable should suffice, it would be set by the winning 
work unit and checked by every work unit on a regular basis, say once 
per trial. This will incur a penalty since even testing an atomic 
boolean will cause cache flushes due to the implicit memory barrier. So 
long as the work in a trial uses sufficient CPU cycles to to be dominant 
it should be efficient, which I believe is the case.

Without fairly major re-factoring, using a thread pool is not going to 
be possible. This means that the simplistic and less efficient approach 
of starting a new thread for each work unit may be a pragmatic route to 
achieve concurrent decoding.

73
Bill
G4WJS.

------------------------------------------------------------------------------
_______________________________________________
wsjt-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/wsjt-devel

Reply via email to