Re: DUCC doesn't use all available machines

Eddie Epstein Mon, 17 Nov 2014 12:45:02 -0800

DuccRawTextSpec.job specifies that each job process (JP)
run 8 analytic pipeline threads. So for this job with 100 work
items, no more than 13 JPs would ever be started.

After successful initialization of the first JP, DUCC begins scaling
up the number of JPs using doubling. During JP scale up the
scheduler monitors the work item completion rate, compares that
with the JP initialization time, and stops scaling up JPs when
starting more JPs will not make the job run any faster.

Of course JP scale up is also limited by the job's "fair share"
of resources relative to total resources available for all preemptable jobs.

To see more JPs, increase the number and/or size of the input text files,
or decrease the number of pipeline threads per JP.

Note that it can be counter productive to run "too many" pipeline
threads per machine. Assuming analytic threads are 100% CPU bound,
running more threads than real cores will often slow down the overall
document processing rate.

On Mon, Nov 17, 2014 at 6:48 AM, Simon Hafner <reactorm...@gmail.com> wrote:

> I fired the DuccRawTextSpec.job on a cluster consisting of three
> machines, with 100 documents. The scheduler only runs the processes on
> two machines instead of all three. Can I mess with a few config
> variables to make it use all three?
>
> id:22 state:Running total:100 done:0 error:0 retry:0 procs:1
> id:22 state:Running total:100 done:0 error:0 retry:0 procs:2
> id:22 state:Running total:100 done:0 error:0 retry:0 procs:4
> id:22 state:Running total:100 done:1 error:0 retry:0 procs:8
> id:22 state:Running total:100 done:6 error:0 retry:0 procs:8
>

Re: DUCC doesn't use all available machines

Reply via email to