Hello, Thank you very much for your response and even more so for the detailed explanation.
So, if I understand it correctly, DUCC is more suited for scenarios where we have large input documents rather than many small ones? Thank you once again. On Fri, 12 Jun 2020, 22:18 Eddie Epstein, <[email protected]> wrote: > Hi, > > In this simple scenario there is a CollectionReader running in a JobDriver > process, delivering 100K workitems to multiple remote JobProcesses. The > processing time is essentially zero. (30 * 60 seconds) / 100,000 workitems > = 18 milliseconds per workitem. This time is roughly the expected overhead > of a DUCC jobDriver delivering workitems to remote JobProcesses and > recording the results. DUCC jobs are much more efficient if the overhead > per workitem is much smaller than the processing time. > > Typically DUCC jobs would be processing much larger blocks of content per > workitem. For example, if a workitem was a document, and the document > parsed into the small CASes by the CasMultiplier, the throughput would be > much better. However, with this example, as the number of working > JobProcess threads is scaled up, the CR (JobDriver) would become a > bottleneck. That's why a typical DUCC Job will not send the Document > content as a workitem, but rather send a reference to the workitem content > and have the CasMultipliers in the JobProcesses read the content directly > from the source. > > Even though content read by the JobProcesses is much more efficient, as > scaleout continued to increase for this non-computation scenario the > bottleneck would eventually move to the underlying filesystem or whatever > document source and JobProcess output are. The main motivation for DUCC was > jobs similar to those in the DUCC examples which use OpenNLP to process > large documents. That is, jobs where CPU processing is the bottleneck > rather than I/O. > > Hopefully this helps. If not, happy to continue the discussion. > Eddie > > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman < > [email protected]> wrote: > > > Hi, > > Thank you for your reply and I'm sorry I couldn't get back to this > > earlier. > > > > To get a better picture of the processing speed of DUCC, I made a dummy > > pipeline where the CollectionReader runs a for loop to generate 100k > > workitems (so no disk reads). each workitem only has a simple string in > it. > > These are then passed on to the CasMultiplier where for each workitem I'm > > creating a new CAS with DocumentInfo (again only having a simple string > > value) and pass it as a newcas to the CasConsumer. The CasConsumer > doesn't > > do anything except add the Document received in the CAS to the logger. So > > basically this pipeline isn't doing anything, no Input reads and the only > > output is the information added to the logger. Running this on the > cluster > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more > than > > 30 minutes. I don't understand how is this possible since there's no > heavy > > I/O processing is happening in the code. > > > > Any ideas please? > > > > Thank you. > > > > On 2020/05/18 12:47:41, Eddie Epstein <[email protected]> wrote: > > > Hi, > > > > > > Removing the AE from the pipeline was a good idea to help isolate the > > > bottleneck. The other two most likely possibilities are the collection > > > reader pulling from elastic search or the CAS consumer writing the > > > processing output. > > > > > > DUCC Jobs are a simple way to scale out compute bottlenecks across a > > > cluster. Scaleout may be of limited or no value for I/O bound jobs. > > > Please give a more complete picture of the processing scenario on DUCC. > > > > > > Regards, > > > Eddie > > > > > > > > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman < > > > [email protected]> wrote: > > > > > > > Hi, > > > > I've been trying to run a very small UIMA DUCC cluster with 2 slave > > nodes > > > > having 32GB of RAM each. I wrote a custom Collection Reader to read > > data > > > > from an Elasticsearch index and dump it into a new index after > certain > > > > analysis engine processing. The Analysis Engine is a simple sentiment > > > > analysis code. The performance I'm getting is very slow as it is only > > able > > > > to process ~150 documents/minute. > > > > To test the performance without the analysis engine, I removed the AE > > from > > > > the pipeline but still I did not get any improvement in the > processing > > > > speeds. Can you please guide me as to where I might be going wrong or > > what > > > > I can do to improve the processing speeds? > > > > > > > > Thank you. > > > > ________________________________ > > > > Edge Hill University<http://ehu.ac.uk/home/emailfooter> > > > > Teaching Excellence Framework Gold Award< > > http://ehu.ac.uk/tef/emailfooter> > > > > ________________________________ > > > > This message is private and confidential. If you have received this > > > > message in error, please notify the sender and remove it from your > > system. > > > > Any views or opinions presented are solely those of the author and do > > not > > > > necessarily represent those of Edge Hill or associated companies. > Edge > > Hill > > > > University may monitor email traffic data and also the content of > > email for > > > > the purposes of security and business communications during staff > > absence.< > > > > http://ehu.ac.uk/itspolicies/emailfooter> > > > > > > > > > >
