I forgot to add, if your application does not require horizontal scale out to many CPUs on multiple machines, UIMA has a vertical scale out tool, the CPE, that can support running multiple pipeline threads on a single machine. More information is at http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.cpe
On Sun, Jun 14, 2020 at 7:06 PM Eddie Epstein <[email protected]> wrote: > In this case the problem is not DUCC, rather it is the high overhead of > opening small files and sending them to a remote computer individually. I/O > works much more efficiently with larger blocks of data. Many small files > can be merged into larger files using zip archives. DUCC sample code shows > how to do this for CASes, and very similar code could be used for input > documents as well. > > Implementing efficient scale out is highly dependent on good treatment of > input and output data. > Best, > Eddie > > > On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman < > [email protected]> wrote: > >> Hello, >> >> Thank you very much for your response and even more so for the detailed >> explanation. >> >> So, if I understand it correctly, DUCC is more suited for scenarios where >> we have large input documents rather than many small ones? >> >> Thank you once again. >> >> On Fri, 12 Jun 2020, 22:18 Eddie Epstein, <[email protected]> wrote: >> >> > Hi, >> > >> > In this simple scenario there is a CollectionReader running in a >> JobDriver >> > process, delivering 100K workitems to multiple remote JobProcesses. The >> > processing time is essentially zero. (30 * 60 seconds) / 100,000 >> workitems >> > = 18 milliseconds per workitem. This time is roughly the expected >> overhead >> > of a DUCC jobDriver delivering workitems to remote JobProcesses and >> > recording the results. DUCC jobs are much more efficient if the overhead >> > per workitem is much smaller than the processing time. >> > >> > Typically DUCC jobs would be processing much larger blocks of content >> per >> > workitem. For example, if a workitem was a document, and the document >> > parsed into the small CASes by the CasMultiplier, the throughput would >> be >> > much better. However, with this example, as the number of working >> > JobProcess threads is scaled up, the CR (JobDriver) would become a >> > bottleneck. That's why a typical DUCC Job will not send the Document >> > content as a workitem, but rather send a reference to the workitem >> content >> > and have the CasMultipliers in the JobProcesses read the content >> directly >> > from the source. >> > >> > Even though content read by the JobProcesses is much more efficient, as >> > scaleout continued to increase for this non-computation scenario the >> > bottleneck would eventually move to the underlying filesystem or >> whatever >> > document source and JobProcess output are. The main motivation for DUCC >> was >> > jobs similar to those in the DUCC examples which use OpenNLP to process >> > large documents. That is, jobs where CPU processing is the bottleneck >> > rather than I/O. >> > >> > Hopefully this helps. If not, happy to continue the discussion. >> > Eddie >> > >> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman < >> > [email protected]> wrote: >> > >> > > Hi, >> > > Thank you for your reply and I'm sorry I couldn't get back to this >> > > earlier. >> > > >> > > To get a better picture of the processing speed of DUCC, I made a >> dummy >> > > pipeline where the CollectionReader runs a for loop to generate 100k >> > > workitems (so no disk reads). each workitem only has a simple string >> in >> > it. >> > > These are then passed on to the CasMultiplier where for each workitem >> I'm >> > > creating a new CAS with DocumentInfo (again only having a simple >> string >> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer >> > doesn't >> > > do anything except add the Document received in the CAS to the >> logger. So >> > > basically this pipeline isn't doing anything, no Input reads and the >> only >> > > output is the information added to the logger. Running this on the >> > cluster >> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more >> > than >> > > 30 minutes. I don't understand how is this possible since there's no >> > heavy >> > > I/O processing is happening in the code. >> > > >> > > Any ideas please? >> > > >> > > Thank you. >> > > >> > > On 2020/05/18 12:47:41, Eddie Epstein <[email protected]> wrote: >> > > > Hi, >> > > > >> > > > Removing the AE from the pipeline was a good idea to help isolate >> the >> > > > bottleneck. The other two most likely possibilities are the >> collection >> > > > reader pulling from elastic search or the CAS consumer writing the >> > > > processing output. >> > > > >> > > > DUCC Jobs are a simple way to scale out compute bottlenecks across a >> > > > cluster. Scaleout may be of limited or no value for I/O bound jobs. >> > > > Please give a more complete picture of the processing scenario on >> DUCC. >> > > > >> > > > Regards, >> > > > Eddie >> > > > >> > > > >> > > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman < >> > > > [email protected]> wrote: >> > > > >> > > > > Hi, >> > > > > I've been trying to run a very small UIMA DUCC cluster with 2 >> slave >> > > nodes >> > > > > having 32GB of RAM each. I wrote a custom Collection Reader to >> read >> > > data >> > > > > from an Elasticsearch index and dump it into a new index after >> > certain >> > > > > analysis engine processing. The Analysis Engine is a simple >> sentiment >> > > > > analysis code. The performance I'm getting is very slow as it is >> only >> > > able >> > > > > to process ~150 documents/minute. >> > > > > To test the performance without the analysis engine, I removed >> the AE >> > > from >> > > > > the pipeline but still I did not get any improvement in the >> > processing >> > > > > speeds. Can you please guide me as to where I might be going >> wrong or >> > > what >> > > > > I can do to improve the processing speeds? >> > > > > >> > > > > Thank you. >> > > > > ________________________________ >> > > > > Edge Hill University<http://ehu.ac.uk/home/emailfooter> >> > > > > Teaching Excellence Framework Gold Award< >> > > http://ehu.ac.uk/tef/emailfooter> >> > > > > ________________________________ >> > > > > This message is private and confidential. If you have received >> this >> > > > > message in error, please notify the sender and remove it from your >> > > system. >> > > > > Any views or opinions presented are solely those of the author >> and do >> > > not >> > > > > necessarily represent those of Edge Hill or associated companies. >> > Edge >> > > Hill >> > > > > University may monitor email traffic data and also the content of >> > > email for >> > > > > the purposes of security and business communications during staff >> > > absence.< >> > > > > http://ehu.ac.uk/itspolicies/emailfooter> >> > > > > >> > > > >> > > >> > >> >
