Re: UIMA DUCC slow processing

Eddie Epstein Sun, 14 Jun 2020 16:35:02 -0700

I forgot to add, if your application does not require horizontal scale out
to many CPUs on multiple machines, UIMA has a vertical scale out tool, the
CPE, that can support running multiple pipeline threads on a single
machine.
More information is at
http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.cpe





On Sun, Jun 14, 2020 at 7:06 PM Eddie Epstein <[email protected]> wrote:

> In this case the problem is not DUCC, rather it is the high overhead of
> opening small files and sending them to a remote computer individually. I/O
> works much more efficiently with larger blocks of data. Many small files
> can be merged into larger files using zip archives. DUCC sample code shows
> how to do this for CASes, and very similar code could be used for input
> documents as well.
>
> Implementing efficient scale out is highly dependent on good treatment of
> input and output data.
> Best,
> Eddie
>
>
> On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman <
> [email protected]> wrote:
>
>> Hello,
>>
>> Thank you very much for your response and even more so for the detailed
>> explanation.
>>
>> So, if I understand it correctly, DUCC is more suited for scenarios where
>> we have large input documents rather than many small ones?
>>
>> Thank you once again.
>>
>> On Fri, 12 Jun 2020, 22:18 Eddie Epstein, <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > In this simple scenario there is a CollectionReader running in a
>> JobDriver
>> > process, delivering 100K workitems to multiple remote JobProcesses. The
>> > processing time is essentially zero.  (30 * 60 seconds) / 100,000
>> workitems
>> > = 18 milliseconds per workitem. This time is roughly the expected
>> overhead
>> > of a DUCC jobDriver delivering workitems to remote JobProcesses and
>> > recording the results. DUCC jobs are much more efficient if the overhead
>> > per workitem is much smaller than the processing time.
>> >
>> > Typically DUCC jobs would be processing much larger blocks of content
>> per
>> > workitem. For example, if a workitem was a document, and the document
>> > parsed into the small CASes by the CasMultiplier, the throughput would
>> be
>> > much better. However, with this example, as the number of working
>> > JobProcess threads is scaled up, the CR (JobDriver) would become a
>> > bottleneck. That's why a typical DUCC Job will not send the Document
>> > content as a workitem, but rather send a reference to the workitem
>> content
>> > and have the CasMultipliers in the JobProcesses read the content
>> directly
>> > from the source.
>> >
>> > Even though content read by the JobProcesses is much more efficient, as
>> > scaleout continued to increase for this non-computation scenario the
>> > bottleneck would eventually move to the underlying filesystem or
>> whatever
>> > document source and JobProcess output are. The main motivation for DUCC
>> was
>> > jobs similar to those in the DUCC examples which use OpenNLP to process
>> > large documents. That is, jobs where CPU processing is the bottleneck
>> > rather than I/O.
>> >
>> > Hopefully this helps. If not, happy to continue the discussion.
>> > Eddie
>> >
>> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
>> > [email protected]> wrote:
>> >
>> > > Hi,
>> > > Thank you for your reply and I'm sorry I couldn't get back to this
>> > > earlier.
>> > >
>> > > To get a better picture of the processing speed of DUCC, I made a
>> dummy
>> > > pipeline where the CollectionReader runs a for loop to generate 100k
>> > > workitems (so no disk reads). each workitem only has a simple string
>> in
>> > it.
>> > > These are then passed on to the CasMultiplier where for each workitem
>> I'm
>> > > creating a new CAS with DocumentInfo (again only having a simple
>> string
>> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer
>> > doesn't
>> > > do anything except add the Document received in the CAS to the
>> logger. So
>> > > basically this pipeline isn't doing anything, no Input reads and the
>> only
>> > > output is the information added to the logger. Running this on the
>> > cluster
>> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more
>> > than
>> > > 30 minutes. I don't understand how is this possible since there's no
>> > heavy
>> > > I/O processing is happening in the code.
>> > >
>> > > Any ideas please?
>> > >
>> > > Thank you.
>> > >
>> > > On 2020/05/18 12:47:41, Eddie Epstein <[email protected]> wrote:
>> > > > Hi,
>> > > >
>> > > > Removing the AE from the pipeline was a good idea to help isolate
>> the
>> > > > bottleneck. The other two most likely possibilities are the
>> collection
>> > > > reader pulling from elastic search or the CAS consumer writing the
>> > > > processing output.
>> > > >
>> > > > DUCC Jobs are a simple way to scale out compute bottlenecks across a
>> > > > cluster. Scaleout may be of limited or no value for I/O bound jobs.
>> > > > Please give a more complete picture of the processing scenario on
>> DUCC.
>> > > >
>> > > > Regards,
>> > > > Eddie
>> > > >
>> > > >
>> > > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > Hi,
>> > > > > I've been trying to run a very small UIMA DUCC cluster with 2
>> slave
>> > > nodes
>> > > > > having 32GB of RAM each. I wrote a custom Collection Reader to
>> read
>> > > data
>> > > > > from an Elasticsearch index and dump it into a new index after
>> > certain
>> > > > > analysis engine processing. The Analysis Engine is a simple
>> sentiment
>> > > > > analysis code. The performance I'm getting is very slow as it is
>> only
>> > > able
>> > > > > to process ~150 documents/minute.
>> > > > > To test the performance without the analysis engine, I removed
>> the AE
>> > > from
>> > > > > the pipeline but still I did not get any improvement in the
>> > processing
>> > > > > speeds. Can you please guide me as to where I might be going
>> wrong or
>> > > what
>> > > > > I can do to improve the processing speeds?
>> > > > >
>> > > > > Thank you.
>> > > > > ________________________________
>> > > > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
>> > > > > Teaching Excellence Framework Gold Award<
>> > > http://ehu.ac.uk/tef/emailfooter>
>> > > > > ________________________________
>> > > > > This message is private and confidential. If you have received
>> this
>> > > > > message in error, please notify the sender and remove it from your
>> > > system.
>> > > > > Any views or opinions presented are solely those of the author
>> and do
>> > > not
>> > > > > necessarily represent those of Edge Hill or associated companies.
>> > Edge
>> > > Hill
>> > > > > University may monitor email traffic data and also the content of
>> > > email for
>> > > > > the purposes of security and business communications during staff
>> > > absence.<
>> > > > > http://ehu.ac.uk/itspolicies/emailfooter>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: UIMA DUCC slow processing

Reply via email to