Thanks for all the input! I have some reading to do now ;-) Best,
Erik > On 21 Apr 2017, at 23:22, Eddie Epstein <[email protected]> wrote: > > Hi Erik, > > A few words about DUCC and your application. DUCC is a cluster controller > that includes a resource manager and 3 applications: batch processing, long > running services and singleton processes. > > The batch processing application consists of a users CollectionReader which > defines work items and a users aggregate for processing work items that can > be replicated as desired across the cluster of machines. DUCC manages the > remote process scale out and distribution of work items. The aggregate can > be vertically scaled within each process so that in-heap data can be shared > by multiple instances of the aggregate. UIMA-AS is not required for this > simple threading model. > > For most applications a work item is itself a collection, a CAS containing > references to the data to be processed, where the collection size is > designed to have small enough granularity to support scale out but big > enough granularity to avoid bottlenecks. > > The users aggregate normally has an initial CasMultiplier that reads the > input data and creates the CASes to be fed to the rest of the pipeline. > When all children CASes are finished processing the work item CAS is routed > to the aggregate's CasConsumer to finalize the collection. DUCC considers > the work item complete only when the work item CAS is successfully > processed. > > The system is quite robust to errors: uncaught exceptions, analytics > crashing, machines crashing, etc. > > Regards, > Eddie > > > On Fri, Apr 21, 2017 at 2:12 PM, Olga Patterson <[email protected]> > wrote: > >> Erik, >> >> My team at the VA have developed an easy way of implementing UIMA AS >> pipelines and scaling them to a large number of nodes - using Leo framework >> that extends UIMA AS 2.8.1. We have run pipelines on over 200M documents >> scaled across multiple nodes with dozens of service instances and it >> performs great. >> >> Here is some info: >> http://department-of-veterans-affairs.github.io/Leo/ >> >> The documentation for Leo reflects an earlier version of Leo, but if you >> are interested in using it with Java 8 and UIMA 2.8.1, we have not released >> the latest version in on the VA github yet but we can share it with you so >> that you can test it out and possibly provide your comments back to us. >> >> Leo has simple-to-use functionality for flexible batch read and write and >> it can work with any UIMA AEs and existing descriptor files and type system >> descriptions, so if you already have a pipeline, wrapping it with Leo would >> take just a few lines of code. >> >> Let me know if you are interested and I can help you to get started. >> >> Olga Patterson >> >> >> >> >> >> >> >> -----Original Message----- >> From: Jaroslaw Cwiklik <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Friday, April 21, 2017 at 8:08 AM >> To: "[email protected]" <[email protected]> >> Subject: Re: Synchonizing Batches AE and StatusCallbackListener >> >> Erik, thanks. This is more clear what you are trying to accomplish. >> First, >> there are no plans to retire the CPE. It is supported and I don't know >> of >> any plans to retire it. The only issue is ongoing development. My >> efforts >> are focused on extending and improving UIMA-AS. >> >> I don't have an answer yet how to handle the CPE crash scenario with >> respect to batching and subsequent restart from the last known good >> batch. >> Seems like some coordination would be needed to avoid redoing the whole >> collection after a crash. Its been awhile since I've looked at the CPE. >> Will take a look and see what is possible if anything. >> >> There is another Apache UIMA project called DUCC which stands for >> Distributed Uima Cluster Computing. From your email it looks like you >> have >> a cluster of machines available. Here is a quick description of DUCC: >> >> DUCC is a Linux cluster controller designed to scale out any UIMA >> pipeline >> for high throughput collection processing jobs as well as for low >> latency >> real-tme applications. Building on UIMA-AS, DUCC is particularly well >> suited to run large memory Java analytics in multiple threads in order >> to >> fully utilize multicore machines. DUCC manages the life cycle of all >> processes deployed across the cluster, including non-UIMA processes >> such as >> tomcat servers or VNC sessions. >> >> You can find more info on this here: >> https://uima.apache.org/doc-uimaducc-whatitam.html >> >> In UIMA-AS batching is an application concern. I am a bit fuzzy on >> implementation so perhaps someone else can comment how to implement >> batching and how to handle errors. You can use a CasMultipler and a >> custom >> FlowController to manage CASes and react to errors.The UIMA-AS service >> can >> take an input CAS representing your batch, pass it on to the >> CasMultiplier, >> generate CASes for each piece of work and deliver results to the >> CasConsumer with a FlowController in the middle orchestrating the >> flow. I >> defer to application deployment experts to provide you with more >> detail. >> >> Jerry >> >> >> >> >> >> >> >> On Fri, Apr 21, 2017 at 2:21 AM, Erik Fäßler < >> [email protected]> >> wrote: >> >>> Hi Jerry, >>> >>> thanks a lot for your answer! I’m sorry that I didn’t make myself >> clearer. >>> I will try again! :-) >>> Here comes a lot of text, sorry for that. The post actually has two >> parts: >>> The first explaining my issue, the second responding to the pointer >> to >>> UIMA-AS. >>> >>> First: Yes, I use a CPE. I process text documents. Tens of millions >> of >>> them. >>> So, I have the following components to my issue, running with the >> CPE. >>> >>> 1. A CAS-Consumer (just an AnalysisEngine internally, of course). >> This >>> consumer is responsible to serialise the document CAS into XMI and >> send the >>> XMI to a database. It is a XMI-to-database consumer. For performance >>> reasons, the XMI of multiple CASes is buffered and then sent as a >> batch, >>> lets say, 50 CAS XMIs at a time. >>> 2. A CPE StatusCallbackListener which also writes to the same >> database, >>> but in another table. It logs into the database which documents have >> been >>> successfully processed by the CPE. It also works on a batch basis. >>> >>> The goal: The CallbackListener should only mark those documents as >>> successfully processed (i.e. as “finished”) where the CAS-Consumer >> actually >>> has sent the XMI data to the database. >>> >>> Reason: I don’t want documents marked as “finished” where the XMI >> data is >>> not in the database but still in the CAS buffer. Because when now the >>> pipeline crashes, the XMI data never gets sent to the database. >> Then, the >>> processing state is inconsistent: Documents that have not been >> written into >>> the database are marked as successfully processed. But their data is >>> missing. >>> >>> Also, not each XMI data is stored. There is a condition in the >> consumer to >>> decide whether the XMI is to be stored or not. Thus, I cannot “create >>> consistency” by checking which XMI made it into the database. >>> >>> Is this better understandable? >>> >>> >>> >>> Regarding UIMA-AS: >>> >>> I tried it out a few years back when it was rather new, UIMA 2.3.1 or >>> something. Back then, it was like the following: >>> 1. Install a broker (or something - ActiveMQ was it called?) >>> 2. Configure it. >>> 3. Start it. >>> 4. For each AE you want to use, deploy the AE on some server in your >>> cluster (multiple AEs can be bundled into an AAE). >>> 5. Start a reader process that will then fill the broker queue. >>> 6. Wait until processing is finished. >>> 7. Stop all the AE services deployed to the cluster, if you want to >> save >>> the resources. >>> 8. Stop the broker. >>> >>> This was quite a while back so perhaps this is not exactly how it >> was. But >>> it seemed overly complex to me. I had to login into each server >> where I >>> wanted work to be done. We have like 20 nodes or something. Perhaps >> I could >>> write a script for that, but then I would have to keep track of the >> servers >>> that are free to use at a current time. Because I am not the only >> one using >>> the cluster. >>> And then I have to stop all AE “services”. Until then, they will use >>> memory because they just idle when there is nothing more to do. >>> >>> In contrast, CPEs are self-contained projects in my case which I can >>> distribute easily through our job system (SLURM). >>> >>> I thought all the setup for UIMA-AS would pay out in better >> performance. >>> But in my - admittedly limited - tests there was not much of a >> performance >>> difference. CPEs seemed to be a bit faster due to the lack of CAS >>> serialization between reader and AEs. >>> >>> Of course, this was years in the past. Is the process a bit simpler >> today? >>> Or perhaps I got it wrong to begin with, that’s possible. But I read >> the >>> documentation back then and couldn’t see how to do things much >> simpler. >>> >>> BUT if CPEs can’t solve my issue and UIMA-AS can, then perhaps I >> will try >>> it again. >>> >>> Another question: You said “CPE was replaced by UIMA-AS”. Does that >> mean >>> that CPEs will eventually be removed from UIMA? Are they still a >> part of >>> UIMA 3? >>> >>> Sorry for all the text! >>> >>> Best regards and thanks! >>> >>> Erik >>> >>>> On 20 Apr 2017, at 20:31, Jaroslaw Cwiklik <[email protected]> >> wrote: >>>> >>>> Hi Erik, sorry for a delay responding to your question. This seems >> like a >>>> CPE question is this right? I am not quite following what is the >> issue >>> you >>>> are running into. Could you explain this better? With a clearer >> problem >>>> description perhaps others will jump in with an answer :) >>>> >>>> Just FYI, the CPE was replaced by the UIMA-AS quite a long time >> ago. >>>> Perhaps UIMA-AS can work better for you. You can read about it >> here: >>>> https://uima.apache.org/d/uima-as-2.9.0/uima_async_scaleout.html >>>> >>>> Jerry >>>> UIMA Team >>>> >>>> On Tue, Apr 18, 2017 at 5:56 AM, Erik Fäßler < >> [email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I have a use case where a consumer of mine sends CAS XMI data >> into a >>>>> database in batchProcessComplete(). I also use a >> StatusCallbackListener >>>>> that logs into the database whether a document has been completed >>>>> processing, this is also done batch wise. >>>>> Now the issue is, if the pipeline crashes for any reason, I must >> start >>>>> over because the “completion” flag from the CallbackListener and >> the >>> data >>>>> actually sent by the XMI consumer is not synchronised, i.e. I >> don’t >>> know if >>>>> the data has actually been sent for a document that has completed >>>>> processing because everything is done batch-wise and not >> immediately for >>>>> performance reasons. I also cannot just look into the database >> which XMI >>>>> data is there because it only gets sent on a met condition. >>>>> >>>>> I would like to somehow communicate between the consumer and the >>>>> CallbackListener to send their data for the same documents in >>> agreement. Is >>>>> there anything I can do to achieve this? >>>>> >>>>> Best, >>>>> >>>>> Erik >>> >>> >> >> >>
