Thanks for your reply Jaroslaw, it seems that I misunderstood
the way UIMA AS works.

1)
"... Because the AAE is not thread safe uima as must scale it through
creating multiple instances of it..."

Since the AAE is not thread safe you should not try to scale it out in the
same JVM. If AAE
is not thread safe, you should only have one instance of it per JVM. You can
scale it by
starting multiple JVMs.
I reduced my AAE to three delegate AEs:

1. HBaseCasMultiplier -> fetches the actual text from hbase
2. Tokenizer -> adds tokens to my CAS
3. HBaseWrite -> writes the tokens back into hbase

These delegates are not thread safe, to scale these AEs
one instance per worker thread must be created.
Thats what I want UIMA AS to do for me, so I think thats
also the case which is described in the documentation in 1.4.1:

"... The classes for annotators and flow controllers do not need to be "thread-safe" with respect to their instance data - meaning, they do not need to be implemented with synchronization locks for access to their instance data, because each instance will only be called using one thread at a time. Scale out for these classes is done using
multiple instances of the class. ..."

2)
"...I must admit the documentation confused me a bit about the meaning of
the async attribute..."

The async attribute is only used for aggregates, and specifies that this
aggregate will be run asynchronously (with input queues in front of all of
its delegates) or not. If you choose async="false" it means that you want to
deploy the aggregate synchronously. Meaning it will be single-threaded. To
UIMA AS a synchronous aggregate is the same as a
UIMA primitive AE.
Thanks, understood the difference, so I want async="true"

3)            ...
            <analysisEngine key="TextAnalysis" async="false">
                <scaleout numberOfInstances="8" />

                <delegates>
                    <analysisEngine key="HBaseCasMultiplier">
                        <casMultiplier poolSize="8"/>
                    </analysisEngine>
                </delegates>
            </analysisEngine>
            ...

The above is an inconsistent configuration.  You are specifying that
"TextAnalytics" should be deployed synchronously but then adding delegate
configuration, which forces the aggregate to be deployed asynchronously.
Synchronous aggregate delegate's are not "visible" to the uima-as, and
cannot be configured in the deployment descriptor.
Ok, I changed it to fit to case described above:
           <analysisEngine>
               <delegates>
                   <analysisEngine key="HBaseCasMultiplier">
                       <casMultiplier poolSize="4"/>
                       <scaleout numberOfInstances="2" />
                   </analysisEngine>
                   <analysisEngine key="Tokenizer">
                       <scaleout numberOfInstances="4" />
                   </analysisEngine>
                   <analysisEngine key="HBaseWriter">
                       <scaleout numberOfInstances="4" />
                   </analysisEngine>
               </delegates>
           </analysisEngine>

I would like to scale the HBaseCasMultiplier to more threads
then two, because there is a short delay when reading from hbase.
First I am not sure which value I should choose for the
Cas Multiplier pool size. If the numberOfInstances get larger
then two I get a few exceptions (stack trace below) when UIMA AS
starts to process the first documents. So I think I am doing something
wrong here. And what is the minimal possible casPoolSize, since
I need CAS instances for my 4 Tokenizers, 4 HBaseWriters
and 4 (?) for the CAS Multiplier, which would result in a minimum
size of 12, right ?

The HBaseCasMultiplier gets one CAS which contains the id and
then outputs one CAS which contains an actual text.

Here is the full stack trace for the exception I get now:
org.apache.uima.UIMARuntimeException: AnalysisComponent "/HBaseCasMultiplier/" requested more CASes (2) than defined in its getCasInstancesRequired() method (1). It is possible that the AnalysisComponent is not properly releasing CASes when it encounters an error. at org.apache.uima.impl.UimaContext_ImplBase.getEmptyCas(UimaContext_ImplBase.java:575) at org.apache.uima.analysis_component.CasMultiplier_ImplBase.getEmptyCAS(CasMultiplier_ImplBase.java:109) at dk.infopaq.nlp.repository.connector.HBaseReadCasMultiplier.hasNext(HBaseReadCasMultiplier.java:107) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl$AnalysisComponentCasIterator.hasNext(PrimitiveAnalysisEngine_impl.java:563) at org.apache.uima.aae.controller.PrimitiveAnalysisEngineController_impl.process(PrimitiveAnalysisEngineController_impl.java:388) at org.apache.uima.aae.handler.HandlerBase.invokeProcess(HandlerBase.java:130) at org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestWithCASReference(ProcessRequestHandler_impl.java:655) at org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:887) at org.apache.uima.aae.spi.transport.vm.UimaVmMessageListener.onMessage(UimaVmMessageListener.java:99) at org.apache.uima.aae.spi.transport.vm.UimaVmMessageDispatcher$1.run(UimaVmMessageDispatcher.java:66) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at org.apache.uima.aae.UimaAsThreadFactory$1.run(UimaAsThreadFactory.java:69)
   at java.lang.Thread.run(Thread.java:619)
CASAdminException: Can't flush CAS, flushing is disabled.
   at org.apache.uima.cas.impl.CASImpl.reset(CASImpl.java:850)
   at org.apache.uima.util.CasPool.releaseCas(CasPool.java:228)
at org.apache.uima.resource.impl.CasManager_impl.releaseCas(CasManager_impl.java:141) at org.apache.uima.cas.AbstractCas_ImplBase.release(AbstractCas_ImplBase.java:35)
   at org.apache.uima.cas.impl.CASImpl.release(CASImpl.java:3561)
   at org.apache.uima.cas.impl.CASImpl.release(CASImpl.java:3559)
at org.apache.uima.aae.controller.BaseAnalysisEngineController.dropCAS(BaseAnalysisEngineController.java:1044) at org.apache.uima.aae.controller.BaseAnalysisEngineController.dropCAS(BaseAnalysisEngineController.java:1269) at org.apache.uima.aae.controller.AggregateAnalysisEngineController_impl.dropCAS(AggregateAnalysisEngineController_impl.java:318) at org.apache.uima.aae.controller.BaseAnalysisEngineController.handleAction(BaseAnalysisEngineController.java:1212) at org.apache.uima.aae.controller.AggregateAnalysisEngineController_impl.takeAction(AggregateAnalysisEngineController_impl.java:533) at org.apache.uima.aae.error.handler.ProcessCasErrorHandler.handleError(ProcessCasErrorHandler.java:566) at org.apache.uima.aae.error.ErrorHandlerChain.handle(ErrorHandlerChain.java:64) at org.apache.uima.aae.handler.input.ProcessResponseHandler.handleProcessResponseWithException(ProcessResponseHandler.java:544) at org.apache.uima.aae.handler.input.ProcessResponseHandler.handle(ProcessResponseHandler.java:644) at org.apache.uima.aae.handler.HandlerBase.delegate(HandlerBase.java:158) at org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:927) at org.apache.uima.aae.spi.transport.vm.UimaVmMessageListener.onMessage(UimaVmMessageListener.java:99) at org.apache.uima.aae.spi.transport.vm.UimaVmMessageDispatcher$1.run(UimaVmMessageDispatcher.java:66) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:619)

Thanks for your help,
Jörn

Reply via email to