Re: Very long Ruta stream initialization

Peter Klügl Wed, 30 Dec 2015 07:44:49 -0800

Hi,

sorry for the delayed reply.


RutaEngine::initializeStream:

The special treatment of MARKUPs that causes the increased time requiredfor initialization is just a workaround because I was to lazy to write aworking jflex rule. Well, I tried but failed. It shouldn't be hard be toimprove this code... I will create an issue for it. When I did the lastperformance optimization, uima did not check the indexes yet and my testset did not contain markups.


Deactivate creation of RutaBasic:

Short answer is no. I was already thinking about making RutaBasicoptional in future so that the user can configure if they are used.However, right now, they are required for rule inference and make therule inference "fast" in the first place. RutaBasic is just an internalannotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, andrules should not match on them at all.


Some background information:

RutaBasics are used for three things:

- store additional information in order to avoid index operations. Someuseful conditions would require many index operations, e.g., PARTOF orENDSWITH. RutaBasic is utilized as a cache what annotations start andend at which position, and which positions are covered by which types.- provide a container to make this information available across analysisengines. Information shared by analysis engine is normally stored in theCAS, e.g. in annotations, (or in external resources). This is the roleof RutaBasic. It is not really implemented right now as it should be butI will improve it soon. Then, there is no performance decrease when apipeline is spammed with small ruta engines.- a basic minimal disjunct partitioning of the document for the coveragebased visibility concept.

Making RutaBasic optional is possible. If there is a real need for it,e.g., in order to reduce the memory footprint or when processing largedocuments where parts are simply not interesting, then I will put it onmy TODO list. I am also open for other/new ideas how to solve thechallenges (and for incremental usage of internal caches).

What is your experience with the processing overhead concerningRutaBasic? Is it the rule matching or rather the initialization? Imyself had already some performance problems with the initalization andmemory consumption in large CAS (500+ pages pdfs). However, othercomponents, serialization and the CAS editor were the actual bottlenecks.


Best,

Peter


Am 22.12.2015 um 17:26 schrieb Mario Gazzo:

I got around it by removing the default seeders by specifying an empty seeders 
list since we don’t need the MARKUP annotations anymore.

I still don’t know why it created so much overhead but it sometimes seemed to 
rival the POS tagger in processing time.

Anyway, this leads me to the next question. Can I disable the creation of Ruta 
basic annotations entirely to save processing overhead and only apply Ruta 
rules to other annotation types created by other AEs such as our own?

Cheers
Mario

On 21 Dec 2015, at 16:09 , Mario Juric <[email protected]> wrote:

Hi Peter,

I noticed that occasionally the initialisation in RutaEngine::initializeStream 
can tak very long time. I can’t really explain them and it seems independent of 
document length since I have seen this with even very small XML documents.

The method seems to spend much time in the DefaultSeeder when creating MARKUP 
annotations during subiterator.moveToNext calls (line 89) and inside 
Subiterator it seems to be the while loop inside adjustForStrictForward (line 
232), which is inside UIMA core classes. I haven’t gone into any deeper 
analysis yet but I first like to hear whether you have an idea what could be 
the main cause(s) for this?

We use Ruta 2.3.1 with UIMA 2.8.1


Cheers
Mario

Re: Very long Ruta stream initialization

Reply via email to