Re: Very long Ruta stream initialization

Peter Klügl Mon, 04 Jan 2016 08:06:42 -0800

Hi,

Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
> Hi Peter,
>
> No problem, I was anyway pretty much offline myself during Christmas holidays.
>
> The term “overhead” is probably an exaggeration in this context especially 
> after I disabled the MARKUP initialisation. We implemented earlier our own 
> XML markup annotator tailored to better fit our needs with additional 
> annotation types and properties, so the Ruta MARKUP is currently not used. It 
> just happens that we don’t directly use RutaBasic in any of our rules in this 
> particular case so I was curious to know whether we could avoid creating them 
> in the first place since there seems to be quite a few. However, overall 
> processing required by our Ruta scripts compared to other processing steps is 
> now small and sub-optimising this further by making RutaBasic optional would 
> currently be of very low priority to us. We would prioritise other features 
> higher e.g. being able to assign annotations to variables as we discussed 
> previously in another thread.


I am working on this right now and there is finally some first progress :-)

I fear that I won't catch all use cases (combinations with language
elements) with the first attempt. If you are interested (and wanna take
care I do not miss your use case), feel free to take a look at the new
unit tests:
https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation

It's still work in progress. Proposals for more unit tests are very welcome.

> We haven’t processed documents as large as those you mention since books have 
> so far been divided into chapters and processing could therefore be 
> parallelised accordingly. We also drop extreme outliers above a certain size 
> if we encounter them and then we batch process them later in smaller chunks 
> but this has so far not been necessary with our current data sets. Like you, 
> our processing bottlenecks are now in different components.

Ah, that's nice to hear that ruta is not the bottleneck :-D

Best,

Peter


> Cheers
> Mario
>
>> On 30 Dec 2015, at 16:44 , Peter Klügl <[email protected]> wrote:
>>
>> Hi,
>>
>> sorry for the delayed reply.
>>
>> RutaEngine::initializeStream:
>>
>> The special treatment of MARKUPs that causes the increased time required for 
>> initialization is just a workaround because I was to lazy to write a working 
>> jflex rule. Well, I tried but failed. It shouldn't be hard be to improve 
>> this code... I will create an issue for it. When I did the last performance 
>> optimization, uima did not check the indexes yet and my test set did not 
>> contain markups.
>>
>> Deactivate creation of RutaBasic:
>> Short answer is no. I was already thinking about making RutaBasic optional 
>> in future so that the user can configure if they are used. However, right 
>> now, they are required for rule inference and make the rule inference "fast" 
>> in the first place. RutaBasic is just an internal annotation like 
>> RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not 
>> match on them at all.
>>
>> Some background information:
>>
>> RutaBasics are used for three things:
>> - store additional information in order to avoid index operations. Some 
>> useful conditions would require many index operations, e.g., PARTOF or 
>> ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at 
>> which position, and which positions are covered by which types.
>> - provide a container to make this information available across analysis 
>> engines. Information shared by analysis engine is normally stored in the 
>> CAS, e.g. in annotations, (or in external resources). This is the role of 
>> RutaBasic. It is not really implemented right now as it should be but I will 
>> improve it soon. Then, there is no performance decrease when a pipeline is 
>> spammed with small ruta engines.
>> - a basic minimal disjunct partitioning of the document for the coverage 
>> based visibility concept.
>>
>> Making RutaBasic optional is possible. If there is a real need for it, e.g., 
>> in order to reduce the memory footprint or when processing large documents 
>> where parts are simply not interesting, then I will put it on my TODO list. 
>> I am also open for other/new ideas how to solve the challenges (and for 
>> incremental usage of internal caches).
>>
>> What is your experience with the processing overhead concerning RutaBasic? 
>> Is it the rule matching or rather the initialization? I myself had already 
>> some performance problems with the initalization and memory consumption in 
>> large CAS (500+ pages pdfs). However, other components, serialization and 
>> the CAS editor were the actual bottlenecks.
>>
>> Best,
>>
>> Peter
>>
>>
>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>> I got around it by removing the default seeders by specifying an empty 
>>> seeders list since we don’t need the MARKUP annotations anymore.
>>>
>>> I still don’t know why it created so much overhead but it sometimes seemed 
>>> to rival the POS tagger in processing time.
>>>
>>> Anyway, this leads me to the next question. Can I disable the creation of 
>>> Ruta basic annotations entirely to save processing overhead and only apply 
>>> Ruta rules to other annotation types created by other AEs such as our own?
>>>
>>> Cheers
>>> Mario
>>>
>>>> On 21 Dec 2015, at 16:09 , Mario Juric <[email protected]> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> I noticed that occasionally the initialisation in 
>>>> RutaEngine::initializeStream can tak very long time. I can’t really 
>>>> explain them and it seems independent of document length since I have seen 
>>>> this with even very small XML documents.
>>>>
>>>> The method seems to spend much time in the DefaultSeeder when creating 
>>>> MARKUP annotations during subiterator.moveToNext calls (line 89) and 
>>>> inside Subiterator it seems to be the while loop inside 
>>>> adjustForStrictForward (line 232), which is inside UIMA core classes. I 
>>>> haven’t gone into any deeper analysis yet but I first like to hear whether 
>>>> you have an idea what could be the main cause(s) for this?
>>>>
>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>
>>>>
>>>> Cheers
>>>> Mario

Re: Very long Ruta stream initialization

Reply via email to