Re: Very long Ruta stream initialization

Mario Gazzo Thu, 07 Jan 2016 01:23:08 -0800

Yes, where do we sign this?

:-)


> On 07 Jan 2016, at 10:16 , Peter Klügl <[email protected]> wrote:
> 
> :-) let me know if you need help or have any questions.
> 
> Am 07.01.2016 um 10:12 schrieb Mario Gazzo:
>> Yes, let us just sign and submit it.
>> 
>>> On 07 Jan 2016, at 10:11 , Peter Klügl <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> thanks, that would be great. Patches are simply attached to the issue.
>>> Non-trivial changes require an ICLA. Do you want to sign and submit it?
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>> 
>>> Am 07.01.2016 um 10:08 schrieb Mario Gazzo:
>>>> Thanks,
>>>> 
>>>> I just added the JIRA issue: 
>>>> https://issues.apache.org/jira/browse/UIMA-4729 
>>>> <https://issues.apache.org/jira/browse/UIMA-4729>
>>>> 
>>>> If you like, then we can also implement it and submit a patch, just let us 
>>>> know what the process is.
>>>> 
>>>> Cheers
>>>> Mario
>>>> 
>>>>> On 07 Jan 2016, at 09:08 , Peter Klügl <[email protected]> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>>>>>> Hi Peter,
>>>>>> 
>>>>>> I had a look at the test cases and I think there are many interesting 
>>>>>> and useful features that cover many of our use cases but I will have to 
>>>>>> experiment with them before I know what might be missing. I have a few 
>>>>>> questions though:
>>>>>> 
>>>>>> 1) It appears that we would then also be able to assign annotations to 
>>>>>> lists, which is nice. I am not sure from looking at the tests whether it 
>>>>>> is possible to use ADD with the annotation lists but I assume so.
>>>>> Not yet, but I will implement it. It's still work in progress. But
>>>>> thanks for pointing it out, I would probably have forgotten about it.
>>>>> 
>>>>>> 2) The use of addresses is unclear to me just from reading the test, 
>>>>>> maybe you could explain them.? This concept is very new to me.
>>>>> It's not intented be to utilized directly in a rule file. It's rather
>>>>> just a way to combine logic in java with ruta rules or use ruta
>>>>> functionality in java code.
>>>>> Let's say we have a new method like
>>>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
>>>>> and you call it with something like (syntax is not yet specified)
>>>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
>>>>> Then, the "$" would be replaced by the address of the annotation and the
>>>>> method would return whether the annotation is covered by a Headline
>>>>> annotation and is followed by a Keyword annotation.
>>>>> 
>>>>>> 3) The annotation feature expression looks nice but I wonder whether an 
>>>>>> array element can also be referenced using an int expression and not 
>>>>>> just a constant e.g. Struct.as[intVar+1]{->T1};
>>>>> Yes, without allowing number expressions, it would not really be useful.
>>>>> The current implementation is just a test in order to check whether the
>>>>> internal object model is good enough to cover it. The complete
>>>>> functionality will probably not be included in the next release since
>>>>> there is still much work left in order to get it up and running. The
>>>>> semantics of such expressions (Struct.as) are resolved on the fly, and
>>>>> the code odes not support expressions at all. I still have to think
>>>>> about a way to implement it.
>>>>> 
>>>>>> The label expressions are also useful and will make some of our rules 
>>>>>> more readable.
>>>>>> 
>>>>>> Finally I have one additional question to the MARKUP initialisation. I 
>>>>>> have a case where I need the token seeds coming from the default seeder 
>>>>>> but I don’t want to run the markup initialisation. Is there a separate 
>>>>>> seeder defined for this somewhere? Right now I have my own copy of the 
>>>>>> default seeder without the MARKUP initialisation but obviously I do not 
>>>>>> want to maintain this. It looks as if they could also be split in two 
>>>>>> seeders with both added as default and then I could overwrite with my 
>>>>>> own seeder list containing only the token seeder.
>>>>> Yes, we can split them or just add another one that ignores markup. I
>>>>> was also always thinking about adding a DetailedSeeder that creates much
>>>>> more finegrained types like different brackets and quotes... but it was
>>>>> never on top of my todo list.
>>>>> 
>>>>> Do you want to open a jira issue for it?
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Peter
>>>>> 
>>>>>> Cheers
>>>>>> Mario
>>>>>> 
>>>>>> 
>>>>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <[email protected]> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>>>>>> Hi Peter,
>>>>>>>> 
>>>>>>>> No problem, I was anyway pretty much offline myself during Christmas 
>>>>>>>> holidays.
>>>>>>>> 
>>>>>>>> The term “overhead” is probably an exaggeration in this context 
>>>>>>>> especially after I disabled the MARKUP initialisation. We implemented 
>>>>>>>> earlier our own XML markup annotator tailored to better fit our needs 
>>>>>>>> with additional annotation types and properties, so the Ruta MARKUP is 
>>>>>>>> currently not used. It just happens that we don’t directly use 
>>>>>>>> RutaBasic in any of our rules in this particular case so I was curious 
>>>>>>>> to know whether we could avoid creating them in the first place since 
>>>>>>>> there seems to be quite a few. However, overall processing required by 
>>>>>>>> our Ruta scripts compared to other processing steps is now small and 
>>>>>>>> sub-optimising this further by making RutaBasic optional would 
>>>>>>>> currently be of very low priority to us. We would prioritise other 
>>>>>>>> features higher e.g. being able to assign annotations to variables as 
>>>>>>>> we discussed previously in another thread.
>>>>>>> I am working on this right now and there is finally some first progress 
>>>>>>> :-)
>>>>>>> 
>>>>>>> I fear that I won't catch all use cases (combinations with language
>>>>>>> elements) with the first attempt. If you are interested (and wanna take
>>>>>>> care I do not miss your use case), feel free to take a look at the new
>>>>>>> unit tests:
>>>>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>>>>>> 
>>>>>>> It's still work in progress. Proposals for more unit tests are very 
>>>>>>> welcome.
>>>>>>> 
>>>>>>>> We haven’t processed documents as large as those you mention since 
>>>>>>>> books have so far been divided into chapters and processing could 
>>>>>>>> therefore be parallelised accordingly. We also drop extreme outliers 
>>>>>>>> above a certain size if we encounter them and then we batch process 
>>>>>>>> them later in smaller chunks but this has so far not been necessary 
>>>>>>>> with our current data sets. Like you, our processing bottlenecks are 
>>>>>>>> now in different components.
>>>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Peter
>>>>>>> 
>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Mario
>>>>>>>> 
>>>>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> sorry for the delayed reply.
>>>>>>>>> 
>>>>>>>>> RutaEngine::initializeStream:
>>>>>>>>> 
>>>>>>>>> The special treatment of MARKUPs that causes the increased time 
>>>>>>>>> required for initialization is just a workaround because I was to 
>>>>>>>>> lazy to write a working jflex rule. Well, I tried but failed. It 
>>>>>>>>> shouldn't be hard be to improve this code... I will create an issue 
>>>>>>>>> for it. When I did the last performance optimization, uima did not 
>>>>>>>>> check the indexes yet and my test set did not contain markups.
>>>>>>>>> 
>>>>>>>>> Deactivate creation of RutaBasic:
>>>>>>>>> Short answer is no. I was already thinking about making RutaBasic 
>>>>>>>>> optional in future so that the user can configure if they are used. 
>>>>>>>>> However, right now, they are required for rule inference and make the 
>>>>>>>>> rule inference "fast" in the first place. RutaBasic is just an 
>>>>>>>>> internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and 
>>>>>>>>> RutaFrame, and rules should not match on them at all.
>>>>>>>>> 
>>>>>>>>> Some background information:
>>>>>>>>> 
>>>>>>>>> RutaBasics are used for three things:
>>>>>>>>> - store additional information in order to avoid index operations. 
>>>>>>>>> Some useful conditions would require many index operations, e.g., 
>>>>>>>>> PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations 
>>>>>>>>> start and end at which position, and which positions are covered by 
>>>>>>>>> which types.
>>>>>>>>> - provide a container to make this information available across 
>>>>>>>>> analysis engines. Information shared by analysis engine is normally 
>>>>>>>>> stored in the CAS, e.g. in annotations, (or in external resources). 
>>>>>>>>> This is the role of RutaBasic. It is not really implemented right now 
>>>>>>>>> as it should be but I will improve it soon. Then, there is no 
>>>>>>>>> performance decrease when a pipeline is spammed with small ruta 
>>>>>>>>> engines.
>>>>>>>>> - a basic minimal disjunct partitioning of the document for the 
>>>>>>>>> coverage based visibility concept.
>>>>>>>>> 
>>>>>>>>> Making RutaBasic optional is possible. If there is a real need for 
>>>>>>>>> it, e.g., in order to reduce the memory footprint or when processing 
>>>>>>>>> large documents where parts are simply not interesting, then I will 
>>>>>>>>> put it on my TODO list. I am also open for other/new ideas how to 
>>>>>>>>> solve the challenges (and for incremental usage of internal caches).
>>>>>>>>> 
>>>>>>>>> What is your experience with the processing overhead concerning 
>>>>>>>>> RutaBasic? Is it the rule matching or rather the initialization? I 
>>>>>>>>> myself had already some performance problems with the initalization 
>>>>>>>>> and memory consumption in large CAS (500+ pages pdfs). However, other 
>>>>>>>>> components, serialization and the CAS editor were the actual 
>>>>>>>>> bottlenecks.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> Peter
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>>>>>> I got around it by removing the default seeders by specifying an 
>>>>>>>>>> empty seeders list since we don’t need the MARKUP annotations 
>>>>>>>>>> anymore.
>>>>>>>>>> 
>>>>>>>>>> I still don’t know why it created so much overhead but it sometimes 
>>>>>>>>>> seemed to rival the POS tagger in processing time.
>>>>>>>>>> 
>>>>>>>>>> Anyway, this leads me to the next question. Can I disable the 
>>>>>>>>>> creation of Ruta basic annotations entirely to save processing 
>>>>>>>>>> overhead and only apply Ruta rules to other annotation types created 
>>>>>>>>>> by other AEs such as our own?
>>>>>>>>>> 
>>>>>>>>>> Cheers
>>>>>>>>>> Mario
>>>>>>>>>> 
>>>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>> 
>>>>>>>>>>> I noticed that occasionally the initialisation in 
>>>>>>>>>>> RutaEngine::initializeStream can tak very long time. I can’t really 
>>>>>>>>>>> explain them and it seems independent of document length since I 
>>>>>>>>>>> have seen this with even very small XML documents.
>>>>>>>>>>> 
>>>>>>>>>>> The method seems to spend much time in the DefaultSeeder when 
>>>>>>>>>>> creating MARKUP annotations during subiterator.moveToNext calls 
>>>>>>>>>>> (line 89) and inside Subiterator it seems to be the while loop 
>>>>>>>>>>> inside adjustForStrictForward (line 232), which is inside UIMA core 
>>>>>>>>>>> classes. I haven’t gone into any deeper analysis yet but I first 
>>>>>>>>>>> like to hear whether you have an idea what could be the main 
>>>>>>>>>>> cause(s) for this?
>>>>>>>>>>> 
>>>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> Mario
>

Re: Very long Ruta stream initialization

Reply via email to