Thanks, I just added the JIRA issue: https://issues.apache.org/jira/browse/UIMA-4729 <https://issues.apache.org/jira/browse/UIMA-4729>
If you like, then we can also implement it and submit a patch, just let us know what the process is. Cheers Mario > On 07 Jan 2016, at 09:08 , Peter Klügl <[email protected]> wrote: > > Hi, > > Am 06.01.2016 um 14:48 schrieb Mario Gazzo: >> Hi Peter, >> >> I had a look at the test cases and I think there are many interesting and >> useful features that cover many of our use cases but I will have to >> experiment with them before I know what might be missing. I have a few >> questions though: >> >> 1) It appears that we would then also be able to assign annotations to >> lists, which is nice. I am not sure from looking at the tests whether it is >> possible to use ADD with the annotation lists but I assume so. > > Not yet, but I will implement it. It's still work in progress. But > thanks for pointing it out, I would probably have forgotten about it. > >> 2) The use of addresses is unclear to me just from reading the test, maybe >> you could explain them.? This concept is very new to me. > > It's not intented be to utilized directly in a rule file. It's rather > just a way to combine logic in java with ruta rules or use ruta > functionality in java code. > Let's say we have a new method like > boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations) > and you call it with something like (syntax is not yet specified) > Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation) > Then, the "$" would be replaced by the address of the annotation and the > method would return whether the annotation is covered by a Headline > annotation and is followed by a Keyword annotation. > >> 3) The annotation feature expression looks nice but I wonder whether an >> array element can also be referenced using an int expression and not just a >> constant e.g. Struct.as[intVar+1]{->T1}; > > Yes, without allowing number expressions, it would not really be useful. > The current implementation is just a test in order to check whether the > internal object model is good enough to cover it. The complete > functionality will probably not be included in the next release since > there is still much work left in order to get it up and running. The > semantics of such expressions (Struct.as) are resolved on the fly, and > the code odes not support expressions at all. I still have to think > about a way to implement it. > >> The label expressions are also useful and will make some of our rules more >> readable. >> >> Finally I have one additional question to the MARKUP initialisation. I have >> a case where I need the token seeds coming from the default seeder but I >> don’t want to run the markup initialisation. Is there a separate seeder >> defined for this somewhere? Right now I have my own copy of the default >> seeder without the MARKUP initialisation but obviously I do not want to >> maintain this. It looks as if they could also be split in two seeders with >> both added as default and then I could overwrite with my own seeder list >> containing only the token seeder. > > Yes, we can split them or just add another one that ignores markup. I > was also always thinking about adding a DetailedSeeder that creates much > more finegrained types like different brackets and quotes... but it was > never on top of my todo list. > > Do you want to open a jira issue for it? > > Best, > > Peter > >> Cheers >> Mario >> >> >>> On 04 Jan 2016, at 17:06 , Peter Klügl <[email protected]> wrote: >>> >>> Hi, >>> >>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo: >>>> Hi Peter, >>>> >>>> No problem, I was anyway pretty much offline myself during Christmas >>>> holidays. >>>> >>>> The term “overhead” is probably an exaggeration in this context especially >>>> after I disabled the MARKUP initialisation. We implemented earlier our own >>>> XML markup annotator tailored to better fit our needs with additional >>>> annotation types and properties, so the Ruta MARKUP is currently not used. >>>> It just happens that we don’t directly use RutaBasic in any of our rules >>>> in this particular case so I was curious to know whether we could avoid >>>> creating them in the first place since there seems to be quite a few. >>>> However, overall processing required by our Ruta scripts compared to other >>>> processing steps is now small and sub-optimising this further by making >>>> RutaBasic optional would currently be of very low priority to us. We would >>>> prioritise other features higher e.g. being able to assign annotations to >>>> variables as we discussed previously in another thread. >>> I am working on this right now and there is finally some first progress :-) >>> >>> I fear that I won't catch all use cases (combinations with language >>> elements) with the first attempt. If you are interested (and wanna take >>> care I do not miss your use case), feel free to take a look at the new >>> unit tests: >>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation >>> >>> It's still work in progress. Proposals for more unit tests are very welcome. >>> >>>> We haven’t processed documents as large as those you mention since books >>>> have so far been divided into chapters and processing could therefore be >>>> parallelised accordingly. We also drop extreme outliers above a certain >>>> size if we encounter them and then we batch process them later in smaller >>>> chunks but this has so far not been necessary with our current data sets. >>>> Like you, our processing bottlenecks are now in different components. >>> Ah, that's nice to hear that ruta is not the bottleneck :-D >>> >>> Best, >>> >>> Peter >>> >>> >>>> Cheers >>>> Mario >>>> >>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <[email protected]> wrote: >>>>> >>>>> Hi, >>>>> >>>>> sorry for the delayed reply. >>>>> >>>>> RutaEngine::initializeStream: >>>>> >>>>> The special treatment of MARKUPs that causes the increased time required >>>>> for initialization is just a workaround because I was to lazy to write a >>>>> working jflex rule. Well, I tried but failed. It shouldn't be hard be to >>>>> improve this code... I will create an issue for it. When I did the last >>>>> performance optimization, uima did not check the indexes yet and my test >>>>> set did not contain markups. >>>>> >>>>> Deactivate creation of RutaBasic: >>>>> Short answer is no. I was already thinking about making RutaBasic >>>>> optional in future so that the user can configure if they are used. >>>>> However, right now, they are required for rule inference and make the >>>>> rule inference "fast" in the first place. RutaBasic is just an internal >>>>> annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and >>>>> rules should not match on them at all. >>>>> >>>>> Some background information: >>>>> >>>>> RutaBasics are used for three things: >>>>> - store additional information in order to avoid index operations. Some >>>>> useful conditions would require many index operations, e.g., PARTOF or >>>>> ENDSWITH. RutaBasic is utilized as a cache what annotations start and end >>>>> at which position, and which positions are covered by which types. >>>>> - provide a container to make this information available across analysis >>>>> engines. Information shared by analysis engine is normally stored in the >>>>> CAS, e.g. in annotations, (or in external resources). This is the role of >>>>> RutaBasic. It is not really implemented right now as it should be but I >>>>> will improve it soon. Then, there is no performance decrease when a >>>>> pipeline is spammed with small ruta engines. >>>>> - a basic minimal disjunct partitioning of the document for the coverage >>>>> based visibility concept. >>>>> >>>>> Making RutaBasic optional is possible. If there is a real need for it, >>>>> e.g., in order to reduce the memory footprint or when processing large >>>>> documents where parts are simply not interesting, then I will put it on >>>>> my TODO list. I am also open for other/new ideas how to solve the >>>>> challenges (and for incremental usage of internal caches). >>>>> >>>>> What is your experience with the processing overhead concerning >>>>> RutaBasic? Is it the rule matching or rather the initialization? I myself >>>>> had already some performance problems with the initalization and memory >>>>> consumption in large CAS (500+ pages pdfs). However, other components, >>>>> serialization and the CAS editor were the actual bottlenecks. >>>>> >>>>> Best, >>>>> >>>>> Peter >>>>> >>>>> >>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo: >>>>>> I got around it by removing the default seeders by specifying an empty >>>>>> seeders list since we don’t need the MARKUP annotations anymore. >>>>>> >>>>>> I still don’t know why it created so much overhead but it sometimes >>>>>> seemed to rival the POS tagger in processing time. >>>>>> >>>>>> Anyway, this leads me to the next question. Can I disable the creation >>>>>> of Ruta basic annotations entirely to save processing overhead and only >>>>>> apply Ruta rules to other annotation types created by other AEs such as >>>>>> our own? >>>>>> >>>>>> Cheers >>>>>> Mario >>>>>> >>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <[email protected]> wrote: >>>>>>> >>>>>>> Hi Peter, >>>>>>> >>>>>>> I noticed that occasionally the initialisation in >>>>>>> RutaEngine::initializeStream can tak very long time. I can’t really >>>>>>> explain them and it seems independent of document length since I have >>>>>>> seen this with even very small XML documents. >>>>>>> >>>>>>> The method seems to spend much time in the DefaultSeeder when creating >>>>>>> MARKUP annotations during subiterator.moveToNext calls (line 89) and >>>>>>> inside Subiterator it seems to be the while loop inside >>>>>>> adjustForStrictForward (line 232), which is inside UIMA core classes. I >>>>>>> haven’t gone into any deeper analysis yet but I first like to hear >>>>>>> whether you have an idea what could be the main cause(s) for this? >>>>>>> >>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1 >>>>>>> >>>>>>> >>>>>>> Cheers >>>>>>> Mario >
