Yes, where do we sign this? :-)
> On 07 Jan 2016, at 10:16 , Peter Klügl <[email protected]> wrote: > > :-) let me know if you need help or have any questions. > > Am 07.01.2016 um 10:12 schrieb Mario Gazzo: >> Yes, let us just sign and submit it. >> >>> On 07 Jan 2016, at 10:11 , Peter Klügl <[email protected]> wrote: >>> >>> Hi, >>> >>> thanks, that would be great. Patches are simply attached to the issue. >>> Non-trivial changes require an ICLA. Do you want to sign and submit it? >>> >>> Best, >>> >>> Peter >>> >>> >>> Am 07.01.2016 um 10:08 schrieb Mario Gazzo: >>>> Thanks, >>>> >>>> I just added the JIRA issue: >>>> https://issues.apache.org/jira/browse/UIMA-4729 >>>> <https://issues.apache.org/jira/browse/UIMA-4729> >>>> >>>> If you like, then we can also implement it and submit a patch, just let us >>>> know what the process is. >>>> >>>> Cheers >>>> Mario >>>> >>>>> On 07 Jan 2016, at 09:08 , Peter Klügl <[email protected]> wrote: >>>>> >>>>> Hi, >>>>> >>>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo: >>>>>> Hi Peter, >>>>>> >>>>>> I had a look at the test cases and I think there are many interesting >>>>>> and useful features that cover many of our use cases but I will have to >>>>>> experiment with them before I know what might be missing. I have a few >>>>>> questions though: >>>>>> >>>>>> 1) It appears that we would then also be able to assign annotations to >>>>>> lists, which is nice. I am not sure from looking at the tests whether it >>>>>> is possible to use ADD with the annotation lists but I assume so. >>>>> Not yet, but I will implement it. It's still work in progress. But >>>>> thanks for pointing it out, I would probably have forgotten about it. >>>>> >>>>>> 2) The use of addresses is unclear to me just from reading the test, >>>>>> maybe you could explain them.? This concept is very new to me. >>>>> It's not intented be to utilized directly in a rule file. It's rather >>>>> just a way to combine logic in java with ruta rules or use ruta >>>>> functionality in java code. >>>>> Let's say we have a new method like >>>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations) >>>>> and you call it with something like (syntax is not yet specified) >>>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation) >>>>> Then, the "$" would be replaced by the address of the annotation and the >>>>> method would return whether the annotation is covered by a Headline >>>>> annotation and is followed by a Keyword annotation. >>>>> >>>>>> 3) The annotation feature expression looks nice but I wonder whether an >>>>>> array element can also be referenced using an int expression and not >>>>>> just a constant e.g. Struct.as[intVar+1]{->T1}; >>>>> Yes, without allowing number expressions, it would not really be useful. >>>>> The current implementation is just a test in order to check whether the >>>>> internal object model is good enough to cover it. The complete >>>>> functionality will probably not be included in the next release since >>>>> there is still much work left in order to get it up and running. The >>>>> semantics of such expressions (Struct.as) are resolved on the fly, and >>>>> the code odes not support expressions at all. I still have to think >>>>> about a way to implement it. >>>>> >>>>>> The label expressions are also useful and will make some of our rules >>>>>> more readable. >>>>>> >>>>>> Finally I have one additional question to the MARKUP initialisation. I >>>>>> have a case where I need the token seeds coming from the default seeder >>>>>> but I don’t want to run the markup initialisation. Is there a separate >>>>>> seeder defined for this somewhere? Right now I have my own copy of the >>>>>> default seeder without the MARKUP initialisation but obviously I do not >>>>>> want to maintain this. It looks as if they could also be split in two >>>>>> seeders with both added as default and then I could overwrite with my >>>>>> own seeder list containing only the token seeder. >>>>> Yes, we can split them or just add another one that ignores markup. I >>>>> was also always thinking about adding a DetailedSeeder that creates much >>>>> more finegrained types like different brackets and quotes... but it was >>>>> never on top of my todo list. >>>>> >>>>> Do you want to open a jira issue for it? >>>>> >>>>> Best, >>>>> >>>>> Peter >>>>> >>>>>> Cheers >>>>>> Mario >>>>>> >>>>>> >>>>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <[email protected]> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo: >>>>>>>> Hi Peter, >>>>>>>> >>>>>>>> No problem, I was anyway pretty much offline myself during Christmas >>>>>>>> holidays. >>>>>>>> >>>>>>>> The term “overhead” is probably an exaggeration in this context >>>>>>>> especially after I disabled the MARKUP initialisation. We implemented >>>>>>>> earlier our own XML markup annotator tailored to better fit our needs >>>>>>>> with additional annotation types and properties, so the Ruta MARKUP is >>>>>>>> currently not used. It just happens that we don’t directly use >>>>>>>> RutaBasic in any of our rules in this particular case so I was curious >>>>>>>> to know whether we could avoid creating them in the first place since >>>>>>>> there seems to be quite a few. However, overall processing required by >>>>>>>> our Ruta scripts compared to other processing steps is now small and >>>>>>>> sub-optimising this further by making RutaBasic optional would >>>>>>>> currently be of very low priority to us. We would prioritise other >>>>>>>> features higher e.g. being able to assign annotations to variables as >>>>>>>> we discussed previously in another thread. >>>>>>> I am working on this right now and there is finally some first progress >>>>>>> :-) >>>>>>> >>>>>>> I fear that I won't catch all use cases (combinations with language >>>>>>> elements) with the first attempt. If you are interested (and wanna take >>>>>>> care I do not miss your use case), feel free to take a look at the new >>>>>>> unit tests: >>>>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation >>>>>>> >>>>>>> It's still work in progress. Proposals for more unit tests are very >>>>>>> welcome. >>>>>>> >>>>>>>> We haven’t processed documents as large as those you mention since >>>>>>>> books have so far been divided into chapters and processing could >>>>>>>> therefore be parallelised accordingly. We also drop extreme outliers >>>>>>>> above a certain size if we encounter them and then we batch process >>>>>>>> them later in smaller chunks but this has so far not been necessary >>>>>>>> with our current data sets. Like you, our processing bottlenecks are >>>>>>>> now in different components. >>>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Peter >>>>>>> >>>>>>> >>>>>>>> Cheers >>>>>>>> Mario >>>>>>>> >>>>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> sorry for the delayed reply. >>>>>>>>> >>>>>>>>> RutaEngine::initializeStream: >>>>>>>>> >>>>>>>>> The special treatment of MARKUPs that causes the increased time >>>>>>>>> required for initialization is just a workaround because I was to >>>>>>>>> lazy to write a working jflex rule. Well, I tried but failed. It >>>>>>>>> shouldn't be hard be to improve this code... I will create an issue >>>>>>>>> for it. When I did the last performance optimization, uima did not >>>>>>>>> check the indexes yet and my test set did not contain markups. >>>>>>>>> >>>>>>>>> Deactivate creation of RutaBasic: >>>>>>>>> Short answer is no. I was already thinking about making RutaBasic >>>>>>>>> optional in future so that the user can configure if they are used. >>>>>>>>> However, right now, they are required for rule inference and make the >>>>>>>>> rule inference "fast" in the first place. RutaBasic is just an >>>>>>>>> internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and >>>>>>>>> RutaFrame, and rules should not match on them at all. >>>>>>>>> >>>>>>>>> Some background information: >>>>>>>>> >>>>>>>>> RutaBasics are used for three things: >>>>>>>>> - store additional information in order to avoid index operations. >>>>>>>>> Some useful conditions would require many index operations, e.g., >>>>>>>>> PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations >>>>>>>>> start and end at which position, and which positions are covered by >>>>>>>>> which types. >>>>>>>>> - provide a container to make this information available across >>>>>>>>> analysis engines. Information shared by analysis engine is normally >>>>>>>>> stored in the CAS, e.g. in annotations, (or in external resources). >>>>>>>>> This is the role of RutaBasic. It is not really implemented right now >>>>>>>>> as it should be but I will improve it soon. Then, there is no >>>>>>>>> performance decrease when a pipeline is spammed with small ruta >>>>>>>>> engines. >>>>>>>>> - a basic minimal disjunct partitioning of the document for the >>>>>>>>> coverage based visibility concept. >>>>>>>>> >>>>>>>>> Making RutaBasic optional is possible. If there is a real need for >>>>>>>>> it, e.g., in order to reduce the memory footprint or when processing >>>>>>>>> large documents where parts are simply not interesting, then I will >>>>>>>>> put it on my TODO list. I am also open for other/new ideas how to >>>>>>>>> solve the challenges (and for incremental usage of internal caches). >>>>>>>>> >>>>>>>>> What is your experience with the processing overhead concerning >>>>>>>>> RutaBasic? Is it the rule matching or rather the initialization? I >>>>>>>>> myself had already some performance problems with the initalization >>>>>>>>> and memory consumption in large CAS (500+ pages pdfs). However, other >>>>>>>>> components, serialization and the CAS editor were the actual >>>>>>>>> bottlenecks. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> >>>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo: >>>>>>>>>> I got around it by removing the default seeders by specifying an >>>>>>>>>> empty seeders list since we don’t need the MARKUP annotations >>>>>>>>>> anymore. >>>>>>>>>> >>>>>>>>>> I still don’t know why it created so much overhead but it sometimes >>>>>>>>>> seemed to rival the POS tagger in processing time. >>>>>>>>>> >>>>>>>>>> Anyway, this leads me to the next question. Can I disable the >>>>>>>>>> creation of Ruta basic annotations entirely to save processing >>>>>>>>>> overhead and only apply Ruta rules to other annotation types created >>>>>>>>>> by other AEs such as our own? >>>>>>>>>> >>>>>>>>>> Cheers >>>>>>>>>> Mario >>>>>>>>>> >>>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Peter, >>>>>>>>>>> >>>>>>>>>>> I noticed that occasionally the initialisation in >>>>>>>>>>> RutaEngine::initializeStream can tak very long time. I can’t really >>>>>>>>>>> explain them and it seems independent of document length since I >>>>>>>>>>> have seen this with even very small XML documents. >>>>>>>>>>> >>>>>>>>>>> The method seems to spend much time in the DefaultSeeder when >>>>>>>>>>> creating MARKUP annotations during subiterator.moveToNext calls >>>>>>>>>>> (line 89) and inside Subiterator it seems to be the while loop >>>>>>>>>>> inside adjustForStrictForward (line 232), which is inside UIMA core >>>>>>>>>>> classes. I haven’t gone into any deeper analysis yet but I first >>>>>>>>>>> like to hear whether you have an idea what could be the main >>>>>>>>>>> cause(s) for this? >>>>>>>>>>> >>>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Cheers >>>>>>>>>>> Mario >
