Hi, Am 04.01.2016 um 16:13 schrieb Mario Gazzo: > Hi Peter, > > No problem, I was anyway pretty much offline myself during Christmas holidays. > > The term “overhead” is probably an exaggeration in this context especially > after I disabled the MARKUP initialisation. We implemented earlier our own > XML markup annotator tailored to better fit our needs with additional > annotation types and properties, so the Ruta MARKUP is currently not used. It > just happens that we don’t directly use RutaBasic in any of our rules in this > particular case so I was curious to know whether we could avoid creating them > in the first place since there seems to be quite a few. However, overall > processing required by our Ruta scripts compared to other processing steps is > now small and sub-optimising this further by making RutaBasic optional would > currently be of very low priority to us. We would prioritise other features > higher e.g. being able to assign annotations to variables as we discussed > previously in another thread.
I am working on this right now and there is finally some first progress :-) I fear that I won't catch all use cases (combinations with language elements) with the first attempt. If you are interested (and wanna take care I do not miss your use case), feel free to take a look at the new unit tests: https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation It's still work in progress. Proposals for more unit tests are very welcome. > We haven’t processed documents as large as those you mention since books have > so far been divided into chapters and processing could therefore be > parallelised accordingly. We also drop extreme outliers above a certain size > if we encounter them and then we batch process them later in smaller chunks > but this has so far not been necessary with our current data sets. Like you, > our processing bottlenecks are now in different components. Ah, that's nice to hear that ruta is not the bottleneck :-D Best, Peter > Cheers > Mario > >> On 30 Dec 2015, at 16:44 , Peter Klügl <[email protected]> wrote: >> >> Hi, >> >> sorry for the delayed reply. >> >> RutaEngine::initializeStream: >> >> The special treatment of MARKUPs that causes the increased time required for >> initialization is just a workaround because I was to lazy to write a working >> jflex rule. Well, I tried but failed. It shouldn't be hard be to improve >> this code... I will create an issue for it. When I did the last performance >> optimization, uima did not check the indexes yet and my test set did not >> contain markups. >> >> Deactivate creation of RutaBasic: >> Short answer is no. I was already thinking about making RutaBasic optional >> in future so that the user can configure if they are used. However, right >> now, they are required for rule inference and make the rule inference "fast" >> in the first place. RutaBasic is just an internal annotation like >> RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not >> match on them at all. >> >> Some background information: >> >> RutaBasics are used for three things: >> - store additional information in order to avoid index operations. Some >> useful conditions would require many index operations, e.g., PARTOF or >> ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at >> which position, and which positions are covered by which types. >> - provide a container to make this information available across analysis >> engines. Information shared by analysis engine is normally stored in the >> CAS, e.g. in annotations, (or in external resources). This is the role of >> RutaBasic. It is not really implemented right now as it should be but I will >> improve it soon. Then, there is no performance decrease when a pipeline is >> spammed with small ruta engines. >> - a basic minimal disjunct partitioning of the document for the coverage >> based visibility concept. >> >> Making RutaBasic optional is possible. If there is a real need for it, e.g., >> in order to reduce the memory footprint or when processing large documents >> where parts are simply not interesting, then I will put it on my TODO list. >> I am also open for other/new ideas how to solve the challenges (and for >> incremental usage of internal caches). >> >> What is your experience with the processing overhead concerning RutaBasic? >> Is it the rule matching or rather the initialization? I myself had already >> some performance problems with the initalization and memory consumption in >> large CAS (500+ pages pdfs). However, other components, serialization and >> the CAS editor were the actual bottlenecks. >> >> Best, >> >> Peter >> >> >> Am 22.12.2015 um 17:26 schrieb Mario Gazzo: >>> I got around it by removing the default seeders by specifying an empty >>> seeders list since we don’t need the MARKUP annotations anymore. >>> >>> I still don’t know why it created so much overhead but it sometimes seemed >>> to rival the POS tagger in processing time. >>> >>> Anyway, this leads me to the next question. Can I disable the creation of >>> Ruta basic annotations entirely to save processing overhead and only apply >>> Ruta rules to other annotation types created by other AEs such as our own? >>> >>> Cheers >>> Mario >>> >>>> On 21 Dec 2015, at 16:09 , Mario Juric <[email protected]> wrote: >>>> >>>> Hi Peter, >>>> >>>> I noticed that occasionally the initialisation in >>>> RutaEngine::initializeStream can tak very long time. I can’t really >>>> explain them and it seems independent of document length since I have seen >>>> this with even very small XML documents. >>>> >>>> The method seems to spend much time in the DefaultSeeder when creating >>>> MARKUP annotations during subiterator.moveToNext calls (line 89) and >>>> inside Subiterator it seems to be the while loop inside >>>> adjustForStrictForward (line 232), which is inside UIMA core classes. I >>>> haven’t gone into any deeper analysis yet but I first like to hear whether >>>> you have an idea what could be the main cause(s) for this? >>>> >>>> We use Ruta 2.3.1 with UIMA 2.8.1 >>>> >>>> >>>> Cheers >>>> Mario
