I added another test where the default seeder is replaced by a different one:
https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/engine/DifferentSeederTest.java Works also as expected, no TokenSeed annotations. Best, Peter Am 18.08.2016 um 15:11 schrieb Peter Klügl: > I found a bug (and fixed it), but it was not related to your problem. > > > I added a unit test where the seeder is removed: > > https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/engine/NoSeedersTest.java > > > Seems to work just fine. The problem must be located somewhere else. > > > Are you sure that the configuration parameter value is correct? > > > I'll write another unit test... > > > Best, > > > Peter > > > Am 18.08.2016 um 14:38 schrieb Peter Klügl: >> I'll check that (writing some unit test right now) >> >> >> Am 18.08.2016 um 14:36 schrieb [email protected]: >>> Hi Peter, >>> >>> doesn't work like that for me. I've removed DefaultSeeder and added my own >>> seeder implementing RutaAnnotationSeeder. Now, I have all of Ruta's >>> standard tokens plus my own tokenization at the same time. >>> >>> Cheers, >>> Armin >>> >>> -----Ursprüngliche Nachricht----- >>> Von: Peter Klügl [mailto:[email protected]] >>> Gesendet: Donnerstag, 18. August 2016 14:23 >>> An: [email protected] >>> Betreff: Re: Ruta 2.4.0 - High memory needs >>> >>> Hi, >>> >>> >>> Am 18.08.2016 um 14:17 schrieb [email protected]: >>>> Hello Peter! >>>> >>>> Please correct me if I'm wrong. My understanding of how Ruta works is as >>>> follows. >>>> >>>> 1. The RutaBasic annotations are always created. RETAINTYPE and FILTERTYPE >>>> have no influence of annotation creation. They influence the use of those >>>> types in rules, only. >>>> >>> yes >>> >>> >>>> 2. The configuration parameter seeders adds additional seeders, only. It >>>> cannot be used to remove the default seeder. >>> No, the parameter specifies all seeder. The default value is is set to >>> the default seeder. If you set it to an empty list, no seeders should be >>> applied. If you want to use your own seeder, you simply set the >>> parameter to your implementation. >>> >>> (I am really sure of that, but I will check it again...) >>> >>> >>> Best, >>> >>> Peter >>> >>>> So how do I tell Ruta not to use the default seeder? How do I tell Ruta to >>>> use my own seeder? Do I have to replace >>>> org.apache.uima.ruta.seed.DefaultSeeder.java? Won't this break Ruta? >>>> >>>> Best, >>>> Armin >>>> >>>> >>>> -----Ursprüngliche Nachricht----- >>>> Von: Peter Klügl [mailto:[email protected]] >>>> Gesendet: Mittwoch, 10. August 2016 14:50 >>>> An: [email protected] >>>> Betreff: Re: Ruta 2.4.0 - High memory needs >>>> >>>> Hi, >>>> >>>> >>>> 18MB of text in a CAS, well that's a quite big sofa. >>>> >>>> >>>> Yes, there are some tricks and best prectices. >>>> >>>> >>>> First of all, there is the configuration parameter "lowMemoryProfile", >>>> which reduces the information stored in RutaBasic. It should reduce the >>>> memory usage considerably, but the processing will take longer, >>>> especially if the type hierarchy is rather deep. The unit tests for it >>>> do not cover all functionality of ruta. I only test all unit test with >>>> this option once in a while, and I haven't done this for some time. >>>> >>>> >>>> >>>> The second thing to do in order to reduce the memory usage is to >>>> minimize the annotations and especially the RutaBasic annotations. These >>>> are automatically created and build up a minimal, atomic partioning of >>>> the document. This means that you should create only annotations as >>>> small as you need them, and only annotations where you need them. The >>>> first option here is to remove/replace the seeder if you do not rely on >>>> these annotations (ANY, CW, NUM, PERIOD, ...), or replace it with a >>>> tokenizer if you did not include one anyway. This will get you rid of >>>> the annotations for whitespaces and so on and the corresponding >>>> RutaBasic annotations. Maybe you also do not need any kind of annotation >>>> for each section (e.g, restrict the matching window). Optimization >>>> strongly depends on the use case and the actual rules. >>>> >>>> Please mind that text spans without any annotations will be considered >>>> invisible concerning sequential matching. >>>> >>>> >>>> btw, the speed of you rules can be improved, especially with the >>>> upcoming 2.5.0 release. Besides that, PARTOFNEQ is one of the slowest >>>> conditions in Ruta. I'd rather recommend something like: >>>> >>>> Full->{ANY @Full{-> UNMARK(Full)};Full{-> UNMARK(Full) ANY};}; >>>> >>>> >>>> Best, >>>> >>>> >>>> Peter >>>> >>>> >>>> Am 09.08.2016 um 12:37 schrieb [email protected]: >>>>> Hello again! >>>>> >>>>> One down, one to go. Are there best practices or tricks to reduce Ruta's >>>>> memory needs? I tried to use the following script to merge names. >>>>> >>>>> Document{->GREEDYANCHORING(true)}; >>>>> First+ Full {->MARK(Full)}; >>>>> Full Last+ {->MARK(Full)}; >>>>> First+ Last+ {->MARK(Full)}; >>>>> Document{->GREEDYANCHORING(false)}; >>>>> Full{PARTOFNEQ(Full) -> UNMARK(Full)}; >>>>> First{PARTOF(Full) -> UNMARK(First)}; >>>>> Last{PARTOF(Full) -> UNMARK(Last)}; >>>>> >>>>> The engine description is create by ruta-maven-plugin:2.4.0 and used with >>>>> uimaFIT's >>>>> AnalysisEngineFactory.createEngineDescription("fullyQualifiedDescriptorNameWithoutXmlExtension"). >>>>> For a 18 Mbyte text, it needs Gbytes of RAM. >>>>> >>>>> Cheers, >>>>> Armin
