Re: Ruta 2.4.0 - High memory needs

Peter Klügl Thu, 18 Aug 2016 06:25:46 -0700

I added another test where the default seeder is replaced by a different
one:


https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/engine/DifferentSeederTest.java


Works also as expected, no TokenSeed annotations.


Best,


Peter


Am 18.08.2016 um 15:11 schrieb Peter Klügl:
> I found a bug (and fixed it), but it was not related to your problem.
>
>
> I added a unit test where the seeder is removed:
>
> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/engine/NoSeedersTest.java
>
>
> Seems to work just fine. The problem must be located somewhere else.
>
>
> Are you sure that the configuration parameter value is correct?
>
>
> I'll write another unit test...
>
>
> Best,
>
>
> Peter
>
>
> Am 18.08.2016 um 14:38 schrieb Peter Klügl:
>> I'll check that (writing some unit test right now)
>>
>>
>> Am 18.08.2016 um 14:36 schrieb [email protected]:
>>> Hi Peter,
>>>
>>> doesn't work like that for me. I've removed DefaultSeeder and added my own 
>>> seeder implementing RutaAnnotationSeeder. Now, I have all of Ruta's 
>>> standard tokens plus my own tokenization at the same time.
>>>
>>> Cheers,
>>> Armin
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Peter Klügl [mailto:[email protected]] 
>>> Gesendet: Donnerstag, 18. August 2016 14:23
>>> An: [email protected]
>>> Betreff: Re: Ruta 2.4.0 - High memory needs
>>>
>>> Hi,
>>>
>>>
>>> Am 18.08.2016 um 14:17 schrieb [email protected]:
>>>> Hello Peter!
>>>>
>>>> Please correct me if I'm wrong. My understanding of how Ruta works is as 
>>>> follows. 
>>>>
>>>> 1. The RutaBasic annotations are always created. RETAINTYPE and FILTERTYPE 
>>>> have no influence of annotation creation. They influence the use of those 
>>>> types in rules, only.
>>>>
>>> yes
>>>
>>>
>>>> 2. The configuration parameter seeders adds additional seeders, only. It 
>>>> cannot be used to remove the default seeder.
>>> No, the parameter specifies all seeder. The default value is is set to
>>> the default seeder. If you set it to an empty list, no seeders should be
>>> applied. If you want to use your own seeder, you simply set the
>>> parameter to your implementation.
>>>
>>> (I am really sure of that, but I will check it again...)
>>>
>>>
>>> Best,
>>>
>>> Peter
>>>
>>>> So how do I tell Ruta not to use the default seeder? How do I tell Ruta to 
>>>> use my own seeder? Do I have to replace 
>>>> org.apache.uima.ruta.seed.DefaultSeeder.java? Won't this break Ruta?
>>>>
>>>> Best,
>>>> Armin
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Peter Klügl [mailto:[email protected]] 
>>>> Gesendet: Mittwoch, 10. August 2016 14:50
>>>> An: [email protected]
>>>> Betreff: Re: Ruta 2.4.0 - High memory needs
>>>>
>>>> Hi,
>>>>
>>>>
>>>> 18MB of text in a CAS, well that's a quite big sofa.
>>>>
>>>>
>>>> Yes, there are some tricks and best prectices.
>>>>
>>>>
>>>> First of all, there is the configuration parameter "lowMemoryProfile",
>>>> which reduces the information stored in RutaBasic. It should reduce the
>>>> memory usage considerably, but the processing will take longer,
>>>> especially if the type hierarchy is rather deep. The unit tests for it
>>>> do not cover all functionality of ruta. I only test all unit test with
>>>> this option once in a while, and I haven't done this for some time.
>>>>
>>>>  
>>>>
>>>> The second thing to do in order to reduce the memory usage is to
>>>> minimize the annotations and especially the RutaBasic annotations. These
>>>> are automatically created and build up a minimal, atomic partioning of
>>>> the document. This means that you should create only annotations as
>>>> small as you need them, and only annotations where you need them. The
>>>> first option here is to remove/replace the seeder if you do not rely on
>>>> these annotations (ANY, CW, NUM, PERIOD, ...), or replace it with a
>>>> tokenizer if you did not include one anyway. This will get you rid of
>>>> the annotations for whitespaces and so on and the corresponding
>>>> RutaBasic annotations. Maybe you also do not need any kind of annotation
>>>> for each section (e.g, restrict the matching window). Optimization
>>>> strongly depends on the use case and the actual rules.
>>>>
>>>> Please mind that text spans without any annotations will be considered
>>>> invisible concerning sequential matching.
>>>>
>>>>
>>>> btw, the speed of you rules can be improved, especially with the
>>>> upcoming 2.5.0 release. Besides that, PARTOFNEQ is one of the slowest
>>>> conditions in Ruta. I'd rather recommend something like:
>>>>
>>>> Full->{ANY @Full{-> UNMARK(Full)};Full{-> UNMARK(Full) ANY};};
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 09.08.2016 um 12:37 schrieb [email protected]:
>>>>> Hello again!
>>>>>
>>>>> One down, one to go. Are there best practices or tricks to reduce Ruta's 
>>>>> memory needs? I tried to use the following script to merge names. 
>>>>>
>>>>> Document{->GREEDYANCHORING(true)};
>>>>> First+ Full {->MARK(Full)};
>>>>> Full Last+ {->MARK(Full)};
>>>>> First+ Last+ {->MARK(Full)};
>>>>> Document{->GREEDYANCHORING(false)};
>>>>> Full{PARTOFNEQ(Full) -> UNMARK(Full)};
>>>>> First{PARTOF(Full) -> UNMARK(First)};
>>>>> Last{PARTOF(Full) -> UNMARK(Last)};
>>>>>
>>>>> The engine description is create by ruta-maven-plugin:2.4.0 and used with 
>>>>> uimaFIT's 
>>>>> AnalysisEngineFactory.createEngineDescription("fullyQualifiedDescriptorNameWithoutXmlExtension").
>>>>>  For a 18 Mbyte text, it needs Gbytes of RAM.
>>>>>
>>>>> Cheers,
>>>>> Armin

Re: Ruta 2.4.0 - High memory needs

Reply via email to