Re: Limiting the memory used by an annotator ?

2017-05-01 Thread Marshall Schor
Hi,

I'm not sure that a limited size FsIndexRepository would work, because it only
would limit those Feature Structures that were added to the index.

Many times, Feature Structures are made which are referenced from other Feature
Structures, but are not added to the index.  One example is instances of
NonEmptyXxxList kinds of objects - these are used to hold items in a list, and
typically are not (individually) added to the index, since the normal way to
access these is via the head of the list.

Even if they are not in the FsIndexRepository indexes, they still take up room
in the main storage on the heap for storing Feature Structures.

-Marshall


On 4/30/2017 4:15 PM, Hugues de Mazancourt wrote:
> Thanks to all for your advices.
> In my specific case, this was a Ruta problem - Peter, I filed a JIRA issue 
> with a minimal example - which would advocate for the « 
> TooManyMatchesException » feature you propose. I vote for it.
>
> Of course, I already limit the size of input texts, but this is not enough.
> One of the main strengths of UIMA is to be able to integrate annotators 
> produced by third-parties. And each annotator is based on assumptions, at 
> least to have a text as an input, formed by words, etc. Thus, pipelines get 
> more and more complex, without the need to code all processig. But, in a 
> production environment, anything can happen, assumptions may not be respected 
> (e.g. non-textual data can be sent to the engine(s), etc). Sh** always happen 
> in production.
>
> My case is a more specific one, but I’m sure it can be generalized.
>
> Thus, any feature that can help limiting the damage of non-expected input 
> would be welcome. And a limited-size FsIndexRepository seems to me a simple 
> yet powerful enough solution to many problems.
>
> Best,
>
> — Hugues
>
>
> PS: appart from occasional problems, Ruta is a great platform for information 
> extraction. I love it!
>
>> Le 30 avr. 2017 à 12:57, Peter Klügl  a écrit :
>>
>> Hi,
>>
>>
>> here are some ruta-specific comments additionally to Thilo and Marshall's 
>> answers.
>>
>> - if you do not want to split the CAS in smaller ones, you can also 
>> sometimes apply the rules just on some parts of the document (-> less 
>> annotations/rule matches created)
>>
>> - there is an discussion related to this topic (about memory usage in ruta): 
>> https://issues.apache.org/jira/browse/UIMA-5306
>>
>> - I can include configuration parameters which limit the allowed amount of 
>> rule matches and rule element matches of one rule/rule element. If a rule or 
>> rule element exceeds it, a new runtime exception is thrown. I'll open a jira 
>> ticket for that. This is not a solution for the problem in my opinion, but 
>> it can help to identify and fix the problematic rules.
>>
>> - I do not want to include code to directly restrict the max memory in ruta. 
>> That should rather happen in the framework or in the code that calls/applies 
>> the ruta analysis engine.
>>
>> - I think there is a problem in ruta and there are several aspects that need 
>> to be considered here: the actual rules, the partitioning with RutaBasic, 
>> flaws in the implementation and the configuration parameters of the analysis 
>> engine
>>
>> - Are the rules inefficient (combinatory explosion)? I see ruta more and 
>> more as a programming language for faster creating maintainable analysis 
>> engines. You can write efficient and ineffiecient code. If the code/rules 
>> are too slow or take too long, you should refactor it and replace them with 
>> a more efficient approach. Something like ANY+ is a good indicator that the 
>> rules are not optimal, you should only match on things if you have to. There 
>> is also profiling functionality in the Ruta Workbench which shows you how 
>> long which rule took and how long specific conditions/action took. Well, 
>> this is information about the speed but not about the memory, but many rule 
>> matches take longer and require more memory, so it could be an indicator.
>>
>> - There are two specific aspects how ruta spends its memory: RutaBasic and 
>> RuleMatches. RutaBasic stores additional information which speeds up the 
>> rule inference and enables specific functionality. The rule matches are 
>> needed to remember where something matched, for the conditions and actions. 
>> You can reduce the memory usage by reducing the amount of RutaBasic 
>> annotations, the amount of the annotations indexed in the RutaBasic 
>> annotations, or by reducing the amount of RuleMatches -> refactoring the 
>> rules.
>>
>> - There are plans to make the implementation of RutaBasic more efficient, by 
>> using more efficient data structures (there are some prototypes mentioned in 
>> the issue linked above). And I added some new configuration parameters (in 
>> ruta 2.6.0 I think) which control which information is stored in RutaBasic, 
>> e.g, you do not need information about annotations if they or their types 
>> 

Re: Limiting the memory used by an annotator ?

2017-05-01 Thread Hugues de Mazancourt

> Thanks for the ticket. I haven't checked the implementation yet but it
> looks as much like a bug as it is possible.
> The rule looks simple, but the problem is quite complicated as you could
> replace both rule elements after the wildcard with arbitrary complex
> composed rule elements. I have to check what exactly went wrong there.

I guess the bug comes from a possible « path » for # between parallel 
annotations that puzzles the matching.

[…]
>> PS: appart from occasional problems, Ruta is a great platform for 
>> information extraction. I love it!
> 
> Thanks :-) especially for reporting the problem which greatly helps to
> improve ruta

You’re welcome !
I possibly have found another one… Stay tuned ;-)

— Hugues



Re: Limiting the memory used by an annotator ?

2017-05-01 Thread Peter Klügl
Hi,


Am 30.04.2017 um 22:15 schrieb Hugues de Mazancourt:
> Thanks to all for your advices.
> In my specific case, this was a Ruta problem - Peter, I filed a JIRA issue 
> with a minimal example - which would advocate for the « 
> TooManyMatchesException » feature you propose. I vote for it.


Thanks for the ticket. I haven't checked the implementation yet but it
looks as much like a bug as it is possible.
The rule looks simple, but the problem is quite complicated as you could
replace both rule elements after the wildcard with arbitrary complex
composed rule elements. I have to check what exactly went wrong there.


>
> Of course, I already limit the size of input texts, but this is not enough.
> One of the main strengths of UIMA is to be able to integrate annotators 
> produced by third-parties. And each annotator is based on assumptions, at 
> least to have a text as an input, formed by words, etc. Thus, pipelines get 
> more and more complex, without the need to code all processig. But, in a 
> production environment, anything can happen, assumptions may not be respected 
> (e.g. non-textual data can be sent to the engine(s), etc). Sh** always happen 
> in production.
>
> My case is a more specific one, but I’m sure it can be generalized.
>
> Thus, any feature that can help limiting the damage of non-expected input 
> would be welcome. And a limited-size FsIndexRepository seems to me a simple 
> yet powerful enough solution to many problems.

I can't say something about the FsIndexRepository but the limitation
within ruta will be included soon.

> Best,
>
> — Hugues
>
>
> PS: appart from occasional problems, Ruta is a great platform for information 
> extraction. I love it!

Thanks :-) especially for reporting the problem which greatly helps to
improve ruta


Best,

Peter



>
>> Le 30 avr. 2017 à 12:57, Peter Klügl  a écrit :
>>
>> Hi,
>>
>>
>> here are some ruta-specific comments additionally to Thilo and Marshall's 
>> answers.
>>
>> - if you do not want to split the CAS in smaller ones, you can also 
>> sometimes apply the rules just on some parts of the document (-> less 
>> annotations/rule matches created)
>>
>> - there is an discussion related to this topic (about memory usage in ruta): 
>> https://issues.apache.org/jira/browse/UIMA-5306
>>
>> - I can include configuration parameters which limit the allowed amount of 
>> rule matches and rule element matches of one rule/rule element. If a rule or 
>> rule element exceeds it, a new runtime exception is thrown. I'll open a jira 
>> ticket for that. This is not a solution for the problem in my opinion, but 
>> it can help to identify and fix the problematic rules.
>>
>> - I do not want to include code to directly restrict the max memory in ruta. 
>> That should rather happen in the framework or in the code that calls/applies 
>> the ruta analysis engine.
>>
>> - I think there is a problem in ruta and there are several aspects that need 
>> to be considered here: the actual rules, the partitioning with RutaBasic, 
>> flaws in the implementation and the configuration parameters of the analysis 
>> engine
>>
>> - Are the rules inefficient (combinatory explosion)? I see ruta more and 
>> more as a programming language for faster creating maintainable analysis 
>> engines. You can write efficient and ineffiecient code. If the code/rules 
>> are too slow or take too long, you should refactor it and replace them with 
>> a more efficient approach. Something like ANY+ is a good indicator that the 
>> rules are not optimal, you should only match on things if you have to. There 
>> is also profiling functionality in the Ruta Workbench which shows you how 
>> long which rule took and how long specific conditions/action took. Well, 
>> this is information about the speed but not about the memory, but many rule 
>> matches take longer and require more memory, so it could be an indicator.
>>
>> - There are two specific aspects how ruta spends its memory: RutaBasic and 
>> RuleMatches. RutaBasic stores additional information which speeds up the 
>> rule inference and enables specific functionality. The rule matches are 
>> needed to remember where something matched, for the conditions and actions. 
>> You can reduce the memory usage by reducing the amount of RutaBasic 
>> annotations, the amount of the annotations indexed in the RutaBasic 
>> annotations, or by reducing the amount of RuleMatches -> refactoring the 
>> rules.
>>
>> - There are plans to make the implementation of RutaBasic more efficient, by 
>> using more efficient data structures (there are some prototypes mentioned in 
>> the issue linked above). And I added some new configuration parameters (in 
>> ruta 2.6.0 I think) which control which information is stored in RutaBasic, 
>> e.g, you do not need information about annotations if they or their types 
>> are not used in the rules.
>>
>> - I think there is a flaw in the implementation which causes your problem, 
>> 

Re: Limiting the memory used by an annotator ?

2017-04-30 Thread Hugues de Mazancourt
Thanks to all for your advices.
In my specific case, this was a Ruta problem - Peter, I filed a JIRA issue with 
a minimal example - which would advocate for the « TooManyMatchesException » 
feature you propose. I vote for it.

Of course, I already limit the size of input texts, but this is not enough.
One of the main strengths of UIMA is to be able to integrate annotators 
produced by third-parties. And each annotator is based on assumptions, at least 
to have a text as an input, formed by words, etc. Thus, pipelines get more and 
more complex, without the need to code all processig. But, in a production 
environment, anything can happen, assumptions may not be respected (e.g. 
non-textual data can be sent to the engine(s), etc). Sh** always happen in 
production.

My case is a more specific one, but I’m sure it can be generalized.

Thus, any feature that can help limiting the damage of non-expected input would 
be welcome. And a limited-size FsIndexRepository seems to me a simple yet 
powerful enough solution to many problems.

Best,

— Hugues


PS: appart from occasional problems, Ruta is a great platform for information 
extraction. I love it!

> Le 30 avr. 2017 à 12:57, Peter Klügl  a écrit :
> 
> Hi,
> 
> 
> here are some ruta-specific comments additionally to Thilo and Marshall's 
> answers.
> 
> - if you do not want to split the CAS in smaller ones, you can also sometimes 
> apply the rules just on some parts of the document (-> less annotations/rule 
> matches created)
> 
> - there is an discussion related to this topic (about memory usage in ruta): 
> https://issues.apache.org/jira/browse/UIMA-5306
> 
> - I can include configuration parameters which limit the allowed amount of 
> rule matches and rule element matches of one rule/rule element. If a rule or 
> rule element exceeds it, a new runtime exception is thrown. I'll open a jira 
> ticket for that. This is not a solution for the problem in my opinion, but it 
> can help to identify and fix the problematic rules.
> 
> - I do not want to include code to directly restrict the max memory in ruta. 
> That should rather happen in the framework or in the code that calls/applies 
> the ruta analysis engine.
> 
> - I think there is a problem in ruta and there are several aspects that need 
> to be considered here: the actual rules, the partitioning with RutaBasic, 
> flaws in the implementation and the configuration parameters of the analysis 
> engine
> 
> - Are the rules inefficient (combinatory explosion)? I see ruta more and more 
> as a programming language for faster creating maintainable analysis engines. 
> You can write efficient and ineffiecient code. If the code/rules are too slow 
> or take too long, you should refactor it and replace them with a more 
> efficient approach. Something like ANY+ is a good indicator that the rules 
> are not optimal, you should only match on things if you have to. There is 
> also profiling functionality in the Ruta Workbench which shows you how long 
> which rule took and how long specific conditions/action took. Well, this is 
> information about the speed but not about the memory, but many rule matches 
> take longer and require more memory, so it could be an indicator.
> 
> - There are two specific aspects how ruta spends its memory: RutaBasic and 
> RuleMatches. RutaBasic stores additional information which speeds up the rule 
> inference and enables specific functionality. The rule matches are needed to 
> remember where something matched, for the conditions and actions. You can 
> reduce the memory usage by reducing the amount of RutaBasic annotations, the 
> amount of the annotations indexed in the RutaBasic annotations, or by 
> reducing the amount of RuleMatches -> refactoring the rules.
> 
> - There are plans to make the implementation of RutaBasic more efficient, by 
> using more efficient data structures (there are some prototypes mentioned in 
> the issue linked above). And I added some new configuration parameters (in 
> ruta 2.6.0 I think) which control which information is stored in RutaBasic, 
> e.g, you do not need information about annotations if they or their types are 
> not used in the rules.
> 
> - I think there is a flaw in the implementation which causes your problem, 
> and which can be fixed. I'll investigate it when I find the time. If you can 
> provide some minimal (synthetic) example for reproducing it, that would be 
> great.
> 
> - There is the configuration parameter lowMemoryProfile for reducing the 
> stuff stored in RutaBasic which reduces the memory usage but makes the rules 
> run slower.
> 
> 
> Best,
> 
> 
> Peter
> 
> 
> 
> Am 29.04.2017 um 12:53 schrieb Hugues de Mazancourt:
>> Hello UIMA users,
>> 
>> I’m currently putting a Ruta-based system in production and I sometimes run 
>> out of memory.
>> This is usually caused by combinatory explosion in Ruta rules. These rules 
>> are not necessary faulty: they are adapted to the documents I 

Re: Limiting the memory used by an annotator ?

2017-04-29 Thread Marshall Schor
This has occasionally popped up as a user request.

Thilo makes some good practical suggestions that often work. 

If (in your case) there's some aspect of the data that causes a combinatorial
explosion in some part of the code, if you can identify that part of the code,
and have any control over it, you might be able to insert some limiting code 
there.

Limiting the amount of memory: thinking more about this, if the limit was
reached, what should happen?  It seems that the choice would be to throw a new
(subclass of) RuntimeException (runtime because it could happen almost
anywhere); the "catch" action would be to abort whatever was going on, report
the failure, and reset things (including the CAS).

This could be done already - because an exception does happen (the out-of-memory
exception).  Hopefully, this isn't too late - you mentioned that things slow
down as memory gets short.  (I suppose you could time things, and if things slow
down dramatically, use that as a trigger, too).

So maybe this is the best approach - find a spot in your code where the
"recovery" of aborting and resetting things makes sense, and install an
out-of-memory exception try / catch point (or a dramatic slow-down catcher).

A trick for out-of-memory catchers is to grab a block of memory (say, an int
array) at the start, and then have the out-of-memory code release that block, to
give the catcher room enough to run and recover.  But this might not be needed;
just unwinding the stack due to the throw also could free up memory, if your
catch point is high up the stack.

Hope this Helps.  -Marshall

On 4/29/2017 6:53 AM, Hugues de Mazancourt wrote:
> Hello UIMA users,
>
> I’m currently putting a Ruta-based system in production and I sometimes run 
> out of memory.
> This is usually caused by combinatory explosion in Ruta rules. These rules 
> are not necessary faulty: they are adapted to the documents I expect to 
> parse. But as this is an open system, people can upload whatever they want 
> and the parser crashes by multiplying annotations (or at least takes 20 
> minutes in garbage-collecting millions of annotations).
>
> Thus, my question is: is there a way to limit the memory used by an 
> annotator, or to limit the number of annotations made by an annotator, or to 
> limit the number of matches made by Ruta ?
> I prefer cancelling a parse for a given document than a 20 minutes downtime 
> of the whole system.
>
> Several UIMA-based services run in production, I guess that others certainly 
> have hit the same problem.
>
> Any hint on that topic would be very helpful.
>
> Thanks,
>
> Hugues de Mazancourt
> http://about.me/mazancourt
>
>
>
>
>



Limiting the memory used by an annotator ?

2017-04-29 Thread Hugues de Mazancourt
Hello UIMA users,

I’m currently putting a Ruta-based system in production and I sometimes run out 
of memory.
This is usually caused by combinatory explosion in Ruta rules. These rules are 
not necessary faulty: they are adapted to the documents I expect to parse. But 
as this is an open system, people can upload whatever they want and the parser 
crashes by multiplying annotations (or at least takes 20 minutes in 
garbage-collecting millions of annotations).

Thus, my question is: is there a way to limit the memory used by an annotator, 
or to limit the number of annotations made by an annotator, or to limit the 
number of matches made by Ruta ?
I prefer cancelling a parse for a given document than a 20 minutes downtime of 
the whole system.

Several UIMA-based services run in production, I guess that others certainly 
have hit the same problem.

Any hint on that topic would be very helpful.

Thanks,

Hugues de Mazancourt
http://about.me/mazancourt