Re: New dictionary annotator

2016-12-02 Thread Hugues de Mazancourt
Cool !
Any idea of how far that near future is ?
;-)

— Hugues



> Le 2 déc. 2016 à 10:26, Donatas Remeika <donatas.reme...@gmail.com> a écrit :
> 
> Hi Hugues,
> 
> Thanks for feedback. Indeed accent-insensitive matching is a needed
> feature. Will implement it in a near future.
> 
> Best regards,
> Donatas Remeika
> 
> On Fri, Dec 2, 2016 at 11:02 AM Hugues de Mazancourt <hug...@mazancourt.com>
> wrote:
> 
>> Thanks for this contribution.
>> 
>> Do you have any plan to make the lookup accent-insensitive ? Or any
>> knowledge of a component that would do the job ?
>> I’m currently using ConceptMapper outside of Ruta and MARKTABLE from
>> within Ruta but neither performs correctly on accents (btw, conceptMapper
>> is *very* slow on resource loading, which can be a problem).
>> 
>> My point is : I have lists containing elements like « événement » and I
>> would like text like « EVENEMENT » or even « évènement » to match that
>> list. Lowercasing texts is not a solution, as « é » is mapped to uppercase
>> « É » in French locale, which has nothing to do with « e ». I guess you
>> have the same problem with latvian.
>> 
>> Best,
>> 
>> 
>> Hugues de Mazancourt
>> http://about.me/mazancourt
>> 
>> 
>> 
>> 
>>> Le 30 nov. 2016 à 15:38, Donatas Remeika <donatas.reme...@gmail.com> a
>> écrit :
>>> 
>>> Hi,
>>> 
>>> Just wanted to let you know that we created a new (probably one more)
>>> dictionary annotator.
>>> 
>>> Reasons for creating it was:
>>> - Quite often we used Ruta in our pipelines only because of its MARKTABLE
>>> action which is able to set several features on annotation
>>> - Sometimes dictionaries contain duplicate entries with different
>> features
>>> and we need to create annotations for each entry
>>> - Possibility to use custom dictionary entries tokenizer (default is
>>> whitespace tokenizer)
>>> 
>>> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE.
>> Big
>>> thanks to their developers!
>>> 
>>> Code with examples can be found
>>> https://github.com/tokenmill/dictionary-annotator
>>> 
>>> BTW, maybe someone knows Concept Mapper alternative, which is more
>> uimaFIT
>>> friendly?
>>> 
>>> Best regards,
>>> Donatas
>> 
>> 



Re: New dictionary annotator

2016-12-02 Thread Hugues de Mazancourt
Thanks for this contribution.

Do you have any plan to make the lookup accent-insensitive ? Or any knowledge 
of a component that would do the job ?
I’m currently using ConceptMapper outside of Ruta and MARKTABLE from within 
Ruta but neither performs correctly on accents (btw, conceptMapper is *very* 
slow on resource loading, which can be a problem).

My point is : I have lists containing elements like « événement » and I would 
like text like « EVENEMENT » or even « évènement » to match that list. 
Lowercasing texts is not a solution, as « é » is mapped to uppercase « É » in 
French locale, which has nothing to do with « e ». I guess you have the same 
problem with latvian.

Best,


Hugues de Mazancourt
http://about.me/mazancourt




> Le 30 nov. 2016 à 15:38, Donatas Remeika <donatas.reme...@gmail.com> a écrit :
> 
> Hi,
> 
> Just wanted to let you know that we created a new (probably one more)
> dictionary annotator.
> 
> Reasons for creating it was:
> - Quite often we used Ruta in our pipelines only because of its MARKTABLE
> action which is able to set several features on annotation
> - Sometimes dictionaries contain duplicate entries with different features
> and we need to create annotations for each entry
> - Possibility to use custom dictionary entries tokenizer (default is
> whitespace tokenizer)
> 
> It was inspired by both DKPro dictionary-annotator and Ruta MARKTABLE. Big
> thanks to their developers!
> 
> Code with examples can be found
> https://github.com/tokenmill/dictionary-annotator
> 
> BTW, maybe someone knows Concept Mapper alternative, which is more uimaFIT
> friendly?
> 
> Best regards,
> Donatas



Re: [ANNOUNCE] Apache UIMA Ruta 2.6.0 released

2017-03-16 Thread Hugues de Mazancourt
Hi,

Since the upgrade on 2.6.0, I can’t get the Ruta Query tool to work on Ruta 
Workbench (no results, even on a « W; » request).
Does anyone have the same problem as me ? In which case I’ll file an issue in 
JIRA.

Thanks,

PS : I’m using a Mac…

Hugues de Mazancourt



> Le 10 mars 2017 à 13:01, Peter Klügl <pklu...@apache.org> a écrit :
> 
> The Apache UIMA team is pleased to announce the release of
> Apache UIMA Ruta (Rule-based Text Annotation), version 2.6.0.
> 
> The Unstructured Information Management Architecture (UIMA) is a
> component framework supporting development, discovery, composition, and
> deployment of multi-modal analytics tasked with the analysis of
> unstructured information.
> 
> Apache UIMA is an Apache licensed open source implementation of the UIMA
> specification which is being developed by a technical committee within
> OASIS, a standards organization. The implementation comprises an SDK and
> tooling for composing and running analytic components written in Java
> and C++, with some support for Perl, Python and TCL.
> 
> Apache UIMA Ruta is a rule-based script language supported by
> Eclipse-based tooling. The language is designed to enable rapid
> development of text processing applications within UIMA. A special focus
> lies on the intuitive and flexible domain specific language for defining
> patterns of annotations. The Eclipse-based tooling,
> called the Apache UIMA Ruta Workbench, was created to support the
> user and to facilitate every step when writing rules. Both
> the rule language and the workbench integrate
> smoothly with Apache UIMA.
> 
> Major Changes in this Release
> 
> 
> UIMA Ruta Language and Analysis Engine:
> - Annotation expressions can be restricted using feature matches and
> conditions
> - Several new configuration parameters for RutaEngine
> - Experimental features to optimize internal indexing (for experienced
> users)
> - Minimal support of feature structures in feature match expressions
> - API change report for ruta-core
> - Typesystem descriptors with JCasGen classes are located in separate
> artifact
> - Implementation of RutaBasic is located in separate artifact
> - Many bug fixes and improvements, especially for label expressions
> 
> UIMA Ruta Workbench:
> - Direct debugging of launched scripts in Java is supported
> - Improved error messages in launcher
> - Removed restriction of classpath size causing problems in launcher
> - Deactivated noVM preference
> - Changed UI to set annotation mode in views
> - Launcher uses project encoding
> - Bug fixes
> 
> This release requires an update of script projects and its descriptors
> in the UIMA Ruta Workbench. There are several ways to achieve this.
> The recommended way is to right-click on the UIMA Ruta project and
> select "UIMA Ruta -> Convert to UIMA Ruta project", which will update
> all provided descriptors. Then, select the project and press
> "Project -> Clean..." in the menu, which will regenerate all descriptors
> of your scripts based on the new templates.
> Projects built with the UIMA Ruta Maven Plugin require no manual effort.
> 
> For a full list of the changes, please refer to Jira:
> http://uima.apache.org/d/ruta-2.6.0/issuesFixed/jira-report.html
> 
> More information about UIMA Ruta can be found here:
> http://uima.apache.org/ruta.html
> 
> - Peter Klügl, for the Apache UIMA development team
> 
> 
> 
> 
> 
> 
> 
> 



Re: [ANNOUNCE] Apache UIMA Ruta 2.6.0 released

2017-03-16 Thread Hugues de Mazancourt
Thanks Peter,

I created the issue on JIRA : https://issues.apache.org/jira/browse/UIMA-5371 
<https://issues.apache.org/jira/browse/UIMA-5371>

Best,

Hugues



> Le 16 mars 2017 à 16:37, Peter Klügl <peter.klu...@averbis.com> a écrit :
> 
> Hi,
> 
> 
> I can reproduce the problem using Windows.
> 
> I have a slight idea what causes the problem, but I still need to
> validate it.
> 
> 
> Best,
> 
> 
> Peter
> 
> 
> 
> Am 16.03.2017 um 15:39 schrieb Hugues de Mazancourt:
>> Hi,
>> 
>> Since the upgrade on 2.6.0, I can’t get the Ruta Query tool to work on Ruta 
>> Workbench (no results, even on a « W; » request).
>> Does anyone have the same problem as me ? In which case I’ll file an issue 
>> in JIRA.
>> 
>> Thanks,
>> 
>> PS : I’m using a Mac…
>> 
>> Hugues de Mazancourt
>> 
>> 
>> 
>>> Le 10 mars 2017 à 13:01, Peter Klügl <pklu...@apache.org> a écrit :
>>> 
>>> The Apache UIMA team is pleased to announce the release of
>>> Apache UIMA Ruta (Rule-based Text Annotation), version 2.6.0.
>>> 
>>> The Unstructured Information Management Architecture (UIMA) is a
>>> component framework supporting development, discovery, composition, and
>>> deployment of multi-modal analytics tasked with the analysis of
>>> unstructured information.
>>> 
>>> Apache UIMA is an Apache licensed open source implementation of the UIMA
>>> specification which is being developed by a technical committee within
>>> OASIS, a standards organization. The implementation comprises an SDK and
>>> tooling for composing and running analytic components written in Java
>>> and C++, with some support for Perl, Python and TCL.
>>> 
>>> Apache UIMA Ruta is a rule-based script language supported by
>>> Eclipse-based tooling. The language is designed to enable rapid
>>> development of text processing applications within UIMA. A special focus
>>> lies on the intuitive and flexible domain specific language for defining
>>> patterns of annotations. The Eclipse-based tooling,
>>> called the Apache UIMA Ruta Workbench, was created to support the
>>> user and to facilitate every step when writing rules. Both
>>> the rule language and the workbench integrate
>>> smoothly with Apache UIMA.
>>> 
>>> Major Changes in this Release
>>> 
>>> 
>>> UIMA Ruta Language and Analysis Engine:
>>> - Annotation expressions can be restricted using feature matches and
>>> conditions
>>> - Several new configuration parameters for RutaEngine
>>> - Experimental features to optimize internal indexing (for experienced
>>> users)
>>> - Minimal support of feature structures in feature match expressions
>>> - API change report for ruta-core
>>> - Typesystem descriptors with JCasGen classes are located in separate
>>> artifact
>>> - Implementation of RutaBasic is located in separate artifact
>>> - Many bug fixes and improvements, especially for label expressions
>>> 
>>> UIMA Ruta Workbench:
>>> - Direct debugging of launched scripts in Java is supported
>>> - Improved error messages in launcher
>>> - Removed restriction of classpath size causing problems in launcher
>>> - Deactivated noVM preference
>>> - Changed UI to set annotation mode in views
>>> - Launcher uses project encoding
>>> - Bug fixes
>>> 
>>> This release requires an update of script projects and its descriptors
>>> in the UIMA Ruta Workbench. There are several ways to achieve this.
>>> The recommended way is to right-click on the UIMA Ruta project and
>>> select "UIMA Ruta -> Convert to UIMA Ruta project", which will update
>>> all provided descriptors. Then, select the project and press
>>> "Project -> Clean..." in the menu, which will regenerate all descriptors
>>> of your scripts based on the new templates.
>>> Projects built with the UIMA Ruta Maven Plugin require no manual effort.
>>> 
>>> For a full list of the changes, please refer to Jira:
>>> http://uima.apache.org/d/ruta-2.6.0/issuesFixed/jira-report.html
>>> 
>>> More information about UIMA Ruta can be found here:
>>> http://uima.apache.org/ruta.html
>>> 
>>> - Peter Klügl, for the Apache UIMA development team
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 



Re: Ruta conflicts with DKPro typesystem

2017-04-10 Thread Hugues de Mazancourt
Hi Peter,

strictimports perfectly solves the problem, tank you.
The advantage over « renaming » the annotation is that it allows to keep syntax 
highlighting in  Ruta Workbench.

Your answer raises another question: you wrote 
> If you create the CAS with uimaFIT,
> then there are also types that are not imported in you script. Well, you
> would not even need to import the types in order to use them in your script.

I did create the CAS with uimaFIT. What kind of types are not imported in the 
script ?

Best,

— Hugues



> Le 10 avr. 2017 à 12:25, Peter Klügl <peter.klu...@averbis.com> a écrit :
> 
> Hi,
> 
> 
> there are two options to avoid ambiguous references to types by using
> their shot name.
> 
> 
> This first one is using an alias as you did. However, you have to assign
> an unambiguous alias. Ruta should check if the alias is ambiguous but
> obviously doesn't. Try something like:
> 
> IMPORT org.apache.uima.ruta.type.NUM FROM
> org.apache.uima.ruta.engine.BasicTypeSystem AS RutaNum;
> 
> Then you can use "RutaNum" for referencing to
> org.apache.uima.ruta.type.NUM in your rules.
> 
> 
> ... or something like IMPORT PACKAGE org.apache.uima.ruta.type FROM
> org.apache.uima.ruta.engine.BasicTypeSystem AS ruta;
> 
> ... then you should be able to use ruta.NUM in your rules.
> 
> 
> (I did not test both examples)
> 
> 
> The second option is to activate the "strictImports" configuration
> parameter. If activated, the type expressions, e.g., by short name, are
> only resolved against the types that are imported. Thus, if you do not
> import the DKPro Core type system, the NUM of the ruta type system will
> be used. If deactivated, the references are resolved against the names
> in the type system of the given CAS. If you create the CAS with uimaFIT,
> then there are also types that are not imported in you script. Well, you
> would not even need to import the types in order to use them in your script.
> 
> 
> Both options have their advantages and disadvantages. Using strictImport
> in generic scripts where you initialize type variables using
> configuration parameters is problematic. If you have a larger pipeline
> with unknown components with unknown type systems, strictImports is
> often required. There may be a conflict with other components, which
> cannot be known when writing the rules.
> 
> 
> btw, there is also an updated exemplary project using DKPro Core in ruta:
> 
> https://github.com/pkluegl/ruta/tree/master/ruta-german-novel-with-dkpro
> 
> 
> 
> Let me know if this helps or if I should provide more information.
> 
> 
> Best,
> 
> 
> Peter
> 
> 
> 
> 
> Am 07.04.2017 um 15:01 schrieb Hugues de Mazancourt:
>> Hi,
>> 
>> I’m using Ruta to perform information extraction and I mix it in a pipeline 
>> with DKPro-based resources (for POS-tagging and NER). Thus, I have my own 
>> type system, Ruta’s basic type system and some DKpro typesystems (especially 
>> the one describing Tokens)
>> 
>> I end up with type conflicts such as (Ruta error) :
>> 
>>> java.lang.IllegalArgumentException: NUM is ambiguous, use one of the 
>>> following instead : 
>>> de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.NUM 
>>> org.apache.uima.ruta.type.NUM 
>> I tried to use declarations such as :
>> 
>>> IMPORT org.apache.uima.ruta.type.NUM FROM 
>>> org.apache.uima.ruta.engine.BasicTypeSystem AS NUM;
>> at the top of my Ruta rule files, but this doesn’t help.
>> 
>> I guess using « org.apache.uima.ruta.type.NUM » instead of « NUM » would fix 
>> the problem, but this wouldn’t increase readability of rules !
>> The other solution I see would be to create my own, non-ambiguous, readable 
>> annotation and have a rule that marks all org.apache.uima.ruta.type.NUM with 
>> that annotation, but I’m afraid of performance issues due to these redundant 
>> annotations.
>> 
>> Is there any other solution for Ruta to mask some types or alias them ?
>> 
>> Best,
>> 
>> Hugues de Mazancourt
>> http://about.me/mazancourt
>> 
>> 
>> 
>> 
>> 
> 



Ruta conflicts with DKPro typesystem

2017-04-07 Thread Hugues de Mazancourt
Hi,

I’m using Ruta to perform information extraction and I mix it in a pipeline 
with DKPro-based resources (for POS-tagging and NER). Thus, I have my own type 
system, Ruta’s basic type system and some DKpro typesystems (especially the one 
describing Tokens)

I end up with type conflicts such as (Ruta error) :

> java.lang.IllegalArgumentException: NUM is ambiguous, use one of the 
> following instead : 
> de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.NUM 
> org.apache.uima.ruta.type.NUM 

I tried to use declarations such as :

> IMPORT org.apache.uima.ruta.type.NUM FROM 
> org.apache.uima.ruta.engine.BasicTypeSystem AS NUM;

at the top of my Ruta rule files, but this doesn’t help.

I guess using « org.apache.uima.ruta.type.NUM » instead of « NUM » would fix 
the problem, but this wouldn’t increase readability of rules !
The other solution I see would be to create my own, non-ambiguous, readable 
annotation and have a rule that marks all org.apache.uima.ruta.type.NUM with 
that annotation, but I’m afraid of performance issues due to these redundant 
annotations.

Is there any other solution for Ruta to mask some types or alias them ?

Best,

Hugues de Mazancourt
http://about.me/mazancourt







Limiting the memory used by an annotator ?

2017-04-29 Thread Hugues de Mazancourt
Hello UIMA users,

I’m currently putting a Ruta-based system in production and I sometimes run out 
of memory.
This is usually caused by combinatory explosion in Ruta rules. These rules are 
not necessary faulty: they are adapted to the documents I expect to parse. But 
as this is an open system, people can upload whatever they want and the parser 
crashes by multiplying annotations (or at least takes 20 minutes in 
garbage-collecting millions of annotations).

Thus, my question is: is there a way to limit the memory used by an annotator, 
or to limit the number of annotations made by an annotator, or to limit the 
number of matches made by Ruta ?
I prefer cancelling a parse for a given document than a 20 minutes downtime of 
the whole system.

Several UIMA-based services run in production, I guess that others certainly 
have hit the same problem.

Any hint on that topic would be very helpful.

Thanks,

Hugues de Mazancourt
http://about.me/mazancourt






[Ruta] STRINGLIST behavior

2017-05-17 Thread Hugues de Mazancourt
Hi all,

I was struggling for several days on a memory leak on a server running a Ruta 
parser.
I guess I finally got the cause: I was using a STRINGLIST variable that I 
though was local to an analysis (a JCas), but apparently, it is global to the 
Ruta interpreter. In my case, since I was adding elements to the list and never 
cleared it. Things went wrong...
Can you confirm that behavior ?
If this is the case, what happens when there are several Ruta threads involved 
? (I’m creating the instances with UIMAFramework.produceAnalysisEngine(…) with 
multiple simultaneous requests)

Thanks,

Hugues de Mazancourt
http://about.me/mazancourt






Re: [Ruta] STRINGLIST behavior

2017-05-17 Thread Hugues de Mazancourt

> Le 17 mai 2017 à 14:53, Peter Klügl  a écrit :
> 
> The environments are not resetted if the script declaring/initializing
> the variable is called by another one which is called by the main one
> (two stacked scripts). The bug is fixed in the current trunk.
> 
> Is this the case in your application?

That’s exactly my case (2 stacked scripts).

Thanks, Peter

— Hugues



Re: UIMA Ruta thread safe

2017-06-06 Thread Hugues de Mazancourt
Hi,

From my experience, Ruta is actually thread-safe and use of « strictImports » 
solves many of the type ambiguities between Ruta and DKPro.
In a concurrent environment, you will certainly gain performance by using a 
JCasPool for creating your CASes

Best,


— Hugues de Mazancourt



> Le 6 juin 2017 à 15:13, Josep María Formentí Serra <jmforme...@aia.es> a 
> écrit :
> 
> Thanks Peter,
> 
>> Could it be that you create the CAS differently in your concurrent
>> setting? For example JCasFactory vs ae.newCAS()?
> 
> CAS is created ever using JCasFactory, we revise if there are something
> wrong in the reader.
> 
>> Anyways, this exception in this situation (using DKPro Core) is really
>> annoying. Did you try to activate strictImports?
> 
> It's really annoying, we have a set of requests to test the web service. If
> we execute the set of test with one thread all is ok, no exception, but
> when we execute the same set with more than one thread, the exceptions
> appears.
> 
>> If this does not help, do you have a minimal reproducible example?
> 
> No, I haven't, I'll check the code and if I don't find the solution, I'll
> try to prepare a minimal example to reproduce the problem
> 
> 2017-06-06 11:46 GMT+02:00 Peter Klügl <peter.klu...@averbis.com>:
> 
>> Hi,
>> 
>> 
>> UIMA Ruta should be threadsafe.
>> 
>> 
>> Could it be that you create the CAS differently in your concurrent
>> setting? For example JCasFactory vs ae.newCAS()?
>> 
>> The exception indicates that Ruta cannot resolve the mention "Document"
>> correctly since it is an alias of uima.tcas.DocumentAnnotation and the
>> short name of the DKPro type. This exception should also occur in a
>> non-concurrent setting.
>> 
>> Anyways, this exception in this situation (using DKPro Core) is really
>> annoying. Did you try to activate strictImports?
>> 
>> 
>> If this does not help, do you have a minimal reproducible example?
>> 
>> 
>> 
>> Best,
>> 
>> Peter
>> 
>> 
>> Am 06.06.2017 um 11:03 schrieb Josep María Formentí Serra:
>>> Hi all,
>>> 
>>>  We are building a web service using directly the RutaEngine, in the
>>> inicialization the engine is built as:
>>> 
>>> UIMAFramework.produceAnalysisEngine(AnalysisEngineFactory
>>> 
>>> .createEngineDescription(createEngineDescription(RutaEngine.class,
>>>RutaEngine.PARAM_RULES,
>>> rule.getRule().getConditions();
>>> 
>>>  And then we call method *process *passing the *JCas*.
>>> 
>>>  All is ok but when we receive concurrent requests, starts to appears
>> many
>>> exceptions like these:
>>> 
>>> Caused by: java.lang.IllegalArgumentException: Document is ambiguous,
>> use
>>> one of the following instead : uima.tcas.DocumentAnnotation
>>> de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document
>>>at
>>> org.apache.uima.ruta.RutaEnvironment.getType(RutaEnvironment.java:459)
>>> ~[ruta-core-2.6.0.jar:2.6.0]
>>> 
>>> Caused by: java.lang.NullPointerException: null
>>>at java.util.ArrayList.addAll(ArrayList.java:577) ~[na:1.8.0_66]
>>>at
>>> org.apache.uima.ruta.condition.ImplicitCondition.eval(
>> ImplicitCondition.java:70)
>>> ~[ruta-core-2.6.0.jar:2.6.0]
>>> 
>>>  Is UIMA Ruta thread safe? we are doing something wrong?
>>> 
>>> Thanks in advance,
>>>  JM
>>> 
>> 
>> 
> 
> 
> -- 
> --- --- --
> - - -
> *Grupo AIA* - *www.aia.es <http://www.aia.es> *
> Josep Mª Formentí Serra   <jmforme...@aia.ptv.es>
> *jmforme...@aia.es <jmforme...@aia.ptv.es>*Dpto. Servicios Financieros y
> Seguros
> ESADECREAPOLIS, Sant Cugat, Barcelona
> Telf.: +34 93 504 49 00 <+34%20935%2004%2049%2000> Fax.: +34 93 580 21 88
> <+34%20935%2080%2021%2088>
> --- --- --
> - - -
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.



Re: Limiting the memory used by an annotator ?

2017-04-30 Thread Hugues de Mazancourt
Thanks to all for your advices.
In my specific case, this was a Ruta problem - Peter, I filed a JIRA issue with 
a minimal example - which would advocate for the « TooManyMatchesException » 
feature you propose. I vote for it.

Of course, I already limit the size of input texts, but this is not enough.
One of the main strengths of UIMA is to be able to integrate annotators 
produced by third-parties. And each annotator is based on assumptions, at least 
to have a text as an input, formed by words, etc. Thus, pipelines get more and 
more complex, without the need to code all processig. But, in a production 
environment, anything can happen, assumptions may not be respected (e.g. 
non-textual data can be sent to the engine(s), etc). Sh** always happen in 
production.

My case is a more specific one, but I’m sure it can be generalized.

Thus, any feature that can help limiting the damage of non-expected input would 
be welcome. And a limited-size FsIndexRepository seems to me a simple yet 
powerful enough solution to many problems.

Best,

— Hugues


PS: appart from occasional problems, Ruta is a great platform for information 
extraction. I love it!

> Le 30 avr. 2017 à 12:57, Peter Klügl <peter.klu...@averbis.com> a écrit :
> 
> Hi,
> 
> 
> here are some ruta-specific comments additionally to Thilo and Marshall's 
> answers.
> 
> - if you do not want to split the CAS in smaller ones, you can also sometimes 
> apply the rules just on some parts of the document (-> less annotations/rule 
> matches created)
> 
> - there is an discussion related to this topic (about memory usage in ruta): 
> https://issues.apache.org/jira/browse/UIMA-5306
> 
> - I can include configuration parameters which limit the allowed amount of 
> rule matches and rule element matches of one rule/rule element. If a rule or 
> rule element exceeds it, a new runtime exception is thrown. I'll open a jira 
> ticket for that. This is not a solution for the problem in my opinion, but it 
> can help to identify and fix the problematic rules.
> 
> - I do not want to include code to directly restrict the max memory in ruta. 
> That should rather happen in the framework or in the code that calls/applies 
> the ruta analysis engine.
> 
> - I think there is a problem in ruta and there are several aspects that need 
> to be considered here: the actual rules, the partitioning with RutaBasic, 
> flaws in the implementation and the configuration parameters of the analysis 
> engine
> 
> - Are the rules inefficient (combinatory explosion)? I see ruta more and more 
> as a programming language for faster creating maintainable analysis engines. 
> You can write efficient and ineffiecient code. If the code/rules are too slow 
> or take too long, you should refactor it and replace them with a more 
> efficient approach. Something like ANY+ is a good indicator that the rules 
> are not optimal, you should only match on things if you have to. There is 
> also profiling functionality in the Ruta Workbench which shows you how long 
> which rule took and how long specific conditions/action took. Well, this is 
> information about the speed but not about the memory, but many rule matches 
> take longer and require more memory, so it could be an indicator.
> 
> - There are two specific aspects how ruta spends its memory: RutaBasic and 
> RuleMatches. RutaBasic stores additional information which speeds up the rule 
> inference and enables specific functionality. The rule matches are needed to 
> remember where something matched, for the conditions and actions. You can 
> reduce the memory usage by reducing the amount of RutaBasic annotations, the 
> amount of the annotations indexed in the RutaBasic annotations, or by 
> reducing the amount of RuleMatches -> refactoring the rules.
> 
> - There are plans to make the implementation of RutaBasic more efficient, by 
> using more efficient data structures (there are some prototypes mentioned in 
> the issue linked above). And I added some new configuration parameters (in 
> ruta 2.6.0 I think) which control which information is stored in RutaBasic, 
> e.g, you do not need information about annotations if they or their types are 
> not used in the rules.
> 
> - I think there is a flaw in the implementation which causes your problem, 
> and which can be fixed. I'll investigate it when I find the time. If you can 
> provide some minimal (synthetic) example for reproducing it, that would be 
> great.
> 
> - There is the configuration parameter lowMemoryProfile for reducing the 
> stuff stored in RutaBasic which reduces the memory usage but makes the rules 
> run slower.
> 
> 
> Best,
> 
> 
> Peter
> 
> 
> 
> Am 29.04.2017 um 12:53 schrieb Hugues de Mazancourt:
>> Hello UIMA users,
>&

Re: Limiting the memory used by an annotator ?

2017-05-01 Thread Hugues de Mazancourt

> Thanks for the ticket. I haven't checked the implementation yet but it
> looks as much like a bug as it is possible.
> The rule looks simple, but the problem is quite complicated as you could
> replace both rule elements after the wildcard with arbitrary complex
> composed rule elements. I have to check what exactly went wrong there.

I guess the bug comes from a possible « path » for # between parallel 
annotations that puzzles the matching.

[…]
>> PS: appart from occasional problems, Ruta is a great platform for 
>> information extraction. I love it!
> 
> Thanks :-) especially for reporting the problem which greatly helps to
> improve ruta

You’re welcome !
I possibly have found another one… Stay tuned ;-)

— Hugues



Re: UIMA on Spark mimicking CPE pipelines

2017-09-28 Thread Hugues de Mazancourt
Hi,

I would be very interested also. We are working with both UIMA and Spark, but 
the two are not directly connected. 
An insight of how this could be made would certainly open some perspectives.

Best,


Hugues de Mazancourt



> Le 27 sept. 2017 à 19:10, Benedict Holland <benedict.m.holl...@gmail.com> a 
> écrit :
> 
> Hello All,
> 
> I am very happy to hear that this has interest. I work at a for-profit
> company but we have and process to release full working examples of this.
> We call it technical dissemination. I will work through my organization and
> hopefully provide a bit more than a simple driver.
> 
> Thanks,
> ~Ben
> 
> 
> 
> On Wed, Sep 27, 2017 at 6:07 AM, Benjamin De Boe <
> benjamin.de...@intersystems.com> wrote:
> 
>> Hi Benedict,
>> 
>> I'd be very interested to see an example of this, as we've been playing
>> with the very same idea, but haven't yet gotten to any actual trial (and
>> error) yet.
>> 
>> Many thanks in advance,
>> benjamin
>> 
>> 
>> Benjamin De Boe
>> Product Manager | InterSystems
>> T: +32 2 464 97 33 | M: +32 495 19 19 27
>> http://www.intersystems.com/
>> 
>> 
>> 
>> -Original Message-
>> From: Benedict Holland [mailto:benedict.m.holl...@gmail.com]
>> Sent: Tuesday, September 26, 2017 9:02 PM
>> To: user@uima.apache.org
>> Subject: UIMA on Spark mimicking CPE pipelines
>> 
>> Hello all,
>> 
>> I have a working application that essentially implements the CPE within a
>> spark context. The best part about this is that it does not use UIMAFit or
>> any 3rd party applications. It simply uses hadoop, spark, UIMA, and OpenNLP.
>> 
>> Users are able to configure, design, and build the UIMA pipeline using all
>> of the eclipse XML plugin applications. Instead of running the application
>> via the CPE.process() driver from a main class, it will run from the
>> foreach() function on the Dataframe object.
>> 
>> Oh also, it plugs into a database to get the text and to write results.
>> 
>> Would the UIMA community be interested in getting a working example put
>> together? If so, please feel free to contact me. I think this could be an
>> excellent example of what people would like to use and your examples are
>> particularly good.
>> 
>> Thanks,
>> ~Ben
>> 



Re: UIMA Ruta on an ARM Linux Machine

2017-10-29 Thread Hugues de Mazancourt
Hi Nikolas,

Ruta rules (and rule development) are not tied to Eclipse. Eclipse provides a 
very convenient tool for developing Ruta scripts through Ruta Workbench, but 
Ruta Workbench is not necessary for this. You can even use your favorite text 
editor for writing the rules. All you need is a JDK for the architecture you 
target, then compile UIMA and Ruta for that target. 

As Ruta rules are (sort of) interpreted, it doesn’t matter if the your 
development environment is not the same as the runtime environment. Thus a 
convenient way to develop in Ruta would be to develop and test using Eclipes’ 
Ruta Workbench, and then run them on the ARM machine. 

Best,


Hugues de Mazancourt
http://www.mazancourt.com
twitter: @mazancourt



> Le 29 oct. 2017 à 14:09, Nikolas Nisidis <nikolas.nisi...@fsfe.org> a écrit :
> 
> Hi,
> 
> As the title suggests, my question is the following: is it possible and if 
> so, how can I develop a UIMA Ruta project on an ARM machine? To my knowledge, 
> there is no Eclipse version for ARM (or is there?). How should I proceed?
> Is there a way to run at least an already written RUTA script on a text and 
> get the corresponding .xmi file as output?
> 
> Relevant Info:
> $ uname -a
> Linux libre 3.14.0-26-ARCH #1 SMP PREEMPT Sun Jul 2 04:08:36 UTC 2017 armv7l 
> GNU/Linux
> 



Dynamically bind resources to AnalysisEngine

2018-04-11 Thread Hugues de Mazancourt
Hello,

Is there a way to dynamically bind/update resources for an AnalysisEngine ?
My use-case is : I build a query parser that will be used to retrieve 
information in an indexed text database.
The parser performs spelling correction, but doesn't have to consider words in 
the index as spelling mistakes. Thus, the (aggregate) engine is bound to the 
index vocabulary (ie a word list).
My point is : when the index gets updated, its vocabulary will also be updated. 
I can re-build a new aggregate parser, with the updated resource, but this 
takes time, mainly for loading resources that were already loaded (POS model, 
lexica, etc.). Is there a way to update a given resource on my parser without 
having to rebuild it ?

Thanks for your help,
PS: I'm mostly building on top of DKPro components. I may miss some basic UIMA 
mechanisms
Hugues de Mazancourt
Mazancourt Conseil

E: hug...@mazancourt.com (mailto:hug...@mazancourt.com)
P: +33-6 72 78 70 33 (tel:+33-6%2072%2078%2070%2033)
W: http://www.mazancourt.com



Re: Dynamically bind resources to AnalysisEngine

2018-04-15 Thread Hugues de Mazancourt
Thanks to all for your answers.
I guess the simplest method is Marshall’s one: having my AE explicitly call 
load() on the resource when changes are detected.

However, if I reading Richard’s suggestions, especially on this :
> 
> However, if you want to use external resources, having a look at 
> 
> https://svn.apache.org/repos/asf/uima/uimafit/trunk/uimafit-core/src/test/java/org/apache/uima/fit/factory/ExternalResourceFactoryTest.java
>  
> 

…if I correctly read the code, it means that I can bind a POJO as a resource to 
my AE. I thought a Shared resource had to be described (and accessed) through a 
DataResource.
If I can directly inject my application’s vocabulary as resource, then my 
problems are gone, because the vocabulary object gets updated each time the 
index changes. Am I missing something ?

Best,

— Hugues



Re: Lost in UIMA Ruta Workbench !

2018-02-25 Thread Hugues de Mazancourt
Hello,

The point is that the MARK action simply creates an annotation without any 
features. Reading your code, it seems that you expect the matched value to be 
held in the « value » feature. This is not the case.
You can access to the text that is covered by an annotation through the 
pseudo-feature « .ct » (as coveredText).
Thus I guess the following rule should do what you want (not tested) :

Document{-> CREATE(INTENT, "value" = 
"Apply_for_Card")}<-{e:ENTITY{e.ct=="card"} # a:ACTIONS{a.ct=="application"};}* 
;}*

But I guess that you will want to capture variations, such as plurals 
(card/cards) or even derivation (apply/application).
Then you should have a look at MARKTABLE action, which will take as a resource 
a CSV file with the text to be matched in the input document and as many 
features as you want to be created on the annotation. Thus you could describe 
dictionary entries with your variations, all of them mapping to a normalized 
value.

Keep trying, Ruta is a little bit tricky at start, but it’s worth the effort.

Best regards, 


Hugues de Mazancourt
http://www.mazancourt.com
twitter: @mazancourt



> Le 23 févr. 2018 à 17:30, Anna Polychroniou <annapolychron...@gmail.com> a 
> écrit :
> 
> Hello,
> I am trying to complete an exercise in NLU using UIMA Ruta.
> I have hit a wall for the last 3 days.
> I would be grateful if you could give a hint on my issue:
> 
> I want to create 2 annotations ENTITY and ACTIONS for a list of sentences.
> I define a list of words for each one.
> Then I want to create a third annotation (INTENT) based on the first 2.
> Different values of ENTITY and ACTIONS must combine the 10 different values
> of INTENT annotation.
> 
> I 've stuck on the final step where I have to create the combined
> annotation (with bold).
> Could you please help?
> I attach my work below.
> 
> 
> 
> PACKAGE uima.ruta.exercise;
> 
> 
> WORDLIST EntityList = "Entities.txt";
> WORDLIST ActionList = "Actions.txt";
> DECLARE Annotation ENTITY(STRING value);
> DECLARE Annotation ACTIONS(STRING value);
> 
> 
> Document{-> RETAINTYPE(BREAK)};
> DECLARE Sentence;
> BREAK #{->MARK (Sentence)} BREAK;
> 
> DECLARE Annotation INTENT(STRING value);
> BLOCK(ForEach) Sentence{} {
> Document{-> MARKFAST(ENTITY, EntityList)};
> Document{-> MARKFAST(ACTIONS, ActionList)};
> 
> *Document{-> CREATE(INTENT, "value" =
> "Apply_for_Card")}<-{e:ENTITY{e.value=="card"} #
> a:ACTIONS{a.value=="application"};}*
> *;}*
> 
> 
> 
> 
> Thank you,
> Anna



Re: Can UIMA work in docker containers?

2018-10-05 Thread Hugues de Mazancourt
Hello,

I built a UIMA-based parser that is currently used in production.
It was implemented in a web-service deployed with Docker containers. No 
counter-indication.

Best,

Hugues de Mazancourt
http://www.mazancourt.com
twitter: @mazancourt



> Le 4 oct. 2018 à 05:21, aska@gmail.com a écrit :
> 
> Hi all,
> I am new to UIMA. It looks like a great platform for analyzing plain text.
> I would like to run UIMA in containers, but I cannot find any docker image or 
> information regarding using UIMA under docker.
> Is is possible to use UIMA under docker?



Re: What is the status of the CVD viewer?

2018-10-09 Thread Hugues de Mazancourt
Wouldn’t be Ruta Workbench worth a look ? I guess its annotation view works for 
standard XMI files and it is very handy and powerful (searches over a corpus, 
etc.)

Best,

— Hugues

Hugues de Mazancourt
http://www.mazancourt.com
Cell: +33-6 72 78 70 33
twitter: @mazancourt



> Le 9 oct. 2018 à 20:55, Marshall Schor  a écrit :
> 
> It's in svn: https://svn.apache.org/repos/asf/uima/uimaj/trunk/uimaj-tools/
> 
> cd to some writable directory,
> 
> svn checkout https://svn.apache.org/repos/asf/uima/uimaj/trunk/uimaj-tools/
> uimaj-tools
> 
> If you're using Eclipse as your ide, you can then import "existing Maven
> projects" and point to the directory where you checked it out.
> 
> Cheers. -Marshall
> 
> On 10/8/2018 3:54 PM, Rune Stilling wrote:
>> Our pipeline takes a long time to run so it’s not practical to use this tool.
>> 
>> Where can I find the source code for the CVD application?
>> 
>> Best,
>> Rune
>> 
>>> Den 8. okt. 2018 kl. 17.48 skrev Marshall Schor :
>>> 
>>> One alternative that may be useful is the DocumentAnalyzer. 
>>> https://uima.apache.org/d/uimaj-current/tools.html#ugr.tools.doc_analyzer
>>> 
>>> Patches welcome :-)
>>> 
>>> -Marshall
>>> 
>>> On 10/8/2018 11:27 AM, Rune Stilling wrote:
>>>> Hi list
>>>> 
>>>> We are using the CVD-viewer to view rather complex annotation document but 
>>>> have stumbled upon some problems.
>>>> 
>>>> First of all scrolling in the bottom left annotation pane is possible on a 
>>>> Mac. The scroll bar simply never shows up and moving the cursor downwards 
>>>> doesn’t move the contents. This makes the viewer very limited in use.
>>>> 
>>>> Secondly I really miss a search function in the text view especially, so 
>>>> that it would be possible to look up specific words. 
>>>> 
>>>> Is the tool still actively being developed at all? Aren’t people using it, 
>>>> and if not, then how do they analyze their results? Just by looking the 
>>>> cas.xmi file or?
>>>> 
>>>> Best,
>>>> Rune
>> 



Re: Uima and spring

2019-02-21 Thread Hugues de Mazancourt
Hi Sarah,

Good that you found your solution.
Regarding the resourecs, I don't think it's a good idea to locally store
DKPro resources. If you use maven, the resources are dowloaded by maven; if
you don't, DKPro uses an Ivy cache to store the downloaded artifacts. In
any case, the resources will be downloaded only once.

Best,

Hugues

Le mer. 20 févr. 2019 à 16:02, Sarah  a écrit :

> Hi Hugues,
>
> Actually, I found the main cause of my issue: I had some annotators
> wrapped inside of a process method.
> However, I got into the optimising business now. I was wondering about
> your last question. Thus far, I use the standard DKPro resource management.
> Is there a benefit to storing resources locally and binding them to the
> annotators in terms of processing time? You would probably save the time
> for downloading and unpacking the resource, right?
>
> All the best,
> Sarah
>
> > On 20. Feb 2019, at 15:49, Hugues de Mazancourt 
> wrote:
> >
> > Hi Sarah,
> >
> > Sorry it didn't solve your issue.
> > Do all resources get reloaded or just specific ones ?
> > DKPro's components usually perform a check on the JCas typesystem and
> reload the resource if the TS changed. This happen as a prolog in every
> call to process().
> > Thus if the components "feels" that the typesystem has changed for some
> reason, this will trigger a reload of the corresponding resource.
> > Do you use standard DKPro resource management (through language/variant)
> or use an explicit location for your resource (PARAM_RESOURCE_LOCATION,
> ...) ?
> >
> > Best,
> > Hugues de Mazancourt
> >
> > P: 06.72.78.70.33 (tel:06.72.78.70.33 )
> > W: http://www.mazancourt.com <http://www.mazancourt.com/>
> >
> > On févr. 19 2019, at 3:25 pm, Sarah  wrote:
> >> Hi all,
> >>
> >> Thanks for the advice!
> >> I have created a JCasPool with - for now - only one JCas instance. I
> run my analysis engines on it, use the results, reset the JCas and release
> it back into the pool. Then I start the same process on the same JCas.
> >> However, the resources still get produced every single time I call
> “process” on my aggregate engine. I assumed that the resource management
> would be taken care of during JCas creation. But that is not the case.
> >>
> >> Does anyone know where exactly the “initialize” method of JCasAnnotator
> is called?
> >> Sarah
> >>> On 18. Feb 2019, at 17:04, Marshall Schor  wrote:
> >>> Hi Sarah,
> >>> I don't have knowledge of DKPro or Spring, but here's some general
> guidance,
> >>> which may (or may not) be of use :-).
> >>>
> >>> External Resources are associated with a Resource Manager instance.
> >>> Try figuring out how to have one Resource Manager instance be reused
> for
> >>> multiple JCas instances.
> >>>
> >>> Also, try to not have multiple JCas instances, beyond what you need to
> keep all
> >>> the cpu "cores" in your host busy.
> >>> Instead of one new JCas instance per piece of work, reusing existing
> instances,
> >>> by calling myJCasInstance.reset() and then using it again.
> >>>
> >>> Hopefully others with specific knowledge may comment also.
> >>> -Marshall
> >>> On 2/18/2019 6:48 AM, Sarah wrote:
> >>>> Hi,
> >>>>
> >>>> I am using uimafit annotators in a spring component. These annotators
> use external resources. These resources are currently produced for every
> JCas even though the Aggregate Engine is created inside of the Spring
> component's init and merely the process method is called on the individual
> JCas objects. This slows my system down.
> >>>> How do I handle external resources appropriately in a spring
> component. I found the SpringContextResourceManager but I don’t know how to
> use it. Can you point me to an example where e.g. the DKPro CoreNLP
> Annotators are used in a spring context?
> >>>>
> >>>> All the best,
> >>>> Sarah
> >>>
> >>
> >>
> >
> >
>
>


Re: Uima and spring

2019-02-20 Thread Hugues de Mazancourt
Hi Sarah,

Sorry it didn't solve your issue.
Do all resources get reloaded or just specific ones ?
DKPro's components usually perform a check on the JCas typesystem and reload 
the resource if the TS changed. This happen as a prolog in every call to 
process().
Thus if the components "feels" that the typesystem has changed for some reason, 
this will trigger a reload of the corresponding resource.
Do you use standard DKPro resource management (through language/variant) or use 
an explicit location for your resource (PARAM_RESOURCE_LOCATION, ...) ?

Best,
Hugues de Mazancourt

P: 06.72.78.70.33 (tel:06.72.78.70.33)
W: http://www.mazancourt.com

On févr. 19 2019, at 3:25 pm, Sarah  wrote:
> Hi all,
>
> Thanks for the advice!
> I have created a JCasPool with - for now - only one JCas instance. I run my 
> analysis engines on it, use the results, reset the JCas and release it back 
> into the pool. Then I start the same process on the same JCas.
> However, the resources still get produced every single time I call “process” 
> on my aggregate engine. I assumed that the resource management would be taken 
> care of during JCas creation. But that is not the case.
>
> Does anyone know where exactly the “initialize” method of JCasAnnotator is 
> called?
> Sarah
> > On 18. Feb 2019, at 17:04, Marshall Schor  wrote:
> > Hi Sarah,
> > I don't have knowledge of DKPro or Spring, but here's some general guidance,
> > which may (or may not) be of use :-).
> >
> > External Resources are associated with a Resource Manager instance.
> > Try figuring out how to have one Resource Manager instance be reused for
> > multiple JCas instances.
> >
> > Also, try to not have multiple JCas instances, beyond what you need to keep 
> > all
> > the cpu "cores" in your host busy.
> > Instead of one new JCas instance per piece of work, reusing existing 
> > instances,
> > by calling myJCasInstance.reset() and then using it again.
> >
> > Hopefully others with specific knowledge may comment also.
> > -Marshall
> > On 2/18/2019 6:48 AM, Sarah wrote:
> > > Hi,
> > >
> > > I am using uimafit annotators in a spring component. These annotators use 
> > > external resources. These resources are currently produced for every JCas 
> > > even though the Aggregate Engine is created inside of the Spring 
> > > component's init and merely the process method is called on the 
> > > individual JCas objects. This slows my system down.
> > > How do I handle external resources appropriately in a spring component. I 
> > > found the SpringContextResourceManager but I don’t know how to use it. 
> > > Can you point me to an example where e.g. the DKPro CoreNLP Annotators 
> > > are used in a spring context?
> > >
> > > All the best,
> > > Sarah
> >
>
>



Re: Uima and spring

2019-02-18 Thread Hugues de Mazancourt
Hi Sarah,

We use UimaFit annotators in a Spring (Boot) component, but we get our JCases 
from a JCasPool.
Apparently, the resources are somewhat linked to the JCas (at least when using 
DKPro's init method which loads resources depending on the JCas language).
Using the pool makes the resources loaded only once for each entry in the JCas 
Pool. This is still too much, but reduces the overhead.

Best,
Hugues de Mazancourt

P: 06.72.78.70.33 (tel:06.72.78.70.33)
W: http://www.mazancourt.com

On févr. 18 2019, at 12:48 pm, Sarah  wrote:
> Hi,
>
> I am using uimafit annotators in a spring component. These annotators use 
> external resources. These resources are currently produced for every JCas 
> even though the Aggregate Engine is created inside of the Spring component's 
> init and merely the process method is called on the individual JCas objects. 
> This slows my system down.
> How do I handle external resources appropriately in a spring component. I 
> found the SpringContextResourceManager but I don’t know how to use it. Can 
> you point me to an example where e.g. the DKPro CoreNLP Annotators are used 
> in a spring context?
>
> All the best,
> Sarah
>



Re: Default value of lexer/seeder in RutaEngine

2021-02-25 Thread Hugues de Mazancourt
Hi,

For simple corpus exploration, I agree that it would be a better default.
In our case, we’re using our own pipeline, to be coherent with higher-level 
annotations (such as POS) - nothing wrong with the Seeder

Best,

Hugues


> Le 25 févr. 2021 à 14:41, Peter Klügl  a écrit :
> 
> Hi,
> 
> 
> I am thinking about changing the default value of the seeder parameter
> in the RutaEngine from DefaultSeeder to TextSeeder. I think TextSeeder
> (no MARKUP annotations) is a better default value in most use cases.
> 
> 
> Are there opinions on that?
> 
> 
> Best,
> 
> 
> Peter
> 
> 
> -- 
> Dr. Peter Klügl
> Head of Text Mining/Machine Learning
> 
> Averbis GmbH
> Salzstr. 15
> 79098 Freiburg
> Germany
> 
> Fon: +49 761 708 394 0
> Fax: +49 761 708 394 10
> Email: peter.klu...@averbis.com
> Web: https://averbis.com
> 
> Headquarters: Freiburg im Breisgau
> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó
>