from:"Marshall Schor"

Re: Performance degradation in UIMA3

2020-06-29 Thread Marshall Schor

Hi,

I've made a possible fix for this in the uimaj-master (the uimaj-core project)

It would be great if you could test this on your big system.

Thanks.

-Marshall

P.S., it's quite possible this issue was the "first one", and there may be
others which will now surface (assuming this one is now fixed), for this level
of multi-threading.  We would love to hear of other issues if they arise :-)

On 6/29/2020 6:15 AM, Augusto Ribeiro Silva wrote:
> Hi all,
>
> I am writing because I noticed that our pipelines got considerably slower 
> after updating to UIMA3. The cause of the slowdown seems to be the fact that 
> the method getJCasRegisteredType located in TypeSystemImpl uses a 
> synchronised block. We haven't noticed it before but when I was running a 
> pipeline in a large machine (48 cores with 80 worker threads) many of the 
> threads were blocked in this specific method.
>
> I just wanted to point that this is a big problem for us and I am not sure if 
> you are aware or if there is some workaround. I am at this point considering 
> packaging our own version of UIMA without the synchronised block since our 
> types are not dynamic and should be loaded when the pipeline starts so it is 
> unlikely that we need the synchronised access to the type registry.
>
> Best regards,
> Augusto
>
>
> 
> Disclaimer:
> This email and any files transmitted with it are confidential and directed 
> solely for the use of the intended addressee or addressees and may contain 
> information that is legally privileged, confidential, and exempt from 
> disclosure. If you have received this email in error, please notify the 
> sender by telephone, fax, or return email and immediately delete this email 
> and any files transmitted along with it. Unintended recipients are not 
> authorized to disclose, disseminate, distribute, copy or take any action in 
> reliance on information contained in this email and/or any files attached 
> thereto, in any manner other than to notify the sender; any unauthorized use 
> is subject to legal prosecution.
>
>

Re: Performance degradation in UIMA3

2020-06-29 Thread Marshall Schor

https://issues.apache.org/jira/browse/UIMA-6249

On 6/29/2020 8:32 AM, Marshall Schor wrote:
> Hi,
>
> Thanks for this. Investigating... -Marshall
>
>
> On 6/29/2020 6:15 AM, Augusto Ribeiro Silva wrote:
>> Hi all,
>>
>> I am writing because I noticed that our pipelines got considerably slower 
>> after updating to UIMA3. The cause of the slowdown seems to be the fact that 
>> the method getJCasRegisteredType located in TypeSystemImpl uses a 
>> synchronised block. We haven't noticed it before but when I was running a 
>> pipeline in a large machine (48 cores with 80 worker threads) many of the 
>> threads were blocked in this specific method.
>>
>> I just wanted to point that this is a big problem for us and I am not sure 
>> if you are aware or if there is some workaround. I am at this point 
>> considering packaging our own version of UIMA without the synchronised block 
>> since our types are not dynamic and should be loaded when the pipeline 
>> starts so it is unlikely that we need the synchronised access to the type 
>> registry.
>>
>> Best regards,
>> Augusto
>>
>>
>> 
>> Disclaimer:
>> This email and any files transmitted with it are confidential and directed 
>> solely for the use of the intended addressee or addressees and may contain 
>> information that is legally privileged, confidential, and exempt from 
>> disclosure. If you have received this email in error, please notify the 
>> sender by telephone, fax, or return email and immediately delete this email 
>> and any files transmitted along with it. Unintended recipients are not 
>> authorized to disclose, disseminate, distribute, copy or take any action in 
>> reliance on information contained in this email and/or any files attached 
>> thereto, in any manner other than to notify the sender; any unauthorized use 
>> is subject to legal prosecution.
>>
>>

Re: Performance degradation in UIMA3

2020-06-29 Thread Marshall Schor

Hi,

Thanks for this. Investigating... -Marshall


On 6/29/2020 6:15 AM, Augusto Ribeiro Silva wrote:
> Hi all,
>
> I am writing because I noticed that our pipelines got considerably slower 
> after updating to UIMA3. The cause of the slowdown seems to be the fact that 
> the method getJCasRegisteredType located in TypeSystemImpl uses a 
> synchronised block. We haven't noticed it before but when I was running a 
> pipeline in a large machine (48 cores with 80 worker threads) many of the 
> threads were blocked in this specific method.
>
> I just wanted to point that this is a big problem for us and I am not sure if 
> you are aware or if there is some workaround. I am at this point considering 
> packaging our own version of UIMA without the synchronised block since our 
> types are not dynamic and should be loaded when the pipeline starts so it is 
> unlikely that we need the synchronised access to the type registry.
>
> Best regards,
> Augusto
>
>
> 
> Disclaimer:
> This email and any files transmitted with it are confidential and directed 
> solely for the use of the intended addressee or addressees and may contain 
> information that is legally privileged, confidential, and exempt from 
> disclosure. If you have received this email in error, please notify the 
> sender by telephone, fax, or return email and immediately delete this email 
> and any files transmitted along with it. Unintended recipients are not 
> authorized to disclose, disseminate, distribute, copy or take any action in 
> reliance on information contained in this email and/or any files attached 
> thereto, in any manner other than to notify the sender; any unauthorized use 
> is subject to legal prosecution.
>
>

Re: Issues upgrading UIMA from 2.x to 3.1

2020-05-29 Thread Marshall Schor

Hi Veena,

Sorry you're having some troubles...

These problems are usually due to classpath issues.

When you do the JCasGen, it will have put the generated jcas classes somewhere.

These newly generated classes have to be in the classpath instead of the
previously generated JCas classes.

Additionally, UIMA Version 3 doesn't use the classes named xxx_Type (e.g.
EntityAnnotation_Type).
Please make sure these (and of course, the version 2 of the JCas gen'd classes)
are not in the class path.

-Marshall

On 5/28/2020 4:47 PM, Veena Reddy wrote:
> Trying to update uima from 2.x to 3.1. 
>
> As detailed in the upgrade notes ran the JCasGen to generate new Annotation 
> classes. The Annotator class extends the JCasAnnotator_ImplBase and 
> implements the process(Jcas) method. 
>
> During execution a ClassCastException is being thrown in the process(Jcas) 
> method of the Annotator class when creating an annotation using the 
> constructor Annotation (JCas jcas, int begin, int end) . 
>
>  com.tas.EntityAnnotation cannot be cast to org.apache.uima.jcas.impl.JCasImpl
> at org.apache.uima.jcas.cas.TOP.(TOP.java:87)
> at 
> org.apache.uima.jcas.cas.AnnotationBase.(AnnotationBase.java:92)
> at org.apache.uima.jcas.tcas.Annotation.(Annotation.java:83)
>
> org.apache.uima.jcas.cas.TOP constructor implementation changed between both 
> 2.x and 3.x versions and the ClassCastException is being thrown in the 
> TOP.java JCas constructor. 
>
>   public TOP(JCas jcas) {
>   super((JCasImpl) jcas);
>   }
>
>  Any help will be greatly appreciated. 
>
> Best Regards,
> VR
>

Re: list

2020-05-02 Thread Marshall Schor

Hi,

It looks like you might be trying to subscribe to the uima-users mailing list at
apache, unsuccessfully.

To subscribe, send an email to user-subscr...@uima.apache.org

(notice the "-subscribe" that is part of the "to".   The email may be "empty" - 
no content.  
It probably works with no subject, but
sometimes "filters" might reject it if it has no subject, so you can put
anything you want for that.

Cheers. -Marshall

On 4/30/2020 9:18 AM, Reinhard Erich Voglmaier wrote:
>
>  
>
>  
>
> * *
>
> Reinhard E. Voglmaier
>
> * *
>
> 
>
> * *
>
> *Reinhard Erich Voglmaier, CISA*
>
> *Medical Excellence and Governance Dept.***
>
>  
>
>  
>
> GlaxoSmithKline Spa - Pharmaceuticals
> Via Fleming 2 37135 - Verona 
> mobile 39 349 079 8094
>
>  
>
>  
>
> *Email   reinhard.e.voglma...@gsk.com *
>
> **
>
>  
>
> gsk.it  |  Twitter  
> | YouTube   | 
> Facebook 
>
>  
>
> cid:image001.png@01D0FCF6.C79046A0
>
>  
>
>   
>
> Misc_Responsibility_DG_POS_RGB.png
>
>   
>
> Please consider the environment before printing this email
>
>  
>
>  
>
> *GSK monitors email communications sent to and from GSK in order to protect
> GSK, our employees, customers, suppliers and business partners, from cyber
> threats and loss of GSK Information. GSK monitoring is conducted with
> appropriate confidentiality controls and in accordance with local laws and
> after appropriate consultation.*
>

Re: UIMA v2 CAS deserialization in UIMA v3

2020-04-10 Thread Marshall Schor

a most lovely sentiment; you're most welcome :-)  -Marshall

On 4/8/2020 5:20 PM, Peter Klügl wrote:
> Hi,
>
>
> I currently deserialize some UIMA v2 CAS files in UIMA v3 while the
> typesystems have evolved considerably.
>
> I faced some problems, but managed to do what I wanted.
>
>
> I just wanted to thank you Marshall for the great work you did and do :-)
>
>
> Best,
>
>
> Peter
>
>
>

Re: UIMA RUTA NPE in RutaLiteralMatcher (UIMA-6915)

2020-03-19 Thread Marshall Schor

obviously, Peter's reply (which I did not see) takes precedence :-)  -Marshall

On 3/19/2020 10:09 AM, Marshall Schor wrote:
> Hi, thanks for reminding about this issue.  (CCing the dev list, also)
>
> Patches are welcome.  I note that the github mirror of this project is not
> up-to-date, so please use the SVN source.
>
> You can checkout the svn source from
> https://svn.apache.org/repos/asf/uima/uv3/ruta-v3/tags/ruta-3.0.0/
>
> This is the version 3 tag.  If you need to use the "trunk" you can do that 
> too.
>
> It would be great to include a simple test case as well :-)
>
> -Marshall
>
> On 3/19/2020 3:53 AM, Dominic Jehle wrote:
>> Hi,
>> We've been using UIMA and Ruta in production in a machine-translation 
>> related project for a while, but during the update to Ruta we've encountered 
>> the NPE bug described in UIMA-6195 
>> [https://issues.apache.org/jira/browse/UIMA-6195]. It's a critical issue for 
>> our project, the bug blocks us from completing the update. 
>> There has been no documented Jira activity on the issue. Can the issue be 
>> worked on and corrected? Is it possible to supply a patch from our side to 
>> help?
>> If it will be fixed, could there also be a release soon?
>> Thanks!

Re: UIMA RUTA NPE in RutaLiteralMatcher (UIMA-6915)

2020-03-19 Thread Marshall Schor

Hi, thanks for reminding about this issue.  (CCing the dev list, also)

Patches are welcome.  I note that the github mirror of this project is not
up-to-date, so please use the SVN source.

You can checkout the svn source from
https://svn.apache.org/repos/asf/uima/uv3/ruta-v3/tags/ruta-3.0.0/

This is the version 3 tag.  If you need to use the "trunk" you can do that too.

It would be great to include a simple test case as well :-)

-Marshall

On 3/19/2020 3:53 AM, Dominic Jehle wrote:
> Hi,
> We've been using UIMA and Ruta in production in a machine-translation related 
> project for a while, but during the update to Ruta we've encountered the NPE 
> bug described in UIMA-6195 [https://issues.apache.org/jira/browse/UIMA-6195]. 
> It's a critical issue for our project, the bug blocks us from completing the 
> update. 
> There has been no documented Jira activity on the issue. Can the issue be 
> worked on and corrected? Is it possible to supply a patch from our side to 
> help?
> If it will be fixed, could there also be a release soon?
> Thanks!

Re: Java 11 - UIMA-AS 2.10.3

2020-02-26 Thread Marshall Schor

Hi,

This problem was fixed in core uima (uimaj) in version 2.10.3, see Jira issue:
https://issues.apache.org/jira/browse/UIMA-5754

But uima-as version 2.10.3 was built/delivered with a previous version of core
uima (uimaj), and doesn't have this fix.

We'll look into fixes/workarounds for this earlier version.

-Marshall


On 2/26/2020 12:58 AM, Hai-Son Nguyen wrote:
> Hi,
> I receiving an exception running:
>bin/runUimaClass.sh org.apache.uima.adapter.jms.service.UIMA_Service ...
> using Java 11 both the Oracle
>java 11.0.6 2020-01-14 LTS
> and the OpenJDK versions:
>   openjdk 11.0.3 2019-04-16
>
> Exception in thread "main" java.lang.ClassCastException: class 
> jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class 
> java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader and 
> java.net.URLClassLoader are in module java.base of loader 'bootstrap')
>   at 
> org.apache.uima.bootstrap.UimaBootstrap.addUrlsToSystemLoader(UimaBootstrap.java:146)
>   at org.apache.uima.bootstrap.UimaBootstrap.main(UimaBootstrap.java:74)
>
> Thanks!
> Hai-Son

Re: Serializing CAS into XMI using UTF-8

2020-01-07 Thread Marshall Schor

not sure, but I think we don't have any code that handles utf-16 encoding when
reading an external xmi cas.

-Marshall

On 1/7/2020 3:30 AM, Rune Stilling wrote:
> Hi list
>
> I’ve run into a problem with serializing a cas in UTF-16 encoding. I use the 
> following code:
>> XMLSerializer xmlSerializer = new XMLSerializer(pw);
>> xmlSerializer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
>> XmiCasSerializer xmiCasSerializer = new 
>> XmiCasSerializer(cas.getTypeSystem());
>> xmiCasSerializer.serialize(cas.getCas(), xmlSerializer.getContentHandler());
> When I try to deserialize this code with the CasIOUtils.load(…) method I get 
> an exception:
>> [Fatal Error] :1:40: Content is not allowed in prolog.
>
> If I set the encoding to UTF-8 there’s no issue.
>
> Best,
> Rune

Re: A new page, UIMA Java cookbook, added to website

2019-11-22 Thread Marshall Schor

cool... I did a slight reformatting that might increase readability - more
"white-space" :-)

-Marshall


On 11/22/2019 3:49 PM, Richard Eckart de Castilho wrote:
> Cool!
>
> I have elaborated a bit - although I fear my elaborations might be a bit 
> cryptic... 
>
> The cookbook also inspired me to new feature requests:
>
> - https://issues.apache.org/jira/browse/UIMA-6151
> - https://issues.apache.org/jira/browse/UIMA-6152
>
> Cheers,
>
> -- Richard

A new page, UIMA Java cookbook, added to website

2019-11-22 Thread Marshall Schor

Hi,

I've put together a new page for the UIMA website, with items that are pretty
simple, but I see in user code that sometimes they're not done properly or
effectively.

The new page is here: https://uima.apache.org/doc-uimaj-cookbook.html

It's linked from the main documentation page, as one of the getting-started 
pages.

I'm sure it can be improved / expanded, but it's a start :-)

Feedback, suggestions for improvement, welcome.

-Marshall

Re: Adding methods to UIMA annotation types defined in XML

2019-11-15 Thread Marshall Schor

Interesting...

I'm wondering how to arrange things so that for JCas class "x.y.z.Foo",

you could have an associated class or interface that could make use of the

getters and setters for the features in the Foo type.

-Marshall

On 11/15/2019 11:34 AM, Richard Eckart de Castilho wrote:
> On 15. Nov 2019, at 17:26, Marshall Schor  wrote:
>> Also, you might not run the JCasGen code very often, because it only would 
>> need
>> to be run if the type system changed.
> I think with Java supporting default methods in interfaces these days, it 
> would be great
> if there was a way to have an annotation type implement additional 
> interfaces. That could
> allow for a nice decoupling of generated code and extra functionality if the 
> interfaces
> e.g. inherit from the FeatureStructure interface and thereby would have 
> access to the 
> methods to get/set features.
>
> -- Richard

Re: Adding methods to UIMA annotation types defined in XML

2019-11-15 Thread Marshall Schor

On 11/15/2019 11:20 AM, Alain Désilets wrote:
> Yeah, but won't my additions be destroyed the next time I modify the XML
> file and regenerate the typesystem?

Maybe not.  It might depend on exactly how you are generating.  The basic
JCasGen software (that builds the code) is careful to preserve the exisiting
modifications you've run.

Also, you might not run the JCasGen code very often, because it only would need
to be run if the type system changed.

See: 
http://uima.apache.org/d/uimaj-2.10.4/references.html#ugr.ref.jcas.keeping_augmentations_when_regenerating

-Marshall

Re: Adding methods to UIMA annotation types defined in XML

2019-11-15 Thread Marshall Schor

Even though the JCas classes can be generated from the XML file, you are allowed
to add additional things to those source files, including

   - additional fields

   - additional methods

See
http://uima.apache.org/d/uimaj-2.10.4/references.html#ugr.ref.jcas.augmenting_generated_code

-Marshall

On 11/15/2019 9:50 AM, Alain Désilets wrote:
> On Thu, Nov 14, 2019 at 4:51 PM Richard Eckart de Castilho 
> wrote:
>
>> Sure. You generate the JCas classes once and then you add the methods you
>> want
>> to them. Cf. e.g.
>>
>>
>> https://github.com/dkpro/dkpro-core/blob/8043e10bf10a61fe47e21946ea609bda9f2278a0/dkpro-core-api-metadata-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/api/metadata/type/DocumentMetaData.java#L290-L447
>
> I know how to create a subclass of Annotation, RelationAnnotation in my
> case. The problem is that if I try to use this subclass in an Annotator,
> UIMA complains that RelationAnnotation is not in the UIMA type system, and
> it lists the available types. This list is essentially the list of types
> defined in some UIMA xml file. This tells me that only those annotation
> classes defined in the xml file can be used in an Annotator. Or at least,
> that I am missing a step for registering my RelationAnnotation class with
> the UIMA type system.
>
> On the other hand, if I define the RelationAnnotation in the xml file, I
> can use it in an Annotator but then I can't figure out how to add methods
> to it, since the Java source for that class is generated automatically (by
> some UIMA maven plugin I presume).
>
> But the question is: why do you want to add new methods? (and is it really
>> a good idea?)
>>
> Essentially, I want to add methods for "derived attributes", i.e.
> attributes whose values are computed from primitive attributes defined in
> the xml file.
>
> I guess I could make those attributes be primitve (i.e. defined in the xml
> file), but then, any annotator that creates a RelationAnnotation would have
> to make sure to set those other attributes correctly. I would much rather
> have the RelationAnnotation class compute those derived attributes itself,
> as it garantees that they will always be computed the same way.
>
> Alain
>

Re: Erratic nullpointer exceptions because feature structure has no type in Ruta

2019-11-15 Thread Marshall Schor

good catch... I (obviously) didn't consider the error path...  -Marshall

On 11/12/2019 3:48 AM, Richard Eckart de Castilho wrote:
> On 11. Nov 2019, at 15:58, Marshall Schor  wrote:
>> This stack trace seems impossible, because the 3rd line shows
>>
>> CASImpl.ll_getFSRef calling a FeatureStructureImpl.toString method, which I
>> believe it doesn't do.
> Actually 
>
>> org.apache.uima.cas.impl.CASImpl.ll_getFSRef(CASImpl.java:3653)
> in UIMA 2.10.2 the code at this line is:
>
> *uimaj-core/src/main/java/org/apache/uima/cas/impl/CASImpl.java:3644ff* [1]
> ```
>   @Override
> public final int ll_getFSRef(FeatureStructure fsImpl) {
> if (null == fsImpl) {
>   return NULL;
> }
> final FeatureStructureImpl fsi = (FeatureStructureImpl) fsImpl;
> if (this != fsi.getCASImpl()) {
>   if (this.getBaseCAS() != fsi.getCASImpl().getBaseCAS()) {  
> // https://issues.apache.org/jira/browse/UIMA-3429
> throw new CASRuntimeException(CASRuntimeException.DEREF_FS_OTHER_CAS, 
> new Object[] {fsi.toString(), this.toString() } );
>   }
> }
> return fsi.getAddress();
>   }
> ```
>
> So there *IS* in fact a call to `FeatureStructureImpl.toString()` here!
>
> Cheers,
>
> -- Richard
>
> [1] 
> https://github.com/apache/uima-uimaj/blob/uimaj-2.10.2/uimaj-core/src/main/java/org/apache/uima/cas/impl/CASImpl.java#L3643-L3657

Re: Erratic nullpointer exceptions because feature structure has no type in Ruta

2019-11-11 Thread Marshall Schor

This stack trace seems impossible, because the 3rd line shows

CASImpl.ll_getFSRef calling a FeatureStructureImpl.toString method, which I
believe it doesn't do.

-M

On 11/11/2019 3:58 AM, Mario Juric wrote:
> Hi Peter,
>
> A while ago we started to get some erratic null pointer exceptions from Ruta 
> because the type of some feature structure element is null (see stack trace 
> below). The error is not consistently reproducible, in fact it seldomly 
> occurs and when reprocessing the document it doesn’t happen again. We 
> therefore think there are some race conditions at play when running in a 
> multithreaded environment as we do in production, and I was hoping that maybe 
> you would get an idea what might be causing it just by looking at the stack 
> trace.
>
> Cheers
> Mario
>
> java.lang.NullPointerException at 
> org.apache.uima.cas.impl.FeatureStructureImpl.prettyPrint(FeatureStructureImpl.java:501)
>  at 
> org.apache.uima.cas.impl.FeatureStructureImpl.prettyPrint(FeatureStructureImpl.java:483)
>  at 
> org.apache.uima.cas.impl.FeatureStructureImpl.toString(FeatureStructureImpl.java:472)
>  at 
> org.apache.uima.cas.impl.FeatureStructureImpl.toString(FeatureStructureImpl.java:467)
>  at org.apache.uima.cas.impl.CASImpl.ll_getFSRef(CASImpl.java:3653) at 
> org.apache.uima.cas.impl.FeatureStructureImpl.setFeatureValue(FeatureStructureImpl.java:61)
>  at org.apache.uima.ruta.RutaStream.assignFeatureValue(RutaStream.java:1140) 
> at org.apache.uima.ruta.RutaStream.assignFeatureValues(RutaStream.java:1020) 
> at org.apache.uima.ruta.action.CreateAction.execute(CreateAction.java:74) at 
> org.apache.uima.ruta.rule.AbstractRuleElement.apply(AbstractRuleElement.java:133)
>  at 
> org.apache.uima.ruta.rule.RuleElementCaretaker.applyRuleElements(RuleElementCaretaker.java:121)
>  at 
> org.apache.uima.ruta.rule.ComposedRuleElement.applyRuleElements(ComposedRuleElement.java:621)
>  at 
> org.apache.uima.ruta.rule.AbstractRuleElement.doneMatching(AbstractRuleElement.java:86)
>  at 
> org.apache.uima.ruta.rule.ComposedRuleElement.fallback(ComposedRuleElement.java:526)
>  at 
> org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:419)
>  at 
> org.apache.uima.ruta.rule.RutaRuleElement.startMatch(RutaRuleElement.java:103)
>  at 
> org.apache.uima.ruta.rule.ComposedRuleElement.startMatch(ComposedRuleElement.java:76)
>  at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:63) at 
> org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:54) at 
> org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:36) at 
> org.apache.uima.ruta.block.RutaScriptBlock.apply(RutaScriptBlock.java:67) at 
> org.apache.uima.ruta.RutaModule.apply(RutaModule.java:56) at 
> org.apache.uima.ruta.engine.RutaEngine.process(RutaEngine.java:561) at 
> org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
>  at 
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
>  at 
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:318)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:412)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344) 
> at 
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:271)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:412)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344) 
> at 
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:271)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:412)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344) 
> at 
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:271)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:412)
>  at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344) 
> at 
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:271)
>  at 
>

[ANNOUNCE] Apache UIMA Java SDK version 3.1.1 released

2019-11-11 Thread Marshall Schor



-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512
 
The Apache UIMA team is pleased to announce the release of Apache UIMA Java SDK,
version 3.1.1.  You can download it here:

https://uima.apache.org/downloads.cgi

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

Version 3.1.1 is an incremental update for the version 3 branch of UIMA. It
includes a performance fix for instances where a users is processing 100's of
CASes with the same type system, in parallel.

It is also our first release since moving the Source Control from SVN to GIT.
The source is now on https://github.com/apache/uima-uimaj

See the UIMA News item for more details here: https://uima.apache.org/news.html

Please send feedback via the Apache UIMA project mailing lists.

- -Marshall Schor, for the Apache UIMA development team.

-BEGIN PGP SIGNATURE-
 
iQIzBAEBCgAdFiEEOn/mVSh3S1eNEDz+zHYv/c0Ez9YFAl3Jb5MACgkQzHYv/c0E
z9Y1BA/+ORehm8mzCEoITakHx/hbvnKLigNgyo4chfvCMEOT7ijPmuQy4dc/KEsq
HRcImDhymRLQsYB8CZcukmhDTcjm1r0Xka2sk5CB68FE1cH9bF0Z7PztVT120+jH
IpFfPKbzRl+C2QUU6j9Sc3iBZkwttukAZxx/nLRPkcpXBBwLt6oHcNIHVXkjKYTG
Tfxd1yiA97tB+pu9CAIQoC6++mYAIMGNtTMY3POr/LYAWqQ9mPH65KxZolKgesHl
zzsRz2xNXrhXV+pBXqgTsGwW7Sa58qajc0eMYI52jaSD2Gwg+snT/p0aKjp9eDnX
mCnoEF3yDe4s8K3W6dz4V5IyvUNtFzLUS1cyIBGFnjCyTwbDujW4JRSYRYxEUrBt
XltYOpMpvv8ckKsbRSuQexhAeehYrmawfGGuAZZNAGp5abzBCEfbivxg3B4ZAwIF
mI7yFFYz1d4gROHlqAYshJAp2gmi0oEzOdJD0pWTpT568Dk19+yGPiNVzgNVInZd
CX5S9dgmtzPreT/eYdJL0d41/dg3zwQhF0SKwqoQArFR8BZvRvrY8GlCSDGwYxwq
NLRXm3wsz28XehGpc8Bqe5BVrDhUw3v2zK3O25m5smqnRiNfIGLePyFJ9qBFj3bO
8KBcyqg3yX9PUIJ6NO4xnS7tLl5Foa/X9CKCJw3z3r4A+lJPo0o=
=oDmW
-END PGP SIGNATURE-

Re: Use of CASes with sofaURI?

2019-10-25 Thread Marshall Schor

Hi,

Here's what I vaguely remember was the driving use-cases for the sofa as a URI.

1.  The main use case was for applications where the data was so large, it would
be unreasonable to read it all in and save as a string.

2.  The prohibition on changing a sofa spec (without resetting the CAS) was that
it has the potential for users to invalidate the results, in this (imagined)
scenario:

    a) User creates cas with some sofa data,
    b) User runs annotators, which create annotations that "point into" the sofa
data
    c) User changes the sofa spec, to different data, but now all the
annotations still are pointing into "offsets" in the original data.

You can change the sofa data setting, but only after resetting the CAS. 

    Did you have a use case for wanting to change the sofa data without
resetting the CAS?

It sounds like you have another interesting use case:

    a) want to convert the sofa data uri -> a string and have the normal
getDocumentText etc. work, but
    b) have the serialization serialize the sofaURI, and not the data that's
present there.

This might be a nice convenience.

I can see a couple of issues:
  a) it might need to have a good strategy for handling very large data.  E.g.,
the convert method might need to include a max string size spec.
  b) Since the serialization would serialize the annotations, but not the data
(it would only serialize the URI), the data at that URI could easily change,
making the annotation results meaningless.  Perhaps some "fingerprinting"
(developing a checksum of the data, and serializing that to be able to signal if
that did happen) would be a reasonable protection.

Maybe do a new feature-request issue?

-Marshall

magine the JavaDoc for this method would be saying something like: has the
potential to exceed your memory, at run time, due to the potential size of the
data...

On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote:
> Hi,
>
> On 25. Oct 2019, at 17:53, Marshall Schor  wrote:
>> One other useful sources for examples:  The test cases for UIMA, e.g. search 
>> the
>> uimaj-core projects *.java files for "getSofaDataStream".
> Ok, let me elaborate :)
>
> One can use setSofaDataURI(url) to tell the CAS that the sofa data is 
> actually external.
> One can then use getSofaDataStream() resolve the URL and retrieve the data as 
> a stream.
>
> So let's assume I have a CAS containing annotations on a text and the text is 
> in an external file:
>
>   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, null, 
> null);
>   cas.setSofaDataURI("file:/path/to/my/file", "text/plain");
>
> Works nice when I use getSofaDataStream() to retrieve the data. 
>
> But I can't use the "normal" methods like getDocumentText() or 
> getCoveredText() at all.
>
> Also, I cannot call setSofaDataString(urlContent, "text/plain") - it throws 
> an exception 
> because there is already a sofaURI set. This is a major inconvenience.
>
> The ClearTK guys came up with an approach that tries to make this a bit more 
> convenient:
>
> * they introduce a well-known view named "UriView" and set the sofaDataURI in 
> that view.
> * then they use a special reader which looks up the URI in that view, 
> resolves it and 
>   drops the content into the sofaDataString of the "_defaultView".
>
> That way they get the benefit of the externally stored sofa as well as the 
> ability to use
> the usual methods to access the text.
>
> When I looked at setSofaDataURI(), I naively expected that it would be 
> resolved the first
> time I try to access the sofa data (e.g. via getDocumentText()) - but that 
> doesn't happen.
>
> Then I expected that I would just call getSofaDataStream() and manually drop 
> the contents
> into setSofaDataString() and that this data string would be "transient", i.e. 
> not saved
> into XMI because we already have a setSofaDataURI set... but that expectation 
> was also
> not fulfilled.
>
> Could it be useful to introduce some place where we can transiently drop data 
> obtained
> from the sofaDataURI such that methods like getDocumentText() and 
> getCoveredText() do 
> something useful but also such that the data is not included when serializing 
> the CAS to
> whatever format?
>
> Cheers,
>
> -- Richard

Re: Use of CASes with sofaURI?

2019-10-25 Thread Marshall Schor

One other useful sources for examples:  The test cases for UIMA, e.g. search the
uimaj-core projects *.java files for "getSofaDataStream".

-Marshall

On 10/24/2019 6:11 PM, Richard Eckart de Castilho wrote:
> Hi there,
>
> does somebody have an example of how to work with CASes that where the sofa 
> data is not set using setDocumentText() but rather using setSofaDataURI(...)? 
> 
>
> It looks like the CAS text is then not accessible via the usual means:
>
>   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, 
> null, null);
> cas.setSofaDataURI("https://www.apache.org/licenses/LICENSE-2.0.txt;, 
> "text/plain");
> CasIOUtils.save(cas, System.out, SerialFormat.XMI);
> System.out.println(cas.getDocumentText()); // -> prints "null"
> System.out.println(cas.getSofaDataString()); // -> prints "null"
>
> Apparently, one needs to call getSofaDataStream() - but even after calling 
> that, getDocumentAnnotation().getCoveredText() returns null.
>
> So how is one expected to work with CASes that are using this data URI 
> concept?
>
> Cheers,
>
> -- Richard

Re: Use of CASes with sofaURI?

2019-10-25 Thread Marshall Schor

hi, not my area of expertise, but the docs say

  
http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.aas.accessing_sofa_data

that if you're using a URI, then you use the cas.getSofaDataURI(), which returns
a string representation of the URI.

To get the data, the docs say you need to set up some standard Java I/O.

There's also a special cas method, getSofaDataStream, which returns an input
stream, and works with both local and remote data.

-Marshall

On 10/24/2019 6:11 PM, Richard Eckart de Castilho wrote:
> Hi there,
>
> does somebody have an example of how to work with CASes that where the sofa 
> data is not set using setDocumentText() but rather using setSofaDataURI(...)? 
> 
>
> It looks like the CAS text is then not accessible via the usual means:
>
>   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, 
> null, null);
> cas.setSofaDataURI("https://www.apache.org/licenses/LICENSE-2.0.txt;, 
> "text/plain");
> CasIOUtils.save(cas, System.out, SerialFormat.XMI);
> System.out.println(cas.getDocumentText()); // -> prints "null"
> System.out.println(cas.getSofaDataString()); // -> prints "null"
>
> Apparently, one needs to call getSofaDataStream() - but even after calling 
> that, getDocumentAnnotation().getCoveredText() returns null.
>
> So how is one expected to work with CASes that are using this data URI 
> concept?
>
> Cheers,
>
> -- Richard

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Marshall Schor

Here's code that works that serializes in 1.1 format.

The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".

XmiCasSerializer xmiCasSerializer = new XmiCasSerializer(jCas.getTypeSystem());
OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
try {
  XMLSerializer xml11Serializer = new XMLSerializer(out);
  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
  xmiCasSerializer.serialize(jCas.getCas(), 
xml11Serializer.getContentHandler());
    }
finally {
  out.close();
}

This is from a test case. -Marshall

On 9/25/2019 2:16 PM, Mario Juric wrote:
> Thanks Marshall,
>
> If you prefer then I can also have a look at it, although I probably need to 
> finish something first within the next 3-4 weeks. It would probably get me 
> faster started if you could share some of your experimental sample code.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 24 Sep 2019, at 21:32 , Marshall Schor  wrote:
>>
>> yes, makes sense, thanks for posting the Jira.
>>
>> If no one else steps up to work on this, I'll probably take a look in a few
>> days. -Marshall
>>
>> On 9/24/2019 6:47 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> I added the following feature request to Apache Jira:
>>>
>>> https://issues.apache.org/jira/browse/UIMA-6128
>>>
>>> Hope it makes sense :)
>>>
>>> Thanks a lot for the help, it’s appreciated.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:
>>>>
>>>> Re: serializing using XML 1.1
>>>>
>>>> This was not thought of, when setting up the CasIOUtils.
>>>>
>>>> The way it was done (above) was using some more "primitive/lower level" 
>>>> APIs,
>>>> rather than the CasIOUtils.
>>>>
>>>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>>>> might be specified in the CasIOUtils APIs.
>>>>
>>>> Thanks! -Marshall
>>>>
>>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>>>> Hi Marshall,
>>>>>
>>>>> Thanks for the thorough and excellent investigation.
>>>>>
>>>>> We are looking into possible normalisation/cleanup of 
>>>>> whitespace/invisible characters, but I don’t think we can necessarily do 
>>>>> the same for some of the other characters. It sounds to me though that 
>>>>> serialising to XML 1.1 could also be a simple fix right now, but can this 
>>>>> be configured? CasIOUtils doesn’t seem to have an option for this, so I 
>>>>> assume it’s something you have working in your branch.
>>>>>
>>>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 
>>>>> 9 and after. Do you think switching to a more recent Java version would 
>>>>> make a difference? I think we can also try this out ourselves when we 
>>>>> look into migrating to UIMA 3 once our current deliveries are complete. 
>>>>> We also like to switch to Java 11, and like UIMA 3 migration it will 
>>>>> require some thorough testing.
>>>>>
>>>>> Cheers,
>>>>> Mario
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>>>>>
>>>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid 
>>>>>> xml
>>>>>> char, which is the \u0002.
>>>>>>
>>>>>> This is in part because the xml version being used is xml 1.0.
>>>>>>
>>>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>>>
>>>>>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>>>>>> xml 1.1:
>>>>>>
>>>>>>   XmiCasSerializer xmiCasSerializer = new
>>>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>>>   OutputStream out = new File

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-24 Thread Marshall Schor

yes, makes sense, thanks for posting the Jira.

If no one else steps up to work on this, I'll probably take a look in a few
days. -Marshall

On 9/24/2019 6:47 AM, Mario Juric wrote:
> Hi Marshall,
>
> I added the following feature request to Apache Jira:
>
> https://issues.apache.org/jira/browse/UIMA-6128
>
> Hope it makes sense :)
>
> Thanks a lot for the help, it’s appreciated.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:
>>
>> Re: serializing using XML 1.1
>>
>> This was not thought of, when setting up the CasIOUtils.
>>
>> The way it was done (above) was using some more "primitive/lower level" APIs,
>> rather than the CasIOUtils.
>>
>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>> might be specified in the CasIOUtils APIs.
>>
>> Thanks! -Marshall
>>
>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> Thanks for the thorough and excellent investigation.
>>>
>>> We are looking into possible normalisation/cleanup of whitespace/invisible 
>>> characters, but I don’t think we can necessarily do the same for some of 
>>> the other characters. It sounds to me though that serialising to XML 1.1 
>>> could also be a simple fix right now, but can this be configured? 
>>> CasIOUtils doesn’t seem to have an option for this, so I assume it’s 
>>> something you have working in your branch.
>>>
>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
>>> and after. Do you think switching to a more recent Java version would make 
>>> a difference? I think we can also try this out ourselves when we look into 
>>> migrating to UIMA 3 once our current deliveries are complete. We also like 
>>> to switch to Java 11, and like UIMA 3 migration it will require some 
>>> thorough testing.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>>>
>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid 
>>>> xml
>>>> char, which is the \u0002.
>>>>
>>>> This is in part because the xml version being used is xml 1.0.
>>>>
>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>
>>>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>>>> xml 1.1:
>>>>
>>>>XmiCasSerializer xmiCasSerializer = new
>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>OutputStream out = new FileOutputStream(new File 
>>>> ("odd-doc-txt-v11.xmi"));
>>>>try {
>>>>  XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>  xmiCasSerializer.serialize(jCas.getCas(),
>>>> xml11Serializer.getContentHandler());
>>>>}
>>>>finally {
>>>>  out.close();
>>>>}
>>>>
>>>> This succeeds and serializes this using xml 1.1.
>>>>
>>>> I also tried serializing some doc text which includes \u77987.  That did 
>>>> not
>>>> serialize correctly.
>>>> I could see it in the code while tracing up to some point down in the 
>>>> innards of
>>>> some internal
>>>> sax java code
>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where 
>>>> it was
>>>> "Correct" in the Java string.
>>>>
>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>
>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>>>> encoding:
>>>>1110  10xx  10xx 
>>>>
>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>>>
>>>> But I think it's out of our hands - it's somewhere deep in the sax 
>>>> transform
>>>> java code.
>>>>
>>>> I looked for a bug report and found some
>>>> https://bugs.openjdk.java.net/browse

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Marshall Schor

re: using a later Java - that might make a difference, since fixes keep getting
added.

For some fixes, however, as you've noted, the fixes are backported to previous
versions.

-Marshall

On 9/23/2019 3:45 AM, Mario Juric wrote:
> Hi Marshall,
>
> Thanks for the thorough and excellent investigation.
>
> We are looking into possible normalisation/cleanup of whitespace/invisible 
> characters, but I don’t think we can necessarily do the same for some of the 
> other characters. It sounds to me though that serialising to XML 1.1 could 
> also be a simple fix right now, but can this be configured? CasIOUtils 
> doesn’t seem to have an option for this, so I assume it’s something you have 
> working in your branch.
>
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
> and after. Do you think switching to a more recent Java version would make a 
> difference? I think we can also try this out ourselves when we look into 
> migrating to UIMA 3 once our current deliveries are complete. We also like to 
> switch to Java 11, and like UIMA 3 migration it will require some thorough 
> testing.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>>
>> This is in part because the xml version being used is xml 1.0.
>>
>> XML 1.1 expanded the set of valid characters to include \u0002.
>>
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
>> 1.1:
>>
>> XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>> OutputStream out = new FileOutputStream(new File 
>> ("odd-doc-txt-v11.xmi"));
>> try {
>>   XMLSerializer xml11Serializer = new XMLSerializer(out);
>>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>   xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>> }
>> finally {
>>   out.close();
>> }
>>
>> This succeeds and serializes this using xml 1.1.
>>
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the 
>> innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it 
>> was
>> "Correct" in the Java string.
>>
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>> encoding:
>> 1110  10xx  10xx xxxx
>>
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>>
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>
>> Bottom line, is, I think to clean out these characters early :-) .
>>
>> -Marshall
>>
>>
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>>
>>> If you have a string, with the surrogate pair  at position 10, and 
>>> you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>>
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>>
>>> Of course, it's a "bug" to iterate through a string of characters, assuming 
>>> you
>>> have characters at each point, if you don't handle surrogate pairs.
>>>
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>>
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea 
>>> if
>>> they would work well enough for you.
>>>
>>> I'll poke around some more to see if I can enable the conversion for 
>>> document
>>> strings.
>>>
>>> -Marshall
>>>
>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>> Thanks Marshall,
>>>>
>>>> Encoding the ch

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Marshall Schor

Re: serializing using XML 1.1

This was not thought of, when setting up the CasIOUtils.

The way it was done (above) was using some more "primitive/lower level" APIs,
rather than the CasIOUtils.

Please open a Jira ticket for this, with perhaps some suggestions on how it
might be specified in the CasIOUtils APIs.

Thanks! -Marshall

On 9/23/2019 3:45 AM, Mario Juric wrote:
> Hi Marshall,
>
> Thanks for the thorough and excellent investigation.
>
> We are looking into possible normalisation/cleanup of whitespace/invisible 
> characters, but I don’t think we can necessarily do the same for some of the 
> other characters. It sounds to me though that serialising to XML 1.1 could 
> also be a simple fix right now, but can this be configured? CasIOUtils 
> doesn’t seem to have an option for this, so I assume it’s something you have 
> working in your branch.
>
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
> and after. Do you think switching to a more recent Java version would make a 
> difference? I think we can also try this out ourselves when we look into 
> migrating to UIMA 3 once our current deliveries are complete. We also like to 
> switch to Java 11, and like UIMA 3 migration it will require some thorough 
> testing.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>>
>> This is in part because the xml version being used is xml 1.0.
>>
>> XML 1.1 expanded the set of valid characters to include \u0002.
>>
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
>> 1.1:
>>
>> XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>> OutputStream out = new FileOutputStream(new File 
>> ("odd-doc-txt-v11.xmi"));
>> try {
>>   XMLSerializer xml11Serializer = new XMLSerializer(out);
>>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>   xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>> }
>> finally {
>>   out.close();
>> }
>>
>> This succeeds and serializes this using xml 1.1.
>>
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the 
>> innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it 
>> was
>> "Correct" in the Java string.
>>
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>> encoding:
>> 1110  10xx xxxx 10xx 
>>
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>>
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>
>> Bottom line, is, I think to clean out these characters early :-) .
>>
>> -Marshall
>>
>>
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>>
>>> If you have a string, with the surrogate pair  at position 10, and 
>>> you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>>
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>>
>>> Of course, it's a "bug" to iterate through a string of characters, assuming 
>>> you
>>> have characters at each point, if you don't handle surrogate pairs.
>>>
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>>
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea 
>>> if
>>> they would work well enough for you.
>>>
>>> I'll poke around some more to see if I can enable the conversion for 
>>> document
>>> strings.
>>>
>>> -M

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-20 Thread Marshall Schor

In the test "OddDocumentText", this produces a "throw" due to an invalid xml
char, which is the \u0002.

This is in part because the xml version being used is xml 1.0.

XML 1.1 expanded the set of valid characters to include \u0002.

Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:

        XmiCasSerializer xmiCasSerializer = new
XmiCasSerializer(jCas.getTypeSystem());
    OutputStream out = new FileOutputStream(new File 
("odd-doc-txt-v11.xmi"));
    try {
  XMLSerializer xml11Serializer = new XMLSerializer(out);
  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
  xmiCasSerializer.serialize(jCas.getCas(),
xml11Serializer.getContentHandler());
    }
    finally {
  out.close();
    }

This succeeds and serializes this using xml 1.1.

I also tried serializing some doc text which includes \u77987.  That did not
serialize correctly.
I could see it in the code while tracing up to some point down in the innards of
some internal
sax java code
com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
"Correct" in the Java string.

When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.

This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
encoding:
    1110  10xx  10xx 

of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.

But I think it's out of our hands - it's somewhere deep in the sax transform
java code.

I looked for a bug report and found some
https://bugs.openjdk.java.net/browse/JDK-8058175

Bottom line, is, I think to clean out these characters early :-) .

-Marshall


On 9/20/2019 1:28 PM, Marshall Schor wrote:
> here's an idea.
>
> If you have a string, with the surrogate pair  at position 10, and you
> have some Java code, which is iterating through the string and getting the
> code-point at each character offset, then that code will produce:
>
> at position 10:  the code-point 77987
> at position 11:  the code-point 56483
>
> Of course, it's a "bug" to iterate through a string of characters, assuming 
> you
> have characters at each point, if you don't handle surrogate pairs.
>
> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
> https://tools.ietf.org/html/rfc2781 )
>
> I worry that even tools like the CVD or similar may not work properly, since
> they're not designed to handle surrogate pairs, I think, so I have no idea if
> they would work well enough for you.
>
> I'll poke around some more to see if I can enable the conversion for document
> strings.
>
> -Marshall
>
> On 9/20/2019 11:09 AM, Mario Juric wrote:
>> Thanks Marshall,
>>
>> Encoding the characters like you suggest should work just fine for us as 
>> long as we can serialize and deserialise the XMI, so that we can open the 
>> content in a tool like the CVD or similar. These characters are just noise 
>> from the original content that happen to remain in the CAS, but they are not 
>> visible in our final output because they are basically filtered out one way 
>> or the other by downstream components. They become a problem though when 
>> they make it more difficult for us to inspect the content.
>>
>> Regarding the feature name issue: Might you have an idea why we are getting 
>> a different XMI output for the same character in our actual pipeline, where 
>> it results in "”? I investigated the value in the debugger 
>> again, and like you are illustrating it is also just a single codepoint with 
>> the value 77987. We are simply not able to load this XMI because of this, 
>> but unfortunately I couldn’t reproduce it in my small example.
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 19 Sep 2019, at 22:41 , Marshall Schor  wrote:
>>>
>>> The odd-feature-text seems to work OK, but has some unusual properties, due 
>>> to
>>> that unicode character.
>>>
>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>
>>> When output, it shows up in the xmi as >> xmi:id="18"
>>> name="" value="1.0"/>
>>> which seems correct.  The name field only has 1 (extended)unicode character
>>> (taking 2 Java characters to represent),
>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>
>>> When read in, the name field is assigned to a String, that string says it 
>>> has a
>

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-20 Thread Marshall Schor

here's an idea.

If you have a string, with the surrogate pair  at position 10, and you
have some Java code, which is iterating through the string and getting the
code-point at each character offset, then that code will produce:

at position 10:  the code-point 77987
at position 11:  the code-point 56483

Of course, it's a "bug" to iterate through a string of characters, assuming you
have characters at each point, if you don't handle surrogate pairs.

The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
https://tools.ietf.org/html/rfc2781 )

I worry that even tools like the CVD or similar may not work properly, since
they're not designed to handle surrogate pairs, I think, so I have no idea if
they would work well enough for you.

I'll poke around some more to see if I can enable the conversion for document
strings.

-Marshall

On 9/20/2019 11:09 AM, Mario Juric wrote:
> Thanks Marshall,
>
> Encoding the characters like you suggest should work just fine for us as long 
> as we can serialize and deserialise the XMI, so that we can open the content 
> in a tool like the CVD or similar. These characters are just noise from the 
> original content that happen to remain in the CAS, but they are not visible 
> in our final output because they are basically filtered out one way or the 
> other by downstream components. They become a problem though when they make 
> it more difficult for us to inspect the content.
>
> Regarding the feature name issue: Might you have an idea why we are getting a 
> different XMI output for the same character in our actual pipeline, where it 
> results in "”? I investigated the value in the debugger 
> again, and like you are illustrating it is also just a single codepoint with 
> the value 77987. We are simply not able to load this XMI because of this, but 
> unfortunately I couldn’t reproduce it in my small example.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>> On 19 Sep 2019, at 22:41 , Marshall Schor  wrote:
>>
>> The odd-feature-text seems to work OK, but has some unusual properties, due 
>> to
>> that unicode character.
>>
>> Here's what I see:  The FeatureRecord "name" field is set to a
>> 1-unicode-character, that must be encoded as 2 java characters.
>>
>> When output, it shows up in the xmi as > name="" value="1.0"/>
>> which seems correct.  The name field only has 1 (extended)unicode character
>> (taking 2 Java characters to represent),
>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>
>> When read in, the name field is assigned to a String, that string says it 
>> has a
>> length of 2 (but that's because it takes 2 java chars to represent this 
>> char).
>> If you have the name string in a variable "n", and do
>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>> n.codePointCount(0, n.length()) is, as expected, 1.
>>
>> So, the string value serialization and deserialization seems to be "working".
>>
>> The other code - for the sofa (document) serialization, is throwing that 
>> error,
>> because as currently designed, the
>> serialization code checks for these kinds of characters, and if found throws
>> that exception.  The code checking is
>> in XMLUtils.checkForNonXmlCharacters
>>
>> This is because it's highly likely that "fixing this" in the same way as the
>> other, would result in hard-to-diagnose
>> future errors, because the subject of analysis string is processed with 
>> begin /
>> end offset all over the place, and makes
>> the assumption that the characters are all not coded as surrogate pairs.
>>
>> We could change the code to output these like the name, as, e.g.,   
>>
>> Would that help in your case, or do you imagine other kinds of things might
>> break (due to begin/end offsets no longer
>> being on character boundaries, for example).
>>
>> -Marshall
>>
>>
>>
>>
>>
>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>> Hi,
>>>
>>> I investigated the XMI issue as promised and these are my findings.
>>>
>>> It is related to special unicode characters that are not handled by XMI
>>> serialisation, and there seems to be two distinct categories of issues we 
>>> have
>>> identified so far.
>>>
>>> 1) The document text of the CAS contains special unicode characters
>>> 2) Annotations with String features have values containing special unicode
>>> characters
>>>
>>> In both cases we co

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-19 Thread Marshall Schor

 has a well described
>> type system then maybe it just lacks a way to describe schema evolution
>> similar to Apache Avro or similar serialisation frameworks. I think a more
>> formal approach to data migration would be critical to any larger operational
>> setup.
>>
>> Regarding XMI I like to provide some input to the problem we are observing,
>> so that it can be solved. We are primarily using XMI for inspection/debugging
>> purposes, and we are sometimes not able to do this because of this error. I
>> will try to extract a minimum example to avoid involving parts that has to do
>> with our pipeline and type system, and I think this would also be the best
>> way to illustrate that the problem exists outside of this context. However,
>> converting all our data to XMI first in order to do the conversion in our
>> example would not be very practical for us, because it involves a large
>> amount of data.
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 16 Sep 2019, at 23:02 , Marshall Schor >> <mailto:m...@schor.com>> wrote:
>>>
>>> In this case, the original looks kind-of like this:
>>>
>>> Container
>>>    features -> FSArray of FeatureAnnotation each of which
>>>  has 5 slots: sofaRef, begin, end, name, value
>>>
>>> the new TypeSystem has
>>>
>>> Container
>>>    features -> FSArray of FeatureRecord each of which
>>>                               has 2 slots: name, value
>>>
>>> The deserializer code would need some way to decide how to
>>>    1) create an FSArray of FeatureRecord,
>>>    2) for each element,
>>>   map the FeatureAnnotation to a new instance of FeatureRecord
>>>
>>> I guess I could imagine a default mapping (for item 2 above) of
>>>   1) change the type from A to B
>>>   2) set equal-named features from A to B, drop other features
>>>
>>> This mapping would need to apply to a subset of the A's and B's, namely, 
>>> only
>>> those referenced by the FSArray where the element type changed.  Seems 
>>> complex
>>> and specific to this use case though.
>>>
>>> -Marshall
>>>
>>>
>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>> On 16. Sep 2019, at 19:05, Marshall Schor >>> <mailto:m...@schor.com>> wrote:
>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>> deserialization
>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>> feature
>>>>> whose range (value) is type  in one type system and type  in the
>>>>> other.
>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>> Without reading the code in detail - could we not relax this check such
>>>> that the element type of FSArrays is not checked and the code simply
>>>> assumes that the source element type has the same features as the target
>>>> element type (with the usual lenient handling of missing features in the
>>>> target type)? - Kind of a "duck typing" approach?
>>>>
>>>> Cheers,
>>>>
>>>> -- Richard
>>
>

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor

In this case, the original looks kind-of like this:

Container
   features -> FSArray of FeatureAnnotation each of which
 has 5 slots: sofaRef, begin, end, name, value

the new TypeSystem has

Container
   features -> FSArray of FeatureRecord each of which
                              has 2 slots: name, value

The deserializer code would need some way to decide how to
   1) create an FSArray of FeatureRecord,
   2) for each element,
  map the FeatureAnnotation to a new instance of FeatureRecord

I guess I could imagine a default mapping (for item 2 above) of
  1) change the type from A to B
  2) set equal-named features from A to B, drop other features

This mapping would need to apply to a subset of the A's and B's, namely, only
those referenced by the FSArray where the element type changed.  Seems complex
and specific to this use case though.

-Marshall


On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
> On 16. Sep 2019, at 19:05, Marshall Schor  wrote:
>> I can reproduce the problem, and see what is happening.  The deserialization
>> code compares the two type systems, and allows for some mismatches (things
>> present in one and not in the other), but it doesn't allow for having a 
>> feature
>> whose range (value) is type  in one type system and type  in the 
>> other. 
>> See CasTypeSystemMapper lines 299 - 315.
> Without reading the code in detail - could we not relax this check such that 
> the element type of FSArrays is not checked and the code simply assumes that 
> the source element type has the same features as the target element type 
> (with the usual lenient handling of missing features in the target type)? - 
> Kind of a "duck typing" approach?
>
> Cheers,
>
> -- Richard

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor

I can reproduce the problem, and see what is happening.  The deserialization
code compares the two type systems, and allows for some mismatches (things
present in one and not in the other), but it doesn't allow for having a feature
whose range (value) is type  in one type system and type  in the other. 
See CasTypeSystemMapper lines 299 - 315.

It may not be easy to fix.  Basically, the deserialization routines are set up
with a lenient kind of accommodation for different type systems, where they can
"skip" over types and features that are missing. 

This particular transformation needs to run a value conversion - from
FeatureAnnotation to FeatureRecord. 

I'm thinking of various approaches, and putting these out for others to expand
upon, etc.

1) Along the lines of Richard's remark, fix the xmi serialization to work with
all binary data, perhaps by base-64 encoding problematic (or specified by
feature name, or all) values, or - if it turns out to just be some "bug" -
fixing the bug.

2) Allow the user to specify some kind of call-back function, in the
deserializer, when the range of the feature doesn't match.  This would take some
kind of representation of the feature value in typesystem1, and the type of the
feature value in type system 2, and would need to produce the value in type
system 2.  This may be quite problematic/awkward to carry out in all the
generalized edge cases, for instance if there are "forward" references to things
not yet deserialized, etc.

At this point, I think #1 could be quite feasible.  To investigate further, it
would help to have a small test case where the xmi serialization currently is
not readable (due to - as you think - character coding issues).

-Marshall

On 9/16/2019 8:11 AM, Mario Juric wrote:
>
> Best Regards,
>
> Mario Juric
> Principal Engineer
> *UNSILO.ai* <http://unsilo.ai/>
> mobile:  +45 3082 4100
>
> skype: mario.juric.dk <http://mario.juric.dk>
>
>
>
>
> Hi Marshall,
>
> I have a small test case  with 3 files excluding any JCasGen generated types
> and UIMAfit types file.
>
> First you will have to generate the types and run the SaveCompressedBinary to
> produce the 3 binaries forms I have been experimenting with. Yo should then be
> able to run LoadCompressedBinaries as expected.
>
> Next you need to change the element type of Container.features from
> FeatureAnnotation to FeatureRecord in the type system and generate the type
> system again. Also change the FeatureAnnotation reference In
> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
> previously stored binaries again without saving them first using the new type
> system.
>
> You can see I have played with different ways of loading just to see if
> anything worked, but much of it seems to result in exactly the same calls in
> the lower layers. I didn’t get entirely the same results with the CAS we
> actually store as in this example. E.g. I experienced some EOF with the
> compressed filtered whereas I only get a class cast exception during
> verification in this example. Note also that we keep both types in the new
> type system, but we want to change the element type of the FSArray in the
> Container.
>
> Hope this will yield some useful insights and thanks a lot :)
>
> Cheers
> Mario
>
>
>
>
>
>
>
>
>
>
>
>> On 13 Sep 2019, at 21:55 , Mario Juric > <mailto:m...@unsilo.ai>>
>> wrote:
>>
>> Thanks Marshall,
>>
>> I’ll get back to you with a small sample as soon I get the time to do it.
>> This will also get me a better understanding of the the format.
>>
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 13 Sep 2019, at 19:32 , Marshall Schor >> <mailto:m...@schor.com>> wrote:
>>>
>>> I'm wondering if you could post a very small test case showing this problem 
>>> with
>>> a small type system. 
>>>
>>> With that, I could run in the debugger and see exactly what was happening, 
>>> and
>>> see whether or not some small fix would make this work.
>>>
>>> The Deserializer for this already supports a certain type of mismatch 
>>> between
>>> type systems, but mainly one where one is a subset of the other - see the
>>> javadoc for the method
>>>
>>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>>>
>>> But it must not currently cover this particular case.
>>>
>>> -Marshall
>>>
>>> On 9/13/2019 10:48 AM, Mario Juric wrote:
>>>> Just a quick follow up.
>>>>
>>>> I pl

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor

oops, ignore that - I see Container is a JCas class ...  -M

On 9/16/2019 9:30 AM, Marshall Schor wrote:
> I may have some version pblms.  The LoadCompressedBinary has refs to a class
> "Container", but I don't seem to have that class - where is it coming from?
>
> -Marshall
>
> On 9/16/2019 8:11 AM, Mario Juric wrote:
>> Best Regards,
>>
>> Mario Juric
>> Principal Engineer
>> *UNSILO.ai* <http://unsilo.ai/>
>> mobile:  +45 3082 4100
>>
>> skype: mario.juric.dk <http://mario.juric.dk>
>>
>>
>>
>>
>> Hi Marshall,
>>
>> I have a small test case  with 3 files excluding any JCasGen generated types
>> and UIMAfit types file.
>>
>> First you will have to generate the types and run the SaveCompressedBinary to
>> produce the 3 binaries forms I have been experimenting with. Yo should then 
>> be
>> able to run LoadCompressedBinaries as expected.
>>
>> Next you need to change the element type of Container.features from
>> FeatureAnnotation to FeatureRecord in the type system and generate the type
>> system again. Also change the FeatureAnnotation reference In
>> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
>> previously stored binaries again without saving them first using the new type
>> system.
>>
>> You can see I have played with different ways of loading just to see if
>> anything worked, but much of it seems to result in exactly the same calls in
>> the lower layers. I didn’t get entirely the same results with the CAS we
>> actually store as in this example. E.g. I experienced some EOF with the
>> compressed filtered whereas I only get a class cast exception during
>> verification in this example. Note also that we keep both types in the new
>> type system, but we want to change the element type of the FSArray in the
>> Container.
>>
>> Hope this will yield some useful insights and thanks a lot :)
>>
>> Cheers
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 13 Sep 2019, at 21:55 , Mario Juric >> <mailto:m...@unsilo.ai>>
>>> wrote:
>>>
>>> Thanks Marshall,
>>>
>>> I’ll get back to you with a small sample as soon I get the time to do it.
>>> This will also get me a better understanding of the the format.
>>>
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 13 Sep 2019, at 19:32 , Marshall Schor >>> <mailto:m...@schor.com>> wrote:
>>>>
>>>> I'm wondering if you could post a very small test case showing this 
>>>> problem with
>>>> a small type system. 
>>>>
>>>> With that, I could run in the debugger and see exactly what was happening, 
>>>> and
>>>> see whether or not some small fix would make this work.
>>>>
>>>> The Deserializer for this already supports a certain type of mismatch 
>>>> between
>>>> type systems, but mainly one where one is a subset of the other - see the
>>>> javadoc for the method
>>>>
>>>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>>>>
>>>> But it must not currently cover this particular case.
>>>>
>>>> -Marshall
>>>>
>>>> On 9/13/2019 10:48 AM, Mario Juric wrote:
>>>>> Just a quick follow up.
>>>>>
>>>>> I played a bit around with the CasIOUtils, and it seems that it is 
>>>>> possible
>>>>> to load and use the embedded type system, i.e. the old type system with X,
>>>>> but I found no way to replace it with the new type system and make the
>>>>> necessary mappings to Y. I tried to see if I could use the CasCopier in a
>>>>> separate step but it expectedly fails when it reaches to the FSArray of X
>>>>> in the source CAS because the destination type system requires elements of
>>>>> type Y. I could make my own modified version of the CasCopier that could
>>>>> take some mapping functions for each pair of source and destination types
>>>>> that need to be mapped, but this is where it starts to get too 
>>>>> complicated,
>>>>> so I found it not to be worth it at this point, since we might then want 
>>>>> to
>>>>> reprocess everything fro

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor

I may have some version pblms.  The LoadCompressedBinary has refs to a class
"Container", but I don't seem to have that class - where is it coming from?

-Marshall

On 9/16/2019 8:11 AM, Mario Juric wrote:
>
> Best Regards,
>
> Mario Juric
> Principal Engineer
> *UNSILO.ai* <http://unsilo.ai/>
> mobile:  +45 3082 4100
>
> skype: mario.juric.dk <http://mario.juric.dk>
>
>
>
>
> Hi Marshall,
>
> I have a small test case  with 3 files excluding any JCasGen generated types
> and UIMAfit types file.
>
> First you will have to generate the types and run the SaveCompressedBinary to
> produce the 3 binaries forms I have been experimenting with. Yo should then be
> able to run LoadCompressedBinaries as expected.
>
> Next you need to change the element type of Container.features from
> FeatureAnnotation to FeatureRecord in the type system and generate the type
> system again. Also change the FeatureAnnotation reference In
> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
> previously stored binaries again without saving them first using the new type
> system.
>
> You can see I have played with different ways of loading just to see if
> anything worked, but much of it seems to result in exactly the same calls in
> the lower layers. I didn’t get entirely the same results with the CAS we
> actually store as in this example. E.g. I experienced some EOF with the
> compressed filtered whereas I only get a class cast exception during
> verification in this example. Note also that we keep both types in the new
> type system, but we want to change the element type of the FSArray in the
> Container.
>
> Hope this will yield some useful insights and thanks a lot :)
>
> Cheers
> Mario
>
>
>
>
>
>
>
>
>
>
>
>> On 13 Sep 2019, at 21:55 , Mario Juric > <mailto:m...@unsilo.ai>>
>> wrote:
>>
>> Thanks Marshall,
>>
>> I’ll get back to you with a small sample as soon I get the time to do it.
>> This will also get me a better understanding of the the format.
>>
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 13 Sep 2019, at 19:32 , Marshall Schor >> <mailto:m...@schor.com>> wrote:
>>>
>>> I'm wondering if you could post a very small test case showing this problem 
>>> with
>>> a small type system. 
>>>
>>> With that, I could run in the debugger and see exactly what was happening, 
>>> and
>>> see whether or not some small fix would make this work.
>>>
>>> The Deserializer for this already supports a certain type of mismatch 
>>> between
>>> type systems, but mainly one where one is a subset of the other - see the
>>> javadoc for the method
>>>
>>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>>>
>>> But it must not currently cover this particular case.
>>>
>>> -Marshall
>>>
>>> On 9/13/2019 10:48 AM, Mario Juric wrote:
>>>> Just a quick follow up.
>>>>
>>>> I played a bit around with the CasIOUtils, and it seems that it is possible
>>>> to load and use the embedded type system, i.e. the old type system with X,
>>>> but I found no way to replace it with the new type system and make the
>>>> necessary mappings to Y. I tried to see if I could use the CasCopier in a
>>>> separate step but it expectedly fails when it reaches to the FSArray of X
>>>> in the source CAS because the destination type system requires elements of
>>>> type Y. I could make my own modified version of the CasCopier that could
>>>> take some mapping functions for each pair of source and destination types
>>>> that need to be mapped, but this is where it starts to get too complicated,
>>>> so I found it not to be worth it at this point, since we might then want to
>>>> reprocess everything from scratch anyway.
>>>>
>>>> Cheers,
>>>> Mario
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On 12 Sep 2019, at 10:41 , Mario Juric >>>> <mailto:m...@unsilo.ai>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We use form 6 compressed binaries to persist the CAS. We now want to make
>>>>> a change to the type system that is not direct

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-13 Thread Marshall Schor

I'm wondering if you could post a very small test case showing this problem with
a small type system. 

With that, I could run in the debugger and see exactly what was happening, and
see whether or not some small fix would make this work.

The Deserializer for this already supports a certain type of mismatch between
type systems, but mainly one where one is a subset of the other - see the
javadoc for the method

org.apache.uima.cas.impl.BinaryCasSerDes6.java.

But it must not currently cover this particular case.

-Marshall

On 9/13/2019 10:48 AM, Mario Juric wrote:
> Just a quick follow up.
>
> I played a bit around with the CasIOUtils, and it seems that it is possible 
> to load and use the embedded type system, i.e. the old type system with X, 
> but I found no way to replace it with the new type system and make the 
> necessary mappings to Y. I tried to see if I could use the CasCopier in a 
> separate step but it expectedly fails when it reaches to the FSArray of X in 
> the source CAS because the destination type system requires elements of type 
> Y. I could make my own modified version of the CasCopier that could take some 
> mapping functions for each pair of source and destination types that need to 
> be mapped, but this is where it starts to get too complicated, so I found it 
> not to be worth it at this point, since we might then want to reprocess 
> everything from scratch anyway.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 12 Sep 2019, at 10:41 , Mario Juric  wrote:
>>
>> Hi,
>>
>> We use form 6 compressed binaries to persist the CAS. We now want to make a 
>> change to the type system that is not directly compatible, although in 
>> principle the new type system is really a subset from a data perspective, so 
>> we want to migrate existing binaries to the new type system, but we don’t 
>> know how. The change is as follows:
>>
>> In the existing type system we have a type A with a FSArray feature of 
>> element type X, and we want to change X to Y where Y contains a genuine 
>> feature subset of X. This means we basically want to replace X with Y for 
>> the FSArray and ditch a few attributes of X when loading the CAS into the 
>> new type system.
>>
>> Had the CAS been stored in JSON this would be trivial by just mapping the 
>> attributes that they have in common, but when I try to load the CAS binary 
>> into the new target type system it chokes with an EOF, so I don’t know if 
>> that is at all possible with a form 6 compressed CAS binary?
>>
>> I pocked a bit around in the reference, API and mailing list archive but I 
>> was not able to find anything useful. I can of course keep parallel 
>> attributes for both X and Y and then have a separate step that makes an 
>> explicit conversion/copy, but I prefer to avoid this. I would appreciate any 
>> input to the problem, thanks :)
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: How can I do efficient FSIndex lookup?

2019-09-09 Thread Marshall Schor

ok.

Re: getting the top 5 or 10 items: Here's a technique you may find of use:

Put the items into a Java PriorityQueue.  Keep a piece of data which is the
bottom item, and in your insert-into-the-queue code, check if the item
to-be-inserted is below that, and if so, skip it.

This gives a very efficient way to get the top 5 or 10 items.

HTH. -Marshall

On 9/9/2019 4:09 AM, Mario Juric wrote:
> Hi,
>
> Once again thanks for the response. It is really appreciated :)
>
> I tried the moveTo(fs) instead of just using an iterator constructed from the 
> FS, and this appeared to give me all items of the specified type when I 
> didn’t set any values on it, which was an accidental experiment, but when I 
> set the key property to what I was searching for then I got zero items back. 
> Not sure what I might be doing wrong here, but I have learned something maybe 
> more importantly to our use case in the mean time: The cost of indexing 
> exceeds by far the benefits of any expected lookup speed in our case.
>
> We are annotating a number of items with a lot of extracted feature 
> information, and the hope was to be able to quickly get top 5 or 10 or 
> whatever of the items with this or that key, which is why it was sorted by 
> key first in natural sort order and then by the value in reverse order, 
> meaning higher value is better, so that we could quickly get to the first 
> item with the right key and then start pulling the top most items until we 
> have those that we need.
>
> So even if I could get this to work optimally it would in our case not be 
> beneficial given the cost of indexing. It seems we really need many of those 
> queries before it pays of, since the amount of feature information is much 
> larger than the items they are associated with, so I reached to the 
> preliminary conclusion to not have features in any index at all and just 
> using plain FS record structures instead. It appears in our case much cheaper 
> to run through all target items, which there are comparatively less of, to 
> find what we need than to index all associated features and find the relevant 
> target items through feature look up.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>> On 6 Sep 2019, at 16:50 , Marshall Schor  wrote:
>>
>> Please don't add to the indexes, the FS you're temporarily using as the 
>> argument
>> for the moveTo operation.  (and of course, if you don't add it, you won't 
>> need
>> to remove it...)
>>
>> If you describe your use case in a bit more detail, I can perhaps comment on
>> this more.
>>
>> -Marshall
>>
>> On 9/6/2019 2:50 AM, Mario Juric wrote:
>>> Hi,
>>>
>>> Thanks for responding.
>>>
>>> I tried with a temporary FS where the key value was set, but I got every 
>>> annotation from the index, so that didn’t appear to change anything, and it 
>>> also broke my unit tests immediately. I also  stepped through the iterator 
>>> implementation and found construction of the iterator quite a bit complex 
>>> with an FS, so that went over my head without spending time to get a deeper 
>>> understanding of the underlying index implementation. Therefore I tried 
>>> with an indexed FS and this seemed to return the correct items, but it 
>>> would be awkward having to add some FS to the index in order to retrieve 
>>> something else and then having to remove the FS from the index again. I am 
>>> now also in doubt about the insertion costs, but I haven’t measured that 
>>> yet.
>>>
>>> I am not sure how many use custom FSIndex, but currently the API doesn’t 
>>> really support very well the type of use cases that we are working with, so 
>>> this is a disappointment for us. Does UIMA 3 improve on this? We are still 
>>> on 2.x since we are awaiting the next major DKPro release with UIMA 3 
>>> because of dependencies.
>>>
>>> Thanks a lot and cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho  
>>>> wrote:
>>>>
>>>> On 5. Sep 2019, at 23:40, Marshall Schor  wrote:
>>>>> The normal way to get the "binary search" kind of behavior is to get a 
>>>>> plain
>>>>> iterator over the sorted index, and then use the moveTo method, 
>>>>> specifying a
>>>>> target FS as the one to move to.  The target FS can be a "temporary" FS, 
>>>>> one
>>>>> that is never added to the indexes, itself; it is just used to supply 
>>>>> values
>>>>> used in the comparison.
>>>> Is there a way to do this using a "temporary" FS which does not take up 
>>>> CAS heap
>>>> space in UIMAv2?
>>>>
>>>> -- Richard
>

Re: How can I do efficient FSIndex lookup?

2019-09-06 Thread Marshall Schor

Please don't add to the indexes, the FS you're temporarily using as the argument
for the moveTo operation.  (and of course, if you don't add it, you won't need
to remove it...)

If you describe your use case in a bit more detail, I can perhaps comment on
this more.

-Marshall

On 9/6/2019 2:50 AM, Mario Juric wrote:
> Hi,
>
> Thanks for responding.
>
> I tried with a temporary FS where the key value was set, but I got every 
> annotation from the index, so that didn’t appear to change anything, and it 
> also broke my unit tests immediately. I also  stepped through the iterator 
> implementation and found construction of the iterator quite a bit complex 
> with an FS, so that went over my head without spending time to get a deeper 
> understanding of the underlying index implementation. Therefore I tried with 
> an indexed FS and this seemed to return the correct items, but it would be 
> awkward having to add some FS to the index in order to retrieve something 
> else and then having to remove the FS from the index again. I am now also in 
> doubt about the insertion costs, but I haven’t measured that yet.
>
> I am not sure how many use custom FSIndex, but currently the API doesn’t 
> really support very well the type of use cases that we are working with, so 
> this is a disappointment for us. Does UIMA 3 improve on this? We are still on 
> 2.x since we are awaiting the next major DKPro release with UIMA 3 because of 
> dependencies.
>
> Thanks a lot and cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>> On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho  wrote:
>>
>> On 5. Sep 2019, at 23:40, Marshall Schor  wrote:
>>> The normal way to get the "binary search" kind of behavior is to get a plain
>>> iterator over the sorted index, and then use the moveTo method, specifying a
>>> target FS as the one to move to.  The target FS can be a "temporary" FS, one
>>> that is never added to the indexes, itself; it is just used to supply values
>>> used in the comparison.
>> Is there a way to do this using a "temporary" FS which does not take up CAS 
>> heap
>> space in UIMAv2?
>>
>> -- Richard
>

Re: How can I do efficient FSIndex lookup?

2019-09-06 Thread Marshall Schor

Sorry, no. But you can make one "key" per CAS instance, and reuse it (you'll
need to keep some kind of a reference to it).

-Marshall

On 9/5/2019 5:42 PM, Richard Eckart de Castilho wrote:
> On 5. Sep 2019, at 23:40, Marshall Schor  wrote:
>> The normal way to get the "binary search" kind of behavior is to get a plain
>> iterator over the sorted index, and then use the moveTo method, specifying a
>> target FS as the one to move to.  The target FS can be a "temporary" FS, one
>> that is never added to the indexes, itself; it is just used to supply values
>> used in the comparison.
> Is there a way to do this using a "temporary" FS which does not take up CAS 
> heap
> space in UIMAv2?
>
> -- Richard

Re: How can I do efficient FSIndex lookup?

2019-09-05 Thread Marshall Schor

Perhaps the use of a filtered iterator went in the wrong direction.

The normal way to get the "binary search" kind of behavior is to get a plain
iterator over the sorted index, and then use the moveTo method, specifying a
target FS as the one to move to.  The target FS can be a "temporary" FS, one
that is never added to the indexes, itself; it is just used to supply values
used in the comparison.

With this, you can "jump to" the nearest element (see the javadocs for the exact
definition of this).

Does this help?

When using uima version 3, the moveto method can be made to ignore type
priorities in the ordering, which is what is wanted in many use cases.  See
http://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.select

-Marshall

On 9/4/2019 3:45 PM, Mario Juric wrote:
> Hi,
>
> I created a custom FSIndex for an annotation type in the hope of speeding up 
> lookup based on one of it’s fields, but after some profiling I found to my 
> surprise that this doesn’t appear to be what I get. I specified the index to 
> be sorted according to two fields where the first is a key and the next is a 
> value field. After creating a filtered iterator with the key field as one of 
> the constraints I thought it would do a quick lookup to the first element in 
> the list that matches the key constraint, after all it’s sorted according to 
> that field, so I assume at least binary search is possible, but to my 
> surprise that is not what happens. It seems to simply iterate through all 
> elements and skips those that don’t match the constraint. There doesn’t seem 
> to be other ways I can do a more efficient jump to the first element in the 
> index and then stop iterating when the key no longer matches.
>
> I am somewhat baffled by this, and it appears to me I could have achieved the 
> same using a normal select with some simple filtering, which kinda makes the 
> FSIndex redundant. There is another way to obtain an iterator, which takes a 
> FeatureStructure, but I am not sure if that is more efficient, and does this 
> mean that you create FeatureStructures for the sole purpose of lookup into 
> the index? I would appreciate if someone could explain this to me, thanks! :)
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>

[ANNOUNCE] Apache UIMA Java SDK version 3.1.0 released

2019-08-16 Thread Marshall Schor



-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512
 
The Apache UIMA team is pleased to announce the release of Apache UIMA Java SDK,
version 3.1.0. You can download it from here:
   
http://uima.apache.org/downloads.cgi

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

Version 3.1.0 is an incremental update for the version 3 branch of UIMA. It
includes fixes that get the Component Descriptor Editor Eclipse Plugin working
again with the latest release of the Eclipse IDE, as well as some other fixes.

The middle version digit was bumped up because the JCasGen component now
generate JCas classes for the FSArray type that include specific Java generic
typing, if the component type of the array is specified.

See the UIMA News item for more details here:
https://uima.apache.org/news.html

Please send feedback via the Apache UIMA project mailing lists.
 -Marshall Schor, for the Apache UIMA development team.
-BEGIN PGP SIGNATURE-
 
iQIzBAEBCgAdFiEEOn/mVSh3S1eNEDz+zHYv/c0Ez9YFAl1WySUACgkQzHYv/c0E
z9bCfQ//au4cK+i+XyFMWgg01FfQ5Tv6W2UHPtqNyq5XKI14jNaJpVhI2/VmQmig
gtURzlTTVAzQipjUhjtZhdM9dEW8PSMmNWT3JOHJazzsm4nyQuXBVMKaoNaJI0u3
K60bcpygT3aoouy/clbJRBPLouXnG31jDoBelymW1XAyLHVGPFoxG2sGRKANO7Qx
p9rnXx2MzSGylHBqYG7yxcLjQbb0HPken3QrNjcQQqjgdnqH0bP3zJLEfOpz4dyu
CzZQCJvzhmJkC/9F8bMYz9FpQBOCbjiJd7KmO29jdpQb1MGb/iNICQOUMhkdsRoV
jDi/tZQz5TTQfGZraWvDU3sFLpQWF8Rsni8dVbVw3LnP3sWr/XzzvhiDK0I5Fw1P
l+axstrDV1hGdT2HunCDCyBOBa5Vx9VfbHktxvkqmsAa1jCI4OVaVQrB0G45IT8a
pzFftYlpnY/HgwMq5sYLCjrPof02qB0ahX33YcDkK+OldPDJm9Ic6OYHD+acisUE
WZEGkYlhtQ495dwb9XUwBX87hfZwkutPYGVOlKvw37cM0DxbloT9mETXvd10T4Wa
2oYiKadXrk/mMeayhrEoiPaXLffKJeMjHYy4F/Qcf7oPC25VpCe/E3Z8uEQvpg4j
4rrgZx7P5sPv+5Wl33EWTurW22PvP+TK6jCeFa1mv3x9FRNZIiI=
=T/bd
-END PGP SIGNATURE-

[ANNOUNCE] Apache UIMA 2.10.4 released

2019-08-09 Thread Marshall Schor



-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512
 
The Apache UIMA team is pleased to announce the release of Apache UIMA Java SDK,
version 2.10.4. You can download it from here:

   http://uima.apache.org/downloads.cgi

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

Version 2.10.4 is an incremental update for the version 2 branch of UIMA.  It
includes fixes that get the Component Descriptor Editor Eclipse Plugin working
again with the latest release of the Eclipse IDE, as well as some other fixes.

See the UIMA News item for more details here:
https://uima.apache.org/news.html

Please send feedback via the Apache UIMA project mailing lists.

- -Marshall Schor, for the Apache UIMA development team.
-BEGIN PGP SIGNATURE-
 
iQIzBAEBCgAdFiEEOn/mVSh3S1eNEDz+zHYv/c0Ez9YFAl1Nsg4ACgkQzHYv/c0E
z9Z/jxAAt3Y7hhY/RyMQEt1V2BGXsYeI/vVNkyMaj2o0esxgNWttreWCZ4JnBsns
aXpkgdqaIQdI1I9jT+KZ/I7mok8l8RcM3FYxJ/mr+M309Ao1XO+WD0Y8vhL2WDVM
nBBcjiqUhQpvA1gLBqBKxrcuGmiN7aXlgqp0XrgtR/8r/cYxO1ajKDUBCYS8fXXZ
Hnev7W8yE8LsAkcQ03h7MJCeczSyW41Goz0P+3WNhXbO3I9zA3oQ1EfLnfFrP36G
GjGw5rns/GLSqoN7pD64fHsrzvHms7xeDqwJGGXHLGn/YZ6v9W8WjjEVsNLNTzDG
rTZ2aiKUJgilxsUoZXDBd3eKTdp9rx71U+Z2sR7lacek6eP1yLbmeiSKT2Qj2lDQ
tqqXTw6SvltZ7fLMWPp9Y+3UbRPcjo0L56zKXfhm0x2LykmMPw8cUsYFNrylr0g/
Xkwchsc/haQbCNdpEzyelzgV/N37oEkKX0SLiM53mYVXfHsYmBIHGkEDe7nm7hUW
bgYWu0Rrwk2yVZff532Isbr9wMaMrHkzu4iNOmCuMDRiRd8XolTGKfB7HnF5gDUu
qoMiOJ3cbawxmQ+EjQN+kd64Q0Rn7SVndG1wQH5hh2tmWn+7kODQF50JqWasXhwL
yIcCXlZ6QdL2rO2ipZGs8/uQ28D005mf1exfm/5Rsdsg5ZSkB8A=
=FJBF
-END PGP SIGNATURE-

Re: Build Error with Concept Mapper 2.10.2

2019-06-01 Thread Marshall Schor

This error happens (I'm reading from the stack trace) in

ConceptMapper's typeSystemInit method, on line (ConceptMapper.java:418)

Looking at that line in the source code:
resultAnnotationType = typeSystem.getType(resultAnnotationName);

The NPE is thrown if the value of the string "resultAnnotationName" is null.

This is, according to the source code, set from a "configuration parameter" 
named "ResultingAnnotationName".

So, I'm guessing your setup isn't specifying this configuration parameter.

Let us know if specifying this and related features (see 
https://uima.apache.org/d/uima-addons-current/ConceptMapper/ConceptMapperAnnotatorUserGuide.html#configParams
) fixes this.

-Marshall Schor


On 5/31/2019 1:27 PM, Chinyere O. wrote:
> Hi,
>
> I’m new to UIMA and I’m trying to get the concept mapper to run. I’ve 
> downloaded the concept mapper bin file, added all the parameter files, and 
> the jar packages to the class path. Could someone help me figure out what I’m 
> doing wrong.
>
> Thank you,
> Chinyere
>
> Error Log:
> 08:09:20.573 - 16: org.apache.uima.conceptMapper.Logger.log(46): INFO: 
> ConceptMapper INFO: Loading Dictionary from 
> file:/C:/Users/Chinyere/eclipse_workspace_2018/NYUW_PathReader/resources/PathologyBiomakers.xml
> 08:09:20.579 - 16: org.apache.uima.conceptMapper.Logger.log(46): INFO: 
> ConceptMapper INFO: Loading dictionary
> 08:09:20.671 - 16: org.apache.uima.jcas.impl.JCasImpl.reportInitErrors(809): 
> WARNING:
> JCas Type "org.apache.uima.conceptMapper.support.tokenizer.TokenAnnotation" 
> implements getters and setters for feature "tokenType", but the type system 
> doesnt define that feature.
> JCas Type "org.apache.uima.conceptMapper.support.tokenizer.TokenAnnotation" 
> implements getters and setters for feature "tokenClass", but the type system 
> doesnt define that feature.
>
> 08:09:20.672 - 16: org.apache.uima.jcas.impl.JCasImpl.reportInitErrors(809): 
> WARNING:
> JCas Type "org.apache.uima.conceptMapper.support.tokenizer.TokenAnnotation" 
> implements getters and setters for feature "tokenType", but the type system 
> doesnt define that feature.
> JCas Type "org.apache.uima.conceptMapper.support.tokenizer.TokenAnnotation" 
> implements getters and setters for feature "tokenClass", but the type system 
> doesnt define that feature.
>
> 08:09:20.682 - 16: org.apache.uima.conceptMapper.Logger.log(46): INFO: 
> ConceptMapper INFO: Finished loading 4 entries
> 08:09:20.682 - 16: org.apache.uima.conceptMapper.Logger.log(46): INFO: 
> ConceptMapper INFO: ...done loading dictionary from 
> file:/C:/Users/Chinyere/eclipse_workspace_2018/NYUW_PathReader/resources/PathologyBiomakers.xml
> 08:09:27.65 - 16: 
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(434):
>  SEVERE: Exception occurred
> org.apache.uima.analysis_engine.AnalysisEngineProcessException
> at 
> org.apache.uima.conceptMapper.ConceptMapper.process(ConceptMapper.java:574)
> at 
> org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
> at 
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
> at 
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:318)
> at 
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
> at 
> org.apache.uima.tools.cvd.MainFrame.internalRunAE(MainFrame.java:1528)
> at 
> org.apache.uima.tools.cvd.MainFrame.runAE(MainFrame.java:430)
> at 
> org.apache.uima.tools.cvd.control.AnnotatorRerunEventHandler.actionPerformed(AnnotatorRerunEventHandler.java:40)
> at javax.swing.AbstractButton.fireActionPerformed(Unknown 
> Source)
> at javax.swing.AbstractButton$Handler.actionPerformed(Unknown 
> Source)
> at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown 
> Source)
> at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
> at javax.swing.AbstractButton.doClick(Unknown Source)
> at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown 
> Source)
> at 
> javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown Source)
> at java.awt.Component.processMouseEvent(Unknown Source)
> at javax.swing.JComponent.processMouseEvent(Unknown Source)
> at j

Re: JCasGen classes for V3 compiler warnings

2019-06-01 Thread Marshall Schor

Hi,

I'm guessing these warnings come from generated JCas classes?

I think that they all represent low priority things that are unlikely to get
improved in the near future, unless someone wants to work on these and submit
patches (always welcome :-) ). 

If you are so inclined, it would be good to submit Jira issues for the ones you
want to work on, so any discussions around how to improve these can be
documented in those Jiras.

Thank you for your interest.  -Marshall Schor


On 5/31/2019 1:01 PM, Hai-son X Nguyen wrote:
> Hello,
>
> I am testing migrating to UIMA V3 (3.0.2) and am seeing a variety of warnings 
> in Eclipse (2019-03, Java 8) using the plugin (v3.0.2).
>
> There are 3 types of warnings:
>
>   1.  “FSArray is a raw type. References to generic type FSArray should be 
> parameterized.”
>   2.  “The import XXX is never used”
>  *   java.lang.invoke.CallSite
>  *   java.lang.invoke.MethodHandle
>  *   and org.apache.uima.cas.impl.TypeSystemImpl)
>   3.  And “The constructor XXX() is deprecated”
>
> Is it appropriate to open a JIRA ticket for these problems?
>
> Thanks!
>

Re: close() never gets called on collection reader

2019-05-23 Thread Marshall Schor

hi, can you please state which component you're referring to , in which project
(e.g.  uima java sdk, or uima-as, or uimaFIT, or RUTA, etc...).

Is the ArtifactProducer class the
org.apache.uima.collection.impl.cpm.engine.ArtifactProducer, or some other one?

Thanks. -Marshall

On 5/16/2019 6:26 PM, Benedict Holland wrote:
> Hi all,
>
> This might be a bug but the close method on the collection reader never
> gets called. I set up a breakpoint in an @override close function and it
> didn't trigger so I looked into the run queue that calls hasNext() called
> ArtifactProducer.run() and there isn't a collectionReader.close() command
> anywhere. If this is a bug, I can file a ticket and if not, can I get a
> fast example of how I can get close() called?
>
> I need to do a few things after the collection reader ends and it seems
> like close was the logical choice. The alternative is to something in
> hasNext() when the condition fails but I really don't like that. The close
> commands should all go in the close method.
>
> Thanks,
> ~Ben
>

Problem setting up a uimaFIT pipeline

2019-05-15 Thread Marshall Schor

Cross posted from stackoverflow:

https://stackoverflow.com/questions/56149592/jcas-type-timex3-used-in-java-code-but-was-not-declared-in-the-xml-type-d

Can someone see if the uimaFIT auto configure of the type system would work with
this setup, or if something else is needed?

-Marshall

Re: Eclipse ClassNotFoundException

2019-05-15 Thread Marshall Schor

Hi,

I wonder if the name of the annotator might be an issue.  In general, a class
name cannot have "." (periods) as part of the name - those are used to refer to
the package name, and the package name parts are set up as hierarchically nested
folders (each part corresponding to a segment of the name between periods
(except for the last segment which is the class name).

If that's not the issue, another thing to check is whether or not the classpath
for the project has been set up.  The Component Descriptor Editor makes the
assumption that the place where you're defining the descriptor is an Eclipse
project, which has a "build path" (eclipse terminology) set up to include the
annotator class.

-Marshall

On 5/14/2019 4:52 PM, Benedict Holland wrote:
> Oh sorry. I typed it out rather than copying the error. The sml should read
> xml. The paths are correct and uima appears to find the annotator but it
> throws an error on save.
>
> On Tue, May 14, 2019 at 11:48 AM Andrew Trice  wrote:
>
>> Your file suffix is “sml"
>>
>>> On May 14, 2019, at 11:45 AM, Benedict Holland <
>> benedict.m.holl...@gmail.com> wrote:
>>> Hi All,
>>>
>>> I have a java annotator called annotators.SentenceSplittingAnnotator with
>>> an associated xml descriptor. When I define the "Name of the Java class
>>> file", I browse and select "annotators.SentenceSplittingAnnotator". I
>> then
>>> go to save the xml file and I get the following error:
>>>
>>> The Descriptor is invalid for the following reason:
>>> ResourceInitilizationException: the class
>>> annotators.SentenceSplittingAnnotator could not be found. (Descriptor:
>>> file:/Sentencesplittingannotator.sml) caused by: ClassNotFoundException:
>>> annotators.SentenceSplittingAnnotator cannot be found by
>>> org.apache.uima.runtime_3.0.2.
>>>
>>> I updated to the latest version of uima 3.0.2 and eclipse to 2019-03. Any
>>> ideas what could be causing the problem?
>>>
>>> Thanks,
>>> ~Ben
>>

Re: What is the status of the CVD viewer?

2019-05-02 Thread Marshall Schor

Hi,

I don't think there's current work being done on this tool, but patches are
always welcome!

-Marshall

On 5/1/2019 6:29 AM, Rune Stilling wrote:
> Hi
>
> I just got the CAS Editor up and running. It seems that this tool is more 
> actively being developed. Is that so?
>
> /Rune
>
>> Den 8. okt. 2018 kl. 17.48 skrev Marshall Schor :
>>
>> One alternative that may be useful is the DocumentAnalyzer. 
>> https://uima.apache.org/d/uimaj-current/tools.html#ugr.tools.doc_analyzer
>>
>> Patches welcome :-)
>>
>> -Marshall
>>
>> On 10/8/2018 11:27 AM, Rune Stilling wrote:
>>> Hi list
>>>
>>> We are using the CVD-viewer to view rather complex annotation document but 
>>> have stumbled upon some problems.
>>>
>>> First of all scrolling in the bottom left annotation pane is possible on a 
>>> Mac. The scroll bar simply never shows up and moving the cursor downwards 
>>> doesn’t move the contents. This makes the viewer very limited in use.
>>>
>>> Secondly I really miss a search function in the text view especially, so 
>>> that it would be possible to look up specific words. 
>>>
>>> Is the tool still actively being developed at all? Aren’t people using it, 
>>> and if not, then how do they analyze their results? Just by looking the 
>>> cas.xmi file or?
>>>
>>> Best,
>>> Rune
>

[ANNOUNCE] Apache UIMA 3.0.2 released

2019-04-11 Thread Marshall Schor

The Apache UIMA team is pleased to announce the release of Apache UIMA Java SDK,
version 3.0.2. You can download it from here:

   http://uima.apache.org/downloads.cgi

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

Version 3.0.2 is an incremental update for version 3.0.1, including better
backwards binary compatibility for UIMA version 2 (permitting migration
without recompiling user source code (except for JCas classes)),
and bug fixes (including some edge cases for the Annotation subiterator).

See the UIMA News items for more details, here:
https://uima.apache.org/news.html

Please send feedback via the Apache UIMA project mailing lists.

-Marshall Schor, for the Apache UIMA development team.

[ANNOUNCE] Apache UIMA 3.0.2 released

2019-04-11 Thread Marshall Schor

The Apache UIMA team is pleased to announce the release of Apache UIMA Java SDK,
version 3.0.2. You can download it from here:

   http://uima.apache.org/downloads.cgi


Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

Version 3.0.2 is an incremental update for version 3.0.1, including better 
backwards
binary compatibility for UIMA version 2 (permitting migration without 
recompiling 
user source code (except for JCas classes)), and bug fixes (including some 
edge cases for the Annotation subiterator).

See the UIMA News items for more details, here:
https://uima.apache.org/news.html

Please send feedback via the Apache UIMA project mailing lists.

-Marshall Schor, for the Apache UIMA development team.

Re: Semantic search File missing

2019-03-19 Thread Marshall Schor

Hi,

I believe the semantic search you found references to is no longer available.

You might find a similar capability in the Apache Solr / Lucene projects.

-Marshall

On 3/19/2019 2:16 AM, Yash Kanojia wrote:
> Hello,
>
> I am building a semantic search engine with the help of UIMA 3.0.1. When
> referring to the documentation of UIMA I came across the descriptor file of
> semantic search. On checking the file on my system I was unable to find it.
> Kindly help me with this.
> Can you please tell me where will I get this file? Or I need to find it
> somewhere else?

Re: Uima and spring

2019-02-18 Thread Marshall Schor

Hi Sarah,

I don't have knowledge of DKPro or Spring, but here's some general guidance,
which may (or may not) be of use :-).

External Resources are associated with a Resource Manager instance.
Try figuring out how to have one Resource Manager instance be reused for
multiple JCas instances.

Also, try to not have multiple JCas instances, beyond what you need to keep all
the cpu "cores" in your host busy. 
Instead of one new JCas instance per piece of work, reusing existing instances,
by calling myJCasInstance.reset() and then using it again.

Hopefully others with specific knowledge may comment also.

-Marshall

On 2/18/2019 6:48 AM, Sarah wrote:
> Hi,
>
> I am using uimafit annotators in a spring component. These annotators use 
> external resources. These resources are currently produced for every JCas 
> even though the Aggregate Engine is created inside of the Spring component's 
> init and merely the process method is called on the individual JCas objects. 
> This slows my system down.
> How do I handle external resources appropriately in a spring component. I 
> found the SpringContextResourceManager but I don’t know how to use it. Can 
> you point me to an example where e.g. the DKPro CoreNLP Annotators are used 
> in a spring context?
>
> All the best,
> Sarah
>
>
>

Re: How to get TextRuler to work

2019-02-14 Thread Marshall Schor

Hi Mandy,
...
>> By the way, in case anybody wants to pick up maintaining TextRuler
>> again, I would suggest to improve a bit on error handling here.
>
> You can also help :-)

A big +1 to this.  This is frequently how new contributors become involved in
open-source efforts.

And specific suggestions for improving error handling (including patches, even)
are certainly welcome :-).


> Just by creating a bug report in our Jira, you will put it on my todo
> list. Also if I do not find the time to fix it, maybe others will.
>
> https://issues.apache.org/jira/browse/UIMA-5987?jql=component%20%3D%20ruta

Cheers. -Marshall

>

Re: Issues with Ruta workbench (Permission Denied and wrong output view)

2019-02-06 Thread Marshall Schor

hi,

I'm not an expert, but I'm guessing that there still is a permissions issue,
perhaps on a different file or directory than the one you checked.

Try having someone else take a look at your stack trace / error message, and
your file system permissions.  A second pair of eyes often is helpful (I speak
from personal experience).

Cheers. -Marshall

On 2/6/2019 5:44 AM, Mandy Neumann wrote:
> Hi all,
>
> I'm just starting to get familiar with UIMA Ruta and the workbench, and I'm
> having some strange issues.
>
> I got a project from a co-worker who already prepared some scripts for me to
> extend. The project has .html files in the input folder, and he already
> provided a Ruta script to convert HTML markup into annotations. The script is
> adapted from the Ruta manual:
>
>> ENGINE utils.HtmlAnnotator;
>> ENGINE utils.HtmlConverter;
>> ENGINE HtmlViewWriter;
>> TYPESYSTEM utils.HtmlTypeSystem;
>> TYPESYSTEM utils.SourceDocumentInformation;
>>
>> Document{->CONFIGURE(HtmlAnnotator, "onlyContent"=true), EXEC(HtmlAnnotator,
>> {TAG})};
>>
>> Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView",
>>     "outputView" = "plain", "expandOffsets"=false, "replaceLinebreaks"=true,
>> "skipWhitespacs"=true, "linebreakReplacement"=" ", "processAll"=true),
>>   EXEC(HtmlConverter)};
>>
>> Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain",
>>     "outputView" = "_InitialView", "output" = "../converted"),
>>     EXEC(HtmlViewWriter)};
>
> On my machine and with my settings, when I run this script, my console get
> spammed with org.apache.uima.analysis_engine.AnalysisEngineProcessExceptions
> caused by java.io.FileNotFoundException
>  with the message "../converted (Permission denied)". I checked the file
> permissions on this directory which were 775 - I even chmodded to 777 but
> still the same issue.
>
> In spite of all these exceptions, the output still gets generated, though. I
> would be fine with it if there weren't another issue - although the script
> should write the annotations into _InitialView, I need to change the view to
> "plain" in the editor to get plain text with HTML annotations. The
> _InitialView still shows the html markup.
>
> I think both issues are related. Any ideas?
>
> Cheers,
>
> Mandy
>
>
> System Info: eclipse Oxygen.3a Release (4.7.3a), UIMA Ruta workbench 2.6.1, OS
> Kubuntu 18.04
>
>

Re: Is it possible to define dynamically typed annotations?

2018-12-15 Thread Marshall Schor

I guess the question is why have a new type?  The answer to that could motivate
what properties the solution should have.

What you propose is fine, but in some ways is not a new type, in that it doesn't
seem to have many of the properties UIMA types have. 

    If that is OK in your application, then that's fine. 

    If not, please say more about what properties of types you want to have,
    that this approach might not satisfy.

Here's some examples of what having types provides:

1) a type hierarchy - subtypes have features inherited from super types.

2) a way to have "indexes" which provide access to a type (and its subtypes)
instances in the CAS.

3) a way to have getters / setters, with special versions for "array" types that
give access to elements

4) for some types, a way to "order" them in the CAS.  For instance, if a type is
a subtype of "Annotation", it gets (via inheritance) a begin and end "feature",
and there's a built-in index that is sorted, making use of these features (and
also making use of "type priority" ordering).

Note that if you don't need this for your types, then they should *not* be
subtypes of Annotation.

5) a way to serialize / deserialize (in several formats) for storage and
transmission (for instance, when some annotators in a pipeline are remote
services). 

Your suggestion to have a general Map might be an issue for
serialization / deserialization.

-Marshall

On 12/15/2018 7:20 AM, Alain Désilets wrote:
> Is it possible to create dynamically typed annotations in UIMA? In other
> words, would it be possible for users of my system to create a new type of
> annotation without having to recompile the Java code?
>
> I need this functionality so that non-dev users can define new types of
> Named Entities and train a model that can recognize them without having to
> recompile the code.
>
> I suspect the answer is no, because all annotation types correspond to a
> Java class. True, those classes are defined in an XML file, but in order to
> use them you have to generate the Java code from the XML and recompile your
> code.
>
> If UIMA does not yet have something that supports dynamic annotations, I
> will have to implement one myself. What I have in mind is to define a
> sub-class of Annotation called say, DynamicallTypedAnnotation, which would
> have two new member variables:
>
> String typeName = null;
> Map attributes = new HashMap();
>
> The 'typeName' variable would correspond to the type of the annotation (ex:
> "Room Number" for an annotation that captures the number of a room) and the
> 'attributes' variable would allow storage of arbitrary information about
> the annotation.
>
> Does that make sense?
> Thx
>

[ANNOUNCE] Apache UIMA 2.10.3 and Apache UIMA 3.0.1 released

2018-11-29 Thread Marshall Schor

The Apache UIMA team is pleased to announce the release of Apache UIMA Java SDK,
version 2.10.3, and Apache UIMA Java SDK version 3.0.1. 

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

Version 2.10.3 is an incremental drop-in update for the 2.x.x series, but has 
been updated to require Java 8 as a minimum.

Version 3.0.1 is an incremental update for version 3.0.0, including numerous 
fixes to the Select framework to fix bugs and align it with uimaFIT 
implementations.

See the UIMA News items for more details, here:
https://uima.apache.org/news.html

Please send feedback via the Apache UIMA project mailing lists.

-Marshall Schor, for the Apache UIMA development team.

Re: JCasGen like utility for generating POJO Classes in the type system

2018-10-15 Thread Marshall Schor

ok, sorry I didn't understand before :-). 

We currently don't have such a facility. 

But it might not take too much to add this alternative to the existing JCasGen.

-Marshall


On 10/13/2018 6:31 PM, Amit Paradkar wrote:
> Thanks Marshall.
> I would like to construct objects in the typesystem that I have defined
> without having to pass in a cas object in the constructor (after I have
> detected features for a particular class in the text)
>
> e.g, I have a Database type defined in my UIMA type descriptor based on
> which JCasGen generates a JCas cover class called named Databaase.
> But all the public constructors in this class take a JCas object as an
> argument which precludes an independent construction of instances of these
> classes.
> I need to be able to construct an instance of such classes independent of
> any cas objects - since I need them to persist even after the the current
> input text is being processed completely. So, almost a mirror typesystem is
> desired. Perhaps I am not using the notion of typesystem appropriately...
>
>
>
>
> On 2018/10/13 14:01:25, Marshall Schor  wrote:
>> hmmm, I guess you've looked at jcasGen.>
>>
>> I'm not understanding how the POJO class you're thinking of, differs from
> the>
>> one generated by JCasGen?>
>>
>> -Marshall>
>>
>> On 10/12/2018 8:20 AM, Amit Paradkar wrote:>
>>> I would like to generate pojo classes corresponding to the classes
> defined>
>>> in my type descriptor. Is there a JcasGen like utility which takes a
> type>
>>> system descriptor and generates both the UIMA and POJO types? (I could
> live>
>>> with just one typesystem classes as long as I can construct instances>
>>> without the Cas object being passed in). Unfortunately the protected>
>>> constructor generated in current UIMA classes is not invokeable.>
>>> Thanks.>

Re: JCasGen like utility for generating POJO Classes in the type system

2018-10-13 Thread Marshall Schor

hmmm, I guess you've looked at jcasGen.

I'm not understanding how the POJO class you're thinking of, differs from the
one generated by JCasGen?

-Marshall

On 10/12/2018 8:20 AM, Amit Paradkar wrote:
> I would like to generate pojo classes corresponding to the classes defined
> in my type descriptor. Is there a JcasGen like utility which takes a type
> system descriptor and generates both the UIMA and POJO types? (I could live
> with just one typesystem classes as long as I can construct instances
> without the Cas object being passed in). Unfortunately the protected
> constructor generated in current UIMA classes is not invokeable.
> Thanks.
>

Re: What is the status of the CVD viewer?

2018-10-09 Thread Marshall Schor

It's in svn: https://svn.apache.org/repos/asf/uima/uimaj/trunk/uimaj-tools/

cd to some writable directory,

svn checkout https://svn.apache.org/repos/asf/uima/uimaj/trunk/uimaj-tools/
uimaj-tools

If you're using Eclipse as your ide, you can then import "existing Maven
projects" and point to the directory where you checked it out.

Cheers. -Marshall

On 10/8/2018 3:54 PM, Rune Stilling wrote:
> Our pipeline takes a long time to run so it’s not practical to use this tool.
>
> Where can I find the source code for the CVD application?
>
> Best,
> Rune
>
>> Den 8. okt. 2018 kl. 17.48 skrev Marshall Schor :
>>
>> One alternative that may be useful is the DocumentAnalyzer. 
>> https://uima.apache.org/d/uimaj-current/tools.html#ugr.tools.doc_analyzer
>>
>> Patches welcome :-)
>>
>> -Marshall
>>
>> On 10/8/2018 11:27 AM, Rune Stilling wrote:
>>> Hi list
>>>
>>> We are using the CVD-viewer to view rather complex annotation document but 
>>> have stumbled upon some problems.
>>>
>>> First of all scrolling in the bottom left annotation pane is possible on a 
>>> Mac. The scroll bar simply never shows up and moving the cursor downwards 
>>> doesn’t move the contents. This makes the viewer very limited in use.
>>>
>>> Secondly I really miss a search function in the text view especially, so 
>>> that it would be possible to look up specific words. 
>>>
>>> Is the tool still actively being developed at all? Aren’t people using it, 
>>> and if not, then how do they analyze their results? Just by looking the 
>>> cas.xmi file or?
>>>
>>> Best,
>>> Rune
>

Re: What is the status of the CVD viewer?

2018-10-08 Thread Marshall Schor

One alternative that may be useful is the DocumentAnalyzer. 
https://uima.apache.org/d/uimaj-current/tools.html#ugr.tools.doc_analyzer

Patches welcome :-)

-Marshall

On 10/8/2018 11:27 AM, Rune Stilling wrote:
> Hi list
>
> We are using the CVD-viewer to view rather complex annotation document but 
> have stumbled upon some problems.
>
> First of all scrolling in the bottom left annotation pane is possible on a 
> Mac. The scroll bar simply never shows up and moving the cursor downwards 
> doesn’t move the contents. This makes the viewer very limited in use.
>
> Secondly I really miss a search function in the text view especially, so that 
> it would be possible to look up specific words. 
>
> Is the tool still actively being developed at all? Aren’t people using it, 
> and if not, then how do they analyze their results? Just by looking the 
> cas.xmi file or?
>
> Best,
> Rune

Re: CAS and Serialization on Emoji codes

2018-09-05 Thread Marshall Schor

Hi, could you post a stack trace of the failure, so we could see the path
between the JMSException and the call to addMessage(msg).

-Marshall


On 9/5/2018 9:50 AM, Yuqi Zhang wrote:
> Dear UIMA experts,
>
> I need process a String including an emoji (
> https://www.iemoji.com/view/emoji/2/smileys-people/smiling-face-with-smiling-eyes
> ).
> I put the string "This is a " in a CAS, and sendCAS(cas) to a remote
> server.
> But it failed at addMessage(msg) at line 971 in class
> BaseUIMAAsynchronousEngineCommon_impl with the error message:
>
> javax.jms.JMSException: Failed to build body from content. Serializable
> class not available to broker. Reason: java.lang.ClassNotFoundException:
> Forbidden class org.xml.sax.SAXParseException! This class is not trusted to
> be serialized as ObjectMessage payload.
>
>
> When I check the serialization result of the cas in the msg, I see the 
> is encoded as "".
> Is that the reason this CAS sent failed?
> Because this emoji  can be processed without any problem in my another
> codes where calls the sendAndReceiveCAS(). The serialization result there
> is "".
> How does it happen?
> Besides the sofa content, is there any other factors to affect the
> serialization result?
>
> I am a newer to UIMA. And I have read the UIMA references about the
> serialization and cas sections. But still have no idea how I could make 
> surely serialized into ""
>
> Many thanks for any feedback!
> Best regards,
> Yuqi Zhang
>

Fwd: DUCC Job does not work on any other language except English

2018-08-07 Thread Marshall Schor


 Forwarded Message 
Subject:Re: DUCC Job does not work on any other language except English
Date:   Mon, 6 Aug 2018 18:29:22 -0400
From:   Eddie Epstein 
To: Marshall Schor 



Sorry, I meant to say: Clearly this needs to be fixed in the next DUCC release,
with a patch made available sooner. Thanks for the info!

Eddie

On Mon, Aug 6, 2018 at 6:15 PM, Marshall Schor mailto:m...@schor.com>> wrote:

I guess I don't really understand what you mean?  can you elaborate?

-Marshall


On 8/4/2018 10:00 AM, Eddie Epstein wrote:
> Hi Rohit,
>
> Hopefully this is something fairly easy to fix. Thanks for the 
information.
>
> Eddie
>
> On Thu, Aug 2, 2018 at 2:46 AM, Rohit Yadav mailto:rohit.ya...@orkash.com>> wrote:
>
>> Hi,
>>
>> I've tried running DUCC Job for various languages but all the content is
>> replaced by (Question Mark)
>>
>> But for english it works fine.I was wondering maybe this is a problem in
>> configuration of DUCC.
>>
>> Any idea about this?
>>
>> Best,
>>
>> Rohit
>>
>>

Re: run existing AE instance on different view

2018-07-09 Thread Marshall Schor

Hi,

Is anything in
https://uima.apache.org/d/uimaj-2.10.2/tutorials_and_users_guides.html#ugr.tug.mvs.name_mapping_application
helpful?

If not, could you add some details that says why not?

-Marshall


On 7/5/2018 8:52 AM, Jens Grivolla wrote:
> Hi,
>
> I'm trying to run an already instantiated AE on a view other than
> _InitialView. Unfortunately, I can't just call process() on the desired
> view, as there is a call to Util.getStartingView(...)
> in PrimitiveAnalysisEngine_impl that forces it back to _InitialView.
>
> The view mapping methods I found (e.g. using and AggregateBuilder) work on
> AE descriptions, so I would need to create additional instances (with the
> corresponding memory overhead). Is there a way to remap/rename the views in
> a JCas before calling process() so that the desired view is seen as the
> _InitialView? It looks like CasCopier.copyCasView(..) could maybe be used
> for this, but it doesn't feel quite right.
>
> Best,
> Jens
>

Re: Dynamically bind resources to AnalysisEngine

2018-04-11 Thread Marshall Schor

Hi,

I don't know about DKPro, so someone more familiar with its conventions could
respond.

UIMA supports a decoupling of resources, shared among annotators running in some
pipeline.  I'm guessing you're asking about this mechanism,  but before
proceeding, there's nothing preventing you from implementing an annotator (let's
call it the spelling corrector annotator) which could load a dictionary (let's
say, specified by a configuration parameter), and then have some mechanism to
"reload it", if it changes.

This link in the UIMA Reference manual describes Resources:
https://uima.apache.org/d/uimaj-2.10.2/references.html#ugr.ref.resources

See also the Javadocs for SharedResourceObject
https://uima.apache.org/d/uimaj-2.10.2/apidocs/org/apache/uima/resource/SharedResourceObject.html

These have a "load" method which the user is supposed to implement to cause the
resource to be "loaded".  Typically, if the resource, for example, implemennts a
hashmap, the load might read some external file and initialize the hashmap from
that.

The implementation of the load method is the responsibility of the resource
implementer. UIMA will instantiate the resource class, and call the load method,
once.

One possibility would be to have your spelling annotator check "every so often"
to see if the on-disk version has changed, and if so, call the load method
again.  If you consider doing this, remember that your annotator might (in some
deployments) be "scaled up" in multiple Java threads, so you might need to do
this under a synchronization lock.

Does this help?  There may be more conventions / built-in ways that DKPro has
for this scenario.

Cheers. -Marshall

On 4/11/2018 9:54 AM, Hugues de Mazancourt wrote:
> Hello,
>
> Is there a way to dynamically bind/update resources for an AnalysisEngine ?
> My use-case is : I build a query parser that will be used to retrieve 
> information in an indexed text database.
> The parser performs spelling correction, but doesn't have to consider words 
> in the index as spelling mistakes. Thus, the (aggregate) engine is bound to 
> the index vocabulary (ie a word list).
> My point is : when the index gets updated, its vocabulary will also be 
> updated. I can re-build a new aggregate parser, with the updated resource, 
> but this takes time, mainly for loading resources that were already loaded 
> (POS model, lexica, etc.). Is there a way to update a given resource on my 
> parser without having to rebuild it ?
>
> Thanks for your help,
> PS: I'm mostly building on top of DKPro components. I may miss some basic 
> UIMA mechanisms
> Hugues de Mazancourt
> Mazancourt Conseil
>
> E: hug...@mazancourt.com (mailto:hug...@mazancourt.com)
> P: +33-6 72 78 70 33 (tel:+33-6%2072%2078%2070%2033)
> W: http://www.mazancourt.com
>
>

Re: UIMAfit CpeBuilder not compatible with CasMultipliers

2018-04-06 Thread Marshall Schor

:-)  isn't open source great?  -Marshall


On 4/5/2018 7:21 AM, Erik Fäßler wrote:
> Let me answer my own question.
> The CpeBuilder only unwraps the *first* AAE level. Thus, I just had two wrap 
> my CasMultiplier twice and then it works.
> Sorry to disturb :-)
>
>> On 5. Apr 2018, at 09:09, Erik Fäßler  wrote:
>>
>> Hi all,
>>
>> I wanted to use a CPE containing an AAE with a CASMultiplier with the 
>> UIMAfit CpeBuilder. But it wouldn’t work, the Multiplier would be iterated 
>> through but its CASes wouldn’t be forwarded to downstream AEs.
>> I learned that the CpeBuilder unwraps AAEs to gain access to their 
>> parameters. Unfortunately, this breaks CasMultiplier compatibility. CPEs 
>> cannot handle CasMultipliers directly embedded into them (cf. 
>> https://uima.apache.org/d/uimaj-2.10.2/tutorials_and_users_guides.html#ugr.tug.cm.using_cm_in_cpe
>>  
>> ).
>>  An AAE wrapping is required.
>> When using SimplePipeline.run(), everything works as expected. 
>> Unfortunately, SimplePipeline does not support multithreading.
>> Is there an easy way to construct a CPE with multithreading support AND 
>> CasMultipliers?
>>
>> Cheers,
>>
>> Erik
>

Re: newbie question: can't get annotationViewer.sh working...

2018-04-06 Thread Marshall Schor

Hi, sorry you're having troubles.

The error message seems to indicate a failure of the viewer to be able to read
the results.
Here are some things to look at.

When you run thd document analyzer, it should put up a "configuration" screen
that has a field for an OutputDirectory.  Can you check to see that field
specifies some directory which is writable by you?

After running the document analyzer (which from your note, appears to run OK)
instead of trying to view the analysis results, can you instead go to that
OutputDirectory, and see what's in it.

It should have 8 files of type ".xmi"  (not .xml).

Open the New_IBM_Fellows.txt.xmi file, for example. 

It should have a long line, having something like this:

http://www.omg.org/XMI;
xmlns:examples="http:///org/apache/uima/examples.ecore;
xmlns:tutorial="http:///org/apache/uima/tutorial.ecore;
xmlns:tcas="http:///uima/tcas.ecore;
xmlns:cas="http:///uima/cas.ecore;
xmlns:tokenizer="http:///org/apache/uima/examples/tokenizer.ecore;
xmi:version="2.0">



etc.

Let us know if you can see this data.

-Marshall

On 4/5/2018 6:47 PM, Andrew Logue wrote:
> Hello,
>
> I'd like to evaluate UIMA for an upcoming project but am having problems
> trying the examples with annotationViewer.sh.
>
> It is throwing an error dialog: "java.lang.NumberFormatException: null" when I
> double click on any of the sample files in the Analyzed
> Documents window (for example, New_IBM_Fellows.xml).
>
> I'll paste the partial error message below but is there anywhere else for me
> to look and try to narrow down the problem on my system?
>
> I'm using Arch Linux with Sun Java version "1.8.0_162".
>
> I am able to build from source but am also having the same problem with the
> binary tarballs.  (2.10.2, 2.10.3, and 3.0.0)
>
> [alogue@freedom bin]$ ./annotationViewer.sh
> java.lang.NumberFormatException: null
>     at java.lang.Integer.parseInt(Integer.java:542)
>     at java.lang.Integer.parseInt(Integer.java:615)
>     at
> org.apache.uima.cas.impl.XmiSerializationSharedData.addOutOfTypeSystemElement(XmiSerializationSharedData.java:202)
>     at
> org.apache.uima.cas.impl.XmiCasDeserializer$XmiCasDeserializerHandler.addToOutOfTypeSystemData(XmiCasDeserializer.java:2015)
>     at
> org.apache.uima.cas.impl.XmiCasDeserializer$XmiCasDeserializerHandler.readFS(XmiCasDeserializer.java:519)
>     at
> org.apache.uima.cas.impl.XmiCasDeserializer$XmiCasDeserializerHandler.startElement(XmiCasDeserializer.java:435)
>     at
> org.apache.uima.util.XmlCasDeserializer$XmlCasDeserializerHandler.startElement(XmlCasDeserializer.java:148)
>     at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509)
>     at
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:374)
>     at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2784)
>     at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)
>     at
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
>     at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505)
>     at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
>     at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> 
>
> Regards,
>
> Andrew.
>
>

Re: UIMA Installation Problem

2018-03-06 Thread Marshall Schor

Hi, I replied to your posting on the uima-dev list.

-Marshall


On 3/6/2018 4:13 AM, Debbie Zhang wrote:
> Hi,
>
> We updated our Eclipse and reinstalled UIMA recently. However, after the new 
> installation, when I tried to use Component Descriptor Editor to open an 
> "Analysis Engine Descriptor" xml file, I get the following errors:
>
> Does anyone know how to fix it? Thank you.
>
> Regards,
> Debbie 
>
> ::
> Error message:
>  
> Plug-in org.apache.uima.desceditor was unable to load class 
> org.apache.uima.taeconfigurator.editors.MultiPageEditor.
>  
> org.eclipse.core.runtime.CoreException: Plug-in org.apache.uima.desceditor 
> was unable to load class 
> org.apache.uima.taeconfigurator.editors.MultiPageEditor.
> at 
> org.eclipse.core.internal.registry.osgi.RegistryStrategyOSGI.throwException(RegistryStrategyOSGI.java:194)
> at 
> org.eclipse.core.internal.registry.osgi.RegistryStrategyOSGI.createExecutableExtension(RegistryStrategyOSGI.java:176)
> at 
> org.eclipse.core.internal.registry.ExtensionRegistry.createExecutableExtension(ExtensionRegistry.java:905)
> at 
> org.eclipse.core.internal.registry.ConfigurationElement.createExecutableExtension(ConfigurationElement.java:243)
> at 
> org.eclipse.core.internal.registry.ConfigurationElementHandle.createExecutableExtension(ConfigurationElementHandle.java:55)
> at 
> org.eclipse.ui.internal.WorkbenchPlugin$1.run(WorkbenchPlugin.java:291)
> at 
> org.eclipse.swt.custom.BusyIndicator.showWhile(BusyIndicator.java:70)
> at 
> org.eclipse.ui.internal.WorkbenchPlugin.createExtension(WorkbenchPlugin.java:286)
> at 
> org.eclipse.ui.internal.registry.EditorDescriptor.createEditor(EditorDescriptor.java:235)
> at 
> org.eclipse.ui.internal.EditorReference.createPart(EditorReference.java:329)
> at 
> org.eclipse.ui.internal.e4.compatibility.CompatibilityPart.createPart(CompatibilityPart.java:278)
> at 
> org.eclipse.ui.internal.e4.compatibility.CompatibilityEditor.createPart(CompatibilityEditor.java:63)
> at 
> org.eclipse.ui.internal.e4.compatibility.CompatibilityPart.create(CompatibilityPart.java:316)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.eclipse.e4.core.internal.di.MethodRequestor.execute(MethodRequestor.java:55)
> at 
> org.eclipse.e4.core.internal.di.InjectorImpl.processAnnotated(InjectorImpl.java:966)
> at 
> org.eclipse.e4.core.internal.di.InjectorImpl.processAnnotated(InjectorImpl.java:931)
> at 
> org.eclipse.e4.core.internal.di.InjectorImpl.inject(InjectorImpl.java:151)
> at 
> org.eclipse.e4.core.internal.di.InjectorImpl.internalMake(InjectorImpl.java:375)
> at 
> org.eclipse.e4.core.internal.di.InjectorImpl.make(InjectorImpl.java:294)
> at 
> org.eclipse.e4.core.contexts.ContextInjectionFactory.make(ContextInjectionFactory.java:162)
> at 
> org.eclipse.e4.ui.internal.workbench.ReflectionContributionFactory.createFromBundle(ReflectionContributionFactory.java:105)
> at 
> org.eclipse.e4.ui.internal.workbench.ReflectionContributionFactory.doCreate(ReflectionContributionFactory.java:74)
> at 
> org.eclipse.e4.ui.internal.workbench.ReflectionContributionFactory.create(ReflectionContributionFactory.java:56)
> at 
> org.eclipse.e4.ui.workbench.renderers.swt.ContributedPartRenderer.createWidget(ContributedPartRenderer.java:129)
> at 
> org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.createWidget(PartRenderingEngine.java:975)
> at 
> org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.safeCreateGui(PartRenderingEngine.java:651)
> at 
> org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.safeCreateGui(PartRenderingEngine.java:757)
> at 
> org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.access$0(PartRenderingEngine.java:728)
> at 
> org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$2.run(PartRenderingEngine.java:722)
> at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
> at 
> org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.createGui(PartRenderingEngine.java:706)
> at 
> org.eclipse.e4.ui.workbench.renderers.swt.StackRenderer.showTab(StackRenderer.java:1317)
> at 
>

Apache UIMA Java sdk 3.0.0 released

2018-03-05 Thread Marshall Schor

The Apache UIMA team is pleased to announce the release of the Apache UIMA Java
SDK, version 3.0.0.  This is the first release of a major re-implementation of
the UIMA Java SDK, aligning it with Java 8 and high performance multi-core
processors.

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

This release is a major rewrite of the internals of core UIMA, and
includes many new features, including:
 -- support for arbitrary Java objects in the CAS
 -- New semi-built-in UIMA types: FSArrayList, FSHashSet, IntegerArrayList,
Int2FS map
 -- New "select" framework integrated with Java 8 Streams
 -- Elimination of concurrent modification exception
  while iterating over UIMA indexes
 -- Automatic Garbage Collection of unreferenced Feature Structures
 -- All around better integration into Java 8 idioms and generic typing

See the UIMA News <https://uima.apache.org/news.html#05 Mar 2018>item (
https://uima.apache.org/news.html#05 Mar 2018 ) for more details.

A full description of the new and changed parts is here:
http://uima.apache.org/d/uimaj-3.0.0/version_3_users_guide.html

This release requires Java 8, and is intended to be backwards compatible with
existing Version 2 pipeline code, except for the need to regenerate or migrate
(tooling provided) user-defined JCas class definitions.

Please send feedback via the Apache UIMA project mailing lists.

 -Marshall Schor, for the Apache UIMA development team

Re: Warnings after importing uimaj-examples

2018-03-05 Thread Marshall Schor

Hi Barbara,

In general, the warnings do not matter.

These typically stem from the evolution of Java, for example: adding Generic
Typing to the language. 
Many of the examples have not been updated to take advantage of Generic Typing,
which produces the "raw type" warnings.

Feel free to contribute patches to improve this :-)

-Marshall


On 3/4/2018 3:39 AM, Barbara Moloney wrote:
> Hi
> I seem to have an error free process when importing uimaj-examples from my
> binary installation of UIMA into Eclipse. However I get a big long list of
> warnings. See excerpt below and attached.
>
> Does anyone have any suggestions? Do the warnings matter?
>
> I'm using Java 1.8.0_161 on Windows 7 with UIMA 2.10.2 and Eclipse 4.7.2
>
> Any help appreciated
>
> * *
>
> *Barbara Moloney** **BVSc MVS(Epidemiology) MANZCVS(Epidemiology &
> Pathology)* |Technical Specialist (Epidemiology)
>
> *NSW Department of Primary Industries* | *Biosecurity Intelligence & 
> Traceability*
>
> Australia
>
>
> DescriptionResourcePathLocationType
>
> ArrayList is a raw type. References to generic type ArrayList should be
> parameterizedAdvancedFixedFlowController.java/uimaj-examples/src/org/apache/uima/examples/flowline
> 60Java Problem
>
> ArrayList is a raw type. References to generic type ArrayList should be
> parameterizedAdvancedFixedFlowController.java/uimaj-examples/src/org/apache/uima/examples/flowline
> 70Java Problem
>
>
>
> 
> This message is intended for the addressee named and may contain confidential
> information. If you are not the intended recipient, please delete it and
> notify the sender. Views expressed in this message are those of the individual
> sender, and are not necessarily the views of their organisation.

Re: Running bin files from UIMA_HOME bin directory

2018-02-28 Thread Marshall Schor

please try a java version 8, not version 9.

There are known issues that pop up when running with Java 9 that haven't been
worked on yet.

Cheers. -Marshall

On 2/27/2018 8:13 PM, Barbara Moloney wrote:
> Hi Marshall
> Thanks for your reply.
> I am using Java version "9.0.4" which I have downloaded as the JDK from
> Oracle at
> http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html#javasejdk
>
> Also I am using Windows 7.
>
> Thanks
> Barbara
>
>
>
>
>
>
>
> On 27 February 2018 at 05:23, Marshall Schor <m...@schor.com> wrote:
>
>> Can you say what Java you're using?
>>
>> (Try the command   java -version)
>>
>> If it is not a mainline Java, e.g., Oracle java or IBM java, please try
>> one of
>> those.
>>
>>
>> -Marshall
>>
>>
>> On 2/24/2018 1:56 AM, Barbara Moloney wrote:
>>> Hi
>>> I'm very new to UIMA and I'd like to run annotationViewer and others in
>> the
>>> bin directory.
>>> Below is the output I'm getting,
>>>
>>> Exception in thread "main" java.lang.ClassCastException:
>>> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader cannot be
>> cast to
>>> java.base/java.net.URLClassLoader
>>> at
>>> org.apache.uima.bootstrap.UimaBootstrap.addUrlsToSystemLoader(
>> UimaBootstrap.java:146)
>>> at org.apache.uima.bootstrap.UimaBootstrap.main(UimaBootstrap.java:74)
>>>
>>>
>>> Any help appreciated
>>>
>>> Barbara
>>>
>>

Re: Running bin files from UIMA_HOME bin directory

2018-02-26 Thread Marshall Schor

Can you say what Java you're using?

(Try the command   java -version    )

If it is not a mainline Java, e.g., Oracle java or IBM java, please try one of
those.


-Marshall


On 2/24/2018 1:56 AM, Barbara Moloney wrote:
> Hi
> I'm very new to UIMA and I'd like to run annotationViewer and others in the
> bin directory.
> Below is the output I'm getting,
>
> Exception in thread "main" java.lang.ClassCastException:
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to
> java.base/java.net.URLClassLoader
> at
> org.apache.uima.bootstrap.UimaBootstrap.addUrlsToSystemLoader(UimaBootstrap.java:146)
> at org.apache.uima.bootstrap.UimaBootstrap.main(UimaBootstrap.java:74)
>
>
> Any help appreciated
>
> Barbara
>

Re: Parameters for PEAR

2018-02-12 Thread Marshall Schor

nope. sorry. -Marshall


On 2/9/2018 3:25 AM, Peter Klügl wrote:
> Hi,
>
>
> did you get an answer?
>
>
> Best,
>
>
> Peter
>
>
> Am 10.01.2018 um 17:12 schrieb Marshall Schor:
>> I'm pinging some people who might know something about LanguageWare's use of
>> this feature. -Marshall
>>
>>
>> On 1/10/2018 6:07 AM, Peter Klügl wrote:
>>> Hi,
>>>
>>>
>>> Am 10.01.2018 um 10:57 schrieb Richard Eckart de Castilho:
>>>>> On 16.12.2017, at 13:48, Peter Klügl <peter.klu...@averbis.com> wrote:
>>>>>
>>>>>> Is it a problem for us to simply implement Matthias's solution: Make use
>>>>>> of the parameters in the PearSpecifier and just set them in the wrapped
>>>>>> analysis engine description if they are compatible?
>>>>>>
>>>>> Are there any opinions on this?
>>>> First, I was a bit confused and though the "PearSpecifier" would be
>>>> this guy here [1]. The I realized it is this one [2].
>>>>
>>>> Looking at where the parameters of the PearSpecifier are used: apparently 
>>>> the
>>>> setParameter and getParameter are only ever called directly in unit tests.
>>>>
>>>> Does it mean that the frameworks so far does not make any use of these 
>>>> parameter
>>>> as all? Or maybe they are used via some inherited methods...?
>>>>
>>>> It sounds reasonable to me that these parameters are forwarded to the 
>>>> top-level
>>>> component in the PEAR - the question I am asking myself is though: why 
>>>> doesn't
>>>> this already happen and (maybe) what else where these PearSpecifier 
>>>> parameters
>>>> intended to do then?
>>> Yes, these are exactly the questions we had :-)
>>>
>>> I rather wanted to ask twice before I open an issue or implement
>>> something. Could always be that I missed something. Initially, I thought
>>> that the IBM guys (LanguageWare) made massive use of the PEAR concept
>>> and they surely had some possibility to configure their PEARs.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>>
>>>> Cheers,
>>>>
>>>> -- Richard
>>>>
>>>> [1] 
>>>> http://uima.apache.org/d/uimaj-current/references.html#ugr.ref.pear.installation_descriptor
>>>> [2] 
>>>> http://uima.apache.org/d/uimaj-current/references.html#ugr.ref.pear.specifier

Re: Parameters for PEAR

2018-01-10 Thread Marshall Schor

I'm pinging some people who might know something about LanguageWare's use of
this feature. -Marshall


On 1/10/2018 6:07 AM, Peter Klügl wrote:
> Hi,
>
>
> Am 10.01.2018 um 10:57 schrieb Richard Eckart de Castilho:
>>> On 16.12.2017, at 13:48, Peter Klügl  wrote:
>>>
 Is it a problem for us to simply implement Matthias's solution: Make use
 of the parameters in the PearSpecifier and just set them in the wrapped
 analysis engine description if they are compatible?

>>> Are there any opinions on this?
>> First, I was a bit confused and though the "PearSpecifier" would be
>> this guy here [1]. The I realized it is this one [2].
>>
>> Looking at where the parameters of the PearSpecifier are used: apparently the
>> setParameter and getParameter are only ever called directly in unit tests.
>>
>> Does it mean that the frameworks so far does not make any use of these 
>> parameter
>> as all? Or maybe they are used via some inherited methods...?
>>
>> It sounds reasonable to me that these parameters are forwarded to the 
>> top-level
>> component in the PEAR - the question I am asking myself is though: why 
>> doesn't
>> this already happen and (maybe) what else where these PearSpecifier 
>> parameters
>> intended to do then?
> Yes, these are exactly the questions we had :-)
>
> I rather wanted to ask twice before I open an issue or implement
> something. Could always be that I missed something. Initially, I thought
> that the IBM guys (LanguageWare) made massive use of the PEAR concept
> and they surely had some possibility to configure their PEARs.
>
> Best,
>
> Peter
>
>
>> Cheers,
>>
>> -- Richard
>>
>> [1] 
>> http://uima.apache.org/d/uimaj-current/references.html#ugr.ref.pear.installation_descriptor
>> [2] 
>> http://uima.apache.org/d/uimaj-current/references.html#ugr.ref.pear.specifier

Re: Parameters for PEAR

2017-12-12 Thread Marshall Schor

Hi,

Good question...

The use of the word "parameters" in UIMA is unfortunately overloaded with
multiple meanings.

There are in general 2 kinds:  the kind used in produceAnalysisEngine - the
so-called "additional parameters".  The other kind are the "configuration
parameters", also called configuration settings.  These latter have a capability
for specifying global external settings overrides.

If the parameters you want to override are configuration params (which I think
you mean, because you say they're already in the "xml"),take a look at
https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.aes.external_configuration_parameter_overrides

Maybe that will be an easy way to address your use case:  the external settings
could be dynamically written into a temp file and that temp file specified in an
"additional parameters" key to produceAnalysisEngine. 

-Marshall


On 12/12/2017 2:39 AM, Matthias Koch wrote:
> Hi,
>
> I want to configure a PEAR dynamically. (I install the pear and want to
> produce the analysis engine with different parameters than in the xml).
> Is this possible? Can I use the additionalParameters? I have seen that the
> PearSpecifier has an instance variable for parameters, but no one is using
> (calling) it.
>
> I want to produce the analysisEngine with:
> UIMAFramework.produceAnalysisEngine(resourceSpecifer, resourceManager, 
> params);
>
> In this specifier there should be one or more pearSpecifiers that should be
> configured.
>
> I have overridden the PearAnalysisEngineWrapper and built a loop that
> configures the following specifier over the configurationParameterSettings. It
> takes the parameters from the pear specifiers.
>
> line 257-258
> // Parse the resource specifier
> ResourceSpecifier specifier =
> UIMAFramework.getXMLParser().parseResourceSpecifier(in);
>
> ==> added code
> AnalysisEngineDescription analysisEngineDescription =
> (AnalysisEngineDescription) specifier;
> AnalysisEngineMetaData analysisEngineMetaData =
> analysisEngineDescription.getAnalysisEngineMetaData();
> ConfigurationParameterSettings configurationParameterSettings =
> analysisEngineMetaData.getConfigurationParameterSettings();
> for (Parameter parameter : Arrays.asList(pearSpec.getParameters())) {
>
> configurationParameterSettings.setParameterValue(parameter.getName(),
> parameter.getValue());
> }
>
> Is it possible without overriding anything?
>
> UIMAJ Version: 2.10
>
> Sincerely
> Matthias
>

Re: uniqueID() function

2017-11-30 Thread Marshall Schor

Hi,

The uniqueId() function you found is (as you have noticed) not actually a
method.  It's instead, some special syntax that was supported by the
feature-value-path mechanism.

I think this is not what you're looking for.

The best thing for you to do is to design your type system as follows:

1) separate the types into those which you want to store in the database, and
others.  Examples of others might be things like "temporary" types, or types
which are in some sense "derived" and not worth the redundant storage in the DB.

2) for those types you want to store in the DB, add a feature, let's call it:
db_unique_id.  You can have it be whatever kind of value makes the most sense -
an integer, or a string, for example.

3) Then arrange your code to "set" this when the feature structure is created.

---

Having said that, there is a more-or-less unique "id", for every feature
structure in the CAS.  Of course, it's not unique across CASs.  Given a feature
structure myFeatureStructure, you can access it using

myFeatureStructure.hashCode()

In UIMA v3, we have myFeatureStructure._id() 

-Marshall

On 11/1/2017 1:55 PM, Kameron Cole wrote:
> Hello
>
> I am trying to use the uniqueId() function, and find some examples.
> Basically I want to use the CAS unique ID as the unique id Feature for an
> annotation. Fro example, a police report Annotation would have a Feature
> reportid, which would leverage the uniqueId() .  The ultimate purpose is to
> send the CAS to a database table, and use the reportid as the row's unique
> ID.
>
> I can't find any information on it, except here:
>
> http://uima.apache.org/d/uimaj-2.4.2/apidocs/org/apache/uima/cas/FeatureValuePath.html
>
> Contains CAS Type and Feature objects to represent a feature path of the
> form feature1/.../featureN. Each part that is enclosed within / is referred
> to as "path snippet" below. Also contains the necessary evaluation logic to
> yield the value of the feature path. For leaf snippets, the following
> "special features" are defined:
>   coveredText() can be accessed using evaluateAsString
>   typeName() can be accessed using evaluateAsString
>   fsId() can be accessed using evaluateAsInt. Its result can be used to
>   retrieve an FS from the current LowLevel-CAS.
>   uniqueId() can be accessed using evaluateAsInt. Its result can be
>   used to uniquely identify an FS for a document (even if the document
>   is split over several CAS chunks)
>
> This is deprecated, and replaced with
>
> http://uima.apache.org/d/uimaj-2.4.2/apidocs/org/apache/uima/cas/FeaturePath.html
>
> However, FeaturePath does not have the uniqueID() method
>
> The feature path syntax also allows some built-in functions on the last
> feature path element. Built-in functions are added with a ":" followed by
> the function name. E.g. "/my/path:fsId()". The allowed built-in functions
> are:
>   coveredText()
>   fsId()
>   typeName()
> Built-in functions are only evaluated if getValueAsString() is called.
>
> At least, I don't get it. Can I get an example?  Thanks
>
>
>
>

[ANNOUNCE] Apache UIMA Java SDK 3.0.0-beta released

2017-11-10 Thread Marshall Schor


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512
 
The Apache UIMA team is pleased to announce the release of the Apache UIMA Java
SDK, version 3.0.0-beta.  This is a beta release, and is intended
to have stable user-facing APIs and enable a wider set of users to test this
version and give feedback.  

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

This release is a major rewrite of the internals of core UIMA, and
includes many new features, including:
 -- support for arbitrary Java objects in the CAS
 -- New semi-built-in UIMA types: FSArrayList, FSHashSet, IntegerArrayList
 -- New "select" framework integrated with Java 8 Streams
 -- Elimination of concurrent modification exception
  while iterating over UIMA indexes
 -- Automatic Garbage Collection of unreferenced Feature Structures
 -- All around better integration into Java 8 idioms and generic typing

The major changes from the alpha02 release include
  - improved generic typing and better Java 8 idiom integration
  - Eclipse JARs are now Jar-signed
  - Many small bug fixes, improvements, better error reporting

See the UIMA News <https://uima.apache.org/news.html#09 Nov 2017> item (
https://uima.apache.org/news.html#09 Nov 2017 ) for more details.

A full description of the new and changed parts is here:
http://uima.apache.org/d/uimaj-3.0.0-beta/version_3_users_guide.html

This release requires Java 8, and is intended to be backwards compatible with
existing Version 2 pipeline code, except for the need to regenerate or migrate
(tooling provided) user-defined JCas class definitions.

Please send feedback via the Apache UIMA project mailing lists.

 -Marshall Schor, for the Apache UIMA development team
-BEGIN PGP SIGNATURE-
Version: GnuPG v2
 
iQIcBAEBCgAGBQJaBadFAAoJEMx2L/3NBM/Wx9sP/jUMMnqj/S7X4HRQT1r9zjRR
ajxEvvxLiQftZtCWXESN+g1EcxjA0bwQ0+aJ628cO67W+3WpUqm+yZjW4+y+Z0Ud
5r65SuqvmxSSNS3/pK9vSLlS+PXMFTRuAbWpIQ7PXKWMn5GsJDSYnm1wSSh561vn
LFJ0L3Xql1Pg05+JdbZbXPsR1i68iB6htKKxwPf6VbFiwRqB3fK8JcaLa+tRRLs8
iMn6KP2vSPpkyRU+ZqCQUEtAdyEAtMtvbijc/mHRoBfJsS3FKDd18Q/VshPaYTJx
UNJ/NryoHkyX4QMS3CBMTZGC5509ckEqmrbeuN46ClP2bghICfR0G1ZFE+5X1lw5
YInW09KMFKNC7KNey5uCT7ppAE1ABbAsAJcrAW1Ek1bFIvLsHtwYfo/uzGrGqfBw
mObDrw7Gk6qVqgA8JA/Rbyu87dhoi/uxIEM5dm6eVb+MIQrU6BitybMucYbyRSyZ
Lg8Tc6W7IDxotWGLOjrGMFcJZqYKPYKD/0vWyLbgbgStqvT/L35DJYiPXALmmvx3
741luzVo/yDZF7qjL352N3xQ7cGw5uhytlLN9MFUfmkj7+fC2ECNB5B1fQXiBxOE
U95GpG7uBkCHXEb7XQ2AmJd9JMSiBiFQwBJtB3eiCs64BzxNICBRgTUP9BOBHs13
5EhqdX2GFUkU0I5FLg9j
=RYQA
-END PGP SIGNATURE-

[ANNOUNCE] Apache UIMA Java SDK 2.10.2 released

2017-11-06 Thread Marshall Schor


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512
 
The Apache UIMA team is pleased to announce the release of Apache UIMA Java SDK,
version 2.10.2.

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

This is mostly a bug fix release.  Some changes to the Eclipse plugins:
  - The plugins are now eclipse jar-signed, which avoids the warning message
when installing about installing unsigned artifacts.
  - The previous Eclipse update site has been archived (it is still available, 
see
    https://uima.apache.org/eclipse-update-archives.html ).

For more details, see the UIMA_News item ( https://uima.apache.org/news.html#03
Nov 2017 ).

- -Marshall Schor, for the Apache UIMA development team
-BEGIN PGP SIGNATURE-
Version: GnuPG v2
 
iQIcBAEBCgAGBQJaAHAbAAoJEMx2L/3NBM/WkQQQALUKlNfjTU/gQzV2Bq1QOVUD
XNdVhCBKJ4PlaKRU8DPu+roARo8eW7X8ke7U6SRENmsAChaZmIFf+bJxgxqd9KY4
CREeGTn1DRajM5vlXjalWtRu4cenF/agLzBJBYL+hmI7tYN5xClNGpf8S6rnTUfZ
+NhjMYHBVYWnZZzLn3kjtpLEy+z2BhuKrLi+dojqA+tIBNB58UUXWSwJwkk26ooU
A4p1PYiTx5b+eseoDnKQuD7FXkWcifcNMWMXoxnEZ/OLJ7FeBTt1g8qu0nqHIWVQ
EAyk39wmljsNv0whGLZJ3LrODpLAocXuUGd/e+6QgInHaiROWFISwr9q07rZHT+O
/fzEQ63h637uPYESsZTpm9NkR6sAPP2O/+CUvh7noyqEOXICuv7M6jALls9LDXDK
aWjLiV/2etCh6rVGCGTN22SjrBxVZL1ARDePT9CFNsIydYr/Lbd7ElkpqHlUdrOo
3DdUtysTIoxkDqn4is2sEqnxq6oaZRPl+243MfA5I2AAKGP4Ne+9w5sEHQ3JZN9A
CfVSVxRpgyFikdPXRmeUtfUJE2Q8ozvQZmXUIsi96Mztii4Dk6DodWsdng4IsV47
YHvkk5yoVJNMHDzqi3eNJSHuC4iNC6LSf5Fhwci1t0eShVBUhcEvj8QR4Xs/qLTy
bBV/fgdokQPlS8CznS7P
=sfZA
-END PGP SIGNATURE-

Re: Attach Javadoc

2017-09-14 Thread Marshall Schor

Hi,

Sorry to hear of the trouble adding Javadocs.

Is it possible that the javadoc location you gave was incorrect? 

If you downloaded the UIMA distribution binary and unzipped it into (for
example) c:\myUima, you should have a folder c:\myUima\apache-uima\docs\d\api

I think that that would be the folder to use when specifying the javadoc for the
UIMA jars.

If you're still having trouble, please post more details, including which
version of UIMA and Eclipse you are using, how you have set up Eclipse to use
UIMA, etc.

-Marshall


On 9/13/2017 5:28 PM, esteban.lla...@correounivalle.edu.co wrote:
> Good afternoon everybody.
>
> I already install the Apache UIMA framework on my PC and configure the UIMA 
> plugins on my Eclipse IDE. I'm trying some of the default examples but i 
> can't attach the Javadoc, i follow instructions that i found in the "1.1 
> Using named Eclipse User Libraries" section but it doesn't work as i expect, 
> i just get this message: "Note: The attached Javadoc could not be retrieved 
> as the specified Javadoc location is either wrong or currently not 
> accessible.". I would like to have this option available so if anyone can 
> help me solve this problem I would appreciate it.
>
> Thanks.
>

Re: JCasGen failure

2017-09-07 Thread Marshall Schor

Ok, there's a really simple explanation :-).

UIMA **Version 3** changes how JCas classes are built and used.

You have to "match" the version of UIMA you install into Eclipse with the
version of UIMA you are using, where the match means running v2 with Eclipse v2
things, and v3 with Eclipse v3 things. 
In your case, the version installed into Eclipse is a version 2, so it generates
JCas classes suitable for version 2 of UIMA.  These are incompatible with UIMA
version 3.

To fix this situation, I would recommend you change your running environment run
with the most recent production level of UIMA, which is version 2.10.1.

Or, if you're interested in trying out version 3, you should install the version
3.0.0-alpha02 version of the Eclipse plugins.  These are available at
http://www.apache.org/dist/uima/eclipse-update-site/uimaj-v3-pre-production/ . 
The last part of the name (uimaj-v3-pre-production) is there to remind you that
you're using an "alpha", pre-production level of the code.  We are working to
release version 3 as a normal level, soon (I hope before the end of the year).

Thank you for your interest! 

-Marshall

On 9/7/2017 3:29 PM, esteban.lla...@correounivalle.edu.co wrote:
>
> On 2017-09-07 12:17, Marshall Schor <m...@schor.com> wrote: 
>> Hi,
>>
>> Please say what version of UIMA you are using, and what version of the UIMA
>> Eclipse plugins you are using (if you are using those to run JCasGen).
>>
>> Thanks.  -Marshall
>>
>>
>> On 9/7/2017 11:06 AM, esteban.lla...@correounivalle.edu.co wrote:
>>> Hello, everybody.
>>>
>>> I'm new using Apache UIMA but i have a problem that doesn't leave proceed 
>>> with the starting guide given by Apache. When i create my Analysis Engine 
>>> Descriptor, i define my types and it's features (as is indicated), but when 
>>> i use the JCasGen to create automatically my Java Classes it create two 
>>> files: (TypeName)_Type.java and (TypeName).java, the issue here is both of 
>>> them has several deprecated classes and methods like Annotation_Type and 
>>> methods like addGeneratorForType for FSClassRegistry. I've reviewed the API 
>>> but i can't find the way to solve this troubles.
>>>
>>> Thanks in advance :)
>>>
>>
>  Hi, Marshall
>
> I'm using Apache UIMA Version 3.0.0-alpha02, and the Eclipse Plugins Version 
> published at the official page 
> (http://www.apache.org/dist/uima/eclipse-update-site/), UIMA Tools 2.10.1, 
> Apache UIMA Ruta 2.6.1, Apache UIMA-AS 2.9.0 and UIMA Runtime 2.10.1, i think.
>
> Thanks.
>

Re: JCasGen failure

2017-09-07 Thread Marshall Schor

Hi,

Please say what version of UIMA you are using, and what version of the UIMA
Eclipse plugins you are using (if you are using those to run JCasGen).

Thanks.  -Marshall


On 9/7/2017 11:06 AM, esteban.lla...@correounivalle.edu.co wrote:
> Hello, everybody.
>
> I'm new using Apache UIMA but i have a problem that doesn't leave proceed 
> with the starting guide given by Apache. When i create my Analysis Engine 
> Descriptor, i define my types and it's features (as is indicated), but when i 
> use the JCasGen to create automatically my Java Classes it create two files: 
> (TypeName)_Type.java and (TypeName).java, the issue here is both of them has 
> several deprecated classes and methods like Annotation_Type and methods like 
> addGeneratorForType for FSClassRegistry. I've reviewed the API but i can't 
> find the way to solve this troubles.
>
> Thanks in advance :)
>

[ANNOUNCE] Apache UIMA Java SDK 2.10.1 released

2017-08-30 Thread Marshall Schor

The Apache UIMA team is pleased to announce the release of Apache UIMA Java SDK,
version 2.10.1. 

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

This is mostly a bug fix release, but includes some new capabilities to use 
external resource settings values within UIMA's XML descriptors.  For more 
details, see the UIMA_News 
<http://uima.apache.org/news.html#29%20Aug%202017>item (
http://uima.apache.org/news.html#29 Aug 2017 ). -Marshall Schor, for the Apache
UIMA development team

Re: Type system commit race condition when maven builds jar(?)

2017-08-23 Thread Marshall Schor

Hi,

Those two listing of the type system appear to show that the *working* version
does not have cTakes, and the *failing* version has cTakes.

Any ideas why this might be so?

The stack trace appears to show that the "FileSystemCollectionReader" class is
running the getNext method, and that is attempting to create an instance of the
JCas class   com.clinacuity.deid.annotations.DocumentInformationAnnotation.

The error message indicates that whatever causes the class
com.clinacuity.deid.annotations.DocumentInformationAnnotation to be "loaded" &
"initialized" by Java's class loader is happening before the type system has
been committed.

It seems that the FileSystemCollectionReader is being run in some separate
thread.  If your system design is permitting this to happen in a race with
another thread which is setting up the type system and committing it, you could
get this error.  The fact that a big type system might take longer to set up
might account for why that one fails (the shorter one wins the race).

Of course, for reliable operation, the race must be eliminated; a fix would be
to delay starting threads that depend on the type system, until the type system
has been committed.

Does this help?

-Marshall



On 8/18/2017 8:55 AM, Andrew Trice wrote:
> I am working with UIMA-Alpha02 and am currently trying to build a jar to make 
> a standalone application. (not pear because we want our users to have just 
> the JDK and not require ecilpse or UIMA) In many cases we use maven to build 
> and package the jar and it runs fine. Other times the jar is broken so that 
> the pipeline will always fail with this stacktrace:
>  
> 2017-08-17 14:53:54 [ERROR] DeidRunnerController.203: Throwing
> java.lang.ExceptionInInitializerError: null
>   at 
> com.clinacuity.deid.ae.FileSystemCollectionReader.getNext(FileSystemCollectionReader.java:105)
>  ~[deid-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?]
>   at com.clinacuity.deid.ae.DeidPipeline.execute(DeidPipeline.java:109) 
> ~[deid-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?]
>   at 
> com.clinacuity.deid.gui.DeidPipelineTask.call(DeidPipelineTask.java:41) 
> ~[deid-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?]
>   at 
> com.clinacuity.deid.gui.DeidPipelineTask.call(DeidPipelineTask.java:9) 
> ~[deid-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?]
>   at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) 
> ~[jfxrt.jar:?]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[?:1.8.0_121]
>   at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121]
> Caused by: org.apache.uima.cas.CASRuntimeException: A JCas class field 
> "documentType" is being initialized by non-framework (user) code before Type 
> System Commit for a type system with a corresponding type. Either change the 
> user load code to not do initialize, or to defer it until after the type 
> system commit.
>   at 
> org.apache.uima.cas.impl.TypeSystemImpl.getAdjustedFeatureOffset(TypeSystemImpl.java:2564)
>  ~[deid-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?]
>   at 
> com.clinacuity.deid.annotations.DocumentInformationAnnotation.(DocumentInformationAnnotation.java:23)
>  ~[deid-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?]
>   ... 7 more
>
> I logged the JCas type system before this error occurs with this command: 
> jCas.getTypeSystem(), when it *passes* it shows the type system as this:
>
> TypeSystem: Type System <2,016,464,189>:
>   uima.cas.TOP: super: 
> uima.cas.Integer: super: uima.cas.TOP
> uima.cas.Float: super: uima.cas.TOP
> uima.cas.String: super: uima.cas.TOP
> uima.cas.ArrayBase: super: uima.cas.TOP
> uima.cas.ListBase: super: uima.cas.TOP
> uima.cas.Boolean: super: uima.cas.TOP
> uima.cas.Byte: super: uima.cas.TOP
> uima.cas.Short: super: uima.cas.TOP
> uima.cas.Long: super: uima.cas.TOP
> uima.cas.Double: super: uima.cas.TOP
> uima.cas.Sofa: super: uima.cas.TOP, FeaturesIntroduced/Range/multiRef: [
>   sofaNum/uima.cas.Integer/F, 
>   sofaID/uima.cas.String/F, 
>   mimeType/uima.cas.String/F, 
>   sofaArray/uima.cas.TOP/T, 
>   sofaString/uima.cas.String/F, 
>   sofaURI/uima.cas.String/F]
> uima.cas.AnnotationBase: super: uima.cas.TOP, 
> FeaturesIntroduced/Range/multiRef: [
>   sofa/uima.cas.Sofa/F]
> com.clinacuity.deid.uima.core.type.Lemma: super: uima.cas.TOP, 
> FeaturesIntroduced/Range/multiRef: [
>   key/uima.cas.String/F, 
>   posTag/uima.cas.String/F]
> com.clinacuity.deid.uima.core.type.OntologyConcept: super: uima.cas.TOP, 
> FeaturesIntroduced/Range/multiRef: [
>   codingScheme/uima.cas.String/F, 
>   code/uima.cas.String/F, 
>   oid/uima.cas.String/F]
>
> but when it *fails* I get this:
>
> TypeSystem: Type System <815,926,310>:
>   uima.cas.TOP: super: 
> uima.cas.Integer: super: uima.cas.TOP
> uima.cas.Float: super: uima.cas.TOP
> uima.cas.String: super: uima.cas.TOP
> uima.cas.ArrayBase: super:

Re: How to install the UIMA Eclipse Plugins Offline

2017-07-26 Thread Marshall Schor

Here's a suggestion.

Use the normal way of Eclipse installing: "Help -> Install New Software".

On the line saying "Work with", click the "Add..." button on the right.

This brings up an "Add repository" box.  Click on the button which says "Local",
and browse to a directory where you have downloaded the part of the Eclipse
update site you want.

To download the eclipse update site parts, do an FTP from

http://www.apache.org/dist/uima/eclipse-update-site/uimaj/

and move the files somehow to your linux machine.  Let's say you put them in
/temp/uima-plugins/uimaj

That directory should have a bunch of folders, jars, etc., like this:

features/
plugins/
artifacts.jar
content.jar
...  a bunch more signature files...

Repeat this for RUTA: FTP from
http://www.apache.org/dist/uima/eclipse-update-site/ruta/
and put these in, say, /temp/uima-plugins/ruta

Then first install uimaj, then install ruta.

HTH. -Marshall

On 7/26/2017 8:26 AM, Ding Haoqi wrote:
> Hi all,
>
> I need to install the UIMA Eclipse Plugins on a virtual machine(Linux) 
> without extranet，but it seems dosn't work when I copy the file system where I 
> install the Eclipse in my own PC to the virtual machine,all the plugins about 
> UMIA and RUTA become unavailable，What should I do ? Is it possible for me to 
> download the plugins and install them on the virtual machine manually?
>
> By the way, I have install the plugis on my own PC, but there is no files 
> that its name contains 'uima' or 'ruta' in the Eclipse's plugin or feature 
> files. Which files on earth is associated with the uima and ruta?
>
> best,
>
> Ding Haoqi

Re: JCasGenMojo.newError(JCasGenMojo.java:239)

2017-07-19 Thread Marshall Schor

:-) -Marshall


On 7/17/2017 8:14 AM, Luca Toldo wrote:
> by upgrading to version 2.10 of the UIMA framework the error message 
> disappeared and instead I got a clear instruction on what is conflicting.
>
> Thank you !
>

Re: How use JCasPool when is exhausted

2017-06-09 Thread Marshall Schor

Since you say that CASes are never being released, perhaps you need to trace the
code where you think an acquired cas ought to be being released, and see why
that code path doesn't call the releaseJCas(cas) method?

Also, were there any log / error messages generated?

-Marshall

On 6/9/2017 6:09 AM, Josep María Formentí Serra wrote:
> Hi all,
>
>   I have doubts how to control when the pool, JCasPool, is exhausted.
>
>   I'm trying something like this:
>
> JCasPool casPool = new JCasPool(100,
> segmenter.getAnalysisEngineMetaData());
>
> 
>
> if (cas == null) {
> synchronized (casPool) {
> {
> casPool.wait();
> cas = casPool.getJCas();
> }
> while (cas == null);
> }
> }
>
>  For release just: casPool.releaseJCas(cas);
>
>  This is not working for me, because when there are a lot of requests the
> application finally is blocked: the pool is full and never is released a
> CAS.
>
> Thanks,
>   JM
>

Re: How use JCasPool when is exhausted

2017-06-09 Thread Marshall Schor

Hi,

I'm having some trouble understanding your code snippet.

Perhaps you intended to have a "do {  } while (cas == null), but it appears
the "do" is missing?

-Marshall


On 6/9/2017 6:09 AM, Josep María Formentí Serra wrote:
> Hi all,
>
>   I have doubts how to control when the pool, JCasPool, is exhausted.
>
>   I'm trying something like this:
>
> JCasPool casPool = new JCasPool(100,
> segmenter.getAnalysisEngineMetaData());
>
> 
>
> if (cas == null) {
> synchronized (casPool) {
> {
> casPool.wait();
> cas = casPool.getJCas();
> }
> while (cas == null);
> }
> }
>
>  For release just: casPool.releaseJCas(cas);
>
>  This is not working for me, because when there are a lot of requests the
> application finally is blocked: the pool is full and never is released a
> CAS.
>
> Thanks,
>   JM
>

Re: Error in CAS Serialization to JSON

2017-06-08 Thread Marshall Schor

I accidentally replied off list.  The reply was:

Can you describe your runtime environment?

The description for this particular error says "
this error is caught by the compiler; this error can only occur at run time if
the definition of a class has incompatibly changed."

What level of Java and UIMA etc., are you running?  Is it possible you have some
mixed levels?

-Marshall
On 6/6/2017 9:44 AM, Das, Tanmay wrote:
> Hi
>
> When I try to run the following line of code(as given in the guide) :
> jcs.serialize(cas, sw);
>
> It produces an error -
> Exception in thread "main" java.lang.IllegalAccessError: tried to access 
> field org.apache.uima.cas.impl.CasSerializerSupport.COMPARATOR_SHORT_TYPENAME 
> from class org.apache.uima.json.JsonCasSerializer$JsonDocSerializer
>at 
> org.apache.uima.json.JsonCasSerializer$JsonDocSerializer.collectUsedSubtypes(JsonCasSerializer.java:842)
>at 
> org.apache.uima.json.JsonCasSerializer$JsonDocSerializer.serializeJsonLdContext(JsonCasSerializer.java:717)
>at 
> org.apache.uima.json.JsonCasSerializer$JsonDocSerializer.writeFeatureStructures(JsonCasSerializer.java:532)
>at 
> org.apache.uima.cas.impl.CasSerializerSupport$CasDocSerializer.serialize(CasSerializerSupport.java:439)
>at 
> org.apache.uima.json.JsonCasSerializer.serialize(JsonCasSerializer.java:318)
>at 
> org.apache.uima.json.JsonCasSerializer.serialize(JsonCasSerializer.java:299)
>at 
> org.apache.uima.json.JsonCasSerializer.serialize(JsonCasSerializer.java:289)
>
> The definition of COMPARATOR_SHORT_TYPENAME in class CasSerializerSupport :
>
>   final static Comparator COMPARATOR_SHORT_TYPENAME = new 
> Comparator() {
> public int compare(TypeImpl object1, TypeImpl object2) {
>   return object1.getShortName().compareTo(object2.getShortName());
> }
>   };
>
> Can anyone thank me with the problem.
>
>
> Thanks
> -TD
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>

Re: UIMA Annotation Editor - Annotation Styles - how to reset ?

2017-05-17 Thread Marshall Schor

I'm not too familiar with this, but will take a look if you can post "how to
reproduce" instructions, for a (hopefully) small example :-)

-Marshall


On 5/17/2017 10:54 AM, Luca Toldo wrote:
> Dear all,
> I'm having problems with the Annotation Editor.
> For reasons I don't know, one of the types listed has a "minus" or "-" on its 
> left and cannot be "deselected".
>
> I notice that it coincide with a value shown in the Outline text window even 
> when no annotation has been done.
>
> I've removed the .pref-typeSystemDescriptor.xml  and also the 
> typeSystemDescriptorStyleMap.xml but the problem remains.
>
> grep -ril  from the home .eclipse folder  delivers no hits...
>
> any advice ?
> Thanks
>
>
>

Re: conversion from ECORE to UIMA - Ecore2UimaTypeSystem

2017-05-16 Thread Marshall Schor

This missing class is part of Eclipse's emf support.

See if you can locate a jar for it.  For example, I went to

http://search.maven.org  and searched for "ecore", and found on the 4th hit
org.eclipse.emf.ecore jar which has the ResourceSet class.

Try including that Jar (and perhaps its dependencies, look at the POM, or just
update your Maven build if you're using Maven) in your classpath.

I also see this jar in my current Eclipse installation.

-Marshall

On 5/16/2017 10:44 AM, Luca Toldo wrote:
> Dear all,
> I'm trying to generate a UIMA TypeSystem from an UML project.
> According to the UIMA Tutorial and Developers' Guide this should be 
> straightforward using
> java org.apache.uima.ecore.Ecore2UimaTypeSystem  
>
> Unfortunately, however, that class does not exist anymore. 
> in lib/uima-examples.jar I've found the following one :
> org.apache.uima.examples.xmi.Ecore2UimaTypeSystem
>
> however executing it results in a NoClassDefFoundError: 
> org/eclipse/emf/ecore/resource/ResourceSet
>
> Unfortunately no information is provided in the UIMA SDK on how to execute 
> that task (e.g. precise dependency list).
>
> Any experience / feedback on the topic is appreciated. 
>
>

Re: Using REST api with UIMA

2017-05-04 Thread Marshall Schor

I am unfamiliar with this project.

I took a quick look and see that:

a) it's old - not been worked on for more than 3 years

b) The serialization technique is to serialize using UIMA's CAS -> XCAS (an
older form of xml serialization), followed by a transform from the xml to json. 
I didn't look, but I'd guess that the json -> cas path did a json -> xcas,
followed by a deserialization of the XCAS format by UIMA.

This would be relatively inefficient I think.   I would guess this would be
unlikely to be included into UIMA.

-Marshall


On 5/4/2017 10:43 AM, Luca Toldo wrote:
> Thankyou Marshall for your fast and authoritative reply.
>
> The deserialization from JSON to CAS is really important since this will 
> „bridge“ the UIMA community with the micro service REST world.
>
> I've found the following project 
> https://github.com/windj007/uima-serialization/ and I am interested on your 
> opinion about its level of „maturity“ / likelihood of inclusion in the UIMA 
> releases.
>
> Thanks
> Luca
>
>
>> Am 04.05.2017 um 16:37 schrieb Marshall Schor <m...@schor.com>:
>>
>> The core UIMA support has not (yet) implemented a deserializer back into CAS
>> form, for this. 
>>
>> The main idea behind CAS -> JSON conversion was to provide the CAS info to 
>> JSON
>> Consumers.
>>
>> We have multiple other serializations (CAS -> XMI, etc.) that are designed 
>> for
>> "transport" and include deserializers.
>>
>> Of course, it is quite possible to create an addition to UIMA which can
>> deserialize the JSON format(s) - that just hasn't yet been done. 
>> Contributions
>> welcome!
>>
>> -Marshall
>>
>>
>> On 5/4/2017 3:49 AM, Luca Toldo wrote:
>>> The following Java code (inspired from 
>>> http://stackoverflow.com/questions/40838999/getting-output-in-json-format-in-uima
>>>  ) 
>>>
>>> import java.io.*;
>>> import org.apache.uima.fit.factory.JCasFactory;
>>> import org.apache.uima.jcas.JCas;
>>> import org.apache.uima.cas.CAS;
>>> import org.apache.uima.json.JsonCasSerializer;
>>> public class  test {
>>>public static void main(String [] args ) throws IOException {
>>>try {
>>>String note="Lorem ipsum incididunt ut labore et 
>>> dolore magna aliqua";
>>>JCas jcas = JCasFactory.createJCas();
>>>jcas.setDocumentText(note);
>>>JsonCasSerializer jcs = new JsonCasSerializer();
>>>jcs.setPrettyPrint(true); 
>>>StringWriter sw = new StringWriter();
>>>CAS cas = jcas.getCas();
>>>jcs.serialize(cas, sw); 
>>>System.out.println(sw.toString());
>>>} catch (Exception ex) {
>>>}
>>>}
>>> }
>>>
>>>
>>> delivers properly formatted JSON CAS:
>>>
>>> {"_context" : {
>>>"_types" : {
>>>  "DocumentAnnotation" : {"_id" : "uima.tcas.DocumentAnnotation", 
>>>"_feature_types" : {"sofa" : "_ref" } }, 
>>>  "Sofa" : {"_id" : "uima.cas.Sofa", 
>>>"_feature_types" : {"sofaArray" : "_ref" } }, 
>>>  "Annotation" : {"_id" : "uima.tcas.Annotation", 
>>>"_feature_types" : {"sofa" : "_ref" }, 
>>>"_subtypes" : ["DocumentAnnotation" ] }, 
>>>  "AnnotationBase" : {"_id" : "uima.cas.AnnotationBase", 
>>>"_feature_types" : {"sofa" : "_ref" }, 
>>>"_subtypes" : ["Annotation" ] }, 
>>>  "TOP" : {"_id" : "uima.cas.TOP", 
>>>"_subtypes" : ["AnnotationBase",  "Sofa" ] } } }, 
>>>  "_views" : {
>>>"_InitialView" : {
>>>  "DocumentAnnotation" : [
>>>{"sofa" : 1,  "begin" : 0,  "end" : 55,  "language" : 
>>> "x-unspecified" } ] } }, 
>>>  "_referenced_fss" : {
>>>"1" : {"_type" : "Sofa",  "sofaNum" : 1,  "sofaID" : "_InitialView",  
>>> "mimeType" : "text",  "sofaString" : "Lorem ipsum incididunt ut labore et 
>>> dolore magna aliqua" } } }
>>>
>>> How to deserialize that back into CAS object ?
>>>
>

Re: Using REST api with UIMA

2017-05-04 Thread Marshall Schor

The core UIMA support has not (yet) implemented a deserializer back into CAS
form, for this. 

The main idea behind CAS -> JSON conversion was to provide the CAS info to JSON
Consumers.

We have multiple other serializations (CAS -> XMI, etc.) that are designed for
"transport" and include deserializers.

Of course, it is quite possible to create an addition to UIMA which can
deserialize the JSON format(s) - that just hasn't yet been done. Contributions
welcome!

-Marshall


On 5/4/2017 3:49 AM, Luca Toldo wrote:
> The following Java code (inspired from 
> http://stackoverflow.com/questions/40838999/getting-output-in-json-format-in-uima
>  ) 
>
> import java.io.*;
> import org.apache.uima.fit.factory.JCasFactory;
> import org.apache.uima.jcas.JCas;
> import org.apache.uima.cas.CAS;
> import org.apache.uima.json.JsonCasSerializer;
> public class  test {
> public static void main(String [] args ) throws IOException {
> try {
> String note="Lorem ipsum incididunt ut labore et 
> dolore magna aliqua";
> JCas jcas = JCasFactory.createJCas();
> jcas.setDocumentText(note);
> JsonCasSerializer jcs = new JsonCasSerializer();
> jcs.setPrettyPrint(true); 
> StringWriter sw = new StringWriter();
> CAS cas = jcas.getCas();
> jcs.serialize(cas, sw); 
> System.out.println(sw.toString());
> } catch (Exception ex) {
> }
> }
> }
>
>
> delivers properly formatted JSON CAS:
>
> {"_context" : {
> "_types" : {
>   "DocumentAnnotation" : {"_id" : "uima.tcas.DocumentAnnotation", 
> "_feature_types" : {"sofa" : "_ref" } }, 
>   "Sofa" : {"_id" : "uima.cas.Sofa", 
> "_feature_types" : {"sofaArray" : "_ref" } }, 
>   "Annotation" : {"_id" : "uima.tcas.Annotation", 
> "_feature_types" : {"sofa" : "_ref" }, 
> "_subtypes" : ["DocumentAnnotation" ] }, 
>   "AnnotationBase" : {"_id" : "uima.cas.AnnotationBase", 
> "_feature_types" : {"sofa" : "_ref" }, 
> "_subtypes" : ["Annotation" ] }, 
>   "TOP" : {"_id" : "uima.cas.TOP", 
> "_subtypes" : ["AnnotationBase",  "Sofa" ] } } }, 
>   "_views" : {
> "_InitialView" : {
>   "DocumentAnnotation" : [
> {"sofa" : 1,  "begin" : 0,  "end" : 55,  "language" : "x-unspecified" 
> } ] } }, 
>   "_referenced_fss" : {
> "1" : {"_type" : "Sofa",  "sofaNum" : 1,  "sofaID" : "_InitialView",  
> "mimeType" : "text",  "sofaString" : "Lorem ipsum incididunt ut labore et 
> dolore magna aliqua" } } }
>
> How to deserialize that back into CAS object ?
>

Re: Limiting the memory used by an annotator ?

2017-05-01 Thread Marshall Schor

Hi,

I'm not sure that a limited size FsIndexRepository would work, because it only
would limit those Feature Structures that were added to the index.

Many times, Feature Structures are made which are referenced from other Feature
Structures, but are not added to the index.  One example is instances of
NonEmptyXxxList kinds of objects - these are used to hold items in a list, and
typically are not (individually) added to the index, since the normal way to
access these is via the head of the list.

Even if they are not in the FsIndexRepository indexes, they still take up room
in the main storage on the heap for storing Feature Structures.

-Marshall


On 4/30/2017 4:15 PM, Hugues de Mazancourt wrote:
> Thanks to all for your advices.
> In my specific case, this was a Ruta problem - Peter, I filed a JIRA issue 
> with a minimal example - which would advocate for the « 
> TooManyMatchesException » feature you propose. I vote for it.
>
> Of course, I already limit the size of input texts, but this is not enough.
> One of the main strengths of UIMA is to be able to integrate annotators 
> produced by third-parties. And each annotator is based on assumptions, at 
> least to have a text as an input, formed by words, etc. Thus, pipelines get 
> more and more complex, without the need to code all processig. But, in a 
> production environment, anything can happen, assumptions may not be respected 
> (e.g. non-textual data can be sent to the engine(s), etc). Sh** always happen 
> in production.
>
> My case is a more specific one, but I’m sure it can be generalized.
>
> Thus, any feature that can help limiting the damage of non-expected input 
> would be welcome. And a limited-size FsIndexRepository seems to me a simple 
> yet powerful enough solution to many problems.
>
> Best,
>
> — Hugues
>
>
> PS: appart from occasional problems, Ruta is a great platform for information 
> extraction. I love it!
>
>> Le 30 avr. 2017 à 12:57, Peter Klügl  a écrit :
>>
>> Hi,
>>
>>
>> here are some ruta-specific comments additionally to Thilo and Marshall's 
>> answers.
>>
>> - if you do not want to split the CAS in smaller ones, you can also 
>> sometimes apply the rules just on some parts of the document (-> less 
>> annotations/rule matches created)
>>
>> - there is an discussion related to this topic (about memory usage in ruta): 
>> https://issues.apache.org/jira/browse/UIMA-5306
>>
>> - I can include configuration parameters which limit the allowed amount of 
>> rule matches and rule element matches of one rule/rule element. If a rule or 
>> rule element exceeds it, a new runtime exception is thrown. I'll open a jira 
>> ticket for that. This is not a solution for the problem in my opinion, but 
>> it can help to identify and fix the problematic rules.
>>
>> - I do not want to include code to directly restrict the max memory in ruta. 
>> That should rather happen in the framework or in the code that calls/applies 
>> the ruta analysis engine.
>>
>> - I think there is a problem in ruta and there are several aspects that need 
>> to be considered here: the actual rules, the partitioning with RutaBasic, 
>> flaws in the implementation and the configuration parameters of the analysis 
>> engine
>>
>> - Are the rules inefficient (combinatory explosion)? I see ruta more and 
>> more as a programming language for faster creating maintainable analysis 
>> engines. You can write efficient and ineffiecient code. If the code/rules 
>> are too slow or take too long, you should refactor it and replace them with 
>> a more efficient approach. Something like ANY+ is a good indicator that the 
>> rules are not optimal, you should only match on things if you have to. There 
>> is also profiling functionality in the Ruta Workbench which shows you how 
>> long which rule took and how long specific conditions/action took. Well, 
>> this is information about the speed but not about the memory, but many rule 
>> matches take longer and require more memory, so it could be an indicator.
>>
>> - There are two specific aspects how ruta spends its memory: RutaBasic and 
>> RuleMatches. RutaBasic stores additional information which speeds up the 
>> rule inference and enables specific functionality. The rule matches are 
>> needed to remember where something matched, for the conditions and actions. 
>> You can reduce the memory usage by reducing the amount of RutaBasic 
>> annotations, the amount of the annotations indexed in the RutaBasic 
>> annotations, or by reducing the amount of RuleMatches -> refactoring the 
>> rules.
>>
>> - There are plans to make the implementation of RutaBasic more efficient, by 
>> using more efficient data structures (there are some prototypes mentioned in 
>> the issue linked above). And I added some new configuration parameters (in 
>> ruta 2.6.0 I think) which control which information is stored in RutaBasic, 
>> e.g, you do not need information about annotations if they or their types 
>>

Re: Refreshing external resources periodically

2017-04-29 Thread Marshall Schor

Hi Debbie,

I think this depends on what kind of external resource you have.  Are you able
to see what Java class is implementing the external resource?  For instance, I'm
guessing you must at some point in your code have some code that says something
like:

myResource.myMethodToGetDataFromIt(...).

If you have that, you can see if the class implementing myResource has a method
you could call to reload itself.  If it does, then you could build a little
timer application that went off once a day, and called that api, perhaps with
some synchronization...

Does this help, or have I misunderstood things? -Marshall

On 4/28/2017 4:13 AM, Debbie Zhang wrote:
> Hi UIMA users,
>
> I have a question regarding accessing external resources. If my external 
> resource file is the output of a database table, the data are updated daily. 
> Can I read the resource file daily as well and update my annotations 
> accordingly? I deploy my pear to elsewhere. At the moment, resource files are 
> included in the pear file so no refresh can be done. Any suggestion is very 
> welcome. Thank you.
>
> Regards,
>
> Debbie 
>
> Sent from my iPhone

Re: Limiting the memory used by an annotator ?

2017-04-29 Thread Marshall Schor

This has occasionally popped up as a user request.

Thilo makes some good practical suggestions that often work. 

If (in your case) there's some aspect of the data that causes a combinatorial
explosion in some part of the code, if you can identify that part of the code,
and have any control over it, you might be able to insert some limiting code 
there.

Limiting the amount of memory: thinking more about this, if the limit was
reached, what should happen?  It seems that the choice would be to throw a new
(subclass of) RuntimeException (runtime because it could happen almost
anywhere); the "catch" action would be to abort whatever was going on, report
the failure, and reset things (including the CAS).

This could be done already - because an exception does happen (the out-of-memory
exception).  Hopefully, this isn't too late - you mentioned that things slow
down as memory gets short.  (I suppose you could time things, and if things slow
down dramatically, use that as a trigger, too).

So maybe this is the best approach - find a spot in your code where the
"recovery" of aborting and resetting things makes sense, and install an
out-of-memory exception try / catch point (or a dramatic slow-down catcher).

A trick for out-of-memory catchers is to grab a block of memory (say, an int
array) at the start, and then have the out-of-memory code release that block, to
give the catcher room enough to run and recover.  But this might not be needed;
just unwinding the stack due to the throw also could free up memory, if your
catch point is high up the stack.

Hope this Helps.  -Marshall

On 4/29/2017 6:53 AM, Hugues de Mazancourt wrote:
> Hello UIMA users,
>
> I’m currently putting a Ruta-based system in production and I sometimes run 
> out of memory.
> This is usually caused by combinatory explosion in Ruta rules. These rules 
> are not necessary faulty: they are adapted to the documents I expect to 
> parse. But as this is an open system, people can upload whatever they want 
> and the parser crashes by multiplying annotations (or at least takes 20 
> minutes in garbage-collecting millions of annotations).
>
> Thus, my question is: is there a way to limit the memory used by an 
> annotator, or to limit the number of annotations made by an annotator, or to 
> limit the number of matches made by Ruta ?
> I prefer cancelling a parse for a given document than a 20 minutes downtime 
> of the whole system.
>
> Several UIMA-based services run in production, I guess that others certainly 
> have hit the same problem.
>
> Any hint on that topic would be very helpful.
>
> Thanks,
>
> Hugues de Mazancourt
> http://about.me/mazancourt
>
>
>
>
>

Re: CAS visual debugger works in eclipse but not in the binary

2017-04-20 Thread Marshall Schor

Hi Benedict,

Although it's hard to spot, this too looks like an out of memory problem.  Can
you try adding the -Xmx parameter to however you're launching this to give Java
more memory to work with?

-Marshall

On 4/18/2017 12:54 PM, Benedict Holland wrote:
> Hello All,
>
> I am attempting to integrate the OpenNLP application with the UIMA
> framework. I created the PEAR file successfully.
>
> When I run the UIMA Pear Installer from eclipse, it works. When I attempt
> to run the runPearInstaller.bat file, it fails.
>
> When I run the CAS Visual Debugger from cvd.bat, I get the error below.
> When I run it from eclipse, everything works.
>
> Any idea about how to proceed?
>
> Thanks,
> ~Ben
>
> Error:
>
> 12:54:05.375 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class
> org.apache.uima.analysis_engine.impl.PearAnalysisEngineWrapper
> 12:54:05.410 - 16:
> org.apache.uima.analysis_engine.impl.PearAnalysisEngineWrapper.createRM:
> CONFIG: UIMA pear runtime set classpath to
> "F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/lib/jwnl.jar;F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/lib/opennlp-maxent.jar;F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/bin;F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/lib/opennlp-tools.jar;F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/lib/opennlp-uima.jar"
> for UIMA component opennlp.uima.OpenNlpTextAnalyzer.
> 12:54:05.410 - 16:
> org.apache.uima.analysis_engine.impl.PearAnalysisEngineWrapper.createRM:
> CONFIG: UIMA pear runtime set datapath to
> "F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/models" for UIMA
> component opennlp.uima.OpenNlpTextAnalyzer.
> 12:54:05.411 - 16:
> org.apache.uima.analysis_engine.impl.PearAnalysisEngineWrapper.createRMmap:
> CONFIG: UIMA pear runtime: creating a Map from class
> "F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/lib/jwnl.jar;F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/lib/opennlp-maxent.jar;F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/bin;F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/lib/opennlp-tools.jar;F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/lib/opennlp-uima.jar"
> and data path
> "F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/models" to its
> resource manager instance.
> 12:54:05.549 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:05.578 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:05.625 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:06.339 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:07.02 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:07.391 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:07.860 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:09.99 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:09.544 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:09.875 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:10.365 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:11.992 - 16:
> org.apache.uima.util.SimpleResourceFactory.produceResource: CONFIG: trying
> Resource class org.apache.uima.resource.impl.DataResource_impl
> 12:54:37.527 - 16:
> org.apache.uima.tools.cvd.MainFrame.handleException(526): SEVERE: Error
> initializing
> "org.apache.uima.analysis_engine.impl.PearAnalysisEngineWrapper" from
> descriptor
> file:/F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/opennlp.uima.OpenNlpTextAnalyzer_pear.xml.
> org.apache.uima.resource.ResourceInitializationException: Error
> initializing
> "org.apache.uima.analysis_engine.impl.PearAnalysisEngineWrapper" from
> descriptor
> file:/F:/nlp/installed_pear/opennlp.uima.OpenNlpTextAnalyzer/opennlp.uima.OpenNlpTextAnalyzer_pear.xml.
> at
> org.apache.uima.util.SimpleResourceFactory.produceResource(SimpleResourceFactory.java:144)
> at
>

Re: Error running PEAR Installer

2017-04-20 Thread Marshall Schor

Please try running with more memory by using the java command line parameter 
-Xmx

See for example the documentation for this launching parameter, on this page

https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html

-Marshall


On 4/17/2017 5:20 PM, Benedict Holland wrote:
> Hello all,
>
> I get this error when I run the pear installer using the built results from
> the OpenNLP application. Is there anything I can do?
>
> Thanks,
> ~Ben
>
> Verification of opennlp.uima.OpenNlpTextAnalyzer failed =>
>  java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.io.DataInputStream.readUTF(Unknown Source)
> at java.io.DataInputStream.readUTF(Unknown Source)
> at
> opennlp.tools.ml.model.BinaryFileDataReader.readUTF(BinaryFileDataReader.java:59)
> at
> opennlp.tools.ml.model.AbstractModelReader.readUTF(AbstractModelReader.java:80)
> at
> opennlp.tools.ml.model.AbstractModelReader.getPredicates(AbstractModelReader.java:117)
> at
> opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelReader.java:77)
> at
> opennlp.tools.ml.model.GenericModelReader.constructModel(GenericModelReader.java:62)
> at
> opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelReader.java:85)
> at
> opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:32)
> at
> opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:29)
> at
> opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel.java:309)
> at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:239)
> at opennlp.tools.util.model.BaseModel.(BaseModel.java:173)
> at opennlp.tools.parser.ParserModel.(ParserModel.java:177)
> at
> opennlp.uima.parser.ParserModelResourceImpl.loadModel(ParserModelResourceImpl.java:35)
> at
> opennlp.uima.parser.ParserModelResourceImpl.loadModel(ParserModelResourceImpl.java:26)
> at
> opennlp.uima.util.AbstractModelResource.load(AbstractModelResource.java:35)
> at
> org.apache.uima.resource.impl.ResourceManager_impl.registerResource(ResourceManager_impl.java:750)
> at
> org.apache.uima.resource.impl.ResourceManager_impl.initializeExternalResources(ResourceManager_impl.java:594)
> at
> org.apache.uima.resource.Resource_ImplBase.initialize(Resource_ImplBase.java:210)
> at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.initialize(AnalysisEngineImplBase.java:157)
> at
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:128)
> at
> org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
> at
> org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:331)
> at
> org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:448)
> at
> org.apache.uima.pear.tools.InstallationTester.testAnalysisEngine(InstallationTester.java:218)
> at
> org.apache.uima.pear.tools.InstallationTester.doTest(InstallationTester.java:113)
> at
> org.apache.uima.pear.tools.InstallationController.verifyComponentInstallation(InstallationController.java:1110)
> at
> org.apache.uima.pear.tools.InstallationController.verifyComponent(InstallationController.java:1993)
> at
> org.apache.uima.tools.pear.install.InstallPear.installPear(InstallPear.java:389)
>

[ANNOUNCE] Apache UIMA Java SDK 2.10.0 released

2017-04-05 Thread Marshall Schor

The Apache UIMA team is pleased to announce the release of Apache UIMA Java SDK,
version 2.10.0. 

Apache UIMA <http://uima.apache.org> is a component architecture and framework
for the analysis of unstructured content like text, video and audio data.

This is mostly a bug fix release, but includes some new capabilities to use 
external resource settings values within UIMA's XML descriptors.  For more 
details, see the UIMA_News 
<http://uima.apache.org/news.html#4%20Apr%202017>item (
http://uima.apache.org/news.html#4 Apr 2017 ). -Marshall Schor, for the Apache
UIMA development team



signature.asc
Description: OpenPGP digital signature

Re: Retrieving annotator back from analysis engine

2017-03-30 Thread Marshall Schor

Hi James,

The UIMA terminology discusses two kinds of entities:

  a) Annotators - take a CAS in, operate on it, update it, etc.  These are the
building blocks of pipelines.

  b) UIMA Applications (e.g., "pipelines") made up of some collection of
Annotators.

In most UIMA applications, there might be 1 pipeline, each having a number of
Annotators. Is this what you have?  Or are you running multiple (perhaps
different) collections of annotators, each having its own pipeline?

The produceAnalysisEngine call takes an object which is a ResourceSpecifier. 
That object is a description of the entire pipeline - what annotators are in it,
configuration parameters, etc.  The output of that is an AnalysisEngine object
that represents the whole pipeline.

There's no reference from that AnalysisEngine object back to the
ResourceSpecifier that was used to direct the construction of the pipeline.

So, I don't think what you want to do can be done.

That being said, perhaps the high level design can be adjusted.  I'm wondering
if two things got a bit conflated in the design - the idea of analysis engine
"components" (e.g. Annotators) and the idea of analysis engines themselves (the
pipelines that contain the annotators, configuration data, etc.)?

-Marshall

On 3/29/2017 1:11 PM, James Baker wrote:
> In my UIMA application, I have a number of AnalysisEngines (as you might
> expect). These were created using UIMAFramework.produceAnalysisEngine(...)
> on my annotators, which all extend MyAnnotator (which in turn extends
> JCasAnnotator_ImplBase).
>
> I want to get from the AnalysisEngine back to the original class (cast to
> MyAnnotator) so that I can access some of the additional functions I've
> added to the class. However, I can't seem to work out how to do that. Could
> someone give some pointers?
>
> For clarity, I've included below some code of what I'm trying to acheive
> (I'm aware that the code below doesn't work as I've tried it!)
>
> 
>
> AnalysisEngine ae = getAnalysisEngine(); //Get the analysis engine from
> whereever it is, this bit's not important
>
> MyAnnotator ma = (MyAnnotator) ae; //Throws ClassCastException
> ma.callMyFunction(); //This is what I'm really trying to get to
>
> 
>
> Thanks,
> James
>

Re: Updating to 3.0.0-alpha

2017-03-27 Thread Marshall Schor

Hi, this is now fixed (in the next release).

To work around for now, please replace calls to
TypeSystemUtils.classifyType(Type myType) with

TypeSystemImpl.getTypeClass((TypeImpl) myType)


Cheers. -Marshall


On 3/27/2017 11:37 AM, Marshall Schor wrote:
> will fix this under Jira issue https://issues.apache.org/jira/browse/UIMA-5387
>
> -Marshall
>
>
> On 3/27/2017 10:24 AM, Marshall Schor wrote:
>> Hi,
>>
>> sorry, I missed this email...  I'll investigate.  It seems this method was
>> removed as part of the reorganization of how this information is maintained 
>> in V3
>>
>> But you make a good point about backwards compatibility...
>>
>> -Marshall
>>
>>
>> On 2/21/2017 11:39 PM, Chad Cravens wrote:
>>> Hello UIMA group!
>>>
>>> Having a great time working with UIMA!
>>>
>>> I'm still pretty new to UIMA and am working with a pre-existing UIMA
>>> application, upgrading to 3.0.0 alpha from 2.9.0. The document states that
>>> backwards-compatibility is important, which is great in my case.
>>>
>>> I regenerated the JCas classes, and everything seemed to be ok, except for
>>> one line from the Regex Annotator package: https://uima.apache.org/
>>> sandbox.html#regex.annotator
>>>
>>> Line 171 of FeaturePath_Impl.java uses the following code:
>>> // get feature type and type code
>>> Type featureType = feature.getRange();
>>> int featureTypeCode = TypeSystemUtils.classifyType(featureType);
>>>
>>> I poured through the trunk branch on github and realized that the method is
>>> indeed gone:
>>> https://github.com/apache/uima-uimaj/blob/trunk/uimaj-core/src/main/java/org/apache/uima/cas/impl/TypeSystemUtils.java
>>>
>>> Just wondering if others are experiencing the same issue or is there
>>> something I'm missing here?
>>>
>>> Thanks for the help!
>>>
>

Re: Updating to 3.0.0-alpha

2017-03-27 Thread Marshall Schor

will fix this under Jira issue https://issues.apache.org/jira/browse/UIMA-5387

-Marshall


On 3/27/2017 10:24 AM, Marshall Schor wrote:
> Hi,
>
> sorry, I missed this email...  I'll investigate.  It seems this method was
> removed as part of the reorganization of how this information is maintained 
> in V3
>
> But you make a good point about backwards compatibility...
>
> -Marshall
>
>
> On 2/21/2017 11:39 PM, Chad Cravens wrote:
>> Hello UIMA group!
>>
>> Having a great time working with UIMA!
>>
>> I'm still pretty new to UIMA and am working with a pre-existing UIMA
>> application, upgrading to 3.0.0 alpha from 2.9.0. The document states that
>> backwards-compatibility is important, which is great in my case.
>>
>> I regenerated the JCas classes, and everything seemed to be ok, except for
>> one line from the Regex Annotator package: https://uima.apache.org/
>> sandbox.html#regex.annotator
>>
>> Line 171 of FeaturePath_Impl.java uses the following code:
>> // get feature type and type code
>> Type featureType = feature.getRange();
>> int featureTypeCode = TypeSystemUtils.classifyType(featureType);
>>
>> I poured through the trunk branch on github and realized that the method is
>> indeed gone:
>> https://github.com/apache/uima-uimaj/blob/trunk/uimaj-core/src/main/java/org/apache/uima/cas/impl/TypeSystemUtils.java
>>
>> Just wondering if others are experiencing the same issue or is there
>> something I'm missing here?
>>
>> Thanks for the help!
>>
>

Re: Updating to 3.0.0-alpha

2017-03-27 Thread Marshall Schor

Hi,

sorry, I missed this email...  I'll investigate.  It seems this method was
removed as part of the reorganization of how this information is maintained in 
V3

But you make a good point about backwards compatibility...

-Marshall


On 2/21/2017 11:39 PM, Chad Cravens wrote:
> Hello UIMA group!
>
> Having a great time working with UIMA!
>
> I'm still pretty new to UIMA and am working with a pre-existing UIMA
> application, upgrading to 3.0.0 alpha from 2.9.0. The document states that
> backwards-compatibility is important, which is great in my case.
>
> I regenerated the JCas classes, and everything seemed to be ok, except for
> one line from the Regex Annotator package: https://uima.apache.org/
> sandbox.html#regex.annotator
>
> Line 171 of FeaturePath_Impl.java uses the following code:
> // get feature type and type code
> Type featureType = feature.getRange();
> int featureTypeCode = TypeSystemUtils.classifyType(featureType);
>
> I poured through the trunk branch on github and realized that the method is
> indeed gone:
> https://github.com/apache/uima-uimaj/blob/trunk/uimaj-core/src/main/java/org/apache/uima/cas/impl/TypeSystemUtils.java
>
> Just wondering if others are experiencing the same issue or is there
> something I'm missing here?
>
> Thanks for the help!
>

Re: Question regarding the encoding of footnotes, marginal notes and images

2017-03-22 Thread Marshall Schor

Hi,

Here are some thoughts.

* You have main-text, images, margin notes, and for the latter two, "position on
the page" information.

You should put the main-text into a sofa, like you say.

You may put the images and margin notes into either additional sofas or feature
structures in the main sofa.

The decision for where to put these depends on what kind of analysis you plan to
do with the images and margin notes.  They should be in sofas if you plan to run
some unstructured analytics annotators over them, for example some image
recognition or classification analytics.  But if you just need to keep these as
artifacts, with no particular kind of analytics for these parts, just put them
in additional feature structures in the main sofa.

Re:  can UIMA handle sofas with different kinds of data:  yes it can.  Each sofa
can be a text string or a byte array (local or remote); see:
http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.aas.sofa

Re: can annotations refer to feature structures in other sofas: yes they can.

See
http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.mvs.sample_application

-Marshall

On 3/22/2017 10:32 AM, Markus Krug wrote:
> Dear UIMA-users,
>
> we are currently facing the issue, that the documents we are processing
> using UIMA have more than just "linear text".
>
> On top of text we got images and marginal notes that should be encoded
> at the correct positions. (Output of OCR and image segmentation)
>
> So far i do not know if UIMA is capable of handling sofas with different
> types of material (e.g. text and images)
>
> We came up with a concept like this (please comment if this is stupid or
> if better ways to handle this have been found already)
>
> 1. Store the main text in the primary sofa
>
> 2. For each image/marginal note, use a different sofa and store the
> content in there
>
> 3. In the main text, refer to annotations in different sofas (is this
> possible? - i never needed this before) at the according position
>
> If there are any best praqctices for those kind of problems i would be
> glad if you would let me know
>
> Thanks in advance
>
> Markus Krug
>
>

Re: Many views in the cas to serialize cause java.lang.NullPointerException in service uima-as

2017-02-15 Thread Marshall Schor

;>>public int getSofaAddr(int sofaNum) {
>>>>   if (sofaNum != 1 || cas.isInitialSofaCreated()) { //skip if
>> initial
>>>> view && no Sofa yet
>>>> // all
>>>> non-initial-views must have a sofa
>>>>* return ((CASImpl)cas.getView(sofaNum)).getSofaRef();*
>>>>   }
>>>>   return 0;
>>>> }
>>>>
>>>> Looks to me that getView(sofaNum) is returning null. Is it possible that
>>>> two threads are operating on the same CAS maybe? One removing a view
>>>> while
>>>> another trying to serialize. Have no idea what else could it be.
>>>>
>>>> -jerry
>>>>
>>>>
>>>>
>>>> On Fri, Feb 10, 2017 at 8:45 AM, nelson rivera <
>> nelsonriver...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,  The first thing I did was these tests,i made a simple test case
>>>>> that create a Cas with 17 views and then serialize using
>>>>> XmiCasSerializer.serialize(newJCas.getCas(), fis) and serializes
>>>>> correctly.
>>>>> Also i made other test, initialize the same AE but of local way with
>>>>> UIMA API and process the same input documents and the processing is
>>>>> correct and then serialize the CAS, without problem.
>>>>>
>>>>> The error is with AE deployed in uima-as and consuming it.
>>>>>
>>>>> 2017-02-09 17:30 GMT-05:00, Marshall Schor <m...@schor.com>:
>>>>>> one thing that would help track this down is a small isolated test
>>>>>> case.
>>>>>>
>>>>>> Do you think uima-as is needed? I'm wondering if a simple test case
>>>>> which
>>>>>> generated 17 views and then tried to serialize would show the
>>>>>> failure...
>>>>>>
>>>>>> If you could supply a small test case that showed the failure so we
>>>>> could
>>>>>> reproduce it, that would enable a rapid resolution.
>>>>>>
>>>>>> -Marshall
>>>>>>
>>>>>>
>>>>>> On 2/9/2017 3:58 PM, Marshall Schor wrote:
>>>>>>>  The line throwing the null pointer exception is :
>>>>>>>
>>>>>>> cas.getView(sofaNum).getSofaRef()
>>>>>>>
>>>>>>> So the NPE is either the cas is null, or the getView(sofaNum) is
>>>>> returning
>>>>>>> null.
>>>>>>>
>>>>>>> I'm not sure what the best way is to debug this...
>>>>>>>
>>>>>>> -Marshall
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2/9/2017 12:42 PM, nelson rivera wrote:
>>>>>>>> I have a aggregate service uima-as, at the end of aggregate the cas
>>>>>>>> to
>>>>>>>> return is composed of as many views as the number of input files,
>>>>>>>> each
>>>>>>>> view with annotations of processing.
>>>>>>>> With a number of input documents less than 15 the processing is
>>>>>>>> successful always,
>>>>>>>> but if the number of documents is greater than 15, i get a
>>>>>>>> NullPointerException at the aggregate service trying to serialize
>>>>>>>> the
>>>>>>>> cas, not in the processing of AE aggregate.
>>>>>>>> the logs of aggregate service:
>>>>>>>>
>>>>>>>> 11:51:38.815 - 42:
>>>>>>>> cu.datys.xinetica.uima.core.MergerInViewCasMultipler.hasNext(285):
>>>>>>>> INFO: HasNext false
>>>>>>>> 11:51:38.875 - 44:
>>>>>>>> org.apache.uima.uimacpp.UimacppAnalysisComponent.log(396): INFO: :
>>>>>>>> XClusterAnalyzer::process --- OK
>>>>>>>> 11:51:39.145 - 45:
>>>>>>>> org.apache.uima.aae.controller.AggregateAnalysisEngineContro
>>>>> ller_impl.replyToClient:
>>>>>>>> WARNING: Service: XClusterAnalyzerAggregate Runtime Exception
>>>>>>>> 11:51:39.145 - 45:
>>>>>>>> o

1 2 3 4 >

1 - 100 of 307 matches

Mail list logo