Re: newbie questions about UIMA Types

2016-12-01 Thread Marshall Schor
Here's how I think about UIMA types, versus rich data structures available in
modern programming languages.

UIMA was designed to facilitate combining more-or-less independently developed
components (Annotators). 

How each primitive annotator is programmed, internally, is invisible to UIMA -
the annotator can make arbitrary use of any of modern programming language's
rich data structures.  The UIMA types come into play at the annotator boundary -
they are there to specify how annotators send data from one annotator in the
pipeline to the next. 

When writing a primitive annotator, developers should keep this distinction in
mind, and use appropriate internal data structures (in whatever language they're
writing in) for computation, and fetch and put data into CAS data structures
that are meant to be shared with other Annotators the CAS is sent to.

Data in the CAS can be accessed using JCasGen'ed classes.  This is a bridge to
make accessing CAS data more Java -like and convenient. JCas is one of 2 main
"APIs" for a CAS.  The other one is the non-JCas style, where you make use of
the CAS APIs directly, using UIMA Types and Features to create types and set and
get feature values.  This style is available in the UIMA CPP implementation. 
That implementation has no equivalent to the JCas style, and no equivalent to
JCasGen.

The same principles apply, in that when writing a C/C++ annotator, you may use
all the modern data structures, etc. internally within the annotator, and at the
boundaries, where you want to share data with other annotators, you get or put
data into the CAS types and features.

Data in the CAS has to be serializable in standard ways.  UIMA uses this
property to allow "remoting" annotators and connecting to them using a service
interface capability.  You can see this in the UIMA-AS extension of UIMA, which
has deployment descriptors specifying how you want to deploy various annotators
that make up a pipeline.

There is one more capability UIMA provides for co-located Annotators in a
pipeline; these can be set up to share a common external resource. An external
resource can make use of any modern programming language constructs (e.g., a
HashMap).  But it is not "serialized".  It doesn't participate in the basic
architecture of sending a CAS through a pipeline of (independently developed)
annotators, and supporting the "remoting" of those annotators in a scaled-out
implementation.

User augmentation of JCas classes is possible. But only the data that is in CAS
Types and features is transported when the CAS is sent to a remote annotator (or
from a Java Annotator to a C/C++ Annotator).  As a limited side effect, when a
CAS is sent from a Java annotator to another Java annotator in a pipeline,
running on a single machine in a single JVM, the UIMA framework doesn't
serialize the data - it just passes things internally in memory.  In this
instance, auxiliary data kept in a JCas class is maintained and can be used in
subsequent annotators.  As you can see, extra data in the JCas, because it isn't
reliably available to other annotators except under some controlled situations,
should not generally be used for data to be shared among annotators; that's what
the CAS was designed to support.  But it's fine to use this for data local to
one annotator.  For that usage, though, it would be better to keep this data as
local data associated with the Annotator, rather than with some particular JCas
type.

One other analogy for these modern times, might be useful.  Think of the trend
today in microservices.  The microservice, itself, can make use of arbitrarily
complex internal data structures.  But its external interface is some fairly
simple set of APIs with fairly simple data.

The CAS is like those APIs, in its purpose - to provide a way for multiple
components (annotators) which might have arbitrarily complex internal data
structures, to communicate with independently developed other components
(annotators).

I hope this answers some of your questions.

-Marshall



 

 

On 11/30/2016 3:25 PM, David Fox wrote:
> Does the UIMA Java framework support modifying or extend the java class 
> generated by JCasGen corresponding to a custom Type?   If so, are there any 
> common circumstances where this is necessary?
>
> I didn’t see anything in the examples or documentation about modifying the 
> generated classes, but I also didn’t see anything saying you couldn’t.  I 
> suspect that this is not supported (and that otherwise you wouldn’t be able 
> to pass a CAS between distributed UIMA AS components, or between a Java 
> annotator and a C++ one).  But it would be nice to know for certain.
>
> The reason I ask is that the set of data structures supported by UIMA types 
> (individual FS references,  FSList linked lists, and FSArray arrays) is 
> fairly limited compared to modern programming languages, which often directly 
> support associative arrays, trees, and graphs.  I’m trying to understand 
> 

Re: newbie questions about UIMA Types

2016-12-01 Thread David Fox
Thanks for the detailed reply and examples.

I¹ve got some tangentially related questions about types in UIMA C++,
which I hope that either you or someone else can answer:

If you need to use a custom Type in an annotator written with the UIMA C++
SDK, 

1) do you need to define a corresponding custom C++ class (analogous to
the one generated by JCasGen)?
2) if so, is there a comparable CppCasGen, or do you need to write it
manually?

Thanks in advance,
David


On 11/30/16, 8:23 PM, "Richard Eckart de Castilho"  wrote:

>It is possible to customize the generated JCas classes, yes. You can e.g.
>add own methods or even own fields. However, own fields would not be
>saved/loaded when you persist a CAS e.g. to XMI.
>
>As a case for a custom method, consider e.g. the DKPro Core Token
>"setText(string)" method [1].
>If the "string" passed to the method differs from the covered text of the
>Token, then a new
>"Form" annotation with the value "string" is created, linked to the Token.
>
>Another case would be the "links()" method on the DKPro Core CorefChain
>type. It returns all
>elements in the respective coreference chain as a List thus saving the
>user to manually iterate
>over the whole chain to reach all elements.
>
>FSList and friends are built-in types of UIMA Core - you can't modify
>these. But uimaFIT provides
>several methods to make working with these things much more convenient.
>See
>
>- org.apache.uima.fit.util.FSCollectionFactory and its methods to create
>FSList etc from Java collections
>- org.apache.uima.fit.util.JCasUtil has select methods to retrieve
>elements from FSList etc
>- org.apache.uima.fit.util.FSUtil has methods to conveniently get/set
>feature values including multi-valued features.
>
>Best,
>
>-- Richard
>
>[1] 
>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dkpro_dkpr
>o-2Dcore_blob_71fda5c6ba91748b6e87312554e418ac1e2911c6_dkpro-2Dcore-2Dapi-
>2Dsegmentation-2Dasl_src_main_java_de_tudarmstadt_ukp_dkpro_core_api_segme
>ntation_type_Token.java-23L313=DgIF-g=3XrKki35ZWuh8X2qbeRISQ=BYS7q6K
>6Famz8NiMJzvOgYA-WQSvBt9z6TEbaT3nnNM=HngGj3axgoDuVIMZym8FO61Tu_FMjQ_zxdk
>T4SVvZWQ=XrClIvXlvCk4wq9FakxA9hWNOdyZAcmxRvmyBj9GJaw=
>[2] 
>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dkpro_dkpr
>o-2Dcore_blob_ba33629fc0f077337f9af39e38e1b58531e1674e_dkpro-2Dcore-2Dapi-
>2Dcoref-2Dasl_src_main_java_de_tudarmstadt_ukp_dkpro_core_api_coref_type_C
>oreferenceChain.java-23L101=DgIF-g=3XrKki35ZWuh8X2qbeRISQ=BYS7q6K6Fa
>mz8NiMJzvOgYA-WQSvBt9z6TEbaT3nnNM=HngGj3axgoDuVIMZym8FO61Tu_FMjQ_zxdkT4S
>VvZWQ=ul8Zztzk4X2HysLTy5P9MA6G_SHnU-firAU3B9s9EMc=
>
>> On 30.11.2016, at 20:25, David Fox  wrote:
>> 
>> Does the UIMA Java framework support modifying or extend the java class
>>generated by JCasGen corresponding to a custom Type?   If so, are there
>>any common circumstances where this is necessary?
>> 
>> I didn¹t see anything in the examples or documentation about modifying
>>the generated classes, but I also didn¹t see anything saying you
>>couldn¹t.  I suspect that this is not supported (and that otherwise you
>>wouldn¹t be able to pass a CAS between distributed UIMA AS components,
>>or between a Java annotator and a C++ one).  But it would be nice to
>>know for certain.
>> 
>> The reason I ask is that the set of data structures supported by UIMA
>>types (individual FS references,  FSList linked lists, and FSArray
>>arrays) is fairly limited compared to modern programming languages,
>>which often directly support associative arrays, trees, and graphs.  I¹m
>>trying to understand whether this is a restriction on the implementation
>>of custom types (which it would be if modifying/extending the generated
>>class was not supported), or just on the public interface accessible via
>>the UIMA API.
>> 
>> David



Re: newbie questions about UIMA Types

2016-11-30 Thread Richard Eckart de Castilho
It is possible to customize the generated JCas classes, yes. You can e.g. add 
own methods or even own fields. However, own fields would not be saved/loaded 
when you persist a CAS e.g. to XMI.

As a case for a custom method, consider e.g. the DKPro Core Token 
"setText(string)" method [1].
If the "string" passed to the method differs from the covered text of the 
Token, then a new
"Form" annotation with the value "string" is created, linked to the Token.

Another case would be the "links()" method on the DKPro Core CorefChain type. 
It returns all
elements in the respective coreference chain as a List thus saving the user to 
manually iterate
over the whole chain to reach all elements.

FSList and friends are built-in types of UIMA Core - you can't modify these. 
But uimaFIT provides
several methods to make working with these things much more convenient. See

- org.apache.uima.fit.util.FSCollectionFactory and its methods to create FSList 
etc from Java collections
- org.apache.uima.fit.util.JCasUtil has select methods to retrieve elements 
from FSList etc
- org.apache.uima.fit.util.FSUtil has methods to conveniently get/set feature 
values including multi-valued features.

Best,

-- Richard

[1] 
https://github.com/dkpro/dkpro-core/blob/71fda5c6ba91748b6e87312554e418ac1e2911c6/dkpro-core-api-segmentation-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/api/segmentation/type/Token.java#L313
[2] 
https://github.com/dkpro/dkpro-core/blob/ba33629fc0f077337f9af39e38e1b58531e1674e/dkpro-core-api-coref-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/api/coref/type/CoreferenceChain.java#L101

> On 30.11.2016, at 20:25, David Fox  wrote:
> 
> Does the UIMA Java framework support modifying or extend the java class 
> generated by JCasGen corresponding to a custom Type?   If so, are there any 
> common circumstances where this is necessary?
> 
> I didn’t see anything in the examples or documentation about modifying the 
> generated classes, but I also didn’t see anything saying you couldn’t.  I 
> suspect that this is not supported (and that otherwise you wouldn’t be able 
> to pass a CAS between distributed UIMA AS components, or between a Java 
> annotator and a C++ one).  But it would be nice to know for certain.
> 
> The reason I ask is that the set of data structures supported by UIMA types 
> (individual FS references,  FSList linked lists, and FSArray arrays) is 
> fairly limited compared to modern programming languages, which often directly 
> support associative arrays, trees, and graphs.  I’m trying to understand 
> whether this is a restriction on the implementation of custom types (which it 
> would be if modifying/extending the generated class was not supported), or 
> just on the public interface accessible via the UIMA API.
> 
> David