Why do you think that you need to have a tokenizer? The example that Adam sent doesn't have a tokenizer in it at all. It simply depends on an Analysis Engine previous in the pipeline that produces Person annotations.
Perhaps, rather than going through the tokenizer route you should just try to do some sort of regular expression matching on your list of person names. An example of this is in the version of UIMA from IBM on AlphaWorks that uses an example of Building Room numbers that might be helpful for you to follow. I haven't looked at the latest documentation to see if this example is around in the Apache versions and the Apache Website doesn't seem to be working for me at the moment. This example might be a little different now as it references buildings and rooms at IBM. There are many different ways in which UIMA is valuable such as enabling distributed processing etc., but what you seem to be the most interested in is using it connect various extraction processes (i.e. Analysis Engines). There isn't any fixed way to do this and there shouldn't be. For example, if I write a Person Name Annotator Analysis Engine that depends on tokens with that know whether or not they are capitalized, then somewhere in the UIMA pipeline these have to be provided. Suppose I want to use a tokenizer that doesn't do anything with capitalization. In that case I have to write some code, be it Java, or C++, python, Perl, or in my case I've been using JRuby, and Groovy would work well too. This code would have to provide tokens with knowledge of capitalization. UIMA is itself agnostic on how to do this. There are so many different possible variations on what could go into a pipeline that it would be impossible to handle them all. If UIMA were to provide tokenization automatically, why shouldn't it also provide Video Scene segmentation or phonetic syllable segmentation of audio, which UIMA also enables? That being said there are some additional tools that might help with putting together various pipelines. For instance, I've been working on (although not very hard) an Analysis engine that would allow an different Analysis Engine to work on only part of the output of another Analysis Engine. Suppose I had an Analysis Engine that detected the different languages being used in a text. Then suppose I had a Person Annotation Extractor that only works on Japanese. I might want to be able to send the Japanese parts of my text to the Person Annotation Extractor without writing any code. I'm not at all sure what the best way to go about this would be. Such an Analysis Engine might be good to include in the UIMA package but it might not belong in the specification. BTW is anybody using the Perlator or Pythonator swig stuff with UIMA 2.x? -----Original Message----- From: LASRI YASSINE [mailto:[EMAIL PROTECTED] Sent: Thursday, March 22, 2007 5:02 PM To: [email protected] Subject: Re: Help on UIMA Please ! Hi Adam, Thanks for the given example ! it's a month that i have started working with UIMA API and i can't until now understand what the value added of UIMA ? for example : if I want to use external resource and check if an entity in the external resource is matched in the given CAS document ? why sould I write a tokenizer and other thing of java code to do so Why UIMA doesn't offer this possibility directly whithout any other java code ? -Yassine 2007/3/22, Adam Lally <[EMAIL PROTECTED]>: > > On 3/15/07, LASRI YASSINE <[EMAIL PROTECTED]> wrote: > > 2007/3/15, Michael Baessler <[EMAIL PROTECTED]>: > > > > > > LASRI YASSINE wrote: > > > > Exactly what I need, but rule can be either regular expression or > > > > aggregation of premitifs annotators ? > > > > Have any example ? > > > When I understand you correct, you want to have a rules that says: > > > > > > rule1: [person] /meets/ [person] > > > > > > where the rules consist of a person annotation followed by a regular > > > expression "meets" followed by another person annotation. > > > Is that what you mean by "either regular expression or aggregation of > > > premitifs annotators"? > > > > > > > Yes of course that's what I mean ! > > > > Sorry I haven't got any example or implementation that do such kind of > > > processing. Maybe some other users on the users list can help you here > > > if they have some experience. > > > > > > > If any user have an example, please send it to me > > > > I don't have a ready-to-run example, but to get yourself started I > would do something like this: > > FSIndex personIndex = aJCas.getAnnotationIndex({Person.type); > //iterate over pairs of Person annotations > Iterator personIter = personIndex.iterator(); > while (personIter.hasNext()) { > Person person1 = (Person)personIter.next(); > if (!personIter.hasNext()) > break; > Person person2 = (Person)personIter.next(); > if (person1.getEnd() < person2.getBegin()) { > //check if the text between the annotations contains the word > "meets" > //(this could easily be a regular expression match instead, of > course) > String textBetween = > aJCas.getDocumentText().substring(person1.getEnd(), > person2.getBegin()); > if (textBetween.indexOf("meets") > -1) { > //create annotation > MeetsRelationAnnotation newAnnot = new > MeetsRelationAnnotation(aJCas, > person1.getBegin(), person2.getEnd()); > newAnnot.addToIndexes(); > } > } > } > > Note I just typed that right into this email, so there might be syntax > errors. But it should give you the idea. > > Now if you want to turn this into a more general annotator that you > can configure with arbitrary rules that tell it what to match, then > that's a much more complex question. What we can help you with here > is how to use the UIMA APIs. > > Regards, > -Adam >
