I am still trying to figure out how to count Ruta annotations across a
bunch of input files. There doesn't seem to be any Workbench way to do it.
So now I am trying to call Ruta from UimaFit so I can do the job in Java.
However, I am having serious configuration problems, plus I have a question
on how do bring in PlainTextAnnotator.
I am using Maven, with the jcasgen-maven-plugin, the ruta-maven-plugin, and
the uimafit-maven-plugin. I will include the pom file at the end of this
post.
I want my Java code to be aware of the types declared in the Ruta script -
that is the whole point - I want to count those annotations.
My Ruta script also uses PlainTextAnnotator. The problem with this is that
I can't figure out where to put it. In a Workbench based Ruta project,
PlainTextAnnotator.xml and PlainTextAnnotatorTypeSystem get put
automatically into descriptor/utils, along with a number of other
descriptors that seem to be built into Ruta. But when I create a project
using maven, there is no such location, and these descriptors do not get
put anywhere. I tried a number of places but could not get my script to see
the type system for PlainTextAnnotator. Finally, I hit on putting the files
in target/generated-sources/ruta/descriptor/utils, and finally my script is
able to see the types and I can run it. This is good because at that point,
the ruta-maven-plugin does its job and generates the descriptors for my
script. However, I suspect this is not a good place to put the
PlainTextAnnotator files since doing a clean overwrites them. Where should
they go? Is there any entry in the pom file that is needed?
The second problem is that although my Ruta script works nicely on its own,
the Java code fails. I get the following exception
Exception in thread "main" org.apache.uima.cas.CASRuntimeException: JCas
type "org.apache.uima.examples.SourceDocumentInformation" used in Java
code, but was not declared in the XML type descriptor.
at org.apache.uima.jcas.impl.JCasImpl.getTypeInit(JCasImpl.java:435)
at org.apache.uima.jcas.impl.JCasImpl.getType(JCasImpl.java:408)
at org.apache.uima.jcas.cas.TOP.<init>(TOP.java:96)
at org.apache.uima.jcas.cas.AnnotationBase.<init>(AnnotationBase.java:66)
at org.apache.uima.jcas.tcas.Annotation.<init>(Annotation.java:54)
at
org.apache.uima.examples.SourceDocumentInformation.<init>(SourceDocumentInformation.java:80)
at
org.apache.uima.examples.cpe.FileSystemCollectionReader.getNext(FileSystemCollectionReader.java:162)
at
org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:149)
at PipelineSystem.<init>(PipelineSystem.java:59)
at PipelineSystem.main(PipelineSystem.java:73)
I am guessing that I need to put some other descriptor somewhere but I
can't figure out what it might be. Here is the code that causes the problem
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
import java.io.IOException;
import java.util.Iterator;
import org.apache.uima.UIMAException;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.analysis_engine.AnalysisEngineDescription;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.Type;
import org.apache.uima.cas.TypeSystem;
import org.apache.uima.collection.CollectionReaderDescription;
import org.apache.uima.examples.cpe.FileSystemCollectionReader;
import org.apache.uima.fit.component.CasDumpWriter;
import org.apache.uima.fit.factory.AnalysisEngineFactory;
import org.apache.uima.fit.factory.CollectionReaderFactory;
import org.apache.uima.fit.pipeline.SimplePipeline;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.ruta.engine.RutaEngine;
public class PipelineSystem {
public PipelineSystem() throws IOException, UIMAException
{
try {
CollectionReaderDescription readerDesc =
CollectionReaderFactory.createReaderDescription(
FileSystemCollectionReader.class,
FileSystemCollectionReader.PARAM_INPUTDIR,
"/home/bonnie/Research/eclipse-uima-projects/PipeLineWithRuta/input",
FileSystemCollectionReader.PARAM_ENCODING, "UTF-8",
FileSystemCollectionReader.PARAM_LANGUAGE, "English");
AnalysisEngine rae = AnalysisEngineFactory.createEngine(RutaEngine.class,
RutaEngine.PARAM_MAIN_SCRIPT,
"ecClassifierRules");
AnalysisEngineDescription rutaEngineDesc =
AnalysisEngineFactory.createEngineDescription(RutaEngine.class,
RutaEngine.PARAM_MAIN_SCRIPT,
"ecClassifierRules");
AnalysisEngineDescription writerDesc =
AnalysisEngineFactory.createEngineDescription(CasDumpWriter.class,
CasDumpWriter.PARAM_OUTPUT_FILE, "dump.txt");
JCas jCas = rae.newJCas();
SimplePipeline.runPipeline(readerDesc, rutaEngineDesc);
displayRutaResults(jCas);
} catch (ResourceInitializationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (AnalysisEngineProcessException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void main(String[] args) throws IOException, UIMAException {
PipelineSystem p = new PipelineSystem();
}
public void displayRutaResults(JCas jCas)
{
System.out.println("in display ruta results");
TypeSystem ts = jCas.getTypeSystem();
Iterator<Type> typeItr = ts.getTypeIterator();
while (typeItr.hasNext()) {
Type type = (Type) typeItr.next();
if (type.getName().equals("INCL")) {
System.out.println("INCL was found");
}
}
}
------------------------------------------------------------------------------------------------------------------------------------------------
Yes, I know the code doesn't actually count annotations yet - this is
strictly a test of the configuration. The type INCL is declared in the
script
ENGINE utils.PlainTextAnnotator; TYPESYSTEM utils.PlainTextTypeSystem;
Document{-> RETAINTYPE(BREAK)}; Document{-> EXEC(PlainTextAnnotator,
{Line})};
DECLARE INCL; "INCLUSION" -> INCL;
And finally, here is the pom file. I note that the ruta pugin and the
jcasegen plugin are correctly generating the descriptor files for the
script and the Java classes for the types. I have this set up so that the
jcasgen plugin reads the type descriptors from the folder that is generated
by the ruta-maven-plugin (I saw this in one of the examples mentioned
elsewhere on this mailing lsit)
However, the uimafit plugin does not generate anything.
thanks for any help. It is really hard to figure out all these moving parts.
Bonnie MacKellar
---------------------------------------------------------------------------------------------------------------------------------
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="
http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion> <groupId>PipeLineWithRuta</groupId>
<artifactId>PipeLineWithRuta</artifactId> <version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging> <name>PipeLineWithRuta</name> <url>
http://maven.apache.org</url> <properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties> <build> <sourceDirectory>src/main/java</sourceDirectory>
<resources> <resource> <directory>src/main/ruta</directory> </resource>
<resource> <directory>src/desc</directory> </resource> </resources>
<plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version> <configuration> <source>1.8</source>
<target>1.8</target> </configuration> </plugin> <plugin>
<groupId>org.apache.uima</groupId>
<artifactId>jcasgen-maven-plugin</artifactId> <version>2.4.1</version> <!--
change this to the latest version --> <executions> <execution> <goals>
<goal>generate</goal> </goals> <!-- this is the only goal --> <!-- runs in
phase process-resources by default --> <configuration> <!-- REQUIRED -->
<typeSystemIncludes> <!-- one or more ant-like file patterns identifying
top level descriptors -->
<typeSystemInclude>target/generated-sources/ruta/descriptor/ecClassifierRulesTypeSystem.xml</typeSystemInclude>
</typeSystemIncludes> <!-- OPTIONAL --> <!-- a sequence of ant-like file
patterns to exclude from the above include list --> <typeSystemExcludes>
</typeSystemExcludes> <!-- OPTIONAL --> <!-- where the generated files go
--> <!-- default value:
${project.build.directory}/generated-sources/jcasgen" --> <outputDirectory>
</outputDirectory> <!-- true or false, default = false --> <!-- if true,
then although the complete merged type system will be created internally,
only those types whose definition is contained within this maven project
will be generated. The others will be presumed to be available via other
projects. --> <!-- OPTIONAL --> <limitToProject>true</limitToProject>
</configuration> </execution> </executions> </plugin> <plugin>
<groupId>org.apache.uima</groupId>
<artifactId>ruta-maven-plugin</artifactId> <version>2.3.1</version>
<configuration> <scriptPaths> <scriptPath>src/main/ruta/</scriptPath>
</scriptPaths> <!-- Descriptor paths of the generated analysis engine
descriptor. --> <!-- default value: none --> <descriptorPaths>
<descriptorPath>${project.build.directory}/generated-sources/ruta/descriptor</descriptorPath>
</descriptorPaths> <!-- Resource paths of the generated analysis engine
descriptor. --> <!-- default value: none --> <resourcePaths>
<resourcePath>${project.build.directory}/generated-sources/ruta/
resources/</resourcePath> </resourcePaths>
<analysisEngineSuffix>Engine</analysisEngineSuffix>
<typeSystemSuffix>TypeSystem</typeSystemSuffix> <!-- Type of type system
imports. false = import by location. --> <!-- default value: false -->
<importByName>false</importByName> <!-- Option to resolve imports while
building. --> <!-- default value: false -->
<resolveImports>false</resolveImports> <!-- List of packages with language
extensions --> <!-- default value: none --> <extensionPackages>
<extensionPackage>org.apache.uima.ruta</extensionPackage>
</extensionPackages> <!-- Add UIMA Ruta nature to .project --> <!-- default
value: false --> <addRutaNature>true</addRutaNature> <!-- Buildpath of the
UIMA Ruta Workbench (IDE) for this project --> <!-- default value: none -->
<buildPaths> <buildPath>script:src/main/ruta/</buildPath>
<buildPath>descriptor:target/generated-sources/ruta/descriptor/
</buildPath> <buildPath>resources:src/main/resources/</buildPath>
</buildPaths> </configuration> <executions> <execution> <id>default</id>
<phase>process-classes</phase> <goals> <goal>generate</goal> </goals>
</execution> </executions> </plugin> <plugin>
<groupId>org.apache.uima</groupId>
<artifactId>uimafit-maven-plugin</artifactId> <version>2.2.0</version> <!--
change to latest version --> <configuration> <!-- OPTIONAL --> <!-- Path
where the generated resources are written. --> <outputDirectory>
${project.build.directory}/generated-sources/uimafit </outputDirectory>
<!-- OPTIONAL --> <!-- Skip generation of
META-INF/org.apache.uima.fit/components.txt -->
<skipComponentsManifest>false</skipComponentsManifest> <!-- OPTIONAL -->
<!-- Source file encoding. -->
<encoding>${project.build.sourceEncoding}</encoding> </configuration>
<executions> <execution> <id>default</id> <phase>process-classes</phase>
<goals> <goal>generate</goal> </goals> </execution> </executions> </plugin>
</plugins> </build> <dependencies> <dependency>
<groupId>org.apache.uima</groupId> <artifactId>uimafit-core</artifactId>
<version>2.2.0</version> </dependency> <dependency>
<groupId>org.apache.uima</groupId> <artifactId>uimaj-core</artifactId>
<version>2.8.1</version> </dependency> <dependency>
<groupId>org.apache.uima</groupId>
<artifactId>ruta-maven-plugin</artifactId> <version>2.3.1</version>
</dependency> <dependency> <groupId>org.apache.uima</groupId>
<artifactId>uimaj-cpe</artifactId> <version>2.8.1</version> </dependency>
<dependency> <groupId>org.apache.uima</groupId>
<artifactId>uimaj-examples</artifactId> <version>2.8.1</version>
</dependency> </dependencies> </project>