I've run into a severe slowdown when using the XmiCasSerializer in a
CasAnnotator with multiple concurrent AnalysisEngines. I'm wondering if I'm
doing something wrong or if there's a bug. Code for this XMI CasAnnotator is
appended.
I run three scenarios on the same set of documents and same set of CAS-updating
annotators.
A. 1 thread/AnalysisEngine with the Xmi CasAnnotator.
B. 10 threads/AnalysisEngines without the Xmi CasAnnotator (in fact, no saving
of any CAS data to disk at all, in any form).
C. 10 threads/AnalysisEngines with the Xmi CasAnnotator.
On Windows XP I use the excellent ProcExplorer tool from SysInternals.com to
measure CPU seconds spent in "user" (i.e. process) space and in kernel space.
I also measure elapsed time. Here's what I see for these scenarios:
Scenario User Kernel Elapsed
A 103 5 588 (a lot of time spent blocking on
proprietary remote network services).
B 84 4 135
C 237 139 295
So, in A, with just one thread and XMI output, we spend very little time in the
kernel.
In B, with 10 threads and no XMI output, we also spend very little time in the
kernel.
C is B+Xmi, and so should be only slightly more than B. Instead kernel time
increases 35X, user time increases 3X, and elapsed time increases 2X.
So it seems like using the XmiCasSerializer with concurrent AnalysisEngines
creates some sort of thread contention. Either that, or I'm using it
incorrectly.
Is this a bug?
Greg Holmberg
public class XmiOutputAnnotator extends CasAnnotator_ImplBase {
public static final String PARAM_OUTPUT_DIRECTORY = "outputDirectory";
private String outputDirectory;
private XmiCasSerializer serializer;
@Override
public void initialize(UimaContext context) throws
ResourceInitializationException {
super.initialize(context);
outputDirectory =
(String)context.getConfigParameterValue(PARAM_OUTPUT_DIRECTORY);
}
public void typeSystemInit(TypeSystem aTypeSystem)
throws AnalysisEngineProcessException
{
serializer = new XmiCasSerializer(aTypeSystem);
}
public void process(CAS cas) throws AnalysisEngineProcessException {
JCas base = null;
try {
base = cas.getJCas();
}
catch (CASException ce) {
throw new AnalysisEngineProcessException(ce);
}
OutputStream outputStream = null;
try {
String identifier = ...
File inputFile = new File(new URI(identifier));
File outputFile = new File(outputDirectory,
inputFile.getAbsolutePath().replace(":", "") + ".xmi");
outputFile.getParentFile().mkdirs();
outputStream = new FileOutputStream(outputFile);
serializer.serialize(cas, new XMLSerializer(outputStream,
true).getContentHandler());
} catch (Exception e) {
throw new MyAnnotatorException(getClass().getSimpleName(), e);
} finally {
if (outputStream != null) {
try {
outputStream.close();
} catch (IOException e) {
// Ignore?
}
}
}
}
}