Greg,

It doesn't look to me like you're doing anything wrong.

I did a quick test to try to reproduce this but wasn't able to... I
may need more information about your set up.

I created a CPE with the FileSystemCollectionReader,
PersonTitleAnnotator, and your XmiCasAnnotator.  (I filled in the part
about generating an identifier with something that checks the
SourceDocumentInformation annotations put there by the
FileSystemCollectionReader.)

On a particular set of documents, with the CPE desriptor's
processingUnitThreadCount set to 1 I get a total elapsed time of 9.25
seconds, whereas with the processingUnitThreadCount set to 10 I get a
total elapsed time of 6.875 seconds.  (This is on a dual-core
machine.)

A few questions come to mind:  Are you using a CPE to do the
multithreading or something else?  If something else, do you see the
same behavior if you try using a CPE instead?  Does this only happen
with large documents, and/or does it only happen when you have a lot
of annotations in the CAS (I have very few in my test).

Regards,
 -Adam



On 6/29/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
I've run into a severe slowdown when using the XmiCasSerializer in a 
CasAnnotator with multiple concurrent AnalysisEngines.  I'm wondering if I'm 
doing something wrong or if there's a bug.  Code for this XMI CasAnnotator is 
appended.

I run three scenarios on the same set of documents and same set of CAS-updating 
annotators.
A. 1 thread/AnalysisEngine with the Xmi CasAnnotator.
B. 10 threads/AnalysisEngines without the Xmi CasAnnotator (in fact, no saving 
of any CAS data to disk at all, in any form).
C. 10 threads/AnalysisEngines with the Xmi CasAnnotator.

On Windows XP I use the excellent ProcExplorer tool from SysInternals.com to measure CPU 
seconds spent in "user" (i.e. process) space and in kernel space.  I also 
measure elapsed time.  Here's what I see for these scenarios:

Scenario   User    Kernel   Elapsed
   A            103          5        588  (a lot of time spent blocking on 
proprietary remote network services).
   B            84           4         135
   C            237       139       295

So, in A, with just one thread and XMI output, we spend very little time in the 
kernel.
In B, with 10 threads and no XMI output, we also spend very little time in the 
kernel.
C is B+Xmi, and so should be only slightly more than B.  Instead kernel time 
increases 35X, user time increases 3X, and elapsed time increases 2X.

So it seems like using the XmiCasSerializer with concurrent AnalysisEngines 
creates some sort of thread contention. Either that, or I'm using it 
incorrectly.

Is this a bug?


Greg Holmberg


public class XmiOutputAnnotator extends CasAnnotator_ImplBase {

        public static final String PARAM_OUTPUT_DIRECTORY = "outputDirectory";

        private String outputDirectory;

        private XmiCasSerializer serializer;

        @Override
        public void initialize(UimaContext context) throws 
ResourceInitializationException {
                super.initialize(context);
        outputDirectory = 
(String)context.getConfigParameterValue(PARAM_OUTPUT_DIRECTORY);
    }

        public void typeSystemInit(TypeSystem aTypeSystem)
        throws AnalysisEngineProcessException
    {
        serializer = new XmiCasSerializer(aTypeSystem);
    }

        public void process(CAS cas) throws AnalysisEngineProcessException {
                JCas base = null;
                try {
                        base = cas.getJCas();
                }
                catch (CASException ce) {
                        throw new AnalysisEngineProcessException(ce);
                }
        OutputStream outputStream = null;

        try {
                        String identifier = ...

            File inputFile = new File(new URI(identifier));
            File outputFile = new File(outputDirectory, inputFile.getAbsolutePath().replace(":", 
"") + ".xmi");
            outputFile.getParentFile().mkdirs();
            outputStream = new FileOutputStream(outputFile);
            serializer.serialize(cas, new XMLSerializer(outputStream, 
true).getContentHandler());
        } catch (Exception e) {
                throw new MyAnnotatorException(getClass().getSimpleName(), e);
            } finally {
                if (outputStream != null) {
                        try {
                            outputStream.close();
                        } catch (IOException e) {
                                // Ignore?
                        }
                }
            }
    }

}


Reply via email to