Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Thilo Goetz Thu, 26 Jun 2008 10:26:15 -0700

rohan rai wrote:

@Pascal: As I have already said the timing does not scale linearly
              Secondly it the approx times which I have specified
@Frank:
     I was talking about actual adding of annotation to CAS
    Record refer to lets say in tags like these <a>.....</a>
    and the document consist of such record
    Annotation is done via this method
                               MyType annotation = new MyType(jCas);
                               annotation.setBegin(start);
                               annotation.setEnd(end);
                               annotation.addToIndexes();
   This takes a lot of time which is not likeable.


I don't know what you mean by a lot of time, but
you can create hundreds of thousands of annotations
like this per second on a standard windows machine.
You can easily verify this by running this code in
isolation (with mock data).

You're more likely seeing per document overhead.
For example, resetting the CAS after processing
a document is not so cheap.  However, I still don't
know why things are so slow for you.  For example,
I ran the following experiment.  I installed the
Whitespace Tokenizer pear file into c:\tmp and ran
it 10000 times on its own descriptor.  That creates
approx 10Mio annotations.  On my 18 months old Xeon
this ran in about 4 seconds.  Code and output is
below, for you to recreate.  So I'm not sure you have
correctly identified your bottleneck.

  public static void main(String[] args) {
    try {
      System.out.println("Starting setup.");
      XMLParser parser = UIMAFramework.getXMLParser();
      ResourceSpecifier spec = parser.parseResourceSpecifier(new 
XMLInputSource(new File(
          "c:\\tmp\\WhitespaceTokenizer\\WhitespaceTokenizer_pear.xml")));
      AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(spec, null, null);
      String text = FileUtils.file2String(new File(
          "c:\\tmp\\WhitespaceTokenizer\\desc\\WhitespaceTokenizer.xml"));
      CAS cas = ae.newCAS();
      System.out.println("Setup done, starting processing.");
      final int max = 10000;
      long time = System.currentTimeMillis();
      for (int i = 0; i < max; i++) {
        cas.reset();
        cas.setDocumentText(text);
        ae.process(cas);
        if (cas.getAnnotationIndex().size() != 1080) {
          // There are 1080 annotations created for each run
          System.out.println("Processing error.");
        }
      }
      time = System.currentTimeMillis() - time;
      System.out.println("Time for processing " + max + " documents, " + max * 
1080
          + " annotations: " + new TimeSpan(time));
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

Output on my machine:

Starting setup.
Setup done, starting processing.
Time for processing 10000 documents, 10800000 annotations: 4.078 sec

--Thilo


Regards
Rohan


On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
[EMAIL PROTECTED]> wrote:

Just to clarify, what do you mean by "annotation"?  Is there a specific
Analysis Engine that you are using? What is a "record"? Is this a
document?  It would actually be surprizing for many applications if
annotation were not the bottleneck, given that some annotation processes
are quite expensive, but this doesn't seem like what you mean here. I
can't tell from your question whether it is the process that determines
the annotations that is a burden or the actual adding of the annotations
to the cas.

-----Original Message-----
From: rohan rai [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 26, 2008 7:36 AM
To: [email protected]
Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed

When I profile a UIMA application
What I see that annonation takes a lot of time If I profile I see that
to annotate 1 record , it takes around 0.06 seconds Now you may say its
good Now scale up Although it does not scale up linearly. But here is
rough estimate on experiments done 6000 records take 6 min to annotate
800000 record tale around 10 hrs min to annotate Which is bad.
One thing is that I am treating each record individually as a cas Even
if I treat all the record as a single cas it takes around 6-7 hrs Which
is still not good in terms of speed

Is there a way out?
Can I improve performance by any means??

Regards
Rohan

Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Reply via email to