Re: Performance of UIMAfit JCasUtil.selectCovered() and variants

Erik Fäßler Thu, 22 Oct 2015 02:06:14 -0700

Small follow-up, just stumbled across by chance, did not even search for it:


http://searchivarius.org/blog/selectcovered-substantially-better-version-uima-subiterator
 
<http://searchivarius.org/blog/selectcovered-substantially-better-version-uima-subiterator>

Someone has done a performance comparison. It does not necessarily strike me to 
be the most sophisticated approach, but the code is available and one could use 
this as a hint.

Best,

Erik

> On 21 Oct 2015, at 18:26, Richard Eckart de Castilho <[email protected]> wrote:
> 
> Hi,
> 
> 1 uses the UIMA indexes which I believe use a binary search, so it should
> be something like O(log n).
> 
> 2 is in principle O(n) but since it does a linear scan from the beginning
> and stops when no further annotations may be found, it practice O(n) 
> should be the upper bound when called for annotations towards the end of
> the document.
> 
> 3 is fastest for repeated use. It should be O(n) for creating
> the index and then uses hashmap lookups.
> 
> So 1 and 3 are better than two.
> 
> If you need speed and need coverage information a lot, 3 should be the best.
> 
> 1 and 2 are more convenient for coding.
> 
> If you use plain UIMA and have type priorities set up, then using an
> iterator over sentences and a subiterator over tokens is likely to
> be better than 3 because it doesn't need the initial scan that 3 does.
> 
> I'm not aware that anybody did extensive performance comparisons here.
> Some are being done in 
> org.apache.uima.fit.util.JCasUtilTest.testSelectCoverRandom()
> which compares 1 and 2. Here a few lines from the test output (mind to 
> increase the
> ITERATIONS variable if you try):
> 
> ...
> Speed up factor 5.50 [naive:11 optimized:2 diff:9]
> Speed up factor 6.67 [naive:20 optimized:3 diff:17]
> Speed up factor 4.00 [naive:16 optimized:4 diff:12]
> Speed up factor 2.50 [naive:30 optimized:12 diff:18]
> Speed up factor 7.00 [naive:35 optimized:5 diff:30]
> Speed up factor 5.63 [naive:45 optimized:8 diff:37]
> Speed up factor 7.78 [naive:70 optimized:9 diff:61]
> Speed up factor 8.09 [naive:89 optimized:11 diff:78]
> ...
> 
> Cheers,
> 
> -- Richard
> 
>> On 21.10.2015, at 17:07, Erik Fäßler <[email protected]> wrote:
>> 
>> Hi all,
>> 
>> I’m wondering about the performance differences between
>> 
>> 1) JCasUtil.selectCovered(JCas, Class<T>, AnnotationFS),
>> 2) JCasUtil.selectCovered(JCas, Class<T>, int, int) and
>> 3) JCasUtil.indexCovered(JCas, Class<T>, Class<S>)
>> 
>> It is clear that 3) iterates once through the CAS and just returns a map. 
>> Once this is done, map access is swift.
>> 
>> The Javadoc of 2) states that it is slower than 1).
>> 3) states that it is preferable to 2).
>> 
>> Questions:
>> Is 3) also preferable over 2) when there is only one covering annotation or 
>> is the performance of 2) and 3) roughly equal then?
>> Main question: Is 3) also quicker than 1) if there are many covering 
>> annotations?
>> 
>> Use case: I want to iterate through all sentences in paragraphs. Normally, I 
>> would use subiterators(), but the known type priority issue could be a 
>> problem for me. Should I just use 1)? Or would I still benefit from 3) if I 
>> have more than one paragraph?
>> 
>> Thank you very much!
>> 
>> Best,
>> 
>> Erik
>

Re: Performance of UIMAfit JCasUtil.selectCovered() and variants

Reply via email to