Small follow-up, just stumbled across by chance, did not even search for it:
http://searchivarius.org/blog/selectcovered-substantially-better-version-uima-subiterator <http://searchivarius.org/blog/selectcovered-substantially-better-version-uima-subiterator> Someone has done a performance comparison. It does not necessarily strike me to be the most sophisticated approach, but the code is available and one could use this as a hint. Best, Erik > On 21 Oct 2015, at 18:26, Richard Eckart de Castilho <[email protected]> wrote: > > Hi, > > 1 uses the UIMA indexes which I believe use a binary search, so it should > be something like O(log n). > > 2 is in principle O(n) but since it does a linear scan from the beginning > and stops when no further annotations may be found, it practice O(n) > should be the upper bound when called for annotations towards the end of > the document. > > 3 is fastest for repeated use. It should be O(n) for creating > the index and then uses hashmap lookups. > > So 1 and 3 are better than two. > > If you need speed and need coverage information a lot, 3 should be the best. > > 1 and 2 are more convenient for coding. > > If you use plain UIMA and have type priorities set up, then using an > iterator over sentences and a subiterator over tokens is likely to > be better than 3 because it doesn't need the initial scan that 3 does. > > I'm not aware that anybody did extensive performance comparisons here. > Some are being done in > org.apache.uima.fit.util.JCasUtilTest.testSelectCoverRandom() > which compares 1 and 2. Here a few lines from the test output (mind to > increase the > ITERATIONS variable if you try): > > ... > Speed up factor 5.50 [naive:11 optimized:2 diff:9] > Speed up factor 6.67 [naive:20 optimized:3 diff:17] > Speed up factor 4.00 [naive:16 optimized:4 diff:12] > Speed up factor 2.50 [naive:30 optimized:12 diff:18] > Speed up factor 7.00 [naive:35 optimized:5 diff:30] > Speed up factor 5.63 [naive:45 optimized:8 diff:37] > Speed up factor 7.78 [naive:70 optimized:9 diff:61] > Speed up factor 8.09 [naive:89 optimized:11 diff:78] > ... > > Cheers, > > -- Richard > >> On 21.10.2015, at 17:07, Erik Fäßler <[email protected]> wrote: >> >> Hi all, >> >> I’m wondering about the performance differences between >> >> 1) JCasUtil.selectCovered(JCas, Class<T>, AnnotationFS), >> 2) JCasUtil.selectCovered(JCas, Class<T>, int, int) and >> 3) JCasUtil.indexCovered(JCas, Class<T>, Class<S>) >> >> It is clear that 3) iterates once through the CAS and just returns a map. >> Once this is done, map access is swift. >> >> The Javadoc of 2) states that it is slower than 1). >> 3) states that it is preferable to 2). >> >> Questions: >> Is 3) also preferable over 2) when there is only one covering annotation or >> is the performance of 2) and 3) roughly equal then? >> Main question: Is 3) also quicker than 1) if there are many covering >> annotations? >> >> Use case: I want to iterate through all sentences in paragraphs. Normally, I >> would use subiterators(), but the known type priority issue could be a >> problem for me. Should I just use 1)? Or would I still benefit from 3) if I >> have more than one paragraph? >> >> Thank you very much! >> >> Best, >> >> Erik >
