Hi, I am working on a project to extend ctakes for processing a large number of documents and I am looking into possible routes for improving ctakes performance.
Is there any information detailing the 50,000 clinical notes per hour benchmark advertised on the ctakes homepage? I am looking for information such as how ctakes was set up, with what components, what kind of system it ran on, how large the documents are, and how long the system was run to get the results. Additionally, LVG is currently one of the major bottlenecks in our current system. It can take anywhere from 100 to 800 ms running on copies of the same document, and caching has not seemed to improve performace. Does anyone know why that is? I have also seen some conflicting reports on whether LVG is needed, particularly for a multithreaded pipeline. The documentation claims that it is necessary for "good" results, with no indication of what "good" means. Some published results, such as this (https://www.researchgate.net/publication/13360565_Evaluating_lexical_variant_generation_to_improve_information_retrieval) seem to indicate a dramatic increase in f-score with different types of variant generation, but I have seen suggestions, such as in this thread (http://user.ctakes.apache.narkive.com/IKiAQVJQ/running-ctakes-through-java#post4) for others to remove lvg from their pipeline because the benefits are minimal. Is there an updated stance on this component that I have not encountered yet? Thanks, Hannah