Hi,

I am working on a project to extend ctakes for processing a large number of 
documents and I am looking into possible routes for improving ctakes 
performance.

Is there any information detailing the 50,000 clinical notes per hour benchmark 
advertised on the ctakes homepage? I am looking for information such as how 
ctakes was set up, with what components, what kind of system it ran on, how 
large the documents are, and how long the system was run to get the results.

Additionally, LVG is currently one of the major bottlenecks in our current 
system. It can take anywhere from 100 to 800 ms running on copies of the same 
document, and caching has not seemed to improve performace. Does anyone know 
why that is? I have also seen some conflicting reports on whether LVG is 
needed, particularly for a multithreaded pipeline. The documentation claims 
that it is necessary for "good" results, with no indication of what "good" 
means. Some published results, such as this 
(https://www.researchgate.net/publication/13360565_Evaluating_lexical_variant_generation_to_improve_information_retrieval)
 seem to indicate a dramatic increase in f-score with different types of 
variant generation, but I have seen suggestions, such as in this thread 
(http://user.ctakes.apache.narkive.com/IKiAQVJQ/running-ctakes-through-java#post4)
 for others to remove lvg from their pipeline because the benefits are minimal. 
Is there an updated stance on this component that I have not encountered yet?

Thanks,
Hannah

Reply via email to