Hi, Thanks for your suggestions. So I have to change the "int" to "bigint" to improve the performance.
I am looking at UIMA DUCC. http://uima.apache.org/doc-uimaducc-whatitam.html The problem with Hadoop is it runs in batch process. So it cannot be used for low latency real systems. But still I want to explore it. On Tue, Jul 1, 2014 at 6:20 PM, Jonathan Bates <[email protected]> wrote: > Hi Prasanna, > I am currently using 3.1.2 to process ~40M notes using 14 CPEs with > AggregatePlaintextUMLSProcessor+DBConsumer. So far, ~34M notes have been > annotated and stored. Altogether, I'm seeing 0.054sec/note. This is with > 4.1k rows in v_snomed_fword_lookup. One modification we had to make was to > change anno_base_id datatype from 'int' to 'bigint'. It would be very > interesting to see Hadoop used with ctakes... > -Jon > > > On Tue, Jul 1, 2014 at 1:54 AM, Prasanna Bala <[email protected] > > wrote: > >> Hi, >> >> I have certain clarifications. This is regarding using third party >> libraries with cTakes. I have clarifications on run time for processing >> documents using cTakes. We are able to run the cTakes through batch mode. >> But we have plans to run documents for 1 million clinical documents. Can >> anyone tell me if they have tackled scalability using cTakes ? I have an >> idea to distribute the process using Hadoop. There are various libraries >> available that can use UIMA and distribute the process using Hadoop. Since >> cTakes is also developed using UIMA, I think there should be a way to >> distribute process. Have anyone tried this ? Are there any limitations in >> distributing problems using cTakes ? Your thoughts please ? >> >> Regards, >> Prasanna >> > >
