Hi Mike, Would you mind sharing the specifications of the nodes you used in your Spark cluster (cores, memory, disk)? We were originally running nodes with more than 2 cores, but have run into occasional issues with contention over the HSQLDB lock for the dictionary file we are using when more than one executor tries to run on one node. We were thinking of various solutions to the problem, but the easiest was just to run only 2 core nodes, one for the executor and one for the spark driver (assuming we have enough machines). I am not sure why HSQLDB needs to lock the file anyway since it should not be changing it, but so far we have not had success modifying the connection string to ask for it open it as read only. Either way, it would be helpful to learn about the setup you settled on.
Thanks, Jeff On 2019/02/27 22:59:56, Michael Trepanier <m...@metistream.com> wrote: > Hi,> > > We currently have a pipeline which is generating ontology mappings for a> > repository of clinical notes. However, this repository contains documents> > which, after RTF parsing, can contain over 900,000 characters (albeit this> > is a very small percentage of notes, out of ~13 million, around 50 contain> > more than 100k chars). Looking at some averages across the dataset, it is> > clear that the processing time is exponentially related to the note length:> > > 0-10000 chars: 0.9 seconds (11 million notes)> > 10000-20000 chars: 5.625 seconds (1.5 million notes)> > 210000-220000 chars: 4422 seconds/1.22 hours (3 notes)> > 900000-1000000 chars: 103237 seconds/28.6 hours (1 note)> > > Given these results, splitting the longer docs into partitions would speed> > up the pipeline considerably. However, our team has some concerns over how> > that might impact the context aware steps of the cTAKES pipeline. How would> > the results from splitting a doc on its sentences or paragraphs compare to> > feeding in an entire doc? Does the default pipeline API support a way to> > use segments instead of the entire document text?> > > Regards,> > > Mike> >