Hi Dima, Thanks for the feedback! As our pipeline develops, we'll be building in additional functionality (eg. Temporal Relations) that require context greater than that of a sentence. Given this, partitioning on document length and shunting those to another queue is an excellent solution.
Thanks, Mike On Thu, Feb 28, 2019 at 4:08 AM Dligach, Dmitriy <ddlig...@luc.edu> wrote: > Hi Mike, > > We also observed this issue. Splitting large documents into smaller ones > is an option, but you have to make sure you preserve the integrity of > individual sentences or you might loose some concept mentions. Since you > are using cTAKES only for ontology mapping, I don’t think you need to worry > about the integrity of linguistic units larger than a sentence. > > FWIF, our solution to this problem was to create a separate queue for > large documents and process them independently from the smaller documents. > > Best, > > Dima > > > > > On Feb 27, 2019, at 16:59, Michael Trepanier <m...@metistream.com> wrote: > > Hi, > > We currently have a pipeline which is generating ontology mappings for a > repository of clinical notes. However, this repository contains documents > which, after RTF parsing, can contain over 900,000 characters (albeit this > is a very small percentage of notes, out of ~13 million, around 50 contain > more than 100k chars). Looking at some averages across the dataset, it is > clear that the processing time is exponentially related to the note length: > > 0-10000 chars: 0.9 seconds (11 million notes) > 10000-20000 chars: 5.625 seconds (1.5 million notes) > 210000-220000 chars: 4422 seconds/1.22 hours (3 notes) > 900000-1000000 chars: 103237 seconds/28.6 hours (1 note) > > Given these results, splitting the longer docs into partitions would speed > up the pipeline considerably. However, our team has some concerns over how > that might impact the context aware steps of the cTAKES pipeline. How would > the results from splitting a doc on its sentences or paragraphs compare to > feeding in an entire doc? Does the default pipeline API support a way to > use segments instead of the entire document text? > > Regards, > > Mike > > > -- [image: MetiStream Logo - 500] Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. | m...@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com