Processing Extraordinarily Long Documents

Michael Trepanier Wed, 27 Feb 2019 15:00:52 -0800

Hi,

We currently have a pipeline which is generating ontology mappings for a
repository of clinical notes. However, this repository contains documents
which, after RTF parsing, can contain over 900,000 characters (albeit this
is a very small percentage of notes, out of ~13 million, around 50 contain
more than 100k chars). Looking at some averages across the dataset, it is
clear that the processing time is exponentially related to the note length:


0-10000 chars: 0.9 seconds (11 million notes)
10000-20000 chars: 5.625 seconds (1.5 million notes)
210000-220000 chars: 4422 seconds/1.22 hours (3 notes)
900000-1000000 chars: 103237 seconds/28.6 hours (1 note)

Given these results, splitting the longer docs into partitions would speed
up the pipeline considerably. However, our team has some concerns over how
that might impact the context aware steps of the cTAKES pipeline. How would
the results from splitting a doc on its sentences or paragraphs compare to
feeding in an entire doc? Does the default pipeline API support a way to
use segments instead of the entire document text?

Regards,

Mike

Processing Extraordinarily Long Documents

Reply via email to