Re: Processing Extraordinarily Long Documents

Dligach, Dmitriy Thu, 28 Feb 2019 04:08:44 -0800

Hi Mike,

We also observed this issue. Splitting large documents into smaller ones is an 
option, but you have to make sure you preserve the integrity of individual 
sentences or you might loose some concept mentions. Since you are using cTAKES 
only for ontology mapping, I don’t think you need to worry about the integrity 
of linguistic units larger than a sentence.


FWIF, our solution to this problem was to create a separate queue for large 
documents and process them independently from the smaller documents.

Best,

Dima




On Feb 27, 2019, at 16:59, Michael Trepanier 
<m...@metistream.com<mailto:m...@metistream.com>> wrote:

Hi,

We currently have a pipeline which is generating ontology mappings for a 
repository of clinical notes. However, this repository contains documents 
which, after RTF parsing, can contain over 900,000 characters (albeit this is a 
very small percentage of notes, out of ~13 million, around 50 contain more than 
100k chars). Looking at some averages across the dataset, it is clear that the 
processing time is exponentially related to the note length:

0-10000 chars: 0.9 seconds (11 million notes)
10000-20000 chars: 5.625 seconds (1.5 million notes)
210000-220000 chars: 4422 seconds/1.22 hours (3 notes)
900000-1000000 chars: 103237 seconds/28.6 hours (1 note)

Given these results, splitting the longer docs into partitions would speed up 
the pipeline considerably. However, our team has some concerns over how that 
might impact the context aware steps of the cTAKES pipeline. How would the 
results from splitting a doc on its sentences or paragraphs compare to feeding 
in an entire doc? Does the default pipeline API support a way to use segments 
instead of the entire document text?

Regards,

Mike

Re: Processing Extraordinarily Long Documents

Reply via email to