Re: Processing Extraordinarily Long Documents

Michael Trepanier Thu, 28 Feb 2019 09:57:47 -0800

Hi Dima,

Thanks for the feedback! As our pipeline develops, we'll be building in
additional functionality (eg. Temporal Relations) that require context
greater than that of a sentence. Given this, partitioning on document
length and shunting those to another queue is an excellent solution.


Thanks,

Mike

On Thu, Feb 28, 2019 at 4:08 AM Dligach, Dmitriy <ddlig...@luc.edu> wrote:

> Hi Mike,
>
> We also observed this issue. Splitting large documents into smaller ones
> is an option, but you have to make sure you preserve the integrity of
> individual sentences or you might loose some concept mentions. Since you
> are using cTAKES only for ontology mapping, I don’t think you need to worry
> about the integrity of linguistic units larger than a sentence.
>
> FWIF, our solution to this problem was to create a separate queue for
> large documents and process them independently from the smaller documents.
>
> Best,
>
> Dima
>
>
>
>
> On Feb 27, 2019, at 16:59, Michael Trepanier <m...@metistream.com> wrote:
>
> Hi,
>
> We currently have a pipeline which is generating ontology mappings for a
> repository of clinical notes. However, this repository contains documents
> which, after RTF parsing, can contain over 900,000 characters (albeit this
> is a very small percentage of notes, out of ~13 million, around 50 contain
> more than 100k chars). Looking at some averages across the dataset, it is
> clear that the processing time is exponentially related to the note length:
>
> 0-10000 chars: 0.9 seconds (11 million notes)
> 10000-20000 chars: 5.625 seconds (1.5 million notes)
> 210000-220000 chars: 4422 seconds/1.22 hours (3 notes)
> 900000-1000000 chars: 103237 seconds/28.6 hours (1 note)
>
> Given these results, splitting the longer docs into partitions would speed
> up the pipeline considerably. However, our team has some concerns over how
> that might impact the context aware steps of the cTAKES pipeline. How would
> the results from splitting a doc on its sentences or paragraphs compare to
> feeding in an entire doc? Does the default pipeline API support a way to
> use segments instead of the entire document text?
>
> Regards,
>
> Mike
>
>
>

-- 
[image: MetiStream Logo - 500]
Mike Trepanier| Senior Big Data Engineer | MetiStream, Inc. |
m...@metistream.com | 845 - 270 - 3129 (m) | www.metistream.com

Re: Processing Extraordinarily Long Documents

Reply via email to