Hello,

I ve been through most of the pages I found about opennlp+sentence detector and 
I still can t answer my question. I d like to construct a sentence detector 
model from data that I have. I can t just use the shipped in models (even the 
french ones) since I work with clinical narratives, which are very specific 
types of documents.

In those documents, there are very diverse type of texts: some (more or less) 
well formed paragraphs of text, but also lists of diagnosis, todo lists, lab 
results, etc. Moreover, extracting pdf files with some level of page formatting 
sometimes entangle text and introduces bits of text into sentences.

The end result is that text extracted from clinical narratives have a lot of « 
pseudo-sentences » which sometime dont end with a period (or other 
punctuations), do not start with a capital letters. Because of lists, a 
significative portion of sentences start with a « bullet » char or a hyphen 
(which are not technically part of the sentence). There are finally a lot of 
text representing lab results in the form a 2-dimensional table. This type of 
text ends up being just 1 line with a label (e.g.: pCO2) and its value (e.g.: 
7.8 kPa).

Consequently I have trouble to figure out how exactly to transform my 
tika-extracted text into sentences example in order to train a sentence 
detector model. I have tried intuitively by inserting « new lines » whenever I 
would consider un chunk of text as a sentence even though sometimes it s 
actually far from the grammatical definition (keeping bullets in front of list 
element, keeping extra spaces before a sentence because it was there, not 
inserting a period because non was there).

I find that there is very little information about the format of training data. 
So the question is how should I edit the sentences within the train file, 
considering I m starting with a rather « dirty » extracted document which is 
actually not made of real sentences for its most parts?

Thank you in advance


FB

Reply via email to