Hi All,
I am trying to parse a string into sentences using the sentence detector.
The data is in english, UTF-8 format, and has many abbreviations (medical
text).
I need the sentence detector to accept a list of abbreviations. I am using
the Dictionary Class like this:
Dictionary abbrDict = new Dictionary();
try {
//abbrDict = new Dictionary( FileInputStream(new
File(pathToAbbr)));
abbrString = readFile(pathToAbbr).replaceAll("(\\t|\\r?\\n)+",
" ");
for (String abbr : abbrString.split(" ")) {
StringList abbrList = new StringList(abbr);
System.out.println( abbrList.getToken(0) );
abbrDict.put(abbrList);
}
} catch (Exception ex) {
ex.printStackTrace();
}
System.out.println( abbrDict.size() + " is the size of dict " +
abbrDict.toString() );
_______________________________________________________________________________
The out put of the last line looks like this:
9 is the size of dict [[L.M.P.], [D.O.A.], [L.S.A.], [R.S.T.], [A.G.A.],
[R.F.P.], [R.S.P.], [S.L.P.], [R.F.A.]]
My question is is this the right way to do it? If yes, how come the
sentence detector still does not split sentences properly with these
abbreviations.
Any help would be appreciated.
Adi