Hi guys,

We have observed that sentence detector splits after tokens that do not end
with a period (I inserted <SENT> at positions where such a split occurs):

input:

die Bewachungskosten als Neben- kosten auf.die Mieter umzulegen.


output:

die Bewachungskosten als Neben- kosten auf.die<SENT>Mieter umzulegen.


input:

1 Nachtrag Nr. 3 zum Mietvertrag vom 28.03./04,04.1984 für die Mieträume in
San Marino


output:

1 Nachtrag Nr. 3 zum Mietvertrag vom 28.03./04,04.1984<SENT> für die
Mieträume in San Marino


The texts are not accurate because they come from OCR. Nevertheless, I
still find it counterintuitive that a sentence split can occur after a
non-period symbol (optionally followed by other punctuation marks).

Could someone shed some light on why this happens? Is there a way to
control what tokens should be considered as potential sentence-ending
tokens and which should not?

Thank you in advance and kind regards,
Nikolai

Reply via email to