Hi guys, We have observed that sentence detector splits after tokens that do not end with a period (I inserted <SENT> at positions where such a split occurs):
input: die Bewachungskosten als Neben- kosten auf.die Mieter umzulegen. output: die Bewachungskosten als Neben- kosten auf.die<SENT>Mieter umzulegen. input: 1 Nachtrag Nr. 3 zum Mietvertrag vom 28.03./04,04.1984 für die Mieträume in San Marino output: 1 Nachtrag Nr. 3 zum Mietvertrag vom 28.03./04,04.1984<SENT> für die Mieträume in San Marino The texts are not accurate because they come from OCR. Nevertheless, I still find it counterintuitive that a sentence split can occur after a non-period symbol (optionally followed by other punctuation marks). Could someone shed some light on why this happens? Is there a way to control what tokens should be considered as potential sentence-ending tokens and which should not? Thank you in advance and kind regards, Nikolai