Hello,

i just added the <SPLIT> Tag because all (only) whitespace files weren't
able to processed by the command line tool. It just found 1 Feature and
the training endet with an exception like "Unable to create model due
to" in the first interation and all the liklihoods are 1.0. I just
replaced all whitespaces with the split tag as described in the
documentation.

Andreas

Am 13.03.2013 20:13, schrieb Jörn Kottmann:
> The tokenizers defaults are for text which is mostly whitespace separated,
> did you lost all your white spaces in the text you want to process?
> 
> Jörn
> 
> On 03/13/2013 04:31 PM, Andreas Niekler wrote:
>> Hello,
>>
>> i give you some examples below this comment. But i already noticed in
>> the code, that the standard tokenizerTrainer call uses the standard
>> alphanumeric pattern which won't work for typical german examples. Most
>> of the data will be separated because of the inproper pattern in the
>> standard Factory.java class. My believe is that the de-token.bin model
>> was trained with a proper pattern within another implementation of the
>> training procedure.
>>
>> Here are some training lines:
>>
>> Senden<SPLIT>Pfleiderer<SPLIT>verkaufen<SPLIT>Düsseldorf<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Analyst<SPLIT>vom<SPLIT>Bankhaus<SPLIT>Lampe<SPLIT>,<SPLIT>Marc<SPLIT>Gabriel<SPLIT>,<SPLIT>stuft<SPLIT>die<SPLIT>Pfleiderer-Aktie<SPLIT>(<SPLIT>ISIN<SPLIT>DE0006764749<SPLIT>/<SPLIT>WKN<SPLIT>676474<SPLIT>)<SPLIT>von<SPLIT>"<SPLIT>halten<SPLIT>"<SPLIT>auf<SPLIT>"<SPLIT>verkaufen<SPLIT>"<SPLIT>herab<SPLIT>.
>>
>> Der<SPLIT>vollständige<SPLIT>Zwischenbericht<SPLIT>wird<SPLIT>am<SPLIT>8<SPLIT>.<SPLIT>November<SPLIT>2010<SPLIT>um<SPLIT>12.00<SPLIT>Uhr<SPLIT>veröffentlicht<SPLIT>.
>>
>> Besonders<SPLIT>in<SPLIT>ländlichen<SPLIT>Gegenden<SPLIT>sind<SPLIT>Telegrafenmaste<SPLIT>auch<SPLIT>heute<SPLIT>noch<SPLIT>weit<SPLIT>verbreitet<SPLIT>-<SPLIT>größtenteils<SPLIT>für<SPLIT>die<SPLIT>Festnetztelefonie<SPLIT>.
>>
>> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Sarah<SPLIT>Palin<SPLIT>als<SPLIT>Reality-Star<SPLIT>im<SPLIT>US-Fernsehen<SPLIT>auf<SPLIT>Sendung<SPLIT>15.11.10<SPLIT>4:58<SPLIT>:<SPLIT>Washington<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Sarah<SPLIT>Palin<SPLIT>hat<SPLIT>jetzt<SPLIT>eine<SPLIT>eigene<SPLIT>Show<SPLIT>.
>>
>> Fotos<SPLIT>Terrorwarnung<SPLIT>-<SPLIT>Was<SPLIT>man<SPLIT>jetzt<SPLIT>beachten<SPLIT>sollte<SPLIT>Die<SPLIT>Sicherheitslage<SPLIT>spitzt<SPLIT>sich<SPLIT>zu<SPLIT>.
>>
>> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Tausende<SPLIT>Siedler<SPLIT>protestieren<SPLIT>gegen<SPLIT>neuen<SPLIT>Baustopp<SPLIT>21.11.10<SPLIT>11:51<SPLIT>:<SPLIT>Jerusalem<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Die<SPLIT>israelischen<SPLIT>Siedler<SPLIT>haben<SPLIT>ihre<SPLIT>Proteste<SPLIT>gegen<SPLIT>einen<SPLIT>erwarteten<SPLIT>neuen<SPLIT>Baustopp<SPLIT>im<SPLIT>Westjordanland<SPLIT>verschärft<SPLIT>.
>>
>> Jetzt<SPLIT>einloggen<SPLIT>SchwarzKater<SPLIT>(<SPLIT>vor<SPLIT>4<SPLIT>Stunden<SPLIT>)<SPLIT>WTF<SPLIT>?
>>
>> Das<SPLIT>Bankhaus<SPLIT>hat<SPLIT>das<SPLIT>Kursziel<SPLIT>für<SPLIT>die<SPLIT>Salzgitter-Aktien<SPLIT>von<SPLIT>69,00<SPLIT>auf<SPLIT>58,00<SPLIT>Euro<SPLIT>gesenkt<SPLIT>,<SPLIT>aber<SPLIT>die<SPLIT>Einstufung<SPLIT>auf<SPLIT>´<SPLIT>Overweight<SPLIT>´<SPLIT>belassen<SPLIT>.
>>
>> Bundeskanzlerin<SPLIT>Angela<SPLIT>Merkel<SPLIT>(<SPLIT>CDU<SPLIT>)<SPLIT>ist<SPLIT>am<SPLIT>Dienstag<SPLIT>zum<SPLIT>Gipfel<SPLIT>der<SPLIT>Organisation<SPLIT>für<SPLIT>Sicherheit<SPLIT>und<SPLIT>Zusammenarbeit<SPLIT>in<SPLIT>Europa<SPLIT>(<SPLIT>OSZE<SPLIT>)<SPLIT>in<SPLIT>Kasachstan<SPLIT>eingetroffen<SPLIT>.
>>
>> Mann<SPLIT>totgeprügelt<SPLIT>:<SPLIT>Haftstrafen<SPLIT>im<SPLIT>«<SPLIT>20-Cent-Prozess<SPLIT>»<SPLIT>Die<SPLIT>beiden<SPLIT>Schläger<SPLIT>jugendlichen<SPLIT>Schläger<SPLIT>sind<SPLIT>wegen<SPLIT>Körperverletzung<SPLIT>mit<SPLIT>Todesfolge<SPLIT>zu<SPLIT>Haftstrafen<SPLIT>verurteilt<SPLIT>worden<SPLIT>.
>>
>> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>AKTIEN<SPLIT>SCHWEIZ/Vorbörse<SPLIT>:<SPLIT>Leicht<SPLIT>höhere<SPLIT>Eröffnung<SPLIT>erwartet<SPLIT>-<SPLIT>Positive<SPLIT>US-Vorgaben<SPLIT>06.12.2010<SPLIT>08:45<SPLIT>Zürich<SPLIT>(<SPLIT>awp<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Schweizer<SPLIT>Aktienmarkt<SPLIT>dürfte<SPLIT>die<SPLIT>Sitzung<SPLIT>vom<SPLIT>Montag<SPLIT>mit<SPLIT>moderaten<SPLIT>Gewinnen<SPLIT>eröffnen<SPLIT>.
>>
>> Werbung<SPLIT>'<SPLIT>)<SPLIT>;<SPLIT>AIG<SPLIT>hatte<SPLIT>sich<SPLIT>auf<SPLIT>dem<SPLIT>US-Häusermarkt<SPLIT>verspekuliert<SPLIT>.
>>
>> Außerordentliche<SPLIT>Hauptversammlung<SPLIT>genehmigt<SPLIT>Aktiensplit<SPLIT>und<SPLIT>Vorratsbeschlüsse<SPLIT>für<SPLIT>Kapitalmaßnahmen<SPLIT>Ad-hoc-Mitteilung<SPLIT>übermittelt<SPLIT>durch<SPLIT>euro<SPLIT>adhoc<SPLIT>mit<SPLIT>dem<SPLIT>Ziel<SPLIT>einer<SPLIT>europaweiten<SPLIT>Verbreitung<SPLIT>.
>>
>> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>Führungscrew<SPLIT>übernimmt<SPLIT>bei<SPLIT>Modekette<SPLIT>Schild<SPLIT>Stefan<SPLIT>Portmann<SPLIT>und<SPLIT>Thomas<SPLIT>Herbert<SPLIT>heissen<SPLIT>die<SPLIT>neuen<SPLIT>starken<SPLIT>Männer<SPLIT>bei<SPLIT>Schild<SPLIT>.
>>
>> IrfanView<SPLIT>Lizenz<SPLIT>:<SPLIT>Freeware<SPLIT>—<SPLIT>Hersteller-Website<SPLIT>IrfanView<SPLIT>ist<SPLIT>ein<SPLIT>für<SPLIT>private<SPLIT>Zwecke<SPLIT>kostenloses<SPLIT>Bildbetrachtungs-<SPLIT>und<SPLIT>Bildbearbeitungsprogramm<SPLIT>,<SPLIT>das<SPLIT>für<SPLIT>kleinere<SPLIT>Belange<SPLIT>durchaus<SPLIT>ausreicht<SPLIT>.
>>
>> Dragonica<SPLIT>So<SPLIT>testet<SPLIT>4Players<SPLIT>Bitte<SPLIT>einloggen<SPLIT>,<SPLIT>um<SPLIT>Spiel<SPLIT>in<SPLIT>die<SPLIT>Watchlist<SPLIT>aufzunehmen<SPLIT>.
>>
>> Rating-Update<SPLIT>:<SPLIT>London<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Robert<SPLIT>T<SPLIT>.<SPLIT>Cornell<SPLIT>,<SPLIT>Scott<SPLIT>L<SPLIT>.<SPLIT>Gaffner<SPLIT>und<SPLIT>Darren<SPLIT>Yip<SPLIT>,<SPLIT>Analysten<SPLIT>von<SPLIT>Barclays<SPLIT>Capital<SPLIT>,<SPLIT>stufen<SPLIT>die<SPLIT>Aktie<SPLIT>von<SPLIT>ITT<SPLIT>Industries<SPLIT>(<SPLIT>ISIN<SPLIT>US4509111021<SPLIT>/<SPLIT>WKN<SPLIT>860023<SPLIT>)<SPLIT>weiterhin<SPLIT>mit<SPLIT>dem<SPLIT>Rating<SPLIT>"<SPLIT>equal-weight<SPLIT>"<SPLIT>ein<SPLIT>.
>>
>> Sollten<SPLIT>der<SPLIT>Branche<SPLIT>durch<SPLIT>"<SPLIT>populäre<SPLIT>Preiskürzungen<SPLIT>"<SPLIT>der<SPLIT>Regulierungsbehörden<SPLIT>weiter<SPLIT>Milliarden<SPLIT>entzogen<SPLIT>werden<SPLIT>,<SPLIT>sei<SPLIT>kaum<SPLIT>vorstellbar<SPLIT>,<SPLIT>wie<SPLIT>ein<SPLIT>flächendeckender<SPLIT>Breitbandausbau<SPLIT>noch<SPLIT>finanziert<SPLIT>werden<SPLIT>könnte<SPLIT>,<SPLIT>kritisierte<SPLIT>Obermann<SPLIT>.
>>
>> Grichting/Von<SPLIT>Bergen<SPLIT>bewährten<SPLIT>sich<SPLIT>abstimmen<SPLIT>Online/Print<SPLIT>Täglich<SPLIT>stellt<SPLIT>der<SPLIT>BLICK<SPLIT>eine<SPLIT>Frage<SPLIT>des<SPLIT>Tages<SPLIT>.
>>
>> Best<SPLIT>.<SPLIT>For<SPLIT>additional<SPLIT>information<SPLIT>,<SPLIT>please<SPLIT>visit<SPLIT>www.asih.bm<SPLIT>.<SPLIT>0<SPLIT>Bewertungen<SPLIT>dieses<SPLIT>Artikels<SPLIT>:<SPLIT>Kommentare<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>Geben<SPLIT>Sie<SPLIT>jetzt<SPLIT>einen<SPLIT>Kommentar<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>ab<SPLIT>.
>>
>>
>> Am 13.03.2013 15:52, schrieb Jörn Kottmann:
>>> Hello,
>>>
>>> can you tell us a bit more about your training data. Did you manually
>>> annotate these 300k sentences?
>>> Is it possible to post 10 lines or so here?
>>>
>>> Jörn
>>>
>>> On 03/12/2013 03:22 PM, Andreas Niekler wrote:
>>>> Dear List,
>>>>
>>>> i created a Tokenizer Model with 300k german Sentences from a very
>>>> clean
>>>> corpus. I see some words that are very strangly separated by a
>>>> tokenizer
>>>> using this model like:
>>>>
>>>> stehenge - blieben
>>>> fre - undlicher
>>>>
>>>> and so on. I cant find those in my training data and wonder why openNLP
>>>> splits those words without any evidence in the training data and wihout
>>>> any whitespace in my text files. I trained the model with 500
>>>> Iterations, cutoff 5 and alphanumeric optimisation.
>>>>
>>>> Can anyone state some ideas how i can prevent this?
>>>>
>>>> thank you
>>>>
>>>> Andreas
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: [email protected]

Reply via email to