[ https://issues.apache.org/jira/browse/OPENNLP-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joern Kottmann closed OPENNLP-1115. ----------------------------------- Resolution: Won't Fix Please ask questions on the user list. > Document Categorizer all events dropped > --------------------------------------- > > Key: OPENNLP-1115 > URL: https://issues.apache.org/jira/browse/OPENNLP-1115 > Project: OpenNLP > Issue Type: Question > Components: Doccat > Affects Versions: 1.7.2 > Reporter: Alessandro Depase > Attachments: Train1.train > > > Hi all, > I'm trying to perform my first (newbie) document categorization using italian > language. > I'm using the attached train file and i got this output: > {{$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data > "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\source_gen\MrJEditor\sandbox\Train1.train" > -encoding UTF-8 > Indexing events using cutoff of 5 > Computing event counts... done. 12 events > Indexing... Dropped event Ok:[bow=ok] > Dropped event Ok:[bow=tutto, bow=bene] > Dropped event Ok:[bow=decisamente, bow=non, bow=male] > Dropped event Ok:[bow=fantastica, bow=scelta] > Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere, > bow=così, bow=contento] > Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato] > Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene] > Dropped event no:[bow=per, bow=nulla] > Dropped event no:[bow=niente, bow=affatto, bow=divertente] > Dropped event no:[bow=va, bow=malissimo] > Dropped event no:[bow=va, bow=decisamente, bow=male] > Dropped event no:[bow=sono, bow=molto, bow=triste] > done. > Sorting and merging events... > ERROR: Not enough training data > The provided training data is not sufficient to create enough events to train > a model. > To resolve this error use more training data, if this doesn't help there might > be some fundamental problem with the training data itself.}} > I already found a couple of other similar issues, just saying that there are > not enough lines (but I have 6 lines for each category and a cutoff of 5) or > that without at least 100 lines the categorization quality is not sufficient > (ok, but that's just a quality matter, it should work, with bad results, but > it should work). The reason for insufficient data is that all the lines are > dropped. > I also tried with java api, same result. > But why? What did I miss? I cannot find useful documentation... > Thank you in advance > Kind Regards > Alessandro -- This message was sent by Atlassian JIRA (v6.4.14#64029)