Thank you. So, if I correctly understand, the cutoff is related to features and not to lines as I wrongly understood from some Internet examples (and not carefully reading the documentation too... my bad)
Reading the documentation, I find that if no feature generator has been specified, "Bag of words" is used. It's not so clear that this means "tested on every single training line" and not "on every category" (after your answer maybe more clear, indeed). Roughly speaking, this means that a training line with more words than the cutoff (more than 5 words according to the default) is kept and with less than the cutoff is dropped, is it correct? If this is correct, using the defaults, it should suffice lower the cutoff to 1, not to zero (a line with no word is meaningless anyway - and I think already tested before as well formatted). I'll make some tests in this direction, thank you so much Alessandro 2017-07-11 16:40 GMT+02:00 Joern Kottmann <kottm...@gmail.com>: > An event is dropped when the cutoff is so high that all features are > removed from that event. > I recommend to train with more data or to decrease the cutoff value to > zero. > > Jörn > > On Tue, Jul 11, 2017 at 3:44 PM, Alessandro Depase > <alessandro.dep...@gmail.com> wrote: > > Hi all, > > I'm trying to perform my first (newbie) document categorization using > > italian language. > > I'm using a very simple file with this content: > > > > Ok ok > > Ok tutto bene > > Ok decisamente non male > > Ok fantastica scelta > > Ok non pensavo di poter essere così contento > > Ok certamente un'ottimo risultato > > no non va affatto bene > > no per nulla > > no niente affatto divertente > > no va malissimo > > no va decisamente male > > no sono molto triste > > > > (no lines before or after the quoted ones - and, yes, I know that in > > Italian "un'ottimo" is an error, but it was part of my list :) ) and i > got > > this output: > > > > $ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data > > "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\ > sandbox\sourcegen\MrJEditor\sandbox\Train1.train" > > -encoding UTF-8 > > Indexing events using cutoff of 5 > > > > Computing event counts... done. 12 events > > Indexing... Dropped event Ok:[bow=ok] > > > > Dropped event Ok:[bow=tutto, bow=bene] > > Dropped event Ok:[bow=decisamente, bow=non, bow=male] > > Dropped event Ok:[bow=fantastica, bow=scelta] > > Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere, > > bow=così, bow=contento] > > Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato] > > Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene] > > Dropped event no:[bow=per, bow=nulla] > > Dropped event no:[bow=niente, bow=affatto, bow=divertente] > > Dropped event no:[bow=va, bow=malissimo] > > Dropped event no:[bow=va, bow=decisamente, bow=male] > > Dropped event no:[bow=sono, bow=molto, bow=triste] > > done. > > Sorting and merging events... > > > > ERROR: Not enough training data > > The provided training data is not sufficient to create enough events to > > train a model. > > To resolve this error use more training data, if this doesn't help there > > might > > be some fundamental problem with the training data itself. > > > > I already found a couple of other similar issues on the Internet, just > > saying that there are not enough lines (but I have 6 lines for each > > category and a cutoff of 5) or that without at least 100 lines the > > categorization quality is not sufficient (ok, but that's just a quality > > matter, it should work, with bad results, but it should work). The reason > > for insufficient data is that all the lines are dropped. Someone seems to > > succeed with even 10 lines. > > But why? What did I miss? I cannot find useful documentation... > > > > Please note that my question is about *why* the lines are dropped, about > > the reason, the logic behind dropping them. > > I tried to understand the code, (I stopped when it required too much time > > without downloading and debugging it) and that's what I understood: > > *the AbstractDataIndexer throws the exception in the method _sortAndMerge > > _because it "thinks" there isn't enough data* but it uses the *List > > eventsToCompare*, which is the result of a previous computation, which > > happens in the same class, *method index(ObjectStream<Event> events, > > Map<String, Integer=""> predicateIndex)* > > * there the code builds a int[] starting from each line in a way I cannot > > completely understood (my question, at the very end, is: what is the > logic > > behind the compilation of this array?). If the array has more than an > > element, then ok, we have elements to compare (and the sortAndMerge will > > not throw this Exception), else the line is dropped. So: what is the > logic > > behind dropping the line? > > The documentation, just talks about the cutoff value, but I compiled more > > lines than requested by the cutoff. > > So: to complete the question, is there a way to quantifiy the minimum > > quantity of lines or words or whatever needed? Why are available online > > examples working with 10 lines and my example not? I don't mind the > quality > > here, I completely understand that it will not produce a meaningful > result > > in a real case, but why I got an Excepion and other not? > > > > In the meanwhile I tried with more or less 15 lines and it returned no > > exception. The quality of categorization was very low, as expected (it > > almost always returned "ok", also to sentences in the training set - is > it > > related to the fact that the corresponding lines were dropped and the > train > > happened only on few others?). With 29 lines it becomes to give > meaningful > > answers, nonetheless the questions remain. > > > > Thank you in advance for your support > > Kind Regards > > Alessandro >