Hi Jörn, I am trying to train the tokenizer with some corpora in Chinese and got exception as below from the console:
Indexing events with TwoPass using cutoff of 5 Computing event counts... done. 4476143 events Indexing... done. Sorting and merging events... done. Reduced 4476143 events to 358244. Done indexing in 30.55 s. opennlp.tools.util.InsufficientTrainingDataException: Training data must contain more than one outcome at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78) at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93) at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:247) at com.mzdee.nlp.Tokenizer.main(Tokenizer.java:207) I am new to NPL and not quite understand what’s going on. and the code snip as below: InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory( new File("/Users/aaron/resume-corpus/corpus_一_20140804162433.txt")); Charset charset = Charset.forName("UTF-8"); ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, charset); ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream); TokenizerModel model; try { // model = TokenizerME.train("zh", sampleStream, true, TrainingParameters.defaultParams()); TokenizerFactory tf = new TokenizerFactory(); boolean useAlphaNumericOptimization=false; String languageCode="zh"; model =TokenizerME.train(sampleStream, TokenizerFactory.create(null, languageCode, null, useAlphaNumericOptimization, null), TrainingParameters.defaultParams()); } finally { sampleStream.close(); } OutputStream modelOut = null; try { modelOut = new BufferedOutputStream(new FileOutputStream("/Users/aaron/resume-corpus/zh-token.bin")); model.serialize(modelOut); } finally { if (modelOut != null) modelOut.close(); } The line I commented out above seem is not update with latest version. Help !!! > On 1 Sep 2017, at 7:18 PM, Joern Kottmann <kottm...@gmail.com> wrote: > > Our current tokenizer can be trained to segment Chinese just by > following the user documentation, > but it might not work very well. We never tried this. > > Do you have a corpora you can train on? > > OntoNotes has some Chinese text and could probably be used. > > Jörn > > On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <igor.w...@icloud.com> wrote: >> Hello everyone, >> >> I wonder if there is any tokenizing model for Chinese text, or where to get >> some guidelines of how to generate one by myself. >> >> thanks! >> Aaron