Hi Jörn, I found it will work if replace the space with <SPLIT> within the corpus file.
Please ignore my last post. Thanks! > On 4 Sep 2017, at 7:52 AM, 王春华 <igor.w...@icloud.com> wrote: > > Hi Jörn, > > I am trying to train the tokenizer with some corpora in Chinese and got > exception as below from the console: > > > Indexing events with TwoPass using cutoff of 5 > > Computing event counts... done. 4476143 events > Indexing... done. > Sorting and merging events... done. Reduced 4476143 events to 358244. > Done indexing in 30.55 s. > opennlp.tools.util.InsufficientTrainingDataException: Training data must > contain more than one outcome > at > opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78) > at > opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93) > at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:247) > at com.mzdee.nlp.Tokenizer.main(Tokenizer.java:207) > > I am new to NPL and not quite understand what’s going on. and the code snip > as below: > > InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory( > new > File("/Users/aaron/resume-corpus/corpus_一_20140804162433.txt")); > Charset charset = Charset.forName("UTF-8"); > ObjectStream<String> lineStream = new > PlainTextByLineStream(inputStreamFactory, charset); > ObjectStream<TokenSample> sampleStream = new > TokenSampleStream(lineStream); > > TokenizerModel model; > > try { > // model = TokenizerME.train("zh", sampleStream, > true, TrainingParameters.defaultParams()); > TokenizerFactory tf = new TokenizerFactory(); > > boolean useAlphaNumericOptimization=false; > String languageCode="zh"; > model =TokenizerME.train(sampleStream, > TokenizerFactory.create(null, languageCode, null, > useAlphaNumericOptimization, null), TrainingParameters.defaultParams()); > > } finally { > sampleStream.close(); > } > > OutputStream modelOut = null; > try { > modelOut = new BufferedOutputStream(new > FileOutputStream("/Users/aaron/resume-corpus/zh-token.bin")); > model.serialize(modelOut); > } finally { > if (modelOut != null) > modelOut.close(); > } > The line I commented out above seem is not update with latest version. > > Help !!! > > >> On 1 Sep 2017, at 7:18 PM, Joern Kottmann <kottm...@gmail.com >> <mailto:kottm...@gmail.com>> wrote: >> >> Our current tokenizer can be trained to segment Chinese just by >> following the user documentation, >> but it might not work very well. We never tried this. >> >> Do you have a corpora you can train on? >> >> OntoNotes has some Chinese text and could probably be used. >> >> Jörn >> >> On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <igor.w...@icloud.com >> <mailto:igor.w...@icloud.com>> wrote: >>> Hello everyone, >>> >>> I wonder if there is any tokenizing model for Chinese text, or where to get >>> some guidelines of how to generate one by myself. >>> >>> thanks! >>> Aaron >