Hi Jörn,

I found it will work if replace the space with <SPLIT>  within the corpus file. 

Please ignore my last post.

Thanks!
> On 4 Sep 2017, at 7:52 AM, 王春华 <igor.w...@icloud.com> wrote:
> 
> Hi Jörn,
> 
> I am trying to train the tokenizer with some corpora in Chinese and got 
> exception as below from the console:
> 
> 
> Indexing events with TwoPass using cutoff of 5
> 
>       Computing event counts...  done. 4476143 events
>       Indexing...  done.
> Sorting and merging events... done. Reduced 4476143 events to 358244.
> Done indexing in 30.55 s.
> opennlp.tools.util.InsufficientTrainingDataException: Training data must 
> contain more than one outcome
>       at 
> opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78)
>       at 
> opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93)
>       at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:247)
>       at com.mzdee.nlp.Tokenizer.main(Tokenizer.java:207)
> 
> I am new to NPL and not quite understand what’s going on. and the code snip 
> as below:
> 
> InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(
>                                       new 
> File("/Users/aaron/resume-corpus/corpus_一_20140804162433.txt"));
>                       Charset charset = Charset.forName("UTF-8");
>                       ObjectStream<String> lineStream = new 
> PlainTextByLineStream(inputStreamFactory, charset);
>                       ObjectStream<TokenSample> sampleStream = new 
> TokenSampleStream(lineStream);
> 
>                       TokenizerModel model;
> 
>                       try {
> //                            model = TokenizerME.train("zh", sampleStream, 
> true, TrainingParameters.defaultParams());
>                               TokenizerFactory tf = new TokenizerFactory();
>                               
>                               boolean useAlphaNumericOptimization=false;
>                               String languageCode="zh";
>                               model =TokenizerME.train(sampleStream, 
> TokenizerFactory.create(null, languageCode, null, 
> useAlphaNumericOptimization, null), TrainingParameters.defaultParams());
>                               
>                       } finally {
>                               sampleStream.close();
>                       }
> 
>                       OutputStream modelOut = null;
>                       try {
>                               modelOut = new BufferedOutputStream(new 
> FileOutputStream("/Users/aaron/resume-corpus/zh-token.bin"));
>                               model.serialize(modelOut);
>                       } finally {
>                               if (modelOut != null)
>                                       modelOut.close();
>                       }
> The line I commented out above seem is not update with latest version.
> 
> Help !!!
> 
> 
>> On 1 Sep 2017, at 7:18 PM, Joern Kottmann <kottm...@gmail.com 
>> <mailto:kottm...@gmail.com>> wrote:
>> 
>> Our current tokenizer can be trained to segment Chinese just by
>> following the user documentation,
>> but it might not work very well. We never tried this.
>> 
>> Do you have a corpora you can train on?
>> 
>> OntoNotes has some Chinese text and could probably be used.
>> 
>> Jörn
>> 
>> On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <igor.w...@icloud.com 
>> <mailto:igor.w...@icloud.com>> wrote:
>>> Hello everyone,
>>> 
>>> I wonder if there is any tokenizing model for Chinese text, or where to get 
>>> some guidelines of how to generate one by myself.
>>> 
>>> thanks!
>>> Aaron
> 

Reply via email to