Hello,
    I am trying to get the train a model using data annotated with the brat 
annotator.  I have the annotation config and all the .txt and .ann files.  I 
have successfully trained a model using:


        public TokenNameFinderModel trainModel(File corpusDir) throws 
IOException{
//
//  set up the directory structure of the corpus….
//
                File trainingDir=new File(corpusDir,"train");
                File testDir=new File(corpusDir,"test");
                File config=new File(corpusDir,"annotation.conf");

//
//  Create a NameSample Stream...
//
                String[] 
args={"-bratDataDir",trainingDir.getAbsolutePath(),"-annotationConfig",config.getAbsolutePath(),
                                "-ruleBasedTokenizer","simple"
                };
                ObjectStreamFactory<NameSample> 
basFactory=StreamFactoryRegistry.getFactory(NameSample.class, "brat");
                ObjectStream<NameSample> trainingStream=basFactory.create(args);


//
//  Train the model...
//
                TrainingParameters params = new TrainingParameters();
                params.put(TrainingParameters.ITERATIONS_PARAM, "70");
                params.put(TrainingParameters.CUTOFF_PARAM, "1");
                TokenNameFinderModel nameFinderModel = NameFinderME.train("en", 
null, trainingStream,
                                params, TokenNameFinderFactory.create(null, 
null, Collections.emptyMap(), new BioCodec()));

                NameFinderME nameFinder = new NameFinderME(nameFinderModel);
//
// Eval the model...
//
                trainingStream.reset();
                TokenNameFinderEvaluator evaluator=new 
TokenNameFinderEvaluator(nameFinder, new NameEvaluationErrorListener());     
                evaluator=new TokenNameFinderEvaluator(nameFinder);      
                evaluator.evaluate(trainingStream);
                System.out.println("on training 
data\n"+evaluator.getFMeasure());
//  return the model...
                return nameFinderModel;
        }




But, when I try to use the model…


        public void eval(File dir,NameFinderME nameFinder) throws IOException{
// load the data..
                FileFilter txtFileFilter=(File x) -> { return 
x.getName().endsWith("txt") && x.length()>0; } ;
                File[] files=dir.listFiles(txtFileFilter);

                for (File responseFile:files){
// read the file into a string...
                        String response=readResponse(responseFile);
// break the string into sentences...
                        Span[] 
sentenceSpans=sentenceDetector.sentPosDetect(response);
                        for (Span sentenceSpan:sentenceSpans){
                                String 
sentence=sentenceSpan.getCoveredText(response).toString();
// break the sentences into tokens...
                                String[] tokens=tokenizer.tokenize(sentence);
// find the “names”..
                                Span[] spans=nameFinder.find(tokens);
                                int spanId = 0;
                                if (spans.length>0){
                                        
System.out.println(responseFile.getName());
                                        for (Span span:spans){
                                                Span offsetSpan=new 
Span(span,sentenceSpan.getStart());
// print out the “names” found...
                                                System.out.println( 
"\tT"+(++spanId)+"\t"+offsetSpan.getStart()+" 
"+offsetSpan.getEnd()+"\t"+offsetSpan.getCoveredText(response));
                                        }
                                }
                        }
                }
        }

        private String readResponse(File file) throws IOException{
                StringBuilder sb=new StringBuilder();
                try(BufferedReader in=new BufferedReader(new FileReader(file))){
                        sb.append(in.readLine());
                }
                return sb.toString();

        }


I get spans that don’t line up with word boundary.  So it is worse than wrong… 
it’s nonsense.  Clearly I am doing something wrong.  Any idea?
Daniel

Reply via email to