Hello, I am trying to get the train a model using data annotated with the brat annotator. I have the annotation config and all the .txt and .ann files. I have successfully trained a model using:
public TokenNameFinderModel trainModel(File corpusDir) throws IOException{ // // set up the directory structure of the corpus…. // File trainingDir=new File(corpusDir,"train"); File testDir=new File(corpusDir,"test"); File config=new File(corpusDir,"annotation.conf"); // // Create a NameSample Stream... // String[] args={"-bratDataDir",trainingDir.getAbsolutePath(),"-annotationConfig",config.getAbsolutePath(), "-ruleBasedTokenizer","simple" }; ObjectStreamFactory<NameSample> basFactory=StreamFactoryRegistry.getFactory(NameSample.class, "brat"); ObjectStream<NameSample> trainingStream=basFactory.create(args); // // Train the model... // TrainingParameters params = new TrainingParameters(); params.put(TrainingParameters.ITERATIONS_PARAM, "70"); params.put(TrainingParameters.CUTOFF_PARAM, "1"); TokenNameFinderModel nameFinderModel = NameFinderME.train("en", null, trainingStream, params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec())); NameFinderME nameFinder = new NameFinderME(nameFinderModel); // // Eval the model... // trainingStream.reset(); TokenNameFinderEvaluator evaluator=new TokenNameFinderEvaluator(nameFinder, new NameEvaluationErrorListener()); evaluator=new TokenNameFinderEvaluator(nameFinder); evaluator.evaluate(trainingStream); System.out.println("on training data\n"+evaluator.getFMeasure()); // return the model... return nameFinderModel; } But, when I try to use the model… public void eval(File dir,NameFinderME nameFinder) throws IOException{ // load the data.. FileFilter txtFileFilter=(File x) -> { return x.getName().endsWith("txt") && x.length()>0; } ; File[] files=dir.listFiles(txtFileFilter); for (File responseFile:files){ // read the file into a string... String response=readResponse(responseFile); // break the string into sentences... Span[] sentenceSpans=sentenceDetector.sentPosDetect(response); for (Span sentenceSpan:sentenceSpans){ String sentence=sentenceSpan.getCoveredText(response).toString(); // break the sentences into tokens... String[] tokens=tokenizer.tokenize(sentence); // find the “names”.. Span[] spans=nameFinder.find(tokens); int spanId = 0; if (spans.length>0){ System.out.println(responseFile.getName()); for (Span span:spans){ Span offsetSpan=new Span(span,sentenceSpan.getStart()); // print out the “names” found... System.out.println( "\tT"+(++spanId)+"\t"+offsetSpan.getStart()+" "+offsetSpan.getEnd()+"\t"+offsetSpan.getCoveredText(response)); } } } } } private String readResponse(File file) throws IOException{ StringBuilder sb=new StringBuilder(); try(BufferedReader in=new BufferedReader(new FileReader(file))){ sb.append(in.readLine()); } return sb.toString(); } I get spans that don’t line up with word boundary. So it is worse than wrong… it’s nonsense. Clearly I am doing something wrong. Any idea? Daniel