Koji Sekiguchi created OPENNLP-1221: ---------------------------------------
Summary: FeatureGeneratorUtil.tokenFeature() is too specific for some languages Key: OPENNLP-1221 URL: https://issues.apache.org/jira/browse/OPENNLP-1221 Project: OpenNLP Issue Type: Improvement Affects Versions: 1.9.0 Reporter: Koji Sekiguchi As I described in OPENNLP-1197, in Japanese NER problem, we usually use only DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER for token classes. What FeatureGeneratorUtil.tokenFeature() provides at present are too specific. I don't need to distinguish among lc (lowercase alphabet), ac (all capital letters) and ic (initial capital letter), for example. By way of trial, if I applied the following patch in order to avoid "too specific token class generation": {code} diff --git a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java index e6b8af95..405938d1 100644 --- a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java +++ b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java @@ -29,6 +29,8 @@ public class FeatureGeneratorUtil { private static final String TOKEN_AND_CLASS_PREFIX = "w&c"; private static final Pattern capPeriod = Pattern.compile("^[A-Z]\\.$"); + private static final Pattern pDigit = Pattern.compile("^\\p{IsDigit}+$"); + private static final Pattern pAlpha = Pattern.compile("^\\p{IsAlphabetic}+$"); /** * Generates a class name for the specified token. @@ -64,48 +66,11 @@ public class FeatureGeneratorUtil { else if (pattern.isAllKatakana()) { feat = "jak"; } - else if (pattern.isAllLowerCaseLetter()) { - feat = "lc"; + else if (pDigit.matcher(token).find()) { + feat = "digit"; } - else if (pattern.digits() == 2) { - feat = "2d"; - } - else if (pattern.digits() == 4) { - feat = "4d"; - } - else if (pattern.containsDigit()) { - if (pattern.containsLetters()) { - feat = "an"; - } - else if (pattern.containsHyphen()) { - feat = "dd"; - } - else if (pattern.containsSlash()) { - feat = "ds"; - } - else if (pattern.containsComma()) { - feat = "dc"; - } - else if (pattern.containsPeriod()) { - feat = "dp"; - } - else { - feat = "num"; - } - } - else if (pattern.isAllCapitalLetter()) { - if (token.length() == 1) { - feat = "sc"; - } - else { - feat = "ac"; - } - } - else if (capPeriod.matcher(token).find()) { - feat = "cp"; - } - else if (pattern.isInitialCapitalLetter()) { - feat = "ic"; + else if (pAlpha.matcher(token).find()) { + feat = "alpha"; } else { feat = "other"; {code} total F1 was increased from 82.00% to 82.13%. It may be trivial, but I think I have a lot of room yet to tune and increase the performance. Fortunately, I could add japanese-addon project to opennlp-addons in the previous ticket, I'd like to add some programs that generate simpler token classes in japanese-addon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)