[jira] [Created] (OPENNLP-1221) FeatureGeneratorUtil.tokenFeature() is too specific for some languages

Koji Sekiguchi (JIRA) Tue, 25 Sep 2018 21:23:44 -0700

Koji Sekiguchi created OPENNLP-1221:
---------------------------------------


             Summary: FeatureGeneratorUtil.tokenFeature() is too specific for 
some languages
                 Key: OPENNLP-1221
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1221
             Project: OpenNLP
          Issue Type: Improvement
    Affects Versions: 1.9.0
            Reporter: Koji Sekiguchi


As I described in OPENNLP-1197, in Japanese NER problem, we usually use only 
DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER 
for token classes. What FeatureGeneratorUtil.tokenFeature() provides at present 
are too specific. I don't need to distinguish among lc (lowercase alphabet), ac 
(all capital letters) and ic (initial capital letter), for example.

By way of trial, if I applied the following patch in order to avoid "too 
specific token class generation":

{code}
diff --git 
a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
 
b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
index e6b8af95..405938d1 100644
--- 
a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
+++ 
b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
@@ -29,6 +29,8 @@ public class FeatureGeneratorUtil {
   private static final String TOKEN_AND_CLASS_PREFIX = "w&c";
 
   private static final Pattern capPeriod = Pattern.compile("^[A-Z]\\.$");
+  private static final Pattern pDigit = Pattern.compile("^\\p{IsDigit}+$");
+  private static final Pattern pAlpha = 
Pattern.compile("^\\p{IsAlphabetic}+$");
 
   /**
    * Generates a class name for the specified token.
@@ -64,48 +66,11 @@ public class FeatureGeneratorUtil {
     else if (pattern.isAllKatakana()) {
       feat = "jak";
     }
-    else if (pattern.isAllLowerCaseLetter()) {
-      feat = "lc";
+    else if (pDigit.matcher(token).find()) {
+      feat = "digit";
     }
-    else if (pattern.digits() == 2) {
-      feat = "2d";
-    }
-    else if (pattern.digits() == 4) {
-      feat = "4d";
-    }
-    else if (pattern.containsDigit()) {
-      if (pattern.containsLetters()) {
-        feat = "an";
-      }
-      else if (pattern.containsHyphen()) {
-        feat = "dd";
-      }
-      else if (pattern.containsSlash()) {
-        feat = "ds";
-      }
-      else if (pattern.containsComma()) {
-        feat = "dc";
-      }
-      else if (pattern.containsPeriod()) {
-        feat = "dp";
-      }
-      else {
-        feat = "num";
-      }
-    }
-    else if (pattern.isAllCapitalLetter()) {
-      if (token.length() == 1) {
-        feat = "sc";
-      }
-      else {
-        feat = "ac";
-      }
-    }
-    else if (capPeriod.matcher(token).find()) {
-      feat = "cp";
-    }
-    else if (pattern.isInitialCapitalLetter()) {
-      feat = "ic";
+    else if (pAlpha.matcher(token).find()) {
+      feat = "alpha";
     }
     else {
       feat = "other";
{code}

total F1 was increased from 82.00% to 82.13%. It may be trivial, but I think I 
have a lot of room yet to tune and increase the performance.

Fortunately, I could add japanese-addon project to opennlp-addons in the 
previous ticket, I'd like to add some programs that generate simpler token 
classes in japanese-addon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (OPENNLP-1221) FeatureGeneratorUtil.tokenFeature() is too specific for some languages

Reply via email to