[jira] [Deleted] (OPENNLP-1297) 17.01.2020
[ https://issues.apache.org/jira/browse/OPENNLP-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi deleted OPENNLP-1297: > 17.01.2020 > - > > Key: OPENNLP-1297 > URL: https://issues.apache.org/jira/browse/OPENNLP-1297 > Project: OpenNLP > Issue Type: Dependency upgrade > Environment: 17.01.2020 >Reporter: Simon poortman >Priority: Critical > Labels: Majesty > > 17.01.2020 > [simon_poort...@icloud.com|mailto:simon_poort...@icloud.com] > Majesty > General -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Deleted] (OPENNLP-1295) 15.01.2020 Simon Poortman
[ https://issues.apache.org/jira/browse/OPENNLP-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi deleted OPENNLP-1295: > 15.01.2020 Simon Poortman > > > Key: OPENNLP-1295 > URL: https://issues.apache.org/jira/browse/OPENNLP-1295 > Project: OpenNLP > Issue Type: Dependency > Environment: 15.01.2020 Simon Poortman >Reporter: Simon poortman >Priority: Major > Labels: Majesty, king, secret-elite > > 15.01.2020 Simon Poortman > > Majesty > King > General -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (OPENNLP-852) CountryContextFile should support multiple regexes
[ https://issues.apache.org/jira/browse/OPENNLP-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated OPENNLP-852: --- Comment: was deleted (was: A comment with security level 'Administrators' was removed.) > CountryContextFile should support multiple regexes > -- > > Key: OPENNLP-852 > URL: https://issues.apache.org/jira/browse/OPENNLP-852 > Project: OpenNLP > Issue Type: Improvement > Components: Entity Linker >Affects Versions: addons-1.6.0 > Environment: windows 7, any >Reporter: Mark Giaconia >Assignee: Mark Giaconia >Priority: Major > Original Estimate: 20h > Remaining Estimate: 20h > > This will require reindexing all data, and constructing a new file format. > This will be a big improvement in terms of recall of general location context. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Deleted] (OPENNLP-1251) Simon Poortman19-88
[ https://issues.apache.org/jira/browse/OPENNLP-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi deleted OPENNLP-1251: > Simon Poortman19-88 > - > > Key: OPENNLP-1251 > URL: https://issues.apache.org/jira/browse/OPENNLP-1251 > Project: OpenNLP > Issue Type: Bug >Reporter: Simon Poortman >Priority: Critical > Labels: Criminal > > Simon is de Beste -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1224) Use Daemon threads in executor services
[ https://issues.apache.org/jira/browse/OPENNLP-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1224. - Resolution: Fixed Fix Version/s: 1.9.1 > Use Daemon threads in executor services > --- > > Key: OPENNLP-1224 > URL: https://issues.apache.org/jira/browse/OPENNLP-1224 > Project: OpenNLP > Issue Type: Improvement >Reporter: Edd Spencer >Assignee: Koji Sekiguchi >Priority: Major > Fix For: 1.9.1 > > > For all executor services it would be ideal if they are configured to use > daemon threads. This will mean that should the process need to be shutdown it > will not wait until these threads are complete in order to do so (which can > take a long time depending on operation). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (OPENNLP-1224) Use Daemon threads in executor services
[ https://issues.apache.org/jira/browse/OPENNLP-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1224: --- Assignee: Koji Sekiguchi > Use Daemon threads in executor services > --- > > Key: OPENNLP-1224 > URL: https://issues.apache.org/jira/browse/OPENNLP-1224 > Project: OpenNLP > Issue Type: Improvement >Reporter: Edd Spencer >Assignee: Koji Sekiguchi >Priority: Major > > For all executor services it would be ideal if they are configured to use > daemon threads. This will mean that should the process need to be shutdown it > will not wait until these threads are complete in order to do so (which can > take a long time depending on operation). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (OPENNLP-1214) use hash to avoid linear search in DefaultEndOfSentenceScanner
[ https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reopened OPENNLP-1214: - > use hash to avoid linear search in DefaultEndOfSentenceScanner > -- > > Key: OPENNLP-1214 > URL: https://issues.apache.org/jira/browse/OPENNLP-1214 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.9.1 > > > When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to > check if each characters in the sentence is one of eos characters. I think > we'd better use HashSet to keep eosCharacters instead of char[]. > In accordance with this replacement, I'd like to make > getEndOfSentenceCharacters() deprecated because it returns char[] and nobody > in OpenNLP calls it at present, and I'd like to add the equivalent method > which returns Set of eos chars. Though it cannot keep the order of > eos chars but I don't think it can be a problem anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1214) use hash to avoid linear search in DefaultEndOfSentenceScanner
[ https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1214. - Resolution: Fixed Assignee: Koji Sekiguchi > use hash to avoid linear search in DefaultEndOfSentenceScanner > -- > > Key: OPENNLP-1214 > URL: https://issues.apache.org/jira/browse/OPENNLP-1214 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.9.1 > > > When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to > check if each characters in the sentence is one of eos characters. I think > we'd better use HashSet to keep eosCharacters instead of char[]. > In accordance with this replacement, I'd like to make > getEndOfSentenceCharacters() deprecated because it returns char[] and nobody > in OpenNLP calls it at present, and I'd like to add the equivalent method > which returns Set of eos chars. Though it cannot keep the order of > eos chars but I don't think it can be a problem anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1221) FeatureGeneratorUtil.tokenFeature() is too specific for some languages
Koji Sekiguchi created OPENNLP-1221: --- Summary: FeatureGeneratorUtil.tokenFeature() is too specific for some languages Key: OPENNLP-1221 URL: https://issues.apache.org/jira/browse/OPENNLP-1221 Project: OpenNLP Issue Type: Improvement Affects Versions: 1.9.0 Reporter: Koji Sekiguchi As I described in OPENNLP-1197, in Japanese NER problem, we usually use only DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER for token classes. What FeatureGeneratorUtil.tokenFeature() provides at present are too specific. I don't need to distinguish among lc (lowercase alphabet), ac (all capital letters) and ic (initial capital letter), for example. By way of trial, if I applied the following patch in order to avoid "too specific token class generation": {code} diff --git a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java index e6b8af95..405938d1 100644 --- a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java +++ b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java @@ -29,6 +29,8 @@ public class FeatureGeneratorUtil { private static final String TOKEN_AND_CLASS_PREFIX = "w"; private static final Pattern capPeriod = Pattern.compile("^[A-Z]\\.$"); + private static final Pattern pDigit = Pattern.compile("^\\p{IsDigit}+$"); + private static final Pattern pAlpha = Pattern.compile("^\\p{IsAlphabetic}+$"); /** * Generates a class name for the specified token. @@ -64,48 +66,11 @@ public class FeatureGeneratorUtil { else if (pattern.isAllKatakana()) { feat = "jak"; } -else if (pattern.isAllLowerCaseLetter()) { - feat = "lc"; +else if (pDigit.matcher(token).find()) { + feat = "digit"; } -else if (pattern.digits() == 2) { - feat = "2d"; -} -else if (pattern.digits() == 4) { - feat = "4d"; -} -else if (pattern.containsDigit()) { - if (pattern.containsLetters()) { -feat = "an"; - } - else if (pattern.containsHyphen()) { -feat = "dd"; - } - else if (pattern.containsSlash()) { -feat = "ds"; - } - else if (pattern.containsComma()) { -feat = "dc"; - } - else if (pattern.containsPeriod()) { -feat = "dp"; - } - else { -feat = "num"; - } -} -else if (pattern.isAllCapitalLetter()) { - if (token.length() == 1) { -feat = "sc"; - } - else { -feat = "ac"; - } -} -else if (capPeriod.matcher(token).find()) { - feat = "cp"; -} -else if (pattern.isInitialCapitalLetter()) { - feat = "ic"; +else if (pAlpha.matcher(token).find()) { + feat = "alpha"; } else { feat = "other"; {code} total F1 was increased from 82.00% to 82.13%. It may be trivial, but I think I have a lot of room yet to tune and increase the performance. Fortunately, I could add japanese-addon project to opennlp-addons in the previous ticket, I'd like to add some programs that generate simpler token classes in japanese-addon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1201) add bailout way for certain languages in order to use POS features
[ https://issues.apache.org/jira/browse/OPENNLP-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1201. - Resolution: Fixed Assignee: Koji Sekiguchi This feature has been added to opennlp-addons. Thanks! > add bailout way for certain languages in order to use POS features > -- > > Key: OPENNLP-1201 > URL: https://issues.apache.org/jira/browse/OPENNLP-1201 > Project: OpenNLP > Issue Type: Improvement > Components: Command Line Interface, Formats >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Major > > As OpenNLP tools depend on the fact that text being processed needs to be > tokenized in advance (in other words, words in the text are separated each > other by space), it is difficult for uses who use certain languages (e.g. > CJK) to use POS (Part-of-Speech) features. > To simplify the explanation, consider using NameFinder for Japanese text. In > NameFinder tools (Train, Eval, Recognize), they require that users should > provide Japanese text which has already been tokenized, but once we tokenize > Japanese text, it loses POS information. (I think Chinese language has same > problem) > Let me describe this problem for western language users :) (English, French, > Italian, etc.) without using Japanese letters. I’ll try to use English > alphabet, instead. > Suppose you have a sentence text “isentthemachine” which you want to give > NameFinder, you use morphological analyzer in order to tokenize the sentence. > There are two possible sequence of tokens: > - i (PPSS) / sent (VBD) / the (AT) / machine (NP) > - i (PPSS) / sent (VBD) / them (PPO) / a (AT) / chine (NP) > As you noticed, morphological analyzer not only tokenizes the sentence, but > also tags POS tag to each token. Same thing takes place in Japanese language > (and Chinese language, I think). > However, in OpenNLP feature generator API, it accepts sequence of tokens thru > API i.e. `String[] tokens`, I cannot produce POS feature in the feature > generator. > To solve this problem (and to invite many users to our community), I’d like > to suggest that OpenNLP tools allow users to add optional information to each > tokenized word. > For example, one can give the following text when using NameFinder tools. > {code} > $ cat en-ner.train > I/PPSS sent/VBD the/AT machine/NP > {code} > When using such text, they must inform the tool that the token has POS tag in > the text by using a certain option e.g. -postag > {code} > $ opennlp TokenNameFinderTrainer -data en-ner.train -model en-ner.bin -postag > {code} > We can maintain the backward compatibility to set -postag false by default > and in this case, existing feature generators work exactly the same as > before. If a user set -postag option in the command line, the existing > feature generators eliminate “/POS” part of token “word/POS” in the text so > that they can produce same features as before. > I’d like to add a simple feature generator which generates only “POS” part of > token “word/POS” in the text, in addition to managing -postag option. This > simple feature generator allows Japanese/Chinese users to produce precise POS > features. > I’d like to focus on NameFinder in this ticket (Let me add this option to > other tools (chunker, classifier, etc.) in another ticket, if needed). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1219) change private instance variable featureGenerators to protected in DefaultNameContextGenerator
[ https://issues.apache.org/jira/browse/OPENNLP-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1219. - Resolution: Fixed Fix Version/s: 1.9.1 > change private instance variable featureGenerators to protected in > DefaultNameContextGenerator > -- > > Key: OPENNLP-1219 > URL: https://issues.apache.org/jira/browse/OPENNLP-1219 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.9.1 > > > TokenNameFinderTrainer allows users to customize TokenNameFinderFactory via > -factory option. As I want to override > DefaultNameContextGenerator.getContext(), I made the sub-class of > TokenNameFinderFactory and created an instance of the sub-class of > DefaultNameContextGenerator in the constructor of my TokenNameFinderFactory. > However, I couldn't implement getContext() method of my > DefaultNameContextGenerator because I couldn't access private member > featureGenerators. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (OPENNLP-1219) change private instance variable featureGenerators to protected in DefaultNameContextGenerator
[ https://issues.apache.org/jira/browse/OPENNLP-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1219: --- Assignee: Koji Sekiguchi > change private instance variable featureGenerators to protected in > DefaultNameContextGenerator > -- > > Key: OPENNLP-1219 > URL: https://issues.apache.org/jira/browse/OPENNLP-1219 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > TokenNameFinderTrainer allows users to customize TokenNameFinderFactory via > -factory option. As I want to override > DefaultNameContextGenerator.getContext(), I made the sub-class of > TokenNameFinderFactory and created an instance of the sub-class of > DefaultNameContextGenerator in the constructor of my TokenNameFinderFactory. > However, I couldn't implement getContext() method of my > DefaultNameContextGenerator because I couldn't access private member > featureGenerators. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OPENNLP-1219) change private instance variable featureGenerators to protected in DefaultNameContextGenerator
[ https://issues.apache.org/jira/browse/OPENNLP-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated OPENNLP-1219: Summary: change private instance variable featureGenerators to protected in DefaultNameContextGenerator (was: change instance variable featureGenerators to protected in DefaultNameContextGenerator) > change private instance variable featureGenerators to protected in > DefaultNameContextGenerator > -- > > Key: OPENNLP-1219 > URL: https://issues.apache.org/jira/browse/OPENNLP-1219 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Priority: Minor > > TokenNameFinderTrainer allows users to customize TokenNameFinderFactory via > -factory option. As I want to override > DefaultNameContextGenerator.getContext(), I made the sub-class of > TokenNameFinderFactory and created an instance of the sub-class of > DefaultNameContextGenerator in the constructor of my TokenNameFinderFactory. > However, I couldn't implement getContext() method of my > DefaultNameContextGenerator because I couldn't access private member > featureGenerators. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1219) change instance variable featureGenerators to protected in DefaultNameContextGenerator
Koji Sekiguchi created OPENNLP-1219: --- Summary: change instance variable featureGenerators to protected in DefaultNameContextGenerator Key: OPENNLP-1219 URL: https://issues.apache.org/jira/browse/OPENNLP-1219 Project: OpenNLP Issue Type: Improvement Affects Versions: 1.9.0 Reporter: Koji Sekiguchi TokenNameFinderTrainer allows users to customize TokenNameFinderFactory via -factory option. As I want to override DefaultNameContextGenerator.getContext(), I made the sub-class of TokenNameFinderFactory and created an instance of the sub-class of DefaultNameContextGenerator in the constructor of my TokenNameFinderFactory. However, I couldn't implement getContext() method of my DefaultNameContextGenerator because I couldn't access private member featureGenerators. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1216) opennlp command should allow users to set heap size
[ https://issues.apache.org/jira/browse/OPENNLP-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1216. - Resolution: Fixed Assignee: Koji Sekiguchi > opennlp command should allow users to set heap size > --- > > Key: OPENNLP-1216 > URL: https://issues.apache.org/jira/browse/OPENNLP-1216 > Project: OpenNLP > Issue Type: Documentation > Components: Command Line Interface >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.9.1 > > > When I used ParserTrainer, I got OutOfMemoryError. I checked opennlp shell > script, I found uses cannot change the heap size without editing the script. > I think we should allow uses to set it by doing like this: > {code} > $ JAVA_HEAP=4096m opennlp ParserTrainer ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1217) opennlp Parser can take only one model file
[ https://issues.apache.org/jira/browse/OPENNLP-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1217. - Resolution: Fixed Assignee: Koji Sekiguchi > opennlp Parser can take only one model file > --- > > Key: OPENNLP-1217 > URL: https://issues.apache.org/jira/browse/OPENNLP-1217 > Project: OpenNLP > Issue Type: Documentation >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.9.1 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1215) ParserTrainer's option -head-rules in the document should be -headRules
[ https://issues.apache.org/jira/browse/OPENNLP-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1215. - Resolution: Fixed Assignee: Koji Sekiguchi > ParserTrainer's option -head-rules in the document should be -headRules > --- > > Key: OPENNLP-1215 > URL: https://issues.apache.org/jira/browse/OPENNLP-1215 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.9.1 > > > There is the section that describes so and I tried to execute: > {code} > opennlp ParserTrainer -model en-parser.bin -data en-parser.train -head-rules > opennlp-tools/lang/en/parser/en-head_rules -lang en > {code} > and I got the error `Missing mandatory parameter: -headRules` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1217) opennlp Parser can take only one model file
Koji Sekiguchi created OPENNLP-1217: --- Summary: opennlp Parser can take only one model file Key: OPENNLP-1217 URL: https://issues.apache.org/jira/browse/OPENNLP-1217 Project: OpenNLP Issue Type: Documentation Affects Versions: 1.9.0 Reporter: Koji Sekiguchi Fix For: 1.9.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1216) opennlp command should allow users to set heap size
Koji Sekiguchi created OPENNLP-1216: --- Summary: opennlp command should allow users to set heap size Key: OPENNLP-1216 URL: https://issues.apache.org/jira/browse/OPENNLP-1216 Project: OpenNLP Issue Type: Documentation Components: Command Line Interface Affects Versions: 1.9.0 Reporter: Koji Sekiguchi Fix For: 1.9.1 When I used ParserTrainer, I got OutOfMemoryError. I checked opennlp shell script, I found uses cannot change the heap size without editing the script. I think we should allow uses to set it by doing like this: {code} $ JAVA_HEAP=4096m opennlp ParserTrainer ... {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1215) ParserTrainer's option -head-rules in the document should be -headRules
Koji Sekiguchi created OPENNLP-1215: --- Summary: ParserTrainer's option -head-rules in the document should be -headRules Key: OPENNLP-1215 URL: https://issues.apache.org/jira/browse/OPENNLP-1215 Project: OpenNLP Issue Type: Documentation Components: Documentation Affects Versions: 1.9.0 Reporter: Koji Sekiguchi Fix For: 1.9.1 There is the section that describes so and I tried to execute: {code} opennlp ParserTrainer -model en-parser.bin -data en-parser.train -head-rules opennlp-tools/lang/en/parser/en-head_rules -lang en {code} and I got the error `Missing mandatory parameter: -headRules` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1213) Use ja for Japanese language code rather than jp
[ https://issues.apache.org/jira/browse/OPENNLP-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1213. - Resolution: Fixed > Use ja for Japanese language code rather than jp > > > Key: OPENNLP-1213 > URL: https://issues.apache.org/jira/browse/OPENNLP-1213 > Project: OpenNLP > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.9.1 > > > It seems that Factory of sentdetect uses "jp" for Japanese language code but > I think it is country code. Let's use "ja" instead. > We could leave "jp" for back-compat, but I don't think we need to do it. So > I'll just replace "jp" with "ja" in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (OPENNLP-1213) Use ja for Japanese language code rather than jp
[ https://issues.apache.org/jira/browse/OPENNLP-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1213: --- Assignee: Koji Sekiguchi > Use ja for Japanese language code rather than jp > > > Key: OPENNLP-1213 > URL: https://issues.apache.org/jira/browse/OPENNLP-1213 > Project: OpenNLP > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.9.1 > > > It seems that Factory of sentdetect uses "jp" for Japanese language code but > I think it is country code. Let's use "ja" instead. > We could leave "jp" for back-compat, but I don't think we need to do it. So > I'll just replace "jp" with "ja" in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1212) TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag
[ https://issues.apache.org/jira/browse/OPENNLP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1212. - Resolution: Fixed > TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag > --- > > Key: OPENNLP-1212 > URL: https://issues.apache.org/jira/browse/OPENNLP-1212 > Project: OpenNLP > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.9.1 > > > As TokenFeatureGenerator can accept lowercase flag but > TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag, > TokenFeatureGenerator always return lowercase tokens. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1211) Improve WindowFeatureGeneratorTest
[ https://issues.apache.org/jira/browse/OPENNLP-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1211. - Resolution: Fixed > Improve WindowFeatureGeneratorTest > -- > > Key: OPENNLP-1211 > URL: https://issues.apache.org/jira/browse/OPENNLP-1211 > Project: OpenNLP > Issue Type: Test > Components: Build, Packaging and Test >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.9.1 > > > I'd like to improve WindowFeatureGeneratorTest from the following perspective: > * testWindowSizeOne should check the contents of the returned features. It > checks the length of the features only now > * most of test methods uses Assert.assertEquals(expected, actual) in opposite > way for its arguments when checking the contents of the returned features > {code} > Assert.assertEquals(features.get(0), testSentence[testTokenIndex]); > {code} > should be > {code} > Assert.assertEquals(testSentence[testTokenIndex], features.get(0)); > {code} > * Though I pointed out the arguments in assertEquals() above, I think we'd > better use exact concrete string rather than expression such like > testSentence[testTokenIndex] for the expected. And also, > testForCorrectFeatures uses contains method when checking the contents of the > returned features but I think we should avoid using contains when checking > the items in a List, rather than writing like this: > {code} > Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + > "2" + > testSentence[testTokenIndex - 2])); > Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + > "1" + > testSentence[testTokenIndex - 1])); > Assert.assertTrue(features.contains(testSentence[testTokenIndex])); > Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + > "1" + > testSentence[testTokenIndex + 1])); > Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + > "2" + > testSentence[testTokenIndex + 2])); > {code} > but I'd like to rewrite them like this: > {code} > Assert.assertEquals("d",features.get(0)); > Assert.assertEquals("p1c",features.get(1)); > Assert.assertEquals("p2b",features.get(2)); > Assert.assertEquals("n1e",features.get(3)); > Assert.assertEquals("n2f",features.get(4)); > {code} > The second form helps us to understand how WindowFeatureGenerator works and > it's easier to read. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (OPENNLP-1211) Improve WindowFeatureGeneratorTest
[ https://issues.apache.org/jira/browse/OPENNLP-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1211: --- Assignee: Koji Sekiguchi > Improve WindowFeatureGeneratorTest > -- > > Key: OPENNLP-1211 > URL: https://issues.apache.org/jira/browse/OPENNLP-1211 > Project: OpenNLP > Issue Type: Test > Components: Build, Packaging and Test >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.9.1 > > > I'd like to improve WindowFeatureGeneratorTest from the following perspective: > * testWindowSizeOne should check the contents of the returned features. It > checks the length of the features only now > * most of test methods uses Assert.assertEquals(expected, actual) in opposite > way for its arguments when checking the contents of the returned features > {code} > Assert.assertEquals(features.get(0), testSentence[testTokenIndex]); > {code} > should be > {code} > Assert.assertEquals(testSentence[testTokenIndex], features.get(0)); > {code} > * Though I pointed out the arguments in assertEquals() above, I think we'd > better use exact concrete string rather than expression such like > testSentence[testTokenIndex] for the expected. And also, > testForCorrectFeatures uses contains method when checking the contents of the > returned features but I think we should avoid using contains when checking > the items in a List, rather than writing like this: > {code} > Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + > "2" + > testSentence[testTokenIndex - 2])); > Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + > "1" + > testSentence[testTokenIndex - 1])); > Assert.assertTrue(features.contains(testSentence[testTokenIndex])); > Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + > "1" + > testSentence[testTokenIndex + 1])); > Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + > "2" + > testSentence[testTokenIndex + 2])); > {code} > but I'd like to rewrite them like this: > {code} > Assert.assertEquals("d",features.get(0)); > Assert.assertEquals("p1c",features.get(1)); > Assert.assertEquals("p2b",features.get(2)); > Assert.assertEquals("n1e",features.get(3)); > Assert.assertEquals("n2f",features.get(4)); > {code} > The second form helps us to understand how WindowFeatureGenerator works and > it's easier to read. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (OPENNLP-1212) TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag
[ https://issues.apache.org/jira/browse/OPENNLP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1212: --- Assignee: Koji Sekiguchi > TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag > --- > > Key: OPENNLP-1212 > URL: https://issues.apache.org/jira/browse/OPENNLP-1212 > Project: OpenNLP > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.9.1 > > > As TokenFeatureGenerator can accept lowercase flag but > TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag, > TokenFeatureGenerator always return lowercase tokens. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1214) use hash to avoid linear search in DefaultEndOfSentenceScanner
Koji Sekiguchi created OPENNLP-1214: --- Summary: use hash to avoid linear search in DefaultEndOfSentenceScanner Key: OPENNLP-1214 URL: https://issues.apache.org/jira/browse/OPENNLP-1214 Project: OpenNLP Issue Type: Improvement Affects Versions: 1.9.0 Reporter: Koji Sekiguchi Fix For: 1.9.1 When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to check if each characters in the sentence is one of eos characters. I think we'd better use HashSet to keep eosCharacters instead of char[]. In accordance with this replacement, I'd like to make getEndOfSentenceCharacters() deprecated because it returns char[] and nobody in OpenNLP calls it at present, and I'd like to add the equivalent method which returns Set of eos chars. Though it cannot keep the order of eos chars but I don't think it can be a problem anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1213) Use ja for Japanese language code rather than jp
Koji Sekiguchi created OPENNLP-1213: --- Summary: Use ja for Japanese language code rather than jp Key: OPENNLP-1213 URL: https://issues.apache.org/jira/browse/OPENNLP-1213 Project: OpenNLP Issue Type: Bug Affects Versions: 1.9.0 Reporter: Koji Sekiguchi Fix For: 1.9.1 It seems that Factory of sentdetect uses "jp" for Japanese language code but I think it is country code. Let's use "ja" instead. We could leave "jp" for back-compat, but I don't think we need to do it. So I'll just replace "jp" with "ja" in the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OPENNLP-1206) add TrigramNameFeatureGeneratorFactory
[ https://issues.apache.org/jira/browse/OPENNLP-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated OPENNLP-1206: Fix Version/s: 1.9.1 > add TrigramNameFeatureGeneratorFactory > -- > > Key: OPENNLP-1206 > URL: https://issues.apache.org/jira/browse/OPENNLP-1206 > Project: OpenNLP > Issue Type: Task > Components: Machine Learning >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.9.1 > > > Surprisingly, it's missing. :) I noticed it when I tried to use it in my > feature generator XML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1212) TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag
Koji Sekiguchi created OPENNLP-1212: --- Summary: TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag Key: OPENNLP-1212 URL: https://issues.apache.org/jira/browse/OPENNLP-1212 Project: OpenNLP Issue Type: Bug Affects Versions: 1.9.0 Reporter: Koji Sekiguchi Fix For: 1.9.1 As TokenFeatureGenerator can accept lowercase flag but TokenFeatureGeneratorFactory doesn't allow us to set lowercase flag, TokenFeatureGenerator always return lowercase tokens. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OPENNLP-1211) Improve WindowFeatureGeneratorTest
[ https://issues.apache.org/jira/browse/OPENNLP-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated OPENNLP-1211: Description: I'd like to improve WindowFeatureGeneratorTest from the following perspective: * testWindowSizeOne should check the contents of the returned features. It checks the length of the features only now * most of test methods uses Assert.assertEquals(expected, actual) in opposite way for its arguments when checking the contents of the returned features {code} Assert.assertEquals(features.get(0), testSentence[testTokenIndex]); {code} should be {code} Assert.assertEquals(testSentence[testTokenIndex], features.get(0)); {code} * Though I pointed out the arguments in assertEquals() above, I think we'd better use exact concrete string rather than expression such like testSentence[testTokenIndex] for the expected. And also, testForCorrectFeatures uses contains method when checking the contents of the returned features but I think we should avoid using contains when checking the items in a List, rather than writing like this: {code} Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + "2" + testSentence[testTokenIndex - 2])); Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + "1" + testSentence[testTokenIndex - 1])); Assert.assertTrue(features.contains(testSentence[testTokenIndex])); Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + "1" + testSentence[testTokenIndex + 1])); Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + "2" + testSentence[testTokenIndex + 2])); {code} but I'd like to rewrite them like this: {code} Assert.assertEquals("d",features.get(0)); Assert.assertEquals("p1c",features.get(1)); Assert.assertEquals("p2b",features.get(2)); Assert.assertEquals("n1e",features.get(3)); Assert.assertEquals("n2f",features.get(4)); {code} The second form helps us to understand how WindowFeatureGenerator works and it's easier to read. was: I'd like to improve WindowFeatureGeneratorTest from the following perspective: * testWindowSizeOne should check the contents of the returned features. It checks the length of the features only now * most of test methods uses Assert.assertEquals(expected, actual) in opposite way for its arguments when checking the contents of the returned features {code} Assert.assertEquals(features.get(0), testSentence[testTokenIndex]); {code} should be {code} Assert.assertEquals(testSentence[testTokenIndex], features.get(0)); {code} * Though I pointed out the arguments in assertEquals() above, I think we'd better use exact concrete string rather than expression such like testSentence[testTokenIndex] for the expected. And also, testForCorrectFeatures uses contains method when checking the contents of the returned features but I think we should avoid using contains when checking the items in a List, rather than writing like this: {code} Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + "2" + testSentence[testTokenIndex - 2])); Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + "1" + testSentence[testTokenIndex - 1])); Assert.assertTrue(features.contains(testSentence[testTokenIndex])); Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + "1" + testSentence[testTokenIndex + 1])); Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + "2" + testSentence[testTokenIndex + 2])); {code} but I'd like to rewrite them like this: {code} Assert.assertTrue("d",features.get(0)); Assert.assertTrue("p1c",features.get(1)); Assert.assertTrue("p2b",features.get(2)); Assert.assertTrue("n1e",features.get(3)); Assert.assertTrue("n2f",features.get(4)); {code} The second form helps us to understand how WindowFeatureGenerator works and it's easier to read. > Improve WindowFeatureGeneratorTest > -- > > Key: OPENNLP-1211 > URL: https://issues.apache.org/jira/browse/OPENNLP-1211 > Project: OpenNLP > Issue Type: Test > Components: Build, Packaging and Test >Affects Versions: 1.9.0 >Reporter: Koji Sekiguchi >Priority: Trivial > Fix For: 1.9.1 > > > I'd like to improve WindowFeatureGeneratorTest from the following perspective: > * testWindowSizeOne should check the contents of the returned features. It > checks the length of the features only now > * most of test methods uses Assert.assertEquals(expected, actual) in opposite > way for its arguments when checking the contents of the returned features > {code} > Assert.assertEquals(features.get(0), testSentence[testTokenIndex]); > {code} > should be > {code} >
[jira] [Created] (OPENNLP-1211) Improve WindowFeatureGeneratorTest
Koji Sekiguchi created OPENNLP-1211: --- Summary: Improve WindowFeatureGeneratorTest Key: OPENNLP-1211 URL: https://issues.apache.org/jira/browse/OPENNLP-1211 Project: OpenNLP Issue Type: Test Components: Build, Packaging and Test Affects Versions: 1.9.0 Reporter: Koji Sekiguchi Fix For: 1.9.1 I'd like to improve WindowFeatureGeneratorTest from the following perspective: * testWindowSizeOne should check the contents of the returned features. It checks the length of the features only now * most of test methods uses Assert.assertEquals(expected, actual) in opposite way for its arguments when checking the contents of the returned features {code} Assert.assertEquals(features.get(0), testSentence[testTokenIndex]); {code} should be {code} Assert.assertEquals(testSentence[testTokenIndex], features.get(0)); {code} * Though I pointed out the arguments in assertEquals() above, I think we'd better use exact concrete string rather than expression such like testSentence[testTokenIndex] for the expected. And also, testForCorrectFeatures uses contains method when checking the contents of the returned features but I think we should avoid using contains when checking the items in a List, rather than writing like this: {code} Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + "2" + testSentence[testTokenIndex - 2])); Assert.assertTrue(features.contains(WindowFeatureGenerator.PREV_PREFIX + "1" + testSentence[testTokenIndex - 1])); Assert.assertTrue(features.contains(testSentence[testTokenIndex])); Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + "1" + testSentence[testTokenIndex + 1])); Assert.assertTrue(features.contains(WindowFeatureGenerator.NEXT_PREFIX + "2" + testSentence[testTokenIndex + 2])); {code} but I'd like to rewrite them like this: {code} Assert.assertTrue("d",features.get(0)); Assert.assertTrue("p1c",features.get(1)); Assert.assertTrue("p2b",features.get(2)); Assert.assertTrue("n1e",features.get(3)); Assert.assertTrue("n2f",features.get(4)); {code} The second form helps us to understand how WindowFeatureGenerator works and it's easier to read. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1210) Outdated documentation on -lang argument?
[ https://issues.apache.org/jira/browse/OPENNLP-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1210. - Resolution: Fixed Assignee: Koji Sekiguchi Fix Version/s: 1.9.1 Thanks Xiang Ji! :) > Outdated documentation on -lang argument? > - > > Key: OPENNLP-1210 > URL: https://issues.apache.org/jira/browse/OPENNLP-1210 > Project: OpenNLP > Issue Type: Bug >Reporter: Xiang Ji >Assignee: Koji Sekiguchi >Priority: Major > Fix For: 1.9.1 > > > I encountered "Unsupported language: en" error when I was trying to run the > `TokenNameFinderConverter` or the `{{TokenNameFinderTrainer}}`. > > I'm not sure if I understood the bug correctly but it seems that after 2 > hours of trying, I found out that apparently in a certain version after > `1.5.3`, OpenNLP changed the language codes from two characters to three > characters, i.e. one should have passed in `eng` instead of `en`. But the > documentation was never updated on this and no meaningful error message was > given (i.e. the program didn't suggest "supported languages" instead). > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1206) add TrigramNameFeatureGeneratorFactory
[ https://issues.apache.org/jira/browse/OPENNLP-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1206. - Resolution: Fixed > add TrigramNameFeatureGeneratorFactory > -- > > Key: OPENNLP-1206 > URL: https://issues.apache.org/jira/browse/OPENNLP-1206 > Project: OpenNLP > Issue Type: Task > Components: Machine Learning >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > Surprisingly, it's missing. :) I noticed it when I tried to use it in my > feature generator XML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1206) add TrigramNameFeatureGeneratorFactory
Koji Sekiguchi created OPENNLP-1206: --- Summary: add TrigramNameFeatureGeneratorFactory Key: OPENNLP-1206 URL: https://issues.apache.org/jira/browse/OPENNLP-1206 Project: OpenNLP Issue Type: Task Components: Machine Learning Affects Versions: 1.8.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Surprisingly, it's missing. :) I noticed it when I tried to use it in my feature generator XML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (OPENNLP-1205) use new XML format of feature generator in OntoNotes4NameFinderEval
[ https://issues.apache.org/jira/browse/OPENNLP-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi closed OPENNLP-1205. --- Resolution: Invalid I'm sorry I saw 1.8 source. This has been done already. Closing as invalid. > use new XML format of feature generator in OntoNotes4NameFinderEval > --- > > Key: OPENNLP-1205 > URL: https://issues.apache.org/jira/browse/OPENNLP-1205 > Project: OpenNLP > Issue Type: Task > Components: Build, Packaging and Test >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1205) use new XML format of feature generator in OntoNotes4NameFinderEval
Koji Sekiguchi created OPENNLP-1205: --- Summary: use new XML format of feature generator in OntoNotes4NameFinderEval Key: OPENNLP-1205 URL: https://issues.apache.org/jira/browse/OPENNLP-1205 Project: OpenNLP Issue Type: Task Components: Build, Packaging and Test Affects Versions: 1.8.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words
[ https://issues.apache.org/jira/browse/OPENNLP-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1197. - Resolution: Fixed Fix Version/s: 1.9.0 > FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words > -- > > Key: OPENNLP-1197 > URL: https://issues.apache.org/jira/browse/OPENNLP-1197 > Project: OpenNLP > Issue Type: Bug > Components: Machine Learning >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Major > Fix For: 1.9.0 > > > FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" > (lower case). It looks a bug to me because they're not lower case letters, > but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes > care only Europe/American languages. > For example, in Japanese NER problem, typical token classes are as follows: > - DIGIT > - HIRA : あ, い, う, え, お etc. > - KATA : ア, イ, ウ, エ, オ etc. > - ALPHA : we don't need to distinguish lower/upper case > - OTHER > I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have > additional token classes I mentioned above, but later on, someone who comes > from Asia and may claim similar thing. > I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words
[ https://issues.apache.org/jira/browse/OPENNLP-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reopened OPENNLP-1197: - After applying this patch, Eval tests which don't run via mvn test cannot be successful. I reopen this and investigate. > FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words > -- > > Key: OPENNLP-1197 > URL: https://issues.apache.org/jira/browse/OPENNLP-1197 > Project: OpenNLP > Issue Type: Bug > Components: Machine Learning >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Major > > FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" > (lower case). It looks a bug to me because they're not lower case letters, > but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes > care only Europe/American languages. > For example, in Japanese NER problem, typical token classes are as follows: > - DIGIT > - HIRA : あ, い, う, え, お etc. > - KATA : ア, イ, ウ, エ, オ etc. > - ALPHA : we don't need to distinguish lower/upper case > - OTHER > I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have > additional token classes I mentioned above, but later on, someone who comes > from Asia and may claim similar thing. > I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1201) add bailout way for certain languages in order to use POS features
Koji Sekiguchi created OPENNLP-1201: --- Summary: add bailout way for certain languages in order to use POS features Key: OPENNLP-1201 URL: https://issues.apache.org/jira/browse/OPENNLP-1201 Project: OpenNLP Issue Type: Improvement Components: Command Line Interface, Formats Affects Versions: 1.8.4 Reporter: Koji Sekiguchi As OpenNLP tools depend on the fact that text being processed needs to be tokenized in advance (in other words, words in the text are separated each other by space), it is difficult for uses who use certain languages (e.g. CJK) to use POS (Part-of-Speech) features. To simplify the explanation, consider using NameFinder for Japanese text. In NameFinder tools (Train, Eval, Recognize), they require that users should provide Japanese text which has already been tokenized, but once we tokenize Japanese text, it loses POS information. (I think Chinese language has same problem) Let me describe this problem for western language users :) (English, French, Italian, etc.) without using Japanese letters. I’ll try to use English alphabet, instead. Suppose you have a sentence text “isentthemachine” which you want to give NameFinder, you use morphological analyzer in order to tokenize the sentence. There are two possible sequence of tokens: - i (PPSS) / sent (VBD) / the (AT) / machine (NP) - i (PPSS) / sent (VBD) / them (PPO) / a (AT) / chine (NP) As you noticed, morphological analyzer not only tokenizes the sentence, but also tags POS tag to each token. Same thing takes place in Japanese language (and Chinese language, I think). However, in OpenNLP feature generator API, it accepts sequence of tokens thru API i.e. `String[] tokens`, I cannot produce POS feature in the feature generator. To solve this problem (and to invite many users to our community), I’d like to suggest that OpenNLP tools allow users to add optional information to each tokenized word. For example, one can give the following text when using NameFinder tools. {code} $ cat en-ner.train I/PPSS sent/VBD the/AT machine/NP {code} When using such text, they must inform the tool that the token has POS tag in the text by using a certain option e.g. -postag {code} $ opennlp TokenNameFinderTrainer -data en-ner.train -model en-ner.bin -postag {code} We can maintain the backward compatibility to set -postag false by default and in this case, existing feature generators work exactly the same as before. If a user set -postag option in the command line, the existing feature generators eliminate “/POS” part of token “word/POS” in the text so that they can produce same features as before. I’d like to add a simple feature generator which generates only “POS” part of token “word/POS” in the text, in addition to managing -postag option. This simple feature generator allows Japanese/Chinese users to produce precise POS features. I’d like to focus on NameFinder in this ticket (Let me add this option to other tools (chunker, classifier, etc.) in another ticket, if needed). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1199) Correct Loop Bounds for NgramGenerator.generate function
[ https://issues.apache.org/jira/browse/OPENNLP-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1199. - Resolution: Fixed > Correct Loop Bounds for NgramGenerator.generate function > > > Key: OPENNLP-1199 > URL: https://issues.apache.org/jira/browse/OPENNLP-1199 > Project: OpenNLP > Issue Type: Improvement >Reporter: Prachi Prakash >Assignee: Joern Kottmann >Priority: Minor > Labels: pull-request-available > > A small enhancement to the loop condition of NGramGenerator.generate function > which saves a subsequent if condition check. I have also attached the PR link > [Pull Request|https://github.com/apache/opennlp/pull/318] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (OPENNLP-1198) add more tests to NGramGeneratorTest
[ https://issues.apache.org/jira/browse/OPENNLP-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1198: --- Assignee: Koji Sekiguchi > add more tests to NGramGeneratorTest > > > Key: OPENNLP-1198 > URL: https://issues.apache.org/jira/browse/OPENNLP-1198 > Project: OpenNLP > Issue Type: Test > Components: Build, Packaging and Test >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > > At present, NGramGeneratorTest has only 2-gram test against the example > sentence "This is a sentence". I think we'd better to have 1-gram, 3-gram and > 4-gram test cases for this example sentence. > In addition, it checks the return values by doing like this: > {code} > Assert.assertEquals(3, ngrams.size()); > Assert.assertTrue(ngrams.contains("This-is")); > Assert.assertTrue(ngrams.contains("is-a")); > Assert.assertTrue(ngrams.contains("a-sentence")); > {code} > but it cannot check the sequence. I think we should check it like this, > instead: > {code} > Assert.assertEquals(3, ngrams.size()); > Assert.assertEquals("This-is", ngrams.get(0)); > Assert.assertEquals("is-a", ngrams.get(1)); > Assert.assertEquals("a-sentence", ngrams.get(2)); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1198) add more tests to NGramGeneratorTest
[ https://issues.apache.org/jira/browse/OPENNLP-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1198. - Resolution: Fixed > add more tests to NGramGeneratorTest > > > Key: OPENNLP-1198 > URL: https://issues.apache.org/jira/browse/OPENNLP-1198 > Project: OpenNLP > Issue Type: Test > Components: Build, Packaging and Test >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > > At present, NGramGeneratorTest has only 2-gram test against the example > sentence "This is a sentence". I think we'd better to have 1-gram, 3-gram and > 4-gram test cases for this example sentence. > In addition, it checks the return values by doing like this: > {code} > Assert.assertEquals(3, ngrams.size()); > Assert.assertTrue(ngrams.contains("This-is")); > Assert.assertTrue(ngrams.contains("is-a")); > Assert.assertTrue(ngrams.contains("a-sentence")); > {code} > but it cannot check the sequence. I think we should check it like this, > instead: > {code} > Assert.assertEquals(3, ngrams.size()); > Assert.assertEquals("This-is", ngrams.get(0)); > Assert.assertEquals("is-a", ngrams.get(1)); > Assert.assertEquals("a-sentence", ngrams.get(2)); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1198) add more tests to NGramGeneratorTest
Koji Sekiguchi created OPENNLP-1198: --- Summary: add more tests to NGramGeneratorTest Key: OPENNLP-1198 URL: https://issues.apache.org/jira/browse/OPENNLP-1198 Project: OpenNLP Issue Type: Test Components: Build, Packaging and Test Affects Versions: 1.8.4 Reporter: Koji Sekiguchi At present, NGramGeneratorTest has only 2-gram test against the example sentence "This is a sentence". I think we'd better to have 1-gram, 3-gram and 4-gram test cases for this example sentence. In addition, it checks the return values by doing like this: {code} Assert.assertEquals(3, ngrams.size()); Assert.assertTrue(ngrams.contains("This-is")); Assert.assertTrue(ngrams.contains("is-a")); Assert.assertTrue(ngrams.contains("a-sentence")); {code} but it cannot check the sequence. I think we should check it like this, instead: {code} Assert.assertEquals(3, ngrams.size()); Assert.assertEquals("This-is", ngrams.get(0)); Assert.assertEquals("is-a", ngrams.get(1)); Assert.assertEquals("a-sentence", ngrams.get(2)); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words
[ https://issues.apache.org/jira/browse/OPENNLP-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated OPENNLP-1197: Description: FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" (lower case). It looks a bug to me because they're not lower case letters, but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care only Europe/American languages. For example, in Japanese NER problem, typical token classes are as follows: - DIGIT - HIRA : あ, い, う, え, お etc. - KATA : ア, イ, ウ, エ, オ etc. - ALPHA : we don't need to distinguish lower/upper case - OTHER I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have additional token classes I mentioned above, but later on, someone who comes from Asia and may claim similar thing. I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now. was: FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" (lower case). It looks a bug to me because they're not lower case letters, but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care only Europe/American languages. For example, in Japanese NER problem, typical token classes are as follows: - DIGIT - HIRA : あ, い, う, え, お etc. - KATA : ア, イ, ウ, エ, オ etc. - ALPHA : we don't distinguish lower/upper case - OTHER I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have additional token classes I mentioned above, but later on, someone who comes from Asia and may claim similar thing. I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now. > FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words > -- > > Key: OPENNLP-1197 > URL: https://issues.apache.org/jira/browse/OPENNLP-1197 > Project: OpenNLP > Issue Type: Bug > Components: Machine Learning >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Priority: Major > > FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" > (lower case). It looks a bug to me because they're not lower case letters, > but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes > care only Europe/American languages. > For example, in Japanese NER problem, typical token classes are as follows: > - DIGIT > - HIRA : あ, い, う, え, お etc. > - KATA : ア, イ, ウ, エ, オ etc. > - ALPHA : we don't need to distinguish lower/upper case > - OTHER > I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have > additional token classes I mentioned above, but later on, someone who comes > from Asia and may claim similar thing. > I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words
Koji Sekiguchi created OPENNLP-1197: --- Summary: FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words Key: OPENNLP-1197 URL: https://issues.apache.org/jira/browse/OPENNLP-1197 Project: OpenNLP Issue Type: Bug Components: Machine Learning Affects Versions: 1.8.4 Reporter: Koji Sekiguchi FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" (lower case). It looks a bug to me because they're not lower case letters, but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care only Europe/American languages. For example, in Japanese NER problem, typical token classes are as follows: - DIGIT - HIRA : あ, い, う, え, お etc. - KATA : ア, イ, ウ, エ, オ etc. - ALPHA : we don't distinguish lower/upper case - OTHER I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have additional token classes I mentioned above, but later on, someone who comes from Asia and may claim similar thing. I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1195) use ArrayMath.argmax() rather than private maxIndex() in PerceptronTrainer and NaiveBayesTrainer
[ https://issues.apache.org/jira/browse/OPENNLP-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1195. - Resolution: Fixed > use ArrayMath.argmax() rather than private maxIndex() in PerceptronTrainer > and NaiveBayesTrainer > > > Key: OPENNLP-1195 > URL: https://issues.apache.org/jira/browse/OPENNLP-1195 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > > PerceptronTrainer and NaiveBayesTrainer have their own private maxIndex() > method and they are identical. > Why don't we move it to their parent class? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OPENNLP-1195) use ArrayMath.argmax() rather than private maxIndex() in PerceptronTrainer and NaiveBayesTrainer
[ https://issues.apache.org/jira/browse/OPENNLP-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated OPENNLP-1195: Summary: use ArrayMath.argmax() rather than private maxIndex() in PerceptronTrainer and NaiveBayesTrainer (was: move maxIndex method to AbstractEventTrainer) > use ArrayMath.argmax() rather than private maxIndex() in PerceptronTrainer > and NaiveBayesTrainer > > > Key: OPENNLP-1195 > URL: https://issues.apache.org/jira/browse/OPENNLP-1195 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > > PerceptronTrainer and NaiveBayesTrainer have their own private maxIndex() > method and they are identical. > Why don't we move it to their parent class? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (OPENNLP-1195) move maxIndex method to AbstractEventTrainer
[ https://issues.apache.org/jira/browse/OPENNLP-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1195: --- Assignee: Koji Sekiguchi > move maxIndex method to AbstractEventTrainer > > > Key: OPENNLP-1195 > URL: https://issues.apache.org/jira/browse/OPENNLP-1195 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > > PerceptronTrainer and NaiveBayesTrainer have their own private maxIndex() > method and they are identical. > Why don't we move it to their parent class? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1196) move ArrayMath to a more general package
[ https://issues.apache.org/jira/browse/OPENNLP-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1196. - Resolution: Fixed > move ArrayMath to a more general package > > > Key: OPENNLP-1196 > URL: https://issues.apache.org/jira/browse/OPENNLP-1196 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > In OPENNLP-1195, [~joern] mentioned this. > {quote} > There are more usages of argmax in the OpenNLP source code. > I propose we create one common method and then try to only use that one. > We could move the ArrayMath to a more general package and place a common > method there, or keep the existing one > {quote} > I want to solve this before OPENNLP-1195. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (OPENNLP-1196) move ArrayMath to a more general package
[ https://issues.apache.org/jira/browse/OPENNLP-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1196: --- Assignee: Koji Sekiguchi > move ArrayMath to a more general package > > > Key: OPENNLP-1196 > URL: https://issues.apache.org/jira/browse/OPENNLP-1196 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning >Affects Versions: 1.8.4 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > In OPENNLP-1195, [~joern] mentioned this. > {quote} > There are more usages of argmax in the OpenNLP source code. > I propose we create one common method and then try to only use that one. > We could move the ArrayMath to a more general package and place a common > method there, or keep the existing one > {quote} > I want to solve this before OPENNLP-1195. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1196) move ArrayMath to a more general package
Koji Sekiguchi created OPENNLP-1196: --- Summary: move ArrayMath to a more general package Key: OPENNLP-1196 URL: https://issues.apache.org/jira/browse/OPENNLP-1196 Project: OpenNLP Issue Type: Improvement Components: Machine Learning Affects Versions: 1.8.4 Reporter: Koji Sekiguchi In OPENNLP-1195, [~joern] mentioned this. {quote} There are more usages of argmax in the OpenNLP source code. I propose we create one common method and then try to only use that one. We could move the ArrayMath to a more general package and place a common method there, or keep the existing one {quote} I want to solve this before OPENNLP-1195. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OPENNLP-1195) move maxIndex method to AbstractEventTrainer
Koji Sekiguchi created OPENNLP-1195: --- Summary: move maxIndex method to AbstractEventTrainer Key: OPENNLP-1195 URL: https://issues.apache.org/jira/browse/OPENNLP-1195 Project: OpenNLP Issue Type: Improvement Affects Versions: 1.8.4 Reporter: Koji Sekiguchi PerceptronTrainer and NaiveBayesTrainer have their own private maxIndex() method and they are identical. Why don't we move it to their parent class? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OPENNLP-1160) avoid letting users specify CachedFeatureGeneratorFactory in XML config
[ https://issues.apache.org/jira/browse/OPENNLP-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1160. - Resolution: Fixed > avoid letting users specify CachedFeatureGeneratorFactory in XML config > --- > > Key: OPENNLP-1160 > URL: https://issues.apache.org/jira/browse/OPENNLP-1160 > Project: OpenNLP > Issue Type: Improvement > Components: Formats, Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > This is similar to OPENNLP-1159. When I'm working on OPENNLP-1154, I think we > should do it for better use. > I'd like to implement this as an independent ticket from OPENNLP-1154 and > OPENNLP-1159 to make patch easy to read. > And this ticket is somewhat different from OPENNLP-1159 as users must be able > to control the framework uses CachedFeatureGeneratorFactory or not. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1159) avoid letting users specify AggregatedFeatureGeneratorFactory in XML config
[ https://issues.apache.org/jira/browse/OPENNLP-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1159. - Resolution: Fixed > avoid letting users specify AggregatedFeatureGeneratorFactory in XML config > --- > > Key: OPENNLP-1159 > URL: https://issues.apache.org/jira/browse/OPENNLP-1159 > Project: OpenNLP > Issue Type: Improvement > Components: Formats, Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > When I'm working on OPENNLP-1154, I think we should do it for better use. > I'd like to implement this as an independent ticket from OPENNLP-1154 to make > patch easy to read. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1160) avoid letting users specify CachedFeatureGeneratorFactory in XML config
[ https://issues.apache.org/jira/browse/OPENNLP-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306169#comment-16306169 ] Koji Sekiguchi commented on OPENNLP-1160: - I'll suggest adding `cache` attribute in the most top tag: {code:xml} ... {code} > avoid letting users specify CachedFeatureGeneratorFactory in XML config > --- > > Key: OPENNLP-1160 > URL: https://issues.apache.org/jira/browse/OPENNLP-1160 > Project: OpenNLP > Issue Type: Improvement > Components: Formats, Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > This is similar to OPENNLP-1159. When I'm working on OPENNLP-1154, I think we > should do it for better use. > I'd like to implement this as an independent ticket from OPENNLP-1154 and > OPENNLP-1159 to make patch easy to read. > And this ticket is somewhat different from OPENNLP-1159 as users must be able > to control the framework uses CachedFeatureGeneratorFactory or not. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1175) explain the new format of feature generator XML config
Koji Sekiguchi created OPENNLP-1175: --- Summary: explain the new format of feature generator XML config Key: OPENNLP-1175 URL: https://issues.apache.org/jira/browse/OPENNLP-1175 Project: OpenNLP Issue Type: Bug Components: Documentation Reporter: Koji Sekiguchi Priority: Minor Document should explain the new format of feature generator XML config, rather than classic format. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger
[ https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1154. - Resolution: Fixed > change the XML format for feature generator config in NameFinder and POS > Tagger > --- > > Key: OPENNLP-1154 > URL: https://issues.apache.org/jira/browse/OPENNLP-1154 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi > > NameFinder provides many kinds of feature generator (factories). Users can > define their config via XML which looks like: > {code:xml} > > > > > > > > > > > > > > > > > {code} > If a user wants to implement their own feature generator, he can use .../>, but if he wants to have two or more feature generators at once, he may > be able to implement it by providing a wrapper feature generator which wraps > two or more feature generators that he originally wants to have, but it is > not good. > I'd like to suggest that we make the config format more flexible like below: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > > > > > {code} > If ... is too noisy, I'm thinking another format as well: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1171) some tests create temp files and directories but never delete them
[ https://issues.apache.org/jira/browse/OPENNLP-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1171. - Resolution: Fixed > some tests create temp files and directories but never delete them > -- > > Key: OPENNLP-1171 > URL: https://issues.apache.org/jira/browse/OPENNLP-1171 > Project: OpenNLP > Issue Type: Bug > Components: Build, Packaging and Test >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.8.4 > > > Some temporary files and directories that are created in some tests are never > deleted and the number of temporary files/directories is increasing after > running mvn clean test. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1171) some tests create temp files and directories but never delete them
Koji Sekiguchi created OPENNLP-1171: --- Summary: some tests create temp files and directories but never delete them Key: OPENNLP-1171 URL: https://issues.apache.org/jira/browse/OPENNLP-1171 Project: OpenNLP Issue Type: Bug Components: Build, Packaging and Test Affects Versions: 1.8.3 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.8.4 Some temporary files and directories that are created in some tests are never deleted and the number of temporary files/directories is increasing after running mvn clean test. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger
[ https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293972#comment-16293972 ] Koji Sekiguchi commented on OPENNLP-1154: - What I did in this patch are: * move all static *FeatureGeneratorFactory classes out of GeneratorFactory.java and make them individual Factory classes such as BrownClusterTokenFeatureGeneratorFactory.java, BigramNameFeatureGeneratorFactory.java etc. so that users can avoid specifying nested class names e.g. opennlp.tools.util.featuregen.GeneratorFactory.BigramNameFeatureGeneratorFactory in XML config file * provide AbstractXmlFeatureGeneratorFactory class which all *FeatureGeneratorFactory classes must extend. It has init() method that is called from framework when XML config file is read. It helps *FeatureGeneratorFactory classes to set their parameters if they are specified in the nested way like: {code:xml} 2 2 {code} * *FeatureGeneratorFactory classes can read parameters set in XML config file via getter methods e.g. getInt(“parameter name”), getStr(“parameter name”) as long as they extend AbstractXmlFeatureGeneratorFactory class. AbstractXmlFeatureGeneratorFactory set parameters to LinkedHashMapin init() method. Why I used LinkedHashMap not HashMap because it must respect the order of written parameters, because multiple can be specified in a parent FeatureGeneratorFactory, only AggregatedFeatureGeneratorFactory can support multiple sub-generators now though. * classic format is still supported for back-compat reasons. I provided test cases to check both of classic and new formats support. The classic format XML files can be found with *_classic.xml file name under src/test/resources folder. GeneratorFactory recognizes which format is used in createGenerator() method. * extractArtifactSerializerMappings() method can support both classic and new formats. * > change the XML format for feature generator config in NameFinder and POS > Tagger > --- > > Key: OPENNLP-1154 > URL: https://issues.apache.org/jira/browse/OPENNLP-1154 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi > > NameFinder provides many kinds of feature generator (factories). Users can > define their config via XML which looks like: > {code:xml} > > > > > > > > > > > > > > > > > {code} > If a user wants to implement their own feature generator, he can use .../>, but if he wants to have two or more feature generators at once, he may > be able to implement it by providing a wrapper feature generator which wraps > two or more feature generators that he originally wants to have, but it is > not good. > I'd like to suggest that we make the config format more flexible like below: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > > > > > {code} > If ... is too noisy, I'm thinking another format as well: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger
[ https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293972#comment-16293972 ] Koji Sekiguchi edited comment on OPENNLP-1154 at 12/17/17 12:18 AM: What I did in this patch are: * move all static *FeatureGeneratorFactory classes out of GeneratorFactory.java and make them individual Factory classes such as BrownClusterTokenFeatureGeneratorFactory.java, BigramNameFeatureGeneratorFactory.java etc. so that users can avoid specifying nested class names e.g. opennlp.tools.util.featuregen.GeneratorFactory.BigramNameFeatureGeneratorFactory in XML config file * provide AbstractXmlFeatureGeneratorFactory class which all *FeatureGeneratorFactory classes must extend. It has init() method that is called from framework when XML config file is read. It helps *FeatureGeneratorFactory classes to set their parameters if they are specified in the nested way like: {code:xml} 2 2 {code} * *FeatureGeneratorFactory classes can read parameters set in XML config file via getter methods e.g. getInt(“parameter name”), getStr(“parameter name”) as long as they extend AbstractXmlFeatureGeneratorFactory class. AbstractXmlFeatureGeneratorFactory set parameters to LinkedHashMapin init() method. Why I used LinkedHashMap not HashMap because it must respect the order of written parameters, because multiple can be specified in a parent FeatureGeneratorFactory, only AggregatedFeatureGeneratorFactory can support multiple sub-generators now though. * classic format is still supported for back-compat reasons. I provided test cases to check both of classic and new formats support. The classic format XML files can be found with *_classic.xml file name under src/test/resources folder. GeneratorFactory recognizes which format is used in createGenerator() method. * extractArtifactSerializerMappings() method can support both classic and new formats. was (Author: koji): What I did in this patch are: * move all static *FeatureGeneratorFactory classes out of GeneratorFactory.java and make them individual Factory classes such as BrownClusterTokenFeatureGeneratorFactory.java, BigramNameFeatureGeneratorFactory.java etc. so that users can avoid specifying nested class names e.g. opennlp.tools.util.featuregen.GeneratorFactory.BigramNameFeatureGeneratorFactory in XML config file * provide AbstractXmlFeatureGeneratorFactory class which all *FeatureGeneratorFactory classes must extend. It has init() method that is called from framework when XML config file is read. It helps *FeatureGeneratorFactory classes to set their parameters if they are specified in the nested way like: {code:xml} 2 2 {code} * *FeatureGeneratorFactory classes can read parameters set in XML config file via getter methods e.g. getInt(“parameter name”), getStr(“parameter name”) as long as they extend AbstractXmlFeatureGeneratorFactory class. AbstractXmlFeatureGeneratorFactory set parameters to LinkedHashMap in init() method. Why I used LinkedHashMap not HashMap because it must respect the order of written parameters, because multiple can be specified in a parent FeatureGeneratorFactory, only AggregatedFeatureGeneratorFactory can support multiple sub-generators now though. * classic format is still supported for back-compat reasons. I provided test cases to check both of classic and new formats support. The classic format XML files can be found with *_classic.xml file name under src/test/resources folder. GeneratorFactory recognizes which format is used in createGenerator() method. * extractArtifactSerializerMappings() method can support both classic and new formats. * > change the XML format for feature generator config in NameFinder and POS > Tagger > --- > > Key: OPENNLP-1154 > URL: https://issues.apache.org/jira/browse/OPENNLP-1154 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi > > NameFinder provides many kinds of feature generator (factories). Users can > define their config via XML which looks like: > {code:xml} > > > > > > > > > > > > > > > > > {code} > If a user wants to implement their own feature generator, he can use .../>, but if he wants to have two or more feature generators at once, he may > be able to implement it by providing a wrapper feature generator which wraps > two or more feature generators that he originally wants to have, but it is > not good. > I'd
[jira] [Closed] (OPENNLP-1161) avoid using concrete tag names of XML config in GeneratorFactory.extractArtifactSerializerMappings()
[ https://issues.apache.org/jira/browse/OPENNLP-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi closed OPENNLP-1161. --- Resolution: Won't Fix The suggested solution can be implemented in the blocked issue, OPENNLP-1154. > avoid using concrete tag names of XML config in > GeneratorFactory.extractArtifactSerializerMappings() > > > Key: OPENNLP-1161 > URL: https://issues.apache.org/jira/browse/OPENNLP-1161 > Project: OpenNLP > Issue Type: Improvement > Components: Formats, Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Blocker > > When working on OPENNLP-1154, I noticed this. > In GeneratorFactory.extractArtifactSerializerMappings(), it specifies > concrete XML tag names: > {code:java} > for (int i = 0; i < allElements.getLength(); i++) { > if (allElements.item(i) instanceof Element) { > Element xmlElement = (Element) allElements.item(i); > String dictName = xmlElement.getAttribute("dict"); > if (dictName != null) { > switch (xmlElement.getTagName()) { > case "wordcluster": > mapping.put(dictName, new > WordClusterDictionary.WordClusterDictionarySerializer()); > break; > case "brownclustertoken": > mapping.put(dictName, new > BrownCluster.BrownClusterSerializer()); > break; > case "brownclustertokenclass"://, ; > mapping.put(dictName, new > BrownCluster.BrownClusterSerializer()); > break; > case "brownclusterbigram": //, ; > mapping.put(dictName, new > BrownCluster.BrownClusterSerializer()); > break; > case "dictionary": > mapping.put(dictName, new DictionarySerializer()); > break; > } > } > String modelName = xmlElement.getAttribute("model"); > if (modelName != null) { > switch (xmlElement.getTagName()) { > case "tokenpos": > mapping.put(modelName, new POSModelSerializer()); > break; > } > } > } > } > {code} > Instead, we'd better let FeatureGeneratorFactories implement a method that > returns mapping (Map) and in > GeneratorFactory.extractArtifactSerializerMappings(), the framework just > calls the method of FeatureGeneratorFactories, which are found in XML config. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1161) avoid using concrete tag names of XML config in GeneratorFactory.extractArtifactSerializerMappings()
[ https://issues.apache.org/jira/browse/OPENNLP-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277921#comment-16277921 ] Koji Sekiguchi commented on OPENNLP-1161: - I made the patch by posting PR 292 but I (and one of committers) don't like it because generators have to return the information of the artifact serializer mappings. As I suggested in this ticket, we should let FeatureGeneratorFactories (not generators) implement a method that returns mapping (Map) and in GeneratorFactory.extractArtifactSerializerMappings(), the framework just calls the method of FeatureGeneratorFactories, which are found in XML config, but I couldn't implement this because I needed to keep back-compat. I'll withdraw this. Instead, I think I can achieve this in the blocked issue, OPENNLP-1154. > avoid using concrete tag names of XML config in > GeneratorFactory.extractArtifactSerializerMappings() > > > Key: OPENNLP-1161 > URL: https://issues.apache.org/jira/browse/OPENNLP-1161 > Project: OpenNLP > Issue Type: Improvement > Components: Formats, Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Blocker > > When working on OPENNLP-1154, I noticed this. > In GeneratorFactory.extractArtifactSerializerMappings(), it specifies > concrete XML tag names: > {code:java} > for (int i = 0; i < allElements.getLength(); i++) { > if (allElements.item(i) instanceof Element) { > Element xmlElement = (Element) allElements.item(i); > String dictName = xmlElement.getAttribute("dict"); > if (dictName != null) { > switch (xmlElement.getTagName()) { > case "wordcluster": > mapping.put(dictName, new > WordClusterDictionary.WordClusterDictionarySerializer()); > break; > case "brownclustertoken": > mapping.put(dictName, new > BrownCluster.BrownClusterSerializer()); > break; > case "brownclustertokenclass"://, ; > mapping.put(dictName, new > BrownCluster.BrownClusterSerializer()); > break; > case "brownclusterbigram": //, ; > mapping.put(dictName, new > BrownCluster.BrownClusterSerializer()); > break; > case "dictionary": > mapping.put(dictName, new DictionarySerializer()); > break; > } > } > String modelName = xmlElement.getAttribute("model"); > if (modelName != null) { > switch (xmlElement.getTagName()) { > case "tokenpos": > mapping.put(modelName, new POSModelSerializer()); > break; > } > } > } > } > {code} > Instead, we'd better let FeatureGeneratorFactories implement a method that > returns mapping (Map ) and in > GeneratorFactory.extractArtifactSerializerMappings(), the framework just > calls the method of FeatureGeneratorFactories, which are found in XML config. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1161) avoid using concrete tag names of XML config in GeneratorFactory.extractArtifactSerializerMappings()
[ https://issues.apache.org/jira/browse/OPENNLP-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273875#comment-16273875 ] Koji Sekiguchi commented on OPENNLP-1161: - This is a blocker of OPENNLP-1154 because in OPENNLP-1154, I try to change the XML format from classic to new one. And the current implementation in GeneratorFactory.extractArtifactSerializerMappings() depends on the classic format. > avoid using concrete tag names of XML config in > GeneratorFactory.extractArtifactSerializerMappings() > > > Key: OPENNLP-1161 > URL: https://issues.apache.org/jira/browse/OPENNLP-1161 > Project: OpenNLP > Issue Type: Improvement > Components: Formats, Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Blocker > > When working on OPENNLP-1154, I noticed this. > In GeneratorFactory.extractArtifactSerializerMappings(), it specifies > concrete XML tag names: > {code:java} > for (int i = 0; i < allElements.getLength(); i++) { > if (allElements.item(i) instanceof Element) { > Element xmlElement = (Element) allElements.item(i); > String dictName = xmlElement.getAttribute("dict"); > if (dictName != null) { > switch (xmlElement.getTagName()) { > case "wordcluster": > mapping.put(dictName, new > WordClusterDictionary.WordClusterDictionarySerializer()); > break; > case "brownclustertoken": > mapping.put(dictName, new > BrownCluster.BrownClusterSerializer()); > break; > case "brownclustertokenclass"://, ; > mapping.put(dictName, new > BrownCluster.BrownClusterSerializer()); > break; > case "brownclusterbigram": //, ; > mapping.put(dictName, new > BrownCluster.BrownClusterSerializer()); > break; > case "dictionary": > mapping.put(dictName, new DictionarySerializer()); > break; > } > } > String modelName = xmlElement.getAttribute("model"); > if (modelName != null) { > switch (xmlElement.getTagName()) { > case "tokenpos": > mapping.put(modelName, new POSModelSerializer()); > break; > } > } > } > } > {code} > Instead, we'd better let FeatureGeneratorFactories implement a method that > returns mapping (Map) and in > GeneratorFactory.extractArtifactSerializerMappings(), the framework just > calls the method of FeatureGeneratorFactories, which are found in XML config. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1161) avoid using concrete tag names of XML config in GeneratorFactory.extractArtifactSerializerMappings()
Koji Sekiguchi created OPENNLP-1161: --- Summary: avoid using concrete tag names of XML config in GeneratorFactory.extractArtifactSerializerMappings() Key: OPENNLP-1161 URL: https://issues.apache.org/jira/browse/OPENNLP-1161 Project: OpenNLP Issue Type: Improvement Components: Formats, Name Finder Affects Versions: 1.8.3 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Blocker When working on OPENNLP-1154, I noticed this. In GeneratorFactory.extractArtifactSerializerMappings(), it specifies concrete XML tag names: {code:java} for (int i = 0; i < allElements.getLength(); i++) { if (allElements.item(i) instanceof Element) { Element xmlElement = (Element) allElements.item(i); String dictName = xmlElement.getAttribute("dict"); if (dictName != null) { switch (xmlElement.getTagName()) { case "wordcluster": mapping.put(dictName, new WordClusterDictionary.WordClusterDictionarySerializer()); break; case "brownclustertoken": mapping.put(dictName, new BrownCluster.BrownClusterSerializer()); break; case "brownclustertokenclass"://, ; mapping.put(dictName, new BrownCluster.BrownClusterSerializer()); break; case "brownclusterbigram": //, ; mapping.put(dictName, new BrownCluster.BrownClusterSerializer()); break; case "dictionary": mapping.put(dictName, new DictionarySerializer()); break; } } String modelName = xmlElement.getAttribute("model"); if (modelName != null) { switch (xmlElement.getTagName()) { case "tokenpos": mapping.put(modelName, new POSModelSerializer()); break; } } } } {code} Instead, we'd better let FeatureGeneratorFactories implement a method that returns mapping (Map) and in GeneratorFactory.extractArtifactSerializerMappings(), the framework just calls the method of FeatureGeneratorFactories, which are found in XML config. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1160) avoid letting users specify CachedFeatureGeneratorFactory in XML config
[ https://issues.apache.org/jira/browse/OPENNLP-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269980#comment-16269980 ] Koji Sekiguchi commented on OPENNLP-1160: - After committing all of OPENNLP-1154, OPENNLP-1159 and OPENNLP-1160, the XML config looks like: {code:xml} 2 2 2 2 true false {code} > avoid letting users specify CachedFeatureGeneratorFactory in XML config > --- > > Key: OPENNLP-1160 > URL: https://issues.apache.org/jira/browse/OPENNLP-1160 > Project: OpenNLP > Issue Type: Improvement > Components: Formats, Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > This is similar to OPENNLP-1159. When I'm working on OPENNLP-1154, I think we > should do it for better use. > I'd like to implement this as an independent ticket from OPENNLP-1154 and > OPENNLP-1159 to make patch easy to read. > And this ticket is somewhat different from OPENNLP-1159 as users must be able > to control the framework uses CachedFeatureGeneratorFactory or not. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1159) avoid letting users specify AggregatedFeatureGeneratorFactory in XML config
[ https://issues.apache.org/jira/browse/OPENNLP-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269969#comment-16269969 ] Koji Sekiguchi commented on OPENNLP-1159: - After committing OPENNLP-1154, the XML config looks like: {code:xml} 2 2 2 2 true false {code} Then after committing this ticket, the XML config looks like: {code:xml} 2 2 2 2 true false {code} CachedFeatureGeneratorFactory should be avoided letting users specify explicitly but I prefer to implement it in OPENNLP-1160. > avoid letting users specify AggregatedFeatureGeneratorFactory in XML config > --- > > Key: OPENNLP-1159 > URL: https://issues.apache.org/jira/browse/OPENNLP-1159 > Project: OpenNLP > Issue Type: Improvement > Components: Formats, Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > When I'm working on OPENNLP-1154, I think we should do it for better use. > I'd like to implement this as an independent ticket from OPENNLP-1154 to make > patch easy to read. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1160) avoid letting users specify CachedFeatureGeneratorFactory in XML config
Koji Sekiguchi created OPENNLP-1160: --- Summary: avoid letting users specify CachedFeatureGeneratorFactory in XML config Key: OPENNLP-1160 URL: https://issues.apache.org/jira/browse/OPENNLP-1160 Project: OpenNLP Issue Type: Improvement Components: Formats, Name Finder Affects Versions: 1.8.3 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor This is similar to OPENNLP-1159. When I'm working on OPENNLP-1154, I think we should do it for better use. I'd like to implement this as an independent ticket from OPENNLP-1154 and OPENNLP-1159 to make patch easy to read. And this ticket is somewhat different from OPENNLP-1159 as users must be able to control the framework uses CachedFeatureGeneratorFactory or not. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1159) avoid letting users specify AggregatedFeatureGeneratorFactory in XML config
Koji Sekiguchi created OPENNLP-1159: --- Summary: avoid letting users specify AggregatedFeatureGeneratorFactory in XML config Key: OPENNLP-1159 URL: https://issues.apache.org/jira/browse/OPENNLP-1159 Project: OpenNLP Issue Type: Improvement Components: Formats, Name Finder Affects Versions: 1.8.3 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor When I'm working on OPENNLP-1154, I think we should do it for better use. I'd like to implement this as an independent ticket from OPENNLP-1154 to make patch easy to read. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger
[ https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256750#comment-16256750 ] Koji Sekiguchi commented on OPENNLP-1154: - As Joern suggested, this should be used for not only NameFinder but also POS Tagger, I added "POS Tagger" to the title. > change the XML format for feature generator config in NameFinder and POS > Tagger > --- > > Key: OPENNLP-1154 > URL: https://issues.apache.org/jira/browse/OPENNLP-1154 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi > > NameFinder provides many kinds of feature generator (factories). Users can > define their config via XML which looks like: > {code:xml} > > > > > > > > > > > > > > > > > {code} > If a user wants to implement their own feature generator, he can use .../>, but if he wants to have two or more feature generators at once, he may > be able to implement it by providing a wrapper feature generator which wraps > two or more feature generators that he originally wants to have, but it is > not good. > I'd like to suggest that we make the config format more flexible like below: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > > > > > {code} > If ... is too noisy, I'm thinking another format as well: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OPENNLP-1154) change the XML format for feature generator config in NameFinder and POS Tagger
[ https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated OPENNLP-1154: Summary: change the XML format for feature generator config in NameFinder and POS Tagger (was: change the XML format for feature generator config in NameFinder) > change the XML format for feature generator config in NameFinder and POS > Tagger > --- > > Key: OPENNLP-1154 > URL: https://issues.apache.org/jira/browse/OPENNLP-1154 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi > > NameFinder provides many kinds of feature generator (factories). Users can > define their config via XML which looks like: > {code:xml} > > > > > > > > > > > > > > > > > {code} > If a user wants to implement their own feature generator, he can use .../>, but if he wants to have two or more feature generators at once, he may > be able to implement it by providing a wrapper feature generator which wraps > two or more feature generators that he originally wants to have, but it is > not good. > I'd like to suggest that we make the config format more flexible like below: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > > > > > {code} > If ... is too noisy, I'm thinking another format as well: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (OPENNLP-1154) change the XML format for feature generator config in NameFinder
[ https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247381#comment-16247381 ] Koji Sekiguchi edited comment on OPENNLP-1154 at 11/10/17 11:58 AM: I'll post the first patch soon. It fails few tests yet because I didn't care about serialize/deserialize for the new format and other details stuff. The purpose of posting the first patch, before implementing further (serialize/deserialize, test cases, etc.), I'd like to know committers' thought about the new format. And also, I think we can support "classic" format for back-compat reasons, if needed. In the first patch, I did it, but there are many Deprecated annotations due to it. I'd like to know your thought about back-compat support as well. I don't still understand the versioning system in OpenNLP. If we have this new format in 1.9, don't I need to consider "classic" format? was (Author: koji): I'll post the first patch soon. It fails one test yet because I didn't care about serialize/deserialize for the new format. The purpose of posting the first patch, before implementing further (serialize/deserialize, test cases, etc.), I'd like to know committers' thought about the new format. And also, I think we can support "classic" format for back-compat reasons, if needed. In the first patch, I did it, but there are many Deprecated annotations due to it. I'd like to know your thought about back-compat support as well. I don't still understand the versioning system in OpenNLP. If we have this new format in 1.9, don't I need to consider "classic" format? > change the XML format for feature generator config in NameFinder > > > Key: OPENNLP-1154 > URL: https://issues.apache.org/jira/browse/OPENNLP-1154 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi > > NameFinder provides many kinds of feature generator (factories). Users can > define their config via XML which looks like: > {code:xml} > > > > > > > > > > > > > > > > > {code} > If a user wants to implement their own feature generator, he can use .../>, but if he wants to have two or more feature generators at once, he may > be able to implement it by providing a wrapper feature generator which wraps > two or more feature generators that he originally wants to have, but it is > not good. > I'd like to suggest that we make the config format more flexible like below: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > > > > > {code} > If ... is too noisy, I'm thinking another format as well: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1154) change the XML format for feature generator config in NameFinder
[ https://issues.apache.org/jira/browse/OPENNLP-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247381#comment-16247381 ] Koji Sekiguchi commented on OPENNLP-1154: - I'll post the first patch soon. It fails one test yet because I didn't care about serialize/deserialize for the new format. The purpose of posting the first patch, before implementing further (serialize/deserialize, test cases, etc.), I'd like to know committers' thought about the new format. And also, I think we can support "classic" format for back-compat reasons, if needed. In the first patch, I did it, but there are many Deprecated annotations due to it. I'd like to know your thought about back-compat support as well. I don't still understand the versioning system in OpenNLP. If we have this new format in 1.9, don't I need to consider "classic" format? > change the XML format for feature generator config in NameFinder > > > Key: OPENNLP-1154 > URL: https://issues.apache.org/jira/browse/OPENNLP-1154 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.3 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi > > NameFinder provides many kinds of feature generator (factories). Users can > define their config via XML which looks like: > {code:xml} > > > > > > > > > > > > > > > > > {code} > If a user wants to implement their own feature generator, he can use .../>, but if he wants to have two or more feature generators at once, he may > be able to implement it by providing a wrapper feature generator which wraps > two or more feature generators that he originally wants to have, but it is > not good. > I'd like to suggest that we make the config format more flexible like below: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > > > class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > > > > > {code} > If ... is too noisy, I'm thinking another format as well: > {code:xml} > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.CachedFeatureGeneratorFactory"> > class="opennlp.tools.util.featuregen.AggregatedFeatureGeneratorFactory"> >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"/> > >class="opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"> > 2 > 2 > class="opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"/> > > > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1154) change the XML format for feature generator config in NameFinder
Koji Sekiguchi created OPENNLP-1154: --- Summary: change the XML format for feature generator config in NameFinder Key: OPENNLP-1154 URL: https://issues.apache.org/jira/browse/OPENNLP-1154 Project: OpenNLP Issue Type: Improvement Components: Name Finder Affects Versions: 1.8.3 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi NameFinder provides many kinds of feature generator (factories). Users can define their config via XML which looks like: {code:xml} {code} If a user wants to implement their own feature generator, he can use , but if he wants to have two or more feature generators at once, he may be able to implement it by providing a wrapper feature generator which wraps two or more feature generators that he originally wants to have, but it is not good. I'd like to suggest that we make the config format more flexible like below: {code:xml} 2 2 2 2 {code} If ... is too noisy, I'm thinking another format as well: {code:xml} 2 2 2 2 {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (OPENNLP-1149) remove unused member in PlainTextByLineStream
[ https://issues.apache.org/jira/browse/OPENNLP-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1149: --- Assignee: Koji Sekiguchi > remove unused member in PlainTextByLineStream > - > > Key: OPENNLP-1149 > URL: https://issues.apache.org/jira/browse/OPENNLP-1149 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.3 > > > PlainTextByLineStream has a private member variable "channel" but it is never > set and hence, it is always null. It can be removed to simplify code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1149) remove unused member in PlainTextByLineStream
[ https://issues.apache.org/jira/browse/OPENNLP-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1149. - Resolution: Fixed Thanks everyone for reviewing this! :) > remove unused member in PlainTextByLineStream > - > > Key: OPENNLP-1149 > URL: https://issues.apache.org/jira/browse/OPENNLP-1149 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.3 > > > PlainTextByLineStream has a private member variable "channel" but it is never > set and hence, it is always null. It can be removed to simplify code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1145) Javadoc of NaiveBayesTrainer class looks incorrect
[ https://issues.apache.org/jira/browse/OPENNLP-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1145. - Resolution: Fixed Assignee: Koji Sekiguchi Thanks everyone for reviewing this! :) > Javadoc of NaiveBayesTrainer class looks incorrect > -- > > Key: OPENNLP-1145 > URL: https://issues.apache.org/jira/browse/OPENNLP-1145 > Project: OpenNLP > Issue Type: Bug > Components: Machine Learning >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.3 > > > It seems that Javadoc of NaiveBayesTrainer class was copied from > PerceptronTrainer and hence, it says "Trains models using the perceptron > algorithm." :) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1150) TokenNameFinderTrainerTool should use ModelUtil.createDefaultTrainingParameters() when mlParams is null
Koji Sekiguchi created OPENNLP-1150: --- Summary: TokenNameFinderTrainerTool should use ModelUtil.createDefaultTrainingParameters() when mlParams is null Key: OPENNLP-1150 URL: https://issues.apache.org/jira/browse/OPENNLP-1150 Project: OpenNLP Issue Type: Improvement Components: Name Finder Affects Versions: 1.8.2 Reporter: Koji Sekiguchi Priority: Trivial Fix For: 1.8.3 Unlike other TrainerTools, TokenNameFinderTrainerTool create an empty TrainingParameters when mlParams is null by calling the constructor. TokenNameFinderTrainerTool should use ModelUtil.createDefaultTrainingParameters() like as other TrainerTools do to initialize mlParams. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1149) remove unused member in PlainTextByLineStream
[ https://issues.apache.org/jira/browse/OPENNLP-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214646#comment-16214646 ] Koji Sekiguchi commented on OPENNLP-1149: - I'll change the type of private member "encoding" from String to Charset in this patch. > remove unused member in PlainTextByLineStream > - > > Key: OPENNLP-1149 > URL: https://issues.apache.org/jira/browse/OPENNLP-1149 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.3 > > > PlainTextByLineStream has a private member variable "channel" but it is never > set and hence, it is always null. It can be removed to simplify code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1149) remove unused member in PlainTextByLineStream
Koji Sekiguchi created OPENNLP-1149: --- Summary: remove unused member in PlainTextByLineStream Key: OPENNLP-1149 URL: https://issues.apache.org/jira/browse/OPENNLP-1149 Project: OpenNLP Issue Type: Improvement Affects Versions: 1.8.2 Reporter: Koji Sekiguchi Priority: Trivial Fix For: 1.8.3 PlainTextByLineStream has a private member variable "channel" but it is never set and hence, it is always null. It can be removed to simplify code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1148) use StandardCharsets.UTF_8 in doc
Koji Sekiguchi created OPENNLP-1148: --- Summary: use StandardCharsets.UTF_8 in doc Key: OPENNLP-1148 URL: https://issues.apache.org/jira/browse/OPENNLP-1148 Project: OpenNLP Issue Type: Improvement Components: Documentation Affects Versions: 1.8.2 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 1.8.3 In the doc, the use of PlainTextByLineStream() is not unified. Other than specifying StandardCharsets.UTF_8 for its second parameter, there are following variations: - String "UTF-8" - StandardCharsets.UTF8 (not UTF_8) - Charset.forName("UTF-8") Let's unify the use to StandardCharsets.UTF_8 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1147) Missing URLs in doc
Koji Sekiguchi created OPENNLP-1147: --- Summary: Missing URLs in doc Key: OPENNLP-1147 URL: https://issues.apache.org/jira/browse/OPENNLP-1147 Project: OpenNLP Issue Type: Bug Components: Documentation Affects Versions: 1.8.2 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 1.8.3 When I read name finder part in document, some missing URLs were there. I'd like to correct some of them which I could find latest/alternative ones. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1146) remove unnecessary serialVersionUID
[ https://issues.apache.org/jira/browse/OPENNLP-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1146. - Resolution: Fixed Thanks all for reviewing this! > remove unnecessary serialVersionUID > --- > > Key: OPENNLP-1146 > URL: https://issues.apache.org/jira/browse/OPENNLP-1146 > Project: OpenNLP > Issue Type: Improvement > Components: Build, Packaging and Test >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.3 > > > We saw several classes that have unnecessary serialVersionUID constant > declaration. Most of them are Stemmer classes that are created by the > Snowball to Java compiler. I think we can just remove serialVersionUID from > Stemmer classes. Other than Stemmer classes, Exception classes which extend > RuntimeException or IOException have serialVersionUID. I'll remove > serialVersionUID from these Exception classes as well but add > @SuppressWarnings("serial") just in case. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1146) remove unnecessary serialVersionUID
Koji Sekiguchi created OPENNLP-1146: --- Summary: remove unnecessary serialVersionUID Key: OPENNLP-1146 URL: https://issues.apache.org/jira/browse/OPENNLP-1146 Project: OpenNLP Issue Type: Improvement Components: Build, Packaging and Test Affects Versions: 1.8.2 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 1.8.3 We saw several classes that have unnecessary serialVersionUID constant declaration. Most of them are Stemmer classes that are created by the Snowball to Java compiler. I think we can just remove serialVersionUID from Stemmer classes. Other than Stemmer classes, Exception classes which extend RuntimeException or IOException have serialVersionUID. I'll remove serialVersionUID from these Exception classes as well but add @SuppressWarnings("serial") just in case. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1141) Add DFA and use it from SequenceCodec.areOutcomesCompatible if possible
[ https://issues.apache.org/jira/browse/OPENNLP-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1141. - Resolution: Invalid > Add DFA and use it from SequenceCodec.areOutcomesCompatible if possible > --- > > Key: OPENNLP-1141 > URL: https://issues.apache.org/jira/browse/OPENNLP-1141 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > BioCodec and BilouCodec implement areOutcomesCompatible(). I think they can > be written as DFA (Deterministic Finite Automaton). > In this ticket, I'll add s simple implementation of DFA and change > areOutcomesCompatible() to use it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OPENNLP-1141) Add DFA and use it from SequenceCodec.areOutcomesCompatible if possible
[ https://issues.apache.org/jira/browse/OPENNLP-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192675#comment-16192675 ] Koji Sekiguchi commented on OPENNLP-1141: - Having a discussion with joern, I learned that we should consider outcomes as a set, not sequence. DFA cannot be applied to SequenceCodec.areOutcomesCompatible(), but it can be used the sequence validators. I'll withdraw this. > Add DFA and use it from SequenceCodec.areOutcomesCompatible if possible > --- > > Key: OPENNLP-1141 > URL: https://issues.apache.org/jira/browse/OPENNLP-1141 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > BioCodec and BilouCodec implement areOutcomesCompatible(). I think they can > be written as DFA (Deterministic Finite Automaton). > In this ticket, I'll add s simple implementation of DFA and change > areOutcomesCompatible() to use it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1138) Add more tests to Span
[ https://issues.apache.org/jira/browse/OPENNLP-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1138. - Resolution: Fixed Fix Version/s: 1.8.3 > Add more tests to Span > -- > > Key: OPENNLP-1138 > URL: https://issues.apache.org/jira/browse/OPENNLP-1138 > Project: OpenNLP > Issue Type: Test >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.3 > > > Span's constructor can throw IllegalArgumentException but there is no tests > for it. I'll add tests for them and in addition to that, I'll fix the test > for toString() because it doesn't test it :) , and I'll remove a redundancy > from a constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1139) BilouCodec should use its own constants
[ https://issues.apache.org/jira/browse/OPENNLP-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1139. - Resolution: Fixed Fix Version/s: 1.8.3 > BilouCodec should use its own constants > --- > > Key: OPENNLP-1139 > URL: https://issues.apache.org/jira/browse/OPENNLP-1139 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.3 > > > It seems that BilouCodec accidentally uses BioCodec's constants such as > BioCodec.START, BioCodec.CONTINUE, etc. It should use its own ones. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1139) BilouCodec should use its own constants
Koji Sekiguchi created OPENNLP-1139: --- Summary: BilouCodec should use its own constants Key: OPENNLP-1139 URL: https://issues.apache.org/jira/browse/OPENNLP-1139 Project: OpenNLP Issue Type: Bug Components: Name Finder Affects Versions: 1.8.2 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial It seems that BilouCodec accidentally uses BioCodec's constants such as BioCodec.START, BioCodec.CONTINUE, etc. It should use its own ones. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1138) Add more tests to Span
Koji Sekiguchi created OPENNLP-1138: --- Summary: Add more tests to Span Key: OPENNLP-1138 URL: https://issues.apache.org/jira/browse/OPENNLP-1138 Project: OpenNLP Issue Type: Test Affects Versions: 1.8.2 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Span's constructor can throw IllegalArgumentException but there is no tests for it. I'll add tests for them and in addition to that, I'll fix the test for toString() because it doesn't test it :) , and I'll remove a redundancy from a constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1137) Add more tests and check overlapping of name spans to NameSample
[ https://issues.apache.org/jira/browse/OPENNLP-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1137. - Resolution: Fixed Fix Version/s: 1.8.3 > Add more tests and check overlapping of name spans to NameSample > > > Key: OPENNLP-1137 > URL: https://issues.apache.org/jira/browse/OPENNLP-1137 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 1.8.2 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.3 > > > NameSample has the following TODO in its constructor: > {quote}// TODO: Check that name spans are not overlapping, otherwise throw > exception{quote} > I added simple code for it and its test. > And I added a test for nested name spans. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OPENNLP-1137) Add more tests and check overlapping of name spans to NameSample
Koji Sekiguchi created OPENNLP-1137: --- Summary: Add more tests and check overlapping of name spans to NameSample Key: OPENNLP-1137 URL: https://issues.apache.org/jira/browse/OPENNLP-1137 Project: OpenNLP Issue Type: Improvement Components: Name Finder Affects Versions: 1.8.2 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial NameSample has the following TODO in its constructor: {quote}// TODO: Check that name spans are not overlapping, otherwise throw exception{quote} I added simple code for it and its test. And I added a test for nested name spans. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OPENNLP-1044) Add validate() which checks validity of parameters in the process of the framework
[ https://issues.apache.org/jira/browse/OPENNLP-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1044. - Resolution: Fixed Fix Version/s: 1.8.0 > Add validate() which checks validity of parameters in the process of the > framework > -- > > Key: OPENNLP-1044 > URL: https://issues.apache.org/jira/browse/OPENNLP-1044 > Project: OpenNLP > Issue Type: Improvement >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.8.0 > > > When I worked on OPENNLP-1039, I saw the client codes throw > IllegalArgumentException when isValid() returns false, but I think such kind > of methods should throw the Exception by themselves and the timing of use > should be controlled by the framework. > So it should look like: > {code} > public abstract class AbstractTrainer { > @Depracated > public boolean isValid() { ... } > // if the subclass overrides this, it should call super.validate(); > public void validate() throws IllegalArgumentException { > // default implementation here > } > // this is the controller of the flow of training... > public final void train() { > // initializing > init(); > // validating parameters > validate(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (OPENNLP-1044) Add validate() which checks validity of parameters in the process of the framework
[ https://issues.apache.org/jira/browse/OPENNLP-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi closed OPENNLP-1044. --- > Add validate() which checks validity of parameters in the process of the > framework > -- > > Key: OPENNLP-1044 > URL: https://issues.apache.org/jira/browse/OPENNLP-1044 > Project: OpenNLP > Issue Type: Improvement >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 1.8.0 > > > When I worked on OPENNLP-1039, I saw the client codes throw > IllegalArgumentException when isValid() returns false, but I think such kind > of methods should throw the Exception by themselves and the timing of use > should be controlled by the framework. > So it should look like: > {code} > public abstract class AbstractTrainer { > @Depracated > public boolean isValid() { ... } > // if the subclass overrides this, it should call super.validate(); > public void validate() throws IllegalArgumentException { > // default implementation here > } > // this is the controller of the flow of training... > public final void train() { > // initializing > init(); > // validating parameters > validate(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (OPENNLP-1044) Add validate() which checks validity of parameters in the process of the framework
[ https://issues.apache.org/jira/browse/OPENNLP-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1044: --- Assignee: Koji Sekiguchi > Add validate() which checks validity of parameters in the process of the > framework > -- > > Key: OPENNLP-1044 > URL: https://issues.apache.org/jira/browse/OPENNLP-1044 > Project: OpenNLP > Issue Type: Improvement >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > > When I worked on OPENNLP-1039, I saw the client codes throw > IllegalArgumentException when isValid() returns false, but I think such kind > of methods should throw the Exception by themselves and the timing of use > should be controlled by the framework. > So it should look like: > {code} > public abstract class AbstractTrainer { > @Depracated > public boolean isValid() { ... } > // if the subclass overrides this, it should call super.validate(); > public void validate() throws IllegalArgumentException { > // default implementation here > } > // this is the controller of the flow of training... > public final void train() { > // initializing > init(); > // validating parameters > validate(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (OPENNLP-1044) Add validate() which checks validity of parameters in the process of the framework
Koji Sekiguchi created OPENNLP-1044: --- Summary: Add validate() which checks validity of parameters in the process of the framework Key: OPENNLP-1044 URL: https://issues.apache.org/jira/browse/OPENNLP-1044 Project: OpenNLP Issue Type: Improvement Reporter: Koji Sekiguchi Priority: Minor When I worked on OPENNLP-1039, I saw the client codes throw IllegalArgumentException when isValid() returns false, but I think such kind of methods should throw the Exception by themselves and the timing of use should be controlled by the framework. So it should look like: {code} public abstract class AbstractTrainer { @Depracated public boolean isValid() { ... } // if the subclass overrides this, it should call super.validate(); public void validate() throws IllegalArgumentException { // default implementation here } // this is the controller of the flow of training... public final void train() { // initializing init(); // validating parameters validate(); } } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (OPENNLP-1039) PerceptronTrainer should call super.isValid() in its isValid()
[ https://issues.apache.org/jira/browse/OPENNLP-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi closed OPENNLP-1039. --- > PerceptronTrainer should call super.isValid() in its isValid() > -- > > Key: OPENNLP-1039 > URL: https://issues.apache.org/jira/browse/OPENNLP-1039 > Project: OpenNLP > Issue Type: Bug >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.0 > > > The current implementation of PerceptronTrainer#isValid() is: > {code} > public boolean isValid() { > String algorithmName = getAlgorithm(); > return !(algorithmName != null && > !(PERCEPTRON_VALUE.equals(algorithmName))); > } > {code} > but it should call super.isValid() to check iterations and cutoff parameters > because PerceptronTrainer uses them. > And if possible, I'd like to rewrite the last line (return statement) because > I needed a few minutes to understand it as it has three exclamation points in > one line. :) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (OPENNLP-1039) PerceptronTrainer should call super.isValid() in its isValid()
[ https://issues.apache.org/jira/browse/OPENNLP-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved OPENNLP-1039. - Resolution: Fixed Fix Version/s: 1.8.0 > PerceptronTrainer should call super.isValid() in its isValid() > -- > > Key: OPENNLP-1039 > URL: https://issues.apache.org/jira/browse/OPENNLP-1039 > Project: OpenNLP > Issue Type: Bug >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 1.8.0 > > > The current implementation of PerceptronTrainer#isValid() is: > {code} > public boolean isValid() { > String algorithmName = getAlgorithm(); > return !(algorithmName != null && > !(PERCEPTRON_VALUE.equals(algorithmName))); > } > {code} > but it should call super.isValid() to check iterations and cutoff parameters > because PerceptronTrainer uses them. > And if possible, I'd like to rewrite the last line (return statement) because > I needed a few minutes to understand it as it has three exclamation points in > one line. :) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (OPENNLP-1039) PerceptronTrainer should call super.isValid() in its isValid()
[ https://issues.apache.org/jira/browse/OPENNLP-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned OPENNLP-1039: --- Assignee: Koji Sekiguchi > PerceptronTrainer should call super.isValid() in its isValid() > -- > > Key: OPENNLP-1039 > URL: https://issues.apache.org/jira/browse/OPENNLP-1039 > Project: OpenNLP > Issue Type: Bug >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > > The current implementation of PerceptronTrainer#isValid() is: > {code} > public boolean isValid() { > String algorithmName = getAlgorithm(); > return !(algorithmName != null && > !(PERCEPTRON_VALUE.equals(algorithmName))); > } > {code} > but it should call super.isValid() to check iterations and cutoff parameters > because PerceptronTrainer uses them. > And if possible, I'd like to rewrite the last line (return statement) because > I needed a few minutes to understand it as it has three exclamation points in > one line. :) -- This message was sent by Atlassian JIRA (v6.3.15#6346)